Spam Detection for Smart Home IoT Devices Using Machine Learning: A Comprehensive Framework
Keywords:
IoT Security, Spam Detection, Smart Home, Machine Learning, Bagging Classifier, Decision Tree, PCA, AdaBoost, Voting Classifier, Gaussian Naive Bayes, Flask, Deep Learning, REFIT Dataset, Spamicity ScoreAbstract
This paper presents an intelligent spam detection framework for securing smart home Internet of Things (IoT) devices
using machine learning. With the number of connected smart home devices exceeding 25 billion globally, network
security against spam injection, unauthorized access, and anomalous device communications has become critically
important. The proposed system analyzes IoT network traffic from the REFIT Smart Home dataset—1,664 records with
13 features encompassing source/destination addresses, device types, locations, operations, and timestamps—to classify
communications as Normal (valid) or Spam. A four-stage preprocessing pipeline applies Label Encoding for 10
categorical features, Simple Imputation for missing values, Standard Scaling for normalization, and Principal
Component Analysis (PCA) to reduce 11 features to 10 principal components capturing ≈98% of total variance. Five
machine learning classifiers are systematically evaluated: Bagging Classifier with SVC base estimator (99.4%
accuracy), Decision Tree with Gini criterion (99.4%), AdaBoost with 100 estimators (95.5%), Voting Classifier
combining Logistic Regression, Random Forest, and Gaussian Naive Bayes (89.2%), and Gaussian Naive Bayes
(89.2%). A Keras Sequential deep learning model (Input(10)→Dense(4,ReLU)→Dense(4,ReLU)→Dense(1,Sigmoid))
trained for 200 epochs with Adam optimizer provides complementary validation. Each model produces a spamicity score
representing device trustworthiness. The Flask web application offers role-based access (admin for model management,
user for predictions) with Docker containerization and an optional Next.js frontend. This article presents the
mathematical foundations of all algorithms, the complete system architecture, algorithmic pseudocode, and thorough
results analysis including comparative tables, PCA variance analysis, learning curve data, and cross-validation
findings.
Downloads
Published
Issue
Section
License
Copyright (c) 2026 Authors

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.










