Abstract
The internet has brought numerous benefits to end-users, companies, and governments, but it has also led to severe threats to the security and privacy of data and systems. One such threat is botnets, which are networks of infected systems controlled by a central malicious machine. Botnets are used to spread various types of malware and attacks and can be considered very dangerous. The two most known identifiers of a botnet are its use of a C&C server to control the botnet and the need for a domain name or IP address for communication between the bots and the C&C server. Botnets nowadays use domain genera-tion algorithms (DGA) to obscure the actual domain that will be used for communication,making it harder to detect the botnet. This research aims to improve DGA-based botnet detection based on packet flow datasets that include DGA-based botnet network activity.
In doing so, the main research question is formulated to what extent machine learning models can be built for the detection of DGA-based botnets by using packet flow datasets and context-related feature selection methods. Improving the accuracy and speed of DGA-based botnet detection is crucial due to the increasing number of systems connected to the internet and the growing impact that successful attacks can have on the infected host systems. Many studies have already been performed in the field of DGA-based botnet detection by using traditional ML models, but not many of these studies use traditional ML
models based on packet flow data. Especially not on a combined dataset derived from multiple other datasets. This research is conducted through an extensive literature study, a search, and consolidation to obtain workable datasets, several experiments to support the selection of feature sets and ML models, and a validation step to position the results against the outcomes of earlier work. Out of a selection of eight machine learning techniques, this study identifies three models as the best-performing models. These models are Bagging, XGBoost, and Decision Tree. The performance results are based on several experiments held against a combination of publicly available datasets and a dataset merged
as part of the experiments in this study. The datasets contain packet flow information in which also the existence of DGA-based botnet behavior is available and labeled. Based on the results from the experiments, the conclusion is made that high-performing machine learning models can be built for the detection of DGA-based botnets by using packet flow datasets and the application of context-related feature selection. With only four features
derived from the available datasets and extending these with two additional features based on the IP address, an XGBoost classifier is trained that reaches an accuracy of 99.72%, an AUC-score of 88.15%, and an F1-score on the positive class of 79.28%. The F1-score on positive class is deemed to be the most important measure for this study. The importance
hereof is implied because the measure indicates how often a record is correctly classified as true-positive in a very imbalanced dataset. True-positive in this case represents actual DGA-based botnet behavior.
Date of Award | 20 Jun 2023 |
---|---|
Original language | English |
Supervisor | Harald Vranken (Examiner) & Clara Maathuis (Co-assessor) |
Master's Degree
- Master Computer Science