• J.M. van Renswou

Student thesis: Master's Thesis


The almost unlimited opportunities of the internet are not always a positive thing. Not everyone is using the internet for the good. Some are using these opportunities for activities that are unwanted. One example are botnets that consist of networks of highjacked computers to do criminal actions. This is a major thread that affects everyone connected to internet.
In this research an attempt is made to detect the computers that are highjacked to be part of a botnet by monitoring their network behaviour.
The main research question is:
How can machine learning techniques effectively and efficiently detect botnets from TCP/IP network traffic?
By selecting three machine learning techniques, models are created using publicly available datasets with network traffic coming from botnets and normal programs. The used machine learning techniques are Random Forest Classifier (RFC) Support Vector Machines (SVM) and Gradient boosted Trees (GBT). Using a flow-based approach with only 17 features per flow an RFC could be trained to detect the botnet network traffic. With an accuracy of 99.63%, it is performing better than the SVM Classifier and the GBT Classifier, on the validation data. The small number of features used assures a low algorithm complexity. A low complex algorithm will reduce the change for overfitting and reduce the resources needed to evaluate a new flow.
With network flows that contain network packets from new botnets and normal program traffic, the RFC performance is poor, with an accuracy of only 55.59%. The features extracted from the flows are good to detect known botnets but are not generic enough to distinguish network flows from unknown botnets and normal programs.
To train an algorithm, multiple datasets are available. Only not all datasets use the same method for adding a truth label to the network packets. Therefore, a software package is created to convert different datasets to flows and add a truth value to each flow.
The complete software package, called botshot, can be used to convert datasets to flows, create feature-sets from flows, train a machine learning model from the feature-sets and validate the performance of the used machine learning algorithm. The software package is documented, and the architecture makes it easy to be adapted to new datasets, export different features or try new training algorithms. The botshot software package will help new research to focus on selecting new features and better classification algorithms instead of spending time in converting raw data to usable features and test the performance.
Date of Award2 Feb 2021
Original languageEnglish
SupervisorHarald Vranken (Examinator) & Arjen Hommersom (Co-assessor)

Cite this