This paper shows how and to what extent data mining classification algorithms detect shadow IT in databases. Data classification points out data fields with coded data, indicating agreed procedures. Discovered shadow IT usually exposes gaps or shortcomings in systems used, as well as opportunities for system improvements. Data classification algorithms focus on data to distinguish shadow IT, where other researchers focus on text classification, and association or clustering algorithms. Data classification aims to distinguish data structures in databases that do not follow formal workflows, not accepted or supported by the IT department. On a synthetic dataset, supervised learning with data classification is examined with Naïve Bayes, k-NN and the probabilistic classifiers Decision Trees and Logistic Regression. Due to working with Euclidian distances, k-NN and Support Vector Machines algorithms are not suitable. Classifying the imbalanced dataset often runs into overfitting and other issues. These require special attention and affect the selection of performance metrics. Accuracy, precision, recall, specificity and the area under the curve are evaluated. System improvement suggestions are, for example, to add dedicated code fields instead of a fictitious date to avoid bias and adding a validity period to enrich data and make it more dynamic.
- Shadow IT
- data classification
- classification algorithms
- binary classification and performance measurement