Dr. Dustin M Mink

Research Faculty, Hal Marcus College of Science and Engineering

Cybersecurity

Information Technology

Conference proceeding Open access Peer reviewed

Selecting Feature Subsets in Continuous Flow Network Attack Traffic Big Data Using Incremental Frequent Pattern Mining

by Sikha Bagui, Andrew Benyacko, Mink Dustin, Subhash Bagui and Bagchi Arijit

Published 12/16/2025

Algorithms, 18, 12, 795

This work focuses on finding frequent patterns in continuous flow network traffic Big Data using incremental frequent pattern mining. A newly created Zeek Conn Log MITRE ATT&CK framework labeled dataset, UWF-ZeekData24, generated using the Cyber Range at The University of West Florida, was used for this study. While FP-Growth is effective for static datasets, its standard implementation does not support incremental mining, which poses challenges for applications involving continuously growing data streams, such as network traffic logs. To overcome this limitation, a staged incremental FP-Growth approach is adopted for this work. The novelty of this work is in showing how incremental FP-Growth can be used efficiently on continuous flow network traffic, or streaming network traffic data, where no rebuild is necessary when new transactions are scanned and integrated. Incremental frequent pattern mining also generates feature subsets that are useful for understanding the nature of the individual attack tactics. Hence, a detailed understanding of the features or feature subsets of the seven different MITRE ATT&CK tactics is also presented. For example, the results indicate that core behavioral rules, such as those involving TCP protocols and service associations, emerge early and remain stable throughout later increments. The incremental FP-Growth framework provides a structured lens through which network behaviors can be observed and compared over time, supporting not only classification but also investigative use cases such as anomaly tracking and technique attribution. And finally, the results of this work, the frequent itemsets, will be useful for intrusion detection machine learning/artificial intelligence algorithms.

Journal article Open access Peer reviewed

Analyzing Performance of Data Preprocessing Techniques on CPUs vs. GPUs with and Without the MapReduce Environment

by Sikha S. Bagui, Colin Eller, Rianna Armour, Shivani Singh, Subhash C. Bagui and Dustin Mink

Published 09/10/2025

Electronics (Basel), 14, 18, 3597

Data preprocessing is usually necessary before running most machine learning classifiers. This work compares three different preprocessing techniques, minimal preprocessing, Principal Components Analysis (PCA), and Linear Discriminant Analysis (LDA). The efficiency of these three preprocessing techniques is measured using the Support Vector Machine (SVM) classifier. Efficiency is measured in terms of statistical metrics such as accuracy, precision, recall, the F-1 measure, and AUROC. The preprocessing times and the classifier run times are also compared using the three differently preprocessed datasets. Finally, a comparison of performance timings on CPUs vs. GPUs with and without the MapReduce environment is performed. Two newly created Zeek Connection Log datasets, collected using the Security Onion 2 network security monitor and labeled using the MITRE ATT&CK framework, UWF-ZeekData22 and UWF-ZeekDataFall22, are used for this work. Results from this work show that binomial LDA, on average, performs the best in terms of statistical measures as well as timings using GPUs or MapReduce GPUs.

Journal article Open access Peer reviewed

Model Retraining upon Concept Drift Detection in Network Traffic Big Data

by Sikha S. Bagui, Mohammad Pale Khan, Chedlyne Valmyr, Subhash C. Bagui and Dustin Mink

Published 07/24/2025

Future internet, 17, 8, 328

This paper presents a comprehensive model for detecting and addressing concept drift in network security data using the Isolation Forest algorithm. The approach leverages Isolation Forest’s inherent ability to efficiently isolate anomalies in high-dimensional data, making it suitable for adapting to shifting data distributions in dynamic environments.Anomalies in network attack data may not occur in large numbers, so it is important to be able to detect anomalies even with small batch sizes. The novelty of this work lies in successfully detecting anomalies even with small batch sizes and identifying the point at which incremental retraining needs to be started. Triggering retraining early also keeps the model in sync with the latest data, reducing the chance for attacks to be successfully conducted. Our methodology implements an end-to-end workflow that continuously monitors incoming data and detects distribution changes using Isolation Forest, then manages model retraining using Random Forest to maintain optimal performance. We evaluate our approach using UWF-ZeekDataFall22, a newly created dataset that analyzes Zeek’s Connection Logs collected through Security Onion 2 network security monitor and labeled using the MITRE ATT&CK framework. Incremental as well as full retraining are analyzed using Random Forest. There was a steady increase in the model’s performance with incremental retraining and a positive impact on the model’s performance with full model retraining.

Journal article Open access Peer reviewed

Detecting Cyber Threats in UWF-ZeekDataFall22 Using K-Means Clustering in the Big Data Environment

by Sikha Bagui, Carvalho Germano Correa Silva De, Mishra Asmi, Dustin Mink, Subhash Bagui and Stephanie Eager

Published 06/18/2025

Future internet, 17, 6, 267

In an era marked by the rapid growth of the Internet of Things (IoT), network security has become increasingly critical. Traditional Intrusion Detection Systems, particularly signature-based methods, struggle to identify evolving cyber threats such as Advanced Persistent Threats (APTs)and zero-day attacks. Such threats or attacks go undetected with supervised machine-learning methods. In this paper, we apply K-means clustering, an unsupervised clustering technique, to a newly created modern network attack dataset, UWF-ZeekDataFall22. Since this dataset contains labeled Zeek logs, the dataset was de-labeled before using this data for K-means clustering. The labeled data, however, was used in the evaluation phase, to determine the attack clusters post-clustering. In order to identify APTs as well as zero-day attack clusters, three different labeling heuristics were evaluated to determine the attack clusters. To address the challenges faced by Big Data, the Big Data framework, that is, Apache Spark and PySpark, were used for our development environment. In addition, the uniqueness of this work is also in using connection-based features. Using connection-based features, an in-depth study is done to determine the effect of the number of clusters, seeds, as well as features, for each of the different labeling heuristics. If the objective is to detect every single attack, the results indicate that 325 clusters with a seed of 200, using an optimal set of features, would be able to correctly place 99% of attacks.

Journal article Open access Peer reviewed

Introducing UWF-ZeekData24: An Enterprise MITRE ATT&CK Labeled Network Attack Traffic Dataset for Machine Learning/AI

by Marshall Elam, Dustin Mink, Sikha S. Bagui, Russell Plenkers and Subhash C. Bagui

Published 04/25/2025

Data (Basel), 10, 5, 59

This paper describes the creation of a new dataset, UWF-ZeekData24, aligned with the Enterprise MITRE ATT&CK Framework, that addresses critical shortcomings in existing network security datasets. Controlling the construction of attacks and meticulously labeling the data provides a more accurate and dynamic environment for testing of IDS/IPS systems and their machine learning algorithms. The outcomes of this research will assist in the development of cybersecurity solutions as well as increase the robustness and adaptability towards modern day cybersecurity threats. This new carefully engineered dataset will enhance cyber defense mechanisms that are responsible for safeguarding critical infrastructures and digital assets. Finally, this paper discusses the differences between crowd-sourced data and data collected in a more controlled environment.

Journal article Open access Peer reviewed

Applying Multi-CLASS Support Vector Machines: One-vs.-One vs. One-vs.-All on the UWF-ZeekDataFall22 Dataset

by Rocio Krebs, Sikha S. Bagui, Dustin Mink and Subhash C. Bagui

Published 10/03/2024

Electronics (Basel), 13, 19, 3916

This study investigates the technical challenges of applying Support Vector Machines (SVM) for multi-class classification in network intrusion detection using the UWF-ZeekDataFall22 dataset, which is labeled based on the MITRE ATT&CK framework. A key challenge lies in handling imbalanced classes and complex attack patterns, which are inherent in intrusion detection data. This work highlights the difficulties in implementing SVMs for multi-class classification, particularly with One-vs.-One (OvO) and One-vs.-All (OvA) methods, including scalability issues due to the large volume of network traffic logs and the tendency of SVMs to be sensitive to noisy data and class imbalances. SMOTE was used to address class imbalances, while preprocessing techniques were applied to improve feature selection and reduce noise in the data. The unique structure of network traffic data, with overlapping patterns between attack vectors, posed significant challenges in achieving accurate classification. Our model reached an accuracy of over 90% with OvO and over 80% with OvA, demonstrating that despite these challenges, multi-class SVMs can be effectively applied to complex intrusion detection tasks when combined with appropriate balancing and preprocessing techniques.

Journal article Open access Peer reviewed

Node Classification of Network Threats Leveraging Graph-Based Characterizations Using Memgraph

by Sadaf Charkhabi, Peyman Samimi, Sikha S. Bagui, Dustin Mink and Subhash C. Bagui

Published 07/15/2024

Computers (Basel), 13, 7, 171

This research leverages Memgraph, an open-source graph database, to analyze graph-based network data and apply Graph Neural Networks (GNNs) for a detailed classification of cyberattack tactics categorized by the MITRE ATT&CK framework. As part of graph characterization, the page rank, degree centrality, betweenness centrality, and Katz centrality are presented. Node classification is utilized to categorize network entities based on their role in the traffic. Graph-theoretic features such as in-degree, out-degree, PageRank, and Katz centrality were used in node classification to ensure that the model captures the structure of the graph. The study utilizes the UWF-ZeekDataFall22 dataset, a newly created dataset which consists of labeled network logs from the University of West Florida’s Cyber Range. The uniqueness of this study is that it uses the power of combining graph-based characterization or analysis with machine learning to enhance the understanding and visualization of cyber threats, thereby improving the network security measures.

Journal article Open access Peer reviewed

Extended Isolation Forest for Intrusion Detection in Zeek Data

by Fariha Moomtaheen, Sikha S. Bagui, Subhash C. Bagui and Dustin Mink

Published 07/12/2024

Information (Basel), 15, 7, 404

The novelty of this paper is in determining and using hyperparameters to improve the Extended Isolation Forest (EIF) algorithm, a relatively new algorithm, to detect malicious activities in network traffic. The EIF algorithm is a variation of the Isolation Forest algorithm, known for its efficacy in detecting anomalies in high-dimensional data. Our research assesses the performance of the EIF model on a newly created dataset composed of Zeek Connection Logs, UWF-ZeekDataFall22. To handle the enormous volume of data involved in this research, the Hadoop Distributed File System (HDFS) is employed for efficient and fault-tolerant storage, and the Apache Spark framework, a powerful open-source Big Data analytics platform, is utilized for machine learning (ML) tasks. The best results for the EIF algorithm came from the 0-extension level. We received an accuracy of 82.3% for the Resource Development tactic, 82.21% for the Reconnaissance tactic, and 78.3% for the Discovery tactic.

Journal article Open access Peer reviewed

Resampling to Classify Rare Attack Tactics in UWF-ZeekData22

by Sikha S. Bagui, Dustin Mink, Subhash C. Bagui and Sakthivel Subramaniam

Published 03/14/2024

Knowledge, 4, 1, 96 - 119

One of the major problems in classifying network attack tactics is the imbalanced nature of data. Typical network datasets have an extremely high percentage of normal or benign traffic and machine learners are skewed toward classes with more data; hence, attack data remain incorrectly classified. This paper addresses the class imbalance problem using resampling techniques on a newly created dataset, UWF-ZeekData22. This is the first dataset with tactic labels, labeled as per the MITRE ATT&CK framework. This dataset contains about half benign data and half attack tactic data, but specific tactics have a meager number of occurrences within the attack tactics. Our objective in this paper was to use resampling techniques to classify two rare tactics, privilege escalation and credential access, never before classified. The study also looks at the order of oversampling and undersampling. Varying resampling ratios were used with oversampling techniques such as BSMOTE and SVM-SMOTE and random undersampling without replacement was used. Based on the results, it can be observed that the order of oversampling and undersampling matters and, in many cases, even an oversampling ratio of 10% of the majority data is enough to obtain the best results.

Journal article Open access Peer reviewed

Graphical Representation of UWF-ZeekData22 Using Memgraph

by Sikha Bagui, Dustin Mink, Subhash C Bagui, Dae Sung and Farooq Mahmud

Published 03/01/2024

Electronics, 13, 6, 1015

This work uses Memgraph, an open-source graph data platform, to analyze, visualize, and apply graph machine learning techniques to detect cybersecurity attack tactics in a newly created Zeek Conn log dataset, UWF-ZeekData22, generated in The University of West Florida’s cyber simulation environment. The dataset is transformed to a representative graph, and the graph’s properties studied in this paper are PageRank, degree, bridge, weakly connected components, node and edge cardinality, and path length. Node classification is used to predict the connection between IP addresses and ports as a form of attack tactic or non-attack tactic in the MITRE framework, implemented using Memgraph’s graph neural networks. Multi-classification is performed using the attack tactics, and three different graph neural network models are compared. Using only three graph features, in-degree, out-degree, and PageRank, Memgraph’s GATJK model performs the best, with source node classification accuracy of 98.51% and destination node classification accuracy of 97.85%.