Dr. Sikha S Bagui

Distinguished University Professor, Computer Science

Big Data analytics

The Big Data Framework (hadoop)

Machine Learning and Deep Learning

Data Mining

Structured Query Language-SQL

database design and architecture

Data Pre-Processing

Load Balancing

Resampling

Attribute Selection

Database design

Entity-Relationship Modeling

Association Rule Mining

decision trees

Random Forest

SVM

Hive

Conference proceeding Open access Peer reviewed

Selecting Feature Subsets in Continuous Flow Network Attack Traffic Big Data Using Incremental Frequent Pattern Mining

by Sikha Bagui, Andrew Benyacko, Mink Dustin, Subhash Bagui and Bagchi Arijit

Published 12/16/2025

Algorithms, 18, 12, 795

This work focuses on finding frequent patterns in continuous flow network traffic Big Data using incremental frequent pattern mining. A newly created Zeek Conn Log MITRE ATT&CK framework labeled dataset, UWF-ZeekData24, generated using the Cyber Range at The University of West Florida, was used for this study. While FP-Growth is effective for static datasets, its standard implementation does not support incremental mining, which poses challenges for applications involving continuously growing data streams, such as network traffic logs. To overcome this limitation, a staged incremental FP-Growth approach is adopted for this work. The novelty of this work is in showing how incremental FP-Growth can be used efficiently on continuous flow network traffic, or streaming network traffic data, where no rebuild is necessary when new transactions are scanned and integrated. Incremental frequent pattern mining also generates feature subsets that are useful for understanding the nature of the individual attack tactics. Hence, a detailed understanding of the features or feature subsets of the seven different MITRE ATT&CK tactics is also presented. For example, the results indicate that core behavioral rules, such as those involving TCP protocols and service associations, emerge early and remain stable throughout later increments. The incremental FP-Growth framework provides a structured lens through which network behaviors can be observed and compared over time, supporting not only classification but also investigative use cases such as anomaly tracking and technique attribution. And finally, the results of this work, the frequent itemsets, will be useful for intrusion detection machine learning/artificial intelligence algorithms.

Journal article Open access Peer reviewed

Classifying Cyber Ranges: A Case-Based Analysis Using the UWF Cyber Range

by Emily Miller, Dustin Mink, Peyton Spellings, Sikha S. Bagui and Subhash C. Bagui

Published 10/10/2025

Encyclopedia (Basel, Switzerland), 5, 4, 162

To address the gaps in cyber range survey research, this entry develops and applies a structured classification taxonomy to support the comparison, evaluation, and design of cyber ranges. The entry will address the following question: What are the objectives and key features of current cyber ranges, and how can they be classified into a comprehensive taxonomy? The entry synthesizes existing frameworks and analyzes and classifies a variety of documented cyber ranges to find similarities and gaps in the current classification methods. The findings indicate recurring design elements across ranges, persistent gaps in standardization, and demonstrate how the University of West Florida (UWF) Cyber Range exemplifies the taxonomy application in practice. The goal is to facilitate informed decision-making by cybersecurity professionals when choosing platforms and to support academic research in cybersecurity education. Pulling information from studies about other cyber ranges to compare with the UWF Cyber Range, this taxonomy aims to contribute to the documentation of cyber ranges by providing a clear understanding of the current cyber range landscape.

Journal article Open access Peer reviewed

Analyzing Performance of Data Preprocessing Techniques on CPUs vs. GPUs with and Without the MapReduce Environment

by Sikha S. Bagui, Colin Eller, Rianna Armour, Shivani Singh, Subhash C. Bagui and Dustin Mink

Published 09/10/2025

Electronics (Basel), 14, 18, 3597

Data preprocessing is usually necessary before running most machine learning classifiers. This work compares three different preprocessing techniques, minimal preprocessing, Principal Components Analysis (PCA), and Linear Discriminant Analysis (LDA). The efficiency of these three preprocessing techniques is measured using the Support Vector Machine (SVM) classifier. Efficiency is measured in terms of statistical metrics such as accuracy, precision, recall, the F-1 measure, and AUROC. The preprocessing times and the classifier run times are also compared using the three differently preprocessed datasets. Finally, a comparison of performance timings on CPUs vs. GPUs with and without the MapReduce environment is performed. Two newly created Zeek Connection Log datasets, collected using the Security Onion 2 network security monitor and labeled using the MITRE ATT&CK framework, UWF-ZeekData22 and UWF-ZeekDataFall22, are used for this work. Results from this work show that binomial LDA, on average, performs the best in terms of statistical measures as well as timings using GPUs or MapReduce GPUs.

Journal article Open access Peer reviewed

PredictMed-CDSS: Artificial Intelligence-Based Decision Support System Predicting the Probability to Develop Neuromuscular Hip Dysplasia

by Carlo M. Bertoncelli, Federico Solla, Michal Latalski, Sikha Bagui, Subhash C. Bagui, Stefania Costantini and Domenico Bertoncelli

Published 08/06/2025

Bioengineering (Basel), 12, 8, 846

Neuromuscular hip dysplasia (NHD) is a common deformity in children with cerebral palsy (CP). Although some predictive factors of NHD are known, the prediction of NHD is in its infancy. We present a Clinical Decision Support System (CDSS) designed to calculate the probability of developing NHD in children with CP. The system utilizes an ensemble of three machine learning (ML) algorithms: Neural Network (NN), Support Vector Machine (SVM), and Logistic Regression (LR). The development and evaluation of the CDSS followed the DECIDE-AI guidelines for AI-driven clinical decision support tools. The ensemble was trained on a data series from 182 subjects. Inclusion criteria were age between 12 and 18 years and diagnosis of CP from two specialized units. Clinical and functional data were collected prospectively between 2005 and 2023, and then analyzed in a cross-sectional study. Accuracy and area under the receiver operating characteristic (AUROC) were calculated for each method. Best logistic regression scores highlighted history of previous orthopedic surgery (p = 0.001), poor motor function (p = 0.004), truncal tone disorder (p = 0.008), scoliosis (p = 0.031), number of affected limbs (p = 0.05), and epilepsy (p = 0.05) as predictors of NHD. Both accuracy and AUROC were highest for NN, 83.7% and 0.92, respectively. The novelty of this study lies in the development of an efficient Clinical Decision Support System (CDSS) prototype, specifically designed to predict future outcomes of neuromuscular hip dysplasia (NHD) in patients with cerebral palsy (CP) using clinical data. The proposed system, PredictMed-CDSS, demonstrated strong predictive performance for estimating the probability of NHD development in children with CP, with the highest accuracy achieved using neural networks (NN). PredictMed-CDSS has the potential to assist clinicians in anticipating the need for early interventions and preventive strategies in the management of NHD among CP patients.

Journal article Open access Peer reviewed

Model Retraining upon Concept Drift Detection in Network Traffic Big Data

by Sikha S. Bagui, Mohammad Pale Khan, Chedlyne Valmyr, Subhash C. Bagui and Dustin Mink

Published 07/24/2025

Future internet, 17, 8, 328

This paper presents a comprehensive model for detecting and addressing concept drift in network security data using the Isolation Forest algorithm. The approach leverages Isolation Forest’s inherent ability to efficiently isolate anomalies in high-dimensional data, making it suitable for adapting to shifting data distributions in dynamic environments.Anomalies in network attack data may not occur in large numbers, so it is important to be able to detect anomalies even with small batch sizes. The novelty of this work lies in successfully detecting anomalies even with small batch sizes and identifying the point at which incremental retraining needs to be started. Triggering retraining early also keeps the model in sync with the latest data, reducing the chance for attacks to be successfully conducted. Our methodology implements an end-to-end workflow that continuously monitors incoming data and detects distribution changes using Isolation Forest, then manages model retraining using Random Forest to maintain optimal performance. We evaluate our approach using UWF-ZeekDataFall22, a newly created dataset that analyzes Zeek’s Connection Logs collected through Security Onion 2 network security monitor and labeled using the MITRE ATT&CK framework. Incremental as well as full retraining are analyzed using Random Forest. There was a steady increase in the model’s performance with incremental retraining and a positive impact on the model’s performance with full model retraining.

Journal article Open access Peer reviewed

Detecting Cyber Threats in UWF-ZeekDataFall22 Using K-Means Clustering in the Big Data Environment

by Sikha Bagui, Carvalho Germano Correa Silva De, Mishra Asmi, Dustin Mink, Subhash Bagui and Stephanie Eager

Published 06/18/2025

Future internet, 17, 6, 267

In an era marked by the rapid growth of the Internet of Things (IoT), network security has become increasingly critical. Traditional Intrusion Detection Systems, particularly signature-based methods, struggle to identify evolving cyber threats such as Advanced Persistent Threats (APTs)and zero-day attacks. Such threats or attacks go undetected with supervised machine-learning methods. In this paper, we apply K-means clustering, an unsupervised clustering technique, to a newly created modern network attack dataset, UWF-ZeekDataFall22. Since this dataset contains labeled Zeek logs, the dataset was de-labeled before using this data for K-means clustering. The labeled data, however, was used in the evaluation phase, to determine the attack clusters post-clustering. In order to identify APTs as well as zero-day attack clusters, three different labeling heuristics were evaluated to determine the attack clusters. To address the challenges faced by Big Data, the Big Data framework, that is, Apache Spark and PySpark, were used for our development environment. In addition, the uniqueness of this work is also in using connection-based features. Using connection-based features, an in-depth study is done to determine the effect of the number of clusters, seeds, as well as features, for each of the different labeling heuristics. If the objective is to detect every single attack, the results indicate that 325 clusters with a seed of 200, using an optimal set of features, would be able to correctly place 99% of attacks.

Journal article Open access Peer reviewed

Introducing UWF-ZeekData24: An Enterprise MITRE ATT&CK Labeled Network Attack Traffic Dataset for Machine Learning/AI

by Marshall Elam, Dustin Mink, Sikha S. Bagui, Russell Plenkers and Subhash C. Bagui

Published 04/25/2025

Data (Basel), 10, 5, 59

This paper describes the creation of a new dataset, UWF-ZeekData24, aligned with the Enterprise MITRE ATT&CK Framework, that addresses critical shortcomings in existing network security datasets. Controlling the construction of attacks and meticulously labeling the data provides a more accurate and dynamic environment for testing of IDS/IPS systems and their machine learning algorithms. The outcomes of this research will assist in the development of cybersecurity solutions as well as increase the robustness and adaptability towards modern day cybersecurity threats. This new carefully engineered dataset will enhance cyber defense mechanisms that are responsible for safeguarding critical infrastructures and digital assets. Finally, this paper discusses the differences between crowd-sourced data and data collected in a more controlled environment.

Journal article Open access

Critical Reflection Sessions: Teacher's Perspectives During Professional Development

by Lisa S. Ockerman and Sikha Bagui

First online publication 10/14/2024

International Journal of Changes in Education, online ahead of print

This study aimed to explore the critical reflection experiences of teachers who took part in effective and ineffective professional learning events. It examined the influence of D.A. Kolb's reflective observation stage within the experiential learning theory (ELT) framework on teachers' professional development. Using a qualitative interpretive phenomenological method, the research investigated teachers' viewpoints, beliefs, and experiences related to various professional development activities to evaluate their effectiveness. Eleven teachers attended the same educational conference. The investigation involved semi-structured individual interviews and a focus group discussion. The interview questions centered around the concepts of enactive mastery and vicarious experiences. Open-ended discussions allowed participants to explore their experiences of professional learning. To highlight emergent patterns, this study employed a phenomenological technique to analyze data and identified critical observations through an inductive coding method using the NVivo software. The study's findings suggested that the reflective process necessitated time, collaboration, and structure during professional learning sessions. Moreover, having a group of peers was advantageous for critical reflection. They created a learning environment where support was the norm, and dedicated time encouraged self-reflection, which promoted effective growth in teacher development. The researchers found that inquiry-based professional development promoted introspection and facilitated profound learning. The research emphasized that teachers responded positively when professional development facilitators allowed participants time to link their personal experiences to new teaching methods, prompting reflection and collaboration with peers.

Journal article Open access Peer reviewed

Applying Multi-CLASS Support Vector Machines: One-vs.-One vs. One-vs.-All on the UWF-ZeekDataFall22 Dataset

by Rocio Krebs, Sikha S. Bagui, Dustin Mink and Subhash C. Bagui

Published 10/03/2024

Electronics (Basel), 13, 19, 3916

This study investigates the technical challenges of applying Support Vector Machines (SVM) for multi-class classification in network intrusion detection using the UWF-ZeekDataFall22 dataset, which is labeled based on the MITRE ATT&CK framework. A key challenge lies in handling imbalanced classes and complex attack patterns, which are inherent in intrusion detection data. This work highlights the difficulties in implementing SVMs for multi-class classification, particularly with One-vs.-One (OvO) and One-vs.-All (OvA) methods, including scalability issues due to the large volume of network traffic logs and the tendency of SVMs to be sensitive to noisy data and class imbalances. SMOTE was used to address class imbalances, while preprocessing techniques were applied to improve feature selection and reduce noise in the data. The unique structure of network traffic data, with overlapping patterns between attack vectors, posed significant challenges in achieving accurate classification. Our model reached an accuracy of over 90% with OvO and over 80% with OvA, demonstrating that despite these challenges, multi-class SVMs can be effectively applied to complex intrusion detection tasks when combined with appropriate balancing and preprocessing techniques.

Journal article Open access Peer reviewed

MongoDB: Meeting the Dynamic Needs of Modern Applications

by Mukesh Rathore and Sikha S. Bagui

Published 09/27/2024

Encyclopedia (Basel, Switzerland), 4, 4, 1433 - 1453

This entry reviews MongoDB’s fundamentals, architectural features, advantages, and limitations, providing a comprehensive understanding of its capabilities. MongoDB’s impact on the database landscape is profound, challenging traditional relational databases and influencing the adoption of NoSQL solutions globally. With its continued growth, innovation, and commitment to addressing evolving market needs, MongoDB remains a pivotal player in modern data management, empowering organizations to build scalable, efficient, and high-performance applications.

Dr. Sikha S Bagui

Distinguished University Professor, Computer Science

List of works

University of West Florida Social media