List of works
Conference proceeding
Published 12/16/2025
Algorithms, 18, 12, 795
This work focuses on finding frequent patterns in continuous flow network traffic Big Data using incremental frequent pattern mining. A newly created Zeek Conn Log MITRE ATT&CK framework labeled dataset, UWF-ZeekData24, generated using the Cyber Range at The University of West Florida, was used for this study. While FP-Growth is effective for static datasets, its standard implementation does not support incremental mining, which poses challenges for applications involving continuously growing data streams, such as network traffic logs. To overcome this limitation, a staged incremental FP-Growth approach is adopted for this work. The novelty of this work is in showing how incremental FP-Growth can be used efficiently on continuous flow network traffic, or streaming network traffic data, where no rebuild is necessary when new transactions are scanned and integrated. Incremental frequent pattern mining also generates feature subsets that are useful for understanding the nature of the individual attack tactics. Hence, a detailed understanding of the features or feature subsets of the seven different MITRE ATT&CK tactics is also presented. For example, the results indicate that core behavioral rules, such as those involving TCP protocols and service associations, emerge early and remain stable throughout later increments. The incremental FP-Growth framework provides a structured lens through which network behaviors can be observed and compared over time, supporting not only classification but also investigative use cases such as anomaly tracking and technique attribution. And finally, the results of this work, the frequent itemsets, will be useful for intrusion detection machine learning/artificial intelligence algorithms.
Journal article
Classifying Cyber Ranges: A Case-Based Analysis Using the UWF Cyber Range
Published 10/10/2025
Encyclopedia (Basel, Switzerland), 5, 4, 162
To address the gaps in cyber range survey research, this entry develops and applies a structured classification taxonomy to support the comparison, evaluation, and design of cyber ranges. The entry will address the following question: What are the objectives and key features of current cyber ranges, and how can they be classified into a comprehensive taxonomy? The entry synthesizes existing frameworks and analyzes and classifies a variety of documented cyber ranges to find similarities and gaps in the current classification methods. The findings indicate recurring design elements across ranges, persistent gaps in standardization, and demonstrate how the University of West Florida (UWF) Cyber Range exemplifies the taxonomy application in practice. The goal is to facilitate informed decision-making by cybersecurity professionals when choosing platforms and to support academic research in cybersecurity education. Pulling information from studies about other cyber ranges to compare with the UWF Cyber Range, this taxonomy aims to contribute to the documentation of cyber ranges by providing a clear understanding of the current cyber range landscape.
Journal article
Published 09/10/2025
Electronics (Basel), 14, 18, 3597
Data preprocessing is usually necessary before running most machine learning classifiers. This work compares three different preprocessing techniques, minimal preprocessing, Principal Components Analysis (PCA), and Linear Discriminant Analysis (LDA). The efficiency of these three preprocessing techniques is measured using the Support Vector Machine (SVM) classifier. Efficiency is measured in terms of statistical metrics such as accuracy, precision, recall, the F-1 measure, and AUROC. The preprocessing times and the classifier run times are also compared using the three differently preprocessed datasets. Finally, a comparison of performance timings on CPUs vs. GPUs with and without the MapReduce environment is performed. Two newly created Zeek Connection Log datasets, collected using the Security Onion 2 network security monitor and labeled using the MITRE ATT&CK framework, UWF-ZeekData22 and UWF-ZeekDataFall22, are used for this work. Results from this work show that binomial LDA, on average, performs the best in terms of statistical measures as well as timings using GPUs or MapReduce GPUs.
Journal article
Published 08/06/2025
Bioengineering (Basel), 12, 8, 846
Neuromuscular hip dysplasia (NHD) is a common deformity in children with cerebral palsy (CP). Although some predictive factors of NHD are known, the prediction of NHD is in its infancy. We present a Clinical Decision Support System (CDSS) designed to calculate the probability of developing NHD in children with CP. The system utilizes an ensemble of three machine learning (ML) algorithms: Neural Network (NN), Support Vector Machine (SVM), and Logistic Regression (LR). The development and evaluation of the CDSS followed the DECIDE-AI guidelines for AI-driven clinical decision support tools. The ensemble was trained on a data series from 182 subjects. Inclusion criteria were age between 12 and 18 years and diagnosis of CP from two specialized units. Clinical and functional data were collected prospectively between 2005 and 2023, and then analyzed in a cross-sectional study. Accuracy and area under the receiver operating characteristic (AUROC) were calculated for each method. Best logistic regression scores highlighted history of previous orthopedic surgery (p = 0.001), poor motor function (p = 0.004), truncal tone disorder (p = 0.008), scoliosis (p = 0.031), number of affected limbs (p = 0.05), and epilepsy (p = 0.05) as predictors of NHD. Both accuracy and AUROC were highest for NN, 83.7% and 0.92, respectively. The novelty of this study lies in the development of an efficient Clinical Decision Support System (CDSS) prototype, specifically designed to predict future outcomes of neuromuscular hip dysplasia (NHD) in patients with cerebral palsy (CP) using clinical data. The proposed system, PredictMed-CDSS, demonstrated strong predictive performance for estimating the probability of NHD development in children with CP, with the highest accuracy achieved using neural networks (NN). PredictMed-CDSS has the potential to assist clinicians in anticipating the need for early interventions and preventive strategies in the management of NHD among CP patients.
Journal article
Model Retraining upon Concept Drift Detection in Network Traffic Big Data
Published 07/24/2025
Future internet, 17, 8, 328
This paper presents a comprehensive model for detecting and addressing concept drift in network security data using the Isolation Forest algorithm. The approach leverages Isolation Forest’s inherent ability to efficiently isolate anomalies in high-dimensional data, making it suitable for adapting to shifting data distributions in dynamic environments.Anomalies in network attack data may not occur in large numbers, so it is important to be able to detect anomalies even with small batch sizes. The novelty of this work lies in successfully detecting anomalies even with small batch sizes and identifying the point at which incremental retraining needs to be started. Triggering retraining early also keeps the model in sync with the latest data, reducing the chance for attacks to be successfully conducted. Our methodology implements an end-to-end workflow that continuously monitors incoming data and detects distribution changes using Isolation Forest, then manages model retraining using Random Forest to maintain optimal performance. We evaluate our approach using UWF-ZeekDataFall22, a newly created dataset that analyzes Zeek’s Connection Logs collected through Security Onion 2 network security monitor and labeled using the MITRE ATT&CK framework. Incremental as well as full retraining are analyzed using Random Forest. There was a steady increase in the model’s performance with incremental retraining and a positive impact on the model’s performance with full model retraining.
Journal article
Detecting Cyber Threats in UWF-ZeekDataFall22 Using K-Means Clustering in the Big Data Environment
Published 06/18/2025
Future internet, 17, 6, 267
In an era marked by the rapid growth of the Internet of Things (IoT), network security has become increasingly critical. Traditional Intrusion Detection Systems, particularly signature-based methods, struggle to identify evolving cyber threats such as Advanced Persistent Threats (APTs)and zero-day attacks. Such threats or attacks go undetected with supervised machine-learning methods. In this paper, we apply K-means clustering, an unsupervised clustering technique, to a newly created modern network attack dataset, UWF-ZeekDataFall22. Since this dataset contains labeled Zeek logs, the dataset was de-labeled before using this data for K-means clustering. The labeled data, however, was used in the evaluation phase, to determine the attack clusters post-clustering. In order to identify APTs as well as zero-day attack clusters, three different labeling heuristics were evaluated to determine the attack clusters. To address the challenges faced by Big Data, the Big Data framework, that is, Apache Spark and PySpark, were used for our development environment. In addition, the uniqueness of this work is also in using connection-based features. Using connection-based features, an in-depth study is done to determine the effect of the number of clusters, seeds, as well as features, for each of the different labeling heuristics. If the objective is to detect every single attack, the results indicate that 325 clusters with a seed of 200, using an optimal set of features, would be able to correctly place 99% of attacks.
Journal article
Bahadur–Kiefer Type Representations for Smoothed Conditional Quantile Estimators
Published 05/2025
Bulletin - Calcutta Statistical Association, 77, 1, 57 - 90
Bahadur and Kiefer derived almost sure (a.s.) representations for the (unconditional) sample quantile function in terms of the standard (unsmoothed) empirical distribution function. Their representations later became commonly known as the Bahadur–Kiefer (BK) representations. In this article, we establish BK type a.s. representations, and the resulting laws of iterated logarithm, for three distinct fully nonparametric smooth conditional quantile estimators—with optimal orders for the remainders—viz. for a smooth linear type, a Parzen-type smoothed (integrated) inverse and a smooth inverse type (kernel) conditional quantile estimator (c.q.e.) under some broad conditions on the underlying cdf’s and the kernels and bandwidth sequences employed. We also demonstrate that of these the linear type c.q.e. is, in fact, ‘second-order-equivalent’ to the Parzen-type smoothed (integrated) inverse c.q.e. Some remarks are included on the comparative merits of these smooth c.q.e.’s, and their BK representations relative to their smooth and unsmoothed counterparts studied earlier in literature and possible extensions of the present results. Our results are of the exact a.s. type and provide improvements over those achieved hitherto in literature. They are of considerable value for studying the asymptotics of quantile regression analytics.
AMS Subject Classification: Primary 62G05, 62G07; secondary: 60F15, 62G20, 62G30
Journal article
Published 04/25/2025
Data (Basel), 10, 5, 59
This paper describes the creation of a new dataset, UWF-ZeekData24, aligned with the Enterprise MITRE ATT&CK Framework, that addresses critical shortcomings in existing network security datasets. Controlling the construction of attacks and meticulously labeling the data provides a more accurate and dynamic environment for testing of IDS/IPS systems and their machine learning algorithms. The outcomes of this research will assist in the development of cybersecurity solutions as well as increase the robustness and adaptability towards modern day cybersecurity threats. This new carefully engineered dataset will enhance cyber defense mechanisms that are responsible for safeguarding critical infrastructures and digital assets. Finally, this paper discusses the differences between crowd-sourced data and data collected in a more controlled environment.
Journal article
Uniformly Minimum Variance Unbiased Estimators (UMVUE) Not Attaining Cramer-Rao Lower Bounds
Published 12/23/2024
International Journal of Statistical Sciences, 24, 20, 18
The main thrust of this article is to provide counterexamples where the variance of the UMVUE does not achieve the Cramer-Rao lower bound. We provided many motivating counterexamples and showed that these UMVU estimators are, in fact, asymptotically efficient estimators. All counterexamples are new or may not be available in standard textbooks. To illustrate the entire process, we supplied many definitions related to UMVUE and described various methods and step-by-step approaches for finding UMVUE’s. In concluding remarks, we also gave a short biography of Professor C.R. Rao. It is hoped that the article will have pedagogical value in courses on statistical inference.
Journal article
Published 10/03/2024
Electronics (Basel), 13, 19, 3916
This study investigates the technical challenges of applying Support Vector Machines (SVM) for multi-class classification in network intrusion detection using the UWF-ZeekDataFall22 dataset, which is labeled based on the MITRE ATT&CK framework. A key challenge lies in handling imbalanced classes and complex attack patterns, which are inherent in intrusion detection data. This work highlights the difficulties in implementing SVMs for multi-class classification, particularly with One-vs.-One (OvO) and One-vs.-All (OvA) methods, including scalability issues due to the large volume of network traffic logs and the tendency of SVMs to be sensitive to noisy data and class imbalances. SMOTE was used to address class imbalances, while preprocessing techniques were applied to improve feature selection and reduce noise in the data. The unique structure of network traffic data, with overlapping patterns between attack vectors, posed significant challenges in achieving accurate classification. Our model reached an accuracy of over 90% with OvO and over 80% with OvA, demonstrating that despite these challenges, multi-class SVMs can be effectively applied to complex intrusion detection tasks when combined with appropriate balancing and preprocessing techniques.