Dr. Subhash C Bagui

Distinguished University Professor, Hal Marcus College of Science and Engineering

Applied Mathematics

Applied Statistics

Data Mining

Mathematics

Mathematics Education

Pattern Recognition

Statistics

Biostatistics

non-central t-distributions

probability and distribution theory

nonparametric classification/discrimination

statistical pattern recognition

cluster analysis

statistical computing

K-12 statistics education

Conference proceeding Open access Peer reviewed

Selecting Feature Subsets in Continuous Flow Network Attack Traffic Big Data Using Incremental Frequent Pattern Mining

by Sikha Bagui, Andrew Benyacko, Mink Dustin, Subhash Bagui and Bagchi Arijit

Published 12/16/2025

Algorithms, 18, 12, 795

This work focuses on finding frequent patterns in continuous flow network traffic Big Data using incremental frequent pattern mining. A newly created Zeek Conn Log MITRE ATT&CK framework labeled dataset, UWF-ZeekData24, generated using the Cyber Range at The University of West Florida, was used for this study. While FP-Growth is effective for static datasets, its standard implementation does not support incremental mining, which poses challenges for applications involving continuously growing data streams, such as network traffic logs. To overcome this limitation, a staged incremental FP-Growth approach is adopted for this work. The novelty of this work is in showing how incremental FP-Growth can be used efficiently on continuous flow network traffic, or streaming network traffic data, where no rebuild is necessary when new transactions are scanned and integrated. Incremental frequent pattern mining also generates feature subsets that are useful for understanding the nature of the individual attack tactics. Hence, a detailed understanding of the features or feature subsets of the seven different MITRE ATT&CK tactics is also presented. For example, the results indicate that core behavioral rules, such as those involving TCP protocols and service associations, emerge early and remain stable throughout later increments. The incremental FP-Growth framework provides a structured lens through which network behaviors can be observed and compared over time, supporting not only classification but also investigative use cases such as anomaly tracking and technique attribution. And finally, the results of this work, the frequent itemsets, will be useful for intrusion detection machine learning/artificial intelligence algorithms.

Conference proceeding Peer reviewed

Classifying Phishing Email Using Machine Learning and Deep Learning

by Sikha Bagui, Debarghya Nandi, Subhash Bagui and Robert Jamie White

Published 06/2019

2019 International Conference on Cyber Security and Protection of Digital Services (Cyber Security)

International Conference on Cyber Security and Protection of Digital Services (Cyber Security), 06/03/2019–06/04/2019, Oxford, UK

In this work, we applied deep semantic analysis, and machine learning and deep learning techniques, to capture inherent characteristics of email text, and classify emails as phishing or non -phishing.

Conference proceeding Peer reviewed

A Parallel Implementation of Information Gain Using Hive in conjunction with MapReduce for Continuous Features

by S. Bagui, Sharon John, Baggs John S. and Subhash C Bagui

Published 2018

Trends and Applications in Knowledge Discovery and Data Mining: Knowledge Discovery and Data Mining Book Subtitle PAKDD 2018 Workshops, BDASC, BDM, ML4Cyber, PAISI, DaMEMO, Melbourne, VIC, Australia, June 3, 2018, Revised Selected Papers, 283 - 294

Pacific Asia Workshop on Intelligence and Security Informatics (PAISI)

Finding efficient ways to perform the Information Gain algorithm is becoming even more important as we enter the Big Data era where data and dimensionality are increasing at alarming rates. When machine learning algorithms get over-burdened with large dimensional data with redundant features, information gain becomes very crucial for feature selection. Information gain is also often used as a pre-cursory step in creating decision trees, text classifiers, support vector machines, etc. Due to the very large volume of today’s data, there is a need to efficiently parallelize classic algorithms like Information Gain. In this paper, we present a parallel implementation of Information Gain in the MapReduce environment, using MapReduce in conjunction with Hive, for continuous features. In our approach, Hive was used to calculate the counts and parent entropy and a Map only job was used to complete the Information Gain calculations. Our approach demonstrated gains in run times as we carefully designed MapReduce jobs efficiently leveraging the Hadoop cluster.

Dr. Subhash C Bagui

Distinguished University Professor, Hal Marcus College of Science and Engineering

List of works

University of West Florida Social media