Dr. Sikha S Bagui

Distinguished University Professor, Computer Science

Big Data analytics

The Big Data Framework (hadoop)

Machine Learning and Deep Learning

Data Mining

Structured Query Language-SQL

database design and architecture

Data Pre-Processing

Load Balancing

Resampling

Attribute Selection

Database design

Entity-Relationship Modeling

Association Rule Mining

decision trees

Random Forest

SVM

Hive

Abstract Open access

Mining Positive and Negative Association Rules in Hadoop's MapReduce Environment

by Sikha Bagui and Probal Chandra Dhar

Published 01/01/2018

ACMSE '18: Proceedings of the ACMSE 2018 Conference, 33

ACMSE '18: Southeast Conference, 03/29/2018–03/31/2018, Richmond, Kentucky

In this paper, we mine positive and negative rules from Big Data in Hadoop's MapReduce Environment. Positive association rule mining finds items that are positively co-related whereas negative association rule mining finds items that are negatively correlated. Positive association rule mining has been traditionally used to mine association rules, but negative association rule mining also has many applications, including the building of efficient decision support systems, for crime data analysis [2], in the health care sector [1], etc. In this paper, we mine positive and negative association rules using the Apriori algorithm in the Big Data environment using Hadoop's MapReduce environment. Positive association rules are in the form X→Y, which has support s in a transaction set D if s% of the transactions in D contain X U Y. A negative association rule is in the form X → ┐ Y or ┐ X → Y or ┐ X → ┐ Y where X ∩ Y = Ø. X → ┐ Y refers to X occurring in the absence of Y; ┐ X → Y refers to Y occurring in the absence of X; ┐ X → ┐ Y means not X and not Y. For positive association rules: Support (X → Y) refers to the percentage of transactions where itemsets X and Y co-exist in a dataset. Confidence (X → Y) is taken to be the conditional probability, P(X|Y). That is, the percentage of transactions containing X that also contain Y. Support of the negative association rules will be form: Supp(X → ┐ Y) > min_supp; Supp(┐ X → Y) > min_supp; Supp(┐ X → ┐ Y) > min_supp. Confidence of negative association rules will be in the form: Conf(X → ┐ Y) > min_supp; Conf(┐ X → Y) > min_supp; Conf(┐ X → ┐ Y) > min_supp. In MapReduce, we scan the dataset and create 1-itemsets in one MapReduce job and then use this 1-itemset to create 2-itemsets in another MapReduce job. In the last map job, the calculation of positive and negative association rules as well as the calculations of support, confidence and lift are performed. Therefore, in essence, we use three map and two reduce jobs. The main contribution of this work is in presenting how the apriori algorithm can be used to extract negative association rules from Big Data and how it can be executed efficiently on MapReduce.

Abstract Open access

A Comparative Study of MapReduce and Hive based on the Design of the Information Gain Algorithm for Analytical Workloads Extended Abstract

by Sikha Bagui, Sharon K. John and John P. Baggs

Published 01/01/2018

ACMSE '18: Proceedings of the ACMSE 2018 Conference, 37

ACMSE '18: Southeast Conference, 03/29/2018–03/31/2018, Richmond, Kentucky

Information Gain (IG) or the Kullback Leibler algorithm is a statistical algorithm that is employed to extract useful features from datasets to eliminate redundant and valueless features. Applying this feature selection technique paves way for sophisticated analysis on Big Data, requiring the underlying framework to handle the data complexity, volume and velocity. The Hadoop ecosystem comes in handy, enabling for seamless distributed computing leveraging the computing potential of many commercial machines. Previous research studies [1, 2] indicate that Hive is best suited for data warehousing and ETL (Extract, Transform, Load) workloads. We aim to extend Hive's capability to analyze how it suits analytical algorithms and compare its performance with MapReduce. In this Big Data era, it is essential to design algorithms efficiently to reap the benefits of parallelization over existing frameworks. This study will showcase the efficacy in designing IG for Hadoop framework and discuss the implementation of IG for analytical workload on Hive and MapReduce. Inherently both these components are built over a shared nothing architecture which prevents contention issues increasing data parallelism, thus best-fitting for analytical workloads. Hence, the programmer is relieved from the overhead of maintaining structures like indexes, caches and partitions. Assessing implementation of Information Gain on both these parallel processing components will certainly provide insights on the benefits and downsides that each component should offer and at large will enable researchers and developers to employ appropriate components for suitable tasks.

References
A. Thusoo, J.S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Anthony and R. Murthy, 2010, Hive - A Petabyte Scale Data Warehouse Using Hadoop, In Proceedings of the International Conference on Data Engineering, 996--1005.
C. Stachniss, G. Grisetti, W. Burgard, 2005, Information Gain-Based Exploration Using Rao- Blackwellized Particle Filters, Robotics: Science and Systems.

Abstract

A Hybrid Genetic Algorithm for Network Intrusion Detection

by Sikha Bagui, Debarghya Nandi and Subhash Bagui

Published 01/01/2018

ACMSE '18: Proceedings of the ACMSE 2018 Conference, 34

ACM SE '18: Southeast Conference, 03/29/2018–03/31/2018, Richmond, Kentucky

Featurel selection is common in prediction tasks because it helps in reducing computation time as well as dimensionality of the data. A hybrid filter wrapper approach has been presented in this paper to detect network intrusion attacks using the genetic algorithm. The genetic algorithm is a popular search algorithm with wide applications in optimization problems like the TSP problem. One of the biggest advantages of the genetic algorithm is its continuous evolution towards better solutions. However, it does take a greedy approach, evaluating its strength against a fitness function, making it vulnerable to local optima. A certain amount of randomness at each generation can help us overcome this problem. In Network Intrusion Detection systems, the number of attacks is sometimes far less than the false alarm rate, causing the real attacks to be ignored. To overcome this problem, we propose an objective function which not only rewards higher score for higher accuracy, but also heavily penalizes false positives. Features are initially selected based on information gain and each feature is weighted differently based on domain knowledge, and then the selected subset of features is scored based on accuracy with higher penalty for false positives. In addition, crossover and mutation is carried out to allow for sufficient randomness in feature selection and avoid overfitting. Sample experimentation on the UNSW-NB15 dataset show that our approach performs much better compared to traditional methods and other state-of-the-art intrusion detection classification algorithms.

Dr. Sikha S Bagui

Distinguished University Professor, Computer Science

List of works

University of West Florida Social media