Logo image
A Parallel Implementation of Information Gain Using Hive in conjunction with MapReduce for Continuous Features
Conference proceeding   Peer reviewed

A Parallel Implementation of Information Gain Using Hive in conjunction with MapReduce for Continuous Features

S. Bagui, Sharon John, Baggs John S. and Subhash C Bagui
Trends and Applications in Knowledge Discovery and Data Mining: Knowledge Discovery and Data Mining Book Subtitle PAKDD 2018 Workshops, BDASC, BDM, ML4Cyber, PAISI, DaMEMO, Melbourne, VIC, Australia, June 3, 2018, Revised Selected Papers, pp.283-294
Lecture Notes in Computer Science, 11154, 1
Pacific Asia Workshop on Intelligence and Security Informatics (PAISI)
2018
Web of Science ID: WOS:000714952500028

Metrics

49 Record Views

Abstract

Finding efficient ways to perform the Information Gain algorithm is becoming even more important as we enter the Big Data era where data and dimensionality are increasing at alarming rates. When machine learning algorithms get over-burdened with large dimensional data with redundant features, information gain becomes very crucial for feature selection. Information gain is also often used as a pre-cursory step in creating decision trees, text classifiers, support vector machines, etc. Due to the very large volume of today’s data, there is a need to efficiently parallelize classic algorithms like Information Gain. In this paper, we present a parallel implementation of Information Gain in the MapReduce environment, using MapReduce in conjunction with Hive, for continuous features. In our approach, Hive was used to calculate the counts and parent entropy and a Map only job was used to complete the Information Gain calculations. Our approach demonstrated gains in run times as we carefully designed MapReduce jobs efficiently leveraging the Hadoop cluster.

Details

Logo image