A Parallel Implementation of Information Gain Using Hive in conjunction with MapReduce for Continuous Features

S. Bagui; Sharon John; Baggs John S.; Subhash C Bagui

doi:10.1007/978-3-030-04503-6_28

Back

A Parallel Implementation of Information Gain Using Hive in conjunction with MapReduce for Continuous Features

Conference proceeding

Peer reviewed

A Parallel Implementation of Information Gain Using Hive in conjunction with MapReduce for Continuous Features

S. Bagui, Sharon John, Baggs John S. and Subhash C Bagui

Trends and Applications in Knowledge Discovery and Data Mining: Knowledge Discovery and Data Mining Book Subtitle PAKDD 2018 Workshops, BDASC, BDM, ML4Cyber, PAISI, DaMEMO, Melbourne, VIC, Australia, June 3, 2018, Revised Selected Papers, pp.283-294

Lecture Notes in Computer Science, 11154, 1

Pacific Asia Workshop on Intelligence and Security Informatics (PAISI)

2018

DOI: https://doi.org/10.1007/978-3-030-04503-6_28

Web of Science ID: WOS:000714952500028

Metrics

49 Record Views

Abstract

Finding efficient ways to perform the Information Gain algorithm is becoming even more important as we enter the Big Data era where data and dimensionality are increasing at alarming rates. When machine learning algorithms get over-burdened with large dimensional data with redundant features, information gain becomes very crucial for feature selection. Information gain is also often used as a pre-cursory step in creating decision trees, text classifiers, support vector machines, etc. Due to the very large volume of today’s data, there is a need to efficiently parallelize classic algorithms like Information Gain. In this paper, we present a parallel implementation of Information Gain in the MapReduce environment, using MapReduce in conjunction with Hive, for continuous features. In our approach, Hive was used to calculate the counts and parent entropy and a Map only job was used to complete the Information Gain calculations. Our approach demonstrated gains in run times as we carefully designed MapReduce jobs efficiently leveraging the Hadoop cluster.

Details

Title: A Parallel Implementation of Information Gain Using Hive in conjunction with MapReduce for Continuous Features
Edition: 1
Publication Details: Trends and Applications in Knowledge Discovery and Data Mining: Knowledge Discovery and Data Mining Book Subtitle PAKDD 2018 Workshops, BDASC, BDM, ML4Cyber, PAISI, DaMEMO, Melbourne, VIC, Australia, June 3, 2018, Revised Selected Papers, pp.283-294
Resource Type: Conference proceeding
Conference: Pacific Asia Workshop on Intelligence and Security Informatics (PAISI)
Publisher: Springer Cham
Series: Lecture Notes in Computer Science; 11154
Identifiers: WOS:000714952500028; 99380482996206600
Academic Unit: Hal Marcus College of Science and Engineering ; Computer Science; Mathematics and Statistics
Language: English

A Parallel Implementation of Information Gain Using Hive in conjunction with MapReduce for Continuous Features

Metrics

Abstract

Related links

Details

University of West Florida Social media