A Comparative Study of MapReduce and Hive based on the Design of the Information Gain Algorithm for Analytical Workloads Extended Abstract

Sikha Bagui; Sharon K. John; John P. Baggs

doi:10.1145/3190645.3190705

Back

Abstract

Open access

A Comparative Study of MapReduce and Hive based on the Design of the Information Gain Algorithm for Analytical Workloads Extended Abstract

Sikha Bagui, Sharon K. John and John P. Baggs

ACMSE '18: Proceedings of the ACMSE 2018 Conference, 37

ACMSE '18: Southeast Conference (Richmond, Kentucky, 03/29/2018–03/31/2018)

01/01/2018

DOI: https://doi.org/10.1145/3190645.3190705

Web of Science ID: WOS:000697984300040

Metrics

67 Record Views

Abstract

Information Gain (IG) or the Kullback Leibler algorithm is a statistical algorithm that is employed to extract useful features from datasets to eliminate redundant and valueless features. Applying this feature selection technique paves way for sophisticated analysis on Big Data, requiring the underlying framework to handle the data complexity, volume and velocity. The Hadoop ecosystem comes in handy, enabling for seamless distributed computing leveraging the computing potential of many commercial machines. Previous research studies [1, 2] indicate that Hive is best suited for data warehousing and ETL (Extract, Transform, Load) workloads. We aim to extend Hive's capability to analyze how it suits analytical algorithms and compare its performance with MapReduce. In this Big Data era, it is essential to design algorithms efficiently to reap the benefits of parallelization over existing frameworks. This study will showcase the efficacy in designing IG for Hadoop framework and discuss the implementation of IG for analytical workload on Hive and MapReduce. Inherently both these components are built over a shared nothing architecture which prevents contention issues increasing data parallelism, thus best-fitting for analytical workloads. Hence, the programmer is relieved from the overhead of maintaining structures like indexes, caches and partitions. Assessing implementation of Information Gain on both these parallel processing components will certainly provide insights on the benefits and downsides that each component should offer and at large will enable researchers and developers to employ appropriate components for suitable tasks. References A. Thusoo, J.S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Anthony and R. Murthy, 2010, Hive - A Petabyte Scale Data Warehouse Using Hadoop, In Proceedings of the International Conference on Data Engineering, 996--1005. C. Stachniss, G. Grisetti, W. Burgard, 2005, Information Gain-Based Exploration Using Rao- Blackwellized Particle Filters, Robotics: Science and Systems.

Files and links (1)

url

A Comparative Study of MapReduce and Hive based on the...View

Published (Version of record)link to abstract Open

Details

Title: A Comparative Study of MapReduce and Hive based on the Design of the Information Gain Algorithm for Analytical Workloads Extended Abstract
Publication Details: ACMSE '18: Proceedings of the ACMSE 2018 Conference, 37
Resource Type: Abstract
Conference: ACMSE '18: Southeast Conference (Richmond, Kentucky, 03/29/2018–03/31/2018)
Publisher: Association for Computing Machinery; New York
Format: link
Number of pages: 1
Identifiers: WOS:000697984300040; 99380179095506600
Academic Unit: Hal Marcus College of Science and Engineering ; Computer Science
Language: English

A Comparative Study of MapReduce and Hive based on the Design of the Information Gain Algorithm for Analytical Workloads Extended Abstract

Metrics

Abstract

Files and links (1)

Related links

Details

University of West Florida Social media