Information Gain (IG) or the Kullback Leibler algorithm is a statistical algorithm that is employed to extract useful features from datasets to eliminate redundant and valueless features. Applying this feature selection technique paves way for sophisticated analysis on Big Data, requiring the underlying framework to handle the data complexity, volume and velocity. The Hadoop ecosystem comes in handy, enabling for seamless distributed computing leveraging the computing potential of many commercial machines. Previous research studies [1, 2] indicate that Hive is best suited for data warehousing and ETL (Extract, Transform, Load) workloads. We aim to extend Hive's capability to analyze how it suits analytical algorithms and compare its performance with MapReduce. In this Big Data era, it is essential to design algorithms efficiently to reap the benefits of parallelization over existing frameworks. This study will showcase the efficacy in designing IG for Hadoop framework and discuss the implementation of IG for analytical workload on Hive and MapReduce. Inherently both these components are built over a shared nothing architecture which prevents contention issues increasing data parallelism, thus best-fitting for analytical workloads. Hence, the programmer is relieved from the overhead of maintaining structures like indexes, caches and partitions. Assessing implementation of Information Gain on both these parallel processing components will certainly provide insights on the benefits and downsides that each component should offer and at large will enable researchers and developers to employ appropriate components for suitable tasks.
References
A. Thusoo, J.S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Anthony and R. Murthy, 2010, Hive - A Petabyte Scale Data Warehouse Using Hadoop, In Proceedings of the International Conference on Data Engineering, 996--1005.
C. Stachniss, G. Grisetti, W. Burgard, 2005, Information Gain-Based Exploration Using Rao- Blackwellized Particle Filters, Robotics: Science and Systems.
Files and links (1)
url
A Comparative Study of MapReduce and Hive based on the...View
Published (Version of record)link to abstract Open
Related links
Details
Title
A Comparative Study of MapReduce and Hive based on the Design of the Information Gain Algorithm for Analytical Workloads Extended Abstract
Publication Details
ACMSE '18: Proceedings of the ACMSE 2018 Conference, 37