Critical Insight for MapReduce Optimization in Hadoop

Burhan Ul Islam Khan; Rashidah F. Olanrewaju; Hunain Altaf; Asadullah Shah

In present day scenario, cloud has become an inevitable need for majority of IT operational organizations. Cloud applications such as data storage, data retrieval and data portability have become significant requirements for cloud computing. Numerous applications are being developed for BigData. Achieving an optimal approach for higher performance in terms of efficient load balancing, load distribution, optimum resource utilization, minimum overheads and least possible delay has been the vital issue for cloud infrastructure. Apache Hadoop is one of the most used cloud frameworks for cloud infrastructure. The predominant philosophy behind Hadoop optimization is the optimization of MapReduce, which is a dominant programming platform effective in bringing about many functional enhancements as per scheduling algorithms developed and implemented. MapReduce has emerged as the most significant part of Hadoop system that establishes itself as a framework that can effectively simplify the overall complexity of running parallel data processes across the network of computing nodes. A number of scheduling techniques have been advocated in the last couple of years for achieving enhanced load balancing in Hadoop. Unfortunately Hadoop still lacks a system model that could facilitate an ultimate solution for delivering optimized performance without creating much computational overhead. In order to pave a way for the development of an adept and decisive load balancing and job scheduling scheme for minimum execution time and optimum resource utilization in future, here in this paper a comprehensive review of some of the major works has been done to discuss the prominence of issues, which will be needed to be taken care of while developing the same.

[1]

Jain, P., Rane, D., & Patidar, S. (2011, December). A survey and analysis of cloud model-based security for computing secure cloud bursting and aggregation in renal environment. In Information and Communication Technologies (WICT), 2011 World Congress on (pp. 456-461). IEEE.

[2]

Buyya, R. (2009, August). Market-oriented cloud computing: Vision, hype, and reality of delivering computing as the 5th utility. In ChinaGrid Annual Conference, 2009. ChinaGrid'09. Fourth (pp. xii-xv). IEEE.

[3]

Dong, W. E., Nan, W., & Xu, L. (2013, June). QoS-Oriented Monitoring Model of Cloud Computing Resources Availability. In Computational and Information Sciences (ICCIS), 2013 Fifth International Conference on (pp. 1537-1540). IEEE.

[4]

Zhang, P., & Yan, Z. (2011, September). A QoS-aware system for mobile cloud computing. In Cloud Computing and Intelligence Systems (CCIS), 2011 IEEE International Conference on (pp. 518-522). IEEE.

[5]

Shaikh, F. B., & Haider, S. (2011, December). Security threats in cloud computing. In Internet technology and secured transactions (ICITST), 2011 international conference for (pp. 214-219). IEEE.

[6]

Thusoo, A., Sarma, J. S., Jain, N., Shao, Z., Chakka, P., Zhang, N., ... & Murthy, R. (2010, March). Hive-a petabyte scale data warehouse using hadoop. In Data Engineering (ICDE), 2010 IEEE 26th International Conference on (pp. 996-1005). IEEE.

[7]

Dean, J., & Ghemawat, S. (2008). MapReduce: simplified data processing on large clusters. Communications of the ACM, 51(1), 107-113.

[8]

Yang, H. C., Dasdan, A., Hsiao, R. L., & Parker, D. S. (2007, June). Map-reduce-merge: simplified relational data processing on large clusters. In Proceedings of the 2007 ACM SIGMOD international conference on Management of data (pp. 1029-1040). ACM.

[9]

Yan, C., Zhu, M., Yang, X., Yu, Z., Li, M., Shi, Y., & Li, X. (2012, September). Affinity-aware virtual cluster optimization for mapreduce applications. In Cluster Computing (CLUSTER), 2012 IEEE International Conference on (pp. 63-71). IEEE.

[10]

Manjula, L., & Sreedevi, M. (2013). Automated Cloud Based File Storage Nodes Balancer. International Journal, 3(9).

[11]

Kurazumi, S., Tsumura, T., Saito, S., & Matsuo, H. (2012, December). Dynamic processing slots scheduling for I/O intensive jobs of Hadoop MapReduce. In Networking and Computing (ICNC), 2012 Third International Conference on (pp. 288-292). IEEE.

[12]

Wasi-ur-Rahman, M., Islam, N. S., Tudoran, R., Costan, A., Rad, R. R., Brasche, G., & Antoniu, G. (2013, October). Adaptive file management for scientific workflows on the Azure cloud. In Big Data, 2013 IEEE International Conference on (pp. 273-281). IEEE.

[13]

Yu, X., & Hong, B. (2013, May). Bi-Hadoop: Extending Hadoop To Improve Support For Binary-Input Applications. In Cluster, Cloud and Grid Computing (CCGrid), 2013 13th IEEE/ACM International Symposium on (pp. 245-252). IEEE.

[14]

Lu, X., Jose, J., Subramoni, H., Wang, H., & Panda, D. K. (2013, May). High-Performance RDMA-based Design of Hadoop MapReduce over InfiniBand. In Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD Forum (pp. 1908-1917). IEEE Computer Society.

[15]

Cooper, Brian F., Eric Baldeschwieler, Rodrigo Fonseca, James J. Kistler, P. P. S. Narayan, Chuck Neerdaels, Toby Negrin et al. "Building a cloud for yahoo!." IEEE Data Eng. Bull. 32, no. 1 (2009): 36-43.

[16]

Rasooli, A., & Down, D. G. (2012, November). A hybrid scheduling approach for scalable heterogeneous hadoop systems. In High Performance Computing, Networking, Storage and Analysis (SCC), 2012 SC Companion: (pp. 1284-1291). IEEE.

[17]

He, C., Weitzel, D., Swanson, D., & Lu, Y. (2012, November). HOG: Distributed Hadoop MapReduce on the Grid. In High Performance Computing, Networking, Storage and Analysis (SCC), 2012 SC Companion: (pp. 1276-1283). IEEE.

[18]

Wu, M., Zhang, Z., & Li, Y. (2013, May). Application research of Hadoop resource monitoring system based on Ganglia and Nagios. In Software Engineering and Service Science (ICSESS), 2013 4th IEEE International Conference on (pp. 684-688). IEEE.

[19]

Wang, K., Lin, X., & Tang, W. (2012, December). Predator—An experience guided configuration optimizer for Hadoop MapReduce. In Cloud Computing Technology and Science (CloudCom), 2012 IEEE 4th International Conference on (pp. 419-426). IEEE.

[20]

Ye, K., Jiang, X., He, Y., Li, X., Yan, H., & Huang, P. (2012, September). vHadoop: A Scalable Hadoop Virtual Cluster Platform for MapReduce-Based Parallel Machine Learning with Performance Consideration. In Cluster Computing Workshops (CLUSTER WORKSHOPS), 2012 IEEE International Conference on (pp. 152-160). IEEE.

[21]

Sadasivam, G. S., & Selvaraj, D. (2010, December). A novel parallel hybrid PSO-GA using MapReduce to schedule jobs in Hadoop data grids. In Nature and Biologically Inspired Computing (NaBIC), 2010 Second World Congress on (pp. 377-382). IEEE.

[22]

Yan, J., Yang, X., Gu, R., Yuan, C., & Huang, Y. (2012, November). Performance Optimization for Short MapReduce Job Execution in Hadoop. In Cloud and Green Computing (CGC), 2012 Second International Conference on (pp. 688-694). IEEE.

[23]

Raj, A., Kaur, K., Dutta, U., Sandeep, V. V., & Rao, S. (2012, December). Enhancement of Hadoop Clusters with Virtualization Using the Capacity Scheduler. In Services in Emerging Markets (ICSEM), 2012 Third International Conference on (pp. 50-57). IEEE.

[24]

Shafer, J., Rixner, S., & Cox, A. L. (2010, March). The Hadoop distributed filesystem: Balancing portability and performance. In Performance Analysis of Systems & Software (ISPASS), 2010 IEEE International Symposium on (pp. 122-133). IEEE. Phd Forum (IPDPSW), 2011 IEEE International Symposium on (pp. 1855-1862). IEEE.

[25]

Zaharia, M., Konwinski, A., Joseph, A. D., Katz, R. H., & Stoica, I. (2008, December). Improving MapReduce Performance in Heterogeneous Environments. In OSDI (Vol. 8, No. 4, p. 7).

[26]

Ahmad, F., Chakradhar, S. T., Raghunathan, A., & Vijaykumar, T. N. (2012, March). Tarazu: optimizing mapreduce on heterogeneous clusters. In ACM SIGARCH Computer Architecture News (Vol. 40, No. 1, pp. 61-74). ACM.

[27]

Borthakur, Dhruba, Jonathan Gray, Joydeep Sen Sarma, Kannan Muthukkaruppan, Nicolas Spiegelberg, Hairong Kuang, Karthik Ranganathan et al. "Apache Hadoop goes realtime at Facebook." In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, pp. 1071-1080. ACM, 2011.

[28]

Chaudhuri, Surajit, Umeshwar Dayal, and Vivek Narasayya. "An overview of business intelligence technology." Communications of the ACM 54, no. 8 (2011): 88-98.

[29]

Condie, T., Conway, N., Alvaro, P., Hellerstein, J. M., Elmeleegy, K., & Sears, R. (2010, April). MapReduce Online. In NSDI (Vol. 10, No. 4, p. 20).

[30]

Palanisamy, B., Singh, A., Liu, L., & Jain, B. (2011, November). Purlieus: locality-aware resource allocation for MapReduce in a cloud. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (p. 58). ACM.

[31]

Tudoran, Radu, Alexandra Costan, Ramin Rezai Rad, Goetz Brasche, and Gabriel Antoniu. "Adaptive file management for scientific workflows on the Azure cloud." In Big Data, 2013 IEEE International Conference on, pp. 273-281. IEEE, 2013.