Towards an Ontology-Based Semantic Approach to Tuning Parameters to Improve Hadoop Application Performance

Ailton Bonifacio, Andre Menolli, Fabiano Silva

Abstract


Hadoop MapReduce assists companies andresearchers to deal with processing large volumes of data.Hadoop has a lot of configuration parameters that must betuned in order to obtain a better application performance.However, the best tuning of the parameters is not easilyobtained by inexperienced users. Therefore, it is necessary tocreate environments that promote and motivate informationsharing and knowledge dissemination. In addition, it isimportant that all acquired knowledge be organized to bereused faster, easily and efficiently whenever necessary. Thispaper proposes an ontology-based semantic approach totuning parameters to improve Hadoop applicationperformance. The approach integrates techniques frommachine learning, semantic search and ontologies.

Keywords


Hadoop MapReduce; Hadoop Performance; Ontology; Parameter Tuning

References


J. Dean and S. Ghemawat, “Mapreduce: simplified data processing on large clusters,” in Proceedings of the 6th conference on Symposium on Operating Systems Design & Implementation - Volume 6, ser. OSDI’04. Berkeley, CA, USA: USENIX Association, 2004, pp. 10–10.

Apache, “Apache hadoop,” http://hadoop.apache.org/, Oct 2012, october 24, 2012. [Online]. Available: {http://hadoop.apache.org/}

H. Herodotou, H. Lim, G. Luo, N. Borisov, L. Dong, F. B. Cetin, and S. Babu, “Starfish: A self-tuning system for big data analytics,” in In CIDR, 2011, pp. 261–272.

J. Venner, “Tuning your mapreduce jobs,” in Pro Hadoop. Apress, 2009, pp. 177–206.

S. Babu, “Towards automatic optimization of mapreduce programs,” in Proceedings of the 1st ACM Symposium on Cloud Computing, ser. SoCC ’10. New York, NY, USA: ACM, 2010, pp. 137–142.

X. Lin, W. Tang, and K. Wang, “Predator — an experience guided configuration optimizer for hadoop mapreduce,” in Proceedings of the 2012 IEEE 4th International Conference on Cloud Computing Technology and Science (CloudCom), ser. CLOUDCOM ’12. Washington, DC, USA: IEEE Computer Society, 2012, pp. 419–426.

M. V. Zelkowitz and D. R. Wallace, “Experimental models for validating technology,” Computer, vol. 31, no. 5, pp. 23–31, May 1998.

W. Li, H. Yang, Z. Luan, and D. Qian, “Energy prediction for mapreduce workloads,” in Proceedings of the 2011 IEEE Ninth International Conference on Dependable, Autonomic and Secure Computing, ser. DASC ’11. Washington, DC, USA: IEEE Computer Society, 2011, pp. 443–448.

N. Khoussainova, M. Balazinska, and D. Suciu, “Perfxplain: debugging mapreduce job performance,” Proc. VLDB Endow., vol. 5, no. 7, pp. 598–609, Mar. 2012.

K. Kim, K. Jeon, H. Han, S.-g. Kim, H. Jung, and H. Y. Yeom, “Mrbench: A benchmark for mapreduce framework,” in Proceedings of the 2008 14th IEEE International Conference on Parallel and Distributed Systems, ser. ICPADS ’08. Washington, DC, USA: IEEE Computer Society, 2008, pp. 11–18.

M. Koehler, Y. Kaniovskyi, and S. Benkner, “An adaptive framework for the execution of data-intensive mapreduce applications in the cloud,” in Proceedings of the 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and PhD Forum, ser. IPDPSW ’11. Washington, DC, USA: IEEE Computer Society, 2011, pp. 1122–1131.

M. Koehler and S. Benkner, “Design of an adaptive framework for utility-based optimization of scientific applications in the cloud,” in Utility and Cloud Computing (UCC), 2012 IEEE Fifth International Conference on, Nov 2012, pp. 303–308.

T. D. Plantenga, Y. R. Choe, and A. Yoshimura, “Using performance measurements to improve mapreduce algorithms,” Procedia Computer Science, vol. 9, no. 0, pp. 1920 – 1929, 2012, proceedings of the International Conference on Computational Science, fICCSg 2012.

G. Wang, A. R. Butt, H. Monti, and K. Gupta, “Towards synthesizing realistic workload traces for studying the hadoop ecosystem,” in 19th IEEE Annual International Symposium on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS). Raffles Hotel, Singapore: IEEE Computer Society, Jul. 2011, pp. 400–408.

F. Ahmad, S. T. Chakradhar, A. Raghunathan, and T. N. Vijaykumar, “Tarazu: Optimizing mapreduce on heterogeneous clusters,” SIGARCH Comput. Archit. News, vol. 40, no. 1, pp. 61–74, Mar. 2012.

K. Kambatla, A. Pathak, and H. Pucha, “Towards optimizing hadoop provisioning in the cloud,” in Proceedings of the 2009 Conference on Hot Topics in Cloud Computing, ser. HotCloud’09. Berkeley, CA, USA:USENIX Association, 2009.

Z. Zhang, L. Cherkasova, and B. T. Loo, “Benchmarking approach for designing a mapreduce performance model,” in Proceedings of the 4th ACM/SPEC International Conference on Performance Engineering, ser. ICPE ’13. New York, NY, USA: ACM, 2013, pp. 253–258.

M. Elteir, H. Lin, and W. chun Feng, “Enhancing mapreduce via asynchronous data processing.” in ICPADS. IEEE, 2010, pp. 397–405.

J. Han, M. Ishii, and H. Makino, “A hadoop performance model for multi-rack clusters,” in Computer Science and Information Technology (CSIT), 2013 5th International Conference on, March 2013, pp. 265–274.

X. Lin, Z. Meng, C. Xu, and M. Wang, “A practical performance model for hadoop mapreduce,” in Proceedings of the 2012 IEEE International Conference on Cluster Computing Workshops, ser. CLUSTERW ’12. Washington, DC, USA: IEEE Computer Society, 2012, pp. 231–239.

A. Nez and M. G. Merayo, “A formal framework to analyze cost and performance in map-reduce based applications,” Journal of Computational Science, no. 0, pp. –, 2013.

D. Tiwari and D. Solihin, “Modeling and analyzing key performance factors of shared memory mapreduce,” in Parallel Distributed Processing Symposium (IPDPS), 2012 IEEE 26th International, May 2012, pp. 1306–1317.

X. Yang and J. Sun, “An analytical performance model of mapreduce,” in Cloud Computing and Intelligence Systems (CCIS), 2011 IEEE International Conference on, Sept 2011, pp. 306–310.

S. Kadirvel and J. A. B. Fortes, “Towards self-caring mapreduce: Proactively reducing fault-induced execution-time penalties,” in High Performance Computing and Simulation (HPCS), 2011 International Conference on, July 2011, pp. 63–71.

M. An, Y. Wang, and W. Wang, “Using index in the mapreduce framework,” in Proceedings of the 2010 12th International Asia-Pacific Web Conference, ser. APWEB ’10. Washington, DC, USA: IEEE Computer Society, 2010, pp. 52–58.

S. Hammoud, M. Li, Y. Liu, N. Alham, and Z. Liu, “Mrsim: A discrete event based mapreduce simulator,” in Fuzzy Systems and Knowledge Discovery (FSKD), 2010 Seventh International Conference on, vol. 6, Aug 2010, pp. 2993–2997.

H. Herodotou, F. Dong, and S. Babu, “No one (cluster) size fits all: Automatic cluster sizing for data-intensive analytics,” in Proceedings of the 2Nd ACM Symposium on Cloud Computing, ser. SOCC ’11. New York, NY, USA: ACM, 2011, pp. 18:1–18:14.

H. Herodotou and S. Babu, “Profiling, What-if Analysis, and Cost-based Optimization of MapReduce Programs,” PVLDB: Proceedings of the VLDB Endowment, vol. 4, no. 11, pp. 1111–1122, 2011.

S. Kadirvel and J. A. B. Fortes, “Grey-box approach for performance prediction in map-reduce based platforms,” in Computer Communications and Networks (ICCCN), 2012 21st International Conference on, July 2012, pp. 1–9.

P. Lama and X. Zhou, “Aroma: Automated resource allocation and configuration of mapreduce environment in the cloud,” in Proceedings of the 9th International Conference on Autonomic Computing, ser. ICAC’12. New York, NY, USA: ACM, 2012, pp. 63–72.

G. Wang, A. Butt, P. Pandey, and K. Gupta, “A simulation approach to evaluating design decisions in mapreduce setups,” in Modeling, Analysis Simulation of Computer and Telecommunication Systems, 2009. MASCOTS ’09. IEEE International Symposium on, Sept 2009, pp. 1–11.

G. Wang, A. R. Butt, P. Pandey, and K. Gupta, “Using realistic simulation for performance analysis of mapreduce setups,” in Proceedings of the 1st ACM Workshop on Large-Scale System and Application Performance, ser. LSAP ’09. New York, NY, USA: ACM, 2009, pp. 19–26.

H. Yang, Z. Luan, W. Li, and D. Qian, “Mapreduce workload modeling with statistical approach,” J. Grid Comput., vol. 10, no. 2, pp. 279–310, Jun. 2012.

H. Herodotou, F. Dong, and S. Babu, “Mapreduce programming and cost-based optimization? crossing this chasm with starfish.” PVLDB, vol. 4, no. 12, pp. 1446–1449, 2011.

R. R. Kompella, Y. C. Hu, and D. Xie, “On the performance projectability of mapreduce,” in Proceedings of the 2012 IEEE 4th International Conference on Cloud Computing Technology and Science (CloudCom), ser. CLOUDCOM ’12. Washington, DC, USA: IEEE Computer Society, 2012, pp. 301–308.

N. B. Rizvandi, J. Taheri, R. Moraveji, and A. Y. Zomaya, “Network load analysis and provisioning of mapreduce applications,” in Proceedings of the 2012 13th International Conference on Parallel and Distributed Computing, Applications and Technologies, ser. PDCAT ’12. Washington, DC, USA: IEEE Computer Society, 2012, pp. 161–166.

——, “On modelling and prediction of total cpu usage for applications in mapreduce environments,” in Proceedings of the 12th International Conference on Algorithms and Architectures for Parallel Processing - Volume Part I, ser. ICA3PP’12. Berlin, Heidelberg: Springer-Verlag, 2012, pp. 414–427.

H. Yang, Z. Luan, W. Li, D. Qian, and G. Guan, “Statistics-based workload modeling for mapreduce,” in Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum, ser. IPDPSW ’12. Washington, DC, USA: IEEE Computer Society, 2012, pp. 2043–2051.

D. Heger, “Hadoop performance tuning - a pragmatic & iterative approach,” CMG Journal, 2013.

Y. Liu, M. Li, N. K. Alham, and S. Hammoud, “Hsim: A mapreduce simulator in enabling cloud computing,” Future Gener. Comput. Syst., vol. 29, no. 1, pp. 300–308, Jan. 2013.

W. Premchaiswadi and W. Romsaiyud, “Optimizing and tuning mapreduce jobs to improve the large-scale data analysis process,” Int. J. Intell. Syst., vol. 28, no. 2, pp. 185–200, Feb. 2013.

A. Raj, K. Kaur, U. Dutta, V. Sandeep, and S. Rao, “Enhancement of hadoop clusters with virtualization using the capacity scheduler,” in Services in Emerging Markets (ICSEM), 2012 Third International Conference on, Dec 2012, pp. 50–57.

K. Wang, B. Tan, J. Shi, and B. Yang, “Automatic task slots assignment in hadoop mapreduce,” in Proceedings of the 1st Workshop on Architectures and Systems for Big Data, ser. ASBD ’11. New York, NY, USA:ACM, 2011, pp. 24–29.

Z. Guo, G. Fox, M. Zhou, and Y. Ruan, “Improving resource utilization in mapreduce.” in CLUSTER. IEEE, 2012, pp. 402–410.

Y. Chen, A. S. Ganapathi, R. Griffith, and R. H. Katz, “A methodology for understanding mapreduce performance under diverse workloads,” EECS Department, University of California, Berkeley, Tech. Rep. UCB/EECS-2010-135, Nov 2010.

F. Baader, I. Horrocks, and U. Sattler, “Description Logics,” in Handbook of Knowledge Representation, F. van Harmelen, V. Lifschitz, and B. Porter, Eds. Elsevier, 2008, ch. 3, pp. 135–180. [Online]. Available: download/2007/BaHS07a.pdf

W3C, “Sparql query language for rdf,” http://www.w3.org/TR/rdfsparql-query/, 2013. [Online]. Available: {http://www.w3.org/TR/rdf-sparql-query/}


Full Text: PDF

Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.

IT in Innovation IT in Business IT in Engineering IT in Health IT in Science IT in Design IT in Fashion

IT in Industry � (2012 - ) � http://www.it-in-industry.com � ISSN (Online): 2203-1731; ISSN (Print): 2204-0595