BIG DATA ANALYTICS: PERFORMANCE TUNING IN APACHE SPARK

Authors

  • Sanjay Puthenpariyarath Staff Data Engineer, CVS Health, United States of America (USA). Author

DOI:

https://doi.org/10.34218/IJCET_16_02_006

Keywords:

Apache Spark, Performance Tuning, Big Data, Data Handling, Query Optimization, Parallelism, Optimization Techniques, Key Salting, Adaptive Partitioning, Shuffle Operations, Data Partitioning, Memory Management Optimization, Shuffle Optimization, Data Skew Handling

Abstract

Today’s organizations generate and manage massive amounts of information from varied sources like social media platforms, IoT devices, and transactional systems. This exponential growth in the data has made robust, scalable, and efficient processing frameworks needed to process large-scale data a pressing requirement [14]. Apache Spark, an open-source distributed computing system, has become highly popular and is a leading solution for big data processing [5]. As its in-memory computation capabilities come with unified API and support a number of programming languages, this tool can be a versatile case used in such diverse workloads as batch processing, real-time analytics, machine learning, and graph computations [7]. Nevertheless, the effective use of Spark applications remains challenging even when advanced features are achieved. With the increasing complexity and size of datasets, problems like insufficient memory, data skew, huge shuffle overhead, and bad configuration may become performance bottlenecks. As a consequence, execution time becomes prolonged, resources are wasted, and the system is unstable, jeopardizing the framework's potential. This paper presents performance tuning techniques for solving the above challenges to provide a structured approach to Spark application performance tuning for large-scale data processing. It is essential for data engineers and practitioners to learn these existing strategies, such as configuration optimization, data partitioning, and memory management, to unlock Apache Spark's full potential. We validate these techniques experimentally and show them to be effective methods for improving these systems' efficiency, scalability, and reliability.

References

H. Karau and R. Warren, High-Performance Spark. “O’Reilly Media, Inc.,” 2017.

S. Salloum, R. Dautov, X. Chen, P. X. Peng, and J. Z. Huang, “Big Data Analytics on Apache Spark,” International Journal of Data Science and Analytics, vol. 1, no. 3–4, pp. 145–164, Oct. 2016, doi: https://doi.org/10.1007/s41060-016-0027-9.

M. A. Rahman, J. Hossen, and C. Venkataseshaiah, “SMBSP: A Self-Tuning Approach using Machine Learning to Improve Performance of Spark in Big Data Processing,” in 2018 7th International Conference on Computer and Communication Engineering (ICCCE), Kuala Lumpur, Malaysia: IEEE, Sep. 2018, pp. 274–279. doi: https://doi.org/10.1109/iccce.2018.8539328.

H. Herodotou, Y. Chen, and J. Lu, “A Survey on Automatic Parameter Tuning for Big Data Processing Systems,” ACM Computing Surveys, vol. 53, no. 2, pp. 1–37, Jul. 2020, https://doi.org/10.1145/3381027.

E. Shaikh, I. Mohiuddin, Y. Alufaisan, and I. Nahvi, “Apache Spark: A Big Data Processing Engine,” in 2019 2nd IEEE Middle East and North Africa COMMunications Conference (MENACOMM), Manama, Bahrain: IEE, Nov. 2019. doi: https://doi.org/10.1109/menacomm46666.2019.8988541.

J. Kroß and H. Krcmar, “PerTract: Model Extraction and Specification of Big Data Systems for Performance Prediction by the Example of Apache Spark and Hadoop,” Big Data and Cognitive Computing, vol. 3, no. 3, p. 47, Aug. 2019, doi: https://doi.org/10.3390/bdcc3030047.

S. Penchikala, Big Data Processing With Apache Spark. C4Media, 2018.

O.-C. Marcu, A. Costan, G. Antoniu, and M. S. Pérez-Hernández, “Spark versus Flink: Understanding Performance in Big Data Analytics Frameworks,” in IEEE Xplore, Taipei, Taiwan, Sep. 2016, pp. 433–442. doi: https://doi.org/10.1109/CLUSTER.2016.22.

N. Nguyen, M. Maifi, and K. Wang, “Towards Automatic Tuning of Apache Spark Configuration,” in 018 IEEE 11th International Conference on Cloud Computing (CLOUD), San Francisco, CA, USA: IEEE, Jul. 2018, pp. 417–425. doi: https://doi.org/10.1109/cloud.2018.00059.

R. Maheshwar and D. Haritha, "Survey on High-Performance Analytics of Bigdata with Apache Spark," in 2016 International Conference on Advanced Communication Control and Computing Technologies (ICACCCT), Ramanathapuram, India: IEEE, May 2016. doi: https://doi.org/10.1109/icaccct.2016.7831734.

A. J. Awan, M. Brorsson, V. Vlassov, and E. Ayguade, “Architectural Impact on Performance of In-memory Data Analytics: Apache Spark Case Study,” arXiv.org, Apr. 2016, doi: https://doi.org/10.48550/arXiv.1604.08484.

G. Wang, J. Xu, and B. He, “A Novel Method for Tuning Configuration Parameters of Spark Based on Machine Learning,” in IEEE Xplore, Sydney, NSW, Australia, Dec. 2016, pp. 586–593. doi: https://doi.org/10.1109/HPCC-SmartCity-DSS.2016.0088.

G. Cheng, S. Ying, and B. Wang, “Tuning Configuration of Apache Spark on Public Clouds by Combining multi-objective Optimization and Performance Prediction Model,” Journal of Systems and Software, vol. 180, p. 111028, Oct. 2021, doi: https://doi.org/10.1016/j.jss.2021.111028.

M. Assefi, E. Behravesh, A. P. Tafti, and G. Liu, “Big Data Machine Learning Using Apache Spark MLlib,” in 2017 IEEE International Conference on Big Data (Big Data), Boston, MA, USA: IEEE, 2017. Available: https://ieeexplore.ieee.org/abstract/document/8258338

M. R. Sundarakumar et al., “A Comprehensive Study and Review of Tuning the Performance on Database Scalability in Big Data Analytics,” Journal of Intelligent & Fuzzy Systems, vol. 44, no. 3, pp. 1–25, Dec. 2022, doi: https://doi.org/10.3233/jifs-223295.

M. Zaharia et al., “Apache Spark,” Communications of the ACM, vol. 59, no. 11, pp. 56–65, Oct. 2016, doi: https://doi.org/10.1145/2934664.

B. Sun, M. Li, Y. Li, M. Lv, Z. Peng, and R. Hong, “An interpretable operating condition partitioning approach based on global spatial structure compensation-local temporal information aggregation self-organizing map for complex industrial processes,” Expert Systems with Applications, vol. 249, p. 123841, Sep. 2024, doi: https://doi.org/10.1016/j.eswa.2024.123841.

Downloads

Published

2025-03-14

How to Cite

Sanjay Puthenpariyarath. (2025). BIG DATA ANALYTICS: PERFORMANCE TUNING IN APACHE SPARK. INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING AND TECHNOLOGY, 16(2), 99-117. https://doi.org/10.34218/IJCET_16_02_006