DEMYSTIFYING SYNTHETIC DATA GENERATION FOR PERFORMANCE BENCHMARKING

Authors

  • Sudhakar Reddy Narra USA Author

Keywords:

Synthetic Data Generation, Machine Learning-Based Data Synthesis, Privacy-Preserving Testing, Performance Engineering, CI/CD Pipeline Integration

Abstract

Synthetic data generation has emerged as a transformative solution in performance engineering, addressing the critical challenge of testing applications without relying on sensitive or unavailable production data. This article explores the fundamental principles, methodologies, and tools that enable engineers to create realistic, production-like datasets tailored for performance testing. By examining key components such as schema modeling, randomized data generation, and data anonymization, the article demonstrates how organizations can achieve high-quality, representative test data while ensuring privacy compliance. The integration of advanced technologies, including machine learning, Monte Carlo simulations, and blockchain-based approaches, has revolutionized how synthetic data is generated and validated. Furthermore, the article discusses the practical implementation of these techniques within modern CI/CD pipelines, highlighting best practices and strategies for successful deployment across various industries.

References

Vinícius Camargo Andrade, Rhodrigo Deda Gomes, et al., "Privacy by Design and Software Engineering: a Systematic Literature Review," SBQS '22: Proceedings of the XXI Brazilian Symposium on Software Quality, Article No.: 18, Pages 1 - 10. Available: https://dl.acm.org/doi/abs/10.1145/3571473.3571480

Yeji Hong, Somin Park, "Synthetic data generation using building information models," Automation in Construction, Volume 130, October 2021, 103871. Available: https://www.sciencedirect.com/science/article/abs/pii/S0926580521003228

Shrey Modi, Bhargava Bokkena, et al., "Automated Synthetic Data Generation Pipeline Using Large Language Models for Enhanced Model Robustness and Fairness in Deep Learning Systems," IEEE 3rd International Conference for Advancement in Technology (ICONAT), 2024. Available: https://ieeexplore.ieee.org/abstract/document/10774980

Vasileios C. Pezoulas, Dimitrios I. Zaridis, et al., "Synthetic data generation methods in healthcare: A review on open-source tools and methods," Computational and Structural Biotechnology Journal, Volume 23, December 2024, Pages 2892-2910. Available: https://www.sciencedirect.com/science/article/pii/S2001037024002393

Abdul Majeed, "Attribute-Centric and Synthetic Data Based Privacy Preserving Methods: A Systematic Review," J. Cybersecur. Priv. 2023, 3(3), 638-661. Available: https://www.mdpi.com/2624-800X/3/3/30

Leonardo Locowic, Alessandro Monteverdi, "Synthetic Data Generation from Real Data Sources using Monte Carlo Tree Search and Large Language Models," arXiv preprint arXiv:2401.12345, 2024. Available: https://d197for5662m48.cloudfront.net/documents/publicationstatus/224165/preprint_pdf/3c3ef1837551b4cf3bb7cfd68385de99.pdf

Yingzhou Lu, et al., "Machine Learning for Synthetic Data Generation: A Review," Journal Of Latex Class Files, Vol. 14, No. 8, August 2021. Available: https://arxiv.org/pdf/2302.04062v9

Heejae Lee, Jongmoo Jeon, "Game engine-driven synthetic data generation for computer vision-based safety monitoring of construction workers," Automation in Construction, Volume 155, November 2023, 105060. Available: https://www.sciencedirect.com/science/article/abs/pii/S0926580523003205

Lauren Arthur, Jason Costello, et al., "On the Challenges of Deploying Privacy-Preserving Synthetic Data in the Enterprise," arXiv preprint arXiv:2307.04208, 2023. Available: https://arxiv.org/pdf/2307.04208

Kelechukwu Innocent Ede, "Enhancing Data Security in Healthcare with Synthetic Data Generation: An Autoencoder and Variational Autoencoder Approach," OsloMet - Oslo Metropolitan University, 2024. Available: https://oda.oslomet.no/oda-xmlui/handle/11250/3163006

Eric Fiterman, Kenneth Brown, et al., "Integrating synthetic data validation and quality benchmarks into a continuous integration/continuous delivery (CI/CD) data-generation pipeline," Proc. SPIE 13043, Autonomous Systems: Sensors, Processing, and Security, 130430C, 2024. Available: https://www.spiedigitallibrary.org/conference-proceedings-of-spie/13043/130430C/Integrating-synthetic-data-validation-and-quality-benchmarks-into-a-continuous/10.1117/12.3014548.short

Md Sarazul Ali, Digvijay Puri, et al., "Optimizing DevOps Methodologies with the Integration of Artificial Intelligence," IEEE 3rd International Conference for Innovation in Technology (INOCON), 2024. Available: https://ieeexplore.ieee.org/document/10511490

Published

2025-01-16

How to Cite

Sudhakar Reddy Narra. (2025). DEMYSTIFYING SYNTHETIC DATA GENERATION FOR PERFORMANCE BENCHMARKING. INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING AND TECHNOLOGY, 16(01), 172-185. https://ijcet.in/index.php/ijcet/article/view/196