Skip to main navigation menu Skip to main content Skip to site footer

Chaos Engineering for Resilience and Reliability in Cloud-Native and Serverless Systems: A Systematic and Theoretical Synthesis

Abstract

The rapid transition toward cloud-native, microservices-based, and serverless computing paradigms has introduced unprecedented levels of system complexity, interdependence, and operational uncertainty. Traditional reliability engineering approaches, which emphasize fault avoidance and redundancy, are increasingly insufficient in addressing the dynamic and non-deterministic nature of modern distributed environments. This research presents a comprehensive and publication-ready synthesis of chaos engineering as a transformative methodology for enhancing resilience and reliability in contemporary software systems. Grounded strictly in the provided references, the study integrates insights from foundational chaos engineering principles, cloud computing reliability challenges, DevOps practices, and empirical evaluations of system performance under failure conditions. The research adopts a systematic literature review methodology to identify, categorize, and synthesize key theoretical constructs and practical implementations of chaos engineering across diverse technological contexts. The findings demonstrate that chaos engineering enables proactive resilience validation by systematically injecting faults into production and pre-production environments, thereby uncovering latent vulnerabilities and improving system robustness. Furthermore, the integration of chaos engineering with site reliability engineering, observability frameworks, and self-adaptive systems is shown to significantly enhance operational stability and performance predictability. The study also explores the role of human-centered learning frameworks in developing high-reliability engineering teams capable of managing complex systems under uncertainty. Despite its advantages, chaos engineering presents challenges related to risk management, scalability, and organizational adoption, particularly in highly regulated or mission-critical domains. The research contributes a unified conceptual framework that bridges theoretical and applied dimensions of chaos engineering, offering a foundation for future advancements in resilient system design. The study concludes by identifying research gaps and proposing directions for the evolution of resilience engineering in increasingly autonomous and distributed technological ecosystems.

Keywords

Chaos engineering, cloud-native systems, serverless computing, resilience engineering

PDF

References

  1. Ali Basiri, et al. (2016). Chaos Engineering. IEEE Software, 33(3), 35–41.
  2. Ali Basiri, et al. (2019). Automating Chaos Experiments in Production. Proceedings of the IEEE/ACM International Conference on Software Engineering.
  3. Microsoft (2023). Quickstart: Create and Run a Chaos Experiment by Using Azure Chaos Studio.
  4. LitmusChaos (2023). Open Source Chaos Engineering Platform.
  5. Eismann, S., Costa, D. E., Liao, L., Bezemer, C.-P., Shang, W., Hoorn, A., and Kounev, S. (2022). A case study on the stability of performance tests for serverless applications. Journal of Systems and Software.
  6. Scheuner, J. (2022). Performance evaluation of serverless applications and infrastructures.
  7. Cerveira, F., Barbosa, R., Madeira, H., and Araujo, F. (2020). The effects of soft errors and mitigation strategies for virtualization servers. IEEE Transactions on Cloud Computing.
  8. Al-Said Ahmad, A., and Andras, P. (2022). Scalability resilience framework using application-level fault injection for cloud-based software services. Journal of Cloud Computing.
  9. Poltronieri, F., Tortonesi, M., and Stefanelli, C. (2022). A chaos engineering approach for improving the resiliency of IT services configurations. Proceedings of the IEEE/IFIP Network Operations and Management Symposium.
  10. Naqvi, M. A., Malik, S., Astekin, M., and Moonen, L. (2022). On evaluating self-adaptive and self-healing systems using chaos engineering. Proceedings of the IEEE International Conference on Autonomic Computing and Self-Organizing Systems.
  11. Zhu, J. (2021). Serverless chaos—measuring the performance and resilience of cloud function platforms.
  12. Beloki, U. H. (2022). The art of site reliability engineering with Azure.
  13. Chockaiyan, R. (2020). Capital One adoption and evolution of chaos engineering. In Chaos Engineering.
  14. Haber, M. J., Chappell, B., and Hills, C. (2022). Cloud attack vectors: Building effective cyber-defense strategies to protect cloud resources.
  15. Chen, G., Bai, G., Zhang, C., Wang, J., Ni, K., and Chen, Z. (2022). Big Data System Testing Method Based on Chaos Engineering. Proceedings of the IEEE International Conference on Electronics Information and Emergency Communication.
  16. Jernberg, H., Runeson, P., and Engström, E. (2020). Getting Started with Chaos Engineering – Design of an Implementation Framework in Practice. Proceedings of the ACM/IEEE International Symposium on Empirical Software Engineering and Measurement.
  17. De Suman (2021). A Study on Chaos Engineering for Improving Cloud Software Quality and Reliability. Proceedings of the International Conference on Disruptive Technologies for Multi-Disciplinary Research and Applications.
  18. Pethuru Raj, Vanga Skylab, and Chaudhary Akshita (2023). Cloud-native computing: How to design, develop, and secure microservices and event-driven applications.
  19. Arsecularatne, M., and Wickramarachchi, R. (2023). Adoptability of Chaos Engineering with DevOps to Stimulate the Software Delivery Performance. Proceedings of the International Research Conference on Smart Computing and Systems Engineering.
  20. Sagar Kesarpu. (2025). Chaos Engineering as a Learning Framework: A Human-Centered Model for Developing High-Reliability Engineering Teams. The American Journal of Engineering and Technology, 7(12), 57–64. https://doi.org/10.37547/tajet/Volume07Issue12-05

Downloads

Download data is not yet available.