Optimizing Data Pipelines for Real-Time Analytics: A Comparative Study of Modern Data Engineering Frameworks

  • Unique Paper ID: 174423
  • PageNo: 3712-3717
  • Abstract:
  • Real-time analytics has become a cornerstone of modern data-driven decision- making, yet selecting optimal frameworks for low-latency, scalable data pipelines remains a challenge. This study evaluates three prominent data engineering tools—Apache Kafka, Apache Flink, and Spark Streaming—through empirical benchmarks measuring latency, throughput, resource efficiency, and fault toler- ance. Using synthetic clickstream data and the NYC Taxi dataset, experiments simulate real-world scenarios such as stateful aggregations and hybrid batch- stream processing. Results identify Apache Flink as the leader in low-latency workloads (<=100ms), ideal for IoT and fraud detection, while Kafka excels in high- throughput ingestion (150,000+ events/sec). Spark Streaming, though slower (500ms–2s latency), proves cost-effective for legacy systems requiring batch- stream unification. The analysis further reveals bottlenecks such as Kafka’s partition rebalancing overhead and Flink’s memory demands, proposing optimiza- tions like hybrid Kafka-Flink architectures and cloud-native autoscaling. This work provides a decision-making framework for engineers balancing performance, cost, and scalability in real-time pipeline design, with implications for industries ranging from healthcare to finance. Future directions include AI-driven tuning and edge computing integration to address evolving data velocity and volume challenges.

Copyright & License

Copyright © 2026 Authors retain the copyright of this article. This article is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

BibTeX

@article{174423,
        author = {Susmit Kulkarni and Harshal Parmar and Barkha Makwana},
        title = {Optimizing Data Pipelines for Real-Time Analytics: A Comparative Study of Modern Data Engineering Frameworks},
        journal = {International Journal of Innovative Research in Technology},
        year = {2025},
        volume = {11},
        number = {10},
        pages = {3712-3717},
        issn = {2349-6002},
        url = {https://ijirt.org/article?manuscript=174423},
        abstract = {Real-time analytics has become a cornerstone of modern data-driven decision- making, yet selecting optimal frameworks for low-latency, scalable data pipelines remains a challenge. This study evaluates three prominent data engineering tools—Apache Kafka, Apache Flink, and Spark Streaming—through empirical benchmarks measuring latency, throughput, resource efficiency, and fault toler- ance. Using synthetic clickstream data and the NYC Taxi dataset, experiments simulate real-world scenarios such as stateful aggregations and hybrid batch- stream processing. Results identify Apache Flink as the leader in low-latency workloads (<=100ms), ideal for IoT and fraud detection, while Kafka excels in high- throughput ingestion (150,000+ events/sec). Spark Streaming, though slower (500ms–2s latency), proves cost-effective for legacy systems requiring batch- stream unification. The analysis further reveals bottlenecks such as Kafka’s partition rebalancing overhead and Flink’s memory demands, proposing optimiza- tions like hybrid Kafka-Flink architectures and cloud-native autoscaling. This work provides a decision-making framework for engineers balancing performance, cost, and scalability in real-time pipeline design, with implications for industries ranging from healthcare to finance. Future directions include AI-driven tuning and edge computing integration to address evolving data velocity and volume challenges.},
        keywords = {},
        month = {March},
        }

Cite This Article

Kulkarni, S., & Parmar, H., & Makwana, B. (2025). Optimizing Data Pipelines for Real-Time Analytics: A Comparative Study of Modern Data Engineering Frameworks. International Journal of Innovative Research in Technology (IJIRT), 11(10), 3712–3717.

Related Articles