Real-time-Transit-Data-Pipeline-MBTA-ETL-CDC

Back to Portfolio

Real-time Transit Data Pipeline: ETL from MBTA to MySQL and CDC with MongoDB

Go to repository

This project presents a comprehensive and integrated real-world data engineering pipeline that leverages real-time transit data from the Massachusetts Bay Transportation Authority (MBTA).

The pipeline demonstrates the use of Extract, Transform, Load (ETL) and Change Data Capture (CDC) processes to ensure real-time data ingestion and storage, as well as data synchronization across storage systems for efficient data replication and consistency. Utilizing a variety of data engineering tools and technologies, including Docker, Apache Airflow, MySQL, MongoDB, and Python MySQL Replication, the pipeline supports real-time data availability, event-driven architectures, and disaster recovery. This makes it an exemplary model for handling dynamic data in a production-like environment.

In addition to real-time data handling, the pipeline also facilitates historical data analysis and visualization. This enables time-series analysis and trend detection to provide insights into transit patterns, informing decision-making and optimizing transit operations.

Table of Contents

Back to top

Overview

The pipeline is designed to:

  1. Real-time Data Ingestion:
    • Create a MySQL database within a Docker container to store real-time data fetched from the MBTA API.
    • Periodically call the MBTA API every 10 seconds, parse the JSON response, and store the data in the MySQL database for further analysis.
    • Run the server for 24 hours to continuously collect and store data.
  2. Change Data Capture (CDC):
    • Monitor and propagate changes from the MySQL database to a MongoDB database.
    • Capture MySQL binary logs to track and replicate data changes efficiently.
  3. Data analysis and visualization:
    • Perform in-depth data analysis and visualization on data stored in the MySQL database using Jupyter Notebook.
    • Calculate the average time for a bus to complete Route 1 and estimate bus speed using the Haversine formula.
    • Visualize the analysis results to understand traffic patterns and peak times in Boston.

Back to top

Benefits

Back to top

Architecture

mbta_pipeline_architecture.png

  1. MBTA API: Source of real-time transit data.
  2. Docker: Containerization of services.
  3. Apache Airflow: ETL and CDC workflows orchestration.
  4. MySQL: Stores real-time data fetched from MBTA API.
  5. MongoDB: Receives data changes from MySQL via CDC.
  6. Python MySQL Replication: Supports the MySQL binary log CDC process.
  7. SQLAlchemy: Facilitates database connections and interactions.
  8. Pandas and Matplotlib: Data analysis and visualization performed in Jupyter Notebook.

Back to top

Prerequisites

Back to top

System Setup

  1. Clone the repository.
  2. Navigate to the root directory where the Dockerfile is located.
  3. Start Docker container for Apache Airflow.
     docker-compose up
    
  4. Launch the Airflow web UI.
     open http://localhost:8080
    
  5. Set up Docker network, build customized container image for MySQL, and start Docker containers for both MySQL and MongoDB.
     python3 containers.py -network
     python3 containers.py -image
     python3 containers.py -create
    
  6. Deploy the Airflow DAG and start the pipeline.
     airflow dags trigger mbta_etl_cdc_pipeline
    

    or manually trigger the DAG from the Airflow web UI.

Back to top