siconge.github.io

About Me

Data Engineer with a robust background in end-to-end data pipeline development, ETL automation, and cloud-based data solutions, transitioning from a technical role in Building Information Modeling (BIM) to enterprise data engineering. Proficient in Python, SQL, AWS, Airflow, and Spark, with hands-on experience in big data technologies, data warehousing, and API integrations. Demonstrated ability to manage complex projects and lead cross-functional teams to deliver impactful solutions. Former AEC (Architecture, Engineering & Construction) professional with a decade of experience, now applying data-driven methodologies to engineering scalable data solutions.

Work History

NBBJ

Project Architect | BIM Data Engineer

Sep 2021 - Dec 2023

www.nbbj.com

Global architecture and design firm integrating data-driven methodologies for enhanced decision-making

Designed and implemented an end-to-end data pipeline orchestrated by Apache Airflow to manage over 100,000 evolving Building Information Modeling (BIM) data points for 6,600 curtain wall units of Tencent’s 207,700 m² Global Headquarters, achieving a 75% reduction in manual effort and a 50% efficiency gain
Automated ETL workflows integrating Autodesk Platform Services APIs for real-time metadata extraction, Pandas for data pre-processing, and AWS SDK for Python to load raw data into Amazon S3 and trigger AWS Glue jobs
Engineered advanced transformations in AWS Glue using PySpark scripts, including timestamp-based partitioning and schema cataloging, enabling seamless integration with Amazon Redshift Spectrum for downstream analytics
Delivered actionable insights by leveraging Redshift Spectrum for querying and visualization, supporting construction detail modifications, cost optimization, and compliance with energy and fire safety standards

5+design

Project Designer | BIM Manager

Jun 2011 - Sep 2021

www.5plusdesign.com/

Architecture firm specializing in retail, mixed-use, residential, and large-scale urban planning

Led architectural design and BIM documentation for 10+ large-scale urban projects, collaborating with clients and consultants. Explored Python automation to streamline workflows, sparking a transition into data engineering

Education

MIT xPRO

Prof. Certificate in Data Engineering

Aug 2023 - Mar 2024

xpro.mit.edu

This program equipped me with the skills to advance my career in data engineering. Key learning outcomes include:

Data Processing, Analysis and Visualization: Proficient in Python and multiple libraries, including NumPy, Pandas, Matplotlib, PySpark, SQLAlchemy, PyMongo, Scikit-learn, SciPy, DASK, NLTK, Graphviz, and Paho.
Database Design and Management: Proficient in SQL; wrote complex database queries; conducted database schema design (data modeling); involved common RDBMS and NoSQL databases; applied regular expressions.
Database Containerization: Used Docker to create and manipulate images and containers; performed change data capture (CDC) in different types of databases including MongoDB, Cassandra, Redis, and Firebase.
ETL Processes and Data Pipelines: Performed extract, transform, and load (ETL) operations using Python; created ETL pipelines and orchestrated workflows using Apache Airflow.
Big Data Handling: Processed big data using Spark and Hadoop; ran parallel operations using DASK; handled real-time streaming data using Mosquitto, ThingsBoard, and Kafka.
Statistics Basics: Identified key concepts in statistics, including various types of probability distributions, Central Limit Theorem, and correlation.
AI/ML Algorithms: Built a prediction modeling using linear regression; implemented foundational ML algorithms, including gradient descent for error reduction, classification with Naïve Bayes theorems, clustering (k-means), reinforcement learning (Q-matrix, Bellman equation), and deep neural networks.
Software Engineering and Network Basics: Grasped commonly used command line commands; identified key concepts of Java, asynchronous event-driven programming, and HTTP and client–server architecture; created applications using Flask web server and Jinja templating language.

University of Southern California (USC)

Master of Architecture

Sep 2009 - May 2011

www.usc.edu

USC provided a comprehensive education in architecture, emphasizing data-driven aspects. This honed my ability to manage and analyze complex datasets in building design and construction. Key learning outcomes relevant to data engineering include:

BIM Data Management: Proficient in Building Information Modeling (BIM) using Revit, focusing on managing and analyzing large datasets generated by BIM processes.
Data Analysis in Architecture: Skilled in extracting and manipulating BIM data to enhance project decision-making and optimize design and construction processes.
Revit Proficiency: Expertise in using Revit for architectural design and data extraction, contributing to efficient project workflows and accurate data management.

This unique combination of skills positions me well for roles in data engineering, where technical proficiency and the ability to manage and interpret large datasets are crucial.

Projects

Tencent Headquarters BIM Data Pipeline Using Apache Airflow, AWS Services, and Autodesk Platform Services

github.com/siconge/Tencent-HQ-BIM-Data-Pipeline-with-AWS

This project delivers an end-to-end data pipeline solution designed to process BIM (Building Information Modeling) data from the Revit model of Tencent Global Headquarters in Shenzhen. Developed during the design and construction phases, the pipeline efficiently handles unitized curtain wall metadata to support data-driven decision-making.

Integrating upstream metadata extraction and pre-processing from Autodesk BIM 360 and downstream cloud-based storage, partitioning, schema cataloging, and analytics using AWS services, the pipeline leverages Apache Airflow, Autodesk Platform Services APIs, AWS CloudFormation, Amazon S3, AWS Glue, and Amazon Redshift. It enables automated, scalable workflows for data extraction, transformation, and storage.

Tailored for design teams, technical consultants, and Tencent clients, the solution supports precise construction detail modifications, cost optimization, and regulatory compliance with standards for energy consumption and fire protection. Its modular and iterative BIM data processing approach adapts to evolving design requirements while preserving architectural integrity.

Real-time Transit Data Pipeline: ETL from MBTA to MySQL and CDC with MongoDB

github.com/siconge/Real-time-Transit-Data-Pipeline-MBTA-ETL-CDC

This project presents a comprehensive and integrated real-world data engineering pipeline that leverages real-time transit data from the Massachusetts Bay Transportation Authority (MBTA).

The pipeline demonstrates the use of Extract, Transform, Load (ETL) and Change Data Capture (CDC) processes to ensure real-time data ingestion and storage, as well as data synchronization across storage systems for efficient data replication and consistency. Utilizing a variety of data engineering tools and technologies, including Docker, Apache Airflow, MySQL, MongoDB, and Python MySQL Replication, the pipeline supports real-time data availability, event-driven architectures, and disaster recovery. This makes it an exemplary model for handling dynamic data in a production-like environment.

In addition to real-time data handling, the pipeline also facilitates historical data analysis and visualization. This enables time-series analysis and trend detection to provide insights into transit patterns, informing decision-making and optimizing transit operations.

ETL Processing and Time Series Analysis of MRTS Dataset

github.com/siconge/MRTS-ETL-Time-Series-Analysis

The Monthly Retail Trade Survey (MRTS) is conducted by the U.S. Census Bureau to gather data from retail businesses, providing insights into the retail sector’s performance. This data covers various aspects of retail, including sales and inventories.

This project has two primary goals: to perform ETL (Extract, Transform, Load) processing on the MRTS dataset using Python and powerful data transformation libraries like Pandas and SQLAlchemy, and to apply key time series analysis techniques, including trend analysis, percentage changes, and rolling time windows, to analyze the data of target businesses. The process involves using MySQL for data retrieval and Python for detailed data manipulation and visualization.

Skills

Programming Languages: Python (Pandas, Numpy, Matplotlib, PySpark, SQLAlchemy, Scikit-learn)
Data Storage & Processing: SQL (MySQL), AWS(Glue, S3, Redshift, CloudFormation), NoSQL(MongoDB), Airflow, Spark, Docker, Azure, YAML Templates, Linux CLI
Data Engineering & Analytics: Data Pipelines, ETL, Data Modeling, Analysis, Visualization, Debugging & Testing, CDC, IaC, Git, Machine Learning, Statistics
Languages: English, Madarin, Japanese

Certifications

AWS Certified Solutions Architect - Associate
California Licensed Architect

Sicong E

Data Engineer

About Me

Work History

NBBJ

5+design

Education

MIT xPRO

University of Southern California (USC)

Projects

Tencent Headquarters BIM Data Pipeline Using Apache Airflow, AWS Services, and Autodesk Platform Services

Real-time Transit Data Pipeline: ETL from MBTA to MySQL and CDC with MongoDB

ETL Processing and Time Series Analysis of MRTS Dataset

Skills

Certifications