About Me
Data Engineer with a robust background in end-to-end data pipeline development, ETL automation, and cloud-based data solutions, transitioning from a technical role in Building Information Modeling (BIM) to enterprise data engineering. Proficient in Python, SQL, AWS, Airflow, and Spark, with hands-on experience in big data technologies, data warehousing, and API integrations. Demonstrated ability to manage complex projects and lead cross-functional teams to deliver impactful solutions. Former AEC (Architecture, Engineering & Construction) professional with a decade of experience, now applying data-driven methodologies to engineering scalable data solutions.
Work History
Global architecture and design firm integrating data-driven methodologies for enhanced decision-making
- Designed and implemented an end-to-end data pipeline orchestrated by Apache Airflow to manage over 100,000 evolving Building Information Modeling (BIM) data points for 6,600 curtain wall units of Tencent’s 207,700 m² Global Headquarters, achieving a 75% reduction in manual effort and a 50% efficiency gain
- Automated ETL workflows integrating Autodesk Platform Services APIs for real-time metadata extraction, Pandas for data pre-processing, and AWS SDK for Python to load raw data into Amazon S3 and trigger AWS Glue jobs
- Engineered advanced transformations in AWS Glue using PySpark scripts, including timestamp-based partitioning and schema cataloging, enabling seamless integration with Amazon Redshift Spectrum for downstream analytics
- Delivered actionable insights by leveraging Redshift Spectrum for querying and visualization, supporting construction detail modifications, cost optimization, and compliance with energy and fire safety standards
Architecture firm specializing in retail, mixed-use, residential, and large-scale urban planning
- Led architectural design and BIM documentation for 10+ large-scale urban projects, collaborating with clients and consultants. Explored Python automation to streamline workflows, sparking a transition into data engineering
Education
This program equipped me with the skills to advance my career in data engineering. Key learning outcomes include:
- Data Processing, Analysis and Visualization: Proficient in Python and multiple libraries, including NumPy, Pandas, Matplotlib, PySpark, SQLAlchemy, PyMongo, Scikit-learn, SciPy, DASK, NLTK, Graphviz, and Paho.
- Database Design and Management: Proficient in SQL; wrote complex database queries; conducted database schema design (data modeling); involved common RDBMS and NoSQL databases; applied regular expressions.
- Database Containerization: Used Docker to create and manipulate images and containers; performed change data capture (CDC) in different types of databases including MongoDB, Cassandra, Redis, and Firebase.
- ETL Processes and Data Pipelines: Performed extract, transform, and load (ETL) operations using Python; created ETL pipelines and orchestrated workflows using Apache Airflow.
- Big Data Handling: Processed big data using Spark and Hadoop; ran parallel operations using DASK; handled real-time streaming data using Mosquitto, ThingsBoard, and Kafka.
- Statistics Basics: Identified key concepts in statistics, including various types of probability distributions, Central Limit Theorem, and correlation.
- AI/ML Algorithms: Built a prediction modeling using linear regression; implemented foundational ML algorithms, including gradient descent for error reduction, classification with Naïve Bayes theorems, clustering (k-means), reinforcement learning (Q-matrix, Bellman equation), and deep neural networks.
- Software Engineering and Network Basics: Grasped commonly used command line commands; identified key concepts of Java, asynchronous event-driven programming, and HTTP and client–server architecture; created applications using Flask web server and Jinja templating language.
USC provided a comprehensive education in architecture, emphasizing data-driven aspects. This honed my ability to manage and analyze complex datasets in building design and construction. Key learning outcomes relevant to data engineering include:
- BIM Data Management: Proficient in Building Information Modeling (BIM) using Revit, focusing on managing and analyzing large datasets generated by BIM processes.
- Data Analysis in Architecture: Skilled in extracting and manipulating BIM data to enhance project decision-making and optimize design and construction processes.
- Revit Proficiency: Expertise in using Revit for architectural design and data extraction, contributing to efficient project workflows and accurate data management.
This unique combination of skills positions me well for roles in data engineering, where technical proficiency and the ability to manage and interpret large datasets are crucial.
Projects
Tencent Headquarters BIM Data Pipeline Using Apache Airflow, AWS Services, and Autodesk Platform Services
github.com/siconge/Tencent-HQ-BIM-Data-Pipeline-with-AWS
This project delivers an end-to-end data pipeline solution designed to process BIM (Building Information Modeling) data from the Revit model of Tencent Global Headquarters in Shenzhen. Developed during the design and construction phases, the pipeline efficiently handles unitized curtain wall metadata to support data-driven decision-making.
Integrating upstream metadata extraction and pre-processing from Autodesk BIM 360 and downstream cloud-based storage, partitioning, schema cataloging, and analytics using AWS services, the pipeline leverages Apache Airflow, Autodesk Platform Services APIs, AWS CloudFormation, Amazon S3, AWS Glue, and Amazon Redshift. It enables automated, scalable workflows for data extraction, transformation, and storage.
Tailored for design teams, technical consultants, and Tencent clients, the solution supports precise construction detail modifications, cost optimization, and regulatory compliance with standards for energy consumption and fire protection. Its modular and iterative BIM data processing approach adapts to evolving design requirements while preserving architectural integrity.
Real-time Transit Data Pipeline: ETL from MBTA to MySQL and CDC with MongoDB
github.com/siconge/Real-time-Transit-Data-Pipeline-MBTA-ETL-CDC
This project presents a comprehensive and integrated real-world data engineering pipeline that leverages real-time transit data from the Massachusetts Bay Transportation Authority (MBTA).
The pipeline demonstrates the use of Extract, Transform, Load (ETL) and Change Data Capture (CDC) processes to ensure real-time data ingestion and storage, as well as data synchronization across storage systems for efficient data replication and consistency. Utilizing a variety of data engineering tools and technologies, including Docker, Apache Airflow, MySQL, MongoDB, and Python MySQL Replication, the pipeline supports real-time data availability, event-driven architectures, and disaster recovery. This makes it an exemplary model for handling dynamic data in a production-like environment.
In addition to real-time data handling, the pipeline also facilitates historical data analysis and visualization. This enables time-series analysis and trend detection to provide insights into transit patterns, informing decision-making and optimizing transit operations.
ETL Processing and Time Series Analysis of MRTS Dataset
github.com/siconge/MRTS-ETL-Time-Series-Analysis
The Monthly Retail Trade Survey (MRTS) is conducted by the U.S. Census Bureau to gather data from retail businesses, providing insights into the retail sector’s performance. This data covers various aspects of retail, including sales and inventories.
This project has two primary goals: to perform ETL (Extract, Transform, Load) processing on the MRTS dataset using Python and powerful data transformation libraries like Pandas and SQLAlchemy, and to apply key time series analysis techniques, including trend analysis, percentage changes, and rolling time windows, to analyze the data of target businesses. The process involves using MySQL for data retrieval and Python for detailed data manipulation and visualization.
Skills
- Programming Languages: Python (Pandas, Numpy, Matplotlib, PySpark, SQLAlchemy, Scikit-learn)
- Data Storage & Processing: SQL (MySQL), AWS(Glue, S3, Redshift, CloudFormation), NoSQL(MongoDB), Airflow, Spark, Docker, Azure, YAML Templates, Linux CLI
- Data Engineering & Analytics: Data Pipelines, ETL, Data Modeling, Analysis, Visualization, Debugging & Testing, CDC, IaC, Git, Machine Learning, Statistics
- Languages: English, Madarin, Japanese
Certifications
- AWS Certified Solutions Architect - Associate
- California Licensed Architect