Skip to content
View traceyho59's full-sized avatar

Block or report traceyho59

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
traceyho59/README.md

Hi, I'm Tracey Ho πŸ‘‹

MS Applied Analytics @ Columbia University (GPA 3.92) | Data Engineer | Machine Learning Engineer

Currently: Data Analyst @ Yondr | Previously: Management Consulting @ Protiviti, Operations @ Amazon

πŸš€ Building data pipelines, ML models, and analytics platforms that solve real-world business problems

πŸ“ New York, NY | πŸ“§ tracey.ho@columbia.edu | πŸ”— LinkedIn


πŸ’Ό What I Do

I bridge data engineering and data science - building scalable pipelines that process millions of records, then extracting insights through ML and analytics. My sweet spot is turning messy data into actionable business intelligence.

Recent highlights:

  • πŸ—οΈ Yondr: Centralizing fragmented data systems with Python/SQL ETL workflows, building unified dashboards for COO and CFO
  • πŸ” Protiviti: Built Python-based cryptocurrency risk detection system (pending patent), analyzed $30M-$1B securitizations
  • πŸ“¦ Amazon: Led 100+ employees to 60K+ packages/day, saved $25M through robotics optimization

πŸš€ Featured Projects

Full-stack ML solution for financial fraud detection combining Random Forest (94% AUC), Isolation Forest for anomaly detection, and market volatility analysis. Discovered 15% fraud increase during market turbulence.

Impact: Catch 38% of fraud at 5% hit rate β†’ save ~$300K/month per 1M transactions

Tech Stack: R Random Forest Isolation Forest SMOTE PCA Statistical Analysis Fintech


End-to-end customer analytics platform for JOOLA (premium pickleball brand) analyzing 1.9M transactions to predict churn and optimize retention. Built XGBoost model (AUC 0.85) identifying key churn drivers, discovered 66% one-time buyer problem, and found 110K monthly searches in untapped keyword opportunity.

Key insights: 3 premium SKUs drive 50%+ profit (concentration risk), high-spend customers show 2.1x retention vs discount users, court-related keywords have low competition at $0.86-$3.69 CPC

Tech Stack: Python XGBoost Logistic Regression Pandas Streamlit Tableau RFM Analysis Customer Segmentation

Dashboard Demo


End-to-end database system analyzing 2,133 luxury cosmetics pop-up events across global markets. Built normalized 3NF PostgreSQL schema, Python ETL pipeline, and Metabase dashboards tracking revenue efficiency and brand performance for Chanel, Dior, YSL, and 21 other luxury brands.

Key insights: Asia leads with 74% sell-through rate, Special Launch events outperform by 5%, Tokyo generates highest revenue efficiency

Tech Stack: PostgreSQL Python Pandas SQLAlchemy Metabase ETL Database Design 3NF Normalization


Cloud-native ML system analyzing 50,000 Amazon toy reviews to identify products with hidden quality issues. Built sentiment classifier (91% accuracy) using TF-IDF + Logistic Regression, deployed on AWS (S3, Lambda, EC2, RDS). Flask dashboard reveals products where high star ratings mask negative sentiment.

Key insight: Products with large rating-sentiment gaps signal quality concerns invisible to traditional metrics

Tech Stack: AWS S3 Lambda EC2 RDS Python Flask Scikit-learn TF-IDF PostgreSQL NLP

Watch Demo


Production-scale data engineering pipeline processing 22M NYC taxi trips (2.13GB) with Spark + PostgreSQL. Built Streamlit dashboard delivering real-time revenue recommendations for drivers competing with Uber.

Key findings: Optimal 5-10 mile trips, 2 AM peak tipping (30% rate), JFK generates $58M revenue (24% of total)

Tech Stack: Python Apache Spark PostgreSQL Streamlit Geospatial Analysis ETL Docker


Machine learning model predicting Click-Through Rates using XGBoost with hyperparameter tuning. Achieved Test RMSE of 0.059, demonstrating feature engineering and model optimization.

Business value: Optimize digital ad spend by targeting high-conversion audience segments

Tech Stack: R XGBoost Hyperparameter Tuning Marketing Analytics Predictive Modeling


Real estate analytics examining crime, education, and walkability factors affecting rent prices across NYC boroughs. Provides data-driven insights for renters and investors.

Tech Stack: Python Pandas Data Visualization Exploratory Data Analysis


Statistical research quantifying the impact of 15-minute workplace naps on productivity and energy levels. Demonstrates experimental design, power analysis, and hypothesis testing.

Tech Stack: R Statistical Analysis Hypothesis Testing Power Analysis Research Design


πŸ› οΈ Technical Skills

Data Engineering

Apache Spark PostgreSQL MongoDB ETL Pipelines PySpark SQL APIs Docker Cloud Integration

Data Science & ML

Python R Scikit-learn XGBoost Random Forest Predictive Modeling Feature Engineering Statistical Analysis

Visualization & BI

Streamlit Tableau Metabase PowerBI Plotly ggplot2 Matplotlib Interactive Dashboards Bootstrap UI

Cloud & Dev Tools

AWS Azure Git VS Code Agile Development Flask Jupyter RStudio Lambda EC2 DynamoDB

Domain Expertise

Financial Risk Modeling Fraud Detection Transportation Analytics Marketing Analytics Operations Optimization


πŸ“ˆ Experience Highlights

Data Engineering

  • Built ETL pipelines processing 22M+ records with Spark, PostgreSQL, MongoDB
  • Centralized fragmented data systems at Yondr with Python/SQL workflows
  • Designed scalable data models connecting sales, supply chain, and finance

Machine Learning

  • Developed fraud detection models (Random Forest 94% AUC, Isolation Forest)
  • Built cryptocurrency risk detection system (pending patent at Protiviti)
  • Created predictive models (XGBoost, Logistic Regression) for customer conversion

Analytics & BI

  • Built interactive dashboards (Streamlit, Tableau) for C-suite executives
  • Conducted lender due diligence on $30M-$1B securitizations (CLO, auto, student loans)
  • Analyzed 1.9M records for customer conversion & retention at Joola

Operations Leadership

  • Managed 100+ employees at Amazon fulfillment center
  • Delivered 60K+ packages/day (record-breaking) while reducing costs by $237K
  • Optimized robotics workflows saving $25M and increasing productivity 30%

πŸŽ“ Education

Columbia University | Master of Science, Applied Analytics | GPA: 4.0 / 4.0 | 2024-2025

  • Coursework: AI, Cloud Computing, Machine Learning, Blockchain

UT Austin | Bachelor of Business Administration, Finance | GPA: 3.62 / 4.0 | 2020-2024

  • Full-Ride Scholarship ($300K+) | SAT: 1520/1600

πŸ† What Sets Me Apart

1. Full-Stack Data Skills

  • Not just analytics or engineering - I do both
  • Build the pipeline (Spark ETL) β†’ Train the model (ML) β†’ Ship the product (Dashboard)

2. Business Impact Focus

  • Every project has measurable outcomes: $300K/month fraud savings, $25M cost reduction, 30% productivity gain
  • I translate data science into executive-friendly insights

3. Production Experience

  • Internships at Yondr, Protiviti, Amazon - not just academic projects
  • Built systems handling real money ($1B securitizations), real scale (22M trips), real users (100+ employees)

4. Rapid Learner

  • 4.0 GPA at Columbia in technical MS program
  • Full-ride scholarship at UT Austin (top 1% of applicants)
  • Self-taught: Blockchain analytics, patent-pending crypto risk detection

πŸ“« Let's Connect!

LinkedIn Email GitHub

πŸ“§ Email: tracey.ho@columbia.edu
πŸ“ Location: New York, NY


🌟 Fun Facts

πŸ•Ί Breakdancer | 🌍 Traveling to the Seven Wonders | πŸ”οΈ Outdoor enthusiast | 🀝 Volunteering

πŸ—£οΈ Languages: Native Vietnamese | Basic Spanish, Mandarin, Korean, German


πŸ’‘ "I don't just analyze data - I build systems that turn data into competitive advantage."

Popular repositories Loading

  1. ctr-prediction-xgboost ctr-prediction-xgboost Public

    Machine learning project predicting Click-Through Rates using XGBoost with hyperparameter tuning (Test RMSE: 0.059)

    R

  2. nyc-rent-analysis nyc-rent-analysis Public

    NYC rent price analysis examining crime, education, and walkability factors | Columbia University

  3. workplace-napping-productivity workplace-napping-productivity Public

    A data science approach to employee wellbeing - using statistical simulation and power analysis to quantify the impact of 15-minute naps on productivity and energy levels.

    HTML

  4. fraud-detection-market-analysis fraud-detection-market-analysis Public

    A comprehensive machine learning approach to fraud detection combining unsupervised clustering, supervised classification, and financial market analysis to identify fraudulent transactions in real-…

    R

  5. nyc-taxi-revenue-optimization nyc-taxi-revenue-optimization Public

    A comprehensive data engineering and analytics solution that empowers NYC yellow taxi drivers to compete with rideshare giants by leveraging 22+ million trip records to optimize revenue, identify h…

    Jupyter Notebook

  6. traceyho59 traceyho59 Public