Data Science Notes & Project Portfolio

This repository contains curated Today I Learned (TIL) insights, case studies, and detailed project overviews from my work as a Data Scientist, focusing on reproducible workflows, applied analytics, data engineering, and production-grade systems.

Visit the individual case studies below to explore:

how I refactor and optimize data applications (e.g., Shiny),
how I build full ML pipelines,
and how I benchmark performance across tools and environments.

Each entry includes technical explanations, code snippets, results, and lessons learned.

👤 Profile: https://github.com/philkleer
📄 LinkedIn: https://linkedin.com/in/philkleer

Featured Projects

These projects represent my most relevant work as an applied Data Scientist, with a focus on production systems, reproducible analytics, and decision support.

Redesigning an Application in Production: Instalometro na Conectividade na Saúde Designing, hardening, and shipping a production-ready data application with a focus on performance, reproducibility, and maintainability.
Modularizing a Large Shiny Application (OBIA)
Refactoring and hardening a national-scale analytics application, reducing code size by ~41% and introducing CI/CD, testing, and reproducibility.
Leveling Up an Internal R Package for Team-Scale Use
Productionizing an internal analytics package with versioned releases, CI/CD pipelines, and reproducible environments.
Shiny Application – IT Governance (MGI)
End-to-end development and deployment of a public-facing Shiny application to assess IT governance across national entities.
Network Technology Analysis & Visualization
Statistical analysis and visual storytelling to support technical and policy-oriented decision-making.
End-to-End MLOps Pipeline
Implementation of a production-like ML lifecycle with experiment tracking and data/model versioning.

Detailed Projects

⬇️ Redesigning an Application in Production: _Instalometro na Conectividade na Saúde_

Designing, hardening, and shipping a production-ready data application with a focus on performance, reproducibility, and maintainability.

Overview

Instalometro (inside the project Conectividade na Saúde) was an established application monitoring connection data of health institutions. However, over time, functionality slowed down as data volume, usage expectations, and operational requirements increased, the original setup revealed clear limitations in performance, scalability, and reproducibility.

This project documents the transition from a working prototype to a production-grade data application, emphasizing architectural decisions, data engineering practices, and infrastructure reliability rather than feature growth.

The goal was to deliver a maintainable, performant, and deployable system that could be confidently operated and handed over to a broader team.

Key Challenges

Large datasets causing slow startup times and high memory usage
JSON-based APIs creating heavy payloads and long response times
Unclear separation between data access and reactive application logic
Fragile dependency restoration at runtime
Slow and unreliable CI builds with compiled dependencies
Need for measurable performance guarantees before production release

Key Contributions

Re-architected data access using Parquet-based pipelines instead of large JSON payloads
Introduced Polars with lazy evaluation for scalable, on-demand data loading
Clearly separated data engineering concerns from Shiny reactive logic
Implemented load testing with shinyloadtest to validate performance under concurrent usage
Designed deterministic, multi-stage Docker builds using {renv} for reproducible environments
Migrated CI pipelines to Docker Buildx with registry-backed cache for faster, more reliable builds
Verified CI/CD workflows and documented clean handover instructions for the team

Tech Stack

Languages

R

Frameworks & Libraries

Shiny
Polars
Arrow
DBI / dbplyr

Data & Storage

PostgreSQL / PostGIS
Parquet

Testing & Validation

shinyloadtest

Reproducibility

renv

CI/CD & Infrastructure

GitLab CI
Docker (multi-stage builds)
Docker Buildx (registry-backed cache)
Kubernetes

Results & Impact

Significantly reduced application startup time and memory footprint
Improved scalability and predictability under multi-user load
Reduced API payload sizes by an order of magnitude through Parquet exports
Deterministic, reproducible builds independent of runtime package restoration
Faster and more reliable CI pipelines
Clear documentation enabling smooth handover and future maintenance

Most importantly, Instalometro na Conectividade na Saúde developed into an optimized data product ready for long-term operation.

Notes

This project prioritized architectural correctness and operational stability over rapid feature expansion.

Many performance issues were solved not through micro-optimizations, but through better placement of responsibilities between databases, data pipelines, and the Shiny application layer.

Several of the patterns developed here have since informed reusable tooling and shared infrastructure.

🔎 Detailed walkthrough: Case study — Redesignin an Application in Production

🔗 Live application: https://conectividadenasaude.nic.br

⬇️ Modularizing a Large Shiny Application Observatório de Inteligência Artificial (OBIA)

Refactoring and hardening a production-grade Shiny application for long-term maintainability, collaboration, and reliability.

Overview

This project documents the refactoring of a large, production Shiny application used in a national analytics context.

The original codebase had grown organically into a monolithic structure that was difficult to maintain, test, and extend.

The goal was to transform the application into a modular, testable, and reproducible system, suitable for multi-developer collaboration and continuous deployment.

Key Contributions

Refactored a monolithic Shiny application into a fully modular architecture
Reduced total lines of code by ~41% while improving readability and extensibility
Introduced automated testing, linting, and formatting standards
Implemented reproducible dependency management using renv
Set up CI/CD pipelines to ensure code quality and deployment safety
Improved application performance and load behavior

Tech Stack

Languages: R
Frameworks: Shiny, plotly
Testing: testthat
Reproducibility: renv
CI/CD: GitLab CI
Deployment: Docker, Kubernetes

Results & Impact

Significantly improved maintainability and onboarding for new contributors
Enabled reliable multi-developer workflows
Increased confidence in production releases through automated checks
Established a reusable architectural pattern for future Shiny applications

Notes

This refactor prioritizes long-term sustainability over short-term feature additions and serves as a reference architecture for future analytical applications.

🔎 Detailed walkthrough:

Case study

🔗 Live application:
https://obia.nic.br/s/indicadores

⬇️ Levelling up the team's own R package

Standardizing, hardening, and productionizing an internal R package to support reproducible analytics, CI/CD, and multi-developer collaboration.

Overview

When joining a new team, I inherited an internal R package used to centralize shared analytical functionality across multiple products. While the package was already in use, it lacked standardization, clear role separation between users and contributors, and a reliable CI/CD and release process.

The goal of this project was to transform the package into a stable, versioned, and reproducible internal dependency, suitable for long-term maintenance and safe use across production systems.

Key Contributions

Standardized package structure, formatting, and development conventions across the entire codebase
Introduced CI/CD pipelines to automate checks, builds, and versioned internal releases
Established a clear separation between user-facing and contributor-facing logic and documentation
Implemented reproducible dependency management using renv, compatible with multiple R versions
Added unit testing with testthat and enforced code quality via formatting, linting, and pre-commit hooks
Designed and implemented a safe versioning strategy to prevent breaking changes in dependent products

Tech Stack

Language: R
Package tooling: testthat, roxygen2
Reproducibility: renv, rig
Code quality: Air (formatting), lintr (linting), pre-commit
CI/CD: GitLab CI
Distribution: pak, internal release artifacts

Results & Impact

Enabled versioned installation of the package, allowing teams to pin stable releases and avoid regressions
Reduced onboarding time through clear README and CONTRIBUTING documentation
Established reproducible builds with downloadable artifacts produced by the CI pipeline
Improved development consistency across contributors and environments
Made internal analytics workflows more reliable, scalable, and maintainable

This work turned the package from a loosely maintained codebase into a production-ready internal dependency, supporting both rapid development and long-term stability.

Notes

Older products could safely continue using pinned package versions while new releases evolved independently
The separation of users vs. contributors clarified responsibilities and reduced friction in collaboration
The CI/CD setup now serves as a reference template for other internal R packages

🔎 Detailed walkthrough:
CI/CD overhaul case study

⬇️ Shiny application Autodiagnóstico do Sistema de Administração dos Recursos de Tecnologia da Informação

Designing and deploying a production-grade analytical application to evaluate IT governance across national entities.

Overview

This project involved the development and deployment of a public-facing, production-grade R Shiny application designed to assess and analyze IT governance practices among national public-sector entities.

I was responsible for the entire application lifecycle, from data integration and analytical logic to visual design, automation, and deployment. The goal was to deliver a stable, maintainable, and transparent analytics platform that supports evidence-based evaluation and comparison.

Key Contributions

Developed a production-grade Shiny application covering the full analytical workflow
Integrated and processed data from multiple heterogeneous sources
Designed a consistent visual design system to ensure clarity, comparability, and usability
Implemented CI/CD pipelines to automate testing, builds, and deployments
Ensured application stability, maintainability, and reproducibility across environments
Delivered a publicly accessible analytics portal for ongoing use and updates

Tech Stack

Language: R
Frameworks: Shiny, ggplot2, ggiraph
Data: Relational databases, structured datasets
CI/CD: GitLab CI
Deployment: Docker, Kubernetes

Results & Impact

Delivered a robust and maintainable analytics platform for assessing IT governance at national scale
Enabled consistent and transparent comparison across entities
Reduced operational overhead through automated deployment and quality checks
Established a reusable blueprint for future public-sector analytical applications

Notes

🔗 Live application:
https://obia.nic.br/s/indicadores-mgi

⬇️ Network Technology Analysis & Visualization

Statistical and exploratory analysis of network technologies with a focus on communication and decision support.

Overview

This project analyzes network technology data to identify patterns, quality indicators, and trends relevant for technical and policy-oriented audiences.

Key Contributions

Conducted exploratory and statistical analyses
Applied regression-based methods where appropriate
Translated analytical results into clear visual narratives
Prepared presentation-ready outputs for non-technical stakeholders

Tech Stack

R
tidyverse, ggplot2, brms
revealJS
Quarto

Results & Impact

Supported evidence-based discussions on network technologies
Improved accessibility of complex analytical results through visualization

Notes

🔎 Presentation at IX Forum 2025 (10min):
Link to presentation

⬇️ End-to-End MLOps Pipeline

Implementing a production-like machine learning lifecycle with model tracking and versioning.

Overview

This project explores the design of an end-to-end MLOps workflow, covering model training, experiment tracking, data and model versioning, and reproducibility.

Key Contributions

Implemented experiment tracking with MLflow
Versioned data and models using DVC
Simulated a production-style model lifecycle
Documented pipeline structure and design choices

Tech Stack

Python
MLflow
DVC
Git

Results & Impact

Demonstrates practical understanding of MLOps concepts
Provides a reference implementation for small-to-medium ML projects

Notes

Link to project

Case studies

2026-02-10 — Redesigning an Application in Production: Instalometro na Conectividade na Saúde
2026-01-05 — Case Study: School Detection from Satellite Imagery
2025-12-17 — How I build data-driven presentations with Quarto + revealjs (a real-world example)
2025-11-20 — Case Study: Benchmarking Shiny app performance across environments with shinyloadtest
2025-09-19 — Case Study: Debugging across multiple R versions with rig + renv
2025-09-14 — From ad‑hoc repo to versioned, CI‑driven R package: nicverso
2025-08-30 — R big data benchmarks: dplyr/duckplyr/polars & Postgres/DuckDB
2025-08-14 — Modularizing a Large Shiny App (R)

TIL: Latest Lessons

2026-02-28 — TIL Refactoring a 200-Line SQL Query: Less CTEs, Fewer Scans
2026-02-24 — TIL: Scaling OSM-Based Weak Label Generation for Semantic Segmentation
2026-02-20 — 🧠 TIL: Speeding Up a Plumber API by Switching from JSON to Parquet
2026-02-05 — 🧠 TIL: Where to Run Data Operations (PostgreSQL vs R engines like DuckDB/Polars)
2026-01-25 — 🧠 TIL: Migrating to Docker Buildx with Registry Cache (and Why It’s Worth Showing)
2026-01-23 — 🧠 TIL: Making {renv} Work in a Multi-Stage Docker Build (Builder → Runtime)
2026-01-19 — 🧠 TIL: Shrinking Docker Images with Multi-Stage Builds (Builder + Runtime)
2026-01-15 — TIL: Using ellmer, gander, chores, and ensure to Draft R Docs + Tests with an Ollama Connection
2026-01-09 — TIL: Learning Window Functions in PostgreSQL (with Practical Examples)
2026-01-05 — TIL: Geographic train/test splits are essential for honest geospatial ML evaluation
2026-01-05 — TIL: OpenStreetMap is powerful weak supervision—but it teaches what is mapped, not what exists
2026-01-05 — TIL: Point labels are often better suited for site detection than for segmentation

Last updated: 2026-03-07 07:42 UTC

About me

Ph.D.-trained Data Scientist with 8+ years of experience in quantitative analysis, statistical modeling, and applied data science. I specialize in building reproducible analytical workflows and production-grade data applications that support data-driven decision-making.

My work combines advanced statistical and Bayesian modeling, machine learning, and software engineering practices, with hands-on experience in R, Python, SQL, CI/CD, and containerized deployments. I focus on translating complex data into actionable insights through robust analysis, interactive dashboards, and clear analytical narratives.

Currently, I work as a Data Scientist at CEPTRO / NIC.br, where I develop and maintain analytical systems used to understand and monitor internet usage and network quality in Brazil. I collaborate in international and interdisciplinary teams and bring strong experience working across cultural and institutional contexts.

📄 CV

Github Profile: https://github.com/philkleer

LinkedIn: https://linkedin.com/in/philkleer

License

MIT (see LICENSE).

Name		Name	Last commit message	Last commit date
Latest commit History 251 Commits
.github/workflows		.github/workflows
assets		assets
notes/case-studies		notes/case-studies
scripts		scripts
til		til
.gitignore		.gitignore
AGENTS.md		AGENTS.md
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Data Science Notes & Project Portfolio

Table of Contents

Featured Projects

Detailed Projects

⬇️ **Redesigning an Application in Production: _Instalometro na Conectividade na Saúde_**

Overview

Key Challenges

Key Contributions

Tech Stack

Results & Impact

Notes

⬇️ Modularizing a Large Shiny Application Observatório de Inteligência Artificial (OBIA)

Overview

Key Contributions

Tech Stack

Results & Impact

Notes

⬇️ Levelling up the team's own R package

Overview

Key Contributions

Tech Stack

Results & Impact

Notes

⬇️ Shiny application Autodiagnóstico do Sistema de Administração dos Recursos de Tecnologia da Informação

Overview

Key Contributions

Tech Stack

Results & Impact

Notes

⬇️ Network Technology Analysis & Visualization

Overview

Key Contributions

Tech Stack

Results & Impact

Notes

⬇️ End-to-End MLOps Pipeline

Overview

Key Contributions

Tech Stack

Results & Impact

Notes

Case studies

TIL: Latest Lessons

About me

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

⬇️ Redesigning an Application in Production: _Instalometro na Conectividade na Saúde_

Packages