This repository contains curated Today I Learned (TIL) insights, case studies, and detailed project overviews from my work as a Data Scientist, focusing on reproducible workflows, applied analytics, data engineering, and production-grade systems.
Visit the individual case studies below to explore:
- how I refactor and optimize data applications (e.g., Shiny),
- how I build full ML pipelines,
- and how I benchmark performance across tools and environments.
Each entry includes technical explanations, code snippets, results, and lessons learned.
👤 Profile: https://github.com/philkleer
📄 LinkedIn: https://linkedin.com/in/philkleer
These projects represent my most relevant work as an applied Data Scientist, with a focus on production systems, reproducible analytics, and decision support.
-
Redesigning an Application in Production: Instalometro na Conectividade na Saúde Designing, hardening, and shipping a production-ready data application with a focus on performance, reproducibility, and maintainability.
-
Modularizing a Large Shiny Application (OBIA)
Refactoring and hardening a national-scale analytics application, reducing code size by ~41% and introducing CI/CD, testing, and reproducibility. -
Leveling Up an Internal R Package for Team-Scale Use
Productionizing an internal analytics package with versioned releases, CI/CD pipelines, and reproducible environments. -
Shiny Application – IT Governance (MGI)
End-to-end development and deployment of a public-facing Shiny application to assess IT governance across national entities. -
Network Technology Analysis & Visualization
Statistical analysis and visual storytelling to support technical and policy-oriented decision-making. -
End-to-End MLOps Pipeline
Implementation of a production-like ML lifecycle with experiment tracking and data/model versioning.
Designing, hardening, and shipping a production-ready data application with a focus on performance, reproducibility, and maintainability.
Instalometro (inside the project Conectividade na Saúde) was an established application monitoring connection data of health institutions. However, over time, functionality slowed down as data volume, usage expectations, and operational requirements increased, the original setup revealed clear limitations in performance, scalability, and reproducibility.
This project documents the transition from a working prototype to a production-grade data application, emphasizing architectural decisions, data engineering practices, and infrastructure reliability rather than feature growth.
The goal was to deliver a maintainable, performant, and deployable system that could be confidently operated and handed over to a broader team.
- Large datasets causing slow startup times and high memory usage
- JSON-based APIs creating heavy payloads and long response times
- Unclear separation between data access and reactive application logic
- Fragile dependency restoration at runtime
- Slow and unreliable CI builds with compiled dependencies
- Need for measurable performance guarantees before production release
- Re-architected data access using Parquet-based pipelines instead of large JSON payloads
- Introduced Polars with lazy evaluation for scalable, on-demand data loading
- Clearly separated data engineering concerns from Shiny reactive logic
- Implemented load testing with
shinyloadtestto validate performance under concurrent usage - Designed deterministic, multi-stage Docker builds using
{renv}for reproducible environments - Migrated CI pipelines to Docker Buildx with registry-backed cache for faster, more reliable builds
- Verified CI/CD workflows and documented clean handover instructions for the team
Languages
- R
Frameworks & Libraries
- Shiny
- Polars
- Arrow
- DBI / dbplyr
Data & Storage
- PostgreSQL / PostGIS
- Parquet
Testing & Validation
- shinyloadtest
Reproducibility
- renv
CI/CD & Infrastructure
- GitLab CI
- Docker (multi-stage builds)
- Docker Buildx (registry-backed cache)
- Kubernetes
- Significantly reduced application startup time and memory footprint
- Improved scalability and predictability under multi-user load
- Reduced API payload sizes by an order of magnitude through Parquet exports
- Deterministic, reproducible builds independent of runtime package restoration
- Faster and more reliable CI pipelines
- Clear documentation enabling smooth handover and future maintenance
Most importantly, Instalometro na Conectividade na Saúde developed into an optimized data product ready for long-term operation.
This project prioritized architectural correctness and operational stability over rapid feature expansion.
Many performance issues were solved not through micro-optimizations, but through better placement of responsibilities between databases, data pipelines, and the Shiny application layer.
Several of the patterns developed here have since informed reusable tooling and shared infrastructure.
🔎 Detailed walkthrough: Case study — Redesignin an Application in Production
🔗 Live application: https://conectividadenasaude.nic.br
Refactoring and hardening a production-grade Shiny application for long-term maintainability, collaboration, and reliability.
This project documents the refactoring of a large, production Shiny application used in a national analytics context.
The original codebase had grown organically into a monolithic structure that was difficult to maintain, test, and extend.
The goal was to transform the application into a modular, testable, and reproducible system, suitable for multi-developer collaboration and continuous deployment.
- Refactored a monolithic Shiny application into a fully modular architecture
- Reduced total lines of code by ~41% while improving readability and extensibility
- Introduced automated testing, linting, and formatting standards
- Implemented reproducible dependency management using
renv - Set up CI/CD pipelines to ensure code quality and deployment safety
- Improved application performance and load behavior
- Languages: R
- Frameworks: Shiny, plotly
- Testing: testthat
- Reproducibility: renv
- CI/CD: GitLab CI
- Deployment: Docker, Kubernetes
- Significantly improved maintainability and onboarding for new contributors
- Enabled reliable multi-developer workflows
- Increased confidence in production releases through automated checks
- Established a reusable architectural pattern for future Shiny applications
This refactor prioritizes long-term sustainability over short-term feature additions and serves as a reference architecture for future analytical applications.
🔎 Detailed walkthrough:
🔗 Live application:
https://obia.nic.br/s/indicadores
Standardizing, hardening, and productionizing an internal R package to support reproducible analytics, CI/CD, and multi-developer collaboration.
When joining a new team, I inherited an internal R package used to centralize shared analytical functionality across multiple products. While the package was already in use, it lacked standardization, clear role separation between users and contributors, and a reliable CI/CD and release process.
The goal of this project was to transform the package into a stable, versioned, and reproducible internal dependency, suitable for long-term maintenance and safe use across production systems.
- Standardized package structure, formatting, and development conventions across the entire codebase
- Introduced CI/CD pipelines to automate checks, builds, and versioned internal releases
- Established a clear separation between user-facing and contributor-facing logic and documentation
- Implemented reproducible dependency management using
renv, compatible with multiple R versions - Added unit testing with
testthatand enforced code quality via formatting, linting, and pre-commit hooks - Designed and implemented a safe versioning strategy to prevent breaking changes in dependent products
- Language: R
- Package tooling: testthat, roxygen2
- Reproducibility: renv, rig
- Code quality: Air (formatting), lintr (linting), pre-commit
- CI/CD: GitLab CI
- Distribution: pak, internal release artifacts
- Enabled versioned installation of the package, allowing teams to pin stable releases and avoid regressions
- Reduced onboarding time through clear README and CONTRIBUTING documentation
- Established reproducible builds with downloadable artifacts produced by the CI pipeline
- Improved development consistency across contributors and environments
- Made internal analytics workflows more reliable, scalable, and maintainable
This work turned the package from a loosely maintained codebase into a production-ready internal dependency, supporting both rapid development and long-term stability.
- Older products could safely continue using pinned package versions while new releases evolved independently
- The separation of users vs. contributors clarified responsibilities and reduced friction in collaboration
- The CI/CD setup now serves as a reference template for other internal R packages
🔎 Detailed walkthrough:
CI/CD overhaul case study
Designing and deploying a production-grade analytical application to evaluate IT governance across national entities.
This project involved the development and deployment of a public-facing, production-grade R Shiny application designed to assess and analyze IT governance practices among national public-sector entities.
I was responsible for the entire application lifecycle, from data integration and analytical logic to visual design, automation, and deployment. The goal was to deliver a stable, maintainable, and transparent analytics platform that supports evidence-based evaluation and comparison.
- Developed a production-grade Shiny application covering the full analytical workflow
- Integrated and processed data from multiple heterogeneous sources
- Designed a consistent visual design system to ensure clarity, comparability, and usability
- Implemented CI/CD pipelines to automate testing, builds, and deployments
- Ensured application stability, maintainability, and reproducibility across environments
- Delivered a publicly accessible analytics portal for ongoing use and updates
- Language: R
- Frameworks: Shiny, ggplot2, ggiraph
- Data: Relational databases, structured datasets
- CI/CD: GitLab CI
- Deployment: Docker, Kubernetes
- Delivered a robust and maintainable analytics platform for assessing IT governance at national scale
- Enabled consistent and transparent comparison across entities
- Reduced operational overhead through automated deployment and quality checks
- Established a reusable blueprint for future public-sector analytical applications
🔗 Live application:
https://obia.nic.br/s/indicadores-mgi
Statistical and exploratory analysis of network technologies with a focus on communication and decision support.
This project analyzes network technology data to identify patterns, quality indicators, and trends relevant for technical and policy-oriented audiences.
- Conducted exploratory and statistical analyses
- Applied regression-based methods where appropriate
- Translated analytical results into clear visual narratives
- Prepared presentation-ready outputs for non-technical stakeholders
- R
- tidyverse, ggplot2, brms
- revealJS
- Quarto
- Supported evidence-based discussions on network technologies
- Improved accessibility of complex analytical results through visualization
🔎 Presentation at IX Forum 2025 (10min):
Link to presentation
Implementing a production-like machine learning lifecycle with model tracking and versioning.
This project explores the design of an end-to-end MLOps workflow, covering model training, experiment tracking, data and model versioning, and reproducibility.
- Implemented experiment tracking with MLflow
- Versioned data and models using DVC
- Simulated a production-style model lifecycle
- Documented pipeline structure and design choices
- Python
- MLflow
- DVC
- Git
- Demonstrates practical understanding of MLOps concepts
- Provides a reference implementation for small-to-medium ML projects
- 2026-02-10 — Redesigning an Application in Production: Instalometro na Conectividade na Saúde
- 2026-01-05 — Case Study: School Detection from Satellite Imagery
- 2025-12-17 — How I build data-driven presentations with Quarto + revealjs (a real-world example)
- 2025-11-20 — Case Study: Benchmarking Shiny app performance across environments with
shinyloadtest - 2025-09-19 — Case Study: Debugging across multiple R versions with
rig+renv - 2025-09-14 — From ad‑hoc repo to versioned, CI‑driven R package: nicverso
- 2025-08-30 — R big data benchmarks: dplyr/duckplyr/polars & Postgres/DuckDB
- 2025-08-14 — Modularizing a Large Shiny App (R)
- 2026-02-28 — TIL Refactoring a 200-Line SQL Query: Less CTEs, Fewer Scans
- 2026-02-24 — TIL: Scaling OSM-Based Weak Label Generation for Semantic Segmentation
- 2026-02-20 — 🧠 TIL: Speeding Up a Plumber API by Switching from JSON to Parquet
- 2026-02-05 — 🧠 TIL: Where to Run Data Operations (PostgreSQL vs R engines like DuckDB/Polars)
- 2026-01-25 — 🧠 TIL: Migrating to Docker Buildx with Registry Cache (and Why It’s Worth Showing)
- 2026-01-23 — 🧠 TIL: Making {renv} Work in a Multi-Stage Docker Build (Builder → Runtime)
- 2026-01-19 — 🧠 TIL: Shrinking Docker Images with Multi-Stage Builds (Builder + Runtime)
- 2026-01-15 — TIL: Using
ellmer,gander,chores, andensureto Draft R Docs + Tests with an Ollama Connection - 2026-01-09 — TIL: Learning Window Functions in PostgreSQL (with Practical Examples)
- 2026-01-05 — TIL: Geographic train/test splits are essential for honest geospatial ML evaluation
- 2026-01-05 — TIL: OpenStreetMap is powerful weak supervision—but it teaches what is mapped, not what exists
- 2026-01-05 — TIL: Point labels are often better suited for site detection than for segmentation
Last updated: 2026-03-07 07:42 UTC
Ph.D.-trained Data Scientist with 8+ years of experience in quantitative analysis, statistical modeling, and applied data science. I specialize in building reproducible analytical workflows and production-grade data applications that support data-driven decision-making.
My work combines advanced statistical and Bayesian modeling, machine learning, and software engineering practices, with hands-on experience in R, Python, SQL, CI/CD, and containerized deployments. I focus on translating complex data into actionable insights through robust analysis, interactive dashboards, and clear analytical narratives.
Currently, I work as a Data Scientist at CEPTRO / NIC.br, where I develop and maintain analytical systems used to understand and monitor internet usage and network quality in Brazil. I collaborate in international and interdisciplinary teams and bring strong experience working across cultural and institutional contexts.
Github Profile: https://github.com/philkleer
LinkedIn: https://linkedin.com/in/philkleer
MIT (see LICENSE).