Skip to content

philkleer/portfolio

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

251 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data Science Notes & Project Portfolio

This repository contains curated Today I Learned (TIL) insights, case studies, and detailed project overviews from my work as a Data Scientist, focusing on reproducible workflows, applied analytics, data engineering, and production-grade systems.

Visit the individual case studies below to explore:

  • how I refactor and optimize data applications (e.g., Shiny),
  • how I build full ML pipelines,
  • and how I benchmark performance across tools and environments.

Each entry includes technical explanations, code snippets, results, and lessons learned.

👤 Profile: https://github.com/philkleer
📄 LinkedIn: https://linkedin.com/in/philkleer

Table of Contents

  1. ⭐ Featured Projects
  2. 📂 Detailed Projects
  3. 📚 Case Studies
  4. 🧠 Learning Notes (TIL)
  5. 🙋🏻‍♂️ About Me

Featured Projects

These projects represent my most relevant work as an applied Data Scientist, with a focus on production systems, reproducible analytics, and decision support.

  1. Redesigning an Application in Production: Instalometro na Conectividade na Saúde Designing, hardening, and shipping a production-ready data application with a focus on performance, reproducibility, and maintainability.

  2. Modularizing a Large Shiny Application (OBIA)
    Refactoring and hardening a national-scale analytics application, reducing code size by ~41% and introducing CI/CD, testing, and reproducibility.

  3. Leveling Up an Internal R Package for Team-Scale Use
    Productionizing an internal analytics package with versioned releases, CI/CD pipelines, and reproducible environments.

  4. Shiny Application – IT Governance (MGI)
    End-to-end development and deployment of a public-facing Shiny application to assess IT governance across national entities.

  5. Network Technology Analysis & Visualization
    Statistical analysis and visual storytelling to support technical and policy-oriented decision-making.

  6. End-to-End MLOps Pipeline
    Implementation of a production-like ML lifecycle with experiment tracking and data/model versioning.

Detailed Projects

⬇️ **Redesigning an Application in Production: _Instalometro na Conectividade na Saúde_**

Designing, hardening, and shipping a production-ready data application with a focus on performance, reproducibility, and maintainability.

Overview

Instalometro (inside the project Conectividade na Saúde) was an established application monitoring connection data of health institutions. However, over time, functionality slowed down as data volume, usage expectations, and operational requirements increased, the original setup revealed clear limitations in performance, scalability, and reproducibility.

This project documents the transition from a working prototype to a production-grade data application, emphasizing architectural decisions, data engineering practices, and infrastructure reliability rather than feature growth.

The goal was to deliver a maintainable, performant, and deployable system that could be confidently operated and handed over to a broader team.


Key Challenges

  • Large datasets causing slow startup times and high memory usage
  • JSON-based APIs creating heavy payloads and long response times
  • Unclear separation between data access and reactive application logic
  • Fragile dependency restoration at runtime
  • Slow and unreliable CI builds with compiled dependencies
  • Need for measurable performance guarantees before production release

Key Contributions

  • Re-architected data access using Parquet-based pipelines instead of large JSON payloads
  • Introduced Polars with lazy evaluation for scalable, on-demand data loading
  • Clearly separated data engineering concerns from Shiny reactive logic
  • Implemented load testing with shinyloadtest to validate performance under concurrent usage
  • Designed deterministic, multi-stage Docker builds using {renv} for reproducible environments
  • Migrated CI pipelines to Docker Buildx with registry-backed cache for faster, more reliable builds
  • Verified CI/CD workflows and documented clean handover instructions for the team

Tech Stack

Languages

  • R

Frameworks & Libraries

  • Shiny
  • Polars
  • Arrow
  • DBI / dbplyr

Data & Storage

  • PostgreSQL / PostGIS
  • Parquet

Testing & Validation

  • shinyloadtest

Reproducibility

  • renv

CI/CD & Infrastructure

  • GitLab CI
  • Docker (multi-stage builds)
  • Docker Buildx (registry-backed cache)
  • Kubernetes

Results & Impact

  • Significantly reduced application startup time and memory footprint
  • Improved scalability and predictability under multi-user load
  • Reduced API payload sizes by an order of magnitude through Parquet exports
  • Deterministic, reproducible builds independent of runtime package restoration
  • Faster and more reliable CI pipelines
  • Clear documentation enabling smooth handover and future maintenance

Most importantly, Instalometro na Conectividade na Saúde developed into an optimized data product ready for long-term operation.


Notes

This project prioritized architectural correctness and operational stability over rapid feature expansion.

Many performance issues were solved not through micro-optimizations, but through better placement of responsibilities between databases, data pipelines, and the Shiny application layer.

Several of the patterns developed here have since informed reusable tooling and shared infrastructure.


🔎 Detailed walkthrough: Case study — Redesignin an Application in Production

🔗 Live application: https://conectividadenasaude.nic.br

⬇️ Modularizing a Large Shiny Application Observatório de Inteligência Artificial (OBIA)

Refactoring and hardening a production-grade Shiny application for long-term maintainability, collaboration, and reliability.

Overview

This project documents the refactoring of a large, production Shiny application used in a national analytics context.

The original codebase had grown organically into a monolithic structure that was difficult to maintain, test, and extend.

The goal was to transform the application into a modular, testable, and reproducible system, suitable for multi-developer collaboration and continuous deployment.

Key Contributions

  • Refactored a monolithic Shiny application into a fully modular architecture
  • Reduced total lines of code by ~41% while improving readability and extensibility
  • Introduced automated testing, linting, and formatting standards
  • Implemented reproducible dependency management using renv
  • Set up CI/CD pipelines to ensure code quality and deployment safety
  • Improved application performance and load behavior

Tech Stack

  • Languages: R
  • Frameworks: Shiny, plotly
  • Testing: testthat
  • Reproducibility: renv
  • CI/CD: GitLab CI
  • Deployment: Docker, Kubernetes

Results & Impact

  • Significantly improved maintainability and onboarding for new contributors
  • Enabled reliable multi-developer workflows
  • Increased confidence in production releases through automated checks
  • Established a reusable architectural pattern for future Shiny applications

Notes

This refactor prioritizes long-term sustainability over short-term feature additions and serves as a reference architecture for future analytical applications.

🔎 Detailed walkthrough:

🔗 Live application:
https://obia.nic.br/s/indicadores

⬇️ Levelling up the team's own R package

Standardizing, hardening, and productionizing an internal R package to support reproducible analytics, CI/CD, and multi-developer collaboration.

Overview

When joining a new team, I inherited an internal R package used to centralize shared analytical functionality across multiple products. While the package was already in use, it lacked standardization, clear role separation between users and contributors, and a reliable CI/CD and release process.

The goal of this project was to transform the package into a stable, versioned, and reproducible internal dependency, suitable for long-term maintenance and safe use across production systems.

Key Contributions

  • Standardized package structure, formatting, and development conventions across the entire codebase
  • Introduced CI/CD pipelines to automate checks, builds, and versioned internal releases
  • Established a clear separation between user-facing and contributor-facing logic and documentation
  • Implemented reproducible dependency management using renv, compatible with multiple R versions
  • Added unit testing with testthat and enforced code quality via formatting, linting, and pre-commit hooks
  • Designed and implemented a safe versioning strategy to prevent breaking changes in dependent products

Tech Stack

  • Language: R
  • Package tooling: testthat, roxygen2
  • Reproducibility: renv, rig
  • Code quality: Air (formatting), lintr (linting), pre-commit
  • CI/CD: GitLab CI
  • Distribution: pak, internal release artifacts

Results & Impact

  • Enabled versioned installation of the package, allowing teams to pin stable releases and avoid regressions
  • Reduced onboarding time through clear README and CONTRIBUTING documentation
  • Established reproducible builds with downloadable artifacts produced by the CI pipeline
  • Improved development consistency across contributors and environments
  • Made internal analytics workflows more reliable, scalable, and maintainable

This work turned the package from a loosely maintained codebase into a production-ready internal dependency, supporting both rapid development and long-term stability.

Notes

  • Older products could safely continue using pinned package versions while new releases evolved independently
  • The separation of users vs. contributors clarified responsibilities and reduced friction in collaboration
  • The CI/CD setup now serves as a reference template for other internal R packages

🔎 Detailed walkthrough:
CI/CD overhaul case study

⬇️ Shiny application Autodiagnóstico do Sistema de Administração dos Recursos de Tecnologia da Informação

Designing and deploying a production-grade analytical application to evaluate IT governance across national entities.

Overview

This project involved the development and deployment of a public-facing, production-grade R Shiny application designed to assess and analyze IT governance practices among national public-sector entities.

I was responsible for the entire application lifecycle, from data integration and analytical logic to visual design, automation, and deployment. The goal was to deliver a stable, maintainable, and transparent analytics platform that supports evidence-based evaluation and comparison.

Key Contributions

  • Developed a production-grade Shiny application covering the full analytical workflow
  • Integrated and processed data from multiple heterogeneous sources
  • Designed a consistent visual design system to ensure clarity, comparability, and usability
  • Implemented CI/CD pipelines to automate testing, builds, and deployments
  • Ensured application stability, maintainability, and reproducibility across environments
  • Delivered a publicly accessible analytics portal for ongoing use and updates

Tech Stack

  • Language: R
  • Frameworks: Shiny, ggplot2, ggiraph
  • Data: Relational databases, structured datasets
  • CI/CD: GitLab CI
  • Deployment: Docker, Kubernetes

Results & Impact

  • Delivered a robust and maintainable analytics platform for assessing IT governance at national scale
  • Enabled consistent and transparent comparison across entities
  • Reduced operational overhead through automated deployment and quality checks
  • Established a reusable blueprint for future public-sector analytical applications

Notes

🔗 Live application:
https://obia.nic.br/s/indicadores-mgi

⬇️ Network Technology Analysis & Visualization

Statistical and exploratory analysis of network technologies with a focus on communication and decision support.

Overview

This project analyzes network technology data to identify patterns, quality indicators, and trends relevant for technical and policy-oriented audiences.

Key Contributions

  • Conducted exploratory and statistical analyses
  • Applied regression-based methods where appropriate
  • Translated analytical results into clear visual narratives
  • Prepared presentation-ready outputs for non-technical stakeholders

Tech Stack

  • R
  • tidyverse, ggplot2, brms
  • revealJS
  • Quarto

Results & Impact

  • Supported evidence-based discussions on network technologies
  • Improved accessibility of complex analytical results through visualization

Notes

🔎 Presentation at IX Forum 2025 (10min):
Link to presentation

⬇️ End-to-End MLOps Pipeline

Implementing a production-like machine learning lifecycle with model tracking and versioning.

Overview

This project explores the design of an end-to-end MLOps workflow, covering model training, experiment tracking, data and model versioning, and reproducibility.

Key Contributions

  • Implemented experiment tracking with MLflow
  • Versioned data and models using DVC
  • Simulated a production-style model lifecycle
  • Documented pipeline structure and design choices

Tech Stack

  • Python
  • MLflow
  • DVC
  • Git

Results & Impact

  • Demonstrates practical understanding of MLOps concepts
  • Provides a reference implementation for small-to-medium ML projects

Notes


Case studies

TIL: Latest Lessons

Last updated: 2026-03-07 07:42 UTC

About me

Ph.D.-trained Data Scientist with 8+ years of experience in quantitative analysis, statistical modeling, and applied data science. I specialize in building reproducible analytical workflows and production-grade data applications that support data-driven decision-making.

My work combines advanced statistical and Bayesian modeling, machine learning, and software engineering practices, with hands-on experience in R, Python, SQL, CI/CD, and containerized deployments. I focus on translating complex data into actionable insights through robust analysis, interactive dashboards, and clear analytical narratives.

Currently, I work as a Data Scientist at CEPTRO / NIC.br, where I develop and maintain analytical systems used to understand and monitor internet usage and network quality in Brazil. I collaborate in international and interdisciplinary teams and bring strong experience working across cultural and institutional contexts.

📄 CV

Github Profile: https://github.com/philkleer

LinkedIn: https://linkedin.com/in/philkleer

License

MIT (see LICENSE).

About

A project to share some insights from my work as Data Scientist.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages