Pulso Electoral Colombia 2026

Social listening and NLP for electoral discourse analysis -- built as a capability demonstration for the CIVICUS Digital Democracy Initiative.

Colombia's 2026 election cycle is unfolding under conditions of rising polarization, coordinated digital manipulation, and threats to civic space. Monitoring how these dynamics play out in online discourse requires social listening infrastructure that works with Latin American Spanish, handles the nuances of Colombian political culture, and produces research-grade outputs -- not marketing dashboards.

This project demonstrates that capability. Built in approximately one week, it collects live data from three sources (RSS, GDELT, ACLED), processes it through NLP models trained on Latin American Spanish, and delivers reproducible analytical notebooks that follow a Signal-Insight-Action-Outcome framework designed for research teams. The project includes 32,488 collected records, 57 passing tests, and an interactive Streamlit dashboard for exploration.

What This Demo Demonstrates

Everything here is real and running -- not mockups, not planned features.

What is working	What it produces
3-source data pipeline (RSS, GDELT, ACLED)	32,488 normalized, deduplicated records across news and event databases
Sentiment analysis with `pysentimiento`	Polarity and emotion scores calibrated to Latin American Spanish (trained on ~500M tweets)
Named entity recognition with `spaCy`	Politicians, organizations, and locations extracted from Spanish text
Topic modeling with sentence-transformers	Emerging narrative clusters detected without predefined categories
Emotion detection and hate speech analysis	Fine-grained emotional signals and toxicity scoring via pysentimiento
Colombian slang normalization	Culturally-aware preprocessing for terms like bodega, mermelada, castrochavismo
Language detection and filtering	Automatic identification and filtering of non-Spanish content
9 numbered notebooks across 5 stages	Reproducible analytical story from raw data to findings
Interactive Streamlit dashboard (6 pages)	Overview, sentiment thermometer, anomaly detection, geographic map, platform comparison, data explorer -- with real-time data freshness indicator
Config-driven keyword monitoring	5 YAML configs: election, manipulation, civic space, political figures, Colombian slang
Tool comparison analysis	Brandwatch vs. open-source evaluation with cost, language, and sustainability criteria
Signal-Insight-Action-Outcome templates	Structured handoff format between data analyst and domain expert
Data ethics and compliance documentation	Data minimization, no-model-training policy, source-specific compliance
57 tests with full CI/CD	pytest + ruff + mypy + commitizen, all passing in GitHub Actions
Zero-cost infrastructure	All sources free, storage embedded (DuckDB), runs on a laptop

Capability-to-Engagement Mapping

Each capability demonstrated here enables a specific part of the DDI consulting engagement. This table maps what the demo proves to what it means for the actual work.

Demonstrated capability	What this enables in a full engagement
Multi-source collection pipeline	Rapid onboarding of additional sources (Telegram, Bluesky, X) -- each new source is one fetch function and one normalize function; downstream analysis does not change
Latin American Spanish NLP	Accurate sentiment and emotion detection for Colombian discourse, extensible to other Global South languages via AfriSenti and XLM-RoBERTa
Config-driven keyword system	Non-technical team members adjust monitoring scope by editing YAML files, no code changes required
Colombian political slang normalization	Culturally-aware preprocessing that captures terms like bodega (troll farm), mermelada (patronage), and castrochavismo (polarization signal) -- missed by generic Spanish models
ACLED + GDELT integration	Cross-referencing online sentiment with real-world protest, violence, and political events for early warning detection
Reproducible notebook workflow	Every analytical claim is auditable, re-runnable, and documented -- critical for research credibility and funder reporting
Interactive dashboard with anomaly detection	Visual exploration of sentiment, volume spikes, geographic patterns, and cross-platform comparison -- functional prototype of the monitoring interface a full engagement would deliver
DuckDB embedded storage	Zero-infrastructure deployment; migrates to MotherDuck (cloud), PostgreSQL, or BigQuery when team size and access needs require it
Open-source stack at $0/year	Sustainable after contract ends -- CIVICUS keeps everything, no license expiration

Thematic Coverage

The CIVICUS DDI Terms of Reference identify three thematic areas. Here is what this demo covers for each and how a funded engagement would extend it.

ToR thematic area	What is demonstrated here	How we would apply it in the engagement
Algorithmic Bias and Polarisation	Sentiment distribution by platform; Colombian slang configs that flag polarization signals (mamerto, paraco, tibio); cross-source comparison of tone	Longitudinal narrative tracking, visibility asymmetry analysis across platforms, coordinated amplification detection
Digital Manipulation During Elections	Keyword configs for bodegas digitales, cuentas falsas, granjas de trolls; hate speech detection via pysentimiento; topic clustering that surfaces emerging manipulation narratives	Network analysis with coordination detection (NetworkX + Louvain), bot scoring heuristics, real-time alerting on anomalous narrative velocity
Early Warning Indicators	ACLED event data correlated with online sentiment shifts; Signal-Insight-Action-Outcome framework for structured analytical handoff	Anomaly detection on narrative velocity, automated threshold alerts, ACLED + GDELT + social media triangulation for early warning dashboards

Demo vs. Full Engagement

This demo is a proof of concept, not a finished product. Here is what changes with funding and a proper engagement timeline.

Dimension	This demo	A funded engagement adds
Data sources	3 (RSS, GDELT, ACLED) — 32,488 records	Telegram, Bluesky, X/Twitter, Facebook (via CrowdTangle or equivalent), Reddit (deferred due to API delays)
Collection mode	Manual notebook runs	Automated scheduled collection (cron/Airflow)
Analysis depth	Sentiment, NER, topic clustering, emotion detection, hate speech analysis, anomaly detection	Network analysis, coordination detection, visibility asymmetry, bot scoring
Geographic scope	Colombia only	Multi-country (methodology portable via config changes and model swaps)
NLP languages	Latin American Spanish	Additional Global South languages via AfriSenti, XLM-RoBERTa, NLLB
Storage	Local DuckDB file	Cloud-accessible database (MotherDuck, PostgreSQL, or BigQuery -- decided in Inception Phase)
Team	Solo data engineer	Data engineer + domain expert + research coordinator
Outputs	Notebooks + exported CSVs	Research Data Packages (dataset + insight brief + visualizations + methodology note)
Training	Documentation only	Live capacity building sessions, recorded training, advisory support
Dashboard	Interactive Streamlit app with 6 analytical pages, anomaly detection, geographic mapping, and data freshness monitoring	Production monitoring dashboard with role-based access and alerting

Data Governance and Ethics

Research-grade social listening requires explicit data governance. This project includes both documentation and implementation.

Data ethics framework: docs/data_ethics.md -- covers data minimization, no-model-training commitment, and source-specific compliance notes for GDELT and ACLED
Tool comparison: notebooks/5-appendix/01_mgo_tool_comparison_20260328.ipynb -- evaluates Brandwatch, Meltwater, Talkwalker, and Pulsar against the open-source stack on cost, Latin American language support, data ownership, reproducibility, capacity building, and sustainability
General principles: minimum necessary collection, no re-identification, local storage by default, purpose limitation, credentials never committed

The tool comparison concludes that for DDI's specific needs (Global South languages, research reproducibility, capacity building, sustainability after contract), an open-source stack provides superior value -- at $0/year vs. $10K-180K/year for commercial alternatives.

Architecture

conf/ (YAML configs)
  |   keywords/*.yml (election, manipulation, civic_space, political_figures, slang)
  |   data_collection/, nlp/
  v
notebooks/ (PRIMARY deliverables)
  |   1-data/ --> 2-exploration/ --> 3-analysis/ --> 4-output/ --> 5-appendix/
  |   Each notebook tells a story: what data, how analyzed, what found
  v
src/ (thin utility functions)
  |   collectors/  nlp/  analysis/  storage/  utils/
  |   Simple helpers consumed by notebooks -- no abstract classes
  v
data/ (numbered layers)
  |   01_raw/ --> 02_intermediate/ --> 03_primary/ --> 04_enriched/ --> 05_analysis/ --> 06_reporting/
  v
app/ (Streamlit dashboard — 6 analytical pages)
      Interactive exploration with sentiment, anomaly detection, geographic map

Data Sources

#	Source	What you get	Records collected	Library
1	Colombian RSS Feeds	News articles from 5 major outlets	640 articles	`feedparser`
2	GDELT	Colombian news events, tone, themes (30-day window)	769 articles	`requests`
3	ACLED	Protests, political violence, social leader killings (2018--2026)	31,079 events	`requests`

Total: 32,488 records | Total infrastructure cost: $0 -- all sources are free or open-access.

Note on Reddit: Reddit was initially planned as a social media data source (r/Colombia, ~500-700K members), but was removed due to API access delays. This is documented transparently as a scope decision -- the architecture supports adding it when API access is granted, since each new source is one fetch function and one normalize function.

RSS Feed Availability as a Research Finding

Of eleven originally targeted Colombian news outlets, only five currently provide public RSS feeds: El Tiempo (Colombia and Politics sections), La Silla Vacía, Razón Pública, and Pulzo. The remaining six have discontinued public RSS entirely with no working alternatives:

Outlet	Notes
El Espectador	Oldest newspaper in Colombia; independent editorial line
Semana	Major news magazine; historically center-right
Cambio Colombia	Investigative journalism
Caracol Radio	Major radio network with online presence
Blu Radio	Second-largest radio network

This is not merely a data collection inconvenience. When major outlets close machine-readable data channels, the pool of sources accessible to independent monitors concentrates around those that still offer open feeds. Reduced plurality in accessible sources limits monitoring coverage and raises the barrier for civil society organizations without commercial data subscriptions. This dynamic is directly relevant to civic space health and is documented here as a methodological finding.

This situation reinforces the project's multi-source design: RSS alone cannot provide adequate coverage of the Colombian media landscape, which is why GDELT and ACLED are treated as co-equal primary sources rather than supplements.

Notebook Guide

9 numbered notebooks across 5 stages, each telling a specific part of the analytical story:

Stage	Notebooks	Purpose
1-data	2 collection notebooks	Collect from RSS, GDELT+ACLED
2-exploration	1 overview notebook	EDA, platform comparison, data quality
3-analysis	3 analysis notebooks	Sentiment, NER, topic modeling
4-output	2 output notebooks	Dataset export + analysis summary (Signal-Insight-Action-Outcome framework)
5-appendix	1 comparison notebook	Brandwatch vs. open-source tool evaluation

Tech Stack

Category	Tool
Language	Python 3.12
Package Manager	uv
NLP - Sentiment	pysentimiento (Latin American Spanish, ~500M tweet training set)
NLP - Emotion	pysentimiento emotion detection
NLP - Hate Speech	pysentimiento hate speech detection
NLP - NER	spaCy es_core_news_lg
NLP - Topics	sentence-transformers + sklearn
NLP - Language	Language detection and filtering
Text Processing	Colombian slang normalization
Collection	feedparser, requests
Storage	DuckDB (zero-infrastructure, single file)
Dashboard	Streamlit + Plotly + Folium
Config	PyYAML
Quality	ruff + mypy + commitizen (pre-commit hooks)
Testing	pytest (57 tests)
CI/CD	GitHub Actions
Containers	Docker
Docs	MkDocs

Why DuckDB?

Zero infrastructure: Embedded analytical database -- no server setup, just a file
SQL on DataFrames: Query collected data with standard SQL directly from pandas
Parquet native: Reads/writes Parquet files, keeping data portable across tools
Scalable path: Migrates cleanly to MotherDuck (cloud), PostgreSQL, or BigQuery for production

Quick Start

# 1. Clone the repository
git clone https://github.com/juandagalo/pulso-electoral.git
cd pulso-electoral

# 2. Install dependencies (requires uv: https://docs.astral.sh/uv/)
make install_env

# 3. Copy environment variables and fill in API credentials
cp .env.example .env

# 4. Download NLP models
make download_models

# 5. Run collection notebooks
make collect_all

# 6. Run analysis notebooks
make analyze

# 7. Export datasets and analysis summary
make export

# 8. (Optional) Start the Streamlit dashboard
make run_app

Development

make test           # Run tests with coverage (57 tests)
make test_verbose   # Run tests in verbose mode
make check          # Run all pre-commit hooks (ruff, mypy, commitizen)
make lint           # Run ruff linter only (faster)
make docs           # Serve MkDocs documentation locally
make docs_test      # Build docs and check for errors
make clean          # Remove caches, compiled files, DB files

Team

This project was built for the CIVICUS Digital Democracy Initiative consulting application, combining research expertise in Colombian political culture with data engineering and NLP capabilities. All code, analysis, and documentation were produced in approximately one week as a technical work sample.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.code_quality		.code_quality
.github/workflows		.github/workflows
.streamlit		.streamlit
app		app
conf		conf
data		data
docs		docs
models		models
notebooks		notebooks
scripts		scripts
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pulso Electoral Colombia 2026

What This Demo Demonstrates

Capability-to-Engagement Mapping

Thematic Coverage

Demo vs. Full Engagement

Data Governance and Ethics

Architecture

Data Sources

RSS Feed Availability as a Research Finding

Notebook Guide

Tech Stack

Why DuckDB?

Quick Start

Development

Team

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Pulso Electoral Colombia 2026

What This Demo Demonstrates

Capability-to-Engagement Mapping

Thematic Coverage

Demo vs. Full Engagement

Data Governance and Ethics

Architecture

Data Sources

RSS Feed Availability as a Research Finding

Notebook Guide

Tech Stack

Why DuckDB?

Quick Start

Development

Team

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages