Social-RAG is a research-oriented Retrieval-Augmented Generation (RAG) framework designed for interpretive (hermeneutic) analysis and factual question answering over socially-produced textual corpora (e.g., messaging platforms), with a strong emphasis on methodological transparency, evaluation traceability, and responsible data governance.
This repository contains the code and research artifacts supporting our ongoing manuscript on Social-RAG, including prompts, evaluation protocols, figures, appendices, and sanitized samples.
Most RAG implementations optimize primarily for “answer correctness” in conventional QA settings. Social-RAG is motivated by a different research need: supporting historical and social-scientific inquiry where:
- questions can be interpretive, not only extractive;
- evidence often consists of fragmented, redundant, and conversational messages;
- “good answers” must be evaluated across criteria like thematic alignment, analytic depth, synthesis, and evidence precision;
- transparency matters: it should be possible to trace how an answer was formed and what evidence was retrieved.
Social-RAG is therefore positioned as a hermeneutic assistant rather than a counting tool: it is not intended to behave like a deterministic aggregator for totals (e.g., number of links, exact term counts), but as a retrieval-grounded workflow to support interpretation.
prompts/
System prompts and task templates used by Social-RAG (including the system prompt that conditions the model on the dataset “theme”).evaluation/
LLM-as-judge materials and outputs: per-judge scores, consolidated summaries, and scripts used to generate tables/plots.figures/
Exported figures used in the manuscript (PNG/SVG when available) and any source data required to regenerate them.appendices/
Paper appendices: full prompts, methodological notes, extended tables, and supplementary material referenced in the manuscript.samples/
Sanitized/minimized samples for demonstration and reproducibility, respecting privacy and data governance constraints.docs/
Project documentation (repository overview, reproducibility steps, data governance, ethics & safety notes).
CITATION.cff
Citation metadata for academic use.LICENSE.md
Licensing terms for code and materials.requirements.txt
Python dependencies for running scripts and reproducing results.
Our experiments currently focus on two thematic corpora used in the paper:
- Rouanet: discussions related to Brazil’s cultural incentive law (Lei Rouanet).
- Vaccine: discussions around vaccination and vaccine-related narratives.
Important: This repository does not necessarily redistribute full raw datasets. See Data Governance below for what is shared and what is restricted.
We evaluate model behavior under two main question families:
- Factual questions
Oriented toward explicit extraction/verification (entities, numbers, negations, URLs, etc.). - Hermeneutic questions
Oriented toward interpretive and analytical output (depth, synthesis, narrative coherence, evidence use).
We use an LLM-as-judge protocol with multiple blind rounds and multiple judges, reporting both:
- per-judge results; and
- consolidated summaries computed to avoid over-weighting judges with more blind rounds (see paper + appendices for details).
This repository follows a share-what-we-can principle while protecting privacy and minimizing re-identification risk.
- Publicly shared: prompts, evaluation protocols, aggregated results, figures, and sanitized samples.
- Restricted: raw message-level datasets and any material that could enable direct re-identification or reconstruction of sensitive contexts.
See docs/data-governance.md for details on sampling/anonymization strategy and access constraints.
Working with platform-derived data can involve:
- privacy and re-identification risks,
- harmful or extremist content,
- bias amplification,
- misuse risks.
We provide explicit handling notes and recommended constraints in docs/ethics-and-safety.md.
If you use this repository, please cite it using CITATION.cff and (when available) the associated paper/preprint reference in the manuscript.
See LICENSE.md.
Social-RAG is developed within LABHDUFBA’s research activities and collaborations. We also acknowledge the open-source ecosystem that enables reproducible research workflows in RAG.
For questions, issues, or collaboration proposals, please open a GitHub issue in this repository.