Social-RAG

Social-RAG is a research-oriented Retrieval-Augmented Generation (RAG) framework designed for interpretive (hermeneutic) analysis and factual question answering over socially-produced textual corpora (e.g., messaging platforms), with a strong emphasis on methodological transparency, evaluation traceability, and responsible data governance.

This repository contains the code and research artifacts supporting our ongoing manuscript on Social-RAG, including prompts, evaluation protocols, figures, appendices, and sanitized samples.

Why Social-RAG?

Most RAG implementations optimize primarily for “answer correctness” in conventional QA settings. Social-RAG is motivated by a different research need: supporting historical and social-scientific inquiry where:

questions can be interpretive, not only extractive;
evidence often consists of fragmented, redundant, and conversational messages;
“good answers” must be evaluated across criteria like thematic alignment, analytic depth, synthesis, and evidence precision;
transparency matters: it should be possible to trace how an answer was formed and what evidence was retrieved.

Social-RAG is therefore positioned as a hermeneutic assistant rather than a counting tool: it is not intended to behave like a deterministic aggregator for totals (e.g., number of links, exact term counts), but as a retrieval-grounded workflow to support interpretation.

What’s in this repository

Core artifacts (for the paper)

prompts/
System prompts and task templates used by Social-RAG (including the system prompt that conditions the model on the dataset “theme”).
evaluation/
LLM-as-judge materials and outputs: per-judge scores, consolidated summaries, and scripts used to generate tables/plots.
figures/
Exported figures used in the manuscript (PNG/SVG when available) and any source data required to regenerate them.
appendices/
Paper appendices: full prompts, methodological notes, extended tables, and supplementary material referenced in the manuscript.
samples/
Sanitized/minimized samples for demonstration and reproducibility, respecting privacy and data governance constraints.
docs/
Project documentation (repository overview, reproducibility steps, data governance, ethics & safety notes).

Repository metadata

CITATION.cff
Citation metadata for academic use.
LICENSE.md
Licensing terms for code and materials.
requirements.txt
Python dependencies for running scripts and reproducing results.

Datasets and experimental themes

Our experiments currently focus on two thematic corpora used in the paper:

Rouanet: discussions related to Brazil’s cultural incentive law (Lei Rouanet).
Vaccine: discussions around vaccination and vaccine-related narratives.

Important: This repository does not necessarily redistribute full raw datasets. See Data Governance below for what is shared and what is restricted.

Evaluation design (high-level)

We evaluate model behavior under two main question families:

Factual questions
Oriented toward explicit extraction/verification (entities, numbers, negations, URLs, etc.).
Hermeneutic questions
Oriented toward interpretive and analytical output (depth, synthesis, narrative coherence, evidence use).

We use an LLM-as-judge protocol with multiple blind rounds and multiple judges, reporting both:

per-judge results; and
consolidated summaries computed to avoid over-weighting judges with more blind rounds (see paper + appendices for details).

Data governance

This repository follows a share-what-we-can principle while protecting privacy and minimizing re-identification risk.

Publicly shared: prompts, evaluation protocols, aggregated results, figures, and sanitized samples.
Restricted: raw message-level datasets and any material that could enable direct re-identification or reconstruction of sensitive contexts.

See docs/data-governance.md for details on sampling/anonymization strategy and access constraints.

Ethics & safety

Working with platform-derived data can involve:

privacy and re-identification risks,
harmful or extremist content,
bias amplification,
misuse risks.

We provide explicit handling notes and recommended constraints in docs/ethics-and-safety.md.

How to cite

If you use this repository, please cite it using CITATION.cff and (when available) the associated paper/preprint reference in the manuscript.

License

See LICENSE.md.

Acknowledgements

Social-RAG is developed within LABHDUFBA’s research activities and collaborations. We also acknowledge the open-source ecosystem that enables reproducible research workflows in RAG.

Contact

For questions, issues, or collaboration proposals, please open a GitHub issue in this repository.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Social-RAG

Why Social-RAG?

What’s in this repository

Core artifacts (for the paper)

Repository metadata

Datasets and experimental themes

Evaluation design (high-level)

Data governance

Ethics & safety

How to cite

License

Acknowledgements

Contact

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
appendices		appendices
docs		docs
evaluation		evaluation
figures		figures
prompts		prompts
samples		samples
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE.md		LICENSE.md
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Social-RAG

Why Social-RAG?

What’s in this repository

Core artifacts (for the paper)

Repository metadata

Datasets and experimental themes

Evaluation design (high-level)

Data governance

Ethics & safety

How to cite

License

Acknowledgements

Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages