Skip to content

LABHDUFBA/social-rag

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

62 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Social-RAG

Social-RAG is a research-oriented Retrieval-Augmented Generation (RAG) framework designed for interpretive (hermeneutic) analysis and factual question answering over socially-produced textual corpora (e.g., messaging platforms), with a strong emphasis on methodological transparency, evaluation traceability, and responsible data governance.

This repository contains the code and research artifacts supporting our ongoing manuscript on Social-RAG, including prompts, evaluation protocols, figures, appendices, and sanitized samples.


Why Social-RAG?

Most RAG implementations optimize primarily for “answer correctness” in conventional QA settings. Social-RAG is motivated by a different research need: supporting historical and social-scientific inquiry where:

  • questions can be interpretive, not only extractive;
  • evidence often consists of fragmented, redundant, and conversational messages;
  • “good answers” must be evaluated across criteria like thematic alignment, analytic depth, synthesis, and evidence precision;
  • transparency matters: it should be possible to trace how an answer was formed and what evidence was retrieved.

Social-RAG is therefore positioned as a hermeneutic assistant rather than a counting tool: it is not intended to behave like a deterministic aggregator for totals (e.g., number of links, exact term counts), but as a retrieval-grounded workflow to support interpretation.


What’s in this repository

Core artifacts (for the paper)

  • prompts/
    System prompts and task templates used by Social-RAG (including the system prompt that conditions the model on the dataset “theme”).
  • evaluation/
    LLM-as-judge materials and outputs: per-judge scores, consolidated summaries, and scripts used to generate tables/plots.
  • figures/
    Exported figures used in the manuscript (PNG/SVG when available) and any source data required to regenerate them.
  • appendices/
    Paper appendices: full prompts, methodological notes, extended tables, and supplementary material referenced in the manuscript.
  • samples/
    Sanitized/minimized samples for demonstration and reproducibility, respecting privacy and data governance constraints.
  • docs/
    Project documentation (repository overview, reproducibility steps, data governance, ethics & safety notes).

Repository metadata

  • CITATION.cff
    Citation metadata for academic use.
  • LICENSE.md
    Licensing terms for code and materials.
  • requirements.txt
    Python dependencies for running scripts and reproducing results.

Datasets and experimental themes

Our experiments currently focus on two thematic corpora used in the paper:

  • Rouanet: discussions related to Brazil’s cultural incentive law (Lei Rouanet).
  • Vaccine: discussions around vaccination and vaccine-related narratives.

Important: This repository does not necessarily redistribute full raw datasets. See Data Governance below for what is shared and what is restricted.


Evaluation design (high-level)

We evaluate model behavior under two main question families:

  1. Factual questions
    Oriented toward explicit extraction/verification (entities, numbers, negations, URLs, etc.).
  2. Hermeneutic questions
    Oriented toward interpretive and analytical output (depth, synthesis, narrative coherence, evidence use).

We use an LLM-as-judge protocol with multiple blind rounds and multiple judges, reporting both:

  • per-judge results; and
  • consolidated summaries computed to avoid over-weighting judges with more blind rounds (see paper + appendices for details).

Data governance

This repository follows a share-what-we-can principle while protecting privacy and minimizing re-identification risk.

  • Publicly shared: prompts, evaluation protocols, aggregated results, figures, and sanitized samples.
  • Restricted: raw message-level datasets and any material that could enable direct re-identification or reconstruction of sensitive contexts.

See docs/data-governance.md for details on sampling/anonymization strategy and access constraints.


Ethics & safety

Working with platform-derived data can involve:

  • privacy and re-identification risks,
  • harmful or extremist content,
  • bias amplification,
  • misuse risks.

We provide explicit handling notes and recommended constraints in docs/ethics-and-safety.md.


How to cite

If you use this repository, please cite it using CITATION.cff and (when available) the associated paper/preprint reference in the manuscript.


License

See LICENSE.md.


Acknowledgements

Social-RAG is developed within LABHDUFBA’s research activities and collaborations. We also acknowledge the open-source ecosystem that enables reproducible research workflows in RAG.


Contact

For questions, issues, or collaboration proposals, please open a GitHub issue in this repository.

About

Social-RAG supplementary materials (prompts, evaluations, figures, samples) for transparent, reproducible RAG research.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages