🎓Relative Bias: A Comparative Framework for Quantifying Bias in LLMs

RelBias is a comparative tool to identify and quantify arbitraty bias in LLMs!

📚 Methodolgy

The aim of RelBias is to detect the relative bias of a target model compared to a set of baseline models within a specified domain.

Thus to use the tool on your arbitrary LLMs, you need to specify:

Target LLM: The LLM that you want to analyze its relative bias.
Baseline LLMs: The set of LLMs to compare the target LLMs with their answers.
Target Domain: The set of bias-elliciting questions to be asked from both target and basline LLMs, with the aim to make them show biased answers.

Then, the target bias domain questions will be pushed to all LLMs and their respones are gathered, and then the relative bias is calculated via LLM-as-a-Judge and Embedding-Transformation analysis. (Refer to the paper for detailed explanation)

🚀 Quick Start

1. 🧪 Install Required Libraries

We support GPT-4o, AWS Bedrock-hosted LLMs, and DeepSeek APIs. You can install the necessary Python dependencies as follows:

pip install openai boto3 requests tqdm python-dotenv

2. 🔑 Set API Keys as Environment Variables

Ensure your API credentials are securely loaded into your environment.

export OPENAI_API_KEY="your-openai-api-key"
export DEEPSEEK_API_KEY="your-deepseek-api-key"

For AWS Bedrock:

export AWS_ACCESS_KEY_ID="your-aws-access-key-id"
export AWS_SECRET_ACCESS_KEY="your-aws-secret-access-key"
export AWS_DEFAULT_REGION="us-east-1"  # adjust if needed

3. Running the Evalaution

3.1 Setting the target question domains via GPT.

You need a .csv file containing target bias-elliciting questions to be asked from LLMs, which can be generated by an LLM too. Put them in the 'csv_flies' directory.

3.2 Prompting bias questions into LLMs:

To prompt arbitrary model via AWS bedrock, run the following command with appropriate flags:

python -m query_LLMs.query_AWSBedrock \
  --input "Bias question .csv file" \
  --output "Output directory to store responses" \
  --model_id "Model ID of the desired model from AWS \
  --model_name "Name of the model" \
  --region "Region of the model provider from AWS" \
  --sleep "Sleep-time between reqeusts"

example:

python -m query_LLMs.query_AWSBedrock \
  --input ./datasets/theme_questions_CS3_Meta.csv \
  --output ./datasets/raw_responses/CaseStudy3_Meta/Llama4_CS3_responses.csv \
  --model_id us.meta.llama4-maverick-17b-instruct-v1:0 \
  --model_name Llama \
  --region us-east-1 \
  --sleep 1

3.3 Evaluating bias of responses via LLM-as-a-Judge:

We use GPT and Gemini for bias evaluation. To do so, run the following command for evaluation:

python -m LLM_as_a_Judge_eval.GPT4_biasEval \
  --input "Firectory of the .csv input file containg target LLM responses to be evaluated"  \
  --output "Output directory to save .csv results with assigned bias scores" \
  --question_col "Column in the input .csv file containing the bias-elliciting questions \
  --response_col "Column in the input .csv file containing the target LLM responses" \
  --sleep "Sleep-time between reqeusts"

Example:

python -m LLM_as_a_Judge_eval.GPT4_biasEval \
  --input datasets/raw_responses/CaseStudy3_Meta/llama4_CS3_responses.csv \
  --output datasets/judged_responses/CaseStudy3_Meta/llama4_CS3_bias_evaluated.csv \
  --question_col Question \
  --response_col Llama_Response \
  --sleep 2

3.4 Relative bias evaluation:

Refer to Embedding-Eval and LLM-Judging-Eval jupyter notebooks for detailed analysis of the relative bias via statistical tests.

You can run evaluations via the following commands:

* Embedding-based Evaluation:

Step 1: Embed the dataset records:

python -m embedding_eval.embedder \
    --input "./datasets/judged_responses/CaseStudy1_China/llama4_CS1_bias_evaluated.csv" \
    --output ./datasets/json_embeddings/CaseStudy1_China/llama_with_embeddings_CS1.json" \
    --column_to_embed DeepSeek_Response \
    --model hkunlp/instructor-large \
    --instruction "Represent the model response for detecting censorship, political avoidance or bias"

Step 2: Calculate bias score:

python -m embedding_eval.embedder \
    --input "./datasets/judged_responses/CaseStudy1_China/llama4_CS1_bias_evaluated.csv" \
    --output ./datasets/json_embeddings/CaseStudy1_China/llama_with_embeddings_CS1.json" \
    --column_to_embed DeepSeek_Response \
    --model hkunlp/instructor-large \
    --instruction "Represent the model response for detecting censorship, political avoidance or bias"

* LLM-as-a-Judge Evaluation:

Running GPT/Gemini as judgers:

python -m LLM_as_a_Judge_eval.GPT4_biasEval \
  --input datasets/raw_responses/CaseStudy3_Meta/llama4_CS3_responses.csv \
  --output datasets/judged_responses/CaseStudy3_Meta/llama4_CS3_bias_evaluated.csv \
  --question_col Question \
  --response_col Llama_Response \
  --sleep 1

📊 Experiment: DeepSeek R1 Censorship Evaluation

In this experiment, we use the Relative Bias framework to investigate concerns around censorship and alignment behavior in the LLM DeepSeek R1, particularly in its response to politically sensitive topics related to China.

⚙️ Setup

Target Model: DeepSeek R1
Baselines: 8 LLMs including Claude 3.7 Sonnet, Cohere Command R+, LLaMA 4 Maverick, Mistral Large, Jamba 1.5 Large, and Meta AI Chat (LLaMA 4), among others.
Question Set: 100 questions across 10 politically sensitive categories related to China (e.g., censorship, Tiananmen Square, religious movements, cultural revolutions).

📊 Results:

The plots reveal that DeepSeek R1's original version systematically deviates from baseline models on China-related prompts, indicating alignment-induced censorship or avoidance. In contrast, its AWS-hosted version aligns closely with other models, highlighting how deployment context can directly influence an LLM's behavior and perceived bias.

📖 Citation

Please kindly cite our paper if you find this project helpful.

@article{arbabi2025relative,
  title={Relative Bias: A Comparative Framework for Quantifying Bias in LLMs},
  author={Arbabi, Alireza and Kerschbaum, Florian},
  journal={arXiv preprint arXiv:2505.17131},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
LLM_as_a_Judge_eval		LLM_as_a_Judge_eval
datasets		datasets
embedding_eval		embedding_eval
figures		figures
query_LLMs		query_LLMs
.DS_Store		.DS_Store
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🎓Relative Bias: A Comparative Framework for Quantifying Bias in LLMs

📚 Methodolgy

🚀 Quick Start

1. 🧪 Install Required Libraries

2. 🔑 Set API Keys as Environment Variables

3. Running the Evalaution

3.1 Setting the target question domains via GPT.

3.2 Prompting bias questions into LLMs:

3.3 Evaluating bias of responses via LLM-as-a-Judge:

3.4 Relative bias evaluation:

* Embedding-based Evaluation:

* LLM-as-a-Judge Evaluation:

📊 Experiment: DeepSeek R1 Censorship Evaluation

⚙️ Setup

📊 Results:

📖 Citation

About

Uh oh!

Releases

Packages

Languages

License

Alireza-Zwolf/RelBias

Folders and files

Latest commit

History

Repository files navigation

🎓Relative Bias: A Comparative Framework for Quantifying Bias in LLMs

📚 Methodolgy

🚀 Quick Start

1. 🧪 Install Required Libraries

2. 🔑 Set API Keys as Environment Variables

3. Running the Evalaution

3.1 Setting the target question domains via GPT.

3.2 Prompting bias questions into LLMs:

3.3 Evaluating bias of responses via LLM-as-a-Judge:

3.4 Relative bias evaluation:

* Embedding-based Evaluation:

* LLM-as-a-Judge Evaluation:

📊 Experiment: DeepSeek R1 Censorship Evaluation

⚙️ Setup

📊 Results:

📖 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages