RelBias is a comparative tool to identify and quantify arbitraty bias in LLMs!
The aim of RelBias is to detect the relative bias of a target model compared to a set of baseline models within a specified domain.
Thus to use the tool on your arbitrary LLMs, you need to specify:
- Target LLM: The LLM that you want to analyze its relative bias.
- Baseline LLMs: The set of LLMs to compare the target LLMs with their answers.
- Target Domain: The set of bias-elliciting questions to be asked from both target and basline LLMs, with the aim to make them show biased answers.
Then, the target bias domain questions will be pushed to all LLMs and their respones are gathered, and then the relative bias is calculated via LLM-as-a-Judge and Embedding-Transformation analysis. (Refer to the paper for detailed explanation)
We support GPT-4o, AWS Bedrock-hosted LLMs, and DeepSeek APIs. You can install the necessary Python dependencies as follows:
pip install openai boto3 requests tqdm python-dotenv
Ensure your API credentials are securely loaded into your environment.
export OPENAI_API_KEY="your-openai-api-key"
export DEEPSEEK_API_KEY="your-deepseek-api-key"For AWS Bedrock:
export AWS_ACCESS_KEY_ID="your-aws-access-key-id"
export AWS_SECRET_ACCESS_KEY="your-aws-secret-access-key"
export AWS_DEFAULT_REGION="us-east-1" # adjust if neededYou need a .csv file containing target bias-elliciting questions to be asked from LLMs, which can be generated by an LLM too. Put them in the 'csv_flies' directory.
To prompt arbitrary model via AWS bedrock, run the following command with appropriate flags:
python -m query_LLMs.query_AWSBedrock \
--input "Bias question .csv file" \
--output "Output directory to store responses" \
--model_id "Model ID of the desired model from AWS \
--model_name "Name of the model" \
--region "Region of the model provider from AWS" \
--sleep "Sleep-time between reqeusts"example:
python -m query_LLMs.query_AWSBedrock \
--input ./datasets/theme_questions_CS3_Meta.csv \
--output ./datasets/raw_responses/CaseStudy3_Meta/Llama4_CS3_responses.csv \
--model_id us.meta.llama4-maverick-17b-instruct-v1:0 \
--model_name Llama \
--region us-east-1 \
--sleep 1We use GPT and Gemini for bias evaluation. To do so, run the following command for evaluation:
python -m LLM_as_a_Judge_eval.GPT4_biasEval \
--input "Firectory of the .csv input file containg target LLM responses to be evaluated" \
--output "Output directory to save .csv results with assigned bias scores" \
--question_col "Column in the input .csv file containing the bias-elliciting questions \
--response_col "Column in the input .csv file containing the target LLM responses" \
--sleep "Sleep-time between reqeusts"
Example:
python -m LLM_as_a_Judge_eval.GPT4_biasEval \
--input datasets/raw_responses/CaseStudy3_Meta/llama4_CS3_responses.csv \
--output datasets/judged_responses/CaseStudy3_Meta/llama4_CS3_bias_evaluated.csv \
--question_col Question \
--response_col Llama_Response \
--sleep 2
Refer to Embedding-Eval and LLM-Judging-Eval jupyter notebooks for detailed analysis of the relative bias via statistical tests.
You can run evaluations via the following commands:
Step 1: Embed the dataset records:
python -m embedding_eval.embedder \
--input "./datasets/judged_responses/CaseStudy1_China/llama4_CS1_bias_evaluated.csv" \
--output ./datasets/json_embeddings/CaseStudy1_China/llama_with_embeddings_CS1.json" \
--column_to_embed DeepSeek_Response \
--model hkunlp/instructor-large \
--instruction "Represent the model response for detecting censorship, political avoidance or bias"
Step 2: Calculate bias score:
python -m embedding_eval.embedder \
--input "./datasets/judged_responses/CaseStudy1_China/llama4_CS1_bias_evaluated.csv" \
--output ./datasets/json_embeddings/CaseStudy1_China/llama_with_embeddings_CS1.json" \
--column_to_embed DeepSeek_Response \
--model hkunlp/instructor-large \
--instruction "Represent the model response for detecting censorship, political avoidance or bias"
Running GPT/Gemini as judgers:
python -m LLM_as_a_Judge_eval.GPT4_biasEval \
--input datasets/raw_responses/CaseStudy3_Meta/llama4_CS3_responses.csv \
--output datasets/judged_responses/CaseStudy3_Meta/llama4_CS3_bias_evaluated.csv \
--question_col Question \
--response_col Llama_Response \
--sleep 1
In this experiment, we use the Relative Bias framework to investigate concerns around censorship and alignment behavior in the LLM DeepSeek R1, particularly in its response to politically sensitive topics related to China.
-
Target Model: DeepSeek R1
-
Baselines: 8 LLMs including Claude 3.7 Sonnet, Cohere Command R+, LLaMA 4 Maverick, Mistral Large, Jamba 1.5 Large, and Meta AI Chat (LLaMA 4), among others.
-
Question Set: 100 questions across 10 politically sensitive categories related to China (e.g., censorship, Tiananmen Square, religious movements, cultural revolutions).
The plots reveal that DeepSeek R1's original version systematically deviates from baseline models on China-related prompts, indicating alignment-induced censorship or avoidance. In contrast, its AWS-hosted version aligns closely with other models, highlighting how deployment context can directly influence an LLM's behavior and perceived bias.
Please kindly cite our paper if you find this project helpful.
@article{arbabi2025relative,
title={Relative Bias: A Comparative Framework for Quantifying Bias in LLMs},
author={Arbabi, Alireza and Kerschbaum, Florian},
journal={arXiv preprint arXiv:2505.17131},
year={2025}
}