- Download and install Ollama: https://ollama.com/
- Make sure Ollama is running when you run these models
- The code will handle downloading the models for you, test using something small like SMOLLM2!!
- Make sure Ollama is running when you run the code
- Setup your virtual environment according to instructions below
Python version: 3.11.9, you can use pyenv to manage your local python installs
- Create a virtual environment:
python -m venv venv- Activate the virtual environment:
source venv/bin/activate- Install requirements:
pip install -r requirements.txt- Create a virtual environment:
python -m venv venv- Activate the virtual environment:
.\venv\Scripts\activate- Install requirements:
pip install -r requirements.txtFor SAM-Sum:
curl -L -O https://huggingface.co/datasets/Samsung/samsum/resolve/main/data/corpus.7z
7z x corpus.7zFor Webis:
for i in {0..9};
do curl -L -O https://huggingface.co/datasets/webis/tldr-17/resolve/refs%2Fconvert%2Fparquet/default/partial-train/000$i.parquet;
done-
samsum/test.json(Source)- 819 instances
- 3 fields:
id,summary,dialogue - dialogue ranges from 3-30 utterances (newline-separated)
-
webis/data.json(Source)- 3,848,330 instances in full
- We can consider using the
partial-trainbranch data (9 parquets) -- see download instructions above - relevant fields:
id,content,summary,subreddit
A single parquet of the Webis data is massive compared to SAM-Sum test split. Another repo includes analysis on duplicate rows and removing problematic rows, which we may want to do as well. We may also want to remove noisy/graphic data from certain subreddits, etc. (maybe only keep TrueReddit, or AskReddit?)