- Team Name: Team DAY
- Members:
- Andrew Li (ayl2159)
- David Zhang (dwz2107)
- Yuqi Zhang (yz5072)
Evaluate long-context language models and state space models (SSMs) on LongBench-v2, Ada-LEval, and PG-19, focusing on quality and efficiency across large context lengths (up to 131k+ for Transformers and 32k for SSM baselines).
- Architectures: Transformer LLMs (Llama-3.1-8B-Instruct, Qwen2.5-7B-Instruct, Jamba-3B, Nemotron-h-8b-r128k) and SSMs (Mamba-2.8B, Mamba-Codestral-7B).
- Frameworks: PyTorch with vLLM 0.11.2;
mamba-ssmfor SSM kernels. - Customization: Context-length scaling via per-model JSON configs; shared greedy decoding (
temperature=0.1,max_new_tokens=256, batch_size=16 for efficient serving); note Qwen-7B requires YARN rope-scaling adjustment for largest contexts.
We evaluated four representative models across increasing context lengths (8k to 131k tokens):
Accuracy Scaling with Context Length
Among evaluated models, Qwen2.5-7B-Instruct demonstrates the strongest positive scaling behavior, with accuracy steadily increasing as context length grows—indicating effective use of additional context. Llama-3.1-8B-Instruct maintains relatively stable accuracy with slight degradation at the longest context lengths. Both mamba-codestral-7B and nemotron-h-8b-r128k underperform Transformers in absolute accuracy. The pure SSM model shows the lowest accuracy overall with limited improvement from extended context, suggesting architectural limitations in long-context reasoning. The hybrid Nemotron model achieves intermediate accuracy with a modest peak at mid-range contexts.
Null Rate and Output Stability
Null prediction rates (indicating unparseable or non-compliant outputs) generally increase with context length. Transformer models maintain relatively controlled null rates, with Qwen2.5-7B showing particularly low rates overall. In contrast, SSM-based and hybrid models exhibit consistently higher null rates across all context lengths, with rates increasing further as context grows. This indicates that a significant fraction of errors in smaller SSM/hybrid models stem from output-format and instruction-following failures rather than purely incorrect reasoning.
| Metric | Value |
|---|---|
| Models Evaluated | Llama-3.1-8B, Qwen2.5-7B, Mamba-Codestral-7B, Nemotron-8B |
| Context Range | 8k–131k tokens |
| Best Performer (Accuracy) | Qwen2.5-7B-Instruct (34% @ 131k ctx) |
| Benchmark | LongBench-v2 |
| Device | NVIDIA A6000 (48 GB VRAM) |
Refer to JSONL logs in results/ and pure_ssm/outputs/pure_ssm_logs/ for detailed run-level metrics.
Clone and install:
git clone https://github.com/andli28/SSM_experiment
cd SSM_experiment
pip install -r requirements.txtView evaluation metrics here: https://api.wandb.ai/links/davidwz2003-columbia-university/57zty5qi
This repo focuses on inference sweeps. For a single model run:
python scripts/pred_one.py --model_cfg configs/models/<model>.json --ctx <max_context_length>To sweep all models at configured context lengths:
python scripts/run_grid.pySet the context lengths required in configs/global.json. Will run all models under configs/models/ at the specified context lengths.
Note: For Qwen-7B, the rope scaling factor must be adjusted for the largest context being run.
Transformer results are stored under results/*.jsonl. For SSM baselines (8k/16k/32k) run the Colab-style pipeline:
# Open and run all cells in:
pure_ssm/notebooks/ssm_8k_16k_32k_final.ipynbLogs land in Google Drive pure_ssm_logs/ and summaries in pure_ssm/results/.
If you want to log results automatically, enable wandb in the global config. Otherwise results and more details can be viewed through results/view.ipynb.
To reproduce a transformer sweep (example):
# 1) Install dependencies
pip install -r requirements.txt
# 2) Configure contexts in configs/global.json
python - <<'PY'
import json
path = 'configs/global.json'
cfg = json.load(open(path))
cfg['ctx'] = [65536, 131072]
json.dump(cfg, open(path, 'w'), indent=2)
PY
# 3) Run sweep
python scripts/run_grid.py
# 4) Inspect outputs
ls results/To reproduce a pure SSM 8k/16k/32k run (Colab):
# 1) Clone into Drive (in Colab)
git clone https://github.com/andli28/SSM_experiment /content/drive/MyDrive/SSM_experiment
# 2) Open and run all cells
pure_ssm/notebooks/ssm_8k_16k_32k_final.ipynbThis repository also includes a pure SSM (Mamba) long-context evaluation pipeline under the test_notebooks/pure_ssm/ directory. It evaluates Mamba models at 8k/16k/32k context on LongBench-v2, Ada-LEval, and PG-19, and reports both quality and efficiency.
The SSM track uses:
state-spaces/mamba-2.8b-hf(≈2.8B parameters)mistralai/Mamba-Codestral-7B-v0.1(≈7B parameters, Mamba-2)
Both are run with vLLM and mamba-ssm using a shared decoding configuration (greedy decoding, temperature 0.0, max_new_tokens = 256).
The end-to-end pipeline lives in:
pure_ssm/notebooks/ssm_8k_16k_32k_final.ipynb
Typical way to run it (on Google Colab):
-
Mount Google Drive and clone this repo into:
/content/drive/MyDrive/SSM_experiment -
Open
pure_ssm/notebooks/ssm_8k_16k_32k_final.ipynbin Colab. -
Run all cells. The notebook will:
- Set up the environment (vLLM,
mamba-ssm, etc.) - Build or load 8k/16k/32k prompt sets for LongBench-v2, Ada-LEval, and PG-19
- Run pure-SSM baselines for Mamba-2.8B and Mamba-Codestral-7B at each context length
- Run LongBench multiple-choice evaluation via
pure_ssm/eval_longbench_mc.py - Aggregate metrics into CSV files
- Set up the environment (vLLM,
-
Summary CSVs are written under:
pure_ssm/results/For example,
pure_ssm/results/ssm_8k.csvcontains quality + efficiency metrics for the 8k context runs. -
Per-run logs (JSONL with prompts, completions, tokens/s, VRAM, etc.) are written to a
pure_ssm_logs/directory in your Google Drive root:/content/drive/MyDrive/pure_ssm_logs/
These outputs can be used to compare pure SSM models against the Jamba and other baselines in this repository.
General config values are located in configs/global.json
Model specific configs are located in configs/models/<model>.json
- Results Analysis:
results/view.ipynbgenerates detailed analysis of locally stored transformer runs, including raw metric values, prediction statistics, and accuracy plots across context lengths and models. For SSM baselines, summaries are available inpure_ssm/results/*.csv. - Resources:
- Contact: ayl2159@columbia.edu
- Datasets: We acknowledge and thank the maintainers of LongBench‑v2, Ada‑LEval, and PG‑19 for making their datasets available for research.
- LongBench‑v2: https://longbench2.github.io/
- Ada‑LEval: https://github.com/Alibaba-NLP/Ada-LEval
- PG‑19: https://huggingface.co/datasets/pg19
- Serving engine: vLLM (0.11.2) was used for inference and long‑context serving — https://vllm.ai/
- SSM kernels:
mamba-ssmlibrary was used for Mamba models — https://pypi.org/project/mamba-ssm/ - Model providers:
- Meta AI — Llama‑3.1‑8B‑Instruct
- Alibaba — Qwen2.5‑7B‑Instruct
- Mistral AI — Mamba‑Codestral‑7B v0.1
- NVIDIA — Nemotron‑H‑8B‑r128k
- State‑Spaces (mamba‑2.8b‑hf)
- AI21 Labs — Jamba‑3B‑Reasoning
- Prior work: We utilize some of the original code from LongBench and Ada‑LEval prediction pipelines in this project.

