Skip to content

hom-bahrani/LOCI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

77 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LOCI: A Benchmark for Synthetic Concept Induction During Evaluation

LOCI (Learning Ongoingly through Concept Induction) is a contamination-resistant benchmark measuring in-context concept formation in language models. It was developed for the Kaggle "Measuring Progress Toward AGI" competition (Learning track).

What LOCI Tests

Each episode presents a model with six labeled examples from a synthetic world and asks it to classify eight unlabeled queries. A hidden category is defined by an exact symbolic rule, but the model never sees the rule or the real attribute names — all categorical attributes and values are replaced with episode-local nonce tokens. The only usable information is the relational structure inside the episode.

This design ensures that pretraining recall is useless: the model must induce the concept from local evidence.

Three Tasks

Task What It Measures
Core Acquisition Single-turn concept induction from 6 support examples
Hard-Split Generalization Low-ambiguity subset with zero support-consistent competitor rules
Delayed Retention Same concepts, but with 3 distractor turns inserted before queries

Key Findings

  • Disjunctions are structurally hardest for every model and for a deterministic structured baseline — models systematically collapse OR rules into conjunction-like hypotheses
  • Current frontier models reach up to 83.1% query accuracy but only 53.8% exact-episode rate on core acquisition
  • A deterministic hypothesis-testing baseline outperforms all six evaluated models

Repository Structure

LOCI_kaggle_tasks_v1_1/     # Benchmark task code and data
  data/                     # Evaluation splits (public_dev + private_test per task)
  data_generation/          # Reference generation scripts
  notebooks/                # Kaggle Benchmarks task notebooks
paper/v6/                   # Research paper (LaTeX source, figures, tables, backing data)
writeup/                    # Kaggle competition writeup

Evaluated Models

Gemini 2.5 Flash, Claude Sonnet 4.6, DeepSeek-R1, Claude Haiku 4.5, DeepSeek V3.2, Gemma 3 27B

Running the Benchmark

The benchmark runs on the Kaggle Benchmarks platform. See LOCI_kaggle_tasks_v1_1/README.md for setup instructions.

License

CC0 (as required by competition rules). See LICENSE.

Citation

If you use LOCI in your research, please cite:

@misc{bahrani2026loci,
  title={LOCI: A Benchmark for Synthetic Concept Induction During Evaluation},
  author={Bahrani, Homam},
  year={2026}
}

About

Learning Of Concepts In-context

Topics

Resources

License

Stars

Watchers

Forks

Contributors