Skip to content

Miccighel/NewBestSub

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NewBestSub

NewBestSub is a research tool for efficient topic-set reduction in IR evaluation. Its objective is to identify small subsets of topics that preserve, as closely as possible, the system ranking induced by the full topic set.

The selection task is formulated as a bi-objective optimization—minimizing subset cardinality K while maximizing rank agreement with the full set—solved via NSGA-II. Given a system-by-topic AP matrix and a chosen correlation metric (Pearson or Kendall), the method explores the search space and streams results to CSV/Parquet, including best/worst subsets and percentile summaries. Deterministic execution is supported for reproducibility.

Docs CI Release Docs License DOI

Last Commit Open Issues Pull Requests Contributors GitHub stars Downloads

Java 22 Kotlin 2.2 jMetal 6.9.1 Parquet 1.15.2 Reproducible

CI CodeQL

JDIQ 2018 SIGIR 2018


⚠️ Important notice: Migration from 1.0 to 2.0

Starting with version 2.0, NewBestSub has undergone major refactoring:

  • Requires Java 22 and Kotlin 2.2 (previously Java 8 and Kotlin 1.x).
  • Output format is now streaming-based with FUN, VAR, and TOP separation and 6-digit correlation precision.
  • New deterministic mode and reproducible folder naming pattern.
  • Updated CLI options (-fi, -c, -t, and others) see CLI options.
  • Artifacts renamed: NewBestSub-2.0-jar-with-dependencies.jar instead of NewBestSub-1.0-....

If you have scripts or pipelines based on 1.0, they will not work unmodified. Please adapt your datasets, CLI calls, and parsing code to the new schema.


Table of contents


Features

  • BEST, WORST, and AVERAGE experiments powered by NSGA-II (jMetal 6.9.x).
  • Streaming engine:
    • FUN and VAR: append only when the per-K representative improves for Best and Worst; exactly one row per K for Average.
    • TOP: replace the entire per-K block on change (up to 10 rows), keeping output stable.
  • MNDS environmental selection (Merge Non-Dominated Sort) plus local crowding distance.
  • Incremental evaluation ("Step 3"): O(S) updates via cached per-system sums and fixed-K swap hints.
  • Compact genotype I/O: VAR and mask-based TOP carry packed Base64 bitmasks (see layout below).
  • Dual outputs: streaming-first CSV and Parquet; consistent 6-digit precision for correlations.
  • Deterministic mode: explicit --seed or stable derived seed from run parameters.

Requirements

  • JDK 22+ (set JAVA_HOME to a JDK, not a JRE)
  • Maven 3.9+
  • Internet access (for dependency downloads on first build)

Build

mvn -DskipTests=false clean package

Artifacts:

target/NewBestSub-2.0.jar
target/NewBestSub-2.0-jar-with-dependencies.jar
target/NewBestSub-2.0-test-jar-with-dependencies.jar

Release artifacts & versioning

We pin the distribution filenames to a major/minor track for stable scripting, while the Maven project version keeps full semver (including patches). The latest release is 2.0.4.

  • Release assets (fat & test-fat jars)
    Produced by the Assembly plugin using a pinned dist.version
    NewBestSub-2.0-jar-with-dependencies.jar
    NewBestSub-2.0-test-jar-with-dependencies.jar

  • Local Maven artifact (thin jar)
    Also pinned via finalName to the same dist.version
    target/NewBestSub-2.0.jar

  • Why this pinning?
    Patch bumps (2.0.1, 2.0.2, 2.0.3, 2.0.4, …) won’t break scripts or docs that reference the stable NewBestSub-2.0*.jar names.

  • Coordinates remain precise
    If you depend via Maven/Gradle, you still pull the full project version (e.g. 2.0.4):
    it.uniud.newbestsub:NewBestSub:2.0.4

  • Override the label (optional)

    mvn -Ddist.version=2.1 clean package

    → produces NewBestSub-2.1.jar, NewBestSub-2.1-jar-with-dependencies.jar, etc. (POM version unchanged).

When building from source, you’ll typically see both pinned artifacts:

  • target/NewBestSub-2.0.jar (thin jar)
  • target/NewBestSub-2.0-jar-with-dependencies.jar (fat jar)

Run from release

Instead of building from source, you can download a ready-to-use JAR:

  1. Go to the Releases page.

  2. Download the asset named:

    NewBestSub-<version>-jar-with-dependencies.jar
    

    (Pinned distribution name example: NewBestSub-2.0-jar-with-dependencies.jar).

  3. Run it with Java 22:

    java -Xmx4g -jar NewBestSub-2.0-jar-with-dependencies.jar --help
  4. Example toy run:

    java -Xmx1g -jar NewBestSub-2.0-jar-with-dependencies.jar -fi samples/toy -c Pearson -t Best -po 50 -i 5000 -log Limited

Quick start

Show help:

java -Xmx4g -jar target/NewBestSub-2.0-jar-with-dependencies.jar --help

Small toy run (adjust -i, -po, and -r to your machine):

# BEST (Pearson)
java -Xmx1g -jar target/NewBestSub-2.0-jar-with-dependencies.jar   -fi samples/toy -c Pearson -t Best -po 50 -i 5000 -log Limited

# WORST (Pearson)
java -Xmx1g -jar target/NewBestSub-2.0-jar-with-dependencies.jar   -fi samples/toy -c Pearson -t Worst -po 50 -i 5000 -log Limited

# AVERAGE (Kendall, 5 to 95th percentiles, 100 reps)
java -Xmx1g -jar target/NewBestSub-2.0-jar-with-dependencies.jar   -fi samples/toy -c Kendall -t Average -r 100 -pe 5,95 -log Limited

# ALL (combine Best, Worst, and Average)
java -Xmx1g -jar target/NewBestSub-2.0-jar-with-dependencies.jar   -fi samples/toy -c Pearson -t All -po 50 -i 5000 -r 100 -pe 1,100 -log Limited

Tip: on large runs ensure -Xmx is sized appropriately, and keep -po >= #topics.


Dataset schema

CSV layout (wide table, systems by topics of AP values):

  • Header row: first cell is empty; remaining cells are topic labels (t1,t2,...).
  • Rows: first cell is a system label; remaining cells are Average Precision (AP) values for each topic.
  • AP range: expected 0.0 .. 1.0 (no hard clamp). Missing values are not allowed.
  • Order: row and column order is preserved and used in outputs.
  • Constraint: when running Best, Worst, or All, population must satisfy -po >= #topics.

Example (samples/toy.csv):

,t1,t2,t3,t4,t5
BM25,0.31,0.45,0.22,0.18,0.40
QL,0.28,0.48,0.19,0.21,0.36
RM3,0.40,0.51,0.26,0.17,0.42

Outputs

Results are written into a per-run container named from parameters (see Folder naming pattern). Each container has CSV/ and Parquet/ subfolders.

CSV

  • ...-Fun.csv contains K corr (space-separated; K integer; corr with 6 digits; always natural correlation).
  • ...-Var.csv contains K B64:<payload> where <payload> is the packed Base64 mask.
  • ...-Top-10-Solutions.csv is either:
    • K,Correlation,B64:<payload> (mask-based), or
    • K,Correlation,topic1;topic2;... (label-based), depending on the view configuration.
  • ...-Final.csv and ...-Info.csv contain summaries and metadata.

Selection recipes (per K):

  • FUN (after the view’s final rewrite)
    • BEST → file is (K asc, corr asc)pick LAST (max).
    • WORST → file is (K asc, corr desc)pick LAST (min).
  • TOP
    • BEST → lines are descendingpick FIRST (max).
    • WORST → lines are ascendingpick FIRST (min).

Parquet

  • ...-Fun.parquet schema: { K:int, Correlation:double }
  • ...-Var.parquet schema: { K:int, Mask:string } (bare Base64, no B64: prefix)
  • ...-Top-10-Solutions.parquet schema: { K:int, Correlation:double, MaskOrTopics:string }

VAR/TOP Base64 layout

We pack the topic-presence mask into Base64 as follows:

  • Pack bits into 64-bit words, LSB-first within each word (bit 0 becomes topic 0).
  • Serialize each 64-bit word as little-endian bytes.
  • Encode the concatenated bytes using Base64 without padding.
  • CSV uses a B64: prefix; Parquet stores the bare payload.

Decoding snippets

Kotlin (mirror of production code):

fun decodeMaskFromBase64(b64OrPrefixed: String, expectedSize: Int): BooleanArray {
    val s = b64OrPrefixed.trim()
    val start = s.indexOf("B64:")
    val payload = if (start >= 0) s.substring(start + 4) else s

    val raw = java.util.Base64.getDecoder().decode(payload)
    val out = BooleanArray(expectedSize)

    var bitAbsolute = 0
    var offset = 0
    while (offset < raw.size && bitAbsolute < expectedSize) {
        // Assemble one 64-bit little-endian word
        var w = 0L
        var i = 0
        while (i < 8 && offset + i < raw.size) {
            val b = (raw[offset + i].toLong() and 0xFFL) // <-- note the L
            w = w or (b shl (8 * i))
            i++
        }
        // Expand 64 LSB-first bits
        var bitInWord = 0
        while (bitInWord < 64 && bitAbsolute < expectedSize) {
            out[bitAbsolute] = ((w ushr bitInWord) and 1L) != 0L
            bitInWord++
            bitAbsolute++
        }
        offset += 8
    }
    return out
}

Python:

import base64

def decode_mask_from_base64(b64_or_prefixed: str, n_topics: int) -> list[bool]:
    s = b64_or_prefixed.strip()
    idx = s.find("B64:")
    payload = s[idx+4:] if idx >= 0 else s

    missing_padding = len(payload) % 4
    if missing_padding:
        payload += "=" * (4 - missing_padding)

    raw = base64.b64decode(payload)
    out = [False] * n_topics

    bit_abs = 0
    off = 0
    while off < len(raw) and bit_abs < n_topics:
        # Assemble one 64-bit little-endian word
        w = 0
        for i in range(8):
            if off + i >= len(raw):
                break
            w |= raw[off + i] << (8 * i)
        # Expand 64 LSB-first bits
        for bit in range(64):
            if bit_abs >= n_topics:
                break
            out[bit_abs] = ((w >> bit) & 1) != 0
            bit_abs += 1
        off += 8
    return out

Additional notes

  • FUN: improvements-only for Best/Worst, one row per K for Average.
  • VAR: aligned with FUN rows.
  • TOP: up to 10 rows per K, replaced atomically.
  • Negative correlations: expected for Worst, possible at small K in Average.

CLI options

Required

  • -fi, --fileIn <file> input CSV basename (no extension)
  • -c, --corr <Pearson|Kendall> correlation method
  • -t, --targ <Best|Worst|Average|All> target
  • -l, --log <Verbose|Limited|Off> logging level

Optional, general

  • --copy copy results into NewBestSub-Experiments sibling folder
  • -det, --deterministic enable deterministic mode
  • -sd, --seed <long> master seed (implies deterministic mode)
  • -mr, --mrg <int> merge N executions
  • -et, --expt <int> add fake topics per step
  • -es, --exps <int> add fake systems per step
  • -mx, --max <int> cap for expansions

Optional, target-specific

  • -i, --iter <int> iterations (Best, Worst, All)
  • -po, --pop <int> population size (must be >= number of topics; Best, Worst, All)
  • -r, --rep <int> repetitions per K (Average, All)
  • -pe, --perc <a,b> percentile range (Average, All)

Deterministic execution

  • --seed <long> sets the master seed explicitly.
  • --deterministic enables reproducibility. If --seed is absent, a stable seed is derived.
  • The effective seed is logged and embedded in the output folder name.

Folder naming pattern

Example:

AH99-Pearson-top50-sys129-po1000-i10000-r2000-time2025-08-22-19-19-40

Pattern:

<DATASET>-<CORR>-top<Topics>-sys<Systems>-po<Population>-i<Iterations>[-r<Repetitions>][-exec<Executions>][-seed<Seed>][-det]-time<YYYY-MM-DD-HH-mm-ss>

Tools

parse_run_name.py

Located in tools/parse_run_name.py. Parses run folder names back into a parameter dictionary.

Example:

python tools/parse_run_name.py "mmlu-Pearson-top18955-sys34-po20000-i100000-r2000-time2025-08-24-17-15-53" --pretty

Output:

{
  "dataset": "mmlu",
  "correlation": "Pearson",
  "topics": 18955,
  "systems": 34,
  "population": 20000,
  "iterations": 100000,
  "repetitions": 2000,
  "executions": null,
  "seed": null,
  "deterministic": false,
  "timestamp": "2025-08-24-17-15-53"
}

Architecture overview

  • DatasetModel: loads data, wires correlation and targets, runs NSGA-II, emits streaming events.
  • Streaming NSGA-II wrapper: calls a per-generation hook and uses MNDS plus crowding in replacement.
  • Incremental evaluation: cached per-system sums and swap deltas.
  • Operators: BinaryPruningCrossover, FixedKSwapMutation.

Testing

JUnit 5 with Surefire 3.x:

mvn -DskipTests=false -Dmaven.test.skip=false -Dsurefire.printSummary=true test

Key test areas (2.0.4):

  • Views (CSV/Parquet) with BEST/WORST, including negative/mixed correlations.
  • Model AVERAGE path: one representative per K; percentile completeness.
  • Operators: FixedKSwap mutation (fixed-K swap/repair); BinaryPruning crossover (AND/OR + feasibility repair).
  • Bitmask encoding/decoding round trips and edge cases across 64-bit boundaries.

Build and logging

Key versions (see pom.xml):

  • Kotlin 2.2.0, Java 22, jMetal 6.9.1
  • Parquet 1.15.2 and Hadoop 3.3.6, Log4j2 2.24.3, JUnit 5.13.4

Logging:

  • Console plus rolling file (log4j2.xml)
  • Uses baseLogFileName system property for destinations

Troubleshooting

  • OutOfMemoryError: increase -Xmx, or reduce -po, -i, or -r.
  • Population smaller than topics: ensure -po >= #topics.
  • TOP incomplete: blocks are emitted only when filled; small runs may delay some Ks.
  • Multiple SLF4J bindings: use Log4j2 only.

Changelog

See CHANGELOG.md for the full history.

2.0.4 — 2025‑09‑03

  • NATURAL vs INTERNAL clarified; selection rules for FUN/TOP documented.
  • AVERAGE percentile interpretation documented with selection recipe.
  • Program: early -po ≥ #topics validation + friendlier runtime error handling.
  • Tests: BEST/WORST view goldens (incl. negative/mixed); operator tests.

2.0.3 — 2025‑08‑29

  • Container folder names now include the iteration token when provided by CLI.
  • AVERAGE CSV/Parquet filenames omit the iteration token by design.
  • numberOfIterations assigned at start of DatasetModel.solve() for consistent naming.

2.0.0 — 2025‑08‑24

  • Java 22 + Kotlin 2.2 toolchain; jMetal 6.9.1.
  • Streaming NSGA‑II + MNDS; incremental evaluation with swap hints.
  • Compact VAR/TOP Base64 masks; 6‑digit correlation precision.

Citation

Software archive: https://doi.org/10.5281/zenodo.16950389

  • Kevin Roitero, Michael Soprano, Andrea Brunello, Stefano Mizzaro. Reproduce and Improve: An Evolutionary Approach to Select a Few Good Topics for Information Retrieval Evaluation. ACM JDIQ 10(3), Article 12, 2018. https://doi.org/10.1145/3239573
  • Kevin Roitero, Michael Soprano, Stefano Mizzaro. Effectiveness Evaluation with a Subset of Topics: A Practical Approach. In SIGIR '18, 1145–1148. https://doi.org/10.1145/3209978.3210108

License

Released under the MIT License.

About

Efficient topic-set reduction for IR evaluation using NSGA-II

Topics

Resources

License

Stars

Watchers

Forks

Contributors 3

  •  
  •  
  •