NewBestSub is a research tool for efficient topic-set reduction in IR evaluation. Its objective is to identify small subsets of topics that preserve, as closely as possible, the system ranking induced by the full topic set.
The selection task is formulated as a bi-objective optimization—minimizing subset cardinality K while maximizing rank agreement with the full set—solved via NSGA-II. Given a system-by-topic AP matrix and a chosen correlation metric (Pearson or Kendall), the method explores the search space and streams results to CSV/Parquet, including best/worst subsets and percentile summaries. Deterministic execution is supported for reproducibility.
⚠️ Important notice: Migration from 1.0 to 2.0Starting with version 2.0, NewBestSub has undergone major refactoring:
- Requires Java 22 and Kotlin 2.2 (previously Java 8 and Kotlin 1.x).
- Output format is now streaming-based with FUN, VAR, and TOP separation and 6-digit correlation precision.
- New deterministic mode and reproducible folder naming pattern.
- Updated CLI options (-fi, -c, -t, and others) see CLI options.
- Artifacts renamed:
NewBestSub-2.0-jar-with-dependencies.jarinstead ofNewBestSub-1.0-....If you have scripts or pipelines based on 1.0, they will not work unmodified. Please adapt your datasets, CLI calls, and parsing code to the new schema.
- Features
- Requirements
- Build
- Release artifacts & versioning
- Run from release
- Quick start
- Dataset schema
- Outputs
- CLI options
- Deterministic execution
- Folder naming pattern
- Tools
- Architecture overview
- Testing
- Build and logging
- Troubleshooting
- Changelog
- Citation
- License
- BEST, WORST, and AVERAGE experiments powered by NSGA-II (jMetal 6.9.x).
- Streaming engine:
- FUN and VAR: append only when the per-K representative improves for Best and Worst; exactly one row per K for Average.
- TOP: replace the entire per-K block on change (up to 10 rows), keeping output stable.
- MNDS environmental selection (Merge Non-Dominated Sort) plus local crowding distance.
- Incremental evaluation ("Step 3"): O(S) updates via cached per-system sums and fixed-K swap hints.
- Compact genotype I/O: VAR and mask-based TOP carry packed Base64 bitmasks (see layout below).
- Dual outputs: streaming-first CSV and Parquet; consistent 6-digit precision for correlations.
- Deterministic mode: explicit
--seedor stable derived seed from run parameters.
- JDK 22+ (set JAVA_HOME to a JDK, not a JRE)
- Maven 3.9+
- Internet access (for dependency downloads on first build)
mvn -DskipTests=false clean packageArtifacts:
target/NewBestSub-2.0.jar
target/NewBestSub-2.0-jar-with-dependencies.jar
target/NewBestSub-2.0-test-jar-with-dependencies.jar
We pin the distribution filenames to a major/minor track for stable scripting, while the Maven project version keeps full semver (including patches). The latest release is 2.0.4.
-
Release assets (fat & test-fat jars)
Produced by the Assembly plugin using a pinneddist.version→
NewBestSub-2.0-jar-with-dependencies.jar
NewBestSub-2.0-test-jar-with-dependencies.jar -
Local Maven artifact (thin jar)
Also pinned viafinalNameto the samedist.version→
target/NewBestSub-2.0.jar -
Why this pinning?
Patch bumps (2.0.1, 2.0.2, 2.0.3, 2.0.4, …) won’t break scripts or docs that reference the stableNewBestSub-2.0*.jarnames. -
Coordinates remain precise
If you depend via Maven/Gradle, you still pull the full project version (e.g.2.0.4):
it.uniud.newbestsub:NewBestSub:2.0.4 -
Override the label (optional)
mvn -Ddist.version=2.1 clean package
→ produces
NewBestSub-2.1.jar,NewBestSub-2.1-jar-with-dependencies.jar, etc. (POM version unchanged).
When building from source, you’ll typically see both pinned artifacts:
target/NewBestSub-2.0.jar(thin jar)target/NewBestSub-2.0-jar-with-dependencies.jar(fat jar)
Instead of building from source, you can download a ready-to-use JAR:
-
Go to the Releases page.
-
Download the asset named:
NewBestSub-<version>-jar-with-dependencies.jar(Pinned distribution name example:
NewBestSub-2.0-jar-with-dependencies.jar). -
Run it with Java 22:
java -Xmx4g -jar NewBestSub-2.0-jar-with-dependencies.jar --help
-
Example toy run:
java -Xmx1g -jar NewBestSub-2.0-jar-with-dependencies.jar -fi samples/toy -c Pearson -t Best -po 50 -i 5000 -log Limited
Show help:
java -Xmx4g -jar target/NewBestSub-2.0-jar-with-dependencies.jar --helpSmall toy run (adjust -i, -po, and -r to your machine):
# BEST (Pearson)
java -Xmx1g -jar target/NewBestSub-2.0-jar-with-dependencies.jar -fi samples/toy -c Pearson -t Best -po 50 -i 5000 -log Limited
# WORST (Pearson)
java -Xmx1g -jar target/NewBestSub-2.0-jar-with-dependencies.jar -fi samples/toy -c Pearson -t Worst -po 50 -i 5000 -log Limited
# AVERAGE (Kendall, 5 to 95th percentiles, 100 reps)
java -Xmx1g -jar target/NewBestSub-2.0-jar-with-dependencies.jar -fi samples/toy -c Kendall -t Average -r 100 -pe 5,95 -log Limited
# ALL (combine Best, Worst, and Average)
java -Xmx1g -jar target/NewBestSub-2.0-jar-with-dependencies.jar -fi samples/toy -c Pearson -t All -po 50 -i 5000 -r 100 -pe 1,100 -log LimitedTip: on large runs ensure -Xmx is sized appropriately, and keep -po >= #topics.
CSV layout (wide table, systems by topics of AP values):
- Header row: first cell is empty; remaining cells are topic labels (
t1,t2,...). - Rows: first cell is a system label; remaining cells are Average Precision (AP) values for each topic.
- AP range: expected
0.0 .. 1.0(no hard clamp). Missing values are not allowed. - Order: row and column order is preserved and used in outputs.
- Constraint: when running Best, Worst, or All, population must satisfy
-po >= #topics.
Example (samples/toy.csv):
,t1,t2,t3,t4,t5
BM25,0.31,0.45,0.22,0.18,0.40
QL,0.28,0.48,0.19,0.21,0.36
RM3,0.40,0.51,0.26,0.17,0.42
Results are written into a per-run container named from parameters (see Folder naming pattern).
Each container has CSV/ and Parquet/ subfolders.
...-Fun.csvcontainsK corr(space-separated;Kinteger;corrwith 6 digits; always natural correlation)....-Var.csvcontainsK B64:<payload>where<payload>is the packed Base64 mask....-Top-10-Solutions.csvis either:K,Correlation,B64:<payload>(mask-based), orK,Correlation,topic1;topic2;...(label-based), depending on the view configuration.
...-Final.csvand...-Info.csvcontain summaries and metadata.
Selection recipes (per K):
- FUN (after the view’s final rewrite)
- BEST → file is
(K asc, corr asc)→ pick LAST (max). - WORST → file is
(K asc, corr desc)→ pick LAST (min).
- BEST → file is
- TOP
- BEST → lines are descending → pick FIRST (max).
- WORST → lines are ascending → pick FIRST (min).
...-Fun.parquetschema:{ K:int, Correlation:double }...-Var.parquetschema:{ K:int, Mask:string }(bare Base64, noB64:prefix)...-Top-10-Solutions.parquetschema:{ K:int, Correlation:double, MaskOrTopics:string }
We pack the topic-presence mask into Base64 as follows:
- Pack bits into 64-bit words, LSB-first within each word (bit 0 becomes topic 0).
- Serialize each 64-bit word as little-endian bytes.
- Encode the concatenated bytes using Base64 without padding.
- CSV uses a
B64:prefix; Parquet stores the bare payload.
Kotlin (mirror of production code):
fun decodeMaskFromBase64(b64OrPrefixed: String, expectedSize: Int): BooleanArray {
val s = b64OrPrefixed.trim()
val start = s.indexOf("B64:")
val payload = if (start >= 0) s.substring(start + 4) else s
val raw = java.util.Base64.getDecoder().decode(payload)
val out = BooleanArray(expectedSize)
var bitAbsolute = 0
var offset = 0
while (offset < raw.size && bitAbsolute < expectedSize) {
// Assemble one 64-bit little-endian word
var w = 0L
var i = 0
while (i < 8 && offset + i < raw.size) {
val b = (raw[offset + i].toLong() and 0xFFL) // <-- note the L
w = w or (b shl (8 * i))
i++
}
// Expand 64 LSB-first bits
var bitInWord = 0
while (bitInWord < 64 && bitAbsolute < expectedSize) {
out[bitAbsolute] = ((w ushr bitInWord) and 1L) != 0L
bitInWord++
bitAbsolute++
}
offset += 8
}
return out
}
Python:
import base64
def decode_mask_from_base64(b64_or_prefixed: str, n_topics: int) -> list[bool]:
s = b64_or_prefixed.strip()
idx = s.find("B64:")
payload = s[idx+4:] if idx >= 0 else s
missing_padding = len(payload) % 4
if missing_padding:
payload += "=" * (4 - missing_padding)
raw = base64.b64decode(payload)
out = [False] * n_topics
bit_abs = 0
off = 0
while off < len(raw) and bit_abs < n_topics:
# Assemble one 64-bit little-endian word
w = 0
for i in range(8):
if off + i >= len(raw):
break
w |= raw[off + i] << (8 * i)
# Expand 64 LSB-first bits
for bit in range(64):
if bit_abs >= n_topics:
break
out[bit_abs] = ((w >> bit) & 1) != 0
bit_abs += 1
off += 8
return out- FUN: improvements-only for Best/Worst, one row per K for Average.
- VAR: aligned with FUN rows.
- TOP: up to 10 rows per K, replaced atomically.
- Negative correlations: expected for Worst, possible at small K in Average.
Required
-fi, --fileIn <file>input CSV basename (no extension)-c, --corr <Pearson|Kendall>correlation method-t, --targ <Best|Worst|Average|All>target-l, --log <Verbose|Limited|Off>logging level
Optional, general
--copycopy results intoNewBestSub-Experimentssibling folder-det, --deterministicenable deterministic mode-sd, --seed <long>master seed (implies deterministic mode)-mr, --mrg <int>merge N executions-et, --expt <int>add fake topics per step-es, --exps <int>add fake systems per step-mx, --max <int>cap for expansions
Optional, target-specific
-i, --iter <int>iterations (Best, Worst, All)-po, --pop <int>population size (must be >= number of topics; Best, Worst, All)-r, --rep <int>repetitions per K (Average, All)-pe, --perc <a,b>percentile range (Average, All)
--seed <long>sets the master seed explicitly.--deterministicenables reproducibility. If--seedis absent, a stable seed is derived.- The effective seed is logged and embedded in the output folder name.
Example:
AH99-Pearson-top50-sys129-po1000-i10000-r2000-time2025-08-22-19-19-40
Pattern:
<DATASET>-<CORR>-top<Topics>-sys<Systems>-po<Population>-i<Iterations>[-r<Repetitions>][-exec<Executions>][-seed<Seed>][-det]-time<YYYY-MM-DD-HH-mm-ss>
Located in tools/parse_run_name.py. Parses run folder names back into a parameter dictionary.
Example:
python tools/parse_run_name.py "mmlu-Pearson-top18955-sys34-po20000-i100000-r2000-time2025-08-24-17-15-53" --prettyOutput:
{
"dataset": "mmlu",
"correlation": "Pearson",
"topics": 18955,
"systems": 34,
"population": 20000,
"iterations": 100000,
"repetitions": 2000,
"executions": null,
"seed": null,
"deterministic": false,
"timestamp": "2025-08-24-17-15-53"
}- DatasetModel: loads data, wires correlation and targets, runs NSGA-II, emits streaming events.
- Streaming NSGA-II wrapper: calls a per-generation hook and uses MNDS plus crowding in replacement.
- Incremental evaluation: cached per-system sums and swap deltas.
- Operators: BinaryPruningCrossover, FixedKSwapMutation.
JUnit 5 with Surefire 3.x:
mvn -DskipTests=false -Dmaven.test.skip=false -Dsurefire.printSummary=true testKey test areas (2.0.4):
- Views (CSV/Parquet) with BEST/WORST, including negative/mixed correlations.
- Model AVERAGE path: one representative per K; percentile completeness.
- Operators: FixedKSwap mutation (fixed-K swap/repair); BinaryPruning crossover (AND/OR + feasibility repair).
- Bitmask encoding/decoding round trips and edge cases across 64-bit boundaries.
Key versions (see pom.xml):
- Kotlin 2.2.0, Java 22, jMetal 6.9.1
- Parquet 1.15.2 and Hadoop 3.3.6, Log4j2 2.24.3, JUnit 5.13.4
Logging:
- Console plus rolling file (
log4j2.xml) - Uses
baseLogFileNamesystem property for destinations
- OutOfMemoryError: increase
-Xmx, or reduce-po,-i, or-r. - Population smaller than topics: ensure
-po >= #topics. - TOP incomplete: blocks are emitted only when filled; small runs may delay some Ks.
- Multiple SLF4J bindings: use Log4j2 only.
See CHANGELOG.md for the full history.
2.0.4 — 2025‑09‑03
- NATURAL vs INTERNAL clarified; selection rules for FUN/TOP documented.
- AVERAGE percentile interpretation documented with selection recipe.
- Program: early
-po ≥ #topicsvalidation + friendlier runtime error handling. - Tests: BEST/WORST view goldens (incl. negative/mixed); operator tests.
2.0.3 — 2025‑08‑29
- Container folder names now include the iteration token when provided by CLI.
- AVERAGE CSV/Parquet filenames omit the iteration token by design.
numberOfIterationsassigned at start ofDatasetModel.solve()for consistent naming.
2.0.0 — 2025‑08‑24
- Java 22 + Kotlin 2.2 toolchain; jMetal 6.9.1.
- Streaming NSGA‑II + MNDS; incremental evaluation with swap hints.
- Compact VAR/TOP Base64 masks; 6‑digit correlation precision.
Software archive: https://doi.org/10.5281/zenodo.16950389
- Kevin Roitero, Michael Soprano, Andrea Brunello, Stefano Mizzaro. Reproduce and Improve: An Evolutionary Approach to Select a Few Good Topics for Information Retrieval Evaluation. ACM JDIQ 10(3), Article 12, 2018. https://doi.org/10.1145/3239573
- Kevin Roitero, Michael Soprano, Stefano Mizzaro. Effectiveness Evaluation with a Subset of Topics: A Practical Approach. In SIGIR '18, 1145–1148. https://doi.org/10.1145/3209978.3210108
Released under the MIT License.