Merge pull request #1 from FujitsuResearch/feature/v1-0-0

FKKimura · web-flow · commit 3f330d09e5f9 · 2026-03-31T16:59:01.000+09:00
Feature/v1 0 0
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,23 @@
 # Change log
 
+## [v1.0.0] 2026-03-31
+
+### Default Parameter Changes
+
+- Changed `Runner.__init__` default values for calibration parameters:
+  - `max_length`: `512` → `2048`
+  - `num_calibration_samples`: `128` → `512`
+- Pinned old default values explicitly in all `example/` and `tests/` files that previously relied on the defaults
+
+### Documentation
+
+- Updated `docs/user-guide/configuration.md` to reflect the new default values for `max_length` and `num_calibration_samples`
+- Added quantizer feature support table to `docs/user-guide/basic-usage.md` and `docs/api/quantizers/base.md`
+  - Documents which quantizers support `save_quantized_model()` / `create_quantized_model()` and quantized-model PPL/ACC evaluation
+  - Currently supported: **GPTQ**, **DBF**, **AutoBitQuantizer** (requires `get_quant_config()` and `create_inference_layer()`)
+  - Unsupported quantizers (RTN, JointQ, QUIP, CQ, ARB, QBB, Onebit): PPL/ACC evaluation automatically falls back to the dequantized (FP16) model
+- Updated the perplexity/accuracy evaluation note in `basic-usage.md` to reflect AutoBitQuantizer support and fallback behavior
+
 ## [v0.5.0] 2026-03-30
 
 ### New Feature: Post-quantization Workflow
diff --git a/README.md b/README.md
@@ -6,14 +6,26 @@ This package is currently under development (version 0) and may behave unstably.
 
 ## 📦 Features
 
-- **Quantization Error Propagation (QEP)**: A post-training quantization method that corrects quantization errors by propagating them to subsequent layers, improving the accuracy of quantized LLMs. See [Arai & Ichikawa, NeurIPS 2025](https://openreview.net/forum?id=a3l3K9khbL) for details.
+- **Quantization Error Propagation (QEP)**: A post-training quantization method that corrects quantization errors by propagating them to subsequent layers, improving the accuracy of quantized LLMs. See [Arai & Ichikawa, NeurIPS 2025](https://openreview.net/forum?id=a3l3K9khbL) for details. The original reference implementation is available at [FujitsuResearch/qep](https://github.com/FujitsuResearch/qep).
 - **vLLM Plugin Integration**: Serve OneComp-quantized models with [vLLM](https://docs.vllm.ai/) via built-in plugins for DBF and Mixed-GPTQ quantization methods.
 - **AutoBit**: Mixed-precision quantization with ILP-based bitwidth assignment. Automatically estimates the target bitwidth from available VRAM and assigns per-layer bitwidths to minimize quantization error under the memory budget.
 - **JointQ**: Joint quantization method that optimizes weight assignments and scale parameters simultaneously for improved quantization accuracy. Supports group-wise quantization (e.g., 4-bit, groupsize=128).
 - **LoRA SFT Post-Process**: Fine-tune quantized models with LoRA adapters for accuracy recovery or domain-specific knowledge injection. Supports SFT loss, teacher distillation, and intermediate block alignment.
 - **Rotation Preprocessing**: SpinQuant/OstQuant-based rotation preprocessing that reduces quantization error by learning optimal rotation matrices before quantization. Rotation/scaling matrices are absorbed into model weights, with online Hadamard hooks automatically registered at load time. Supports Llama and Qwen3 architectures.
 - (TBD)
 
+## 🤖 Supported Models
+
+OneComp has been verified with the following model architectures.
+Other Hugging Face-compatible models may work but are currently untested.
+
+| # | Architecture | Verified Models | Status |
+|---|-------------|-----------------|--------|
+| 1 | Llama | TinyLlama, Llama-2, Llama-3 | ✅ Verified |
+| 2 | Qwen3 | Qwen3-0.6B ~ 32B | ✅ Verified |
+
+> **Note:** Support for additional architectures is planned. Contributions and test reports are welcome.
+
 ## 🔧 Installation
 
 ### for users (pip)
@@ -181,6 +193,19 @@ See [LICENSE](./LICENSE) for more details.
 
 ## Citation
 
+OneComp technical report (coming soon on ArXiv):
+
+```
+@misc{onecomp2026,
+  title={TBD},
+  author={TBD},
+  year={2026},
+  note={arXiv preprint coming soon}
+}
+```
+
+QEP (Quantization Error Propagation):
+
 ```
 @inproceedings{
 arai2025quantization,
diff --git a/docs/algorithms/qep.md b/docs/algorithms/qep.md
@@ -6,7 +6,8 @@ the error that propagates from previously quantized layers to subsequent ones.
 !!! abstract "Reference"
     Yamato Arai and Yuma Ichikawa, "Quantization Error Propagation: Revisiting Layer-Wise
     Post-Training Quantization," NeurIPS 2025.
-    [OpenReview](https://openreview.net/forum?id=a3l3K9khbL)
+    [OpenReview](https://openreview.net/forum?id=a3l3K9khbL) |
+    [Original implementation](https://github.com/FujitsuResearch/qep)
 
 ## Motivation
 
diff --git a/docs/api/quantizers/base.md b/docs/api/quantizers/base.md
@@ -4,6 +4,34 @@
 
 Abstract base class for all quantizers. Defines the common interface and shared functionality.
 
+### Quantizer Feature Support
+
+`Runner.save_quantized_model()`, `Runner.create_quantized_model()`, and quantized-model
+PPL/ACC evaluation internally call `get_quant_config()` and `create_inference_layer()` on
+the quantizer. These methods raise `NotImplementedError` by default and must be overridden
+by each quantizer to enable these features.
+
+| Quantizer          | `get_quant_config` | `create_inference_layer` | Save | Quantized PPL/ACC |
+|--------------------|:------------------:|:------------------------:|:----:|:-----------------:|
+| `GPTQ`             | Yes                | Yes                      | Yes  | Yes               |
+| `DBF`              | Yes                | Yes                      | Yes  | Yes               |
+| `AutoBitQuantizer` | Yes                | Yes                      | Yes  | Yes               |
+| `RTN`              | —                  | —                        | No   | No (fallback)     |
+| `JointQ`           | —                  | —                        | No   | No (fallback)     |
+| `QUIP`             | —                  | —                        | No   | No (fallback)     |
+| `CQ`               | —                  | —                        | No   | No (fallback)     |
+| `ARB`              | —                  | —                        | No   | No (fallback)     |
+| `QBB`              | —                  | —                        | No   | No (fallback)     |
+| `Onebit`           | —                  | —                        | No   | No (fallback)     |
+
+For quantizers without support:
+
+- **PPL/ACC evaluation**: `calculate_perplexity()` / `calculate_accuracy()` with
+  `quantized_model=True` automatically falls back to the dequantized (FP16) model.
+  No error is raised.
+- **Saving**: use `save_dequantized_model()` (FP16) or `save_quantization_results()`
+  to persist results.
+
 ::: onecomp.quantizer._quantizer.Quantizer
     options:
       show_source: false
diff --git a/docs/index.md b/docs/index.md
@@ -17,6 +17,19 @@ It implements state-of-the-art quantization algorithms including GPTQ, DBF, RTN,
 - **LoRA SFT Post-Process** -- Fine-tune quantized models with LoRA adapters for accuracy recovery or domain-specific knowledge injection. Supports SFT loss, teacher distillation, and intermediate block alignment.
 - **Rotation Preprocessing** -- SpinQuant/OstQuant-based rotation preprocessing that reduces quantization error by learning optimal rotation matrices before quantization. Rotation/scaling matrices are absorbed into model weights, with online Hadamard hooks automatically registered at load time. Supports Llama and Qwen3 architectures.
 
+## Supported Models
+
+OneComp has been verified with the following model architectures.
+Other Hugging Face-compatible models may work but are currently untested.
+
+| # | Architecture | Verified Models | Status |
+|---|-------------|-----------------|--------|
+| 1 | Llama | TinyLlama, Llama-2, Llama-3 | :white_check_mark: Verified |
+| 2 | Qwen3 | Qwen3-0.6B ~ 32B | :white_check_mark: Verified |
+
+!!! note
+    Support for additional architectures is planned. Contributions and test reports are welcome.
+
 ## Quick Example
 
 Quantize any Hugging Face model in a single line -- with QEP, GPTQ 4-bit quantization,
@@ -72,6 +85,19 @@ For full control over each step, see the [step-by-step workflow](user-guide/basi
 
 If you use OneComp in your research, please cite our paper:
 
+OneComp technical report (coming soon on ArXiv):
+
+```bibtex
+@misc{onecomp2026,
+  title={TBD},
+  author={TBD},
+  year={2026},
+  note={arXiv preprint coming soon}
+}
+```
+
+QEP (Quantization Error Propagation):
+
 ```bibtex
 @inproceedings{
 arai2025quantization,
diff --git a/docs/user-guide/basic-usage.md b/docs/user-guide/basic-usage.md
@@ -121,7 +121,9 @@ print(f"Quantized: {quantized_ppl:.2f}")
 
 !!! note
     - Evaluating the original or dequantized model requires loading the full model on GPU.
-    - Quantized-model evaluation is currently supported only for **GPTQ** and **DBF** quantizers. Support for other methods is planned.
+    - Quantized-model evaluation (`quantized_model=True`) is supported only for quantizers
+      that implement `create_quantized_model()` (**GPTQ**, **DBF**, **AutoBitQuantizer**).
+      For other quantizers, evaluation automatically falls back to the dequantized (FP16) model.
 
 ### Zero-shot Accuracy
 
@@ -146,6 +148,29 @@ runner.save_dequantized_model("./output/dequantized")
 runner.save_quantized_model("./output/quantized")
 ```
 
+!!! note "Quantizer feature support"
+    `save_quantized_model()`, `create_quantized_model()`, and quantized-model PPL/ACC evaluation
+    require the quantizer to implement `get_quant_config()` and `create_inference_layer()`.
+    Currently only **GPTQ**, **DBF**, and **AutoBitQuantizer** support these features.
+
+    | Quantizer          | Save | Quantized PPL/ACC | Fallback                  |
+    |--------------------|:----:|:-----------------:|---------------------------|
+    | `GPTQ`             | Yes  | Yes               | —                         |
+    | `DBF`              | Yes  | Yes               | —                         |
+    | `AutoBitQuantizer` | Yes  | Yes               | —                         |
+    | `RTN`              | —    | —                 | Dequantized (FP16) model  |
+    | `JointQ`           | —    | —                 | Dequantized (FP16) model  |
+    | `QUIP`             | —    | —                 | Dequantized (FP16) model  |
+    | `CQ`               | —    | —                 | Dequantized (FP16) model  |
+    | `ARB`              | —    | —                 | Dequantized (FP16) model  |
+    | `QBB`              | —    | —                 | Dequantized (FP16) model  |
+    | `Onebit`           | —    | —                 | Dequantized (FP16) model  |
+
+    For unsupported quantizers:
+
+    - **PPL/ACC evaluation**: automatically falls back to the dequantized (FP16) model. No error is raised.
+    - **Saving**: use `save_dequantized_model()` (FP16) or `save_quantization_results()` instead.
+
 ## Enabling QEP
 
 QEP adjusts weights before quantization to compensate for error propagation across layers.
diff --git a/docs/user-guide/configuration.md b/docs/user-guide/configuration.md
@@ -36,8 +36,8 @@ from onecomp import Runner
 runner = Runner(
     model_config=model_config,
     quantizer=quantizer,
-    max_length=512,
-    num_calibration_samples=128,
+    max_length=2048,
+    num_calibration_samples=512,
     qep=False,
 )
 ```
@@ -57,8 +57,8 @@ runner = Runner(
 | Parameter                   | Type   | Description                                      | Default          |
 |-----------------------------|--------|--------------------------------------------------|------------------|
 | `calibration_dataset`       | `Dataset` | Custom calibration dataset                    | `None`           |
-| `max_length`                | `int`  | Maximum input sequence length                    | `512`            |
-| `num_calibration_samples`   | `int`  | Number of calibration samples                    | `128`            |
+| `max_length`                | `int`  | Maximum input sequence length                    | `2048`           |
+| `num_calibration_samples`   | `int`  | Number of calibration samples                    | `512`            |
 | `calibration_strategy`      | `str`  | Strategy for preparing calibration inputs        | `"drop_rand"`    |
 | `calibration_seed`          | `int`  | Random seed for calibration                      | `0`              |
 | `calibration_batch_size`    | `int`  | Batch size for chunked calibration               | `None`           |
diff --git a/example/example_autobit.py b/example/example_autobit.py
@@ -30,6 +30,8 @@
     model_config=ModelConfig(model_id=MODEL_ID, device="cuda:0"),
     quantizer=quantizer,
     qep=False,
+    max_length=512,
+    num_calibration_samples=128,
 )
 runner.run()
 
diff --git a/example/example_gptq.py b/example/example_gptq.py
@@ -22,7 +22,13 @@
 gptq = GPTQ(wbits=3)
 
 # Configure the runner
-runner = Runner(model_config=model_config, quantizer=gptq, qep=False)
+runner = Runner(
+    model_config=model_config,
+    quantizer=gptq,
+    qep=False,
+    max_length=512,
+    num_calibration_samples=128,
+)
 
 # Run quantization
 runner.run()
diff --git a/example/example_jointq.py b/example/example_jointq.py
@@ -22,7 +22,13 @@
 jointq = JointQ(bits=4, group_size=128)
 
 # Configure the runner
-runner = Runner(model_config=model_config, quantizer=jointq, qep=False)
+runner = Runner(
+    model_config=model_config,
+    quantizer=jointq,
+    qep=False,
+    max_length=512,
+    num_calibration_samples=128,
+)
 
 # Run quantization
 runner.run()
diff --git a/example/example_qep_gptq.py b/example/example_qep_gptq.py
@@ -22,7 +22,13 @@
 gptq = GPTQ(wbits=3)
 
 # Configure the runner
-runner = Runner(model_config=model_config, quantizer=gptq, qep=True)
+runner = Runner(
+    model_config=model_config,
+    quantizer=gptq,
+    qep=True,
+    max_length=512,
+    num_calibration_samples=128,
+)
 
 # Run quantization
 runner.run()
diff --git a/example/example_save_load.py b/example/example_save_load.py
@@ -30,7 +30,13 @@
     groupsize=128,
 )
 
-runner = Runner(model_config=model_config, quantizer=gptq, qep=True)
+runner = Runner(
+    model_config=model_config,
+    quantizer=gptq,
+    qep=True,
+    max_length=512,
+    num_calibration_samples=128,
+)
 runner.run()
 
 # ── 2. Save ───────────────────────────────────────────────────
diff --git a/example/post_process/example_lora_sft.py b/example/post_process/example_lora_sft.py
@@ -80,6 +80,8 @@ def generate_text(model, tokenizer, prompt, device, max_new_tokens=64):
     model_config=model_config,
     quantizer=gptq,
     post_processes=[post_process],
+    max_length=512,
+    num_calibration_samples=128,
 )
 runner.run()
 
diff --git a/example/post_process/example_lora_sft_knowledge.py b/example/post_process/example_lora_sft_knowledge.py
@@ -60,7 +60,12 @@ def generate_text(model, tokenizer, prompt, device, max_new_tokens=128):
 model_config = ModelConfig(model_id=MODEL_ID, device="cuda:0")
 gptq = GPTQ(wbits=4, groupsize=128)
 
-runner = Runner(model_config=model_config, quantizer=gptq)
+runner = Runner(
+    model_config=model_config,
+    quantizer=gptq,
+    max_length=512,
+    num_calibration_samples=128,
+)
 runner.run()
 
 # ================================================================
diff --git a/example/pre_process/example_llama_preprocess_rtn.py b/example/pre_process/example_llama_preprocess_rtn.py
@@ -35,7 +35,12 @@
 # ============================================================
 
 rtn = RTN(wbits=3, groupsize=-1, sym=False)
-runner = Runner(model_config=rotated_config, quantizer=rtn)
+runner = Runner(
+    model_config=rotated_config,
+    quantizer=rtn,
+    max_length=512,
+    num_calibration_samples=128,
+)
 runner.run()
 
 # Calculate perplexity
diff --git a/example/pre_process/example_preprocess_save_load.py b/example/pre_process/example_preprocess_save_load.py
@@ -71,6 +71,7 @@
 runner = Runner(
     model_config=rotated_config,
     quantizer=gptq,
+    max_length=512,
     num_calibration_samples=NUM_CALIBRATION_SAMPLES,
     calibration_seed=SEED,
 )
diff --git a/example/vllm_inference/example_gptq_vllm_inference.py b/example/vllm_inference/example_gptq_vllm_inference.py
@@ -34,7 +34,13 @@ def main():
         model_id="TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T",
     )
     quantizer = GPTQ(wbits=4, groupsize=128)
-    runner = Runner(model_config=model_config, quantizer=quantizer, qep=True)
+    runner = Runner(
+        model_config=model_config,
+        quantizer=quantizer,
+        qep=True,
+        max_length=512,
+        num_calibration_samples=128,
+    )
     runner.run()
     runner.save_quantized_model(save_dir)
 
diff --git a/onecomp/__version__.py b/onecomp/__version__.py
@@ -6,4 +6,4 @@
 
 """
 
-__version__ = "0.5.0"
+__version__ = "1.0.0"
diff --git a/onecomp/runner.py b/onecomp/runner.py
@@ -74,8 +74,8 @@ def __init__(
         self,
         model_config=None,
         calibration_dataset=None,
-        max_length=512,
-        num_calibration_samples=128,
+        max_length=2048,
+        num_calibration_samples=512,
         quantizer=None,
         quantizers=None,
         qep=False,
diff --git a/tests/onecomp/post_process/test_post_process_lora_sft.py b/tests/onecomp/post_process/test_post_process_lora_sft.py
@@ -43,6 +43,7 @@ def quantized_model_and_config():
     runner = Runner(
         model_config=model_config,
         quantizer=quantizer,
+        max_length=512,
         num_calibration_samples=8,
     )
     runner.run()
@@ -120,6 +121,7 @@ def test_runner_with_post_process(self):
         runner = Runner(
             model_config=model_config,
             quantizer=quantizer,
+            max_length=512,
             num_calibration_samples=8,
             post_processes=[post_process],
         )
diff --git a/tests/onecomp/quantizer/autobit/test_autobit.py b/tests/onecomp/quantizer/autobit/test_autobit.py
diff --git a/tests/onecomp/test_eval_quantized_vs_dequantized.py b/tests/onecomp/test_eval_quantized_vs_dequantized.py

Original file line number	Diff line number	Diff line change
`@@ -30,6 +30,8 @@`
`30`	`30`	`model_config=ModelConfig(model_id=MODEL_ID, device="cuda:0"),`
`31`	`31`	`quantizer=quantizer,`
`32`	`32`	`qep=False,`
	`33`	`+ max_length=512,`
	`34`	`+ num_calibration_samples=128,`
`33`	`35`	`)`
`34`	`36`	`runner.run()`
`35`	`37`