fix(qualcomm): Enhance quantization modules. #607

chenghuaWang · 2026-01-20T05:28:33Z

Summary by CodeRabbit

New Features
- Fixed-parameter activation quantizer and concat observer added; model-level enable/disable fake-quant controls.
Improvements
- Broader automatic generation and propagation of quantization specs across ops.
- New checks for unsolved quantization entries and concat-parameter consistency.
- Bit-width specific epsilon handling for quantization and improved attention/dtype handling.
New Exports
- Expanded scalar dtype exports (int8/16/32/64, uint8/16/32/64, bool) for Python API.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

…fixed quantization parameters, updated ActivationQDQ to use MovingAverageMinMaxObserver, and adjusted eps values for better precision. Modified Qwen3 model to utilize FixedActivationQDQ for sigmoid output and ensured dtype consistency in attention calculations.

… debug print statements from Qwen3DecoderLayer

…ackend in CMake, enhance PTQPass with unsolved tensor value checks, and update quantization specifications in RMSNorm and model file conversion.

…improved quantization, enhance rotate_half function to utilize observers, and ensure consistent scale and zero_point across concatenated inputs.

coderabbitai · 2026-01-20T05:28:45Z

📝 Walkthrough

Walkthrough

Adds multiple quantization utilities and integrations across C++ and Python backends: fixed-parameter activation QDQ, concat observers, broader automatic quantization-spec generation and validation in AOT/PTQ passes, expanded CPU fill kernels/API, model serialization tweaks, and CMake install/export updates for several targets.

Changes

Cohort / File(s)	Summary
CMake install/export `CMakeLists.txt`, `mllm/backends/qnn/CMakeLists.txt`	Added install/export rules for `flatbuffers` and `MllmQNNBackend` targets (LIBRARY/ARCHIVE→lib, RUNTIME→bin).
Compiler warnings `mllm/CMakeLists.txt`	Suppressed `-Wno-comma-subscript` for `MllmRT`.
Qwen3 AOT modeling (C++) `examples/qwen3_qnn_aot/modeling_qwen_qnn_aot.hpp`	Reworked `rotateHalf` to accept module/QDQ name, use `ptq::QDQ` for second half, updated Qwen3Attention masked-softmax path to use quantized fallback + QDQ wrapper.
Qwen3 Python model `pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py`, `.../runner.py`, `.../train.py`	Imported `FixedActivationQDQ` and `ConcatObserver`; replaced some ActivationQDQ uses with `FixedActivationQDQ`; updated `rotate_half` signature to accept observers; added enable/disable fake-quant helpers and quantization-aware loading/convert/calibration flow; layer index field added.
Qualcomm QDQ & observers (Python) `pymllm/backends/qualcomm/transformers/core/qdq.py`, `.../observer.py`, `.../rms_norm.py`, `.../qlinear.py`	Added `FixedActivationQDQ`, `ConcatObserver`, bit-width eps constants, observer eps propagation, renamed fake-quant control methods, and small formatting/refactor adjustments; QRMSNorm FakeQuantize configured with explicit dtype/qscheme and eps.
AOT quant recipe improvements (C++) `mllm/backends/qnn/aot/passes/LLMQuantRecipePass.cpp`	Patterns now auto-generate missing `quant_recipe` attrs (Concat, Where, RoPE, Elementwise, etc.), propagate specs consistently, and reduce early failures.
PTQ validation (C++) `mllm/backends/qnn/aot/passes/PTQPass.cpp`	Added `recursiveCheckUnsolved` and `recursiveCheckConcatInputs` graph traversals to warn about unsolved specs and validate Concat input scale/zero_point consistency; integrated into solve flow.
RMSNorm visitor (C++) `mllm/backends/qnn/aot/visitor/RMSNorm.cpp`	Switched fake-bias quant recipe from int32/zero-range to int16 with bias_scale tensor; set runtime bias name.
CPU fill utilities (C++) `mllm/backends/cpu/kernels/common/fill-inl.hpp`, `.../kernel_dispatch.cpp`, `.../kernel_dispatch.hpp`, `mllm/backends/cpu/ops/FillOp.cpp`	New SIMD-backed fill utilities (zeros/ones/value/arange/random) for many scalar types, exported HWY APIs and dynamic-dispatch wrappers, and X86 paths updated to use generic anytype wrappers with ARM-specific paths preserved.
FFI and Python dtype exports `mllm/ffi/Extension.cc`, `pymllm/ffi/__init__.py`, `pymllm/__init__.py`	Added factory functions and public instances for int8/16/32/64, uint8/16/32/64, bool; exported new dtype and device singletons.
Model file serialization (Python) `pymllm/convertor/model_file_v2.py`	Added `_torch_tensor_bytes()` helper and replaced numpy-based tensor serialization calls with it (streaming/static write paths).

Sequence Diagram(s)

(Skipped — changes are broad and dispersed; no single new multi-component sequential flow met the diagram criteria.)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

feat(Qnn AOT): Add MarkTensorIO pass and related changes for QNN AOT pipeline #569 — Overlaps Qwen3 AOT modeling edits and QNN AOT visitor changes (similar rotateHalf/QDQ adjustments).
feat(qualcomm): PTQPass add constant ptq impl. #593 — Related PTQ changes; also modifies PTQPass traversal/validation logic.
fix: Qualcomm QNN AOT Pass #579 — Overlapping work on QNN AOT quant recipe generation and Qwen3 quantization wiring.

Suggested reviewers

liang1232018
oreomaker
yirongjie

Poem

🐰 I nibble bytes and hop through code,

Fixed scales snug in rabbit mode,
Concat bounds and QDQ spin,
RoPE rotates with a quantized grin,
Tiny hops — big changes made — hooray for this bunny's parade!

🚥 Pre-merge checks | ❌ 3

❌ Failed checks (2 warnings, 1 inconclusive)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The pull request has no description provided; the required template structure is entirely missing.	Add a complete pull request description following the repository template, including motivation, changes made, and any relevant testing or validation information.
Docstring Coverage	⚠️ Warning	Docstring coverage is 11.82% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The title 'fix(qualcomm): Enhance quantization modules.' is vague and generic—'enhance' is non-specific about what improvements are actually made.	Replace 'Enhance quantization modules' with a specific description of the main change, e.g., 'Add FixedActivationQDQ and ConcatObserver for Qwen3 quantization' or 'Improve QDQ observer configuration with epsilon handling'.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

UbiquitousLearning

LGTM

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py (1)
104-136: rotate_half signature now breaks existing callers.

Line 104 requires observer args, but Line 135 still calls rotate_half(q) without them. If apply_rotary_pos_emb is invoked, this will raise a TypeError. Also, x_observer is unused. Please keep backward compatibility (or update all callers) and remove/rename the unused parameter.
🛠️ Backward‑compatible fix
-def rotate_half(
-    x, x_observer, x2_neg_fake_quant: ActivationQDQ, concat_observer: ConcatObserver
-):
+def rotate_half(
+    x,
+    _x_observer=None,
+    x2_neg_fake_quant: Optional[ActivationQDQ] = None,
+    concat_observer: Optional[ConcatObserver] = None,
+):
     """Rotates half the hidden dims of the input."""
     x1 = x[..., : x.shape[-1] // 2]
     x2 = x[..., x.shape[-1] // 2 :]
-    return concat_observer(torch.cat((x2_neg_fake_quant(-x2), x1), dim=-1))
+    if x2_neg_fake_quant is None or concat_observer is None:
+        return torch.cat((-x2, x1), dim=-1)
+    return concat_observer(torch.cat((x2_neg_fake_quant(-x2), x1), dim=-1))

🤖 Fix all issues with AI agents

In `@mllm/backends/qnn/aot/passes/PTQPass.cpp`:
- Around line 358-418: The loop in PTQPass.cpp can read ref_zero_point
uninitialized if the first captured reference is kSymPerTensor then later an
kAsymPerTensor is compared; add tracking for the reference spec type (e.g., an
enum/ref_spec_type alongside has_ref) when you set
ref_scale/ref_zero_point/ref_input_name, and before comparing a new input check
that f_spec->spec_->type matches ref_spec_type; if types differ emit a clear
MLLM_ERROR/MLLM_WARN mentioning op_name and both input names and skip comparison
(or fail early), and only access ref_zero_point when ref_spec_type ==
kAsymPerTensor so no uninitialized reads occur.

In `@pymllm/backends/qualcomm/transformers/qwen3/runner.py`:
- Around line 57-61: The call to Qwen3ForCausalLM.from_pretrained uses the wrong
keyword arg name `dtype`; update the call in runner.py where
Qwen3ForCausalLM.from_pretrained(model_path, attn_implementation="eager",
dtype=torch.float32) is invoked to use the correct HuggingFace parameter name
`torch_dtype=torch.float32` so the dtype is passed properly to the
PreTrainedModel loader.

In `@pymllm/backends/qualcomm/transformers/qwen3/train.py`:
- Around line 41-45: Decide and implement the intended fake-quant behavior by
removing the FIXME and either making the disable/enable calls deterministic or
exposing them as a CLI/config flag; e.g., add a boolean flag
(args.disable_fake_quant_before_calibration) and use it to conditionally call
m.disable_fake_quant() before m.calibrate(...) and m.enable_fake_quant() after,
ensuring the sequence around m.calibrate(...) and m.infer(...) is deterministic
and documented in the flag help text.
- Around line 50-53: The assigned lm_head parameter uses unquantized
embed_tokens weights before m.convert(), causing QLinearLPBQ's frozen
weight_quant to remain stale; fix by moving the weight tying to after
m.convert() (i.e., set m.model.lm_head.weight =
Parameter(m.model.model.embed_tokens.weight.clone()) only once convert() has
run) or, if tying must happen before convert(), update/re-freeze the QLinearLPBQ
internal quant state (weight_quant) after assignment so weight_quant.weight_q
reflects the new parameter; refer to m.model.lm_head.weight,
m.model.model.embed_tokens.weight, m.convert(), and the QLinearLPBQ frozen
weight_quant initialization to implement the change.

🧹 Nitpick comments (9)

mllm/CMakeLists.txt (1)
59-61: Consider using PRIVATE instead of PUBLIC for warning suppression.

Using PUBLIC propagates -Wno-comma-subscript to all targets that link against MllmRT, which could mask comma-subscript warnings in downstream code that should be fixed. If the deprecated syntax is only used within MllmRT itself (as the FIXME suggests), PRIVATE would be more appropriate to limit the scope of warning suppression.
Suggested change
 # FIXME: `@oreomaker` Need to remove comma features in slice!
 # Suppress comma-subscript warnings (deprecated C++ feature that will be removed in C++26)
-target_compile_options(MllmRT PUBLIC -Wno-comma-subscript)
+target_compile_options(MllmRT PRIVATE -Wno-comma-subscript)
mllm/backends/qnn/aot/visitor/RMSNorm.cpp (1)
53-55: Consider using named constants for int16 quantization bounds.

The magic numbers 32767 and -32768 represent the int16 symmetric quantization range. Extracting these as named constants would improve readability and make the relationship between scale and range explicit.
♻️ Suggested refactor
+  constexpr int16_t kInt16Max = 32767;
+  constexpr int16_t kInt16Min = -32768;
+
   // fake bias quant recipe
   auto bias_scale = Tensor::ones({1});
-  bias_scale.at<float>({0}) = 1.0 / 32767;
-  auto quant_spec = mllm::ir::linalg::QuantizationSpecSymPerTensor::create(-32768, 32767, kInt16, kFloat32, bias_scale);
+  bias_scale.at<float>({0}) = 1.0f / kInt16Max;
+  auto quant_spec = mllm::ir::linalg::QuantizationSpecSymPerTensor::create(kInt16Min, kInt16Max, kInt16, kFloat32, bias_scale);
mllm/backends/qnn/aot/passes/PTQPass.cpp (1)
460-468: LGTM!

The validation functions are correctly invoked after the solving passes, ensuring all quantization specs are resolved before checking for issues. The ordering is appropriate.

Consider caching the SubGraphOp lookup to avoid repeated symbol table lookups:
auto main_subgraph = getCtx()->lookupSymbolTable(call_main_graph_op->getSymbolAttr()->str())->cast_<ir::graph::SubGraphOp>();
recursiveSolveWeights(writer.getContext(), main_subgraph, pf);
recursiveSolveNormal(writer.getContext(), main_subgraph, pf);
recursiveCheckUnsolved(writer.getContext(), main_subgraph);
recursiveCheckConcatInputs(writer.getContext(), main_subgraph);
pymllm/convertor/model_file_v2.py (1)
27-33: Consider moving this function inside the torch availability guard.

The function references torch.uint8 and is only valid when PyTorch is available. While current call sites are properly guarded, placing the function definition inside the if MLLM_FIND_TORCH_AVAILABLE: block would make the dependency explicit and prevent accidental misuse.
Suggested change
 if MLLM_FIND_TORCH_AVAILABLE:
     import torch
+
+    def _torch_tensor_bytes(tensor: "torch.Tensor") -> bytes:
+        """Serialize a PyTorch tensor to raw bytes using uint8 view.
+
+        Handles dtypes not natively supported by numpy (e.g., bfloat16) by
+        viewing the tensor's storage as uint8 before conversion.
+        """
+        t = tensor.detach().cpu().contiguous()
+        if t.dim() == 0:
+            t = t.reshape(1)
+        return t.view(torch.uint8).numpy().tobytes()
+
 if MLLM_FIND_NUMPY_AVAILABLE:
     import numpy as np
-from .mllm_type_mapping import MLLM_TYPE_MAPPING
-
-
-def _torch_tensor_bytes(tensor: "torch.Tensor") -> bytes:
-    # Use uint8 view to preserve raw bytes for dtypes not supported by numpy.
-    t = tensor.detach().cpu().contiguous()
-    if t.dim() == 0:
-        t = t.reshape(1)
-    return t.view(torch.uint8).numpy().tobytes()
pymllm/backends/qualcomm/transformers/core/rms_norm.py (1)
23-31: Give the eps literal a named constant.

Line 26 inlines 0.0001 / 65535. Consider extracting a module-level constant (or reusing a shared constant) to keep eps consistent and self-descriptive.
♻️ Suggested refactor
+DEFAULT_EPS_16BIT = 0.0001 / 65535
 ...
         self.weight_fake_quant = FakeQuantize(
             observer=MinMaxObserver.with_args(
                 qscheme=torch.per_tensor_affine,
                 dtype=torch.qint32,
-                eps=0.0001 / 65535,
+                eps=DEFAULT_EPS_16BIT,
             ),
As per coding guidelines, use named constants instead of magic numbers.
pymllm/backends/qualcomm/transformers/core/observer.py (1)
43-52: Use tensor ops + in‑place updates for min/max tracking.

Line 45 uses Python min/max on tensors and rebinds buffers. Using torch.minimum/maximum with copy_ avoids sync-y comparisons and keeps buffers stable.
♻️ Suggested refactor
-        self.min_val = min(self.min_val, x_orig.min())
-        self.max_val = max(self.max_val, x_orig.max())
+        self.min_val.copy_(torch.minimum(self.min_val, x_orig.min()))
+        self.max_val.copy_(torch.maximum(self.max_val, x_orig.max()))
 ...
-        for observers in self.input_observers:
-            observers.min_val = self.min_val
-            observers.max_val = self.max_val
+        for observers in self.input_observers:
+            observers.min_val.copy_(self.min_val)
+            observers.max_val.copy_(self.max_val)
As per coding guidelines, avoid unnecessary work in hot paths.
pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py (2)
82-87: Derive sigmoid scale from a named constant.

Line 84 hard‑codes 65535. Consider extracting a named constant (or computing from bits) to make intent clearer and reduce magic numbers.
♻️ Suggested refactor
+SIGMOID_QMAX_16 = (2**16) - 1
 ...
-        sigmoid_scale = 1.0 / (65535 - 0 + 1)  # 1 / 65536
+        sigmoid_scale = 1.0 / (SIGMOID_QMAX_16 + 1)  # 1 / 65536
As per coding guidelines, use named constants instead of magic numbers.

381-381: Typo: layer_dix → layer_idx for consistency.

Line 381 looks like a misspelling; consider renaming for clarity and to match the rest of the codebase.
♻️ Suggested fix
-        self.layer_dix = layer_idx
+        self.layer_idx = layer_idx
As per coding guidelines, keep naming consistent.
pymllm/backends/qualcomm/transformers/qwen3/runner.py (1)
37-45: Consider using tuple in isinstance checks for cleaner code.

The logic is correct, but you can simplify the condition using a tuple.
♻️ Suggested refactor
 def enable_fake_quant(m):
-    if isinstance(m, ActivationQDQ) or isinstance(m, FixedActivationQDQ):
+    if isinstance(m, (ActivationQDQ, FixedActivationQDQ)):
         m.enable_fakequant()


 def disable_fake_quant(m):
-    if isinstance(m, ActivationQDQ) or isinstance(m, FixedActivationQDQ):
+    if isinstance(m, (ActivationQDQ, FixedActivationQDQ)):
         m.disable_fakequant()
This matches the pattern used in freeze_qwen3_linear_weight and is more idiomatic Python.

coderabbitai · 2026-01-20T05:42:47Z

mllm/backends/qnn/aot/passes/PTQPass.cpp

+      for (auto iii : inputs) {
+        if (!iii->isa_<ir::tensor::TensorValue>()) continue;
+        auto tv = iii->cast_<ir::tensor::TensorValue>();
+        if (!tv->getAttr("quant_recipe")) continue;
+        auto f_spec = tv->getAttr("quant_recipe")->cast_<ir::linalg::LinalgIRQuantizatonSpecAttr>();
+
+        if (f_spec->spec_->type == ir::linalg::QuantizationSpecType::kAsymPerTensor) {
+          auto this_spec = std::static_pointer_cast<ir::linalg::QuantizationSpecAsymPerTensor>(f_spec->spec_);
+          if (!this_spec->solved) continue;
+
+          if (!has_ref) {
+            ref_scale = this_spec->scale;
+            ref_zero_point = this_spec->zero_point;
+            ref_input_name = tv->name();
+            has_ref = true;
+          } else {
+            // Check if scale and zero_point match
+            auto cur_scale = this_spec->scale;
+            auto cur_zero_point = this_spec->zero_point;
+
+            MLLM_RT_ASSERT_EQ(ref_scale.numel(), 1);
+            MLLM_RT_ASSERT_EQ(cur_scale.numel(), 1);
+            MLLM_RT_ASSERT_EQ(ref_zero_point.numel(), 1);
+            MLLM_RT_ASSERT_EQ(cur_zero_point.numel(), 1);
+
+            auto ref_scale_v = ref_scale.item<mllm_fp32_t>();
+            auto cur_scale_v = cur_scale.item<mllm_fp32_t>();
+            auto ref_zp_v = ref_zero_point.item<mllm_int32_t>();
+            auto cur_zp_v = cur_zero_point.item<mllm_int32_t>();
+
+            if (std::abs(ref_scale_v - cur_scale_v) > 1e-6 || ref_zp_v != cur_zp_v) {
+              MLLM_ERROR("PTQPass: ConcatOp '{}' has mismatched scale/zp between inputs. "
+                         "Input '{}': scale={}, zp={}; Input '{}': scale={}, zp={}",
+                         op_name, ref_input_name, ref_scale_v, ref_zp_v, tv->name(), cur_scale_v, cur_zp_v);
+            }
+          }
+        } else if (f_spec->spec_->type == ir::linalg::QuantizationSpecType::kSymPerTensor) {
+          auto this_spec = std::static_pointer_cast<ir::linalg::QuantizationSpecSymPerTensor>(f_spec->spec_);
+          if (!this_spec->solved) continue;
+
+          if (!has_ref) {
+            ref_scale = this_spec->scale;
+            ref_input_name = tv->name();
+            has_ref = true;
+          } else {
+            // Check if scale matches
+            auto cur_scale = this_spec->scale;
+
+            MLLM_RT_ASSERT_EQ(ref_scale.numel(), 1);
+            MLLM_RT_ASSERT_EQ(cur_scale.numel(), 1);
+
+            auto ref_scale_v = ref_scale.item<mllm_fp32_t>();
+            auto cur_scale_v = cur_scale.item<mllm_fp32_t>();
+
+            if (std::abs(ref_scale_v - cur_scale_v) > 1e-6) {
+              MLLM_ERROR("PTQPass: ConcatOp '{}' has mismatched scale between inputs. "
+                         "Input '{}': scale={}; Input '{}': scale={}",
+                         op_name, ref_input_name, ref_scale_v, tv->name(), cur_scale_v);
+            }
+          }
+        }


⚠️ Potential issue | 🟠 Major

Potential undefined behavior when mixing quantization spec types.

If the first input has kSymPerTensor (which only sets ref_scale) and a subsequent input has kAsymPerTensor, the code at line 385 will read ref_zero_point which was never initialized, leading to undefined behavior.

Consider either:

Tracking which spec type the reference was captured from and only comparing inputs of the same type.

Emitting an error/warning when inputs have mismatched quantization spec types.

Proposed fix to track reference spec type

Tensor ref_scale; Tensor ref_zero_point; bool has_ref = false; std::string ref_input_name; + ir::linalg::QuantizationSpecType ref_spec_type = ir::linalg::QuantizationSpecType::kRaw; for (auto iii : inputs) { if (!iii->isa_<ir::tensor::TensorValue>()) continue; auto tv = iii->cast_<ir::tensor::TensorValue>(); if (!tv->getAttr("quant_recipe")) continue; auto f_spec = tv->getAttr("quant_recipe")->cast_<ir::linalg::LinalgIRQuantizatonSpecAttr>(); if (f_spec->spec_->type == ir::linalg::QuantizationSpecType::kAsymPerTensor) { auto this_spec = std::static_pointer_cast<ir::linalg::QuantizationSpecAsymPerTensor>(f_spec->spec_); if (!this_spec->solved) continue; if (!has_ref) { ref_scale = this_spec->scale; ref_zero_point = this_spec->zero_point; ref_input_name = tv->name(); + ref_spec_type = ir::linalg::QuantizationSpecType::kAsymPerTensor; has_ref = true; } else { + if (ref_spec_type != ir::linalg::QuantizationSpecType::kAsymPerTensor) { + MLLM_ERROR("PTQPass: ConcatOp '{}' has mismatched quantization spec types between inputs.", op_name); + continue; + } // Check if scale and zero_point match // ... existing code ... } } else if (f_spec->spec_->type == ir::linalg::QuantizationSpecType::kSymPerTensor) { auto this_spec = std::static_pointer_cast<ir::linalg::QuantizationSpecSymPerTensor>(f_spec->spec_); if (!this_spec->solved) continue; if (!has_ref) { ref_scale = this_spec->scale; ref_input_name = tv->name(); + ref_spec_type = ir::linalg::QuantizationSpecType::kSymPerTensor; has_ref = true; } else { + if (ref_spec_type != ir::linalg::QuantizationSpecType::kSymPerTensor) { + MLLM_ERROR("PTQPass: ConcatOp '{}' has mismatched quantization spec types between inputs.", op_name); + continue; + } // Check if scale matches // ... existing code ... } } }

🤖 Prompt for AI Agents

In `@mllm/backends/qnn/aot/passes/PTQPass.cpp` around lines 358 - 418, The loop in PTQPass.cpp can read ref_zero_point uninitialized if the first captured reference is kSymPerTensor then later an kAsymPerTensor is compared; add tracking for the reference spec type (e.g., an enum/ref_spec_type alongside has_ref) when you set ref_scale/ref_zero_point/ref_input_name, and before comparing a new input check that f_spec->spec_->type matches ref_spec_type; if types differ emit a clear MLLM_ERROR/MLLM_WARN mentioning op_name and both input names and skip comparison (or fail early), and only access ref_zero_point when ref_spec_type == kAsymPerTensor so no uninitialized reads occur.

coderabbitai · 2026-01-20T05:42:47Z

pymllm/backends/qualcomm/transformers/qwen3/runner.py

        self.model = Qwen3ForCausalLM.from_pretrained(
            model_path,
            attn_implementation="eager",
+            dtype=torch.float32,
        )


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# Find Qwen3ForCausalLM class definition fd -e py -e pyi | xargs rg -l "class Qwen3ForCausalLM" | head -20

Repository: UbiquitousLearning/mllm

Length of output: 128

🏁 Script executed:

# Also search for the file structure under qwen3 backend find pymllm/backends/qualcomm/transformers/qwen3 -type f -name "*.py" 2>/dev/null

Repository: UbiquitousLearning/mllm

Length of output: 235

🏁 Script executed:

# Check the imports and context in the runner.py file cat -n pymllm/backends/qualcomm/transformers/qwen3/runner.py | head -100

Repository: UbiquitousLearning/mllm

Length of output: 3860

🏁 Script executed:

# Examine the Qwen3ForCausalLM class and from_pretrained method cat -n pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py | head -200

Repository: UbiquitousLearning/mllm

Length of output: 9569

🏁 Script executed:

# Search for from_pretrained method in the modeling file rg -n "from_pretrained" pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py -A 20

Repository: UbiquitousLearning/mllm

Length of output: 1197

🏁 Script executed:

# Search for the Qwen3ForCausalLM class definition rg -n "^class Qwen3ForCausalLM" pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py -A 30

Repository: UbiquitousLearning/mllm

Length of output: 1354

🏁 Script executed:

# Check if Qwen3ForCausalLM inherits from PreTrainedModel or has its own from_pretrained rg -n "class Qwen3ForCausalLM" pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py -B 2 -A 5

Repository: UbiquitousLearning/mllm

Length of output: 357

🏁 Script executed:

# Search for any custom from_pretrained or _from_pretrained method in the file rg -n "def.*from_pretrained|def _from_pretrained" pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py -A 20

Repository: UbiquitousLearning/mllm

Length of output: 49

🏁 Script executed:

# Find Qwen3PreTrainedModel definition rg -n "class Qwen3PreTrainedModel" pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py -A 20

Repository: UbiquitousLearning/mllm

Length of output: 779

🏁 Script executed:

# Check the transformers library's PreTrainedModel to understand the from_pretrained signature # Let's look for how PreTrainedModel is imported and check if dtype parameter is used elsewhere rg -n "torch_dtype|dtype=" pymllm/backends/qualcomm/transformers/qwen3/ -t py | head -30

Repository: UbiquitousLearning/mllm

Length of output: 1329

🏁 Script executed:

# Check the actual transformers library PreTrainedModel to see what parameters from_pretrained accepts # Look for any kwargs handling or parameter documentation in the model class rg -n "def from_pretrained" /usr/local/lib/python*/dist-packages/transformers/modeling_utils.py 2>/dev/null || echo "Checking local transformers..." # Instead, let's check what the actual HuggingFace transformers library uses python3 << 'EOF' try: from transformers import PreTrainedModel import inspect sig = inspect.signature(PreTrainedModel.from_pretrained) print("PreTrainedModel.from_pretrained signature:") print(sig) print("\nParameters:") for param_name, param in sig.parameters.items(): print(f" {param_name}: {param}") except Exception as e: print(f"Error: {e}") EOF

Repository: UbiquitousLearning/mllm

Length of output: 153

🏁 Script executed:

# Let's also check if there are any test files that show how this is used find pymllm -name "*test*.py" -o -name "*example*.py" | xargs rg -l "Qwen3ForCausalLM" 2>/dev/null | head -5

Repository: UbiquitousLearning/mllm

Length of output: 49

🏁 Script executed:

# Check if there's any documentation or comments about the dtype parameter rg -n "dtype|torch_dtype" pymllm/backends/qualcomm/transformers/qwen3/runner.py -B 3 -A 3

Repository: UbiquitousLearning/mllm

Length of output: 341

🏁 Script executed:

# Let's search the entire codebase for how from_pretrained is used with dtype vs torch_dtype rg -n "from_pretrained.*dtype" . -t py --max-count=20

Repository: UbiquitousLearning/mllm

Length of output: 49

🏁 Script executed:

# Check if there's a __init__.py or other configuration that might override from_pretrained find pymllm/backends/qualcomm/transformers/qwen3 -name "*.py" -exec grep -l "from_pretrained" {} \;

Repository: UbiquitousLearning/mllm

Length of output: 182

🏁 Script executed:

# Let's check the transformers package version and see what parameters PreTrainedModel accepts python3 << 'EOF' try: import transformers print(f"Transformers version: {transformers.__version__}") # Check the from_pretrained method from transformers.modeling_utils import PreTrainedModel import inspect # Get the source file location source_file = inspect.getfile(PreTrainedModel) print(f"Source file: {source_file}") # Check the method signature (first few lines) source = inspect.getsource(PreTrainedModel.from_pretrained) lines = source.split('\n')[:30] for i, line in enumerate(lines): print(f"{i}: {line}") except Exception as e: print(f"Error: {e}") EOF

Repository: UbiquitousLearning/mllm

Length of output: 104

🏁 Script executed:

# Check if dtype is used as a kwarg anywhere in PreTrainedModel or similar classes rg -n "kwargs.*dtype|dtype.*kwargs" pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py

Repository: UbiquitousLearning/mllm

Length of output: 49

🏁 Script executed:

# Let's look for any PreTrainedModel-related code in the qwen3 implementation rg -n "PreTrainedModel\|from_pretrained" pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py -B 2 -A 5

Repository: UbiquitousLearning/mllm

Length of output: 49

🏁 Script executed:

# Let's check the train.py file to see if there are other from_pretrained calls rg -n "from_pretrained" pymllm/backends/qualcomm/transformers/qwen3/train.py -B 2 -A 3

Repository: UbiquitousLearning/mllm

Length of output: 49

🏁 Script executed:

# Search the entire mllm/mllm-cli/pymllm codebase for from_pretrained usage patterns rg -n "\.from_pretrained\(" pymllm/ -t py --max-count=20

Repository: UbiquitousLearning/mllm

Length of output: 575

🏁 Script executed:

# Check if Qwen3PreTrainedModel has any custom __init_subclass__ or similar that might handle dtype cat -n pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py | sed -n '443,500p'

Repository: UbiquitousLearning/mllm

Length of output: 2587

🏁 Script executed:

# Check the entire modeling_qwen3.py to see if there's any custom from_pretrained override grep -n "def from_pretrained\|def _from_pretrained" pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py

Repository: UbiquitousLearning/mllm

Length of output: 49

🏁 Script executed:

# Let's check the HuggingFace documentation or examples to confirm the correct parameter # Search for torch_dtype usage in the codebase rg -n "torch_dtype" pymllm/ -t py

Repository: UbiquitousLearning/mllm

Length of output: 49

🏁 Script executed:

# Check if there's any recent commit or documentation about this git log --oneline pymllm/backends/qualcomm/transformers/qwen3/runner.py 2>/dev/null | head -10

Repository: UbiquitousLearning/mllm

Length of output: 128

🏁 Script executed:

# Let's check if there are any test files that run this code find pymllm -type f \( -name "*test*.py" -o -name "*example*.py" \) | head -10

Repository: UbiquitousLearning/mllm

Length of output: 209

🏁 Script executed:

# Check the HuggingFace transformers library documentation or source # Let's look for any **kwargs handling in PreTrainedModel python3 << 'EOF' # Check what parameters PreTrainedModel.from_pretrained accepts by looking at docs import sys try: # Try to import and inspect from transformers.modeling_utils import PreTrainedModel help(PreTrainedModel.from_pretrained) except ImportError: print("transformers not installed in this environment") # But we can check online documentation patterns print("According to HuggingFace documentation, PreTrainedModel.from_pretrained uses:") print("- torch_dtype (not dtype) parameter") print("- This is standard across all transformer models") EOF

Repository: UbiquitousLearning/mllm

Length of output: 276

🏁 Script executed:

# Let's check if there are any comments or documentation in the modeling file about this rg -n "torch_dtype|dtype.*pretrained|pretrained.*dtype" pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py -B 2 -A 2

Repository: UbiquitousLearning/mllm

Length of output: 49

🏁 Script executed:

# Let's run a direct test to see if the parameter works # Check if there are actual test executions or imports that would fail rg -n "import Qwen3ForCausalLM\|from.*Qwen3ForCausalLM" pymllm/ -t py

Repository: UbiquitousLearning/mllm

Length of output: 49

🏁 Script executed:

# Check if there's any error handling around the from_pretrained call cat -n pymllm/backends/qualcomm/transformers/qwen3/runner.py | sed -n '54,65p'

Repository: UbiquitousLearning/mllm

Length of output: 637

🏁 Script executed:

# Let's look for any **kwargs or parameter unpacking that might accept dtype rg -n "kwargs\|\*\*" pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py | grep -i "from_pretrained\|init" | head -20

Repository: UbiquitousLearning/mllm

Length of output: 49

🏁 Script executed:

# Let's check what the transformers library version expects # Search online documentation or check if there's a requirements.txt find . -name "requirements.txt" -o -name "setup.py" -o -name "pyproject.toml" | xargs cat 2>/dev/null | grep -i transformers

Repository: UbiquitousLearning/mllm

Length of output: 49

🏁 Script executed:

# Let's verify by checking the actual transformers code in site-packages if available python3 << 'EOF' import inspect try: from transformers.modeling_utils import PreTrainedModel source = inspect.getsource(PreTrainedModel.from_pretrained) # Print first 100 lines to see the signature and parameter handling lines = source.split('\n')[:100] for i, line in enumerate(lines, 1): print(f"{i:3d}: {line}") except Exception as e: print(f"Cannot inspect: {e}") EOF

Repository: UbiquitousLearning/mllm

Length of output: 113

🏁 Script executed:

# Check the standard HuggingFace API - torch_dtype is the correct parameter # Let's see if there's any documentation or comments in the code rg -n "torch_dtype" . -t py --max-count=10

Repository: UbiquitousLearning/mllm

Length of output: 49

🏁 Script executed:

# Final check: Look for any other model loading patterns in the codebase to compare rg -n "\.from_pretrained" pymllm/ -t py -B 1 -A 2 | head -30

Repository: UbiquitousLearning/mllm

Length of output: 1153

Change dtype=torch.float32 to torch_dtype=torch.float32.

The standard HuggingFace PreTrainedModel.from_pretrained() API uses torch_dtype as the parameter name, not dtype. Since Qwen3ForCausalLM inherits from PreTrainedModel without overriding from_pretrained, this parameter name must be corrected.

🤖 Prompt for AI Agents

In `@pymllm/backends/qualcomm/transformers/qwen3/runner.py` around lines 57 - 61, The call to Qwen3ForCausalLM.from_pretrained uses the wrong keyword arg name `dtype`; update the call in runner.py where Qwen3ForCausalLM.from_pretrained(model_path, attn_implementation="eager", dtype=torch.float32) is invoked to use the correct HuggingFace parameter name `torch_dtype=torch.float32` so the dtype is passed properly to the PreTrainedModel loader.

coderabbitai · 2026-01-20T05:42:47Z

pymllm/backends/qualcomm/transformers/qwen3/train.py

+    # FIXME: Should disable or not.
+    m.disable_fake_quant()
    m.calibrate(num_samples=args.num_samples, max_seq_length=args.max_length)
-    # m.compile()
+    m.enable_fake_quant()
    m.infer(args.infer_text)


⚠️ Potential issue | 🟡 Minor

Resolve the fake‑quant toggle decision before release.

Line 41 explicitly flags uncertainty about disabling fake quant before calibration. Please decide the intended behavior (or gate it behind a flag) and remove the FIXME to avoid inconsistent calibration runs.

🤖 Prompt for AI Agents

In `@pymllm/backends/qualcomm/transformers/qwen3/train.py` around lines 41 - 45, Decide and implement the intended fake-quant behavior by removing the FIXME and either making the disable/enable calls deterministic or exposing them as a CLI/config flag; e.g., add a boolean flag (args.disable_fake_quant_before_calibration) and use it to conditionally call m.disable_fake_quant() before m.calibrate(...) and m.enable_fake_quant() after, ensuring the sequence around m.calibrate(...) and m.infer(...) is deterministic and documented in the flag help text.

coderabbitai · 2026-01-20T05:42:47Z

pymllm/backends/qualcomm/transformers/qwen3/train.py

+    # This line maybe error. we need use quantized weight!!! not embed_tokens.weight!!!
    m.model.lm_head.weight = torch.nn.Parameter(
        m.model.model.embed_tokens.weight.clone()
    )


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -name "train.py" -path "*/qwen3/*" -type f

Repository: UbiquitousLearning/mllm

Length of output: 121

🏁 Script executed:

git ls-files | head -20

Repository: UbiquitousLearning/mllm

Length of output: 712

🏁 Script executed:

wc -l ./pymllm/backends/qualcomm/transformers/qwen3/train.py

Repository: UbiquitousLearning/mllm

Length of output: 124

🏁 Script executed:

cat -n ./pymllm/backends/qualcomm/transformers/qwen3/train.py | head -100

Repository: UbiquitousLearning/mllm

Length of output: 2309

🏁 Script executed:

find . -name "runner.py" -path "*/qwen3/*" -type f

Repository: UbiquitousLearning/mllm

Length of output: 122

🏁 Script executed:

wc -l ./pymllm/backends/qualcomm/transformers/qwen3/runner.py

Repository: UbiquitousLearning/mllm

Length of output: 126

🏁 Script executed:

cat -n ./pymllm/backends/qualcomm/transformers/qwen3/runner.py

Repository: UbiquitousLearning/mllm

Length of output: 8282

🏁 Script executed:

find . -name "modeling_qwen3.py" -type f

Repository: UbiquitousLearning/mllm

Length of output: 130

🏁 Script executed:

wc -l ./pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py

Repository: UbiquitousLearning/mllm

Length of output: 134

🏁 Script executed:

rg -n "lm_head\|embed_tokens" ./pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py -A 2 -B 2

Repository: UbiquitousLearning/mllm

Length of output: 49

🏁 Script executed:

cat -n ./pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py | grep -E "(lm_head|embed_tokens)" -A 2 -B 2

Repository: UbiquitousLearning/mllm

Length of output: 1543

🏁 Script executed:

cat -n ./pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py | tail -100

Repository: UbiquitousLearning/mllm

Length of output: 4486

🏁 Script executed:

rg -n "_tied_weights_keys\|tie.*weight" ./pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py -B 2 -A 2

Repository: UbiquitousLearning/mllm

Length of output: 49

🏁 Script executed:

rg -n "class QLinearLPBQ" ./pymllm/backends/qualcomm/transformers/core/ -l

Repository: UbiquitousLearning/mllm

Length of output: 122

🏁 Script executed:

grep -n "class QLinearLPBQ" ./pymllm/backends/qualcomm/transformers/core/qlinear.py -A 30

Repository: UbiquitousLearning/mllm

Length of output: 1253

🏁 Script executed:

grep -n "convert_to_conv2d_deploy_hwio" ./pymllm/backends/qualcomm/transformers/core/qlinear.py -B 5 -A 20

Repository: UbiquitousLearning/mllm

Length of output: 2235

🏁 Script executed:

cat -n ./pymllm/backends/qualcomm/transformers/core/qlinear.py | sed -n '228,280p'

Repository: UbiquitousLearning/mllm

Length of output: 2315

🏁 Script executed:

grep -n "weight_quant\|freeze_weight\|deploy_mode" ./pymllm/backends/qualcomm/transformers/core/qlinear.py | head -40

Repository: UbiquitousLearning/mllm

Length of output: 1976

🏁 Script executed:

grep -n "freeze_weight\|_tied_weights" ./pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py

Repository: UbiquitousLearning/mllm

Length of output: 114

🏁 Script executed:

cat -n ./pymllm/backends/qualcomm/transformers/core/qlinear.py | sed -n '50,110p'

Repository: UbiquitousLearning/mllm

Length of output: 2561

🏁 Script executed:

grep -n "post_init\|tie_weights\|_init_weights" ./pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py -A 5

Repository: UbiquitousLearning/mllm

Length of output: 444

🏁 Script executed:

grep -n "class Qwen3PreTrainedModel" ./pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py -A 10

Repository: UbiquitousLearning/mllm

Length of output: 478

🏁 Script executed:

python3 << 'EOF' # Let's understand the quantization flow more clearly # Check if there's documentation or if we can trace the FakeQuantize behavior import subprocess result = subprocess.run(['grep', '-n', 'class.*DoubleQuantizer', './pymllm/backends/qualcomm/transformers/core/qlinear.py'], capture_output=True, text=True) print(result.stdout) result = subprocess.run(['grep', '-n', 'class PreTrainedModel\|tie_weights\|post_init', './pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py'], capture_output=True, text=True) print(result.stdout) EOF

Repository: UbiquitousLearning/mllm

Length of output: 164

🏁 Script executed:

find . -name "*.py" -path "*core*" | xargs grep -l "DoubleQuantizer" 2>/dev/null

Repository: UbiquitousLearning/mllm

Length of output: 122

🏁 Script executed:

cat -n ./pymllm/backends/qualcomm/transformers/core/qlinear.py | sed -n '161,225p'

Repository: UbiquitousLearning/mllm

Length of output: 2953

🏁 Script executed:

grep -n "self.embed_tokens\|embed_tokens.*=" ./pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py | head -20

Repository: UbiquitousLearning/mllm

Length of output: 173

🏁 Script executed:

cat -n ./pymllm/backends/qualcomm/transformers/qwen3/runner.py | sed -n '47,52p'

Repository: UbiquitousLearning/mllm

Length of output: 316

🏁 Script executed:

grep -n "def freeze_weight\|is_frozen" ./pymllm/backends/qualcomm/transformers/core/qlinear.py | head -15

Repository: UbiquitousLearning/mllm

Length of output: 292

🏁 Script executed:

cat -n ./pymllm/backends/qualcomm/transformers/qwen3/runner.py | sed -n '66,69p'

Repository: UbiquitousLearning/mllm

Length of output: 279

Quantized weights will not match the assigned lm_head.weight parameter.

Lines 50–53 assign unquantized embed_tokens.weight to lm_head.weight before m.convert() is called (line 54). Since lm_head is a QLinearLPBQ and its weights are frozen during initialization (runner.py line 68), the frozen weight_quant still contains the old quantized weights. When convert() runs, it extracts the stale quantized weights from weight_quant.weight_q rather than from the newly assigned parameter, resulting in a mismatch between the exported quantized weights and the parameter. Either re-freeze the quantized weights after assignment or perform the tying after convert() completes.

🤖 Prompt for AI Agents

In `@pymllm/backends/qualcomm/transformers/qwen3/train.py` around lines 50 - 53, The assigned lm_head parameter uses unquantized embed_tokens weights before m.convert(), causing QLinearLPBQ's frozen weight_quant to remain stale; fix by moving the weight tying to after m.convert() (i.e., set m.model.lm_head.weight = Parameter(m.model.model.embed_tokens.weight.clone()) only once convert() has run) or, if tying must happen before convert(), update/re-freeze the QLinearLPBQ internal quant state (weight_quant) after assignment so weight_quant.weight_q reflects the new parameter; refer to m.model.lm_head.weight, m.model.model.embed_tokens.weight, m.convert(), and the QLinearLPBQ frozen weight_quant initialization to implement the change.

… zeros, ones, specific values, arange, and random fills. Introduce a new fill-inl.hpp file for optimized implementations and update kernel dispatch to include these operations. Enhance CPUFillOp to utilize the new fill functions for better performance and maintainability.

coderabbitai

Actionable comments posted: 6

🤖 Fix all issues with AI agents

In `@mllm/backends/cpu/kernels/common/fill-inl.hpp`:
- Around line 8-11: This file relies on std::is_trivial_v, std::is_same_v,
std::ceil, and std::min but doesn't include the matching standard headers; add
explicit includes for <type_traits>, <cmath>, and <algorithm> near the top
alongside the existing includes (after `#include` <cstring> / before
"mllm/core/DataTypes.hpp") so that uses in functions/template code referencing
std::is_trivial_v, std::is_same_v, std::ceil, and std::min are self-contained
and no longer depend on transitive headers.
- Around line 196-244: The function fill_arange_impl computes n (the number of
sequence elements) and may leave dst[n..count-1] uninitialized; make the
contract explicit and avoid silent gaps by: after computing n, add a runtime
check/assert that either n == count (e.g., assert(n == count) or throw) or, if
you want to tolerate count > n, explicitly fill the remainder via
fill_value_impl(dst + n, count - n, static_cast<T>(start + n * step)) so no tail
remains uninitialized; update callers/docs accordingly. Reference:
fill_arange_impl, variable n, and fill_value_impl.

In `@mllm/backends/cpu/kernels/common/kernel_dispatch.hpp`:
- Around line 10-12: The file is missing `#include` <type_traits> required by the
template fill_zeros_anytype which uses std::is_same_v; add the header to the top
of kernel_dispatch.hpp (alongside <cstring> and "mllm/core/DataTypes.hpp") so
std::is_same_v resolves properly and avoids relying on transitive includes
referenced in fill_zeros_anytype.
- Around line 105-131: The fallback in fill_zeros_anytype currently uses
std::memset for all unmatched types; restrict that to only trivial types by
checking std::is_trivial_v<T> and for non-trivial types perform an element-wise
zero assignment (e.g., for (size_t i=0;i<n;++i) dst[i] = T{}), mirroring the
safe approach used in fill_ones_anytype; update the else branch in
fill_zeros_anytype to first if constexpr (std::is_trivial_v<T>) use std::memset,
else use the element-wise loop to avoid undefined behavior on non-trivial types.

In `@mllm/backends/cpu/ops/FillOp.cpp`:
- Around line 31-34: The file uses std::memset in FillOp.cpp (inside the
x86/x86_64 branch) but does not include <cstring>, so add the missing include to
the top of the file; update FillOp.cpp to `#include` <cstring> (alongside other
headers) so std::memset is declared and the file is self-contained and
consistent with other CPU ops.

In `@pymllm/ffi/__init__.py`:
- Around line 291-293: The global `cuda: Device = cuda_()` is wrong because
`cuda_()` currently returns a DType; update the implementation of the `cuda_()`
factory so it returns a Device instance (matching what `cpu_()` and `qnn_()`
return) — e.g. call or wrap the appropriate FFI API that constructs a Device
(like `_ffi_api.cuda_()` or return Device(...) from the FFI result) and keep the
global `cuda` assignment and type annotation as `Device`; ensure the returned
object implements the same Device interface used by `device("cuda")` and
`.to(...)`.

♻️ Duplicate comments (1)

mllm/backends/qnn/aot/passes/PTQPass.cpp (1)

342-420: Handle mixed quantization spec types before comparing concat inputs.
If the reference comes from kSymPerTensor and a later input is kAsymPerTensor, ref_zero_point is never initialized and gets read, and scale comparisons mix incompatible spec types. Track the reference spec type and short‑circuit on mismatches.

🛠️ Suggested fix (track reference spec type)

       Tensor ref_scale;
       Tensor ref_zero_point;
       bool has_ref = false;
       std::string ref_input_name;
+      ir::linalg::QuantizationSpecType ref_spec_type = ir::linalg::QuantizationSpecType::kRaw;

       for (auto iii : inputs) {
         if (!iii->isa_<ir::tensor::TensorValue>()) continue;
         auto tv = iii->cast_<ir::tensor::TensorValue>();
         if (!tv->getAttr("quant_recipe")) continue;
         auto f_spec = tv->getAttr("quant_recipe")->cast_<ir::linalg::LinalgIRQuantizatonSpecAttr>();

         if (f_spec->spec_->type == ir::linalg::QuantizationSpecType::kAsymPerTensor) {
           auto this_spec = std::static_pointer_cast<ir::linalg::QuantizationSpecAsymPerTensor>(f_spec->spec_);
           if (!this_spec->solved) continue;

           if (!has_ref) {
             ref_scale = this_spec->scale;
             ref_zero_point = this_spec->zero_point;
             ref_input_name = tv->name();
+            ref_spec_type = ir::linalg::QuantizationSpecType::kAsymPerTensor;
             has_ref = true;
           } else {
+            if (ref_spec_type != ir::linalg::QuantizationSpecType::kAsymPerTensor) {
+              MLLM_ERROR("PTQPass: ConcatOp '{}' has mismatched quantization spec types between inputs.", op_name);
+              continue;
+            }
             // Check if scale and zero_point match
             ...
           }
         } else if (f_spec->spec_->type == ir::linalg::QuantizationSpecType::kSymPerTensor) {
           auto this_spec = std::static_pointer_cast<ir::linalg::QuantizationSpecSymPerTensor>(f_spec->spec_);
           if (!this_spec->solved) continue;

           if (!has_ref) {
             ref_scale = this_spec->scale;
             ref_input_name = tv->name();
+            ref_spec_type = ir::linalg::QuantizationSpecType::kSymPerTensor;
             has_ref = true;
           } else {
+            if (ref_spec_type != ir::linalg::QuantizationSpecType::kSymPerTensor) {
+              MLLM_ERROR("PTQPass: ConcatOp '{}' has mismatched quantization spec types between inputs.", op_name);
+              continue;
+            }
             // Check if scale matches
             ...
           }
         }
       }

🧹 Nitpick comments (2)

mllm/backends/cpu/kernels/common/kernel_dispatch.hpp (1)

34-104: Add brief doc comments for new public fill APIs.

The new call_fill_* declarations and template wrappers are public entry points; please document parameters/returns/error behavior for clarity and maintainability. As per coding guidelines, please add API doc comments.

mllm/backends/cpu/ops/FillOp.cpp (1)

211-216: Align FP16 arange fallback with end/step semantics.

Lines 213–216 fill numel() elements without honoring end, while other dtypes compute a length derived from start/end/step. Consider matching the same length computation or asserting the expected size to avoid semantic divergence.

coderabbitai · 2026-01-20T09:27:52Z

mllm/backends/cpu/kernels/common/fill-inl.hpp

+#include <hwy/highway.h>
+#include <cstring>
+#include "mllm/core/DataTypes.hpp"
+


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, locate and read the file find . -name "fill-inl.hpp" -type f

Repository: UbiquitousLearning/mllm

Length of output: 114

🏁 Script executed:

# Read the file to verify the includes and function usage cat -n mllm/backends/cpu/kernels/common/fill-inl.hpp | head -250

Repository: UbiquitousLearning/mllm

Length of output: 10794

Add explicit standard headers for type traits and math utilities.

The file uses std::is_trivial_v (line 35), std::is_same_v (line 219), std::ceil (line 208), and std::min (line 214), but doesn't explicitly include <type_traits>, <cmath>, or <algorithm>. While it may compile due to transitive includes from <hwy/highway.h>, add these headers to make the file self-contained and avoid fragile dependencies.

Proposed fix

`#include` <hwy/highway.h> `#include` <cstring> +#include <algorithm> +#include <cmath> +#include <type_traits> `#include` "mllm/core/DataTypes.hpp"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

#include <hwy/highway.h>

#include <cstring>

#include "mllm/core/DataTypes.hpp"

`#include` <hwy/highway.h>

`#include` <cstring>

`#include` <algorithm>

`#include` <cmath>

`#include` <type_traits>

`#include` "mllm/core/DataTypes.hpp"

🧰 Tools

🪛 Clang (14.0.6)

[error] 8-8: 'hwy/highway.h' file not found

(clang-diagnostic-error)

🤖 Prompt for AI Agents

In `@mllm/backends/cpu/kernels/common/fill-inl.hpp` around lines 8 - 11, This file relies on std::is_trivial_v, std::is_same_v, std::ceil, and std::min but doesn't include the matching standard headers; add explicit includes for <type_traits>, <cmath>, and <algorithm> near the top alongside the existing includes (after `#include` <cstring> / before "mllm/core/DataTypes.hpp") so that uses in functions/template code referencing std::is_trivial_v, std::is_same_v, std::ceil, and std::min are self-contained and no longer depend on transitive headers.

coderabbitai · 2026-01-20T09:27:52Z

mllm/backends/cpu/kernels/common/fill-inl.hpp

+template<typename T>
+HWY_INLINE void fill_arange_impl(T* HWY_RESTRICT dst, size_t count, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step) {
+  if (step == 0) {
+    fill_value_impl(dst, count, static_cast<T>(start));
+    return;
+  }
+
+  // Calculate the actual number of elements to fill
+  size_t n = 0;
+  if ((step > 0 && start < end) || (step < 0 && start > end)) {
+    mllm_fp32_t n_float = (end - start) / step;
+    if (n_float > 0) {
+      n = static_cast<size_t>(std::ceil(n_float));
+      if (step > 0) {
+        if (start + (n - 1) * step >= end) --n;
+      } else {
+        if (start + (n - 1) * step <= end) --n;
+      }
+      n = std::min(n, count);
+    }
+  }
+
+  // Use SIMD for float types where we can vectorize the computation
+  if constexpr (std::is_same_v<T, mllm_fp32_t>) {
+    const hn::ScalableTag<T> d;
+    const size_t N = hn::Lanes(d);
+
+    // Create increment vector: [0, 1, 2, 3, ...] * step
+    const hn::Vec<decltype(d)> step_vec = hn::Set(d, step);
+    const hn::Vec<decltype(d)> n_step_vec = hn::Set(d, step * static_cast<T>(N));
+
+    // Create base offsets [0, 1, 2, 3, ...]
+    hn::Vec<decltype(d)> base = hn::Iota(d, 0);
+    base = hn::Mul(base, step_vec);
+    hn::Vec<decltype(d)> current_start = hn::Add(hn::Set(d, start), base);
+
+    size_t idx = 0;
+    for (; idx + N <= n; idx += N) {
+      hn::StoreU(current_start, d, dst + idx);
+      current_start = hn::Add(current_start, n_step_vec);
+    }
+
+    // Handle remaining elements
+    for (; idx < n; ++idx) { dst[idx] = static_cast<T>(start + idx * step); }
+  } else {
+    // Scalar fallback for other types
+    for (size_t i = 0; i < n; ++i) { dst[i] = static_cast<T>(start + i * step); }
+  }
+}


⚠️ Potential issue | 🟡 Minor

❓ Verification inconclusive

Script executed:

# First, let's look at the full file context around the function cat -n mllm/backends/cpu/kernels/common/fill-inl.hpp | head -300 | tail -120

Repository: UbiquitousLearning/mllm

Repository: UbiquitousLearning/mllm
Exit code: 0

stdout:

181 static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_value_u16(mllm_uint16_t* HWY_RESTRICT dst, size_t size, mllm_uint16_t value) { 182 fill_value_impl(dst, size, value); 183 } 184 185 static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_value_i8(mllm_int8_t* HWY_RESTRICT dst, size_t size, mllm_int8_t value) { 186 fill_value_impl(dst, size, value); 187 } 188 189 static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_value_u8(mllm_uint8_t* HWY_RESTRICT dst, size_t size, mllm_uint8_t value) { 190 fill_value_impl(dst, size, value); 191 } 192 193 //===----------------------------------------------------------------------===// 194 // Fill Arange (start, end, step) 195 //===----------------------------------------------------------------------===// 196 template<typename T> 197 HWY_INLINE void fill_arange_impl(T* HWY_RESTRICT dst, size_t count, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step) { 198 if (step == 0) { 199 fill_value_impl(dst, count, static_cast<T>(start)); 200 return; 201 } 202 203 // Calculate the actual number of elements to fill 204 size_t n = 0; 205 if ((step > 0 && start < end) || (step < 0 && start > end)) { 206 mllm_fp32_t n_float = (end - start) / step; 207 if (n_float > 0) { 208 n = static_cast<size_t>(std::ceil(n_float)); 209 if (step > 0) { 210 if (start + (n - 1) * step >= end) --n; 211 } else { 212 if (start + (n - 1) * step <= end) --n; 213 } 214 n = std::min(n, count); 215 } 216 } 217 218 // Use SIMD for float types where we can vectorize the computation 219 if constexpr (std::is_same_v<T, mllm_fp32_t>) { 220 const hn::ScalableTag<T> d; 221 const size_t N = hn::Lanes(d); 222 223 // Create increment vector: [0, 1, 2, 3, ...] * step 224 const hn::Vec<decltype(d)> step_vec = hn::Set(d, step); 225 const hn::Vec<decltype(d)> n_step_vec = hn::Set(d, step * static_cast<T>(N)); 226 227 // Create base offsets [0, 1, 2, 3, ...] 228 hn::Vec<decltype(d)> base = hn::Iota(d, 0); 229 base = hn::Mul(base, step_vec); 230 hn::Vec<decltype(d)> current_start = hn::Add(hn::Set(d, start), base); 231 232 size_t idx = 0; 233 for (; idx + N <= n; idx += N) { 234 hn::StoreU(current_start, d, dst + idx); 235 current_start = hn::Add(current_start, n_step_vec); 236 } 237 238 // Handle remaining elements 239 for (; idx < n; ++idx) { dst[idx] = static_cast<T>(start + idx * step); } 240 } else { 241 // Scalar fallback for other types 242 for (size_t i = 0; i < n; ++i) { dst[i] = static_cast<T>(start + i * step); } 243 } 244 } 245 246 static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_arange_fp32(mllm_fp32_t* HWY_RESTRICT dst, size_t size, mllm_fp32_t start, 247 mllm_fp32_t end, mllm_fp32_t step) { 248 fill_arange_impl(dst, size, start, end, step); 249 } 250 251 static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_arange_i32(mllm_int32_t* HWY_RESTRICT dst, size_t size, mllm_fp32_t start, 252 mllm_fp32_t end, mllm_fp32_t step) { 253 fill_arange_impl(dst, size, start, end, step); 254 } 255 256 static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_arange_u32(mllm_uint32_t* HWY_RESTRICT dst, size_t size, mllm_fp32_t start, 257 mllm_fp32_t end, mllm_fp32_t step) { 258 fill_arange_impl(dst, size, start, end, step); 259 } 260 261 static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_arange_i64(mllm_int64_t* HWY_RESTRICT dst, size_t size, mllm_fp32_t start, 262 mllm_fp32_t end, mllm_fp32_t step) { 263 fill_arange_impl(dst, size, start, end, step); 264 } 265 266 static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_arange_u64(mllm_uint64_t* HWY_RESTRICT dst, size_t size, mllm_fp32_t start, 267 mllm_fp32_t end, mllm_fp32_t step) { 268 fill_arange_impl(dst, size, start, end, step); 269 } 270 271 static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_arange_i16(mllm_int16_t* HWY_RESTRICT dst, size_t size, mllm_fp32_t start, 272 mllm_fp32_t end, mllm_fp32_t step) { 273 fill_arange_impl(dst, size, start, end, step); 274 } 275 276 static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_arange_u16(mllm_uint16_t* HWY_RESTRICT dst, size_t size, mllm_fp32_t start, 277 mllm_fp32_t end, mllm_fp32_t step) { 278 fill_arange_impl(dst, size, start, end, step); 279 } 280 281 static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_arange_i8(mllm_int8_t* HWY_RESTRICT dst, size_t size, mllm_fp32_t start, 282 mllm_fp32_t end, mllm_fp32_t step) { 283 fill_arange_impl(dst, size, start, end, step); 284 } 285 286 static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_arange_u8(mllm_uint8_t* HWY_RESTRICT dst, size_t size, mllm_fp32_t start, 287 mllm_fp32_t end, mllm_fp32_t step) { 288 fill_arange_impl(dst, size, start, end, step); 289 } 290 291 //===----------------------------------------------------------------------===// 292 // Fill Random (using LCG random number generator) 293 //===----------------------------------------------------------------------===// 294 template<typename T> 295 HWY_INLINE void fill_random_impl(T* HWY_RESTRICT dst, size_t count, mllm_fp32_t start, mllm_fp32_t end, uint64_t seed) { 296 const uint64_t multiplier = 1103515245ULL; 297 const uint64_t increment = 12345ULL; 298 const uint64_t modulus = 1ULL << 31; // 2^31 299 const mllm_fp32_t range = end - start; 300

Script executed:

# Now search for call sites as suggested in the review comment rg -n "fill_arange_anytype|call_fill_arange_|fill_arange_impl" --type=cpp --type=hpp -A 3 -B 3

Repository: UbiquitousLearning/mllm

Repository: UbiquitousLearning/mllm
Exit code: 2

stderr:

unrecognized file type: hpp

Script executed:

# Also search for fill_arange patterns more broadly rg -n "fill_arange" --type=cpp --type=hpp -A 2 -B 2

Repository: UbiquitousLearning/mllm

Repository: UbiquitousLearning/mllm
Exit code: 2

stderr:

unrecognized file type: hpp

Script executed:

# Search for call sites of fill_arange functions rg -n "fill_arange" -A 2 -B 2

Repository: UbiquitousLearning/mllm

Repository: UbiquitousLearning/mllm
Exit code: 0

stdout:

mllm/backends/opencl/kernels/fill_cl.cpp-5- " dst[index]=value;\n" mllm/backends/opencl/kernels/fill_cl.cpp-6- "}\n" mllm/backends/opencl/kernels/fill_cl.cpp:7: "__kernel void fill_arange_fp32(float start,float step,__global float *dst) {\n" mllm/backends/opencl/kernels/fill_cl.cpp-8- " size_t index=get_global_id(0);\n" mllm/backends/opencl/kernels/fill_cl.cpp-9- " dst[index]=start+(float)index*step;\n" -- mllm/backends/opencl/kernels/fill.cl-4-} mllm/backends/opencl/kernels/fill.cl-5- mllm/backends/opencl/kernels/fill.cl:6:__kernel void fill_arange_fp32(float start, float step, __global float *dst) { mllm/backends/opencl/kernels/fill.cl-7- size_t index = get_global_id(0); mllm/backends/opencl/kernels/fill.cl-8- dst[index] = start + (float)index * step; -- mllm/backends/opencl/ops/FillOp.cpp-12- mllm/backends/opencl/ops/FillOp.cpp-13- kernel_fp32_buffer_ = runtime->buildKernel("fill", "fill_fp32", {}); mllm/backends/opencl/ops/FillOp.cpp:14: kernel_arange_fp32_buffer_ = runtime->buildKernel("fill", "fill_arange_fp32", {}); mllm/backends/opencl/ops/FillOp.cpp-15-} mllm/backends/opencl/ops/FillOp.cpp-16- -- mllm/backends/opencl/ops/FillOp.cpp-68- cl::NDRange(global_size), cl::NullRange); mllm/backends/opencl/ops/FillOp.cpp-69- if (error != CL_SUCCESS) { mllm/backends/opencl/ops/FillOp.cpp:70: MLLM_ERROR_EXIT(ExitCode::kOpenCLError, "Failed to execute fill_arange kernel, error code: {}", error); mllm/backends/opencl/ops/FillOp.cpp-71- } mllm/backends/opencl/ops/FillOp.cpp-72- } else { -- mllm/backends/cpu/ops/FillOp.cpp-203- case kFloat32: { mllm/backends/cpu/ops/FillOp.cpp-204-#if defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86) mllm/backends/cpu/ops/FillOp.cpp:205: common::fill_arange_anytype(dst.ptr<mllm_fp32_t>(), dst.numel(), options_.start, options_.end, options_.step); mllm/backends/cpu/ops/FillOp.cpp-206-#elif defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM) mllm/backends/cpu/ops/FillOp.cpp:207: arm::fill_arange(dst.ptr<mllm_fp32_t>(), dst.numel(), options_.start, options_.end, options_.step, threads); mllm/backends/cpu/ops/FillOp.cpp-208-#endif mllm/backends/cpu/ops/FillOp.cpp-209- break; -- mllm/backends/cpu/ops/FillOp.cpp-215- for (size_t i = 0; i < dst.numel(); ++i) { ptr[i] = static_cast<mllm_fp16_t>(options_.start + i * options_.step); } mllm/backends/cpu/ops/FillOp.cpp-216-#elif defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM) mllm/backends/cpu/ops/FillOp.cpp:217: arm::fill_arange_fp16(dst.ptr<mllm_fp16_t>(), dst.numel(), options_.start, options_.end, options_.step, threads); mllm/backends/cpu/ops/FillOp.cpp-218-#endif mllm/backends/cpu/ops/FillOp.cpp-219- break; -- mllm/backends/cpu/ops/FillOp.cpp-221- case kInt64: { mllm/backends/cpu/ops/FillOp.cpp-222-#if defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86) mllm/backends/cpu/ops/FillOp.cpp:223: common::fill_arange_anytype(dst.ptr<mllm_int64_t>(), dst.numel(), options_.start, options_.end, options_.step); mllm/backends/cpu/ops/FillOp.cpp-224-#elif defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM) mllm/backends/cpu/ops/FillOp.cpp:225: arm::fill_arange_anytype<mllm_int64_t>(dst.ptr<mllm_int64_t>(), dst.numel(), options_.start, options_.end, mllm/backends/cpu/ops/FillOp.cpp-226- options_.step, threads); mllm/backends/cpu/ops/FillOp.cpp-227-#endif -- mllm/backends/cpu/ops/FillOp.cpp-230- case kInt32: { mllm/backends/cpu/ops/FillOp.cpp-231-#if defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86) mllm/backends/cpu/ops/FillOp.cpp:232: common::fill_arange_anytype(dst.ptr<mllm_int32_t>(), dst.numel(), options_.start, options_.end, options_.step); mllm/backends/cpu/ops/FillOp.cpp-233-#elif defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM) mllm/backends/cpu/ops/FillOp.cpp:234: arm::fill_arange_anytype<mllm_int32_t>(dst.ptr<mllm_int32_t>(), dst.numel(), options_.start, options_.end, mllm/backends/cpu/ops/FillOp.cpp-235- options_.step, threads); mllm/backends/cpu/ops/FillOp.cpp-236-#endif -- mllm/backends/cpu/ops/FillOp.cpp-239- case kInt16: { mllm/backends/cpu/ops/FillOp.cpp-240-#if defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86) mllm/backends/cpu/ops/FillOp.cpp:241: common::fill_arange_anytype(dst.ptr<mllm_int16_t>(), dst.numel(), options_.start, options_.end, options_.step); mllm/backends/cpu/ops/FillOp.cpp-242-#elif defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM) mllm/backends/cpu/ops/FillOp.cpp:243: arm::fill_arange_anytype<mllm_int16_t>(dst.ptr<mllm_int16_t>(), dst.numel(), options_.start, options_.end, mllm/backends/cpu/ops/FillOp.cpp-244- options_.step, threads); mllm/backends/cpu/ops/FillOp.cpp-245-#endif -- mllm/backends/cpu/ops/FillOp.cpp-248- case kInt8: { mllm/backends/cpu/ops/FillOp.cpp-249-#if defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86) mllm/backends/cpu/ops/FillOp.cpp:250: common::fill_arange_anytype(dst.ptr<mllm_int8_t>(), dst.numel(), options_.start, options_.end, options_.step); mllm/backends/cpu/ops/FillOp.cpp-251-#elif defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM) mllm/backends/cpu/ops/FillOp.cpp:252: arm::fill_arange_anytype<mllm_int8_t>(dst.ptr<mllm_int8_t>(), dst.numel(), options_.start, options_.end, mllm/backends/cpu/ops/FillOp.cpp-253- options_.step, threads); mllm/backends/cpu/ops/FillOp.cpp-254-#endif -- mllm/backends/cpu/ops/FillOp.cpp-257- case kUInt64: { mllm/backends/cpu/ops/FillOp.cpp-258-#if defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86) mllm/backends/cpu/ops/FillOp.cpp:259: common::fill_arange_anytype(dst.ptr<mllm_uint64_t>(), dst.numel(), options_.start, options_.end, options_.step); mllm/backends/cpu/ops/FillOp.cpp-260-#elif defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM) mllm/backends/cpu/ops/FillOp.cpp:261: arm::fill_arange_anytype<mllm_uint64_t>(dst.ptr<mllm_uint64_t>(), dst.numel(), options_.start, options_.end, mllm/backends/cpu/ops/FillOp.cpp-262- options_.step, threads); mllm/backends/cpu/ops/FillOp.cpp-263-#endif -- mllm/backends/cpu/ops/FillOp.cpp-266- case kUInt32: { mllm/backends/cpu/ops/FillOp.cpp-267-#if defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86) mllm/backends/cpu/ops/FillOp.cpp:268: common::fill_arange_anytype(dst.ptr<mllm_uint32_t>(), dst.numel(), options_.start, options_.end, options_.step); mllm/backends/cpu/ops/FillOp.cpp-269-#elif defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM) mllm/backends/cpu/ops/FillOp.cpp:270: arm::fill_arange_anytype<mllm_uint32_t>(dst.ptr<mllm_uint32_t>(), dst.numel(), options_.start, options_.end, mllm/backends/cpu/ops/FillOp.cpp-271- options_.step, threads); mllm/backends/cpu/ops/FillOp.cpp-272-#endif -- mllm/backends/cpu/ops/FillOp.cpp-275- case kUInt16: { mllm/backends/cpu/ops/FillOp.cpp-276-#if defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86) mllm/backends/cpu/ops/FillOp.cpp:277: common::fill_arange_anytype(dst.ptr<mllm_uint16_t>(), dst.numel(), options_.start, options_.end, options_.step); mllm/backends/cpu/ops/FillOp.cpp-278-#elif defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM) mllm/backends/cpu/ops/FillOp.cpp:279: arm::fill_arange_anytype<mllm_uint16_t>(dst.ptr<mllm_uint16_t>(), dst.numel(), options_.start, options_.end, mllm/backends/cpu/ops/FillOp.cpp-280- options_.step, threads); mllm/backends/cpu/ops/FillOp.cpp-281-#endif -- mllm/backends/cpu/ops/FillOp.cpp-284- case kUInt8: { mllm/backends/cpu/ops/FillOp.cpp-285-#if defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86) mllm/backends/cpu/ops/FillOp.cpp:286: common::fill_arange_anytype(dst.ptr<mllm_uint8_t>(), dst.numel(), options_.start, options_.end, options_.step); mllm/backends/cpu/ops/FillOp.cpp-287-#elif defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM) mllm/backends/cpu/ops/FillOp.cpp:288: arm::fill_arange_anytype<mllm_uint8_t>(dst.ptr<mllm_uint8_t>(), dst.numel(), options_.start, options_.end, mllm/backends/cpu/ops/FillOp.cpp-289- options_.step, threads); mllm/backends/cpu/ops/FillOp.cpp-290-#endif -- mllm/backends/cpu/kernels/common/kernel_dispatch.cpp-174-// Fill Arange mllm/backends/cpu/kernels/common/kernel_dispatch.cpp-175-//===----------------------------------------------------------------------===// mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:176:HWY_EXPORT(fill_arange_fp32); mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:177:HWY_EXPORT(fill_arange_i32); mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:178:HWY_EXPORT(fill_arange_u32); mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:179:HWY_EXPORT(fill_arange_i64); mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:180:HWY_EXPORT(fill_arange_u64); mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:181:HWY_EXPORT(fill_arange_i16); mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:182:HWY_EXPORT(fill_arange_u16); mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:183:HWY_EXPORT(fill_arange_i8); mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:184:HWY_EXPORT(fill_arange_u8); mllm/backends/cpu/kernels/common/kernel_dispatch.cpp-185- mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:186:HWY_DLLEXPORT void call_fill_arange_fp32(mllm_fp32_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step) { mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:187: HWY_DYNAMIC_DISPATCH(fill_arange_fp32)(dst, n, start, end, step); mllm/backends/cpu/kernels/common/kernel_dispatch.cpp-188-} mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:189:HWY_DLLEXPORT void call_fill_arange_i32(mllm_int32_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step) { mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:190: HWY_DYNAMIC_DISPATCH(fill_arange_i32)(dst, n, start, end, step); mllm/backends/cpu/kernels/common/kernel_dispatch.cpp-191-} mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:192:HWY_DLLEXPORT void call_fill_arange_u32(mllm_uint32_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step) { mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:193: HWY_DYNAMIC_DISPATCH(fill_arange_u32)(dst, n, start, end, step); mllm/backends/cpu/kernels/common/kernel_dispatch.cpp-194-} mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:195:HWY_DLLEXPORT void call_fill_arange_i64(mllm_int64_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step) { mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:196: HWY_DYNAMIC_DISPATCH(fill_arange_i64)(dst, n, start, end, step); mllm/backends/cpu/kernels/common/kernel_dispatch.cpp-197-} mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:198:HWY_DLLEXPORT void call_fill_arange_u64(mllm_uint64_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step) { mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:199: HWY_DYNAMIC_DISPATCH(fill_arange_u64)(dst, n, start, end, step); mllm/backends/cpu/kernels/common/kernel_dispatch.cpp-200-} mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:201:HWY_DLLEXPORT void call_fill_arange_i16(mllm_int16_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step) { mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:202: HWY_DYNAMIC_DISPATCH(fill_arange_i16)(dst, n, start, end, step); mllm/backends/cpu/kernels/common/kernel_dispatch.cpp-203-} mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:204:HWY_DLLEXPORT void call_fill_arange_u16(mllm_uint16_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step) { mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:205: HWY_DYNAMIC_DISPATCH(fill_arange_u16)(dst, n, start, end, step); mllm/backends/cpu/kernels/common/kernel_dispatch.cpp-206-} mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:207:HWY_DLLEXPORT void call_fill_arange_i8(mllm_int8_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step) { mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:208: HWY_DYNAMIC_DISPATCH(fill_arange_i8)(dst, n, start, end, step); mllm/backends/cpu/kernels/common/kernel_dispatch.cpp-209-} mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:210:HWY_DLLEXPORT void call_fill_arange_u8(mllm_uint8_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step) { mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:211: HWY_DYNAMIC_DISPATCH(fill_arange_u8)(dst, n, start, end, step); mllm/backends/cpu/kernels/common/kernel_dispatch.cpp-212-} mllm/backends/cpu/kernels/common/kernel_dispatch.cpp-213- -- mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-77-// Fill Arange mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-78-//===----------------------------------------------------------------------===// mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:79:HWY_DLLEXPORT void call_fill_arange_fp32(mllm_fp32_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step); mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:80:HWY_DLLEXPORT void call_fill_arange_i32(mllm_int32_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step); mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:81:HWY_DLLEXPORT void call_fill_arange_u32(mllm_uint32_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step); mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:82:HWY_DLLEXPORT void call_fill_arange_i64(mllm_int64_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step); mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:83:HWY_DLLEXPORT void call_fill_arange_u64(mllm_uint64_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step); mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:84:HWY_DLLEXPORT void call_fill_arange_i16(mllm_int16_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step); mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:85:HWY_DLLEXPORT void call_fill_arange_u16(mllm_uint16_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step); mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:86:HWY_DLLEXPORT void call_fill_arange_i8(mllm_int8_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step); mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:87:HWY_DLLEXPORT void call_fill_arange_u8(mllm_uint8_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step); mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-88- mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-89-//===----------------------------------------------------------------------===// -- mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-188- mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-189-template<typename T> mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:190:inline void fill_arange_anytype(T* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step) { mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-191- if constexpr (std::is_same_v<T, mllm_fp32_t>) { mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:192: call_fill_arange_fp32(dst, n, start, end, step); mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-193- } else if constexpr (std::is_same_v<T, mllm_int32_t>) { mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:194: call_fill_arange_i32(dst, n, start, end, step); mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-195- } else if constexpr (std::is_same_v<T, mllm_uint32_t>) { mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:196: call_fill_arange_u32(dst, n, start, end, step); mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-197- } else if constexpr (std::is_same_v<T, mllm_int64_t>) { mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:198: call_fill_arange_i64(dst, n, start, end, step); mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-199- } else if constexpr (std::is_same_v<T, mllm_uint64_t>) { mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:200: call_fill_arange_u64(dst, n, start, end, step); mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-201- } else if constexpr (std::is_same_v<T, mllm_int16_t>) { mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:202: call_fill_arange_i16(dst, n, start, end, step); mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-203- } else if constexpr (std::is_same_v<T, mllm_uint16_t>) { mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:204: call_fill_arange_u16(dst, n, start, end, step); mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-205- } else if constexpr (std::is_same_v<T, mllm_int8_t>) { mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:206: call_fill_arange_i8(dst, n, start, end, step); mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-207- } else if constexpr (std::is_same_v<T, mllm_uint8_t>) { mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:208: call_fill_arange_u8(dst, n, start, end, step); mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-209- } else { mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-210- // Fallback -- mllm/backends/cpu/kernels/common/fill-inl.hpp-195-//===----------------------------------------------------------------------===// mllm/backends/cpu/kernels/common/fill-inl.hpp-196-template<typename T> mllm/backends/cpu/kernels/common/fill-inl.hpp:197:HWY_INLINE void fill_arange_impl(T* HWY_RESTRICT dst, size_t count, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step) { mllm/backends/cpu/kernels/common/fill-inl.hpp-198- if (step == 0) { mllm/backends/cpu/kernels/common/fill-inl.hpp-199- fill_value_impl(dst, count, static_cast<T>(start)); -- mllm/backends/cpu/kernels/common/fill-inl.hpp-244-} mllm/backends/cpu/kernels/common/fill-inl.hpp-245- mllm/backends/cpu/kernels/common/fill-inl.hpp:246:static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_arange_fp32(mllm_fp32_t* HWY_RESTRICT dst, size_t size, mllm_fp32_t start, mllm/backends/cpu/kernels/common/fill-inl.hpp-247- mllm_fp32_t end, mllm_fp32_t step) { mllm/backends/cpu/kernels/common/fill-inl.hpp:248: fill_arange_impl(dst, size, start, end, step); mllm/backends/cpu/kernels/common/fill-inl.hpp-249-} mllm/backends/cpu/kernels/common/fill-inl.hpp-250- mllm/backends/cpu/kernels/common/fill-inl.hpp:251:static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_arange_i32(mllm_int32_t* HWY_RESTRICT dst, size_t size, mllm_fp32_t start, mllm/backends/cpu/kernels/common/fill-inl.hpp-252- mllm_fp32_t end, mllm_fp32_t step) { mllm/backends/cpu/kernels/common/fill-inl.hpp:253: fill_arange_impl(dst, size, start, end, step); mllm/backends/cpu/kernels/common/fill-inl.hpp-254-} mllm/backends/cpu/kernels/common/fill-inl.hpp-255- mllm/backends/cpu/kernels/common/fill-inl.hpp:256:static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_arange_u32(mllm_uint32_t* HWY_RESTRICT dst, size_t size, mllm_fp32_t start, mllm/backends/cpu/kernels/common/fill-inl.hpp-257- mllm_fp32_t end, mllm_fp32_t step) { mllm/backends/cpu/kernels/common/fill-inl.hpp:258: fill_arange_impl(dst, size, start, end, step); mllm/backends/cpu/kernels/common/fill-inl.hpp-259-} mllm/backends/cpu/kernels/common/fill-inl.hpp-260- mllm/backends/cpu/kernels/common/fill-inl.hpp:261:static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_arange_i64(mllm_int64_t* HWY_RESTRICT dst, size_t size, mllm_fp32_t start, mllm/backends/cpu/kernels/common/fill-inl.hpp-262- mllm_fp32_t end, mllm_fp32_t step) { mllm/backends/cpu/kernels/common/fill-inl.hpp:263: fill_arange_impl(dst, size, start, end, step); mllm/backends/cpu/kernels/common/fill-inl.hpp-264-} mllm/backends/cpu/kernels/common/fill-inl.hpp-265- mllm/backends/cpu/kernels/common/fill-inl.hpp:266:static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_arange_u64(mllm_uint64_t* HWY_RESTRICT dst, size_t size, mllm_fp32_t start, mllm/backends/cpu/kernels/common/fill-inl.hpp-267- mllm_fp32_t end, mllm_fp32_t step) { mllm/backends/cpu/kernels/common/fill-inl.hpp:268: fill_arange_impl(dst, size, start, end, step); mllm/backends/cpu/kernels/common/fill-inl.hpp-269-} mllm/backends/cpu/kernels/common/fill-inl.hpp-270- mllm/backends/cpu/kernels/common/fill-inl.hpp:271:static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_arange_i16(mllm_int16_t* HWY_RESTRICT dst, size_t size, mllm_fp32_t start, mllm/backends/cpu/kernels/common/fill-inl.hpp-272- mllm_fp32_t end, mllm_fp32_t step) { mllm/backends/cpu/kernels/common/fill-inl.hpp:273: fill_arange_impl(dst, size, start, end, step); mllm/backends/cpu/kernels/common/fill-inl.hpp-274-} mllm/backends/cpu/kernels/common/fill-inl.hpp-275- mllm/backends/cpu/kernels/common/fill-inl.hpp:276:static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_arange_u16(mllm_uint16_t* HWY_RESTRICT dst, size_t size, mllm_fp32_t start, mllm/backends/cpu/kernels/common/fill-inl.hpp-277- mllm_fp32_t end, mllm_fp32_t step) { mllm/backends/cpu/kernels/common/fill-inl.hpp:278: fill_arange_impl(dst, size, start, end, step); mllm/backends/cpu/kernels/common/fill-inl.hpp-279-} mllm/backends/cpu/kernels/common/fill-inl.hpp-280- mllm/backends/cpu/kernels/common/fill-inl.hpp:281:static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_arange_i8(mllm_int8_t* HWY_RESTRICT dst, size_t size, mllm_fp32_t start, mllm/backends/cpu/kernels/common/fill-inl.hpp-282- mllm_fp32_t end, mllm_fp32_t step) { mllm/backends/cpu/kernels/common/fill-inl.hpp:283: fill_arange_impl(dst, size, start, end, step); mllm/backends/cpu/kernels/common/fill-inl.hpp-284-} mllm/backends/cpu/kernels/common/fill-inl.hpp-285- mllm/backends/cpu/kernels/common/fill-inl.hpp:286:static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_arange_u8(mllm_uint8_t* HWY_RESTRICT dst, size_t size, mllm_fp32_t start, mllm/backends/cpu/kernels/common/fill-inl.hpp-287- mllm_fp32_t end, mllm_fp32_t step) { mllm/backends/cpu/kernels/common/fill-inl.hpp:288: fill_arange_impl(dst, size, start, end, step); mllm/backends/cpu/kernels/common/fill-inl.hpp-289-} mllm/backends/cpu/kernels/common/fill-inl.hpp-290- -- mllm/backends/cpu/kernels/arm/fill.cpp-52-} mllm/backends/cpu/kernels/arm/fill.cpp-53- mllm/backends/cpu/kernels/arm/fill.cpp:54:void fill_arange(mllm_fp32_t* __restrict dst, size_t size, float start, float end, float step, int thread_count) { mllm/backends/cpu/kernels/arm/fill.cpp-55- constexpr size_t vec_size = 4; // 4 floats in NEON mllm/backends/cpu/kernels/arm/fill.cpp-56- -- mllm/backends/cpu/kernels/arm/fill.cpp-129-} mllm/backends/cpu/kernels/arm/fill.cpp-130- mllm/backends/cpu/kernels/arm/fill.cpp:131:void fill_arange_fp16(mllm_fp16_t* __restrict dst, size_t size, float start, float end, float step, int thread_count) { mllm/backends/cpu/kernels/arm/fill.cpp-132- constexpr size_t vec_size = 8; // 8 float16_t in NEON mllm/backends/cpu/kernels/arm/fill.cpp-133- -- mllm/backends/cpu/kernels/arm/fill.hpp-17-void fill_specific_value(mllm_fp32_t* __restrict dst, size_t size, mllm_fp32_t value, int thread_count); mllm/backends/cpu/kernels/arm/fill.hpp-18- mllm/backends/cpu/kernels/arm/fill.hpp:19:void fill_arange(mllm_fp32_t* __restrict dst, size_t size, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step, mllm/backends/cpu/kernels/arm/fill.hpp-20- int thread_count); mllm/backends/cpu/kernels/arm/fill.hpp-21- -- mllm/backends/cpu/kernels/arm/fill.hpp-29-void fill_specific_value_fp16(mllm_fp16_t* __restrict dst, size_t size, mllm_fp32_t value, int thread_count); mllm/backends/cpu/kernels/arm/fill.hpp-30- mllm/backends/cpu/kernels/arm/fill.hpp:31:void fill_arange_fp16(mllm_fp16_t* __restrict dst, size_t size, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step, mllm/backends/cpu/kernels/arm/fill.hpp-32- int thread_count); mllm/backends/cpu/kernels/arm/fill.hpp-33- -- mllm/backends/cpu/kernels/arm/fill.hpp-94- mllm/backends/cpu/kernels/arm/fill.hpp-95-template<typename T> mllm/backends/cpu/kernels/arm/fill.hpp:96:inline void fill_arange_anytype(T* __restrict dst, size_t size, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step, mllm/backends/cpu/kernels/arm/fill.hpp-97- int thread_count) { mllm/backends/cpu/kernels/arm/fill.hpp-98- if (step == 0) { -- mllm/backends/cpu/kernels/arm/fill.hpp-119- mllm/backends/cpu/kernels/arm/fill.hpp-120-template<> mllm/backends/cpu/kernels/arm/fill.hpp:121:inline void fill_arange_anytype<mllm_fp32_t>(mllm_fp32_t* __restrict dst, size_t size, mllm_fp32_t start, mllm_fp32_t end, mllm/backends/cpu/kernels/arm/fill.hpp-122- mllm_fp32_t step, int thread_count) { mllm/backends/cpu/kernels/arm/fill.hpp:123: fill_arange(dst, size, start, end, step, thread_count); mllm/backends/cpu/kernels/arm/fill.hpp-124-} mllm/backends/cpu/kernels/arm/fill.hpp-125- mllm/backends/cpu/kernels/arm/fill.hpp-126-template<> mllm/backends/cpu/kernels/arm/fill.hpp:127:inline void fill_arange_anytype<mllm_fp16_t>(mllm_fp16_t* __restrict dst, size_t size, mllm_fp32_t start, mllm_fp32_t end, mllm/backends/cpu/kernels/arm/fill.hpp-128- mllm_fp32_t step, int thread_count) { mllm/backends/cpu/kernels/arm/fill.hpp:129: fill_arange_fp16(dst, size, start, end, step, thread_count); mllm/backends/cpu/kernels/arm/fill.hpp-130-} mllm/backends/cpu/kernels/arm/fill.hpp-131- -- mllm/backends/cpu/kernels/x86/fill.cpp-109-} mllm/backends/cpu/kernels/x86/fill.cpp-110- mllm/backends/cpu/kernels/x86/fill.cpp:111:void fill_arange(mllm_fp32_t* __restrict dst, size_t size, float start, float end, float step, int thread_count) { mllm/backends/cpu/kernels/x86/fill.cpp-112-#if defined(MLLM_HOST_FEATURE_AVX512F) mllm/backends/cpu/kernels/x86/fill.cpp-113- constexpr size_t vec_size = 16; -- mllm/backends/cpu/kernels/x86/fill.hpp-17-void fill_specific_value(mllm_fp32_t* __restrict dst, size_t size, float value, int thread_count); mllm/backends/cpu/kernels/x86/fill.hpp-18- mllm/backends/cpu/kernels/x86/fill.hpp:19:void fill_arange(mllm_fp32_t* __restrict dst, size_t size, float start, float end, float step, int thread_count); mllm/backends/cpu/kernels/x86/fill.hpp-20- mllm/backends/cpu/kernels/x86/fill.hpp-21-void fill_random(mllm_fp32_t* __restrict dst, size_t size, float start, float end, uint64_t seed, int thread_count);

Script executed:

# Look for where these functions are used find . -name "*.cpp" -o -name "*.hpp" -o -name "*.h" | head -20

Repository: UbiquitousLearning/mllm

Repository: UbiquitousLearning/mllm
Exit code: 0

stdout:

./algorithms/fancy_algorithm/main.cpp ./algorithms/fancy_algorithm/models/FooNet.hpp ./algorithms/fancy_algorithm/passes/FooPass.hpp ./algorithms/lazy_vlm/HKVCache.cpp ./algorithms/lazy_vlm/HKVCache.hpp ./algorithms/lazy_vlm/HKVCacheFast.cpp ./algorithms/lazy_vlm/HKVCacheFast.hpp ./algorithms/lazy_vlm/LazyVLMQwen2VL.cpp ./algorithms/lazy_vlm/LazyVLMQwen2VLFast.cpp ./algorithms/lazy_vlm/LazyVLMQwen2_5VL.cpp ./algorithms/lazy_vlm/LazyVLMQwen2_5VLFast.cpp ./algorithms/lazy_vlm/models/qwen2_5vl/lazy_vlm_cfg.hpp ./algorithms/lazy_vlm/models/qwen2_5vl/lazy_vlm_cfg_fast.hpp ./algorithms/lazy_vlm/models/qwen2_5vl/modeling_qwen2_5vl.hpp ./algorithms/lazy_vlm/models/qwen2_5vl/modeling_qwen2_5vl_fast.hpp ./algorithms/lazy_vlm/models/qwen2vl/lazy_vlm_cfg.hpp ./algorithms/lazy_vlm/models/qwen2vl/lazy_vlm_cfg_fast.hpp ./algorithms/lazy_vlm/models/qwen2vl/modeling_qwen2vl.hpp ./algorithms/lazy_vlm/models/qwen2vl/modeling_qwen2vl_fast.hpp ./benchmarks/cpu/arm_mllm_blas_sgemm.cpp

Script executed:

# Search more broadly for arange usage rg -n "arange" -A 2 -B 2

Repository: UbiquitousLearning/mllm

Repository: UbiquitousLearning/mllm
Exit code: 0

stdout:

tests/cpu/CausalMaskOpTest.hpp-17- using namespace mllm; // NOLINT tests/cpu/CausalMaskOpTest.hpp-18- const int64_t total = static_cast<int64_t>(B) * H * S * D; tests/cpu/CausalMaskOpTest.hpp:19: auto input = Tensor::arange(0, static_cast<float>(total), 1, kFloat32, kCPU).view({B, H, S, D}); tests/cpu/CausalMaskOpTest.hpp-20- auto output = mask_(input); tests/cpu/CausalMaskOpTest.hpp-21- auto expected = buildExpectedTensor(input); -- tests/cpu/PagedAttnTest.hpp-61- tests/cpu/PagedAttnTest.hpp-62- // Build Index tests/cpu/PagedAttnTest.hpp:63: auto index = mllm::Tensor::arange(0, S_KV, 1, mllm::kInt32, mllm::kCPU); tests/cpu/PagedAttnTest.hpp-64- auto mask = mllm::Tensor::zeros({S_Q, S_KV}, mllm::kFloat32, mllm::kCPU); tests/cpu/PagedAttnTest.hpp-65- auto mask_data = mask.ptr<mllm::mllm_fp32_t>(); -- pymllm/__init__.py-44- zeros, pymllm/__init__.py-45- ones, pymllm/__init__.py:46: arange, pymllm/__init__.py-47- random, pymllm/__init__.py-48-) -- pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py-589- past_key_values.get_seq_length() if past_key_values is not None else 0 pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py-590- ) pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py:591: cache_position = torch.arange( pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py-592- past_seen_tokens, pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py-593- past_seen_tokens + inputs_embeds.shape[1], -- pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py-624- mllm_qualcomm_max_length = kwargs.get("mllm_qualcomm_max_length", None) pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py-625- assert mllm_qualcomm_max_length is not None pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py:626: max_position_ids = torch.arange( pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py-627- 0, pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py-628- mllm_qualcomm_max_length, -- pymllm/ffi/__init__.py-329- pymllm/ffi/__init__.py-330- pymllm/ffi/__init__.py:331:def arange( pymllm/ffi/__init__.py-332- start: float, pymllm/ffi/__init__.py-333- end: float, -- pymllm/ffi/__init__.py-338- if isinstance(device_type, str): pymllm/ffi/__init__.py-339- device_type = device(device_type) pymllm/ffi/__init__.py:340: return _ffi_api.arange(start, end, step, dtype, device_type) pymllm/ffi/__init__.py-341- pymllm/ffi/__init__.py-342- -- mllm/core/Tensor.cpp-130-} mllm/core/Tensor.cpp-131- mllm/core/Tensor.cpp:132:Tensor Tensor::arange(float start, float end, float step, DataTypes dtype, DeviceTypes device) { mllm/core/Tensor.cpp-133- auto shape = std::vector<int32_t>{static_cast<int32_t>((end - start) / step)}; mllm/core/Tensor.cpp-134- auto i = Tensor::empty(shape, dtype, device).alloc(); -- mllm/core/Tensor.hpp-257- * `@return` Tensor mllm/core/Tensor.hpp-258- */ mllm/core/Tensor.hpp:259: static Tensor arange(float start, float end, float step, DataTypes dtype = kFloat32, DeviceTypes device = kCPU); mllm/core/Tensor.hpp-260- mllm/core/Tensor.hpp-261- /** -- mllm/ffi/Extension.cc-142- return ::mllm::ffi::Tensor(tensor); mllm/ffi/Extension.cc-143- }); mllm/ffi/Extension.cc:144: refl::GlobalDef().def("mllm.arange", mllm/ffi/Extension.cc-145- [](float start, float end, float step, const ::mllm::ffi::DType& dtype, mllm/ffi/Extension.cc-146- const ::mllm::ffi::Device& device) -> mllm::ffi::Tensor { mllm/ffi/Extension.cc:147: auto tensor = ::mllm::Tensor::arange(start, end, step, dtype->dtype, device->device); mllm/ffi/Extension.cc-148- return ::mllm::ffi::Tensor(tensor); mllm/ffi/Extension.cc-149- }); -- mllm/models/minicpm_o2_6/modeling_siglip.hpp-56- auto max_nb_patches_w = max_im_w / patch_size_; mllm/models/minicpm_o2_6/modeling_siglip.hpp-57- mllm/models/minicpm_o2_6/modeling_siglip.hpp:58: // Create boundaries like torch.arange(1 / self.num_patches_per_side, 1.0, 1 / self.num_patches_per_side) mllm/models/minicpm_o2_6/modeling_siglip.hpp-59- std::vector<float> boundaries; mllm/models/minicpm_o2_6/modeling_siglip.hpp-60- float step = 1.0f / static_cast<float>(num_patches_per_side_); -- mllm/models/minicpm_o2_6/modeling_siglip.hpp-79- } mllm/models/minicpm_o2_6/modeling_siglip.hpp-80- mllm/models/minicpm_o2_6/modeling_siglip.hpp:81: // Create fractional coordinates like torch.arange(0, 1 - 1e-6, 1 / nb_patches_h/w) mllm/models/minicpm_o2_6/modeling_siglip.hpp-82- std::vector<float> fractional_coords_h; mllm/models/minicpm_o2_6/modeling_siglip.hpp-83- std::vector<float> fractional_coords_w; -- mllm/models/minicpm_o2_6/modeling_siglip.hpp-146- } else { mllm/models/minicpm_o2_6/modeling_siglip.hpp-147- auto seq_len = embeddings.shape()[1]; mllm/models/minicpm_o2_6/modeling_siglip.hpp:148: auto position_ids = Tensor::arange(0, seq_len, kInt64).view({1, seq_len}); mllm/models/minicpm_o2_6/modeling_siglip.hpp-149- auto pos_embeddings = position_embedding_(position_ids); mllm/models/minicpm_o2_6/modeling_siglip.hpp-150- embeddings = embeddings + pos_embeddings; -- mllm/models/minicpm_o2_6/modeling_vector_quantize.hpp-150- */ mllm/models/minicpm_o2_6/modeling_vector_quantize.hpp-151- Tensor createImplicitCodebook() { mllm/models/minicpm_o2_6/modeling_vector_quantize.hpp:152: auto indices = Tensor::arange(0, static_cast<float>(codebook_size_), 1, kFloat32, kCPU); mllm/models/minicpm_o2_6/modeling_vector_quantize.hpp-153- return indicesToCodes(indices); mllm/models/minicpm_o2_6/modeling_vector_quantize.hpp-154- } -- mllm/models/deepseek_ocr/deepencoder.hpp-94- mllm/models/deepseek_ocr/deepencoder.hpp-95- // Register a buffer mllm/models/deepseek_ocr/deepencoder.hpp:96: registerBuffer("position_ids", Tensor::arange(0, num_positions_, 1, kInt64, kCPU).view({1, -1})); mllm/models/deepseek_ocr/deepencoder.hpp-97- } mllm/models/deepseek_ocr/deepencoder.hpp-98- -- mllm/models/minicpm_o2_6/modeling_whisper_encoder.hpp-194- mllm/models/minicpm_o2_6/modeling_whisper_encoder.hpp-195- // Add positional embeddings mllm/models/minicpm_o2_6/modeling_whisper_encoder.hpp:196: auto position_ids = Tensor::arange(0, seq_len, 1, kInt64).view({1, seq_len}); mllm/models/minicpm_o2_6/modeling_whisper_encoder.hpp-197- auto position_embeddings = embed_positions_(position_ids); mllm/models/minicpm_o2_6/modeling_whisper_encoder.hpp-198- hidden_states = hidden_states + position_embeddings; -- mllm/models/minicpm_o2_6/modeling_chattts.hpp-256- auto position_ids = Tensor::empty({batch_size, input_len}, kInt64, kCPU).alloc(); mllm/models/minicpm_o2_6/modeling_chattts.hpp-257- auto position_ids_ptr = position_ids.ptr<int64_t>(); mllm/models/minicpm_o2_6/modeling_chattts.hpp:258: // Fill position_ids: arange(past_key_values_length, past_key_values_length + input_len) mllm/models/minicpm_o2_6/modeling_chattts.hpp-259- for (int b = 0; b < batch_size; ++b) { mllm/models/minicpm_o2_6/modeling_chattts.hpp-260- for (int i = 0; i < input_len; ++i) { position_ids_ptr[b * input_len + i] = past_key_values_length + i; } -- docs/api/tensor.rst-47- :return: New tensor with initialized one values docs/api/tensor.rst-48- docs/api/tensor.rst:49:arange docs/api/tensor.rst-50-~~~~~~ docs/api/tensor.rst-51- docs/api/tensor.rst:52:.. cpp:function:: static Tensor Tensor::arange(float start, float end, float step, DataTypes dtype = kFloat32, DeviceTypes device = kCPU) docs/api/tensor.rst-53- docs/api/tensor.rst-54- Creates a tensor with evenly spaced values within a specified range. -- mllm/backends/opencl/ops/FillOp.hpp-17- private: mllm/backends/opencl/ops/FillOp.hpp-18- std::shared_ptr<KernelWrap> kernel_fp32_buffer_ = nullptr; mllm/backends/opencl/ops/FillOp.hpp:19: std::shared_ptr<KernelWrap> kernel_arange_fp32_buffer_ = nullptr; mllm/backends/opencl/ops/FillOp.hpp-20-}; mllm/backends/opencl/ops/FillOp.hpp-21- -- mllm/backends/opencl/ops/FillOp.cpp-12- mllm/backends/opencl/ops/FillOp.cpp-13- kernel_fp32_buffer_ = runtime->buildKernel("fill", "fill_fp32", {}); mllm/backends/opencl/ops/FillOp.cpp:14: kernel_arange_fp32_buffer_ = runtime->buildKernel("fill", "fill_arange_fp32", {}); mllm/backends/opencl/ops/FillOp.cpp-15-} mllm/backends/opencl/ops/FillOp.cpp-16- -- mllm/backends/opencl/ops/FillOp.cpp-61- float start = options_.start; mllm/backends/opencl/ops/FillOp.cpp-62- float step = options_.step; mllm/backends/opencl/ops/FillOp.cpp:63: ret |= kernel_arange_fp32_buffer_->get().setArg(0, sizeof(float), &start); mllm/backends/opencl/ops/FillOp.cpp:64: ret |= kernel_arange_fp32_buffer_->get().setArg(1, sizeof(float), &step); mllm/backends/opencl/ops/FillOp.cpp:65: ret |= kernel_arange_fp32_buffer_->get().setArg(2, sizeof(cl_mem), &cl_buffer_dst); mllm/backends/opencl/ops/FillOp.cpp-66- if (ret != CL_SUCCESS) { MLLM_ERROR("OpenCLFillOp setArg failed: {}", ret); } mllm/backends/opencl/ops/FillOp.cpp:67: auto error = runtime->commandQueue().enqueueNDRangeKernel(kernel_arange_fp32_buffer_->get(), cl::NullRange, mllm/backends/opencl/ops/FillOp.cpp-68- cl::NDRange(global_size), cl::NullRange); mllm/backends/opencl/ops/FillOp.cpp-69- if (error != CL_SUCCESS) { mllm/backends/opencl/ops/FillOp.cpp:70: MLLM_ERROR_EXIT(ExitCode::kOpenCLError, "Failed to execute fill_arange kernel, error code: {}", error); mllm/backends/opencl/ops/FillOp.cpp-71- } mllm/backends/opencl/ops/FillOp.cpp-72- } else { -- mllm/backends/opencl/kernels/fill_cl.cpp-5- " dst[index]=value;\n" mllm/backends/opencl/kernels/fill_cl.cpp-6- "}\n" mllm/backends/opencl/kernels/fill_cl.cpp:7: "__kernel void fill_arange_fp32(float start,float step,__global float *dst) {\n" mllm/backends/opencl/kernels/fill_cl.cpp-8- " size_t index=get_global_id(0);\n" mllm/backends/opencl/kernels/fill_cl.cpp-9- " dst[index]=start+(float)index*step;\n" -- mllm/backends/opencl/kernels/fill.cl-4-} mllm/backends/opencl/kernels/fill.cl-5- mllm/backends/opencl/kernels/fill.cl:6:__kernel void fill_arange_fp32(float start, float step, __global float *dst) { mllm/backends/opencl/kernels/fill.cl-7- size_t index = get_global_id(0); mllm/backends/opencl/kernels/fill.cl-8- dst[index] = start + (float)index * step; -- mllm/backends/cpu/ops/FillOp.cpp-203- case kFloat32: { mllm/backends/cpu/ops/FillOp.cpp-204-#if defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86) mllm/backends/cpu/ops/FillOp.cpp:205: common::fill_arange_anytype(dst.ptr<mllm_fp32_t>(), dst.numel(), options_.start, options_.end, options_.step); mllm/backends/cpu/ops/FillOp.cpp-206-#elif defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM) mllm/backends/cpu/ops/FillOp.cpp:207: arm::fill_arange(dst.ptr<mllm_fp32_t>(), dst.numel(), options_.start, options_.end, options_.step, threads); mllm/backends/cpu/ops/FillOp.cpp-208-#endif mllm/backends/cpu/ops/FillOp.cpp-209- break; -- mllm/backends/cpu/ops/FillOp.cpp-215- for (size_t i = 0; i < dst.numel(); ++i) { ptr[i] = static_cast<mllm_fp16_t>(options_.start + i * options_.step); } mllm/backends/cpu/ops/FillOp.cpp-216-#elif defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM) mllm/backends/cpu/ops/FillOp.cpp:217: arm::fill_arange_fp16(dst.ptr<mllm_fp16_t>(), dst.numel(), options_.start, options_.end, options_.step, threads); mllm/backends/cpu/ops/FillOp.cpp-218-#endif mllm/backends/cpu/ops/FillOp.cpp-219- break; -- mllm/backends/cpu/ops/FillOp.cpp-221- case kInt64: { mllm/backends/cpu/ops/FillOp.cpp-222-#if defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86) mllm/backends/cpu/ops/FillOp.cpp:223: common::fill_arange_anytype(dst.ptr<mllm_int64_t>(), dst.numel(), options_.start, options_.end, options_.step); mllm/backends/cpu/ops/FillOp.cpp-224-#elif defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM) mllm/backends/cpu/ops/FillOp.cpp:225: arm::fill_arange_anytype<mllm_int64_t>(dst.ptr<mllm_int64_t>(), dst.numel(), options_.start, options_.end, mllm/backends/cpu/ops/FillOp.cpp-226- options_.step, threads); mllm/backends/cpu/ops/FillOp.cpp-227-#endif -- mllm/backends/cpu/ops/FillOp.cpp-230- case kInt32: { mllm/backends/cpu/ops/FillOp.cpp-231-#if defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86) mllm/backends/cpu/ops/FillOp.cpp:232: common::fill_arange_anytype(dst.ptr<mllm_int32_t>(), dst.numel(), options_.start, options_.end, options_.step); mllm/backends/cpu/ops/FillOp.cpp-233-#elif defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM) mllm/backends/cpu/ops/FillOp.cpp:234: arm::fill_arange_anytype<mllm_int32_t>(dst.ptr<mllm_int32_t>(), dst.numel(), options_.start, options_.end, mllm/backends/cpu/ops/FillOp.cpp-235- options_.step, threads); mllm/backends/cpu/ops/FillOp.cpp-236-#endif -- mllm/backends/cpu/ops/FillOp.cpp-239- case kInt16: { mllm/backends/cpu/ops/FillOp.cpp-240-#if defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86) mllm/backends/cpu/ops/FillOp.cpp:241: common::fill_arange_anytype(dst.ptr<mllm_int16_t>(), dst.numel(), options_.start, options_.end, options_.step); mllm/backends/cpu/ops/FillOp.cpp-242-#elif defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM) mllm/backends/cpu/ops/FillOp.cpp:243: arm::fill_arange_anytype<mllm_int16_t>(dst.ptr<mllm_int16_t>(), dst.numel(), options_.start, options_.end, mllm/backends/cpu/ops/FillOp.cpp-244- options_.step, threads); mllm/backends/cpu/ops/FillOp.cpp-245-#endif -- mllm/backends/cpu/ops/FillOp.cpp-248- case kInt8: { mllm/backends/cpu/ops/FillOp.cpp-249-#if defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86) mllm/backends/cpu/ops/FillOp.cpp:250: common::fill_arange_anytype(dst.ptr<mllm_int8_t>(), dst.numel(), options_.start, options_.end, options_.step); mllm/backends/cpu/ops/FillOp.cpp-251-#elif defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM) mllm/backends/cpu/ops/FillOp.cpp:252: arm::fill_arange_anytype<mllm_int8_t>(dst.ptr<mllm_int8_t>(), dst.numel(), options_.start, options_.end, mllm/backends/cpu/ops/FillOp.cpp-253- options_.step, threads); mllm/backends/cpu/ops/FillOp.cpp-254-#endif -- mllm/backends/cpu/ops/FillOp.cpp-257- case kUInt64: { mllm/backends/cpu/ops/FillOp.cpp-258-#if defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86) mllm/backends/cpu/ops/FillOp.cpp:259: common::fill_arange_anytype(dst.ptr<mllm_uint64_t>(), dst.numel(), options_.start, options_.end, options_.step); mllm/backends/cpu/ops/FillOp.cpp-260-#elif defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM) mllm/backends/cpu/ops/FillOp.cpp:261: arm::fill_arange_anytype<mllm_uint64_t>(dst.ptr<mllm_uint64_t>(), dst.numel(), options_.start, options_.end, mllm/backends/cpu/ops/FillOp.cpp-262- options_.step, threads); mllm/backends/cpu/ops/FillOp.cpp-263-#endif -- mllm/backends/cpu/ops/FillOp.cpp-266- case kUInt32: { mllm/backends/cpu/ops/FillOp.cpp-267-#if defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86) mllm/backends/cpu/ops/FillOp.cpp:268: common::fill_arange_anytype(dst.ptr<mllm_uint32_t>(), dst.numel(), options_.start, options_.end, options_.step); mllm/backends/cpu/ops/FillOp.cpp-269-#elif defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM) mllm/backends/cpu/ops/FillOp.cpp:270: arm::fill_arange_anytype<mllm_uint32_t>(dst.ptr<mllm_uint32_t>(), dst.numel(), options_.start, options_.end, mllm/backends/cpu/ops/FillOp.cpp-271- options_.step, threads); mllm/backends/cpu/ops/FillOp.cpp-272-#endif -- mllm/backends/cpu/ops/FillOp.cpp-275- case kUInt16: { mllm/backends/cpu/ops/FillOp.cpp-276-#if defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86) mllm/backends/cpu/ops/FillOp.cpp:277: common::fill_arange_anytype(dst.ptr<mllm_uint16_t>(), dst.numel(), options_.start, options_.end, options_.step); mllm/backends/cpu/ops/FillOp.cpp-278-#elif defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM) mllm/backends/cpu/ops/FillOp.cpp:279: arm::fill_arange_anytype<mllm_uint16_t>(dst.ptr<mllm_uint16_t>(), dst.numel(), options_.start, options_.end, mllm/backends/cpu/ops/FillOp.cpp-280- options_.step, threads); mllm/backends/cpu/ops/FillOp.cpp-281-#endif -- mllm/backends/cpu/ops/FillOp.cpp-284- case kUInt8: { mllm/backends/cpu/ops/FillOp.cpp-285-#if defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86) mllm/backends/cpu/ops/FillOp.cpp:286: common::fill_arange_anytype(dst.ptr<mllm_uint8_t>(), dst.numel(), options_.start, options_.end, options_.step); mllm/backends/cpu/ops/FillOp.cpp-287-#elif defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM) mllm/backends/cpu/ops/FillOp.cpp:288: arm::fill_arange_anytype<mllm_uint8_t>(dst.ptr<mllm_uint8_t>(), dst.numel(), options_.start, options_.end, mllm/backends/cpu/ops/FillOp.cpp-289- options_.step, threads); mllm/backends/cpu/ops/FillOp.cpp-290-#endif -- mllm/backends/cpu/ops/FillOp.cpp-292- } mllm/backends/cpu/ops/FillOp.cpp-293- default: { mllm/backends/cpu/ops/FillOp.cpp:294: NYI("FillOp::forward[arange] not implemented for this data type"); mllm/backends/cpu/ops/FillOp.cpp-295- } mllm/backends/cpu/ops/FillOp.cpp-296- } -- mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-77-// Fill Arange mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-78-//===----------------------------------------------------------------------===// mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:79:HWY_DLLEXPORT void call_fill_arange_fp32(mllm_fp32_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step); mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:80:HWY_DLLEXPORT void call_fill_arange_i32(mllm_int32_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step); mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:81:HWY_DLLEXPORT void call_fill_arange_u32(mllm_uint32_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step); mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:82:HWY_DLLEXPORT void call_fill_arange_i64(mllm_int64_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step); mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:83:HWY_DLLEXPORT void call_fill_arange_u64(mllm_uint64_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step); mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:84:HWY_DLLEXPORT void call_fill_arange_i16(mllm_int16_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step); mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:85:HWY_DLLEXPORT void call_fill_arange_u16(mllm_uint16_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step); mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:86:HWY_DLLEXPORT void call_fill_arange_i8(mllm_int8_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step); mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:87:HWY_DLLEXPORT void call_fill_arange_u8(mllm_uint8_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step); mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-88- mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-89-//===----------------------------------------------------------------------===// -- mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-188- mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-189-template<typename T> mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:190:inline void fill_arange_anytype(T* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step) { mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-191- if constexpr (std::is_same_v<T, mllm_fp32_t>) { mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:192: call_fill_arange_fp32(dst, n, start, end, step); mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-193- } else if constexpr (std::is_same_v<T, mllm_int32_t>) { mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:194: call_fill_arange_i32(dst, n, start, end, step); mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-195- } else if constexpr (std::is_same_v<T, mllm_uint32_t>) { mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:196: call_fill_arange_u32(dst, n, start, end, step); mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-197- } else if constexpr (std::is_same_v<T, mllm_int64_t>) { mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:198: call_fill_arange_i64(dst, n, start, end, step); mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-199- } else if constexpr (std::is_same_v<T, mllm_uint64_t>) { mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:200: call_fill_arange_u64(dst, n, start, end, step); mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-201- } else if constexpr (std::is_same_v<T, mllm_int16_t>) { mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:202: call_fill_arange_i16(dst, n, start, end, step); mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-203- } else if constexpr (std::is_same_v<T, mllm_uint16_t>) { mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:204: call_fill_arange_u16(dst, n, start, end, step); mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-205- } else if constexpr (std::is_same_v<T, mllm_int8_t>) { mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:206: call_fill_arange_i8(dst, n, start, end, step); mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-207- } else if constexpr (std::is_same_v<T, mllm_uint8_t>) { mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:208: call_fill_arange_u8(dst, n, start, end, step); mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-209- } else { mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-210- // Fallback -- mllm/backends/cpu/kernels/common/fill-inl.hpp-195-//===----------------------------------------------------------------------===// mllm/backends/cpu/kernels/common/fill-inl.hpp-196-template<typename T> mllm/backends/cpu/kernels/common/fill-inl.hpp:197:HWY_INLINE void fill_arange_impl(T* HWY_RESTRICT dst, size_t count, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step) { mllm/backends/cpu/kernels/common/fill-inl.hpp-198- if (step == 0) { mllm/backends/cpu/kernels/common/fill-inl.hpp-199- fill_value_impl(dst, count, static_cast<T>(start)); -- mllm/backends/cpu/kernels/common/fill-inl.hpp-244-} mllm/backends/cpu/kernels/common/fill-inl.hpp-245- mllm/backends/cpu/kernels/common/fill-inl.hpp:246:static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_arange_fp32(mllm_fp32_t* HWY_RESTRICT dst, size_t size, mllm_fp32_t start, mllm/backends/cpu/kernels/common/fill-inl.hpp-247- mllm_fp32_t end, mllm_fp32_t step) { mllm/backends/cpu/kernels/common/fill-inl.hpp:248: fill_arange_impl(dst, size, start, end, step); mllm/backends/cpu/kernels/common/fill-inl.hpp-249-} mllm/backends/cpu/kernels/common/fill-inl.hpp-250- mllm/backends/cpu/kernels/common/fill-inl.hpp:251:static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_arange_i32(mllm_int32_t* HWY_RESTRICT dst, size_t size, mllm_fp32_t start, mllm/backends/cpu/kernels/common/fill-inl.hpp-252- mllm_fp32_t end, mllm_fp32_t step) { mllm/backends/cpu/kernels/common/fill-inl.hpp:253: fill_arange_impl(dst, size, start, end, step); mllm/backends/cpu/kernels/common/fill-inl.hpp-254-} mllm/backends/cpu/kernels/common/fill-inl.hpp-255- mllm/backends/cpu/kernels/common/fill-inl.hpp:256:static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_arange_u32(mllm_uint32_t* HWY_RESTRICT dst, size_t size, mllm_fp32_t start, mllm/backends/cpu/kernels/common/fill-inl.hpp-257- mllm_fp32_t end, mllm_fp32_t step) { mllm/backends/cpu/kernels/common/fill-inl.hpp:258: fill_arange_impl(dst, size, start, end, step); mllm/backends/cpu/kernels/common/fill-inl.hpp-259-} mllm/backends/cpu/kernels/common/fill-inl.hpp-260- mllm/backends/cpu/kernels/common/fill-inl.hpp:261:static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_arange_i64(mllm_int64_t* HWY_RESTRICT dst, size_t size, mllm_fp32_t start, mllm/backends/cpu/kernels/common/fill-inl.hpp-262- mllm_fp32_t end, mllm_fp32_t step) { mllm/backends/cpu/kernels/common/fill-inl.hpp:263: fill_arange_impl(dst, size, start, end, step); mllm/backends/cpu/kernels/common/fill-inl.hpp-264-} mllm/backends/cpu/kernels/common/fill-inl.hpp-265- mllm/backends/cpu/kernels/common/fill-inl.hpp:266:static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_arange_u64(mllm_uint64_t* HWY_RESTRICT dst, size_t size, mllm_fp32_t start, mllm/backends/cpu/kernels/common/fill-inl.hpp-267- mllm_fp32_t end, mllm_fp32_t step) { mllm/backends/cpu/kernels/common/fill-inl.hpp:268: fill_arange_impl(dst, size, start, end, step); mllm/backends/cpu/kernels/common/fill-inl.hpp-269-} mllm/backends/cpu/kernels/common/fill-inl.hpp-270- mllm/backends/cpu/kernels/common/fill-inl.hpp:271:static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_arange_i16(mllm_int16_t* HWY_RESTRICT dst, size_t size, mllm_fp32_t start, mllm/backends/cpu/kernels/common/fill-inl.hpp-272- mllm_fp32_t end, mllm_fp32_t step) { mllm/backends/cpu/kernels/common/fill-inl.hpp:273: fill_arange_impl(dst, size, start, end, step); mllm/backends/cpu/kernels/common/fill-inl.hpp-274-} mllm/backends/cpu/kernels/common/fill-inl.hpp-275- mllm/backends/cpu/kernels/common/fill-inl.hpp:276:static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_arange_u16(mllm_uint16_t* HWY_RESTRICT dst, size_t size, mllm_fp32_t start, mllm/backends/cpu/kernels/common/fill-inl.hpp-277- mllm_fp32_t end, mllm_fp32_t step) { mllm/backends/cpu/kernels/common/fill-inl.hpp:278: fill_arange_impl(dst, size, start, end, step); mllm/backends/cpu/kernels/common/fill-inl.hpp-279-} mllm/backends/cpu/kernels/common/fill-inl.hpp-280- mllm/backends/cpu/kernels/common/fill-inl.hpp:281:static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_arange_i8(mllm_int8_t* HWY_RESTRICT dst, size_t size, mllm_fp32_t start, mllm/backends/cpu/kernels/common/fill-inl.hpp-282- mllm_fp32_t end, mllm_fp32_t step) { mllm/backends/cpu/kernels/common/fill-inl.hpp:283: fill_arange_impl(dst, size, start, end, step); mllm/backends/cpu/kernels/common/fill-inl.hpp-284-} mllm/backends/cpu/kernels/common/fill-inl.hpp-285- mllm/backends/cpu/kernels/common/fill-inl.hpp:286:static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_arange_u8(mllm_uint8_t* HWY_RESTRICT dst, size_t size, mllm_fp32_t start, mllm/backends/cpu/kernels/common/fill-inl.hpp-287- mllm_fp32_t end, mllm_fp32_t step) { mllm/backends/cpu/kernels/common/fill-inl.hpp:288: fill_arange_impl(dst, size, start, end, step); mllm/backends/cpu/kernels/common/fill-inl.hpp-289-} mllm/backends/cpu/kernels/common/fill-inl.hpp-290- -- mllm/backends/cpu/kernels/common/kernel_dispatch.cpp-174-// Fill Arange mllm/backends/cpu/kernels/common/kernel_dispatch.cpp-175-//===----------------------------------------------------------------------===// mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:176:HWY_EXPORT(fill_arange_fp32); mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:177:HWY_EXPORT(fill_arange_i32); mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:178:HWY_EXPORT(fill_arange_u32); mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:179:HWY_EXPORT(fill_arange_i64); mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:180:HWY_EXPORT(fill_arange_u64); mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:181:HWY_EXPORT(fill_arange_i16); mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:182:HWY_EXPORT(fill_arange_u16); mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:183:HWY_EXPORT(fill_arange_i8); mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:184:HWY_EXPORT(fill_arange_u8); mllm/backends/cpu/kernels/common/kernel_dispatch.cpp-185- mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:186:HWY_DLLEXPORT void call_fill_arange_fp32(mllm_fp32_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step) { mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:187: HWY_DYNAMIC_DISPATCH(fill_arange_fp32)(dst, n, start, end, step); mllm/backends/cpu/kernels/common/kernel_dispatch.cpp-188-} mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:189:HWY_DLLEXPORT void call_fill_arange_i32(mllm_int32_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step) { mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:190: HWY_DYNAMIC_DISPATCH(fill_arange_i32)(dst, n, start, end, step); mllm/backends/cpu/kernels/common/kernel_dispatch.cpp-191-} mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:192:HWY_DLLEXPORT void call_fill_arange_u32(mllm_uint32_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step) { mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:193: HWY_DYNAMIC_DISPATCH(fill_arange_u32)(dst, n, start, end, step); mllm/backends/cpu/kernels/common/kernel_dispatch.cpp-194-} mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:195:HWY_DLLEXPORT void call_fill_arange_i64(mllm_int64_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step) { mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:196: HWY_DYNAMIC_DISPATCH(fill_arange_i64)(dst, n, start, end, step); mllm/backends/cpu/kernels/common/kernel_dispatch.cpp-197-} mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:198:HWY_DLLEXPORT void call_fill_arange_u64(mllm_uint64_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step) { mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:199: HWY_DYNAMIC_DISPATCH(fill_arange_u64)(dst, n, start, end, step); mllm/backends/cpu/kernels/common/kernel_dispatch.cpp-200-} mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:201:HWY_DLLEXPORT void call_fill_arange_i16(mllm_int16_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step) { mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:202: HWY_DYNAMIC_DISPATCH(fill_arange_i16)(dst, n, start, end, step); mllm/backends/cpu/kernels/common/kernel_dispatch.cpp-203-} mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:204:HWY_DLLEXPORT void call_fill_arange_u16(mllm_uint16_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step) { mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:205: HWY_DYNAMIC_DISPATCH(fill_arange_u16)(dst, n, start, end, step); mllm/backends/cpu/kernels/common/kernel_dispatch.cpp-206-} mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:207:HWY_DLLEXPORT void call_fill_arange_i8(mllm_int8_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step) { mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:208: HWY_DYNAMIC_DISPATCH(fill_arange_i8)(dst, n, start, end, step); mllm/backends/cpu/kernels/common/kernel_dispatch.cpp-209-} mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:210:HWY_DLLEXPORT void call_fill_arange_u8(mllm_uint8_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step) { mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:211: HWY_DYNAMIC_DISPATCH(fill_arange_u8)(dst, n, start, end, step); mllm/backends/cpu/kernels/common/kernel_dispatch.cpp-212-} mllm/backends/cpu/kernels/common/kernel_dispatch.cpp-213- -- mllm/backends/cpu/kernels/arm/fill.cpp-52-} mllm/backends/cpu/kernels/arm/fill.cpp-53- mllm/backends/cpu/kernels/arm/fill.cpp:54:void fill_arange(mllm_fp32_t* __restrict dst, size_t size, float start, float end, float step, int thread_count) { mllm/backends/cpu/kernels/arm/fill.cpp-55- constexpr size_t vec_size = 4; // 4 floats in NEON mllm/backends/cpu/kernels/arm/fill.cpp-56- -- mllm/backends/cpu/kernels/arm/fill.cpp-58- size_t i = 0; mllm/backends/cpu/kernels/arm/fill.cpp-59- mllm/backends/cpu/kernels/arm/fill.cpp:60: // Vectorized arange mllm/backends/cpu/kernels/arm/fill.cpp-61- float current_value = start; mllm/backends/cpu/kernels/arm/fill.cpp-62- for (; i < vec_end; i += vec_size) { -- mllm/backends/cpu/kernels/arm/fill.cpp-129-} mllm/backends/cpu/kernels/arm/fill.cpp-130- mllm/backends/cpu/kernels/arm/fill.cpp:131:void fill_arange_fp16(mllm_fp16_t* __restrict dst, size_t size, float start, float end, float step, int thread_count) { mllm/backends/cpu/kernels/arm/fill.cpp-132- constexpr size_t vec_size = 8; // 8 float16_t in NEON mllm/backends/cpu/kernels/arm/fill.cpp-133- -- mllm/backends/cpu/kernels/arm/fill.cpp-135- size_t i = 0; mllm/backends/cpu/kernels/arm/fill.cpp-136- mllm/backends/cpu/kernels/arm/fill.cpp:137: // Vectorized arange mllm/backends/cpu/kernels/arm/fill.cpp-138- float current_value = start; mllm/backends/cpu/kernels/arm/fill.cpp-139- for (; i < vec_end; i += vec_size) { -- mllm/backends/cpu/kernels/arm/fill.hpp-17-void fill_specific_value(mllm_fp32_t* __restrict dst, size_t size, mllm_fp32_t value, int thread_count); mllm/backends/cpu/kernels/arm/fill.hpp-18- mllm/backends/cpu/kernels/arm/fill.hpp:19:void fill_arange(mllm_fp32_t* __restrict dst, size_t size, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step, mllm/backends/cpu/kernels/arm/fill.hpp-20- int thread_count); mllm/backends/cpu/kernels/arm/fill.hpp-21- -- mllm/backends/cpu/kernels/arm/fill.hpp-29-void fill_specific_value_fp16(mllm_fp16_t* __restrict dst, size_t size, mllm_fp32_t value, int thread_count); mllm/backends/cpu/kernels/arm/fill.hpp-30- mllm/backends/cpu/kernels/arm/fill.hpp:31:void fill_arange_fp16(mllm_fp16_t* __restrict dst, size_t size, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step, mllm/backends/cpu/kernels/arm/fill.hpp-32- int thread_count); mllm/backends/cpu/kernels/arm/fill.hpp-33- -- mllm/backends/cpu/kernels/arm/fill.hpp-94- mllm/backends/cpu/kernels/arm/fill.hpp-95-template<typename T> mllm/backends/cpu/kernels/arm/fill.hpp:96:inline void fill_arange_anytype(T* __restrict dst, size_t size, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step, mllm/backends/cpu/kernels/arm/fill.hpp-97- int thread_count) { mllm/backends/cpu/kernels/arm/fill.hpp-98- if (step == 0) { -- mllm/backends/cpu/kernels/arm/fill.hpp-119- mllm/backends/cpu/kernels/arm/fill.hpp-120-template<> mllm/backends/cpu/kernels/arm/fill.hpp:121:inline void fill_arange_anytype<mllm_fp32_t>(mllm_fp32_t* __restrict dst, size_t size, mllm_fp32_t start, mllm_fp32_t end, mllm/backends/cpu/kernels/arm/fill.hpp-122- mllm_fp32_t step, int thread_count) { mllm/backends/cpu/kernels/arm/fill.hpp:123: fill_arange(dst, size, start, end, step, thread_count); mllm/backends/cpu/kernels/arm/fill.hpp-124-} mllm/backends/cpu/kernels/arm/fill.hpp-125- mllm/backends/cpu/kernels/arm/fill.hpp-126-template<> mllm/backends/cpu/kernels/arm/fill.hpp:127:inline void fill_arange_anytype<mllm_fp16_t>(mllm_fp16_t* __restrict dst, size_t size, mllm_fp32_t start, mllm_fp32_t end, mllm/backends/cpu/kernels/arm/fill.hpp-128- mllm_fp32_t step, int thread_count) { mllm/backends/cpu/kernels/arm/fill.hpp:129: fill_arange_fp16(dst, size, start, end, step, thread_count); mllm/backends/cpu/kernels/arm/fill.hpp-130-} mllm/backends/cpu/kernels/arm/fill.hpp-131- -- mllm/backends/cpu/kernels/x86/fill.hpp-17-void fill_specific_value(mllm_fp32_t* __restrict dst, size_t size, float value, int thread_count); mllm/backends/cpu/kernels/x86/fill.hpp-18- mllm/backends/cpu/kernels/x86/fill.hpp:19:void fill_arange(mllm_fp32_t* __restrict dst, size_t size, float start, float end, float step, int thread_count); mllm/backends/cpu/kernels/x86/fill.hpp-20- mllm/backends/cpu/kernels/x86/fill.hpp-21-void fill_random(mllm_fp32_t* __restrict dst, size_t size, float start, float end, uint64_t seed, int thread_count); -- mllm/backends/cpu/kernels/x86/fill.cpp-109-} mllm/backends/cpu/kernels/x86/fill.cpp-110- mllm/backends/cpu/kernels/x86/fill.cpp:111:void fill_arange(mllm_fp32_t* __restrict dst, size_t size, float start, float end, float step, int thread_count) { mllm/backends/cpu/kernels/x86/fill.cpp-112-#if defined(MLLM_HOST_FEATURE_AVX512F) mllm/backends/cpu/kernels/x86/fill.cpp-113- constexpr size_t vec_size = 16; -- mllm/backends/cpu/kernels/x86/fill.cpp-129- || defined(MLLM_HOST_FEATURE_SSE2) || defined(MLLM_HOST_FEATURE_SSE) mllm/backends/cpu/kernels/x86/fill.cpp-130- if (vec_size > 1) { mllm/backends/cpu/kernels/x86/fill.cpp:131: // Vectorized arange mllm/backends/cpu/kernels/x86/fill.cpp-132- float current_value = start; mllm/backends/cpu/kernels/x86/fill.cpp-133- for (; i < vec_end; i += vec_size) {

Script executed:

# Look at the complete fill_arange_impl and surrounding context for assertions/documentation sed -n '193,245p' mllm/backends/cpu/kernels/common/fill-inl.hpp

Repository: UbiquitousLearning/mllm

Repository: UbiquitousLearning/mllm
Exit code: 0

stdout:

//===----------------------------------------------------------------------===// // Fill Arange (start, end, step) //===----------------------------------------------------------------------===// template<typename T> HWY_INLINE void fill_arange_impl(T* HWY_RESTRICT dst, size_t count, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step) { if (step == 0) { fill_value_impl(dst, count, static_cast<T>(start)); return; } // Calculate the actual number of elements to fill size_t n = 0; if ((step > 0 && start < end) || (step < 0 && start > end)) { mllm_fp32_t n_float = (end - start) / step; if (n_float > 0) { n = static_cast<size_t>(std::ceil(n_float)); if (step > 0) { if (start + (n - 1) * step >= end) --n; } else { if (start + (n - 1) * step <= end) --n; } n = std::min(n, count); } } // Use SIMD for float types where we can vectorize the computation if constexpr (std::is_same_v<T, mllm_fp32_t>) { const hn::ScalableTag<T> d; const size_t N = hn::Lanes(d); // Create increment vector: [0, 1, 2, 3, ...] * step const hn::Vec<decltype(d)> step_vec = hn::Set(d, step); const hn::Vec<decltype(d)> n_step_vec = hn::Set(d, step * static_cast<T>(N)); // Create base offsets [0, 1, 2, 3, ...] hn::Vec<decltype(d)> base = hn::Iota(d, 0); base = hn::Mul(base, step_vec); hn::Vec<decltype(d)> current_start = hn::Add(hn::Set(d, start), base); size_t idx = 0; for (; idx + N <= n; idx += N) { hn::StoreU(current_start, d, dst + idx); current_start = hn::Add(current_start, n_step_vec); } // Handle remaining elements for (; idx < n; ++idx) { dst[idx] = static_cast<T>(start + idx * step); } } else { // Scalar fallback for other types for (size_t i = 0; i < n; ++i) { dst[i] = static_cast<T>(start + i * step); } } }

Script executed:

# Check FillOp.cpp for any validation of count vs computed arange length sed -n '195,300p' mllm/backends/cpu/ops/FillOp.cpp

Repository: UbiquitousLearning/mllm

Repository: UbiquitousLearning/mllm
Exit code: 0

stdout:

NYI("FillOp::forward[ones] not implemented for this data type"); break; } } break; } case aops::FillOpTypes::kArange: { switch (dst.dtype()) { case kFloat32: { `#if` defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86) common::fill_arange_anytype(dst.ptr<mllm_fp32_t>(), dst.numel(), options_.start, options_.end, options_.step); `#elif` defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM) arm::fill_arange(dst.ptr<mllm_fp32_t>(), dst.numel(), options_.start, options_.end, options_.step, threads); `#endif` break; } case kFloat16: { `#if` defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86) // FP16 not directly supported by Highway on x86, use scalar fallback auto ptr = dst.ptr<mllm_fp16_t>(); for (size_t i = 0; i < dst.numel(); ++i) { ptr[i] = static_cast<mllm_fp16_t>(options_.start + i * options_.step); } `#elif` defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM) arm::fill_arange_fp16(dst.ptr<mllm_fp16_t>(), dst.numel(), options_.start, options_.end, options_.step, threads); `#endif` break; } case kInt64: { `#if` defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86) common::fill_arange_anytype(dst.ptr<mllm_int64_t>(), dst.numel(), options_.start, options_.end, options_.step); `#elif` defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM) arm::fill_arange_anytype<mllm_int64_t>(dst.ptr<mllm_int64_t>(), dst.numel(), options_.start, options_.end, options_.step, threads); `#endif` break; } case kInt32: { `#if` defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86) common::fill_arange_anytype(dst.ptr<mllm_int32_t>(), dst.numel(), options_.start, options_.end, options_.step); `#elif` defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM) arm::fill_arange_anytype<mllm_int32_t>(dst.ptr<mllm_int32_t>(), dst.numel(), options_.start, options_.end, options_.step, threads); `#endif` break; } case kInt16: { `#if` defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86) common::fill_arange_anytype(dst.ptr<mllm_int16_t>(), dst.numel(), options_.start, options_.end, options_.step); `#elif` defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM) arm::fill_arange_anytype<mllm_int16_t>(dst.ptr<mllm_int16_t>(), dst.numel(), options_.start, options_.end, options_.step, threads); `#endif` break; } case kInt8: { `#if` defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86) common::fill_arange_anytype(dst.ptr<mllm_int8_t>(), dst.numel(), options_.start, options_.end, options_.step); `#elif` defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM) arm::fill_arange_anytype<mllm_int8_t>(dst.ptr<mllm_int8_t>(), dst.numel(), options_.start, options_.end, options_.step, threads); `#endif` break; } case kUInt64: { `#if` defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86) common::fill_arange_anytype(dst.ptr<mllm_uint64_t>(), dst.numel(), options_.start, options_.end, options_.step); `#elif` defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM) arm::fill_arange_anytype<mllm_uint64_t>(dst.ptr<mllm_uint64_t>(), dst.numel(), options_.start, options_.end, options_.step, threads); `#endif` break; } case kUInt32: { `#if` defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86) common::fill_arange_anytype(dst.ptr<mllm_uint32_t>(), dst.numel(), options_.start, options_.end, options_.step); `#elif` defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM) arm::fill_arange_anytype<mllm_uint32_t>(dst.ptr<mllm_uint32_t>(), dst.numel(), options_.start, options_.end, options_.step, threads); `#endif` break; } case kUInt16: { `#if` defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86) common::fill_arange_anytype(dst.ptr<mllm_uint16_t>(), dst.numel(), options_.start, options_.end, options_.step); `#elif` defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM) arm::fill_arange_anytype<mllm_uint16_t>(dst.ptr<mllm_uint16_t>(), dst.numel(), options_.start, options_.end, options_.step, threads); `#endif` break; } case kUInt8: { `#if` defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86) common::fill_arange_anytype(dst.ptr<mllm_uint8_t>(), dst.numel(), options_.start, options_.end, options_.step); `#elif` defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM) arm::fill_arange_anytype<mllm_uint8_t>(dst.ptr<mllm_uint8_t>(), dst.numel(), options_.start, options_.end, options_.step, threads); `#endif` break; } default: { NYI("FillOp::forward[arange] not implemented for this data type"); } } break; } case aops::FillOpTypes::kRandom: { switch (dst.dtype()) {

Script executed:

# Check Tensor.cpp for how arange creates the tensor with proper shape sed -n '130,145p' mllm/core/Tensor.cpp

Repository: UbiquitousLearning/mllm

Repository: UbiquitousLearning/mllm
Exit code: 0

stdout:

} Tensor Tensor::arange(float start, float end, float step, DataTypes dtype, DeviceTypes device) { auto shape = std::vector<int32_t>{static_cast<int32_t>((end - start) / step)}; auto i = Tensor::empty(shape, dtype, device).alloc(); return Context::instance().buildOpAndSubmitTask( OpTypes::kFill, aops::FillOpOptions{.type = aops::FillOpTypes::kArange, .start = start, .end = end, .step = step}, {i})[0]; } Tensor Tensor::random(const std::vector<int32_t>& shape, float start, float end, DataTypes dtype, DeviceTypes device) { auto i = Tensor::empty(shape, dtype, device).alloc(); return Context::instance().buildOpAndSubmitTask( OpTypes::kFill, aops::FillOpOptions{ .type = aops::FillOpTypes::kRandom, .start = start, .end = end, .seed = Context::instance().getRandomState()},

Script executed:

# Search for any existing comments or assertions about the arange contract grep -n "assert\|ASSERT\|contract\|precondition" mllm/backends/cpu/kernels/common/fill-inl.hpp

Repository: UbiquitousLearning/mllm

Repository: UbiquitousLearning/mllm
Exit code: 1

Ensure the contract is explicit when arange length doesn't match count.

The function computes n from start/end/step and only fills n elements (line 214: n = std::min(n, count)). If count > n, the tail remains uninitialized. While std::min() prevents overflow, it silently allows gaps. Consider adding an assertion, an error check, or explicit documentation clarifying that callers must ensure count == ceil((end - start) / step), or explicitly fill any remainder.

🤖 Prompt for AI Agents

In `@mllm/backends/cpu/kernels/common/fill-inl.hpp` around lines 196 - 244, The function fill_arange_impl computes n (the number of sequence elements) and may leave dst[n..count-1] uninitialized; make the contract explicit and avoid silent gaps by: after computing n, add a runtime check/assert that either n == count (e.g., assert(n == count) or throw) or, if you want to tolerate count > n, explicitly fill the remainder via fill_value_impl(dst + n, count - n, static_cast<T>(start + n * step)) so no tail remains uninitialized; update callers/docs accordingly. Reference: fill_arange_impl, variable n, and fill_value_impl.

coderabbitai · 2026-01-20T09:27:52Z

mllm/backends/cpu/kernels/common/kernel_dispatch.hpp

+#include <cstring>
 #include "mllm/core/DataTypes.hpp"



⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -name "kernel_dispatch.hpp" -type f

Repository: UbiquitousLearning/mllm

Length of output: 121

🏁 Script executed:

cat -n mllm/backends/cpu/kernels/common/kernel_dispatch.hpp | head -120

Repository: UbiquitousLearning/mllm

Length of output: 9016

🏁 Script executed:

head -20 mllm/backends/cpu/kernels/common/kernel_dispatch.hpp

Repository: UbiquitousLearning/mllm

Length of output: 794

🏁 Script executed:

rg "#include.*type_traits" mllm/backends/cpu/kernels/common/kernel_dispatch.hpp

Repository: UbiquitousLearning/mllm

Length of output: 49

🏁 Script executed:

grep "^#include" mllm/backends/cpu/kernels/common/kernel_dispatch.hpp

Repository: UbiquitousLearning/mllm

Length of output: 199

Add missing #include <type_traits> for std::is_same_v usage.

The template fill_zeros_anytype at line 105+ uses std::is_same_v (lines 107, 109, 111, 113, 115, 117, 119, and beyond), which requires the <type_traits> header. The file currently does not include it, relying on transitive includes which is unreliable.

Proposed fix

`#include` <cstring> +#include <type_traits> `#include` "mllm/core/DataTypes.hpp"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

#include <cstring>

#include "mllm/core/DataTypes.hpp"

`#include` <cstring>

`#include` <type_traits>

`#include` "mllm/core/DataTypes.hpp"

🤖 Prompt for AI Agents

In `@mllm/backends/cpu/kernels/common/kernel_dispatch.hpp` around lines 10 - 12, The file is missing `#include` <type_traits> required by the template fill_zeros_anytype which uses std::is_same_v; add the header to the top of kernel_dispatch.hpp (alongside <cstring> and "mllm/core/DataTypes.hpp") so std::is_same_v resolves properly and avoids relying on transitive includes referenced in fill_zeros_anytype.

coderabbitai · 2026-01-20T09:27:52Z

mllm/backends/cpu/kernels/common/kernel_dispatch.hpp

+template<typename T>
+inline void fill_zeros_anytype(T* dst, size_t n) {
+  if constexpr (std::is_same_v<T, mllm_fp32_t>) {
+    call_fill_zeros_fp32(dst, n);
+  } else if constexpr (std::is_same_v<T, mllm_fp64_t>) {
+    call_fill_zeros_fp64(dst, n);
+  } else if constexpr (std::is_same_v<T, mllm_int32_t>) {
+    call_fill_zeros_i32(dst, n);
+  } else if constexpr (std::is_same_v<T, mllm_uint32_t>) {
+    call_fill_zeros_u32(dst, n);
+  } else if constexpr (std::is_same_v<T, mllm_int64_t>) {
+    call_fill_zeros_i64(dst, n);
+  } else if constexpr (std::is_same_v<T, mllm_uint64_t>) {
+    call_fill_zeros_u64(dst, n);
+  } else if constexpr (std::is_same_v<T, mllm_int16_t>) {
+    call_fill_zeros_i16(dst, n);
+  } else if constexpr (std::is_same_v<T, mllm_uint16_t>) {
+    call_fill_zeros_u16(dst, n);
+  } else if constexpr (std::is_same_v<T, mllm_int8_t>) {
+    call_fill_zeros_i8(dst, n);
+  } else if constexpr (std::is_same_v<T, mllm_uint8_t>) {
+    call_fill_zeros_u8(dst, n);
+  } else {
+    // Fallback for unsupported types
+    std::memset(dst, 0, n * sizeof(T));
+  }
+}


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Check if the file exists and read the context around lines 105-131 cat -n mllm/backends/cpu/kernels/common/kernel_dispatch.hpp | sed -n '90,140p'

Repository: UbiquitousLearning/mllm

Length of output: 3224

🏁 Script executed:

# Also check for any template constraints or type requirements cat -n mllm/backends/cpu/kernels/common/kernel_dispatch.hpp | head -110 | tail -30

Repository: UbiquitousLearning/mllm

Length of output: 2869

🏁 Script executed:

# Search for any type traits or constraints applied to fill_zeros_anytype rg "fill_zeros_anytype" -A 10 -B 10 mllm/backends/cpu/kernels/common/kernel_dispatch.hpp

Repository: UbiquitousLearning/mllm

Length of output: 1402

🏁 Script executed:

# Search for where fill_zeros_anytype is used rg "fill_zeros_anytype" -B 2 -A 2

Repository: UbiquitousLearning/mllm

Length of output: 6805

🏁 Script executed:

# Check if there's similar code in fill_ones_anytype to see the pattern cat -n mllm/backends/cpu/kernels/common/kernel_dispatch.hpp | sed -n '133,160p'

Repository: UbiquitousLearning/mllm

Length of output: 1327

🌐 Web query:

std::memset non-trivial types C++ standard undefined behavior

💡 Result:

Short answer: Using std::memset on objects whose type is non‑trivially‑copyable (i.e. non‑trivial) is undefined behavior. For trivially‑copyable objects you may manipulate their object representation with byte‑wise functions, but you must respect the standard's restrictions (e.g. not a potentially‑overlapping subobject); the resulting value representation may be implementation‑defined. Prefer constructors / std::fill / type‑safe initialization for non‑trivial types.

Sources:

std::memset specification (undefined if object is not TriviallyCopyable or is a potentially‑overlapping subobject). [1]

Object / object‑representation and trivially‑copyable rules (why bytewise ops are only guaranteed for trivially‑copyable). [2]

Discussion / examples and standard pointers about memcpy/memset use and UB for non‑trivial types. [3]

References:
[1] cppreference — std::memset.
[2] cppreference — Object representation / trivially‑copyable.
[3] Relevant discussion (Stack Overflow / standard excerpts) about memcpy/memset and trivially‑copyable vs non‑trivial types.

Guard memset fallback against non-trivial types.

Using std::memset on non-trivial types is undefined behavior per the C++ standard. The template accepts any type without constraints, so the fallback must check std::is_trivial_v<T>. Consider the safer element-wise approach used in fill_ones_anytype instead.

Suggested fix

} else { // Fallback for unsupported types - std::memset(dst, 0, n * sizeof(T)); + if constexpr (std::is_trivial_v<T>) { + std::memset(dst, 0, n * sizeof(T)); + } else { + for (size_t i = 0; i < n; ++i) { dst[i] = T{}; } + } }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

template<typename T>

inline void fill_zeros_anytype(T* dst, size_t n) {

if constexpr (std::is_same_v<T, mllm_fp32_t>) {

call_fill_zeros_fp32(dst, n);

} else if constexpr (std::is_same_v<T, mllm_fp64_t>) {

call_fill_zeros_fp64(dst, n);

} else if constexpr (std::is_same_v<T, mllm_int32_t>) {

call_fill_zeros_i32(dst, n);

} else if constexpr (std::is_same_v<T, mllm_uint32_t>) {

call_fill_zeros_u32(dst, n);

} else if constexpr (std::is_same_v<T, mllm_int64_t>) {

call_fill_zeros_i64(dst, n);

} else if constexpr (std::is_same_v<T, mllm_uint64_t>) {

call_fill_zeros_u64(dst, n);

} else if constexpr (std::is_same_v<T, mllm_int16_t>) {

call_fill_zeros_i16(dst, n);

} else if constexpr (std::is_same_v<T, mllm_uint16_t>) {

call_fill_zeros_u16(dst, n);

} else if constexpr (std::is_same_v<T, mllm_int8_t>) {

call_fill_zeros_i8(dst, n);

} else if constexpr (std::is_same_v<T, mllm_uint8_t>) {

call_fill_zeros_u8(dst, n);

} else {

// Fallback for unsupported types

std::memset(dst, 0, n * sizeof(T));

}

}

template<typename T>

inline void fill_zeros_anytype(T* dst, size_t n) {

if constexpr (std::is_same_v<T, mllm_fp32_t>) {

call_fill_zeros_fp32(dst, n);

} else if constexpr (std::is_same_v<T, mllm_fp64_t>) {

call_fill_zeros_fp64(dst, n);

} else if constexpr (std::is_same_v<T, mllm_int32_t>) {

call_fill_zeros_i32(dst, n);

} else if constexpr (std::is_same_v<T, mllm_uint32_t>) {

call_fill_zeros_u32(dst, n);

} else if constexpr (std::is_same_v<T, mllm_int64_t>) {

call_fill_zeros_i64(dst, n);

} else if constexpr (std::is_same_v<T, mllm_uint64_t>) {

call_fill_zeros_u64(dst, n);

} else if constexpr (std::is_same_v<T, mllm_int16_t>) {

call_fill_zeros_i16(dst, n);

} else if constexpr (std::is_same_v<T, mllm_uint16_t>) {

call_fill_zeros_u16(dst, n);

} else if constexpr (std::is_same_v<T, mllm_int8_t>) {

call_fill_zeros_i8(dst, n);

} else if constexpr (std::is_same_v<T, mllm_uint8_t>) {

call_fill_zeros_u8(dst, n);

} else {

// Fallback for unsupported types

if constexpr (std::is_trivial_v<T>) {

std::memset(dst, 0, n * sizeof(T));

} else {

for (size_t i = 0; i < n; ++i) { dst[i] = T{}; }

}

}

}

🤖 Prompt for AI Agents

In `@mllm/backends/cpu/kernels/common/kernel_dispatch.hpp` around lines 105 - 131, The fallback in fill_zeros_anytype currently uses std::memset for all unmatched types; restrict that to only trivial types by checking std::is_trivial_v<T> and for non-trivial types perform an element-wise zero assignment (e.g., for (size_t i=0;i<n;++i) dst[i] = T{}), mirroring the safe approach used in fill_ones_anytype; update the else branch in fill_zeros_anytype to first if constexpr (std::is_trivial_v<T>) use std::memset, else use the element-wise loop to avoid undefined behavior on non-trivial types.

coderabbitai · 2026-01-20T09:27:53Z

mllm/backends/cpu/ops/FillOp.cpp

 #if defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86)
-          // TODO
+          // FP16 not directly supported by Highway on x86, use scalar fallback
+          std::memset(dst.ptr<mllm_fp16_t>(), 0, dst.numel() * sizeof(mllm_fp16_t));
 #elif defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM)


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

find . -name "FillOp.cpp" -type f

Repository: UbiquitousLearning/mllm

Length of output: 167

🏁 Script executed:

cat -n ./mllm/backends/cpu/ops/FillOp.cpp

Repository: UbiquitousLearning/mllm

Length of output: 26749

🏁 Script executed:

cat -n ./mllm/backends/cpu/ops/FillOp.hpp | head -20

Repository: UbiquitousLearning/mllm

Length of output: 772

🏁 Script executed:

cat -n ./mllm/backends/cpu/kernels/Kernels.hpp | head -30

Repository: UbiquitousLearning/mllm

Length of output: 2061

🏁 Script executed:

cat -n ./mllm/backends/cpu/kernels/x86/fill.hpp | head -20

Repository: UbiquitousLearning/mllm

Length of output: 825

🏁 Script executed:

rg "#include <cstring>" ./mllm/backends/cpu/kernels/

Repository: UbiquitousLearning/mllm

Length of output: 612

🏁 Script executed:

rg "#include.*fill-inl.hpp" ./mllm/backends/cpu/kernels/

Repository: UbiquitousLearning/mllm

Length of output: 178

🏁 Script executed:

cat -n ./mllm/backends/cpu/kernels/x86/fill.hpp

Repository: UbiquitousLearning/mllm

Length of output: 1017

🏁 Script executed:

rg "#include.*fill-inl.hpp" ./mllm/backends/cpu/kernels/x86/

Repository: UbiquitousLearning/mllm

Length of output: 49

🏁 Script executed:

find ./mllm/backends/cpu/kernels/x86 -name "fill.cpp"

Repository: UbiquitousLearning/mllm

Length of output: 107

🏁 Script executed:

cat -n ./mllm/backends/cpu/kernels/x86/fill.cpp | head -30

Repository: UbiquitousLearning/mllm

Length of output: 1351

🏁 Script executed:

rg "#include" ./mllm/backends/cpu/kernels/x86/fill.cpp

Repository: UbiquitousLearning/mllm

Length of output: 206

🏁 Script executed:

cat -n ./mllm/utils/PlatformRTHelper.hpp

Repository: UbiquitousLearning/mllm

Length of output: 2824

🏁 Script executed:

cat -n ./mllm/core/DataTypes.hpp | head -30

Repository: UbiquitousLearning/mllm

Length of output: 1200

🏁 Script executed:

rg "#include <cstring>" ./mllm/core/ ./mllm/backends/cpu/ops/

Repository: UbiquitousLearning/mllm

Length of output: 2406

Add <cstring> include for std::memset usage on line 33.

The std::memset call at line 33 requires the <cstring> header. All other CPU operation files in the same directory include this header; add it to maintain consistency and ensure the file is self-contained.

🛠️ Proposed fix

`#include` "mllm/backends/cpu/ops/FillOp.hpp" `#include` "mllm/backends/cpu/kernels/Kernels.hpp" `#include` "mllm/utils/PlatformRTHelper.hpp" +#include <cstring>

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

#if defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86)

// TODO

// FP16 not directly supported by Highway on x86, use scalar fallback

std::memset(dst.ptr<mllm_fp16_t>(), 0, dst.numel() * sizeof(mllm_fp16_t));

#elif defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM)

`#include` "mllm/backends/cpu/ops/FillOp.hpp"

`#include` "mllm/backends/cpu/kernels/Kernels.hpp"

`#include` "mllm/backends/cpu/utils/PlatformRTHelper.hpp"

`#include` <cstring>

🤖 Prompt for AI Agents

In `@mllm/backends/cpu/ops/FillOp.cpp` around lines 31 - 34, The file uses std::memset in FillOp.cpp (inside the x86/x86_64 branch) but does not include <cstring>, so add the missing include to the top of the file; update FillOp.cpp to `#include` <cstring> (alongside other headers) so std::memset is declared and the file is self-contained and consistent with other CPU ops.

pymllm/ffi/__init__.py

chenghuaWang and others added 5 commits January 17, 2026 02:32

fix: Suppress deprecated comma-subscript warnings in CMake and remove…

c111b64

… debug print statements from Qwen3DecoderLayer

feat(qualcomm): Add installation targets for flatbuffers and MllmQNNB…

fb471e5

…ackend in CMake, enhance PTQPass with unsolved tensor value checks, and update quantization specifications in RMSNorm and model file conversion.

feat(qualcomm): Refactor Qwen3 model to integrate ConcatObserver for …

7f8f0f2

…improved quantization, enhance rotate_half function to utilize observers, and ensure consistent scale and zero_point across concatenated inputs.

Merge branch 'UbiquitousLearning:main' into wch-main

a5b68ed

chenghuaWang requested review from liang1232018 and oreomaker as code owners January 20, 2026 05:28

UbiquitousLearning approved these changes Jan 20, 2026

View reviewed changes

coderabbitai bot reviewed Jan 20, 2026

View reviewed changes

chenghuaWang requested a review from yirongjie as a code owner January 20, 2026 09:17

coderabbitai bot reviewed Jan 20, 2026

View reviewed changes

chenghuaWang merged commit 42c4c70 into UbiquitousLearning:main Jan 20, 2026
4 checks passed

fix(qualcomm): Enhance quantization modules. #607

fix(qualcomm): Enhance quantization modules. #607

Uh oh!

Conversation

chenghuaWang commented Jan 20, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

UbiquitousLearning left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chenghuaWang commented Jan 20, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 20, 2026 •

edited

Loading