Skip to content

Conversation

@chenghuaWang
Copy link
Collaborator

@chenghuaWang chenghuaWang commented Jan 20, 2026

Summary by CodeRabbit

  • New Features

    • Fixed-parameter activation quantizer and concat observer added; model-level enable/disable fake-quant controls.
  • Improvements

    • Broader automatic generation and propagation of quantization specs across ops.
    • New checks for unsolved quantization entries and concat-parameter consistency.
    • Bit-width specific epsilon handling for quantization and improved attention/dtype handling.
  • New Exports

    • Expanded scalar dtype exports (int8/16/32/64, uint8/16/32/64, bool) for Python API.

✏️ Tip: You can customize this high-level summary in your review settings.

chenghuaWang and others added 5 commits January 17, 2026 02:32
…fixed quantization parameters, updated ActivationQDQ to use MovingAverageMinMaxObserver, and adjusted eps values for better precision. Modified Qwen3 model to utilize FixedActivationQDQ for sigmoid output and ensured dtype consistency in attention calculations.
… debug print statements from Qwen3DecoderLayer
…ackend in CMake, enhance PTQPass with unsolved tensor value checks, and update quantization specifications in RMSNorm and model file conversion.
…improved quantization, enhance rotate_half function to utilize observers, and ensure consistent scale and zero_point across concatenated inputs.
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 20, 2026

📝 Walkthrough

Walkthrough

Adds multiple quantization utilities and integrations across C++ and Python backends: fixed-parameter activation QDQ, concat observers, broader automatic quantization-spec generation and validation in AOT/PTQ passes, expanded CPU fill kernels/API, model serialization tweaks, and CMake install/export updates for several targets.

Changes

Cohort / File(s) Summary
CMake install/export
CMakeLists.txt, mllm/backends/qnn/CMakeLists.txt
Added install/export rules for flatbuffers and MllmQNNBackend targets (LIBRARY/ARCHIVE→lib, RUNTIME→bin).
Compiler warnings
mllm/CMakeLists.txt
Suppressed -Wno-comma-subscript for MllmRT.
Qwen3 AOT modeling (C++)
examples/qwen3_qnn_aot/modeling_qwen_qnn_aot.hpp
Reworked rotateHalf to accept module/QDQ name, use ptq::QDQ for second half, updated Qwen3Attention masked-softmax path to use quantized fallback + QDQ wrapper.
Qwen3 Python model
pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py, .../runner.py, .../train.py
Imported FixedActivationQDQ and ConcatObserver; replaced some ActivationQDQ uses with FixedActivationQDQ; updated rotate_half signature to accept observers; added enable/disable fake-quant helpers and quantization-aware loading/convert/calibration flow; layer index field added.
Qualcomm QDQ & observers (Python)
pymllm/backends/qualcomm/transformers/core/qdq.py, .../observer.py, .../rms_norm.py, .../qlinear.py
Added FixedActivationQDQ, ConcatObserver, bit-width eps constants, observer eps propagation, renamed fake-quant control methods, and small formatting/refactor adjustments; QRMSNorm FakeQuantize configured with explicit dtype/qscheme and eps.
AOT quant recipe improvements (C++)
mllm/backends/qnn/aot/passes/LLMQuantRecipePass.cpp
Patterns now auto-generate missing quant_recipe attrs (Concat, Where, RoPE, Elementwise, etc.), propagate specs consistently, and reduce early failures.
PTQ validation (C++)
mllm/backends/qnn/aot/passes/PTQPass.cpp
Added recursiveCheckUnsolved and recursiveCheckConcatInputs graph traversals to warn about unsolved specs and validate Concat input scale/zero_point consistency; integrated into solve flow.
RMSNorm visitor (C++)
mllm/backends/qnn/aot/visitor/RMSNorm.cpp
Switched fake-bias quant recipe from int32/zero-range to int16 with bias_scale tensor; set runtime bias name.
CPU fill utilities (C++)
mllm/backends/cpu/kernels/common/fill-inl.hpp, .../kernel_dispatch.cpp, .../kernel_dispatch.hpp, mllm/backends/cpu/ops/FillOp.cpp
New SIMD-backed fill utilities (zeros/ones/value/arange/random) for many scalar types, exported HWY APIs and dynamic-dispatch wrappers, and X86 paths updated to use generic anytype wrappers with ARM-specific paths preserved.
FFI and Python dtype exports
mllm/ffi/Extension.cc, pymllm/ffi/__init__.py, pymllm/__init__.py
Added factory functions and public instances for int8/16/32/64, uint8/16/32/64, bool; exported new dtype and device singletons.
Model file serialization (Python)
pymllm/convertor/model_file_v2.py
Added _torch_tensor_bytes() helper and replaced numpy-based tensor serialization calls with it (streaming/static write paths).

Sequence Diagram(s)

(Skipped — changes are broad and dispersed; no single new multi-component sequential flow met the diagram criteria.)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • liang1232018
  • oreomaker
  • yirongjie

Poem

🐰 I nibble bytes and hop through code,

Fixed scales snug in rabbit mode,
Concat bounds and QDQ spin,
RoPE rotates with a quantized grin,
Tiny hops — big changes made — hooray for this bunny's parade!

🚥 Pre-merge checks | ❌ 3
❌ Failed checks (2 warnings, 1 inconclusive)
Check name Status Explanation Resolution
Description check ⚠️ Warning The pull request has no description provided; the required template structure is entirely missing. Add a complete pull request description following the repository template, including motivation, changes made, and any relevant testing or validation information.
Docstring Coverage ⚠️ Warning Docstring coverage is 11.82% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'fix(qualcomm): Enhance quantization modules.' is vague and generic—'enhance' is non-specific about what improvements are actually made. Replace 'Enhance quantization modules' with a specific description of the main change, e.g., 'Add FixedActivationQDQ and ConcatObserver for Qwen3 quantization' or 'Improve QDQ observer configuration with epsilon handling'.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Owner

@UbiquitousLearning UbiquitousLearning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py (1)

104-136: rotate_half signature now breaks existing callers.

Line 104 requires observer args, but Line 135 still calls rotate_half(q) without them. If apply_rotary_pos_emb is invoked, this will raise a TypeError. Also, x_observer is unused. Please keep backward compatibility (or update all callers) and remove/rename the unused parameter.

🛠️ Backward‑compatible fix
-def rotate_half(
-    x, x_observer, x2_neg_fake_quant: ActivationQDQ, concat_observer: ConcatObserver
-):
+def rotate_half(
+    x,
+    _x_observer=None,
+    x2_neg_fake_quant: Optional[ActivationQDQ] = None,
+    concat_observer: Optional[ConcatObserver] = None,
+):
     """Rotates half the hidden dims of the input."""
     x1 = x[..., : x.shape[-1] // 2]
     x2 = x[..., x.shape[-1] // 2 :]
-    return concat_observer(torch.cat((x2_neg_fake_quant(-x2), x1), dim=-1))
+    if x2_neg_fake_quant is None or concat_observer is None:
+        return torch.cat((-x2, x1), dim=-1)
+    return concat_observer(torch.cat((x2_neg_fake_quant(-x2), x1), dim=-1))
🤖 Fix all issues with AI agents
In `@mllm/backends/qnn/aot/passes/PTQPass.cpp`:
- Around line 358-418: The loop in PTQPass.cpp can read ref_zero_point
uninitialized if the first captured reference is kSymPerTensor then later an
kAsymPerTensor is compared; add tracking for the reference spec type (e.g., an
enum/ref_spec_type alongside has_ref) when you set
ref_scale/ref_zero_point/ref_input_name, and before comparing a new input check
that f_spec->spec_->type matches ref_spec_type; if types differ emit a clear
MLLM_ERROR/MLLM_WARN mentioning op_name and both input names and skip comparison
(or fail early), and only access ref_zero_point when ref_spec_type ==
kAsymPerTensor so no uninitialized reads occur.

In `@pymllm/backends/qualcomm/transformers/qwen3/runner.py`:
- Around line 57-61: The call to Qwen3ForCausalLM.from_pretrained uses the wrong
keyword arg name `dtype`; update the call in runner.py where
Qwen3ForCausalLM.from_pretrained(model_path, attn_implementation="eager",
dtype=torch.float32) is invoked to use the correct HuggingFace parameter name
`torch_dtype=torch.float32` so the dtype is passed properly to the
PreTrainedModel loader.

In `@pymllm/backends/qualcomm/transformers/qwen3/train.py`:
- Around line 41-45: Decide and implement the intended fake-quant behavior by
removing the FIXME and either making the disable/enable calls deterministic or
exposing them as a CLI/config flag; e.g., add a boolean flag
(args.disable_fake_quant_before_calibration) and use it to conditionally call
m.disable_fake_quant() before m.calibrate(...) and m.enable_fake_quant() after,
ensuring the sequence around m.calibrate(...) and m.infer(...) is deterministic
and documented in the flag help text.
- Around line 50-53: The assigned lm_head parameter uses unquantized
embed_tokens weights before m.convert(), causing QLinearLPBQ's frozen
weight_quant to remain stale; fix by moving the weight tying to after
m.convert() (i.e., set m.model.lm_head.weight =
Parameter(m.model.model.embed_tokens.weight.clone()) only once convert() has
run) or, if tying must happen before convert(), update/re-freeze the QLinearLPBQ
internal quant state (weight_quant) after assignment so weight_quant.weight_q
reflects the new parameter; refer to m.model.lm_head.weight,
m.model.model.embed_tokens.weight, m.convert(), and the QLinearLPBQ frozen
weight_quant initialization to implement the change.
🧹 Nitpick comments (9)
mllm/CMakeLists.txt (1)

59-61: Consider using PRIVATE instead of PUBLIC for warning suppression.

Using PUBLIC propagates -Wno-comma-subscript to all targets that link against MllmRT, which could mask comma-subscript warnings in downstream code that should be fixed. If the deprecated syntax is only used within MllmRT itself (as the FIXME suggests), PRIVATE would be more appropriate to limit the scope of warning suppression.

Suggested change
 # FIXME: `@oreomaker` Need to remove comma features in slice!
 # Suppress comma-subscript warnings (deprecated C++ feature that will be removed in C++26)
-target_compile_options(MllmRT PUBLIC -Wno-comma-subscript)
+target_compile_options(MllmRT PRIVATE -Wno-comma-subscript)
mllm/backends/qnn/aot/visitor/RMSNorm.cpp (1)

53-55: Consider using named constants for int16 quantization bounds.

The magic numbers 32767 and -32768 represent the int16 symmetric quantization range. Extracting these as named constants would improve readability and make the relationship between scale and range explicit.

♻️ Suggested refactor
+  constexpr int16_t kInt16Max = 32767;
+  constexpr int16_t kInt16Min = -32768;
+
   // fake bias quant recipe
   auto bias_scale = Tensor::ones({1});
-  bias_scale.at<float>({0}) = 1.0 / 32767;
-  auto quant_spec = mllm::ir::linalg::QuantizationSpecSymPerTensor::create(-32768, 32767, kInt16, kFloat32, bias_scale);
+  bias_scale.at<float>({0}) = 1.0f / kInt16Max;
+  auto quant_spec = mllm::ir::linalg::QuantizationSpecSymPerTensor::create(kInt16Min, kInt16Max, kInt16, kFloat32, bias_scale);
mllm/backends/qnn/aot/passes/PTQPass.cpp (1)

460-468: LGTM!

The validation functions are correctly invoked after the solving passes, ensuring all quantization specs are resolved before checking for issues. The ordering is appropriate.

Consider caching the SubGraphOp lookup to avoid repeated symbol table lookups:

auto main_subgraph = getCtx()->lookupSymbolTable(call_main_graph_op->getSymbolAttr()->str())->cast_<ir::graph::SubGraphOp>();
recursiveSolveWeights(writer.getContext(), main_subgraph, pf);
recursiveSolveNormal(writer.getContext(), main_subgraph, pf);
recursiveCheckUnsolved(writer.getContext(), main_subgraph);
recursiveCheckConcatInputs(writer.getContext(), main_subgraph);
pymllm/convertor/model_file_v2.py (1)

27-33: Consider moving this function inside the torch availability guard.

The function references torch.uint8 and is only valid when PyTorch is available. While current call sites are properly guarded, placing the function definition inside the if MLLM_FIND_TORCH_AVAILABLE: block would make the dependency explicit and prevent accidental misuse.

Suggested change
 if MLLM_FIND_TORCH_AVAILABLE:
     import torch
+
+    def _torch_tensor_bytes(tensor: "torch.Tensor") -> bytes:
+        """Serialize a PyTorch tensor to raw bytes using uint8 view.
+
+        Handles dtypes not natively supported by numpy (e.g., bfloat16) by
+        viewing the tensor's storage as uint8 before conversion.
+        """
+        t = tensor.detach().cpu().contiguous()
+        if t.dim() == 0:
+            t = t.reshape(1)
+        return t.view(torch.uint8).numpy().tobytes()
+
 if MLLM_FIND_NUMPY_AVAILABLE:
     import numpy as np
-from .mllm_type_mapping import MLLM_TYPE_MAPPING
-
-
-def _torch_tensor_bytes(tensor: "torch.Tensor") -> bytes:
-    # Use uint8 view to preserve raw bytes for dtypes not supported by numpy.
-    t = tensor.detach().cpu().contiguous()
-    if t.dim() == 0:
-        t = t.reshape(1)
-    return t.view(torch.uint8).numpy().tobytes()
pymllm/backends/qualcomm/transformers/core/rms_norm.py (1)

23-31: Give the eps literal a named constant.

Line 26 inlines 0.0001 / 65535. Consider extracting a module-level constant (or reusing a shared constant) to keep eps consistent and self-descriptive.

♻️ Suggested refactor
+DEFAULT_EPS_16BIT = 0.0001 / 65535
 ...
         self.weight_fake_quant = FakeQuantize(
             observer=MinMaxObserver.with_args(
                 qscheme=torch.per_tensor_affine,
                 dtype=torch.qint32,
-                eps=0.0001 / 65535,
+                eps=DEFAULT_EPS_16BIT,
             ),

As per coding guidelines, use named constants instead of magic numbers.

pymllm/backends/qualcomm/transformers/core/observer.py (1)

43-52: Use tensor ops + in‑place updates for min/max tracking.

Line 45 uses Python min/max on tensors and rebinds buffers. Using torch.minimum/maximum with copy_ avoids sync-y comparisons and keeps buffers stable.

♻️ Suggested refactor
-        self.min_val = min(self.min_val, x_orig.min())
-        self.max_val = max(self.max_val, x_orig.max())
+        self.min_val.copy_(torch.minimum(self.min_val, x_orig.min()))
+        self.max_val.copy_(torch.maximum(self.max_val, x_orig.max()))
 ...
-        for observers in self.input_observers:
-            observers.min_val = self.min_val
-            observers.max_val = self.max_val
+        for observers in self.input_observers:
+            observers.min_val.copy_(self.min_val)
+            observers.max_val.copy_(self.max_val)

As per coding guidelines, avoid unnecessary work in hot paths.

pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py (2)

82-87: Derive sigmoid scale from a named constant.

Line 84 hard‑codes 65535. Consider extracting a named constant (or computing from bits) to make intent clearer and reduce magic numbers.

♻️ Suggested refactor
+SIGMOID_QMAX_16 = (2**16) - 1
 ...
-        sigmoid_scale = 1.0 / (65535 - 0 + 1)  # 1 / 65536
+        sigmoid_scale = 1.0 / (SIGMOID_QMAX_16 + 1)  # 1 / 65536

As per coding guidelines, use named constants instead of magic numbers.


381-381: Typo: layer_dixlayer_idx for consistency.

Line 381 looks like a misspelling; consider renaming for clarity and to match the rest of the codebase.

♻️ Suggested fix
-        self.layer_dix = layer_idx
+        self.layer_idx = layer_idx

As per coding guidelines, keep naming consistent.

pymllm/backends/qualcomm/transformers/qwen3/runner.py (1)

37-45: Consider using tuple in isinstance checks for cleaner code.

The logic is correct, but you can simplify the condition using a tuple.

♻️ Suggested refactor
 def enable_fake_quant(m):
-    if isinstance(m, ActivationQDQ) or isinstance(m, FixedActivationQDQ):
+    if isinstance(m, (ActivationQDQ, FixedActivationQDQ)):
         m.enable_fakequant()


 def disable_fake_quant(m):
-    if isinstance(m, ActivationQDQ) or isinstance(m, FixedActivationQDQ):
+    if isinstance(m, (ActivationQDQ, FixedActivationQDQ)):
         m.disable_fakequant()

This matches the pattern used in freeze_qwen3_linear_weight and is more idiomatic Python.

Comment on lines 358 to 418
for (auto iii : inputs) {
if (!iii->isa_<ir::tensor::TensorValue>()) continue;
auto tv = iii->cast_<ir::tensor::TensorValue>();
if (!tv->getAttr("quant_recipe")) continue;
auto f_spec = tv->getAttr("quant_recipe")->cast_<ir::linalg::LinalgIRQuantizatonSpecAttr>();

if (f_spec->spec_->type == ir::linalg::QuantizationSpecType::kAsymPerTensor) {
auto this_spec = std::static_pointer_cast<ir::linalg::QuantizationSpecAsymPerTensor>(f_spec->spec_);
if (!this_spec->solved) continue;

if (!has_ref) {
ref_scale = this_spec->scale;
ref_zero_point = this_spec->zero_point;
ref_input_name = tv->name();
has_ref = true;
} else {
// Check if scale and zero_point match
auto cur_scale = this_spec->scale;
auto cur_zero_point = this_spec->zero_point;

MLLM_RT_ASSERT_EQ(ref_scale.numel(), 1);
MLLM_RT_ASSERT_EQ(cur_scale.numel(), 1);
MLLM_RT_ASSERT_EQ(ref_zero_point.numel(), 1);
MLLM_RT_ASSERT_EQ(cur_zero_point.numel(), 1);

auto ref_scale_v = ref_scale.item<mllm_fp32_t>();
auto cur_scale_v = cur_scale.item<mllm_fp32_t>();
auto ref_zp_v = ref_zero_point.item<mllm_int32_t>();
auto cur_zp_v = cur_zero_point.item<mllm_int32_t>();

if (std::abs(ref_scale_v - cur_scale_v) > 1e-6 || ref_zp_v != cur_zp_v) {
MLLM_ERROR("PTQPass: ConcatOp '{}' has mismatched scale/zp between inputs. "
"Input '{}': scale={}, zp={}; Input '{}': scale={}, zp={}",
op_name, ref_input_name, ref_scale_v, ref_zp_v, tv->name(), cur_scale_v, cur_zp_v);
}
}
} else if (f_spec->spec_->type == ir::linalg::QuantizationSpecType::kSymPerTensor) {
auto this_spec = std::static_pointer_cast<ir::linalg::QuantizationSpecSymPerTensor>(f_spec->spec_);
if (!this_spec->solved) continue;

if (!has_ref) {
ref_scale = this_spec->scale;
ref_input_name = tv->name();
has_ref = true;
} else {
// Check if scale matches
auto cur_scale = this_spec->scale;

MLLM_RT_ASSERT_EQ(ref_scale.numel(), 1);
MLLM_RT_ASSERT_EQ(cur_scale.numel(), 1);

auto ref_scale_v = ref_scale.item<mllm_fp32_t>();
auto cur_scale_v = cur_scale.item<mllm_fp32_t>();

if (std::abs(ref_scale_v - cur_scale_v) > 1e-6) {
MLLM_ERROR("PTQPass: ConcatOp '{}' has mismatched scale between inputs. "
"Input '{}': scale={}; Input '{}': scale={}",
op_name, ref_input_name, ref_scale_v, tv->name(), cur_scale_v);
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Potential undefined behavior when mixing quantization spec types.

If the first input has kSymPerTensor (which only sets ref_scale) and a subsequent input has kAsymPerTensor, the code at line 385 will read ref_zero_point which was never initialized, leading to undefined behavior.

Consider either:

  1. Tracking which spec type the reference was captured from and only comparing inputs of the same type.
  2. Emitting an error/warning when inputs have mismatched quantization spec types.
Proposed fix to track reference spec type
       Tensor ref_scale;
       Tensor ref_zero_point;
       bool has_ref = false;
       std::string ref_input_name;
+      ir::linalg::QuantizationSpecType ref_spec_type = ir::linalg::QuantizationSpecType::kRaw;

       for (auto iii : inputs) {
         if (!iii->isa_<ir::tensor::TensorValue>()) continue;
         auto tv = iii->cast_<ir::tensor::TensorValue>();
         if (!tv->getAttr("quant_recipe")) continue;
         auto f_spec = tv->getAttr("quant_recipe")->cast_<ir::linalg::LinalgIRQuantizatonSpecAttr>();

         if (f_spec->spec_->type == ir::linalg::QuantizationSpecType::kAsymPerTensor) {
           auto this_spec = std::static_pointer_cast<ir::linalg::QuantizationSpecAsymPerTensor>(f_spec->spec_);
           if (!this_spec->solved) continue;

           if (!has_ref) {
             ref_scale = this_spec->scale;
             ref_zero_point = this_spec->zero_point;
             ref_input_name = tv->name();
+            ref_spec_type = ir::linalg::QuantizationSpecType::kAsymPerTensor;
             has_ref = true;
           } else {
+            if (ref_spec_type != ir::linalg::QuantizationSpecType::kAsymPerTensor) {
+              MLLM_ERROR("PTQPass: ConcatOp '{}' has mismatched quantization spec types between inputs.", op_name);
+              continue;
+            }
             // Check if scale and zero_point match
             // ... existing code ...
           }
         } else if (f_spec->spec_->type == ir::linalg::QuantizationSpecType::kSymPerTensor) {
           auto this_spec = std::static_pointer_cast<ir::linalg::QuantizationSpecSymPerTensor>(f_spec->spec_);
           if (!this_spec->solved) continue;

           if (!has_ref) {
             ref_scale = this_spec->scale;
             ref_input_name = tv->name();
+            ref_spec_type = ir::linalg::QuantizationSpecType::kSymPerTensor;
             has_ref = true;
           } else {
+            if (ref_spec_type != ir::linalg::QuantizationSpecType::kSymPerTensor) {
+              MLLM_ERROR("PTQPass: ConcatOp '{}' has mismatched quantization spec types between inputs.", op_name);
+              continue;
+            }
             // Check if scale matches
             // ... existing code ...
           }
         }
       }
🤖 Prompt for AI Agents
In `@mllm/backends/qnn/aot/passes/PTQPass.cpp` around lines 358 - 418, The loop in
PTQPass.cpp can read ref_zero_point uninitialized if the first captured
reference is kSymPerTensor then later an kAsymPerTensor is compared; add
tracking for the reference spec type (e.g., an enum/ref_spec_type alongside
has_ref) when you set ref_scale/ref_zero_point/ref_input_name, and before
comparing a new input check that f_spec->spec_->type matches ref_spec_type; if
types differ emit a clear MLLM_ERROR/MLLM_WARN mentioning op_name and both input
names and skip comparison (or fail early), and only access ref_zero_point when
ref_spec_type == kAsymPerTensor so no uninitialized reads occur.

Comment on lines 57 to 61
self.model = Qwen3ForCausalLM.from_pretrained(
model_path,
attn_implementation="eager",
dtype=torch.float32,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# Find Qwen3ForCausalLM class definition
fd -e py -e pyi | xargs rg -l "class Qwen3ForCausalLM" | head -20

Repository: UbiquitousLearning/mllm

Length of output: 128


🏁 Script executed:

# Also search for the file structure under qwen3 backend
find pymllm/backends/qualcomm/transformers/qwen3 -type f -name "*.py" 2>/dev/null

Repository: UbiquitousLearning/mllm

Length of output: 235


🏁 Script executed:

# Check the imports and context in the runner.py file
cat -n pymllm/backends/qualcomm/transformers/qwen3/runner.py | head -100

Repository: UbiquitousLearning/mllm

Length of output: 3860


🏁 Script executed:

# Examine the Qwen3ForCausalLM class and from_pretrained method
cat -n pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py | head -200

Repository: UbiquitousLearning/mllm

Length of output: 9569


🏁 Script executed:

# Search for from_pretrained method in the modeling file
rg -n "from_pretrained" pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py -A 20

Repository: UbiquitousLearning/mllm

Length of output: 1197


🏁 Script executed:

# Search for the Qwen3ForCausalLM class definition
rg -n "^class Qwen3ForCausalLM" pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py -A 30

Repository: UbiquitousLearning/mllm

Length of output: 1354


🏁 Script executed:

# Check if Qwen3ForCausalLM inherits from PreTrainedModel or has its own from_pretrained
rg -n "class Qwen3ForCausalLM" pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py -B 2 -A 5

Repository: UbiquitousLearning/mllm

Length of output: 357


🏁 Script executed:

# Search for any custom from_pretrained or _from_pretrained method in the file
rg -n "def.*from_pretrained|def _from_pretrained" pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py -A 20

Repository: UbiquitousLearning/mllm

Length of output: 49


🏁 Script executed:

# Find Qwen3PreTrainedModel definition
rg -n "class Qwen3PreTrainedModel" pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py -A 20

Repository: UbiquitousLearning/mllm

Length of output: 779


🏁 Script executed:

# Check the transformers library's PreTrainedModel to understand the from_pretrained signature
# Let's look for how PreTrainedModel is imported and check if dtype parameter is used elsewhere
rg -n "torch_dtype|dtype=" pymllm/backends/qualcomm/transformers/qwen3/ -t py | head -30

Repository: UbiquitousLearning/mllm

Length of output: 1329


🏁 Script executed:

# Check the actual transformers library PreTrainedModel to see what parameters from_pretrained accepts
# Look for any kwargs handling or parameter documentation in the model class
rg -n "def from_pretrained" /usr/local/lib/python*/dist-packages/transformers/modeling_utils.py 2>/dev/null || echo "Checking local transformers..."

# Instead, let's check what the actual HuggingFace transformers library uses
python3 << 'EOF'
try:
    from transformers import PreTrainedModel
    import inspect
    sig = inspect.signature(PreTrainedModel.from_pretrained)
    print("PreTrainedModel.from_pretrained signature:")
    print(sig)
    print("\nParameters:")
    for param_name, param in sig.parameters.items():
        print(f"  {param_name}: {param}")
except Exception as e:
    print(f"Error: {e}")
EOF

Repository: UbiquitousLearning/mllm

Length of output: 153


🏁 Script executed:

# Let's also check if there are any test files that show how this is used
find pymllm -name "*test*.py" -o -name "*example*.py" | xargs rg -l "Qwen3ForCausalLM" 2>/dev/null | head -5

Repository: UbiquitousLearning/mllm

Length of output: 49


🏁 Script executed:

# Check if there's any documentation or comments about the dtype parameter
rg -n "dtype|torch_dtype" pymllm/backends/qualcomm/transformers/qwen3/runner.py -B 3 -A 3

Repository: UbiquitousLearning/mllm

Length of output: 341


🏁 Script executed:

# Let's search the entire codebase for how from_pretrained is used with dtype vs torch_dtype
rg -n "from_pretrained.*dtype" . -t py --max-count=20

Repository: UbiquitousLearning/mllm

Length of output: 49


🏁 Script executed:

# Check if there's a __init__.py or other configuration that might override from_pretrained
find pymllm/backends/qualcomm/transformers/qwen3 -name "*.py" -exec grep -l "from_pretrained" {} \;

Repository: UbiquitousLearning/mllm

Length of output: 182


🏁 Script executed:

# Let's check the transformers package version and see what parameters PreTrainedModel accepts
python3 << 'EOF'
try:
    import transformers
    print(f"Transformers version: {transformers.__version__}")
    
    # Check the from_pretrained method
    from transformers.modeling_utils import PreTrainedModel
    import inspect
    
    # Get the source file location
    source_file = inspect.getfile(PreTrainedModel)
    print(f"Source file: {source_file}")
    
    # Check the method signature (first few lines)
    source = inspect.getsource(PreTrainedModel.from_pretrained)
    lines = source.split('\n')[:30]
    for i, line in enumerate(lines):
        print(f"{i}: {line}")
except Exception as e:
    print(f"Error: {e}")
EOF

Repository: UbiquitousLearning/mllm

Length of output: 104


🏁 Script executed:

# Check if dtype is used as a kwarg anywhere in PreTrainedModel or similar classes
rg -n "kwargs.*dtype|dtype.*kwargs" pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py

Repository: UbiquitousLearning/mllm

Length of output: 49


🏁 Script executed:

# Let's look for any PreTrainedModel-related code in the qwen3 implementation
rg -n "PreTrainedModel\|from_pretrained" pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py -B 2 -A 5

Repository: UbiquitousLearning/mllm

Length of output: 49


🏁 Script executed:

# Let's check the train.py file to see if there are other from_pretrained calls
rg -n "from_pretrained" pymllm/backends/qualcomm/transformers/qwen3/train.py -B 2 -A 3

Repository: UbiquitousLearning/mllm

Length of output: 49


🏁 Script executed:

# Search the entire mllm/mllm-cli/pymllm codebase for from_pretrained usage patterns
rg -n "\.from_pretrained\(" pymllm/ -t py --max-count=20

Repository: UbiquitousLearning/mllm

Length of output: 575


🏁 Script executed:

# Check if Qwen3PreTrainedModel has any custom __init_subclass__ or similar that might handle dtype
cat -n pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py | sed -n '443,500p'

Repository: UbiquitousLearning/mllm

Length of output: 2587


🏁 Script executed:

# Check the entire modeling_qwen3.py to see if there's any custom from_pretrained override
grep -n "def from_pretrained\|def _from_pretrained" pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py

Repository: UbiquitousLearning/mllm

Length of output: 49


🏁 Script executed:

# Let's check the HuggingFace documentation or examples to confirm the correct parameter
# Search for torch_dtype usage in the codebase
rg -n "torch_dtype" pymllm/ -t py

Repository: UbiquitousLearning/mllm

Length of output: 49


🏁 Script executed:

# Check if there's any recent commit or documentation about this
git log --oneline pymllm/backends/qualcomm/transformers/qwen3/runner.py 2>/dev/null | head -10

Repository: UbiquitousLearning/mllm

Length of output: 128


🏁 Script executed:

# Let's check if there are any test files that run this code
find pymllm -type f \( -name "*test*.py" -o -name "*example*.py" \) | head -10

Repository: UbiquitousLearning/mllm

Length of output: 209


🏁 Script executed:

# Check the HuggingFace transformers library documentation or source
# Let's look for any **kwargs handling in PreTrainedModel
python3 << 'EOF'
# Check what parameters PreTrainedModel.from_pretrained accepts by looking at docs
import sys
try:
    # Try to import and inspect
    from transformers.modeling_utils import PreTrainedModel
    help(PreTrainedModel.from_pretrained)
except ImportError:
    print("transformers not installed in this environment")
    # But we can check online documentation patterns
    print("According to HuggingFace documentation, PreTrainedModel.from_pretrained uses:")
    print("- torch_dtype (not dtype) parameter")
    print("- This is standard across all transformer models")
EOF

Repository: UbiquitousLearning/mllm

Length of output: 276


🏁 Script executed:

# Let's check if there are any comments or documentation in the modeling file about this
rg -n "torch_dtype|dtype.*pretrained|pretrained.*dtype" pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py -B 2 -A 2

Repository: UbiquitousLearning/mllm

Length of output: 49


🏁 Script executed:

# Let's run a direct test to see if the parameter works
# Check if there are actual test executions or imports that would fail
rg -n "import Qwen3ForCausalLM\|from.*Qwen3ForCausalLM" pymllm/ -t py

Repository: UbiquitousLearning/mllm

Length of output: 49


🏁 Script executed:

# Check if there's any error handling around the from_pretrained call
cat -n pymllm/backends/qualcomm/transformers/qwen3/runner.py | sed -n '54,65p'

Repository: UbiquitousLearning/mllm

Length of output: 637


🏁 Script executed:

# Let's look for any **kwargs or parameter unpacking that might accept dtype
rg -n "kwargs\|\*\*" pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py | grep -i "from_pretrained\|init" | head -20

Repository: UbiquitousLearning/mllm

Length of output: 49


🏁 Script executed:

# Let's check what the transformers library version expects
# Search online documentation or check if there's a requirements.txt
find . -name "requirements.txt" -o -name "setup.py" -o -name "pyproject.toml" | xargs cat 2>/dev/null | grep -i transformers

Repository: UbiquitousLearning/mllm

Length of output: 49


🏁 Script executed:

# Let's verify by checking the actual transformers code in site-packages if available
python3 << 'EOF'
import inspect
try:
    from transformers.modeling_utils import PreTrainedModel
    source = inspect.getsource(PreTrainedModel.from_pretrained)
    # Print first 100 lines to see the signature and parameter handling
    lines = source.split('\n')[:100]
    for i, line in enumerate(lines, 1):
        print(f"{i:3d}: {line}")
except Exception as e:
    print(f"Cannot inspect: {e}")
EOF

Repository: UbiquitousLearning/mllm

Length of output: 113


🏁 Script executed:

# Check the standard HuggingFace API - torch_dtype is the correct parameter
# Let's see if there's any documentation or comments in the code
rg -n "torch_dtype" . -t py --max-count=10

Repository: UbiquitousLearning/mllm

Length of output: 49


🏁 Script executed:

# Final check: Look for any other model loading patterns in the codebase to compare
rg -n "\.from_pretrained" pymllm/ -t py -B 1 -A 2 | head -30

Repository: UbiquitousLearning/mllm

Length of output: 1153


Change dtype=torch.float32 to torch_dtype=torch.float32.

The standard HuggingFace PreTrainedModel.from_pretrained() API uses torch_dtype as the parameter name, not dtype. Since Qwen3ForCausalLM inherits from PreTrainedModel without overriding from_pretrained, this parameter name must be corrected.

🤖 Prompt for AI Agents
In `@pymllm/backends/qualcomm/transformers/qwen3/runner.py` around lines 57 - 61,
The call to Qwen3ForCausalLM.from_pretrained uses the wrong keyword arg name
`dtype`; update the call in runner.py where
Qwen3ForCausalLM.from_pretrained(model_path, attn_implementation="eager",
dtype=torch.float32) is invoked to use the correct HuggingFace parameter name
`torch_dtype=torch.float32` so the dtype is passed properly to the
PreTrainedModel loader.

Comment on lines +41 to 45
# FIXME: Should disable or not.
m.disable_fake_quant()
m.calibrate(num_samples=args.num_samples, max_seq_length=args.max_length)
# m.compile()
m.enable_fake_quant()
m.infer(args.infer_text)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Resolve the fake‑quant toggle decision before release.

Line 41 explicitly flags uncertainty about disabling fake quant before calibration. Please decide the intended behavior (or gate it behind a flag) and remove the FIXME to avoid inconsistent calibration runs.

🤖 Prompt for AI Agents
In `@pymllm/backends/qualcomm/transformers/qwen3/train.py` around lines 41 - 45,
Decide and implement the intended fake-quant behavior by removing the FIXME and
either making the disable/enable calls deterministic or exposing them as a
CLI/config flag; e.g., add a boolean flag
(args.disable_fake_quant_before_calibration) and use it to conditionally call
m.disable_fake_quant() before m.calibrate(...) and m.enable_fake_quant() after,
ensuring the sequence around m.calibrate(...) and m.infer(...) is deterministic
and documented in the flag help text.

Comment on lines +50 to 53
# This line maybe error. we need use quantized weight!!! not embed_tokens.weight!!!
m.model.lm_head.weight = torch.nn.Parameter(
m.model.model.embed_tokens.weight.clone()
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -name "train.py" -path "*/qwen3/*" -type f

Repository: UbiquitousLearning/mllm

Length of output: 121


🏁 Script executed:

git ls-files | head -20

Repository: UbiquitousLearning/mllm

Length of output: 712


🏁 Script executed:

wc -l ./pymllm/backends/qualcomm/transformers/qwen3/train.py

Repository: UbiquitousLearning/mllm

Length of output: 124


🏁 Script executed:

cat -n ./pymllm/backends/qualcomm/transformers/qwen3/train.py | head -100

Repository: UbiquitousLearning/mllm

Length of output: 2309


🏁 Script executed:

find . -name "runner.py" -path "*/qwen3/*" -type f

Repository: UbiquitousLearning/mllm

Length of output: 122


🏁 Script executed:

wc -l ./pymllm/backends/qualcomm/transformers/qwen3/runner.py

Repository: UbiquitousLearning/mllm

Length of output: 126


🏁 Script executed:

cat -n ./pymllm/backends/qualcomm/transformers/qwen3/runner.py

Repository: UbiquitousLearning/mllm

Length of output: 8282


🏁 Script executed:

find . -name "modeling_qwen3.py" -type f

Repository: UbiquitousLearning/mllm

Length of output: 130


🏁 Script executed:

wc -l ./pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py

Repository: UbiquitousLearning/mllm

Length of output: 134


🏁 Script executed:

rg -n "lm_head\|embed_tokens" ./pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py -A 2 -B 2

Repository: UbiquitousLearning/mllm

Length of output: 49


🏁 Script executed:

cat -n ./pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py | grep -E "(lm_head|embed_tokens)" -A 2 -B 2

Repository: UbiquitousLearning/mllm

Length of output: 1543


🏁 Script executed:

cat -n ./pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py | tail -100

Repository: UbiquitousLearning/mllm

Length of output: 4486


🏁 Script executed:

rg -n "_tied_weights_keys\|tie.*weight" ./pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py -B 2 -A 2

Repository: UbiquitousLearning/mllm

Length of output: 49


🏁 Script executed:

rg -n "class QLinearLPBQ" ./pymllm/backends/qualcomm/transformers/core/ -l

Repository: UbiquitousLearning/mllm

Length of output: 122


🏁 Script executed:

grep -n "class QLinearLPBQ" ./pymllm/backends/qualcomm/transformers/core/qlinear.py -A 30

Repository: UbiquitousLearning/mllm

Length of output: 1253


🏁 Script executed:

grep -n "convert_to_conv2d_deploy_hwio" ./pymllm/backends/qualcomm/transformers/core/qlinear.py -B 5 -A 20

Repository: UbiquitousLearning/mllm

Length of output: 2235


🏁 Script executed:

cat -n ./pymllm/backends/qualcomm/transformers/core/qlinear.py | sed -n '228,280p'

Repository: UbiquitousLearning/mllm

Length of output: 2315


🏁 Script executed:

grep -n "weight_quant\|freeze_weight\|deploy_mode" ./pymllm/backends/qualcomm/transformers/core/qlinear.py | head -40

Repository: UbiquitousLearning/mllm

Length of output: 1976


🏁 Script executed:

grep -n "freeze_weight\|_tied_weights" ./pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py

Repository: UbiquitousLearning/mllm

Length of output: 114


🏁 Script executed:

cat -n ./pymllm/backends/qualcomm/transformers/core/qlinear.py | sed -n '50,110p'

Repository: UbiquitousLearning/mllm

Length of output: 2561


🏁 Script executed:

grep -n "post_init\|tie_weights\|_init_weights" ./pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py -A 5

Repository: UbiquitousLearning/mllm

Length of output: 444


🏁 Script executed:

grep -n "class Qwen3PreTrainedModel" ./pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py -A 10

Repository: UbiquitousLearning/mllm

Length of output: 478


🏁 Script executed:

python3 << 'EOF'
# Let's understand the quantization flow more clearly
# Check if there's documentation or if we can trace the FakeQuantize behavior

import subprocess
result = subprocess.run(['grep', '-n', 'class.*DoubleQuantizer', './pymllm/backends/qualcomm/transformers/core/qlinear.py'], 
                       capture_output=True, text=True)
print(result.stdout)
result = subprocess.run(['grep', '-n', 'class PreTrainedModel\|tie_weights\|post_init', './pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py'], 
                       capture_output=True, text=True)
print(result.stdout)
EOF

Repository: UbiquitousLearning/mllm

Length of output: 164


🏁 Script executed:

find . -name "*.py" -path "*core*" | xargs grep -l "DoubleQuantizer" 2>/dev/null

Repository: UbiquitousLearning/mllm

Length of output: 122


🏁 Script executed:

cat -n ./pymllm/backends/qualcomm/transformers/core/qlinear.py | sed -n '161,225p'

Repository: UbiquitousLearning/mllm

Length of output: 2953


🏁 Script executed:

grep -n "self.embed_tokens\|embed_tokens.*=" ./pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py | head -20

Repository: UbiquitousLearning/mllm

Length of output: 173


🏁 Script executed:

cat -n ./pymllm/backends/qualcomm/transformers/qwen3/runner.py | sed -n '47,52p'

Repository: UbiquitousLearning/mllm

Length of output: 316


🏁 Script executed:

grep -n "def freeze_weight\|is_frozen" ./pymllm/backends/qualcomm/transformers/core/qlinear.py | head -15

Repository: UbiquitousLearning/mllm

Length of output: 292


🏁 Script executed:

cat -n ./pymllm/backends/qualcomm/transformers/qwen3/runner.py | sed -n '66,69p'

Repository: UbiquitousLearning/mllm

Length of output: 279


Quantized weights will not match the assigned lm_head.weight parameter.

Lines 50–53 assign unquantized embed_tokens.weight to lm_head.weight before m.convert() is called (line 54). Since lm_head is a QLinearLPBQ and its weights are frozen during initialization (runner.py line 68), the frozen weight_quant still contains the old quantized weights. When convert() runs, it extracts the stale quantized weights from weight_quant.weight_q rather than from the newly assigned parameter, resulting in a mismatch between the exported quantized weights and the parameter. Either re-freeze the quantized weights after assignment or perform the tying after convert() completes.

🤖 Prompt for AI Agents
In `@pymllm/backends/qualcomm/transformers/qwen3/train.py` around lines 50 - 53,
The assigned lm_head parameter uses unquantized embed_tokens weights before
m.convert(), causing QLinearLPBQ's frozen weight_quant to remain stale; fix by
moving the weight tying to after m.convert() (i.e., set m.model.lm_head.weight =
Parameter(m.model.model.embed_tokens.weight.clone()) only once convert() has
run) or, if tying must happen before convert(), update/re-freeze the QLinearLPBQ
internal quant state (weight_quant) after assignment so weight_quant.weight_q
reflects the new parameter; refer to m.model.lm_head.weight,
m.model.model.embed_tokens.weight, m.convert(), and the QLinearLPBQ frozen
weight_quant initialization to implement the change.

… zeros, ones, specific values, arange, and random fills. Introduce a new fill-inl.hpp file for optimized implementations and update kernel dispatch to include these operations. Enhance CPUFillOp to utilize the new fill functions for better performance and maintainability.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Fix all issues with AI agents
In `@mllm/backends/cpu/kernels/common/fill-inl.hpp`:
- Around line 8-11: This file relies on std::is_trivial_v, std::is_same_v,
std::ceil, and std::min but doesn't include the matching standard headers; add
explicit includes for <type_traits>, <cmath>, and <algorithm> near the top
alongside the existing includes (after `#include` <cstring> / before
"mllm/core/DataTypes.hpp") so that uses in functions/template code referencing
std::is_trivial_v, std::is_same_v, std::ceil, and std::min are self-contained
and no longer depend on transitive headers.
- Around line 196-244: The function fill_arange_impl computes n (the number of
sequence elements) and may leave dst[n..count-1] uninitialized; make the
contract explicit and avoid silent gaps by: after computing n, add a runtime
check/assert that either n == count (e.g., assert(n == count) or throw) or, if
you want to tolerate count > n, explicitly fill the remainder via
fill_value_impl(dst + n, count - n, static_cast<T>(start + n * step)) so no tail
remains uninitialized; update callers/docs accordingly. Reference:
fill_arange_impl, variable n, and fill_value_impl.

In `@mllm/backends/cpu/kernels/common/kernel_dispatch.hpp`:
- Around line 10-12: The file is missing `#include` <type_traits> required by the
template fill_zeros_anytype which uses std::is_same_v; add the header to the top
of kernel_dispatch.hpp (alongside <cstring> and "mllm/core/DataTypes.hpp") so
std::is_same_v resolves properly and avoids relying on transitive includes
referenced in fill_zeros_anytype.
- Around line 105-131: The fallback in fill_zeros_anytype currently uses
std::memset for all unmatched types; restrict that to only trivial types by
checking std::is_trivial_v<T> and for non-trivial types perform an element-wise
zero assignment (e.g., for (size_t i=0;i<n;++i) dst[i] = T{}), mirroring the
safe approach used in fill_ones_anytype; update the else branch in
fill_zeros_anytype to first if constexpr (std::is_trivial_v<T>) use std::memset,
else use the element-wise loop to avoid undefined behavior on non-trivial types.

In `@mllm/backends/cpu/ops/FillOp.cpp`:
- Around line 31-34: The file uses std::memset in FillOp.cpp (inside the
x86/x86_64 branch) but does not include <cstring>, so add the missing include to
the top of the file; update FillOp.cpp to `#include` <cstring> (alongside other
headers) so std::memset is declared and the file is self-contained and
consistent with other CPU ops.

In `@pymllm/ffi/__init__.py`:
- Around line 291-293: The global `cuda: Device = cuda_()` is wrong because
`cuda_()` currently returns a DType; update the implementation of the `cuda_()`
factory so it returns a Device instance (matching what `cpu_()` and `qnn_()`
return) — e.g. call or wrap the appropriate FFI API that constructs a Device
(like `_ffi_api.cuda_()` or return Device(...) from the FFI result) and keep the
global `cuda` assignment and type annotation as `Device`; ensure the returned
object implements the same Device interface used by `device("cuda")` and
`.to(...)`.
♻️ Duplicate comments (1)
mllm/backends/qnn/aot/passes/PTQPass.cpp (1)

342-420: Handle mixed quantization spec types before comparing concat inputs.
If the reference comes from kSymPerTensor and a later input is kAsymPerTensor, ref_zero_point is never initialized and gets read, and scale comparisons mix incompatible spec types. Track the reference spec type and short‑circuit on mismatches.

🛠️ Suggested fix (track reference spec type)
       Tensor ref_scale;
       Tensor ref_zero_point;
       bool has_ref = false;
       std::string ref_input_name;
+      ir::linalg::QuantizationSpecType ref_spec_type = ir::linalg::QuantizationSpecType::kRaw;

       for (auto iii : inputs) {
         if (!iii->isa_<ir::tensor::TensorValue>()) continue;
         auto tv = iii->cast_<ir::tensor::TensorValue>();
         if (!tv->getAttr("quant_recipe")) continue;
         auto f_spec = tv->getAttr("quant_recipe")->cast_<ir::linalg::LinalgIRQuantizatonSpecAttr>();

         if (f_spec->spec_->type == ir::linalg::QuantizationSpecType::kAsymPerTensor) {
           auto this_spec = std::static_pointer_cast<ir::linalg::QuantizationSpecAsymPerTensor>(f_spec->spec_);
           if (!this_spec->solved) continue;

           if (!has_ref) {
             ref_scale = this_spec->scale;
             ref_zero_point = this_spec->zero_point;
             ref_input_name = tv->name();
+            ref_spec_type = ir::linalg::QuantizationSpecType::kAsymPerTensor;
             has_ref = true;
           } else {
+            if (ref_spec_type != ir::linalg::QuantizationSpecType::kAsymPerTensor) {
+              MLLM_ERROR("PTQPass: ConcatOp '{}' has mismatched quantization spec types between inputs.", op_name);
+              continue;
+            }
             // Check if scale and zero_point match
             ...
           }
         } else if (f_spec->spec_->type == ir::linalg::QuantizationSpecType::kSymPerTensor) {
           auto this_spec = std::static_pointer_cast<ir::linalg::QuantizationSpecSymPerTensor>(f_spec->spec_);
           if (!this_spec->solved) continue;

           if (!has_ref) {
             ref_scale = this_spec->scale;
             ref_input_name = tv->name();
+            ref_spec_type = ir::linalg::QuantizationSpecType::kSymPerTensor;
             has_ref = true;
           } else {
+            if (ref_spec_type != ir::linalg::QuantizationSpecType::kSymPerTensor) {
+              MLLM_ERROR("PTQPass: ConcatOp '{}' has mismatched quantization spec types between inputs.", op_name);
+              continue;
+            }
             // Check if scale matches
             ...
           }
         }
       }
🧹 Nitpick comments (2)
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp (1)

34-104: Add brief doc comments for new public fill APIs.

The new call_fill_* declarations and template wrappers are public entry points; please document parameters/returns/error behavior for clarity and maintainability. As per coding guidelines, please add API doc comments.

mllm/backends/cpu/ops/FillOp.cpp (1)

211-216: Align FP16 arange fallback with end/step semantics.

Lines 213–216 fill numel() elements without honoring end, while other dtypes compute a length derived from start/end/step. Consider matching the same length computation or asserting the expected size to avoid semantic divergence.

Comment on lines +8 to +11
#include <hwy/highway.h>
#include <cstring>
#include "mllm/core/DataTypes.hpp"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, locate and read the file
find . -name "fill-inl.hpp" -type f

Repository: UbiquitousLearning/mllm

Length of output: 114


🏁 Script executed:

# Read the file to verify the includes and function usage
cat -n mllm/backends/cpu/kernels/common/fill-inl.hpp | head -250

Repository: UbiquitousLearning/mllm

Length of output: 10794


Add explicit standard headers for type traits and math utilities.

The file uses std::is_trivial_v (line 35), std::is_same_v (line 219), std::ceil (line 208), and std::min (line 214), but doesn't explicitly include <type_traits>, <cmath>, or <algorithm>. While it may compile due to transitive includes from <hwy/highway.h>, add these headers to make the file self-contained and avoid fragile dependencies.

Proposed fix
 `#include` <hwy/highway.h>
 `#include` <cstring>
+#include <algorithm>
+#include <cmath>
+#include <type_traits>
 `#include` "mllm/core/DataTypes.hpp"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
#include <hwy/highway.h>
#include <cstring>
#include "mllm/core/DataTypes.hpp"
`#include` <hwy/highway.h>
`#include` <cstring>
`#include` <algorithm>
`#include` <cmath>
`#include` <type_traits>
`#include` "mllm/core/DataTypes.hpp"
🧰 Tools
🪛 Clang (14.0.6)

[error] 8-8: 'hwy/highway.h' file not found

(clang-diagnostic-error)

🤖 Prompt for AI Agents
In `@mllm/backends/cpu/kernels/common/fill-inl.hpp` around lines 8 - 11, This file
relies on std::is_trivial_v, std::is_same_v, std::ceil, and std::min but doesn't
include the matching standard headers; add explicit includes for <type_traits>,
<cmath>, and <algorithm> near the top alongside the existing includes (after
`#include` <cstring> / before "mllm/core/DataTypes.hpp") so that uses in
functions/template code referencing std::is_trivial_v, std::is_same_v,
std::ceil, and std::min are self-contained and no longer depend on transitive
headers.

Comment on lines +196 to +244
template<typename T>
HWY_INLINE void fill_arange_impl(T* HWY_RESTRICT dst, size_t count, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step) {
if (step == 0) {
fill_value_impl(dst, count, static_cast<T>(start));
return;
}

// Calculate the actual number of elements to fill
size_t n = 0;
if ((step > 0 && start < end) || (step < 0 && start > end)) {
mllm_fp32_t n_float = (end - start) / step;
if (n_float > 0) {
n = static_cast<size_t>(std::ceil(n_float));
if (step > 0) {
if (start + (n - 1) * step >= end) --n;
} else {
if (start + (n - 1) * step <= end) --n;
}
n = std::min(n, count);
}
}

// Use SIMD for float types where we can vectorize the computation
if constexpr (std::is_same_v<T, mllm_fp32_t>) {
const hn::ScalableTag<T> d;
const size_t N = hn::Lanes(d);

// Create increment vector: [0, 1, 2, 3, ...] * step
const hn::Vec<decltype(d)> step_vec = hn::Set(d, step);
const hn::Vec<decltype(d)> n_step_vec = hn::Set(d, step * static_cast<T>(N));

// Create base offsets [0, 1, 2, 3, ...]
hn::Vec<decltype(d)> base = hn::Iota(d, 0);
base = hn::Mul(base, step_vec);
hn::Vec<decltype(d)> current_start = hn::Add(hn::Set(d, start), base);

size_t idx = 0;
for (; idx + N <= n; idx += N) {
hn::StoreU(current_start, d, dst + idx);
current_start = hn::Add(current_start, n_step_vec);
}

// Handle remaining elements
for (; idx < n; ++idx) { dst[idx] = static_cast<T>(start + idx * step); }
} else {
// Scalar fallback for other types
for (size_t i = 0; i < n; ++i) { dst[i] = static_cast<T>(start + i * step); }
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

❓ Verification inconclusive

Script executed:

# First, let's look at the full file context around the function
cat -n mllm/backends/cpu/kernels/common/fill-inl.hpp | head -300 | tail -120

Repository: UbiquitousLearning/mllm


Repository: UbiquitousLearning/mllm
Exit code: 0

stdout:

   181	static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_value_u16(mllm_uint16_t* HWY_RESTRICT dst, size_t size, mllm_uint16_t value) {
   182	  fill_value_impl(dst, size, value);
   183	}
   184	
   185	static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_value_i8(mllm_int8_t* HWY_RESTRICT dst, size_t size, mllm_int8_t value) {
   186	  fill_value_impl(dst, size, value);
   187	}
   188	
   189	static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_value_u8(mllm_uint8_t* HWY_RESTRICT dst, size_t size, mllm_uint8_t value) {
   190	  fill_value_impl(dst, size, value);
   191	}
   192	
   193	//===----------------------------------------------------------------------===//
   194	// Fill Arange (start, end, step)
   195	//===----------------------------------------------------------------------===//
   196	template<typename T>
   197	HWY_INLINE void fill_arange_impl(T* HWY_RESTRICT dst, size_t count, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step) {
   198	  if (step == 0) {
   199	    fill_value_impl(dst, count, static_cast<T>(start));
   200	    return;
   201	  }
   202	
   203	  // Calculate the actual number of elements to fill
   204	  size_t n = 0;
   205	  if ((step > 0 && start < end) || (step < 0 && start > end)) {
   206	    mllm_fp32_t n_float = (end - start) / step;
   207	    if (n_float > 0) {
   208	      n = static_cast<size_t>(std::ceil(n_float));
   209	      if (step > 0) {
   210	        if (start + (n - 1) * step >= end) --n;
   211	      } else {
   212	        if (start + (n - 1) * step <= end) --n;
   213	      }
   214	      n = std::min(n, count);
   215	    }
   216	  }
   217	
   218	  // Use SIMD for float types where we can vectorize the computation
   219	  if constexpr (std::is_same_v<T, mllm_fp32_t>) {
   220	    const hn::ScalableTag<T> d;
   221	    const size_t N = hn::Lanes(d);
   222	
   223	    // Create increment vector: [0, 1, 2, 3, ...] * step
   224	    const hn::Vec<decltype(d)> step_vec = hn::Set(d, step);
   225	    const hn::Vec<decltype(d)> n_step_vec = hn::Set(d, step * static_cast<T>(N));
   226	
   227	    // Create base offsets [0, 1, 2, 3, ...]
   228	    hn::Vec<decltype(d)> base = hn::Iota(d, 0);
   229	    base = hn::Mul(base, step_vec);
   230	    hn::Vec<decltype(d)> current_start = hn::Add(hn::Set(d, start), base);
   231	
   232	    size_t idx = 0;
   233	    for (; idx + N <= n; idx += N) {
   234	      hn::StoreU(current_start, d, dst + idx);
   235	      current_start = hn::Add(current_start, n_step_vec);
   236	    }
   237	
   238	    // Handle remaining elements
   239	    for (; idx < n; ++idx) { dst[idx] = static_cast<T>(start + idx * step); }
   240	  } else {
   241	    // Scalar fallback for other types
   242	    for (size_t i = 0; i < n; ++i) { dst[i] = static_cast<T>(start + i * step); }
   243	  }
   244	}
   245	
   246	static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_arange_fp32(mllm_fp32_t* HWY_RESTRICT dst, size_t size, mllm_fp32_t start,
   247	                                                           mllm_fp32_t end, mllm_fp32_t step) {
   248	  fill_arange_impl(dst, size, start, end, step);
   249	}
   250	
   251	static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_arange_i32(mllm_int32_t* HWY_RESTRICT dst, size_t size, mllm_fp32_t start,
   252	                                                          mllm_fp32_t end, mllm_fp32_t step) {
   253	  fill_arange_impl(dst, size, start, end, step);
   254	}
   255	
   256	static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_arange_u32(mllm_uint32_t* HWY_RESTRICT dst, size_t size, mllm_fp32_t start,
   257	                                                          mllm_fp32_t end, mllm_fp32_t step) {
   258	  fill_arange_impl(dst, size, start, end, step);
   259	}
   260	
   261	static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_arange_i64(mllm_int64_t* HWY_RESTRICT dst, size_t size, mllm_fp32_t start,
   262	                                                          mllm_fp32_t end, mllm_fp32_t step) {
   263	  fill_arange_impl(dst, size, start, end, step);
   264	}
   265	
   266	static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_arange_u64(mllm_uint64_t* HWY_RESTRICT dst, size_t size, mllm_fp32_t start,
   267	                                                          mllm_fp32_t end, mllm_fp32_t step) {
   268	  fill_arange_impl(dst, size, start, end, step);
   269	}
   270	
   271	static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_arange_i16(mllm_int16_t* HWY_RESTRICT dst, size_t size, mllm_fp32_t start,
   272	                                                          mllm_fp32_t end, mllm_fp32_t step) {
   273	  fill_arange_impl(dst, size, start, end, step);
   274	}
   275	
   276	static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_arange_u16(mllm_uint16_t* HWY_RESTRICT dst, size_t size, mllm_fp32_t start,
   277	                                                          mllm_fp32_t end, mllm_fp32_t step) {
   278	  fill_arange_impl(dst, size, start, end, step);
   279	}
   280	
   281	static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_arange_i8(mllm_int8_t* HWY_RESTRICT dst, size_t size, mllm_fp32_t start,
   282	                                                         mllm_fp32_t end, mllm_fp32_t step) {
   283	  fill_arange_impl(dst, size, start, end, step);
   284	}
   285	
   286	static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_arange_u8(mllm_uint8_t* HWY_RESTRICT dst, size_t size, mllm_fp32_t start,
   287	                                                         mllm_fp32_t end, mllm_fp32_t step) {
   288	  fill_arange_impl(dst, size, start, end, step);
   289	}
   290	
   291	//===----------------------------------------------------------------------===//
   292	// Fill Random (using LCG random number generator)
   293	//===----------------------------------------------------------------------===//
   294	template<typename T>
   295	HWY_INLINE void fill_random_impl(T* HWY_RESTRICT dst, size_t count, mllm_fp32_t start, mllm_fp32_t end, uint64_t seed) {
   296	  const uint64_t multiplier = 1103515245ULL;
   297	  const uint64_t increment = 12345ULL;
   298	  const uint64_t modulus = 1ULL << 31;  // 2^31
   299	  const mllm_fp32_t range = end - start;
   300	

Script executed:

# Now search for call sites as suggested in the review comment
rg -n "fill_arange_anytype|call_fill_arange_|fill_arange_impl" --type=cpp --type=hpp -A 3 -B 3

Repository: UbiquitousLearning/mllm


Repository: UbiquitousLearning/mllm
Exit code: 2

stderr:

unrecognized file type: hpp

Script executed:

# Also search for fill_arange patterns more broadly
rg -n "fill_arange" --type=cpp --type=hpp -A 2 -B 2

Repository: UbiquitousLearning/mllm


Repository: UbiquitousLearning/mllm
Exit code: 2

stderr:

unrecognized file type: hpp

Script executed:

# Search for call sites of fill_arange functions
rg -n "fill_arange" -A 2 -B 2

Repository: UbiquitousLearning/mllm


Repository: UbiquitousLearning/mllm
Exit code: 0

stdout:

mllm/backends/opencl/kernels/fill_cl.cpp-5-                   " dst[index]=value;\n"
mllm/backends/opencl/kernels/fill_cl.cpp-6-                   "}\n"
mllm/backends/opencl/kernels/fill_cl.cpp:7:                   "__kernel void fill_arange_fp32(float start,float step,__global float *dst) {\n"
mllm/backends/opencl/kernels/fill_cl.cpp-8-                   " size_t index=get_global_id(0);\n"
mllm/backends/opencl/kernels/fill_cl.cpp-9-                   " dst[index]=start+(float)index*step;\n"
--
mllm/backends/opencl/kernels/fill.cl-4-}
mllm/backends/opencl/kernels/fill.cl-5-
mllm/backends/opencl/kernels/fill.cl:6:__kernel void fill_arange_fp32(float start, float step, __global float *dst) {
mllm/backends/opencl/kernels/fill.cl-7-  size_t index = get_global_id(0);
mllm/backends/opencl/kernels/fill.cl-8-  dst[index] = start + (float)index * step;
--
mllm/backends/opencl/ops/FillOp.cpp-12-
mllm/backends/opencl/ops/FillOp.cpp-13-  kernel_fp32_buffer_ = runtime->buildKernel("fill", "fill_fp32", {});
mllm/backends/opencl/ops/FillOp.cpp:14:  kernel_arange_fp32_buffer_ = runtime->buildKernel("fill", "fill_arange_fp32", {});
mllm/backends/opencl/ops/FillOp.cpp-15-}
mllm/backends/opencl/ops/FillOp.cpp-16-
--
mllm/backends/opencl/ops/FillOp.cpp-68-                                                              cl::NDRange(global_size), cl::NullRange);
mllm/backends/opencl/ops/FillOp.cpp-69-    if (error != CL_SUCCESS) {
mllm/backends/opencl/ops/FillOp.cpp:70:      MLLM_ERROR_EXIT(ExitCode::kOpenCLError, "Failed to execute fill_arange kernel, error code: {}", error);
mllm/backends/opencl/ops/FillOp.cpp-71-    }
mllm/backends/opencl/ops/FillOp.cpp-72-  } else {
--
mllm/backends/cpu/ops/FillOp.cpp-203-        case kFloat32: {
mllm/backends/cpu/ops/FillOp.cpp-204-#if defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86)
mllm/backends/cpu/ops/FillOp.cpp:205:          common::fill_arange_anytype(dst.ptr<mllm_fp32_t>(), dst.numel(), options_.start, options_.end, options_.step);
mllm/backends/cpu/ops/FillOp.cpp-206-#elif defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM)
mllm/backends/cpu/ops/FillOp.cpp:207:          arm::fill_arange(dst.ptr<mllm_fp32_t>(), dst.numel(), options_.start, options_.end, options_.step, threads);
mllm/backends/cpu/ops/FillOp.cpp-208-#endif
mllm/backends/cpu/ops/FillOp.cpp-209-          break;
--
mllm/backends/cpu/ops/FillOp.cpp-215-          for (size_t i = 0; i < dst.numel(); ++i) { ptr[i] = static_cast<mllm_fp16_t>(options_.start + i * options_.step); }
mllm/backends/cpu/ops/FillOp.cpp-216-#elif defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM)
mllm/backends/cpu/ops/FillOp.cpp:217:          arm::fill_arange_fp16(dst.ptr<mllm_fp16_t>(), dst.numel(), options_.start, options_.end, options_.step, threads);
mllm/backends/cpu/ops/FillOp.cpp-218-#endif
mllm/backends/cpu/ops/FillOp.cpp-219-          break;
--
mllm/backends/cpu/ops/FillOp.cpp-221-        case kInt64: {
mllm/backends/cpu/ops/FillOp.cpp-222-#if defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86)
mllm/backends/cpu/ops/FillOp.cpp:223:          common::fill_arange_anytype(dst.ptr<mllm_int64_t>(), dst.numel(), options_.start, options_.end, options_.step);
mllm/backends/cpu/ops/FillOp.cpp-224-#elif defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM)
mllm/backends/cpu/ops/FillOp.cpp:225:          arm::fill_arange_anytype<mllm_int64_t>(dst.ptr<mllm_int64_t>(), dst.numel(), options_.start, options_.end,
mllm/backends/cpu/ops/FillOp.cpp-226-                                                 options_.step, threads);
mllm/backends/cpu/ops/FillOp.cpp-227-#endif
--
mllm/backends/cpu/ops/FillOp.cpp-230-        case kInt32: {
mllm/backends/cpu/ops/FillOp.cpp-231-#if defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86)
mllm/backends/cpu/ops/FillOp.cpp:232:          common::fill_arange_anytype(dst.ptr<mllm_int32_t>(), dst.numel(), options_.start, options_.end, options_.step);
mllm/backends/cpu/ops/FillOp.cpp-233-#elif defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM)
mllm/backends/cpu/ops/FillOp.cpp:234:          arm::fill_arange_anytype<mllm_int32_t>(dst.ptr<mllm_int32_t>(), dst.numel(), options_.start, options_.end,
mllm/backends/cpu/ops/FillOp.cpp-235-                                                 options_.step, threads);
mllm/backends/cpu/ops/FillOp.cpp-236-#endif
--
mllm/backends/cpu/ops/FillOp.cpp-239-        case kInt16: {
mllm/backends/cpu/ops/FillOp.cpp-240-#if defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86)
mllm/backends/cpu/ops/FillOp.cpp:241:          common::fill_arange_anytype(dst.ptr<mllm_int16_t>(), dst.numel(), options_.start, options_.end, options_.step);
mllm/backends/cpu/ops/FillOp.cpp-242-#elif defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM)
mllm/backends/cpu/ops/FillOp.cpp:243:          arm::fill_arange_anytype<mllm_int16_t>(dst.ptr<mllm_int16_t>(), dst.numel(), options_.start, options_.end,
mllm/backends/cpu/ops/FillOp.cpp-244-                                                 options_.step, threads);
mllm/backends/cpu/ops/FillOp.cpp-245-#endif
--
mllm/backends/cpu/ops/FillOp.cpp-248-        case kInt8: {
mllm/backends/cpu/ops/FillOp.cpp-249-#if defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86)
mllm/backends/cpu/ops/FillOp.cpp:250:          common::fill_arange_anytype(dst.ptr<mllm_int8_t>(), dst.numel(), options_.start, options_.end, options_.step);
mllm/backends/cpu/ops/FillOp.cpp-251-#elif defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM)
mllm/backends/cpu/ops/FillOp.cpp:252:          arm::fill_arange_anytype<mllm_int8_t>(dst.ptr<mllm_int8_t>(), dst.numel(), options_.start, options_.end,
mllm/backends/cpu/ops/FillOp.cpp-253-                                                options_.step, threads);
mllm/backends/cpu/ops/FillOp.cpp-254-#endif
--
mllm/backends/cpu/ops/FillOp.cpp-257-        case kUInt64: {
mllm/backends/cpu/ops/FillOp.cpp-258-#if defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86)
mllm/backends/cpu/ops/FillOp.cpp:259:          common::fill_arange_anytype(dst.ptr<mllm_uint64_t>(), dst.numel(), options_.start, options_.end, options_.step);
mllm/backends/cpu/ops/FillOp.cpp-260-#elif defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM)
mllm/backends/cpu/ops/FillOp.cpp:261:          arm::fill_arange_anytype<mllm_uint64_t>(dst.ptr<mllm_uint64_t>(), dst.numel(), options_.start, options_.end,
mllm/backends/cpu/ops/FillOp.cpp-262-                                                  options_.step, threads);
mllm/backends/cpu/ops/FillOp.cpp-263-#endif
--
mllm/backends/cpu/ops/FillOp.cpp-266-        case kUInt32: {
mllm/backends/cpu/ops/FillOp.cpp-267-#if defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86)
mllm/backends/cpu/ops/FillOp.cpp:268:          common::fill_arange_anytype(dst.ptr<mllm_uint32_t>(), dst.numel(), options_.start, options_.end, options_.step);
mllm/backends/cpu/ops/FillOp.cpp-269-#elif defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM)
mllm/backends/cpu/ops/FillOp.cpp:270:          arm::fill_arange_anytype<mllm_uint32_t>(dst.ptr<mllm_uint32_t>(), dst.numel(), options_.start, options_.end,
mllm/backends/cpu/ops/FillOp.cpp-271-                                                  options_.step, threads);
mllm/backends/cpu/ops/FillOp.cpp-272-#endif
--
mllm/backends/cpu/ops/FillOp.cpp-275-        case kUInt16: {
mllm/backends/cpu/ops/FillOp.cpp-276-#if defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86)
mllm/backends/cpu/ops/FillOp.cpp:277:          common::fill_arange_anytype(dst.ptr<mllm_uint16_t>(), dst.numel(), options_.start, options_.end, options_.step);
mllm/backends/cpu/ops/FillOp.cpp-278-#elif defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM)
mllm/backends/cpu/ops/FillOp.cpp:279:          arm::fill_arange_anytype<mllm_uint16_t>(dst.ptr<mllm_uint16_t>(), dst.numel(), options_.start, options_.end,
mllm/backends/cpu/ops/FillOp.cpp-280-                                                  options_.step, threads);
mllm/backends/cpu/ops/FillOp.cpp-281-#endif
--
mllm/backends/cpu/ops/FillOp.cpp-284-        case kUInt8: {
mllm/backends/cpu/ops/FillOp.cpp-285-#if defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86)
mllm/backends/cpu/ops/FillOp.cpp:286:          common::fill_arange_anytype(dst.ptr<mllm_uint8_t>(), dst.numel(), options_.start, options_.end, options_.step);
mllm/backends/cpu/ops/FillOp.cpp-287-#elif defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM)
mllm/backends/cpu/ops/FillOp.cpp:288:          arm::fill_arange_anytype<mllm_uint8_t>(dst.ptr<mllm_uint8_t>(), dst.numel(), options_.start, options_.end,
mllm/backends/cpu/ops/FillOp.cpp-289-                                                 options_.step, threads);
mllm/backends/cpu/ops/FillOp.cpp-290-#endif
--
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp-174-// Fill Arange
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp-175-//===----------------------------------------------------------------------===//
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:176:HWY_EXPORT(fill_arange_fp32);
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:177:HWY_EXPORT(fill_arange_i32);
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:178:HWY_EXPORT(fill_arange_u32);
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:179:HWY_EXPORT(fill_arange_i64);
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:180:HWY_EXPORT(fill_arange_u64);
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:181:HWY_EXPORT(fill_arange_i16);
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:182:HWY_EXPORT(fill_arange_u16);
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:183:HWY_EXPORT(fill_arange_i8);
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:184:HWY_EXPORT(fill_arange_u8);
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp-185-
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:186:HWY_DLLEXPORT void call_fill_arange_fp32(mllm_fp32_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step) {
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:187:  HWY_DYNAMIC_DISPATCH(fill_arange_fp32)(dst, n, start, end, step);
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp-188-}
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:189:HWY_DLLEXPORT void call_fill_arange_i32(mllm_int32_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step) {
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:190:  HWY_DYNAMIC_DISPATCH(fill_arange_i32)(dst, n, start, end, step);
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp-191-}
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:192:HWY_DLLEXPORT void call_fill_arange_u32(mllm_uint32_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step) {
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:193:  HWY_DYNAMIC_DISPATCH(fill_arange_u32)(dst, n, start, end, step);
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp-194-}
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:195:HWY_DLLEXPORT void call_fill_arange_i64(mllm_int64_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step) {
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:196:  HWY_DYNAMIC_DISPATCH(fill_arange_i64)(dst, n, start, end, step);
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp-197-}
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:198:HWY_DLLEXPORT void call_fill_arange_u64(mllm_uint64_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step) {
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:199:  HWY_DYNAMIC_DISPATCH(fill_arange_u64)(dst, n, start, end, step);
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp-200-}
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:201:HWY_DLLEXPORT void call_fill_arange_i16(mllm_int16_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step) {
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:202:  HWY_DYNAMIC_DISPATCH(fill_arange_i16)(dst, n, start, end, step);
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp-203-}
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:204:HWY_DLLEXPORT void call_fill_arange_u16(mllm_uint16_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step) {
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:205:  HWY_DYNAMIC_DISPATCH(fill_arange_u16)(dst, n, start, end, step);
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp-206-}
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:207:HWY_DLLEXPORT void call_fill_arange_i8(mllm_int8_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step) {
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:208:  HWY_DYNAMIC_DISPATCH(fill_arange_i8)(dst, n, start, end, step);
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp-209-}
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:210:HWY_DLLEXPORT void call_fill_arange_u8(mllm_uint8_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step) {
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:211:  HWY_DYNAMIC_DISPATCH(fill_arange_u8)(dst, n, start, end, step);
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp-212-}
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp-213-
--
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-77-// Fill Arange
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-78-//===----------------------------------------------------------------------===//
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:79:HWY_DLLEXPORT void call_fill_arange_fp32(mllm_fp32_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step);
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:80:HWY_DLLEXPORT void call_fill_arange_i32(mllm_int32_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step);
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:81:HWY_DLLEXPORT void call_fill_arange_u32(mllm_uint32_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step);
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:82:HWY_DLLEXPORT void call_fill_arange_i64(mllm_int64_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step);
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:83:HWY_DLLEXPORT void call_fill_arange_u64(mllm_uint64_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step);
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:84:HWY_DLLEXPORT void call_fill_arange_i16(mllm_int16_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step);
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:85:HWY_DLLEXPORT void call_fill_arange_u16(mllm_uint16_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step);
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:86:HWY_DLLEXPORT void call_fill_arange_i8(mllm_int8_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step);
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:87:HWY_DLLEXPORT void call_fill_arange_u8(mllm_uint8_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step);
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-88-
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-89-//===----------------------------------------------------------------------===//
--
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-188-
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-189-template<typename T>
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:190:inline void fill_arange_anytype(T* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step) {
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-191-  if constexpr (std::is_same_v<T, mllm_fp32_t>) {
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:192:    call_fill_arange_fp32(dst, n, start, end, step);
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-193-  } else if constexpr (std::is_same_v<T, mllm_int32_t>) {
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:194:    call_fill_arange_i32(dst, n, start, end, step);
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-195-  } else if constexpr (std::is_same_v<T, mllm_uint32_t>) {
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:196:    call_fill_arange_u32(dst, n, start, end, step);
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-197-  } else if constexpr (std::is_same_v<T, mllm_int64_t>) {
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:198:    call_fill_arange_i64(dst, n, start, end, step);
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-199-  } else if constexpr (std::is_same_v<T, mllm_uint64_t>) {
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:200:    call_fill_arange_u64(dst, n, start, end, step);
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-201-  } else if constexpr (std::is_same_v<T, mllm_int16_t>) {
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:202:    call_fill_arange_i16(dst, n, start, end, step);
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-203-  } else if constexpr (std::is_same_v<T, mllm_uint16_t>) {
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:204:    call_fill_arange_u16(dst, n, start, end, step);
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-205-  } else if constexpr (std::is_same_v<T, mllm_int8_t>) {
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:206:    call_fill_arange_i8(dst, n, start, end, step);
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-207-  } else if constexpr (std::is_same_v<T, mllm_uint8_t>) {
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:208:    call_fill_arange_u8(dst, n, start, end, step);
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-209-  } else {
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-210-    // Fallback
--
mllm/backends/cpu/kernels/common/fill-inl.hpp-195-//===----------------------------------------------------------------------===//
mllm/backends/cpu/kernels/common/fill-inl.hpp-196-template<typename T>
mllm/backends/cpu/kernels/common/fill-inl.hpp:197:HWY_INLINE void fill_arange_impl(T* HWY_RESTRICT dst, size_t count, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step) {
mllm/backends/cpu/kernels/common/fill-inl.hpp-198-  if (step == 0) {
mllm/backends/cpu/kernels/common/fill-inl.hpp-199-    fill_value_impl(dst, count, static_cast<T>(start));
--
mllm/backends/cpu/kernels/common/fill-inl.hpp-244-}
mllm/backends/cpu/kernels/common/fill-inl.hpp-245-
mllm/backends/cpu/kernels/common/fill-inl.hpp:246:static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_arange_fp32(mllm_fp32_t* HWY_RESTRICT dst, size_t size, mllm_fp32_t start,
mllm/backends/cpu/kernels/common/fill-inl.hpp-247-                                                           mllm_fp32_t end, mllm_fp32_t step) {
mllm/backends/cpu/kernels/common/fill-inl.hpp:248:  fill_arange_impl(dst, size, start, end, step);
mllm/backends/cpu/kernels/common/fill-inl.hpp-249-}
mllm/backends/cpu/kernels/common/fill-inl.hpp-250-
mllm/backends/cpu/kernels/common/fill-inl.hpp:251:static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_arange_i32(mllm_int32_t* HWY_RESTRICT dst, size_t size, mllm_fp32_t start,
mllm/backends/cpu/kernels/common/fill-inl.hpp-252-                                                          mllm_fp32_t end, mllm_fp32_t step) {
mllm/backends/cpu/kernels/common/fill-inl.hpp:253:  fill_arange_impl(dst, size, start, end, step);
mllm/backends/cpu/kernels/common/fill-inl.hpp-254-}
mllm/backends/cpu/kernels/common/fill-inl.hpp-255-
mllm/backends/cpu/kernels/common/fill-inl.hpp:256:static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_arange_u32(mllm_uint32_t* HWY_RESTRICT dst, size_t size, mllm_fp32_t start,
mllm/backends/cpu/kernels/common/fill-inl.hpp-257-                                                          mllm_fp32_t end, mllm_fp32_t step) {
mllm/backends/cpu/kernels/common/fill-inl.hpp:258:  fill_arange_impl(dst, size, start, end, step);
mllm/backends/cpu/kernels/common/fill-inl.hpp-259-}
mllm/backends/cpu/kernels/common/fill-inl.hpp-260-
mllm/backends/cpu/kernels/common/fill-inl.hpp:261:static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_arange_i64(mllm_int64_t* HWY_RESTRICT dst, size_t size, mllm_fp32_t start,
mllm/backends/cpu/kernels/common/fill-inl.hpp-262-                                                          mllm_fp32_t end, mllm_fp32_t step) {
mllm/backends/cpu/kernels/common/fill-inl.hpp:263:  fill_arange_impl(dst, size, start, end, step);
mllm/backends/cpu/kernels/common/fill-inl.hpp-264-}
mllm/backends/cpu/kernels/common/fill-inl.hpp-265-
mllm/backends/cpu/kernels/common/fill-inl.hpp:266:static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_arange_u64(mllm_uint64_t* HWY_RESTRICT dst, size_t size, mllm_fp32_t start,
mllm/backends/cpu/kernels/common/fill-inl.hpp-267-                                                          mllm_fp32_t end, mllm_fp32_t step) {
mllm/backends/cpu/kernels/common/fill-inl.hpp:268:  fill_arange_impl(dst, size, start, end, step);
mllm/backends/cpu/kernels/common/fill-inl.hpp-269-}
mllm/backends/cpu/kernels/common/fill-inl.hpp-270-
mllm/backends/cpu/kernels/common/fill-inl.hpp:271:static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_arange_i16(mllm_int16_t* HWY_RESTRICT dst, size_t size, mllm_fp32_t start,
mllm/backends/cpu/kernels/common/fill-inl.hpp-272-                                                          mllm_fp32_t end, mllm_fp32_t step) {
mllm/backends/cpu/kernels/common/fill-inl.hpp:273:  fill_arange_impl(dst, size, start, end, step);
mllm/backends/cpu/kernels/common/fill-inl.hpp-274-}
mllm/backends/cpu/kernels/common/fill-inl.hpp-275-
mllm/backends/cpu/kernels/common/fill-inl.hpp:276:static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_arange_u16(mllm_uint16_t* HWY_RESTRICT dst, size_t size, mllm_fp32_t start,
mllm/backends/cpu/kernels/common/fill-inl.hpp-277-                                                          mllm_fp32_t end, mllm_fp32_t step) {
mllm/backends/cpu/kernels/common/fill-inl.hpp:278:  fill_arange_impl(dst, size, start, end, step);
mllm/backends/cpu/kernels/common/fill-inl.hpp-279-}
mllm/backends/cpu/kernels/common/fill-inl.hpp-280-
mllm/backends/cpu/kernels/common/fill-inl.hpp:281:static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_arange_i8(mllm_int8_t* HWY_RESTRICT dst, size_t size, mllm_fp32_t start,
mllm/backends/cpu/kernels/common/fill-inl.hpp-282-                                                         mllm_fp32_t end, mllm_fp32_t step) {
mllm/backends/cpu/kernels/common/fill-inl.hpp:283:  fill_arange_impl(dst, size, start, end, step);
mllm/backends/cpu/kernels/common/fill-inl.hpp-284-}
mllm/backends/cpu/kernels/common/fill-inl.hpp-285-
mllm/backends/cpu/kernels/common/fill-inl.hpp:286:static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_arange_u8(mllm_uint8_t* HWY_RESTRICT dst, size_t size, mllm_fp32_t start,
mllm/backends/cpu/kernels/common/fill-inl.hpp-287-                                                         mllm_fp32_t end, mllm_fp32_t step) {
mllm/backends/cpu/kernels/common/fill-inl.hpp:288:  fill_arange_impl(dst, size, start, end, step);
mllm/backends/cpu/kernels/common/fill-inl.hpp-289-}
mllm/backends/cpu/kernels/common/fill-inl.hpp-290-
--
mllm/backends/cpu/kernels/arm/fill.cpp-52-}
mllm/backends/cpu/kernels/arm/fill.cpp-53-
mllm/backends/cpu/kernels/arm/fill.cpp:54:void fill_arange(mllm_fp32_t* __restrict dst, size_t size, float start, float end, float step, int thread_count) {
mllm/backends/cpu/kernels/arm/fill.cpp-55-  constexpr size_t vec_size = 4;  // 4 floats in NEON
mllm/backends/cpu/kernels/arm/fill.cpp-56-
--
mllm/backends/cpu/kernels/arm/fill.cpp-129-}
mllm/backends/cpu/kernels/arm/fill.cpp-130-
mllm/backends/cpu/kernels/arm/fill.cpp:131:void fill_arange_fp16(mllm_fp16_t* __restrict dst, size_t size, float start, float end, float step, int thread_count) {
mllm/backends/cpu/kernels/arm/fill.cpp-132-  constexpr size_t vec_size = 8;  // 8 float16_t in NEON
mllm/backends/cpu/kernels/arm/fill.cpp-133-
--
mllm/backends/cpu/kernels/arm/fill.hpp-17-void fill_specific_value(mllm_fp32_t* __restrict dst, size_t size, mllm_fp32_t value, int thread_count);
mllm/backends/cpu/kernels/arm/fill.hpp-18-
mllm/backends/cpu/kernels/arm/fill.hpp:19:void fill_arange(mllm_fp32_t* __restrict dst, size_t size, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step,
mllm/backends/cpu/kernels/arm/fill.hpp-20-                 int thread_count);
mllm/backends/cpu/kernels/arm/fill.hpp-21-
--
mllm/backends/cpu/kernels/arm/fill.hpp-29-void fill_specific_value_fp16(mllm_fp16_t* __restrict dst, size_t size, mllm_fp32_t value, int thread_count);
mllm/backends/cpu/kernels/arm/fill.hpp-30-
mllm/backends/cpu/kernels/arm/fill.hpp:31:void fill_arange_fp16(mllm_fp16_t* __restrict dst, size_t size, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step,
mllm/backends/cpu/kernels/arm/fill.hpp-32-                      int thread_count);
mllm/backends/cpu/kernels/arm/fill.hpp-33-
--
mllm/backends/cpu/kernels/arm/fill.hpp-94-
mllm/backends/cpu/kernels/arm/fill.hpp-95-template<typename T>
mllm/backends/cpu/kernels/arm/fill.hpp:96:inline void fill_arange_anytype(T* __restrict dst, size_t size, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step,
mllm/backends/cpu/kernels/arm/fill.hpp-97-                                int thread_count) {
mllm/backends/cpu/kernels/arm/fill.hpp-98-  if (step == 0) {
--
mllm/backends/cpu/kernels/arm/fill.hpp-119-
mllm/backends/cpu/kernels/arm/fill.hpp-120-template<>
mllm/backends/cpu/kernels/arm/fill.hpp:121:inline void fill_arange_anytype<mllm_fp32_t>(mllm_fp32_t* __restrict dst, size_t size, mllm_fp32_t start, mllm_fp32_t end,
mllm/backends/cpu/kernels/arm/fill.hpp-122-                                             mllm_fp32_t step, int thread_count) {
mllm/backends/cpu/kernels/arm/fill.hpp:123:  fill_arange(dst, size, start, end, step, thread_count);
mllm/backends/cpu/kernels/arm/fill.hpp-124-}
mllm/backends/cpu/kernels/arm/fill.hpp-125-
mllm/backends/cpu/kernels/arm/fill.hpp-126-template<>
mllm/backends/cpu/kernels/arm/fill.hpp:127:inline void fill_arange_anytype<mllm_fp16_t>(mllm_fp16_t* __restrict dst, size_t size, mllm_fp32_t start, mllm_fp32_t end,
mllm/backends/cpu/kernels/arm/fill.hpp-128-                                             mllm_fp32_t step, int thread_count) {
mllm/backends/cpu/kernels/arm/fill.hpp:129:  fill_arange_fp16(dst, size, start, end, step, thread_count);
mllm/backends/cpu/kernels/arm/fill.hpp-130-}
mllm/backends/cpu/kernels/arm/fill.hpp-131-
--
mllm/backends/cpu/kernels/x86/fill.cpp-109-}
mllm/backends/cpu/kernels/x86/fill.cpp-110-
mllm/backends/cpu/kernels/x86/fill.cpp:111:void fill_arange(mllm_fp32_t* __restrict dst, size_t size, float start, float end, float step, int thread_count) {
mllm/backends/cpu/kernels/x86/fill.cpp-112-#if defined(MLLM_HOST_FEATURE_AVX512F)
mllm/backends/cpu/kernels/x86/fill.cpp-113-  constexpr size_t vec_size = 16;
--
mllm/backends/cpu/kernels/x86/fill.hpp-17-void fill_specific_value(mllm_fp32_t* __restrict dst, size_t size, float value, int thread_count);
mllm/backends/cpu/kernels/x86/fill.hpp-18-
mllm/backends/cpu/kernels/x86/fill.hpp:19:void fill_arange(mllm_fp32_t* __restrict dst, size_t size, float start, float end, float step, int thread_count);
mllm/backends/cpu/kernels/x86/fill.hpp-20-
mllm/backends/cpu/kernels/x86/fill.hpp-21-void fill_random(mllm_fp32_t* __restrict dst, size_t size, float start, float end, uint64_t seed, int thread_count);

Script executed:

# Look for where these functions are used
find . -name "*.cpp" -o -name "*.hpp" -o -name "*.h" | head -20

Repository: UbiquitousLearning/mllm


Repository: UbiquitousLearning/mllm
Exit code: 0

stdout:

./algorithms/fancy_algorithm/main.cpp
./algorithms/fancy_algorithm/models/FooNet.hpp
./algorithms/fancy_algorithm/passes/FooPass.hpp
./algorithms/lazy_vlm/HKVCache.cpp
./algorithms/lazy_vlm/HKVCache.hpp
./algorithms/lazy_vlm/HKVCacheFast.cpp
./algorithms/lazy_vlm/HKVCacheFast.hpp
./algorithms/lazy_vlm/LazyVLMQwen2VL.cpp
./algorithms/lazy_vlm/LazyVLMQwen2VLFast.cpp
./algorithms/lazy_vlm/LazyVLMQwen2_5VL.cpp
./algorithms/lazy_vlm/LazyVLMQwen2_5VLFast.cpp
./algorithms/lazy_vlm/models/qwen2_5vl/lazy_vlm_cfg.hpp
./algorithms/lazy_vlm/models/qwen2_5vl/lazy_vlm_cfg_fast.hpp
./algorithms/lazy_vlm/models/qwen2_5vl/modeling_qwen2_5vl.hpp
./algorithms/lazy_vlm/models/qwen2_5vl/modeling_qwen2_5vl_fast.hpp
./algorithms/lazy_vlm/models/qwen2vl/lazy_vlm_cfg.hpp
./algorithms/lazy_vlm/models/qwen2vl/lazy_vlm_cfg_fast.hpp
./algorithms/lazy_vlm/models/qwen2vl/modeling_qwen2vl.hpp
./algorithms/lazy_vlm/models/qwen2vl/modeling_qwen2vl_fast.hpp
./benchmarks/cpu/arm_mllm_blas_sgemm.cpp

Script executed:

# Search more broadly for arange usage
rg -n "arange" -A 2 -B 2

Repository: UbiquitousLearning/mllm


Repository: UbiquitousLearning/mllm
Exit code: 0

stdout:

tests/cpu/CausalMaskOpTest.hpp-17-    using namespace mllm;  // NOLINT
tests/cpu/CausalMaskOpTest.hpp-18-    const int64_t total = static_cast<int64_t>(B) * H * S * D;
tests/cpu/CausalMaskOpTest.hpp:19:    auto input = Tensor::arange(0, static_cast<float>(total), 1, kFloat32, kCPU).view({B, H, S, D});
tests/cpu/CausalMaskOpTest.hpp-20-    auto output = mask_(input);
tests/cpu/CausalMaskOpTest.hpp-21-    auto expected = buildExpectedTensor(input);
--
tests/cpu/PagedAttnTest.hpp-61-
tests/cpu/PagedAttnTest.hpp-62-    // Build Index
tests/cpu/PagedAttnTest.hpp:63:    auto index = mllm::Tensor::arange(0, S_KV, 1, mllm::kInt32, mllm::kCPU);
tests/cpu/PagedAttnTest.hpp-64-    auto mask = mllm::Tensor::zeros({S_Q, S_KV}, mllm::kFloat32, mllm::kCPU);
tests/cpu/PagedAttnTest.hpp-65-    auto mask_data = mask.ptr<mllm::mllm_fp32_t>();
--
pymllm/__init__.py-44-    zeros,
pymllm/__init__.py-45-    ones,
pymllm/__init__.py:46:    arange,
pymllm/__init__.py-47-    random,
pymllm/__init__.py-48-)
--
pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py-589-                past_key_values.get_seq_length() if past_key_values is not None else 0
pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py-590-            )
pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py:591:            cache_position = torch.arange(
pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py-592-                past_seen_tokens,
pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py-593-                past_seen_tokens + inputs_embeds.shape[1],
--
pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py-624-            mllm_qualcomm_max_length = kwargs.get("mllm_qualcomm_max_length", None)
pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py-625-            assert mllm_qualcomm_max_length is not None
pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py:626:            max_position_ids = torch.arange(
pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py-627-                0,
pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py-628-                mllm_qualcomm_max_length,
--
pymllm/ffi/__init__.py-329-
pymllm/ffi/__init__.py-330-
pymllm/ffi/__init__.py:331:def arange(
pymllm/ffi/__init__.py-332-    start: float,
pymllm/ffi/__init__.py-333-    end: float,
--
pymllm/ffi/__init__.py-338-    if isinstance(device_type, str):
pymllm/ffi/__init__.py-339-        device_type = device(device_type)
pymllm/ffi/__init__.py:340:    return _ffi_api.arange(start, end, step, dtype, device_type)
pymllm/ffi/__init__.py-341-
pymllm/ffi/__init__.py-342-
--
mllm/core/Tensor.cpp-130-}
mllm/core/Tensor.cpp-131-
mllm/core/Tensor.cpp:132:Tensor Tensor::arange(float start, float end, float step, DataTypes dtype, DeviceTypes device) {
mllm/core/Tensor.cpp-133-  auto shape = std::vector<int32_t>{static_cast<int32_t>((end - start) / step)};
mllm/core/Tensor.cpp-134-  auto i = Tensor::empty(shape, dtype, device).alloc();
--
mllm/core/Tensor.hpp-257-   * `@return` Tensor
mllm/core/Tensor.hpp-258-   */
mllm/core/Tensor.hpp:259:  static Tensor arange(float start, float end, float step, DataTypes dtype = kFloat32, DeviceTypes device = kCPU);
mllm/core/Tensor.hpp-260-
mllm/core/Tensor.hpp-261-  /**
--
mllm/ffi/Extension.cc-142-                          return ::mllm::ffi::Tensor(tensor);
mllm/ffi/Extension.cc-143-                        });
mllm/ffi/Extension.cc:144:  refl::GlobalDef().def("mllm.arange",
mllm/ffi/Extension.cc-145-                        [](float start, float end, float step, const ::mllm::ffi::DType& dtype,
mllm/ffi/Extension.cc-146-                           const ::mllm::ffi::Device& device) -> mllm::ffi::Tensor {
mllm/ffi/Extension.cc:147:                          auto tensor = ::mllm::Tensor::arange(start, end, step, dtype->dtype, device->device);
mllm/ffi/Extension.cc-148-                          return ::mllm::ffi::Tensor(tensor);
mllm/ffi/Extension.cc-149-                        });
--
mllm/models/minicpm_o2_6/modeling_siglip.hpp-56-      auto max_nb_patches_w = max_im_w / patch_size_;
mllm/models/minicpm_o2_6/modeling_siglip.hpp-57-
mllm/models/minicpm_o2_6/modeling_siglip.hpp:58:      // Create boundaries like torch.arange(1 / self.num_patches_per_side, 1.0, 1 / self.num_patches_per_side)
mllm/models/minicpm_o2_6/modeling_siglip.hpp-59-      std::vector<float> boundaries;
mllm/models/minicpm_o2_6/modeling_siglip.hpp-60-      float step = 1.0f / static_cast<float>(num_patches_per_side_);
--
mllm/models/minicpm_o2_6/modeling_siglip.hpp-79-        }
mllm/models/minicpm_o2_6/modeling_siglip.hpp-80-
mllm/models/minicpm_o2_6/modeling_siglip.hpp:81:        // Create fractional coordinates like torch.arange(0, 1 - 1e-6, 1 / nb_patches_h/w)
mllm/models/minicpm_o2_6/modeling_siglip.hpp-82-        std::vector<float> fractional_coords_h;
mllm/models/minicpm_o2_6/modeling_siglip.hpp-83-        std::vector<float> fractional_coords_w;
--
mllm/models/minicpm_o2_6/modeling_siglip.hpp-146-    } else {
mllm/models/minicpm_o2_6/modeling_siglip.hpp-147-      auto seq_len = embeddings.shape()[1];
mllm/models/minicpm_o2_6/modeling_siglip.hpp:148:      auto position_ids = Tensor::arange(0, seq_len, kInt64).view({1, seq_len});
mllm/models/minicpm_o2_6/modeling_siglip.hpp-149-      auto pos_embeddings = position_embedding_(position_ids);
mllm/models/minicpm_o2_6/modeling_siglip.hpp-150-      embeddings = embeddings + pos_embeddings;
--
mllm/models/minicpm_o2_6/modeling_vector_quantize.hpp-150-   */
mllm/models/minicpm_o2_6/modeling_vector_quantize.hpp-151-  Tensor createImplicitCodebook() {
mllm/models/minicpm_o2_6/modeling_vector_quantize.hpp:152:    auto indices = Tensor::arange(0, static_cast<float>(codebook_size_), 1, kFloat32, kCPU);
mllm/models/minicpm_o2_6/modeling_vector_quantize.hpp-153-    return indicesToCodes(indices);
mllm/models/minicpm_o2_6/modeling_vector_quantize.hpp-154-  }
--
mllm/models/deepseek_ocr/deepencoder.hpp-94-
mllm/models/deepseek_ocr/deepencoder.hpp-95-    // Register a buffer
mllm/models/deepseek_ocr/deepencoder.hpp:96:    registerBuffer("position_ids", Tensor::arange(0, num_positions_, 1, kInt64, kCPU).view({1, -1}));
mllm/models/deepseek_ocr/deepencoder.hpp-97-  }
mllm/models/deepseek_ocr/deepencoder.hpp-98-
--
mllm/models/minicpm_o2_6/modeling_whisper_encoder.hpp-194-
mllm/models/minicpm_o2_6/modeling_whisper_encoder.hpp-195-    // Add positional embeddings
mllm/models/minicpm_o2_6/modeling_whisper_encoder.hpp:196:    auto position_ids = Tensor::arange(0, seq_len, 1, kInt64).view({1, seq_len});
mllm/models/minicpm_o2_6/modeling_whisper_encoder.hpp-197-    auto position_embeddings = embed_positions_(position_ids);
mllm/models/minicpm_o2_6/modeling_whisper_encoder.hpp-198-    hidden_states = hidden_states + position_embeddings;
--
mllm/models/minicpm_o2_6/modeling_chattts.hpp-256-    auto position_ids = Tensor::empty({batch_size, input_len}, kInt64, kCPU).alloc();
mllm/models/minicpm_o2_6/modeling_chattts.hpp-257-    auto position_ids_ptr = position_ids.ptr<int64_t>();
mllm/models/minicpm_o2_6/modeling_chattts.hpp:258:    // Fill position_ids: arange(past_key_values_length, past_key_values_length + input_len)
mllm/models/minicpm_o2_6/modeling_chattts.hpp-259-    for (int b = 0; b < batch_size; ++b) {
mllm/models/minicpm_o2_6/modeling_chattts.hpp-260-      for (int i = 0; i < input_len; ++i) { position_ids_ptr[b * input_len + i] = past_key_values_length + i; }
--
docs/api/tensor.rst-47-   :return: New tensor with initialized one values
docs/api/tensor.rst-48-
docs/api/tensor.rst:49:arange
docs/api/tensor.rst-50-~~~~~~
docs/api/tensor.rst-51-
docs/api/tensor.rst:52:.. cpp:function:: static Tensor Tensor::arange(float start, float end, float step, DataTypes dtype = kFloat32, DeviceTypes device = kCPU)
docs/api/tensor.rst-53-
docs/api/tensor.rst-54-   Creates a tensor with evenly spaced values within a specified range.
--
mllm/backends/opencl/ops/FillOp.hpp-17- private:
mllm/backends/opencl/ops/FillOp.hpp-18-  std::shared_ptr<KernelWrap> kernel_fp32_buffer_ = nullptr;
mllm/backends/opencl/ops/FillOp.hpp:19:  std::shared_ptr<KernelWrap> kernel_arange_fp32_buffer_ = nullptr;
mllm/backends/opencl/ops/FillOp.hpp-20-};
mllm/backends/opencl/ops/FillOp.hpp-21-
--
mllm/backends/opencl/ops/FillOp.cpp-12-
mllm/backends/opencl/ops/FillOp.cpp-13-  kernel_fp32_buffer_ = runtime->buildKernel("fill", "fill_fp32", {});
mllm/backends/opencl/ops/FillOp.cpp:14:  kernel_arange_fp32_buffer_ = runtime->buildKernel("fill", "fill_arange_fp32", {});
mllm/backends/opencl/ops/FillOp.cpp-15-}
mllm/backends/opencl/ops/FillOp.cpp-16-
--
mllm/backends/opencl/ops/FillOp.cpp-61-    float start = options_.start;
mllm/backends/opencl/ops/FillOp.cpp-62-    float step = options_.step;
mllm/backends/opencl/ops/FillOp.cpp:63:    ret |= kernel_arange_fp32_buffer_->get().setArg(0, sizeof(float), &start);
mllm/backends/opencl/ops/FillOp.cpp:64:    ret |= kernel_arange_fp32_buffer_->get().setArg(1, sizeof(float), &step);
mllm/backends/opencl/ops/FillOp.cpp:65:    ret |= kernel_arange_fp32_buffer_->get().setArg(2, sizeof(cl_mem), &cl_buffer_dst);
mllm/backends/opencl/ops/FillOp.cpp-66-    if (ret != CL_SUCCESS) { MLLM_ERROR("OpenCLFillOp setArg failed: {}", ret); }
mllm/backends/opencl/ops/FillOp.cpp:67:    auto error = runtime->commandQueue().enqueueNDRangeKernel(kernel_arange_fp32_buffer_->get(), cl::NullRange,
mllm/backends/opencl/ops/FillOp.cpp-68-                                                              cl::NDRange(global_size), cl::NullRange);
mllm/backends/opencl/ops/FillOp.cpp-69-    if (error != CL_SUCCESS) {
mllm/backends/opencl/ops/FillOp.cpp:70:      MLLM_ERROR_EXIT(ExitCode::kOpenCLError, "Failed to execute fill_arange kernel, error code: {}", error);
mllm/backends/opencl/ops/FillOp.cpp-71-    }
mllm/backends/opencl/ops/FillOp.cpp-72-  } else {
--
mllm/backends/opencl/kernels/fill_cl.cpp-5-                   " dst[index]=value;\n"
mllm/backends/opencl/kernels/fill_cl.cpp-6-                   "}\n"
mllm/backends/opencl/kernels/fill_cl.cpp:7:                   "__kernel void fill_arange_fp32(float start,float step,__global float *dst) {\n"
mllm/backends/opencl/kernels/fill_cl.cpp-8-                   " size_t index=get_global_id(0);\n"
mllm/backends/opencl/kernels/fill_cl.cpp-9-                   " dst[index]=start+(float)index*step;\n"
--
mllm/backends/opencl/kernels/fill.cl-4-}
mllm/backends/opencl/kernels/fill.cl-5-
mllm/backends/opencl/kernels/fill.cl:6:__kernel void fill_arange_fp32(float start, float step, __global float *dst) {
mllm/backends/opencl/kernels/fill.cl-7-  size_t index = get_global_id(0);
mllm/backends/opencl/kernels/fill.cl-8-  dst[index] = start + (float)index * step;
--
mllm/backends/cpu/ops/FillOp.cpp-203-        case kFloat32: {
mllm/backends/cpu/ops/FillOp.cpp-204-#if defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86)
mllm/backends/cpu/ops/FillOp.cpp:205:          common::fill_arange_anytype(dst.ptr<mllm_fp32_t>(), dst.numel(), options_.start, options_.end, options_.step);
mllm/backends/cpu/ops/FillOp.cpp-206-#elif defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM)
mllm/backends/cpu/ops/FillOp.cpp:207:          arm::fill_arange(dst.ptr<mllm_fp32_t>(), dst.numel(), options_.start, options_.end, options_.step, threads);
mllm/backends/cpu/ops/FillOp.cpp-208-#endif
mllm/backends/cpu/ops/FillOp.cpp-209-          break;
--
mllm/backends/cpu/ops/FillOp.cpp-215-          for (size_t i = 0; i < dst.numel(); ++i) { ptr[i] = static_cast<mllm_fp16_t>(options_.start + i * options_.step); }
mllm/backends/cpu/ops/FillOp.cpp-216-#elif defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM)
mllm/backends/cpu/ops/FillOp.cpp:217:          arm::fill_arange_fp16(dst.ptr<mllm_fp16_t>(), dst.numel(), options_.start, options_.end, options_.step, threads);
mllm/backends/cpu/ops/FillOp.cpp-218-#endif
mllm/backends/cpu/ops/FillOp.cpp-219-          break;
--
mllm/backends/cpu/ops/FillOp.cpp-221-        case kInt64: {
mllm/backends/cpu/ops/FillOp.cpp-222-#if defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86)
mllm/backends/cpu/ops/FillOp.cpp:223:          common::fill_arange_anytype(dst.ptr<mllm_int64_t>(), dst.numel(), options_.start, options_.end, options_.step);
mllm/backends/cpu/ops/FillOp.cpp-224-#elif defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM)
mllm/backends/cpu/ops/FillOp.cpp:225:          arm::fill_arange_anytype<mllm_int64_t>(dst.ptr<mllm_int64_t>(), dst.numel(), options_.start, options_.end,
mllm/backends/cpu/ops/FillOp.cpp-226-                                                 options_.step, threads);
mllm/backends/cpu/ops/FillOp.cpp-227-#endif
--
mllm/backends/cpu/ops/FillOp.cpp-230-        case kInt32: {
mllm/backends/cpu/ops/FillOp.cpp-231-#if defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86)
mllm/backends/cpu/ops/FillOp.cpp:232:          common::fill_arange_anytype(dst.ptr<mllm_int32_t>(), dst.numel(), options_.start, options_.end, options_.step);
mllm/backends/cpu/ops/FillOp.cpp-233-#elif defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM)
mllm/backends/cpu/ops/FillOp.cpp:234:          arm::fill_arange_anytype<mllm_int32_t>(dst.ptr<mllm_int32_t>(), dst.numel(), options_.start, options_.end,
mllm/backends/cpu/ops/FillOp.cpp-235-                                                 options_.step, threads);
mllm/backends/cpu/ops/FillOp.cpp-236-#endif
--
mllm/backends/cpu/ops/FillOp.cpp-239-        case kInt16: {
mllm/backends/cpu/ops/FillOp.cpp-240-#if defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86)
mllm/backends/cpu/ops/FillOp.cpp:241:          common::fill_arange_anytype(dst.ptr<mllm_int16_t>(), dst.numel(), options_.start, options_.end, options_.step);
mllm/backends/cpu/ops/FillOp.cpp-242-#elif defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM)
mllm/backends/cpu/ops/FillOp.cpp:243:          arm::fill_arange_anytype<mllm_int16_t>(dst.ptr<mllm_int16_t>(), dst.numel(), options_.start, options_.end,
mllm/backends/cpu/ops/FillOp.cpp-244-                                                 options_.step, threads);
mllm/backends/cpu/ops/FillOp.cpp-245-#endif
--
mllm/backends/cpu/ops/FillOp.cpp-248-        case kInt8: {
mllm/backends/cpu/ops/FillOp.cpp-249-#if defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86)
mllm/backends/cpu/ops/FillOp.cpp:250:          common::fill_arange_anytype(dst.ptr<mllm_int8_t>(), dst.numel(), options_.start, options_.end, options_.step);
mllm/backends/cpu/ops/FillOp.cpp-251-#elif defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM)
mllm/backends/cpu/ops/FillOp.cpp:252:          arm::fill_arange_anytype<mllm_int8_t>(dst.ptr<mllm_int8_t>(), dst.numel(), options_.start, options_.end,
mllm/backends/cpu/ops/FillOp.cpp-253-                                                options_.step, threads);
mllm/backends/cpu/ops/FillOp.cpp-254-#endif
--
mllm/backends/cpu/ops/FillOp.cpp-257-        case kUInt64: {
mllm/backends/cpu/ops/FillOp.cpp-258-#if defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86)
mllm/backends/cpu/ops/FillOp.cpp:259:          common::fill_arange_anytype(dst.ptr<mllm_uint64_t>(), dst.numel(), options_.start, options_.end, options_.step);
mllm/backends/cpu/ops/FillOp.cpp-260-#elif defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM)
mllm/backends/cpu/ops/FillOp.cpp:261:          arm::fill_arange_anytype<mllm_uint64_t>(dst.ptr<mllm_uint64_t>(), dst.numel(), options_.start, options_.end,
mllm/backends/cpu/ops/FillOp.cpp-262-                                                  options_.step, threads);
mllm/backends/cpu/ops/FillOp.cpp-263-#endif
--
mllm/backends/cpu/ops/FillOp.cpp-266-        case kUInt32: {
mllm/backends/cpu/ops/FillOp.cpp-267-#if defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86)
mllm/backends/cpu/ops/FillOp.cpp:268:          common::fill_arange_anytype(dst.ptr<mllm_uint32_t>(), dst.numel(), options_.start, options_.end, options_.step);
mllm/backends/cpu/ops/FillOp.cpp-269-#elif defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM)
mllm/backends/cpu/ops/FillOp.cpp:270:          arm::fill_arange_anytype<mllm_uint32_t>(dst.ptr<mllm_uint32_t>(), dst.numel(), options_.start, options_.end,
mllm/backends/cpu/ops/FillOp.cpp-271-                                                  options_.step, threads);
mllm/backends/cpu/ops/FillOp.cpp-272-#endif
--
mllm/backends/cpu/ops/FillOp.cpp-275-        case kUInt16: {
mllm/backends/cpu/ops/FillOp.cpp-276-#if defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86)
mllm/backends/cpu/ops/FillOp.cpp:277:          common::fill_arange_anytype(dst.ptr<mllm_uint16_t>(), dst.numel(), options_.start, options_.end, options_.step);
mllm/backends/cpu/ops/FillOp.cpp-278-#elif defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM)
mllm/backends/cpu/ops/FillOp.cpp:279:          arm::fill_arange_anytype<mllm_uint16_t>(dst.ptr<mllm_uint16_t>(), dst.numel(), options_.start, options_.end,
mllm/backends/cpu/ops/FillOp.cpp-280-                                                  options_.step, threads);
mllm/backends/cpu/ops/FillOp.cpp-281-#endif
--
mllm/backends/cpu/ops/FillOp.cpp-284-        case kUInt8: {
mllm/backends/cpu/ops/FillOp.cpp-285-#if defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86)
mllm/backends/cpu/ops/FillOp.cpp:286:          common::fill_arange_anytype(dst.ptr<mllm_uint8_t>(), dst.numel(), options_.start, options_.end, options_.step);
mllm/backends/cpu/ops/FillOp.cpp-287-#elif defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM)
mllm/backends/cpu/ops/FillOp.cpp:288:          arm::fill_arange_anytype<mllm_uint8_t>(dst.ptr<mllm_uint8_t>(), dst.numel(), options_.start, options_.end,
mllm/backends/cpu/ops/FillOp.cpp-289-                                                 options_.step, threads);
mllm/backends/cpu/ops/FillOp.cpp-290-#endif
--
mllm/backends/cpu/ops/FillOp.cpp-292-        }
mllm/backends/cpu/ops/FillOp.cpp-293-        default: {
mllm/backends/cpu/ops/FillOp.cpp:294:          NYI("FillOp::forward[arange] not implemented for this data type");
mllm/backends/cpu/ops/FillOp.cpp-295-        }
mllm/backends/cpu/ops/FillOp.cpp-296-      }
--
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-77-// Fill Arange
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-78-//===----------------------------------------------------------------------===//
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:79:HWY_DLLEXPORT void call_fill_arange_fp32(mllm_fp32_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step);
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:80:HWY_DLLEXPORT void call_fill_arange_i32(mllm_int32_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step);
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:81:HWY_DLLEXPORT void call_fill_arange_u32(mllm_uint32_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step);
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:82:HWY_DLLEXPORT void call_fill_arange_i64(mllm_int64_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step);
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:83:HWY_DLLEXPORT void call_fill_arange_u64(mllm_uint64_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step);
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:84:HWY_DLLEXPORT void call_fill_arange_i16(mllm_int16_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step);
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:85:HWY_DLLEXPORT void call_fill_arange_u16(mllm_uint16_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step);
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:86:HWY_DLLEXPORT void call_fill_arange_i8(mllm_int8_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step);
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:87:HWY_DLLEXPORT void call_fill_arange_u8(mllm_uint8_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step);
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-88-
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-89-//===----------------------------------------------------------------------===//
--
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-188-
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-189-template<typename T>
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:190:inline void fill_arange_anytype(T* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step) {
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-191-  if constexpr (std::is_same_v<T, mllm_fp32_t>) {
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:192:    call_fill_arange_fp32(dst, n, start, end, step);
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-193-  } else if constexpr (std::is_same_v<T, mllm_int32_t>) {
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:194:    call_fill_arange_i32(dst, n, start, end, step);
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-195-  } else if constexpr (std::is_same_v<T, mllm_uint32_t>) {
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:196:    call_fill_arange_u32(dst, n, start, end, step);
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-197-  } else if constexpr (std::is_same_v<T, mllm_int64_t>) {
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:198:    call_fill_arange_i64(dst, n, start, end, step);
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-199-  } else if constexpr (std::is_same_v<T, mllm_uint64_t>) {
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:200:    call_fill_arange_u64(dst, n, start, end, step);
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-201-  } else if constexpr (std::is_same_v<T, mllm_int16_t>) {
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:202:    call_fill_arange_i16(dst, n, start, end, step);
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-203-  } else if constexpr (std::is_same_v<T, mllm_uint16_t>) {
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:204:    call_fill_arange_u16(dst, n, start, end, step);
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-205-  } else if constexpr (std::is_same_v<T, mllm_int8_t>) {
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:206:    call_fill_arange_i8(dst, n, start, end, step);
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-207-  } else if constexpr (std::is_same_v<T, mllm_uint8_t>) {
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:208:    call_fill_arange_u8(dst, n, start, end, step);
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-209-  } else {
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-210-    // Fallback
--
mllm/backends/cpu/kernels/common/fill-inl.hpp-195-//===----------------------------------------------------------------------===//
mllm/backends/cpu/kernels/common/fill-inl.hpp-196-template<typename T>
mllm/backends/cpu/kernels/common/fill-inl.hpp:197:HWY_INLINE void fill_arange_impl(T* HWY_RESTRICT dst, size_t count, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step) {
mllm/backends/cpu/kernels/common/fill-inl.hpp-198-  if (step == 0) {
mllm/backends/cpu/kernels/common/fill-inl.hpp-199-    fill_value_impl(dst, count, static_cast<T>(start));
--
mllm/backends/cpu/kernels/common/fill-inl.hpp-244-}
mllm/backends/cpu/kernels/common/fill-inl.hpp-245-
mllm/backends/cpu/kernels/common/fill-inl.hpp:246:static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_arange_fp32(mllm_fp32_t* HWY_RESTRICT dst, size_t size, mllm_fp32_t start,
mllm/backends/cpu/kernels/common/fill-inl.hpp-247-                                                           mllm_fp32_t end, mllm_fp32_t step) {
mllm/backends/cpu/kernels/common/fill-inl.hpp:248:  fill_arange_impl(dst, size, start, end, step);
mllm/backends/cpu/kernels/common/fill-inl.hpp-249-}
mllm/backends/cpu/kernels/common/fill-inl.hpp-250-
mllm/backends/cpu/kernels/common/fill-inl.hpp:251:static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_arange_i32(mllm_int32_t* HWY_RESTRICT dst, size_t size, mllm_fp32_t start,
mllm/backends/cpu/kernels/common/fill-inl.hpp-252-                                                          mllm_fp32_t end, mllm_fp32_t step) {
mllm/backends/cpu/kernels/common/fill-inl.hpp:253:  fill_arange_impl(dst, size, start, end, step);
mllm/backends/cpu/kernels/common/fill-inl.hpp-254-}
mllm/backends/cpu/kernels/common/fill-inl.hpp-255-
mllm/backends/cpu/kernels/common/fill-inl.hpp:256:static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_arange_u32(mllm_uint32_t* HWY_RESTRICT dst, size_t size, mllm_fp32_t start,
mllm/backends/cpu/kernels/common/fill-inl.hpp-257-                                                          mllm_fp32_t end, mllm_fp32_t step) {
mllm/backends/cpu/kernels/common/fill-inl.hpp:258:  fill_arange_impl(dst, size, start, end, step);
mllm/backends/cpu/kernels/common/fill-inl.hpp-259-}
mllm/backends/cpu/kernels/common/fill-inl.hpp-260-
mllm/backends/cpu/kernels/common/fill-inl.hpp:261:static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_arange_i64(mllm_int64_t* HWY_RESTRICT dst, size_t size, mllm_fp32_t start,
mllm/backends/cpu/kernels/common/fill-inl.hpp-262-                                                          mllm_fp32_t end, mllm_fp32_t step) {
mllm/backends/cpu/kernels/common/fill-inl.hpp:263:  fill_arange_impl(dst, size, start, end, step);
mllm/backends/cpu/kernels/common/fill-inl.hpp-264-}
mllm/backends/cpu/kernels/common/fill-inl.hpp-265-
mllm/backends/cpu/kernels/common/fill-inl.hpp:266:static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_arange_u64(mllm_uint64_t* HWY_RESTRICT dst, size_t size, mllm_fp32_t start,
mllm/backends/cpu/kernels/common/fill-inl.hpp-267-                                                          mllm_fp32_t end, mllm_fp32_t step) {
mllm/backends/cpu/kernels/common/fill-inl.hpp:268:  fill_arange_impl(dst, size, start, end, step);
mllm/backends/cpu/kernels/common/fill-inl.hpp-269-}
mllm/backends/cpu/kernels/common/fill-inl.hpp-270-
mllm/backends/cpu/kernels/common/fill-inl.hpp:271:static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_arange_i16(mllm_int16_t* HWY_RESTRICT dst, size_t size, mllm_fp32_t start,
mllm/backends/cpu/kernels/common/fill-inl.hpp-272-                                                          mllm_fp32_t end, mllm_fp32_t step) {
mllm/backends/cpu/kernels/common/fill-inl.hpp:273:  fill_arange_impl(dst, size, start, end, step);
mllm/backends/cpu/kernels/common/fill-inl.hpp-274-}
mllm/backends/cpu/kernels/common/fill-inl.hpp-275-
mllm/backends/cpu/kernels/common/fill-inl.hpp:276:static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_arange_u16(mllm_uint16_t* HWY_RESTRICT dst, size_t size, mllm_fp32_t start,
mllm/backends/cpu/kernels/common/fill-inl.hpp-277-                                                          mllm_fp32_t end, mllm_fp32_t step) {
mllm/backends/cpu/kernels/common/fill-inl.hpp:278:  fill_arange_impl(dst, size, start, end, step);
mllm/backends/cpu/kernels/common/fill-inl.hpp-279-}
mllm/backends/cpu/kernels/common/fill-inl.hpp-280-
mllm/backends/cpu/kernels/common/fill-inl.hpp:281:static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_arange_i8(mllm_int8_t* HWY_RESTRICT dst, size_t size, mllm_fp32_t start,
mllm/backends/cpu/kernels/common/fill-inl.hpp-282-                                                         mllm_fp32_t end, mllm_fp32_t step) {
mllm/backends/cpu/kernels/common/fill-inl.hpp:283:  fill_arange_impl(dst, size, start, end, step);
mllm/backends/cpu/kernels/common/fill-inl.hpp-284-}
mllm/backends/cpu/kernels/common/fill-inl.hpp-285-
mllm/backends/cpu/kernels/common/fill-inl.hpp:286:static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_arange_u8(mllm_uint8_t* HWY_RESTRICT dst, size_t size, mllm_fp32_t start,
mllm/backends/cpu/kernels/common/fill-inl.hpp-287-                                                         mllm_fp32_t end, mllm_fp32_t step) {
mllm/backends/cpu/kernels/common/fill-inl.hpp:288:  fill_arange_impl(dst, size, start, end, step);
mllm/backends/cpu/kernels/common/fill-inl.hpp-289-}
mllm/backends/cpu/kernels/common/fill-inl.hpp-290-
--
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp-174-// Fill Arange
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp-175-//===----------------------------------------------------------------------===//
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:176:HWY_EXPORT(fill_arange_fp32);
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:177:HWY_EXPORT(fill_arange_i32);
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:178:HWY_EXPORT(fill_arange_u32);
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:179:HWY_EXPORT(fill_arange_i64);
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:180:HWY_EXPORT(fill_arange_u64);
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:181:HWY_EXPORT(fill_arange_i16);
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:182:HWY_EXPORT(fill_arange_u16);
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:183:HWY_EXPORT(fill_arange_i8);
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:184:HWY_EXPORT(fill_arange_u8);
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp-185-
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:186:HWY_DLLEXPORT void call_fill_arange_fp32(mllm_fp32_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step) {
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:187:  HWY_DYNAMIC_DISPATCH(fill_arange_fp32)(dst, n, start, end, step);
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp-188-}
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:189:HWY_DLLEXPORT void call_fill_arange_i32(mllm_int32_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step) {
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:190:  HWY_DYNAMIC_DISPATCH(fill_arange_i32)(dst, n, start, end, step);
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp-191-}
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:192:HWY_DLLEXPORT void call_fill_arange_u32(mllm_uint32_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step) {
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:193:  HWY_DYNAMIC_DISPATCH(fill_arange_u32)(dst, n, start, end, step);
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp-194-}
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:195:HWY_DLLEXPORT void call_fill_arange_i64(mllm_int64_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step) {
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:196:  HWY_DYNAMIC_DISPATCH(fill_arange_i64)(dst, n, start, end, step);
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp-197-}
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:198:HWY_DLLEXPORT void call_fill_arange_u64(mllm_uint64_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step) {
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:199:  HWY_DYNAMIC_DISPATCH(fill_arange_u64)(dst, n, start, end, step);
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp-200-}
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:201:HWY_DLLEXPORT void call_fill_arange_i16(mllm_int16_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step) {
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:202:  HWY_DYNAMIC_DISPATCH(fill_arange_i16)(dst, n, start, end, step);
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp-203-}
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:204:HWY_DLLEXPORT void call_fill_arange_u16(mllm_uint16_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step) {
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:205:  HWY_DYNAMIC_DISPATCH(fill_arange_u16)(dst, n, start, end, step);
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp-206-}
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:207:HWY_DLLEXPORT void call_fill_arange_i8(mllm_int8_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step) {
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:208:  HWY_DYNAMIC_DISPATCH(fill_arange_i8)(dst, n, start, end, step);
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp-209-}
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:210:HWY_DLLEXPORT void call_fill_arange_u8(mllm_uint8_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step) {
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:211:  HWY_DYNAMIC_DISPATCH(fill_arange_u8)(dst, n, start, end, step);
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp-212-}
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp-213-
--
mllm/backends/cpu/kernels/arm/fill.cpp-52-}
mllm/backends/cpu/kernels/arm/fill.cpp-53-
mllm/backends/cpu/kernels/arm/fill.cpp:54:void fill_arange(mllm_fp32_t* __restrict dst, size_t size, float start, float end, float step, int thread_count) {
mllm/backends/cpu/kernels/arm/fill.cpp-55-  constexpr size_t vec_size = 4;  // 4 floats in NEON
mllm/backends/cpu/kernels/arm/fill.cpp-56-
--
mllm/backends/cpu/kernels/arm/fill.cpp-58-  size_t i = 0;
mllm/backends/cpu/kernels/arm/fill.cpp-59-
mllm/backends/cpu/kernels/arm/fill.cpp:60:  // Vectorized arange
mllm/backends/cpu/kernels/arm/fill.cpp-61-  float current_value = start;
mllm/backends/cpu/kernels/arm/fill.cpp-62-  for (; i < vec_end; i += vec_size) {
--
mllm/backends/cpu/kernels/arm/fill.cpp-129-}
mllm/backends/cpu/kernels/arm/fill.cpp-130-
mllm/backends/cpu/kernels/arm/fill.cpp:131:void fill_arange_fp16(mllm_fp16_t* __restrict dst, size_t size, float start, float end, float step, int thread_count) {
mllm/backends/cpu/kernels/arm/fill.cpp-132-  constexpr size_t vec_size = 8;  // 8 float16_t in NEON
mllm/backends/cpu/kernels/arm/fill.cpp-133-
--
mllm/backends/cpu/kernels/arm/fill.cpp-135-  size_t i = 0;
mllm/backends/cpu/kernels/arm/fill.cpp-136-
mllm/backends/cpu/kernels/arm/fill.cpp:137:  // Vectorized arange
mllm/backends/cpu/kernels/arm/fill.cpp-138-  float current_value = start;
mllm/backends/cpu/kernels/arm/fill.cpp-139-  for (; i < vec_end; i += vec_size) {
--
mllm/backends/cpu/kernels/arm/fill.hpp-17-void fill_specific_value(mllm_fp32_t* __restrict dst, size_t size, mllm_fp32_t value, int thread_count);
mllm/backends/cpu/kernels/arm/fill.hpp-18-
mllm/backends/cpu/kernels/arm/fill.hpp:19:void fill_arange(mllm_fp32_t* __restrict dst, size_t size, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step,
mllm/backends/cpu/kernels/arm/fill.hpp-20-                 int thread_count);
mllm/backends/cpu/kernels/arm/fill.hpp-21-
--
mllm/backends/cpu/kernels/arm/fill.hpp-29-void fill_specific_value_fp16(mllm_fp16_t* __restrict dst, size_t size, mllm_fp32_t value, int thread_count);
mllm/backends/cpu/kernels/arm/fill.hpp-30-
mllm/backends/cpu/kernels/arm/fill.hpp:31:void fill_arange_fp16(mllm_fp16_t* __restrict dst, size_t size, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step,
mllm/backends/cpu/kernels/arm/fill.hpp-32-                      int thread_count);
mllm/backends/cpu/kernels/arm/fill.hpp-33-
--
mllm/backends/cpu/kernels/arm/fill.hpp-94-
mllm/backends/cpu/kernels/arm/fill.hpp-95-template<typename T>
mllm/backends/cpu/kernels/arm/fill.hpp:96:inline void fill_arange_anytype(T* __restrict dst, size_t size, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step,
mllm/backends/cpu/kernels/arm/fill.hpp-97-                                int thread_count) {
mllm/backends/cpu/kernels/arm/fill.hpp-98-  if (step == 0) {
--
mllm/backends/cpu/kernels/arm/fill.hpp-119-
mllm/backends/cpu/kernels/arm/fill.hpp-120-template<>
mllm/backends/cpu/kernels/arm/fill.hpp:121:inline void fill_arange_anytype<mllm_fp32_t>(mllm_fp32_t* __restrict dst, size_t size, mllm_fp32_t start, mllm_fp32_t end,
mllm/backends/cpu/kernels/arm/fill.hpp-122-                                             mllm_fp32_t step, int thread_count) {
mllm/backends/cpu/kernels/arm/fill.hpp:123:  fill_arange(dst, size, start, end, step, thread_count);
mllm/backends/cpu/kernels/arm/fill.hpp-124-}
mllm/backends/cpu/kernels/arm/fill.hpp-125-
mllm/backends/cpu/kernels/arm/fill.hpp-126-template<>
mllm/backends/cpu/kernels/arm/fill.hpp:127:inline void fill_arange_anytype<mllm_fp16_t>(mllm_fp16_t* __restrict dst, size_t size, mllm_fp32_t start, mllm_fp32_t end,
mllm/backends/cpu/kernels/arm/fill.hpp-128-                                             mllm_fp32_t step, int thread_count) {
mllm/backends/cpu/kernels/arm/fill.hpp:129:  fill_arange_fp16(dst, size, start, end, step, thread_count);
mllm/backends/cpu/kernels/arm/fill.hpp-130-}
mllm/backends/cpu/kernels/arm/fill.hpp-131-
--
mllm/backends/cpu/kernels/x86/fill.hpp-17-void fill_specific_value(mllm_fp32_t* __restrict dst, size_t size, float value, int thread_count);
mllm/backends/cpu/kernels/x86/fill.hpp-18-
mllm/backends/cpu/kernels/x86/fill.hpp:19:void fill_arange(mllm_fp32_t* __restrict dst, size_t size, float start, float end, float step, int thread_count);
mllm/backends/cpu/kernels/x86/fill.hpp-20-
mllm/backends/cpu/kernels/x86/fill.hpp-21-void fill_random(mllm_fp32_t* __restrict dst, size_t size, float start, float end, uint64_t seed, int thread_count);
--
mllm/backends/cpu/kernels/x86/fill.cpp-109-}
mllm/backends/cpu/kernels/x86/fill.cpp-110-
mllm/backends/cpu/kernels/x86/fill.cpp:111:void fill_arange(mllm_fp32_t* __restrict dst, size_t size, float start, float end, float step, int thread_count) {
mllm/backends/cpu/kernels/x86/fill.cpp-112-#if defined(MLLM_HOST_FEATURE_AVX512F)
mllm/backends/cpu/kernels/x86/fill.cpp-113-  constexpr size_t vec_size = 16;
--
mllm/backends/cpu/kernels/x86/fill.cpp-129-    || defined(MLLM_HOST_FEATURE_SSE2) || defined(MLLM_HOST_FEATURE_SSE)
mllm/backends/cpu/kernels/x86/fill.cpp-130-  if (vec_size > 1) {
mllm/backends/cpu/kernels/x86/fill.cpp:131:    // Vectorized arange
mllm/backends/cpu/kernels/x86/fill.cpp-132-    float current_value = start;
mllm/backends/cpu/kernels/x86/fill.cpp-133-    for (; i < vec_end; i += vec_size) {

Script executed:

# Look at the complete fill_arange_impl and surrounding context for assertions/documentation
sed -n '193,245p' mllm/backends/cpu/kernels/common/fill-inl.hpp

Repository: UbiquitousLearning/mllm


Repository: UbiquitousLearning/mllm
Exit code: 0

stdout:

//===----------------------------------------------------------------------===//
// Fill Arange (start, end, step)
//===----------------------------------------------------------------------===//
template<typename T>
HWY_INLINE void fill_arange_impl(T* HWY_RESTRICT dst, size_t count, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step) {
  if (step == 0) {
    fill_value_impl(dst, count, static_cast<T>(start));
    return;
  }

  // Calculate the actual number of elements to fill
  size_t n = 0;
  if ((step > 0 && start < end) || (step < 0 && start > end)) {
    mllm_fp32_t n_float = (end - start) / step;
    if (n_float > 0) {
      n = static_cast<size_t>(std::ceil(n_float));
      if (step > 0) {
        if (start + (n - 1) * step >= end) --n;
      } else {
        if (start + (n - 1) * step <= end) --n;
      }
      n = std::min(n, count);
    }
  }

  // Use SIMD for float types where we can vectorize the computation
  if constexpr (std::is_same_v<T, mllm_fp32_t>) {
    const hn::ScalableTag<T> d;
    const size_t N = hn::Lanes(d);

    // Create increment vector: [0, 1, 2, 3, ...] * step
    const hn::Vec<decltype(d)> step_vec = hn::Set(d, step);
    const hn::Vec<decltype(d)> n_step_vec = hn::Set(d, step * static_cast<T>(N));

    // Create base offsets [0, 1, 2, 3, ...]
    hn::Vec<decltype(d)> base = hn::Iota(d, 0);
    base = hn::Mul(base, step_vec);
    hn::Vec<decltype(d)> current_start = hn::Add(hn::Set(d, start), base);

    size_t idx = 0;
    for (; idx + N <= n; idx += N) {
      hn::StoreU(current_start, d, dst + idx);
      current_start = hn::Add(current_start, n_step_vec);
    }

    // Handle remaining elements
    for (; idx < n; ++idx) { dst[idx] = static_cast<T>(start + idx * step); }
  } else {
    // Scalar fallback for other types
    for (size_t i = 0; i < n; ++i) { dst[i] = static_cast<T>(start + i * step); }
  }
}


Script executed:

# Check FillOp.cpp for any validation of count vs computed arange length
sed -n '195,300p' mllm/backends/cpu/ops/FillOp.cpp

Repository: UbiquitousLearning/mllm


Repository: UbiquitousLearning/mllm
Exit code: 0

stdout:

          NYI("FillOp::forward[ones] not implemented for this data type");
          break;
        }
      }
      break;
    }
    case aops::FillOpTypes::kArange: {
      switch (dst.dtype()) {
        case kFloat32: {
`#if` defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86)
          common::fill_arange_anytype(dst.ptr<mllm_fp32_t>(), dst.numel(), options_.start, options_.end, options_.step);
`#elif` defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM)
          arm::fill_arange(dst.ptr<mllm_fp32_t>(), dst.numel(), options_.start, options_.end, options_.step, threads);
`#endif`
          break;
        }
        case kFloat16: {
`#if` defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86)
          // FP16 not directly supported by Highway on x86, use scalar fallback
          auto ptr = dst.ptr<mllm_fp16_t>();
          for (size_t i = 0; i < dst.numel(); ++i) { ptr[i] = static_cast<mllm_fp16_t>(options_.start + i * options_.step); }
`#elif` defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM)
          arm::fill_arange_fp16(dst.ptr<mllm_fp16_t>(), dst.numel(), options_.start, options_.end, options_.step, threads);
`#endif`
          break;
        }
        case kInt64: {
`#if` defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86)
          common::fill_arange_anytype(dst.ptr<mllm_int64_t>(), dst.numel(), options_.start, options_.end, options_.step);
`#elif` defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM)
          arm::fill_arange_anytype<mllm_int64_t>(dst.ptr<mllm_int64_t>(), dst.numel(), options_.start, options_.end,
                                                 options_.step, threads);
`#endif`
          break;
        }
        case kInt32: {
`#if` defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86)
          common::fill_arange_anytype(dst.ptr<mllm_int32_t>(), dst.numel(), options_.start, options_.end, options_.step);
`#elif` defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM)
          arm::fill_arange_anytype<mllm_int32_t>(dst.ptr<mllm_int32_t>(), dst.numel(), options_.start, options_.end,
                                                 options_.step, threads);
`#endif`
          break;
        }
        case kInt16: {
`#if` defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86)
          common::fill_arange_anytype(dst.ptr<mllm_int16_t>(), dst.numel(), options_.start, options_.end, options_.step);
`#elif` defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM)
          arm::fill_arange_anytype<mllm_int16_t>(dst.ptr<mllm_int16_t>(), dst.numel(), options_.start, options_.end,
                                                 options_.step, threads);
`#endif`
          break;
        }
        case kInt8: {
`#if` defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86)
          common::fill_arange_anytype(dst.ptr<mllm_int8_t>(), dst.numel(), options_.start, options_.end, options_.step);
`#elif` defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM)
          arm::fill_arange_anytype<mllm_int8_t>(dst.ptr<mllm_int8_t>(), dst.numel(), options_.start, options_.end,
                                                options_.step, threads);
`#endif`
          break;
        }
        case kUInt64: {
`#if` defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86)
          common::fill_arange_anytype(dst.ptr<mllm_uint64_t>(), dst.numel(), options_.start, options_.end, options_.step);
`#elif` defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM)
          arm::fill_arange_anytype<mllm_uint64_t>(dst.ptr<mllm_uint64_t>(), dst.numel(), options_.start, options_.end,
                                                  options_.step, threads);
`#endif`
          break;
        }
        case kUInt32: {
`#if` defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86)
          common::fill_arange_anytype(dst.ptr<mllm_uint32_t>(), dst.numel(), options_.start, options_.end, options_.step);
`#elif` defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM)
          arm::fill_arange_anytype<mllm_uint32_t>(dst.ptr<mllm_uint32_t>(), dst.numel(), options_.start, options_.end,
                                                  options_.step, threads);
`#endif`
          break;
        }
        case kUInt16: {
`#if` defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86)
          common::fill_arange_anytype(dst.ptr<mllm_uint16_t>(), dst.numel(), options_.start, options_.end, options_.step);
`#elif` defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM)
          arm::fill_arange_anytype<mllm_uint16_t>(dst.ptr<mllm_uint16_t>(), dst.numel(), options_.start, options_.end,
                                                  options_.step, threads);
`#endif`
          break;
        }
        case kUInt8: {
`#if` defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86)
          common::fill_arange_anytype(dst.ptr<mllm_uint8_t>(), dst.numel(), options_.start, options_.end, options_.step);
`#elif` defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM)
          arm::fill_arange_anytype<mllm_uint8_t>(dst.ptr<mllm_uint8_t>(), dst.numel(), options_.start, options_.end,
                                                 options_.step, threads);
`#endif`
          break;
        }
        default: {
          NYI("FillOp::forward[arange] not implemented for this data type");
        }
      }
      break;
    }
    case aops::FillOpTypes::kRandom: {
      switch (dst.dtype()) {

Script executed:

# Check Tensor.cpp for how arange creates the tensor with proper shape
sed -n '130,145p' mllm/core/Tensor.cpp

Repository: UbiquitousLearning/mllm


Repository: UbiquitousLearning/mllm
Exit code: 0

stdout:

}

Tensor Tensor::arange(float start, float end, float step, DataTypes dtype, DeviceTypes device) {
  auto shape = std::vector<int32_t>{static_cast<int32_t>((end - start) / step)};
  auto i = Tensor::empty(shape, dtype, device).alloc();
  return Context::instance().buildOpAndSubmitTask(
      OpTypes::kFill, aops::FillOpOptions{.type = aops::FillOpTypes::kArange, .start = start, .end = end, .step = step},
      {i})[0];
}

Tensor Tensor::random(const std::vector<int32_t>& shape, float start, float end, DataTypes dtype, DeviceTypes device) {
  auto i = Tensor::empty(shape, dtype, device).alloc();
  return Context::instance().buildOpAndSubmitTask(
      OpTypes::kFill,
      aops::FillOpOptions{
          .type = aops::FillOpTypes::kRandom, .start = start, .end = end, .seed = Context::instance().getRandomState()},

Script executed:

# Search for any existing comments or assertions about the arange contract
grep -n "assert\|ASSERT\|contract\|precondition" mllm/backends/cpu/kernels/common/fill-inl.hpp

Repository: UbiquitousLearning/mllm


Repository: UbiquitousLearning/mllm
Exit code: 1


Ensure the contract is explicit when arange length doesn't match count.

The function computes n from start/end/step and only fills n elements (line 214: n = std::min(n, count)). If count > n, the tail remains uninitialized. While std::min() prevents overflow, it silently allows gaps. Consider adding an assertion, an error check, or explicit documentation clarifying that callers must ensure count == ceil((end - start) / step), or explicitly fill any remainder.

🤖 Prompt for AI Agents
In `@mllm/backends/cpu/kernels/common/fill-inl.hpp` around lines 196 - 244, The
function fill_arange_impl computes n (the number of sequence elements) and may
leave dst[n..count-1] uninitialized; make the contract explicit and avoid silent
gaps by: after computing n, add a runtime check/assert that either n == count
(e.g., assert(n == count) or throw) or, if you want to tolerate count > n,
explicitly fill the remainder via fill_value_impl(dst + n, count - n,
static_cast<T>(start + n * step)) so no tail remains uninitialized; update
callers/docs accordingly. Reference: fill_arange_impl, variable n, and
fill_value_impl.

Comment on lines +10 to 12
#include <cstring>
#include "mllm/core/DataTypes.hpp"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -name "kernel_dispatch.hpp" -type f

Repository: UbiquitousLearning/mllm

Length of output: 121


🏁 Script executed:

cat -n mllm/backends/cpu/kernels/common/kernel_dispatch.hpp | head -120

Repository: UbiquitousLearning/mllm

Length of output: 9016


🏁 Script executed:

head -20 mllm/backends/cpu/kernels/common/kernel_dispatch.hpp

Repository: UbiquitousLearning/mllm

Length of output: 794


🏁 Script executed:

rg "#include.*type_traits" mllm/backends/cpu/kernels/common/kernel_dispatch.hpp

Repository: UbiquitousLearning/mllm

Length of output: 49


🏁 Script executed:

grep "^#include" mllm/backends/cpu/kernels/common/kernel_dispatch.hpp

Repository: UbiquitousLearning/mllm

Length of output: 199


Add missing #include <type_traits> for std::is_same_v usage.

The template fill_zeros_anytype at line 105+ uses std::is_same_v (lines 107, 109, 111, 113, 115, 117, 119, and beyond), which requires the <type_traits> header. The file currently does not include it, relying on transitive includes which is unreliable.

Proposed fix
 `#include` <cstring>
+#include <type_traits>
 `#include` "mllm/core/DataTypes.hpp"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
#include <cstring>
#include "mllm/core/DataTypes.hpp"
`#include` <cstring>
`#include` <type_traits>
`#include` "mllm/core/DataTypes.hpp"
🤖 Prompt for AI Agents
In `@mllm/backends/cpu/kernels/common/kernel_dispatch.hpp` around lines 10 - 12,
The file is missing `#include` <type_traits> required by the template
fill_zeros_anytype which uses std::is_same_v; add the header to the top of
kernel_dispatch.hpp (alongside <cstring> and "mllm/core/DataTypes.hpp") so
std::is_same_v resolves properly and avoids relying on transitive includes
referenced in fill_zeros_anytype.

Comment on lines +105 to +131
template<typename T>
inline void fill_zeros_anytype(T* dst, size_t n) {
if constexpr (std::is_same_v<T, mllm_fp32_t>) {
call_fill_zeros_fp32(dst, n);
} else if constexpr (std::is_same_v<T, mllm_fp64_t>) {
call_fill_zeros_fp64(dst, n);
} else if constexpr (std::is_same_v<T, mllm_int32_t>) {
call_fill_zeros_i32(dst, n);
} else if constexpr (std::is_same_v<T, mllm_uint32_t>) {
call_fill_zeros_u32(dst, n);
} else if constexpr (std::is_same_v<T, mllm_int64_t>) {
call_fill_zeros_i64(dst, n);
} else if constexpr (std::is_same_v<T, mllm_uint64_t>) {
call_fill_zeros_u64(dst, n);
} else if constexpr (std::is_same_v<T, mllm_int16_t>) {
call_fill_zeros_i16(dst, n);
} else if constexpr (std::is_same_v<T, mllm_uint16_t>) {
call_fill_zeros_u16(dst, n);
} else if constexpr (std::is_same_v<T, mllm_int8_t>) {
call_fill_zeros_i8(dst, n);
} else if constexpr (std::is_same_v<T, mllm_uint8_t>) {
call_fill_zeros_u8(dst, n);
} else {
// Fallback for unsupported types
std::memset(dst, 0, n * sizeof(T));
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Check if the file exists and read the context around lines 105-131
cat -n mllm/backends/cpu/kernels/common/kernel_dispatch.hpp | sed -n '90,140p'

Repository: UbiquitousLearning/mllm

Length of output: 3224


🏁 Script executed:

# Also check for any template constraints or type requirements
cat -n mllm/backends/cpu/kernels/common/kernel_dispatch.hpp | head -110 | tail -30

Repository: UbiquitousLearning/mllm

Length of output: 2869


🏁 Script executed:

# Search for any type traits or constraints applied to fill_zeros_anytype
rg "fill_zeros_anytype" -A 10 -B 10 mllm/backends/cpu/kernels/common/kernel_dispatch.hpp

Repository: UbiquitousLearning/mllm

Length of output: 1402


🏁 Script executed:

# Search for where fill_zeros_anytype is used
rg "fill_zeros_anytype" -B 2 -A 2

Repository: UbiquitousLearning/mllm

Length of output: 6805


🏁 Script executed:

# Check if there's similar code in fill_ones_anytype to see the pattern
cat -n mllm/backends/cpu/kernels/common/kernel_dispatch.hpp | sed -n '133,160p'

Repository: UbiquitousLearning/mllm

Length of output: 1327


🌐 Web query:

std::memset non-trivial types C++ standard undefined behavior

💡 Result:

Short answer: Using std::memset on objects whose type is non‑trivially‑copyable (i.e. non‑trivial) is undefined behavior. For trivially‑copyable objects you may manipulate their object representation with byte‑wise functions, but you must respect the standard's restrictions (e.g. not a potentially‑overlapping subobject); the resulting value representation may be implementation‑defined. Prefer constructors / std::fill / type‑safe initialization for non‑trivial types.

Sources:

  • std::memset specification (undefined if object is not TriviallyCopyable or is a potentially‑overlapping subobject). [1]
  • Object / object‑representation and trivially‑copyable rules (why bytewise ops are only guaranteed for trivially‑copyable). [2]
  • Discussion / examples and standard pointers about memcpy/memset use and UB for non‑trivial types. [3]

References:
[1] cppreference — std::memset.
[2] cppreference — Object representation / trivially‑copyable.
[3] Relevant discussion (Stack Overflow / standard excerpts) about memcpy/memset and trivially‑copyable vs non‑trivial types.


Guard memset fallback against non-trivial types.

Using std::memset on non-trivial types is undefined behavior per the C++ standard. The template accepts any type without constraints, so the fallback must check std::is_trivial_v<T>. Consider the safer element-wise approach used in fill_ones_anytype instead.

Suggested fix
   } else {
     // Fallback for unsupported types
-    std::memset(dst, 0, n * sizeof(T));
+    if constexpr (std::is_trivial_v<T>) {
+      std::memset(dst, 0, n * sizeof(T));
+    } else {
+      for (size_t i = 0; i < n; ++i) { dst[i] = T{}; }
+    }
   }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
template<typename T>
inline void fill_zeros_anytype(T* dst, size_t n) {
if constexpr (std::is_same_v<T, mllm_fp32_t>) {
call_fill_zeros_fp32(dst, n);
} else if constexpr (std::is_same_v<T, mllm_fp64_t>) {
call_fill_zeros_fp64(dst, n);
} else if constexpr (std::is_same_v<T, mllm_int32_t>) {
call_fill_zeros_i32(dst, n);
} else if constexpr (std::is_same_v<T, mllm_uint32_t>) {
call_fill_zeros_u32(dst, n);
} else if constexpr (std::is_same_v<T, mllm_int64_t>) {
call_fill_zeros_i64(dst, n);
} else if constexpr (std::is_same_v<T, mllm_uint64_t>) {
call_fill_zeros_u64(dst, n);
} else if constexpr (std::is_same_v<T, mllm_int16_t>) {
call_fill_zeros_i16(dst, n);
} else if constexpr (std::is_same_v<T, mllm_uint16_t>) {
call_fill_zeros_u16(dst, n);
} else if constexpr (std::is_same_v<T, mllm_int8_t>) {
call_fill_zeros_i8(dst, n);
} else if constexpr (std::is_same_v<T, mllm_uint8_t>) {
call_fill_zeros_u8(dst, n);
} else {
// Fallback for unsupported types
std::memset(dst, 0, n * sizeof(T));
}
}
template<typename T>
inline void fill_zeros_anytype(T* dst, size_t n) {
if constexpr (std::is_same_v<T, mllm_fp32_t>) {
call_fill_zeros_fp32(dst, n);
} else if constexpr (std::is_same_v<T, mllm_fp64_t>) {
call_fill_zeros_fp64(dst, n);
} else if constexpr (std::is_same_v<T, mllm_int32_t>) {
call_fill_zeros_i32(dst, n);
} else if constexpr (std::is_same_v<T, mllm_uint32_t>) {
call_fill_zeros_u32(dst, n);
} else if constexpr (std::is_same_v<T, mllm_int64_t>) {
call_fill_zeros_i64(dst, n);
} else if constexpr (std::is_same_v<T, mllm_uint64_t>) {
call_fill_zeros_u64(dst, n);
} else if constexpr (std::is_same_v<T, mllm_int16_t>) {
call_fill_zeros_i16(dst, n);
} else if constexpr (std::is_same_v<T, mllm_uint16_t>) {
call_fill_zeros_u16(dst, n);
} else if constexpr (std::is_same_v<T, mllm_int8_t>) {
call_fill_zeros_i8(dst, n);
} else if constexpr (std::is_same_v<T, mllm_uint8_t>) {
call_fill_zeros_u8(dst, n);
} else {
// Fallback for unsupported types
if constexpr (std::is_trivial_v<T>) {
std::memset(dst, 0, n * sizeof(T));
} else {
for (size_t i = 0; i < n; ++i) { dst[i] = T{}; }
}
}
}
🤖 Prompt for AI Agents
In `@mllm/backends/cpu/kernels/common/kernel_dispatch.hpp` around lines 105 - 131,
The fallback in fill_zeros_anytype currently uses std::memset for all unmatched
types; restrict that to only trivial types by checking std::is_trivial_v<T> and
for non-trivial types perform an element-wise zero assignment (e.g., for (size_t
i=0;i<n;++i) dst[i] = T{}), mirroring the safe approach used in
fill_ones_anytype; update the else branch in fill_zeros_anytype to first if
constexpr (std::is_trivial_v<T>) use std::memset, else use the element-wise loop
to avoid undefined behavior on non-trivial types.

Comment on lines 31 to 34
#if defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86)
// TODO
// FP16 not directly supported by Highway on x86, use scalar fallback
std::memset(dst.ptr<mllm_fp16_t>(), 0, dst.numel() * sizeof(mllm_fp16_t));
#elif defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

find . -name "FillOp.cpp" -type f

Repository: UbiquitousLearning/mllm

Length of output: 167


🏁 Script executed:

cat -n ./mllm/backends/cpu/ops/FillOp.cpp

Repository: UbiquitousLearning/mllm

Length of output: 26749


🏁 Script executed:

cat -n ./mllm/backends/cpu/ops/FillOp.hpp | head -20

Repository: UbiquitousLearning/mllm

Length of output: 772


🏁 Script executed:

cat -n ./mllm/backends/cpu/kernels/Kernels.hpp | head -30

Repository: UbiquitousLearning/mllm

Length of output: 2061


🏁 Script executed:

cat -n ./mllm/backends/cpu/kernels/x86/fill.hpp | head -20

Repository: UbiquitousLearning/mllm

Length of output: 825


🏁 Script executed:

rg "#include <cstring>" ./mllm/backends/cpu/kernels/

Repository: UbiquitousLearning/mllm

Length of output: 612


🏁 Script executed:

rg "#include.*fill-inl.hpp" ./mllm/backends/cpu/kernels/

Repository: UbiquitousLearning/mllm

Length of output: 178


🏁 Script executed:

cat -n ./mllm/backends/cpu/kernels/x86/fill.hpp

Repository: UbiquitousLearning/mllm

Length of output: 1017


🏁 Script executed:

rg "#include.*fill-inl.hpp" ./mllm/backends/cpu/kernels/x86/

Repository: UbiquitousLearning/mllm

Length of output: 49


🏁 Script executed:

find ./mllm/backends/cpu/kernels/x86 -name "fill.cpp"

Repository: UbiquitousLearning/mllm

Length of output: 107


🏁 Script executed:

cat -n ./mllm/backends/cpu/kernels/x86/fill.cpp | head -30

Repository: UbiquitousLearning/mllm

Length of output: 1351


🏁 Script executed:

rg "#include" ./mllm/backends/cpu/kernels/x86/fill.cpp

Repository: UbiquitousLearning/mllm

Length of output: 206


🏁 Script executed:

cat -n ./mllm/utils/PlatformRTHelper.hpp

Repository: UbiquitousLearning/mllm

Length of output: 2824


🏁 Script executed:

cat -n ./mllm/core/DataTypes.hpp | head -30

Repository: UbiquitousLearning/mllm

Length of output: 1200


🏁 Script executed:

rg "#include <cstring>" ./mllm/core/ ./mllm/backends/cpu/ops/

Repository: UbiquitousLearning/mllm

Length of output: 2406


Add <cstring> include for std::memset usage on line 33.

The std::memset call at line 33 requires the <cstring> header. All other CPU operation files in the same directory include this header; add it to maintain consistency and ensure the file is self-contained.

🛠️ Proposed fix
 `#include` "mllm/backends/cpu/ops/FillOp.hpp"
 `#include` "mllm/backends/cpu/kernels/Kernels.hpp"
 `#include` "mllm/utils/PlatformRTHelper.hpp"
+#include <cstring>
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
#if defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86)
// TODO
// FP16 not directly supported by Highway on x86, use scalar fallback
std::memset(dst.ptr<mllm_fp16_t>(), 0, dst.numel() * sizeof(mllm_fp16_t));
#elif defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM)
`#include` "mllm/backends/cpu/ops/FillOp.hpp"
`#include` "mllm/backends/cpu/kernels/Kernels.hpp"
`#include` "mllm/backends/cpu/utils/PlatformRTHelper.hpp"
`#include` <cstring>
🤖 Prompt for AI Agents
In `@mllm/backends/cpu/ops/FillOp.cpp` around lines 31 - 34, The file uses
std::memset in FillOp.cpp (inside the x86/x86_64 branch) but does not include
<cstring>, so add the missing include to the top of the file; update FillOp.cpp
to `#include` <cstring> (alongside other headers) so std::memset is declared and
the file is self-contained and consistent with other CPU ops.

@chenghuaWang chenghuaWang merged commit 42c4c70 into UbiquitousLearning:main Jan 20, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants