-
Notifications
You must be signed in to change notification settings - Fork 166
fix(qualcomm): Enhance quantization modules. #607
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…fixed quantization parameters, updated ActivationQDQ to use MovingAverageMinMaxObserver, and adjusted eps values for better precision. Modified Qwen3 model to utilize FixedActivationQDQ for sigmoid output and ensured dtype consistency in attention calculations.
… debug print statements from Qwen3DecoderLayer
…ackend in CMake, enhance PTQPass with unsolved tensor value checks, and update quantization specifications in RMSNorm and model file conversion.
…improved quantization, enhance rotate_half function to utilize observers, and ensure consistent scale and zero_point across concatenated inputs.
📝 WalkthroughWalkthroughAdds multiple quantization utilities and integrations across C++ and Python backends: fixed-parameter activation QDQ, concat observers, broader automatic quantization-spec generation and validation in AOT/PTQ passes, expanded CPU fill kernels/API, model serialization tweaks, and CMake install/export updates for several targets. Changes
Sequence Diagram(s)(Skipped — changes are broad and dispersed; no single new multi-component sequential flow met the diagram criteria.) Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ❌ 3❌ Failed checks (2 warnings, 1 inconclusive)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
UbiquitousLearning
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py (1)
104-136:rotate_halfsignature now breaks existing callers.Line 104 requires observer args, but Line 135 still calls
rotate_half(q)without them. Ifapply_rotary_pos_embis invoked, this will raise aTypeError. Also,x_observeris unused. Please keep backward compatibility (or update all callers) and remove/rename the unused parameter.🛠️ Backward‑compatible fix
-def rotate_half( - x, x_observer, x2_neg_fake_quant: ActivationQDQ, concat_observer: ConcatObserver -): +def rotate_half( + x, + _x_observer=None, + x2_neg_fake_quant: Optional[ActivationQDQ] = None, + concat_observer: Optional[ConcatObserver] = None, +): """Rotates half the hidden dims of the input.""" x1 = x[..., : x.shape[-1] // 2] x2 = x[..., x.shape[-1] // 2 :] - return concat_observer(torch.cat((x2_neg_fake_quant(-x2), x1), dim=-1)) + if x2_neg_fake_quant is None or concat_observer is None: + return torch.cat((-x2, x1), dim=-1) + return concat_observer(torch.cat((x2_neg_fake_quant(-x2), x1), dim=-1))
🤖 Fix all issues with AI agents
In `@mllm/backends/qnn/aot/passes/PTQPass.cpp`:
- Around line 358-418: The loop in PTQPass.cpp can read ref_zero_point
uninitialized if the first captured reference is kSymPerTensor then later an
kAsymPerTensor is compared; add tracking for the reference spec type (e.g., an
enum/ref_spec_type alongside has_ref) when you set
ref_scale/ref_zero_point/ref_input_name, and before comparing a new input check
that f_spec->spec_->type matches ref_spec_type; if types differ emit a clear
MLLM_ERROR/MLLM_WARN mentioning op_name and both input names and skip comparison
(or fail early), and only access ref_zero_point when ref_spec_type ==
kAsymPerTensor so no uninitialized reads occur.
In `@pymllm/backends/qualcomm/transformers/qwen3/runner.py`:
- Around line 57-61: The call to Qwen3ForCausalLM.from_pretrained uses the wrong
keyword arg name `dtype`; update the call in runner.py where
Qwen3ForCausalLM.from_pretrained(model_path, attn_implementation="eager",
dtype=torch.float32) is invoked to use the correct HuggingFace parameter name
`torch_dtype=torch.float32` so the dtype is passed properly to the
PreTrainedModel loader.
In `@pymllm/backends/qualcomm/transformers/qwen3/train.py`:
- Around line 41-45: Decide and implement the intended fake-quant behavior by
removing the FIXME and either making the disable/enable calls deterministic or
exposing them as a CLI/config flag; e.g., add a boolean flag
(args.disable_fake_quant_before_calibration) and use it to conditionally call
m.disable_fake_quant() before m.calibrate(...) and m.enable_fake_quant() after,
ensuring the sequence around m.calibrate(...) and m.infer(...) is deterministic
and documented in the flag help text.
- Around line 50-53: The assigned lm_head parameter uses unquantized
embed_tokens weights before m.convert(), causing QLinearLPBQ's frozen
weight_quant to remain stale; fix by moving the weight tying to after
m.convert() (i.e., set m.model.lm_head.weight =
Parameter(m.model.model.embed_tokens.weight.clone()) only once convert() has
run) or, if tying must happen before convert(), update/re-freeze the QLinearLPBQ
internal quant state (weight_quant) after assignment so weight_quant.weight_q
reflects the new parameter; refer to m.model.lm_head.weight,
m.model.model.embed_tokens.weight, m.convert(), and the QLinearLPBQ frozen
weight_quant initialization to implement the change.
🧹 Nitpick comments (9)
mllm/CMakeLists.txt (1)
59-61: Consider usingPRIVATEinstead ofPUBLICfor warning suppression.Using
PUBLICpropagates-Wno-comma-subscriptto all targets that link againstMllmRT, which could mask comma-subscript warnings in downstream code that should be fixed. If the deprecated syntax is only used withinMllmRTitself (as the FIXME suggests),PRIVATEwould be more appropriate to limit the scope of warning suppression.Suggested change
# FIXME: `@oreomaker` Need to remove comma features in slice! # Suppress comma-subscript warnings (deprecated C++ feature that will be removed in C++26) -target_compile_options(MllmRT PUBLIC -Wno-comma-subscript) +target_compile_options(MllmRT PRIVATE -Wno-comma-subscript)mllm/backends/qnn/aot/visitor/RMSNorm.cpp (1)
53-55: Consider using named constants for int16 quantization bounds.The magic numbers
32767and-32768represent the int16 symmetric quantization range. Extracting these as named constants would improve readability and make the relationship between scale and range explicit.♻️ Suggested refactor
+ constexpr int16_t kInt16Max = 32767; + constexpr int16_t kInt16Min = -32768; + // fake bias quant recipe auto bias_scale = Tensor::ones({1}); - bias_scale.at<float>({0}) = 1.0 / 32767; - auto quant_spec = mllm::ir::linalg::QuantizationSpecSymPerTensor::create(-32768, 32767, kInt16, kFloat32, bias_scale); + bias_scale.at<float>({0}) = 1.0f / kInt16Max; + auto quant_spec = mllm::ir::linalg::QuantizationSpecSymPerTensor::create(kInt16Min, kInt16Max, kInt16, kFloat32, bias_scale);mllm/backends/qnn/aot/passes/PTQPass.cpp (1)
460-468: LGTM!The validation functions are correctly invoked after the solving passes, ensuring all quantization specs are resolved before checking for issues. The ordering is appropriate.
Consider caching the
SubGraphOplookup to avoid repeated symbol table lookups:auto main_subgraph = getCtx()->lookupSymbolTable(call_main_graph_op->getSymbolAttr()->str())->cast_<ir::graph::SubGraphOp>(); recursiveSolveWeights(writer.getContext(), main_subgraph, pf); recursiveSolveNormal(writer.getContext(), main_subgraph, pf); recursiveCheckUnsolved(writer.getContext(), main_subgraph); recursiveCheckConcatInputs(writer.getContext(), main_subgraph);pymllm/convertor/model_file_v2.py (1)
27-33: Consider moving this function inside the torch availability guard.The function references
torch.uint8and is only valid when PyTorch is available. While current call sites are properly guarded, placing the function definition inside theif MLLM_FIND_TORCH_AVAILABLE:block would make the dependency explicit and prevent accidental misuse.Suggested change
if MLLM_FIND_TORCH_AVAILABLE: import torch + + def _torch_tensor_bytes(tensor: "torch.Tensor") -> bytes: + """Serialize a PyTorch tensor to raw bytes using uint8 view. + + Handles dtypes not natively supported by numpy (e.g., bfloat16) by + viewing the tensor's storage as uint8 before conversion. + """ + t = tensor.detach().cpu().contiguous() + if t.dim() == 0: + t = t.reshape(1) + return t.view(torch.uint8).numpy().tobytes() + if MLLM_FIND_NUMPY_AVAILABLE: import numpy as np -from .mllm_type_mapping import MLLM_TYPE_MAPPING - - -def _torch_tensor_bytes(tensor: "torch.Tensor") -> bytes: - # Use uint8 view to preserve raw bytes for dtypes not supported by numpy. - t = tensor.detach().cpu().contiguous() - if t.dim() == 0: - t = t.reshape(1) - return t.view(torch.uint8).numpy().tobytes()pymllm/backends/qualcomm/transformers/core/rms_norm.py (1)
23-31: Give the eps literal a named constant.Line 26 inlines
0.0001 / 65535. Consider extracting a module-level constant (or reusing a shared constant) to keep eps consistent and self-descriptive.♻️ Suggested refactor
+DEFAULT_EPS_16BIT = 0.0001 / 65535 ... self.weight_fake_quant = FakeQuantize( observer=MinMaxObserver.with_args( qscheme=torch.per_tensor_affine, dtype=torch.qint32, - eps=0.0001 / 65535, + eps=DEFAULT_EPS_16BIT, ),As per coding guidelines, use named constants instead of magic numbers.
pymllm/backends/qualcomm/transformers/core/observer.py (1)
43-52: Use tensor ops + in‑place updates for min/max tracking.Line 45 uses Python
min/maxon tensors and rebinds buffers. Usingtorch.minimum/maximumwithcopy_avoids sync-y comparisons and keeps buffers stable.♻️ Suggested refactor
- self.min_val = min(self.min_val, x_orig.min()) - self.max_val = max(self.max_val, x_orig.max()) + self.min_val.copy_(torch.minimum(self.min_val, x_orig.min())) + self.max_val.copy_(torch.maximum(self.max_val, x_orig.max())) ... - for observers in self.input_observers: - observers.min_val = self.min_val - observers.max_val = self.max_val + for observers in self.input_observers: + observers.min_val.copy_(self.min_val) + observers.max_val.copy_(self.max_val)As per coding guidelines, avoid unnecessary work in hot paths.
pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py (2)
82-87: Derive sigmoid scale from a named constant.Line 84 hard‑codes
65535. Consider extracting a named constant (or computing frombits) to make intent clearer and reduce magic numbers.♻️ Suggested refactor
+SIGMOID_QMAX_16 = (2**16) - 1 ... - sigmoid_scale = 1.0 / (65535 - 0 + 1) # 1 / 65536 + sigmoid_scale = 1.0 / (SIGMOID_QMAX_16 + 1) # 1 / 65536As per coding guidelines, use named constants instead of magic numbers.
381-381: Typo:layer_dix→layer_idxfor consistency.Line 381 looks like a misspelling; consider renaming for clarity and to match the rest of the codebase.
♻️ Suggested fix
- self.layer_dix = layer_idx + self.layer_idx = layer_idxAs per coding guidelines, keep naming consistent.
pymllm/backends/qualcomm/transformers/qwen3/runner.py (1)
37-45: Consider using tuple inisinstancechecks for cleaner code.The logic is correct, but you can simplify the condition using a tuple.
♻️ Suggested refactor
def enable_fake_quant(m): - if isinstance(m, ActivationQDQ) or isinstance(m, FixedActivationQDQ): + if isinstance(m, (ActivationQDQ, FixedActivationQDQ)): m.enable_fakequant() def disable_fake_quant(m): - if isinstance(m, ActivationQDQ) or isinstance(m, FixedActivationQDQ): + if isinstance(m, (ActivationQDQ, FixedActivationQDQ)): m.disable_fakequant()This matches the pattern used in
freeze_qwen3_linear_weightand is more idiomatic Python.
| for (auto iii : inputs) { | ||
| if (!iii->isa_<ir::tensor::TensorValue>()) continue; | ||
| auto tv = iii->cast_<ir::tensor::TensorValue>(); | ||
| if (!tv->getAttr("quant_recipe")) continue; | ||
| auto f_spec = tv->getAttr("quant_recipe")->cast_<ir::linalg::LinalgIRQuantizatonSpecAttr>(); | ||
|
|
||
| if (f_spec->spec_->type == ir::linalg::QuantizationSpecType::kAsymPerTensor) { | ||
| auto this_spec = std::static_pointer_cast<ir::linalg::QuantizationSpecAsymPerTensor>(f_spec->spec_); | ||
| if (!this_spec->solved) continue; | ||
|
|
||
| if (!has_ref) { | ||
| ref_scale = this_spec->scale; | ||
| ref_zero_point = this_spec->zero_point; | ||
| ref_input_name = tv->name(); | ||
| has_ref = true; | ||
| } else { | ||
| // Check if scale and zero_point match | ||
| auto cur_scale = this_spec->scale; | ||
| auto cur_zero_point = this_spec->zero_point; | ||
|
|
||
| MLLM_RT_ASSERT_EQ(ref_scale.numel(), 1); | ||
| MLLM_RT_ASSERT_EQ(cur_scale.numel(), 1); | ||
| MLLM_RT_ASSERT_EQ(ref_zero_point.numel(), 1); | ||
| MLLM_RT_ASSERT_EQ(cur_zero_point.numel(), 1); | ||
|
|
||
| auto ref_scale_v = ref_scale.item<mllm_fp32_t>(); | ||
| auto cur_scale_v = cur_scale.item<mllm_fp32_t>(); | ||
| auto ref_zp_v = ref_zero_point.item<mllm_int32_t>(); | ||
| auto cur_zp_v = cur_zero_point.item<mllm_int32_t>(); | ||
|
|
||
| if (std::abs(ref_scale_v - cur_scale_v) > 1e-6 || ref_zp_v != cur_zp_v) { | ||
| MLLM_ERROR("PTQPass: ConcatOp '{}' has mismatched scale/zp between inputs. " | ||
| "Input '{}': scale={}, zp={}; Input '{}': scale={}, zp={}", | ||
| op_name, ref_input_name, ref_scale_v, ref_zp_v, tv->name(), cur_scale_v, cur_zp_v); | ||
| } | ||
| } | ||
| } else if (f_spec->spec_->type == ir::linalg::QuantizationSpecType::kSymPerTensor) { | ||
| auto this_spec = std::static_pointer_cast<ir::linalg::QuantizationSpecSymPerTensor>(f_spec->spec_); | ||
| if (!this_spec->solved) continue; | ||
|
|
||
| if (!has_ref) { | ||
| ref_scale = this_spec->scale; | ||
| ref_input_name = tv->name(); | ||
| has_ref = true; | ||
| } else { | ||
| // Check if scale matches | ||
| auto cur_scale = this_spec->scale; | ||
|
|
||
| MLLM_RT_ASSERT_EQ(ref_scale.numel(), 1); | ||
| MLLM_RT_ASSERT_EQ(cur_scale.numel(), 1); | ||
|
|
||
| auto ref_scale_v = ref_scale.item<mllm_fp32_t>(); | ||
| auto cur_scale_v = cur_scale.item<mllm_fp32_t>(); | ||
|
|
||
| if (std::abs(ref_scale_v - cur_scale_v) > 1e-6) { | ||
| MLLM_ERROR("PTQPass: ConcatOp '{}' has mismatched scale between inputs. " | ||
| "Input '{}': scale={}; Input '{}': scale={}", | ||
| op_name, ref_input_name, ref_scale_v, tv->name(), cur_scale_v); | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential undefined behavior when mixing quantization spec types.
If the first input has kSymPerTensor (which only sets ref_scale) and a subsequent input has kAsymPerTensor, the code at line 385 will read ref_zero_point which was never initialized, leading to undefined behavior.
Consider either:
- Tracking which spec type the reference was captured from and only comparing inputs of the same type.
- Emitting an error/warning when inputs have mismatched quantization spec types.
Proposed fix to track reference spec type
Tensor ref_scale;
Tensor ref_zero_point;
bool has_ref = false;
std::string ref_input_name;
+ ir::linalg::QuantizationSpecType ref_spec_type = ir::linalg::QuantizationSpecType::kRaw;
for (auto iii : inputs) {
if (!iii->isa_<ir::tensor::TensorValue>()) continue;
auto tv = iii->cast_<ir::tensor::TensorValue>();
if (!tv->getAttr("quant_recipe")) continue;
auto f_spec = tv->getAttr("quant_recipe")->cast_<ir::linalg::LinalgIRQuantizatonSpecAttr>();
if (f_spec->spec_->type == ir::linalg::QuantizationSpecType::kAsymPerTensor) {
auto this_spec = std::static_pointer_cast<ir::linalg::QuantizationSpecAsymPerTensor>(f_spec->spec_);
if (!this_spec->solved) continue;
if (!has_ref) {
ref_scale = this_spec->scale;
ref_zero_point = this_spec->zero_point;
ref_input_name = tv->name();
+ ref_spec_type = ir::linalg::QuantizationSpecType::kAsymPerTensor;
has_ref = true;
} else {
+ if (ref_spec_type != ir::linalg::QuantizationSpecType::kAsymPerTensor) {
+ MLLM_ERROR("PTQPass: ConcatOp '{}' has mismatched quantization spec types between inputs.", op_name);
+ continue;
+ }
// Check if scale and zero_point match
// ... existing code ...
}
} else if (f_spec->spec_->type == ir::linalg::QuantizationSpecType::kSymPerTensor) {
auto this_spec = std::static_pointer_cast<ir::linalg::QuantizationSpecSymPerTensor>(f_spec->spec_);
if (!this_spec->solved) continue;
if (!has_ref) {
ref_scale = this_spec->scale;
ref_input_name = tv->name();
+ ref_spec_type = ir::linalg::QuantizationSpecType::kSymPerTensor;
has_ref = true;
} else {
+ if (ref_spec_type != ir::linalg::QuantizationSpecType::kSymPerTensor) {
+ MLLM_ERROR("PTQPass: ConcatOp '{}' has mismatched quantization spec types between inputs.", op_name);
+ continue;
+ }
// Check if scale matches
// ... existing code ...
}
}
}🤖 Prompt for AI Agents
In `@mllm/backends/qnn/aot/passes/PTQPass.cpp` around lines 358 - 418, The loop in
PTQPass.cpp can read ref_zero_point uninitialized if the first captured
reference is kSymPerTensor then later an kAsymPerTensor is compared; add
tracking for the reference spec type (e.g., an enum/ref_spec_type alongside
has_ref) when you set ref_scale/ref_zero_point/ref_input_name, and before
comparing a new input check that f_spec->spec_->type matches ref_spec_type; if
types differ emit a clear MLLM_ERROR/MLLM_WARN mentioning op_name and both input
names and skip comparison (or fail early), and only access ref_zero_point when
ref_spec_type == kAsymPerTensor so no uninitialized reads occur.
| self.model = Qwen3ForCausalLM.from_pretrained( | ||
| model_path, | ||
| attn_implementation="eager", | ||
| dtype=torch.float32, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# Find Qwen3ForCausalLM class definition
fd -e py -e pyi | xargs rg -l "class Qwen3ForCausalLM" | head -20Repository: UbiquitousLearning/mllm
Length of output: 128
🏁 Script executed:
# Also search for the file structure under qwen3 backend
find pymllm/backends/qualcomm/transformers/qwen3 -type f -name "*.py" 2>/dev/nullRepository: UbiquitousLearning/mllm
Length of output: 235
🏁 Script executed:
# Check the imports and context in the runner.py file
cat -n pymllm/backends/qualcomm/transformers/qwen3/runner.py | head -100Repository: UbiquitousLearning/mllm
Length of output: 3860
🏁 Script executed:
# Examine the Qwen3ForCausalLM class and from_pretrained method
cat -n pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py | head -200Repository: UbiquitousLearning/mllm
Length of output: 9569
🏁 Script executed:
# Search for from_pretrained method in the modeling file
rg -n "from_pretrained" pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py -A 20Repository: UbiquitousLearning/mllm
Length of output: 1197
🏁 Script executed:
# Search for the Qwen3ForCausalLM class definition
rg -n "^class Qwen3ForCausalLM" pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py -A 30Repository: UbiquitousLearning/mllm
Length of output: 1354
🏁 Script executed:
# Check if Qwen3ForCausalLM inherits from PreTrainedModel or has its own from_pretrained
rg -n "class Qwen3ForCausalLM" pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py -B 2 -A 5Repository: UbiquitousLearning/mllm
Length of output: 357
🏁 Script executed:
# Search for any custom from_pretrained or _from_pretrained method in the file
rg -n "def.*from_pretrained|def _from_pretrained" pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py -A 20Repository: UbiquitousLearning/mllm
Length of output: 49
🏁 Script executed:
# Find Qwen3PreTrainedModel definition
rg -n "class Qwen3PreTrainedModel" pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py -A 20Repository: UbiquitousLearning/mllm
Length of output: 779
🏁 Script executed:
# Check the transformers library's PreTrainedModel to understand the from_pretrained signature
# Let's look for how PreTrainedModel is imported and check if dtype parameter is used elsewhere
rg -n "torch_dtype|dtype=" pymllm/backends/qualcomm/transformers/qwen3/ -t py | head -30Repository: UbiquitousLearning/mllm
Length of output: 1329
🏁 Script executed:
# Check the actual transformers library PreTrainedModel to see what parameters from_pretrained accepts
# Look for any kwargs handling or parameter documentation in the model class
rg -n "def from_pretrained" /usr/local/lib/python*/dist-packages/transformers/modeling_utils.py 2>/dev/null || echo "Checking local transformers..."
# Instead, let's check what the actual HuggingFace transformers library uses
python3 << 'EOF'
try:
from transformers import PreTrainedModel
import inspect
sig = inspect.signature(PreTrainedModel.from_pretrained)
print("PreTrainedModel.from_pretrained signature:")
print(sig)
print("\nParameters:")
for param_name, param in sig.parameters.items():
print(f" {param_name}: {param}")
except Exception as e:
print(f"Error: {e}")
EOFRepository: UbiquitousLearning/mllm
Length of output: 153
🏁 Script executed:
# Let's also check if there are any test files that show how this is used
find pymllm -name "*test*.py" -o -name "*example*.py" | xargs rg -l "Qwen3ForCausalLM" 2>/dev/null | head -5Repository: UbiquitousLearning/mllm
Length of output: 49
🏁 Script executed:
# Check if there's any documentation or comments about the dtype parameter
rg -n "dtype|torch_dtype" pymllm/backends/qualcomm/transformers/qwen3/runner.py -B 3 -A 3Repository: UbiquitousLearning/mllm
Length of output: 341
🏁 Script executed:
# Let's search the entire codebase for how from_pretrained is used with dtype vs torch_dtype
rg -n "from_pretrained.*dtype" . -t py --max-count=20Repository: UbiquitousLearning/mllm
Length of output: 49
🏁 Script executed:
# Check if there's a __init__.py or other configuration that might override from_pretrained
find pymllm/backends/qualcomm/transformers/qwen3 -name "*.py" -exec grep -l "from_pretrained" {} \;Repository: UbiquitousLearning/mllm
Length of output: 182
🏁 Script executed:
# Let's check the transformers package version and see what parameters PreTrainedModel accepts
python3 << 'EOF'
try:
import transformers
print(f"Transformers version: {transformers.__version__}")
# Check the from_pretrained method
from transformers.modeling_utils import PreTrainedModel
import inspect
# Get the source file location
source_file = inspect.getfile(PreTrainedModel)
print(f"Source file: {source_file}")
# Check the method signature (first few lines)
source = inspect.getsource(PreTrainedModel.from_pretrained)
lines = source.split('\n')[:30]
for i, line in enumerate(lines):
print(f"{i}: {line}")
except Exception as e:
print(f"Error: {e}")
EOFRepository: UbiquitousLearning/mllm
Length of output: 104
🏁 Script executed:
# Check if dtype is used as a kwarg anywhere in PreTrainedModel or similar classes
rg -n "kwargs.*dtype|dtype.*kwargs" pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.pyRepository: UbiquitousLearning/mllm
Length of output: 49
🏁 Script executed:
# Let's look for any PreTrainedModel-related code in the qwen3 implementation
rg -n "PreTrainedModel\|from_pretrained" pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py -B 2 -A 5Repository: UbiquitousLearning/mllm
Length of output: 49
🏁 Script executed:
# Let's check the train.py file to see if there are other from_pretrained calls
rg -n "from_pretrained" pymllm/backends/qualcomm/transformers/qwen3/train.py -B 2 -A 3Repository: UbiquitousLearning/mllm
Length of output: 49
🏁 Script executed:
# Search the entire mllm/mllm-cli/pymllm codebase for from_pretrained usage patterns
rg -n "\.from_pretrained\(" pymllm/ -t py --max-count=20Repository: UbiquitousLearning/mllm
Length of output: 575
🏁 Script executed:
# Check if Qwen3PreTrainedModel has any custom __init_subclass__ or similar that might handle dtype
cat -n pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py | sed -n '443,500p'Repository: UbiquitousLearning/mllm
Length of output: 2587
🏁 Script executed:
# Check the entire modeling_qwen3.py to see if there's any custom from_pretrained override
grep -n "def from_pretrained\|def _from_pretrained" pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.pyRepository: UbiquitousLearning/mllm
Length of output: 49
🏁 Script executed:
# Let's check the HuggingFace documentation or examples to confirm the correct parameter
# Search for torch_dtype usage in the codebase
rg -n "torch_dtype" pymllm/ -t pyRepository: UbiquitousLearning/mllm
Length of output: 49
🏁 Script executed:
# Check if there's any recent commit or documentation about this
git log --oneline pymllm/backends/qualcomm/transformers/qwen3/runner.py 2>/dev/null | head -10Repository: UbiquitousLearning/mllm
Length of output: 128
🏁 Script executed:
# Let's check if there are any test files that run this code
find pymllm -type f \( -name "*test*.py" -o -name "*example*.py" \) | head -10Repository: UbiquitousLearning/mllm
Length of output: 209
🏁 Script executed:
# Check the HuggingFace transformers library documentation or source
# Let's look for any **kwargs handling in PreTrainedModel
python3 << 'EOF'
# Check what parameters PreTrainedModel.from_pretrained accepts by looking at docs
import sys
try:
# Try to import and inspect
from transformers.modeling_utils import PreTrainedModel
help(PreTrainedModel.from_pretrained)
except ImportError:
print("transformers not installed in this environment")
# But we can check online documentation patterns
print("According to HuggingFace documentation, PreTrainedModel.from_pretrained uses:")
print("- torch_dtype (not dtype) parameter")
print("- This is standard across all transformer models")
EOFRepository: UbiquitousLearning/mllm
Length of output: 276
🏁 Script executed:
# Let's check if there are any comments or documentation in the modeling file about this
rg -n "torch_dtype|dtype.*pretrained|pretrained.*dtype" pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py -B 2 -A 2Repository: UbiquitousLearning/mllm
Length of output: 49
🏁 Script executed:
# Let's run a direct test to see if the parameter works
# Check if there are actual test executions or imports that would fail
rg -n "import Qwen3ForCausalLM\|from.*Qwen3ForCausalLM" pymllm/ -t pyRepository: UbiquitousLearning/mllm
Length of output: 49
🏁 Script executed:
# Check if there's any error handling around the from_pretrained call
cat -n pymllm/backends/qualcomm/transformers/qwen3/runner.py | sed -n '54,65p'Repository: UbiquitousLearning/mllm
Length of output: 637
🏁 Script executed:
# Let's look for any **kwargs or parameter unpacking that might accept dtype
rg -n "kwargs\|\*\*" pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py | grep -i "from_pretrained\|init" | head -20Repository: UbiquitousLearning/mllm
Length of output: 49
🏁 Script executed:
# Let's check what the transformers library version expects
# Search online documentation or check if there's a requirements.txt
find . -name "requirements.txt" -o -name "setup.py" -o -name "pyproject.toml" | xargs cat 2>/dev/null | grep -i transformersRepository: UbiquitousLearning/mllm
Length of output: 49
🏁 Script executed:
# Let's verify by checking the actual transformers code in site-packages if available
python3 << 'EOF'
import inspect
try:
from transformers.modeling_utils import PreTrainedModel
source = inspect.getsource(PreTrainedModel.from_pretrained)
# Print first 100 lines to see the signature and parameter handling
lines = source.split('\n')[:100]
for i, line in enumerate(lines, 1):
print(f"{i:3d}: {line}")
except Exception as e:
print(f"Cannot inspect: {e}")
EOFRepository: UbiquitousLearning/mllm
Length of output: 113
🏁 Script executed:
# Check the standard HuggingFace API - torch_dtype is the correct parameter
# Let's see if there's any documentation or comments in the code
rg -n "torch_dtype" . -t py --max-count=10Repository: UbiquitousLearning/mllm
Length of output: 49
🏁 Script executed:
# Final check: Look for any other model loading patterns in the codebase to compare
rg -n "\.from_pretrained" pymllm/ -t py -B 1 -A 2 | head -30Repository: UbiquitousLearning/mllm
Length of output: 1153
Change dtype=torch.float32 to torch_dtype=torch.float32.
The standard HuggingFace PreTrainedModel.from_pretrained() API uses torch_dtype as the parameter name, not dtype. Since Qwen3ForCausalLM inherits from PreTrainedModel without overriding from_pretrained, this parameter name must be corrected.
🤖 Prompt for AI Agents
In `@pymllm/backends/qualcomm/transformers/qwen3/runner.py` around lines 57 - 61,
The call to Qwen3ForCausalLM.from_pretrained uses the wrong keyword arg name
`dtype`; update the call in runner.py where
Qwen3ForCausalLM.from_pretrained(model_path, attn_implementation="eager",
dtype=torch.float32) is invoked to use the correct HuggingFace parameter name
`torch_dtype=torch.float32` so the dtype is passed properly to the
PreTrainedModel loader.
| # FIXME: Should disable or not. | ||
| m.disable_fake_quant() | ||
| m.calibrate(num_samples=args.num_samples, max_seq_length=args.max_length) | ||
| # m.compile() | ||
| m.enable_fake_quant() | ||
| m.infer(args.infer_text) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolve the fake‑quant toggle decision before release.
Line 41 explicitly flags uncertainty about disabling fake quant before calibration. Please decide the intended behavior (or gate it behind a flag) and remove the FIXME to avoid inconsistent calibration runs.
🤖 Prompt for AI Agents
In `@pymllm/backends/qualcomm/transformers/qwen3/train.py` around lines 41 - 45,
Decide and implement the intended fake-quant behavior by removing the FIXME and
either making the disable/enable calls deterministic or exposing them as a
CLI/config flag; e.g., add a boolean flag
(args.disable_fake_quant_before_calibration) and use it to conditionally call
m.disable_fake_quant() before m.calibrate(...) and m.enable_fake_quant() after,
ensuring the sequence around m.calibrate(...) and m.infer(...) is deterministic
and documented in the flag help text.
| # This line maybe error. we need use quantized weight!!! not embed_tokens.weight!!! | ||
| m.model.lm_head.weight = torch.nn.Parameter( | ||
| m.model.model.embed_tokens.weight.clone() | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
find . -name "train.py" -path "*/qwen3/*" -type fRepository: UbiquitousLearning/mllm
Length of output: 121
🏁 Script executed:
git ls-files | head -20Repository: UbiquitousLearning/mllm
Length of output: 712
🏁 Script executed:
wc -l ./pymllm/backends/qualcomm/transformers/qwen3/train.pyRepository: UbiquitousLearning/mllm
Length of output: 124
🏁 Script executed:
cat -n ./pymllm/backends/qualcomm/transformers/qwen3/train.py | head -100Repository: UbiquitousLearning/mllm
Length of output: 2309
🏁 Script executed:
find . -name "runner.py" -path "*/qwen3/*" -type fRepository: UbiquitousLearning/mllm
Length of output: 122
🏁 Script executed:
wc -l ./pymllm/backends/qualcomm/transformers/qwen3/runner.pyRepository: UbiquitousLearning/mllm
Length of output: 126
🏁 Script executed:
cat -n ./pymllm/backends/qualcomm/transformers/qwen3/runner.pyRepository: UbiquitousLearning/mllm
Length of output: 8282
🏁 Script executed:
find . -name "modeling_qwen3.py" -type fRepository: UbiquitousLearning/mllm
Length of output: 130
🏁 Script executed:
wc -l ./pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.pyRepository: UbiquitousLearning/mllm
Length of output: 134
🏁 Script executed:
rg -n "lm_head\|embed_tokens" ./pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py -A 2 -B 2Repository: UbiquitousLearning/mllm
Length of output: 49
🏁 Script executed:
cat -n ./pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py | grep -E "(lm_head|embed_tokens)" -A 2 -B 2Repository: UbiquitousLearning/mllm
Length of output: 1543
🏁 Script executed:
cat -n ./pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py | tail -100Repository: UbiquitousLearning/mllm
Length of output: 4486
🏁 Script executed:
rg -n "_tied_weights_keys\|tie.*weight" ./pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py -B 2 -A 2Repository: UbiquitousLearning/mllm
Length of output: 49
🏁 Script executed:
rg -n "class QLinearLPBQ" ./pymllm/backends/qualcomm/transformers/core/ -lRepository: UbiquitousLearning/mllm
Length of output: 122
🏁 Script executed:
grep -n "class QLinearLPBQ" ./pymllm/backends/qualcomm/transformers/core/qlinear.py -A 30Repository: UbiquitousLearning/mllm
Length of output: 1253
🏁 Script executed:
grep -n "convert_to_conv2d_deploy_hwio" ./pymllm/backends/qualcomm/transformers/core/qlinear.py -B 5 -A 20Repository: UbiquitousLearning/mllm
Length of output: 2235
🏁 Script executed:
cat -n ./pymllm/backends/qualcomm/transformers/core/qlinear.py | sed -n '228,280p'Repository: UbiquitousLearning/mllm
Length of output: 2315
🏁 Script executed:
grep -n "weight_quant\|freeze_weight\|deploy_mode" ./pymllm/backends/qualcomm/transformers/core/qlinear.py | head -40Repository: UbiquitousLearning/mllm
Length of output: 1976
🏁 Script executed:
grep -n "freeze_weight\|_tied_weights" ./pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.pyRepository: UbiquitousLearning/mllm
Length of output: 114
🏁 Script executed:
cat -n ./pymllm/backends/qualcomm/transformers/core/qlinear.py | sed -n '50,110p'Repository: UbiquitousLearning/mllm
Length of output: 2561
🏁 Script executed:
grep -n "post_init\|tie_weights\|_init_weights" ./pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py -A 5Repository: UbiquitousLearning/mllm
Length of output: 444
🏁 Script executed:
grep -n "class Qwen3PreTrainedModel" ./pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py -A 10Repository: UbiquitousLearning/mllm
Length of output: 478
🏁 Script executed:
python3 << 'EOF'
# Let's understand the quantization flow more clearly
# Check if there's documentation or if we can trace the FakeQuantize behavior
import subprocess
result = subprocess.run(['grep', '-n', 'class.*DoubleQuantizer', './pymllm/backends/qualcomm/transformers/core/qlinear.py'],
capture_output=True, text=True)
print(result.stdout)
result = subprocess.run(['grep', '-n', 'class PreTrainedModel\|tie_weights\|post_init', './pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py'],
capture_output=True, text=True)
print(result.stdout)
EOFRepository: UbiquitousLearning/mllm
Length of output: 164
🏁 Script executed:
find . -name "*.py" -path "*core*" | xargs grep -l "DoubleQuantizer" 2>/dev/nullRepository: UbiquitousLearning/mllm
Length of output: 122
🏁 Script executed:
cat -n ./pymllm/backends/qualcomm/transformers/core/qlinear.py | sed -n '161,225p'Repository: UbiquitousLearning/mllm
Length of output: 2953
🏁 Script executed:
grep -n "self.embed_tokens\|embed_tokens.*=" ./pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py | head -20Repository: UbiquitousLearning/mllm
Length of output: 173
🏁 Script executed:
cat -n ./pymllm/backends/qualcomm/transformers/qwen3/runner.py | sed -n '47,52p'Repository: UbiquitousLearning/mllm
Length of output: 316
🏁 Script executed:
grep -n "def freeze_weight\|is_frozen" ./pymllm/backends/qualcomm/transformers/core/qlinear.py | head -15Repository: UbiquitousLearning/mllm
Length of output: 292
🏁 Script executed:
cat -n ./pymllm/backends/qualcomm/transformers/qwen3/runner.py | sed -n '66,69p'Repository: UbiquitousLearning/mllm
Length of output: 279
Quantized weights will not match the assigned lm_head.weight parameter.
Lines 50–53 assign unquantized embed_tokens.weight to lm_head.weight before m.convert() is called (line 54). Since lm_head is a QLinearLPBQ and its weights are frozen during initialization (runner.py line 68), the frozen weight_quant still contains the old quantized weights. When convert() runs, it extracts the stale quantized weights from weight_quant.weight_q rather than from the newly assigned parameter, resulting in a mismatch between the exported quantized weights and the parameter. Either re-freeze the quantized weights after assignment or perform the tying after convert() completes.
🤖 Prompt for AI Agents
In `@pymllm/backends/qualcomm/transformers/qwen3/train.py` around lines 50 - 53,
The assigned lm_head parameter uses unquantized embed_tokens weights before
m.convert(), causing QLinearLPBQ's frozen weight_quant to remain stale; fix by
moving the weight tying to after m.convert() (i.e., set m.model.lm_head.weight =
Parameter(m.model.model.embed_tokens.weight.clone()) only once convert() has
run) or, if tying must happen before convert(), update/re-freeze the QLinearLPBQ
internal quant state (weight_quant) after assignment so weight_quant.weight_q
reflects the new parameter; refer to m.model.lm_head.weight,
m.model.model.embed_tokens.weight, m.convert(), and the QLinearLPBQ frozen
weight_quant initialization to implement the change.
… zeros, ones, specific values, arange, and random fills. Introduce a new fill-inl.hpp file for optimized implementations and update kernel dispatch to include these operations. Enhance CPUFillOp to utilize the new fill functions for better performance and maintainability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
🤖 Fix all issues with AI agents
In `@mllm/backends/cpu/kernels/common/fill-inl.hpp`:
- Around line 8-11: This file relies on std::is_trivial_v, std::is_same_v,
std::ceil, and std::min but doesn't include the matching standard headers; add
explicit includes for <type_traits>, <cmath>, and <algorithm> near the top
alongside the existing includes (after `#include` <cstring> / before
"mllm/core/DataTypes.hpp") so that uses in functions/template code referencing
std::is_trivial_v, std::is_same_v, std::ceil, and std::min are self-contained
and no longer depend on transitive headers.
- Around line 196-244: The function fill_arange_impl computes n (the number of
sequence elements) and may leave dst[n..count-1] uninitialized; make the
contract explicit and avoid silent gaps by: after computing n, add a runtime
check/assert that either n == count (e.g., assert(n == count) or throw) or, if
you want to tolerate count > n, explicitly fill the remainder via
fill_value_impl(dst + n, count - n, static_cast<T>(start + n * step)) so no tail
remains uninitialized; update callers/docs accordingly. Reference:
fill_arange_impl, variable n, and fill_value_impl.
In `@mllm/backends/cpu/kernels/common/kernel_dispatch.hpp`:
- Around line 10-12: The file is missing `#include` <type_traits> required by the
template fill_zeros_anytype which uses std::is_same_v; add the header to the top
of kernel_dispatch.hpp (alongside <cstring> and "mllm/core/DataTypes.hpp") so
std::is_same_v resolves properly and avoids relying on transitive includes
referenced in fill_zeros_anytype.
- Around line 105-131: The fallback in fill_zeros_anytype currently uses
std::memset for all unmatched types; restrict that to only trivial types by
checking std::is_trivial_v<T> and for non-trivial types perform an element-wise
zero assignment (e.g., for (size_t i=0;i<n;++i) dst[i] = T{}), mirroring the
safe approach used in fill_ones_anytype; update the else branch in
fill_zeros_anytype to first if constexpr (std::is_trivial_v<T>) use std::memset,
else use the element-wise loop to avoid undefined behavior on non-trivial types.
In `@mllm/backends/cpu/ops/FillOp.cpp`:
- Around line 31-34: The file uses std::memset in FillOp.cpp (inside the
x86/x86_64 branch) but does not include <cstring>, so add the missing include to
the top of the file; update FillOp.cpp to `#include` <cstring> (alongside other
headers) so std::memset is declared and the file is self-contained and
consistent with other CPU ops.
In `@pymllm/ffi/__init__.py`:
- Around line 291-293: The global `cuda: Device = cuda_()` is wrong because
`cuda_()` currently returns a DType; update the implementation of the `cuda_()`
factory so it returns a Device instance (matching what `cpu_()` and `qnn_()`
return) — e.g. call or wrap the appropriate FFI API that constructs a Device
(like `_ffi_api.cuda_()` or return Device(...) from the FFI result) and keep the
global `cuda` assignment and type annotation as `Device`; ensure the returned
object implements the same Device interface used by `device("cuda")` and
`.to(...)`.
♻️ Duplicate comments (1)
mllm/backends/qnn/aot/passes/PTQPass.cpp (1)
342-420: Handle mixed quantization spec types before comparing concat inputs.
If the reference comes fromkSymPerTensorand a later input iskAsymPerTensor,ref_zero_pointis never initialized and gets read, and scale comparisons mix incompatible spec types. Track the reference spec type and short‑circuit on mismatches.🛠️ Suggested fix (track reference spec type)
Tensor ref_scale; Tensor ref_zero_point; bool has_ref = false; std::string ref_input_name; + ir::linalg::QuantizationSpecType ref_spec_type = ir::linalg::QuantizationSpecType::kRaw; for (auto iii : inputs) { if (!iii->isa_<ir::tensor::TensorValue>()) continue; auto tv = iii->cast_<ir::tensor::TensorValue>(); if (!tv->getAttr("quant_recipe")) continue; auto f_spec = tv->getAttr("quant_recipe")->cast_<ir::linalg::LinalgIRQuantizatonSpecAttr>(); if (f_spec->spec_->type == ir::linalg::QuantizationSpecType::kAsymPerTensor) { auto this_spec = std::static_pointer_cast<ir::linalg::QuantizationSpecAsymPerTensor>(f_spec->spec_); if (!this_spec->solved) continue; if (!has_ref) { ref_scale = this_spec->scale; ref_zero_point = this_spec->zero_point; ref_input_name = tv->name(); + ref_spec_type = ir::linalg::QuantizationSpecType::kAsymPerTensor; has_ref = true; } else { + if (ref_spec_type != ir::linalg::QuantizationSpecType::kAsymPerTensor) { + MLLM_ERROR("PTQPass: ConcatOp '{}' has mismatched quantization spec types between inputs.", op_name); + continue; + } // Check if scale and zero_point match ... } } else if (f_spec->spec_->type == ir::linalg::QuantizationSpecType::kSymPerTensor) { auto this_spec = std::static_pointer_cast<ir::linalg::QuantizationSpecSymPerTensor>(f_spec->spec_); if (!this_spec->solved) continue; if (!has_ref) { ref_scale = this_spec->scale; ref_input_name = tv->name(); + ref_spec_type = ir::linalg::QuantizationSpecType::kSymPerTensor; has_ref = true; } else { + if (ref_spec_type != ir::linalg::QuantizationSpecType::kSymPerTensor) { + MLLM_ERROR("PTQPass: ConcatOp '{}' has mismatched quantization spec types between inputs.", op_name); + continue; + } // Check if scale matches ... } } }
🧹 Nitpick comments (2)
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp (1)
34-104: Add brief doc comments for new public fill APIs.The new
call_fill_*declarations and template wrappers are public entry points; please document parameters/returns/error behavior for clarity and maintainability. As per coding guidelines, please add API doc comments.mllm/backends/cpu/ops/FillOp.cpp (1)
211-216: Align FP16 arange fallback with end/step semantics.Lines 213–216 fill
numel()elements without honoringend, while other dtypes compute a length derived fromstart/end/step. Consider matching the same length computation or asserting the expected size to avoid semantic divergence.
| #include <hwy/highway.h> | ||
| #include <cstring> | ||
| #include "mllm/core/DataTypes.hpp" | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# First, locate and read the file
find . -name "fill-inl.hpp" -type fRepository: UbiquitousLearning/mllm
Length of output: 114
🏁 Script executed:
# Read the file to verify the includes and function usage
cat -n mllm/backends/cpu/kernels/common/fill-inl.hpp | head -250Repository: UbiquitousLearning/mllm
Length of output: 10794
Add explicit standard headers for type traits and math utilities.
The file uses std::is_trivial_v (line 35), std::is_same_v (line 219), std::ceil (line 208), and std::min (line 214), but doesn't explicitly include <type_traits>, <cmath>, or <algorithm>. While it may compile due to transitive includes from <hwy/highway.h>, add these headers to make the file self-contained and avoid fragile dependencies.
Proposed fix
`#include` <hwy/highway.h>
`#include` <cstring>
+#include <algorithm>
+#include <cmath>
+#include <type_traits>
`#include` "mllm/core/DataTypes.hpp"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| #include <hwy/highway.h> | |
| #include <cstring> | |
| #include "mllm/core/DataTypes.hpp" | |
| `#include` <hwy/highway.h> | |
| `#include` <cstring> | |
| `#include` <algorithm> | |
| `#include` <cmath> | |
| `#include` <type_traits> | |
| `#include` "mllm/core/DataTypes.hpp" | |
🧰 Tools
🪛 Clang (14.0.6)
[error] 8-8: 'hwy/highway.h' file not found
(clang-diagnostic-error)
🤖 Prompt for AI Agents
In `@mllm/backends/cpu/kernels/common/fill-inl.hpp` around lines 8 - 11, This file
relies on std::is_trivial_v, std::is_same_v, std::ceil, and std::min but doesn't
include the matching standard headers; add explicit includes for <type_traits>,
<cmath>, and <algorithm> near the top alongside the existing includes (after
`#include` <cstring> / before "mllm/core/DataTypes.hpp") so that uses in
functions/template code referencing std::is_trivial_v, std::is_same_v,
std::ceil, and std::min are self-contained and no longer depend on transitive
headers.
| template<typename T> | ||
| HWY_INLINE void fill_arange_impl(T* HWY_RESTRICT dst, size_t count, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step) { | ||
| if (step == 0) { | ||
| fill_value_impl(dst, count, static_cast<T>(start)); | ||
| return; | ||
| } | ||
|
|
||
| // Calculate the actual number of elements to fill | ||
| size_t n = 0; | ||
| if ((step > 0 && start < end) || (step < 0 && start > end)) { | ||
| mllm_fp32_t n_float = (end - start) / step; | ||
| if (n_float > 0) { | ||
| n = static_cast<size_t>(std::ceil(n_float)); | ||
| if (step > 0) { | ||
| if (start + (n - 1) * step >= end) --n; | ||
| } else { | ||
| if (start + (n - 1) * step <= end) --n; | ||
| } | ||
| n = std::min(n, count); | ||
| } | ||
| } | ||
|
|
||
| // Use SIMD for float types where we can vectorize the computation | ||
| if constexpr (std::is_same_v<T, mllm_fp32_t>) { | ||
| const hn::ScalableTag<T> d; | ||
| const size_t N = hn::Lanes(d); | ||
|
|
||
| // Create increment vector: [0, 1, 2, 3, ...] * step | ||
| const hn::Vec<decltype(d)> step_vec = hn::Set(d, step); | ||
| const hn::Vec<decltype(d)> n_step_vec = hn::Set(d, step * static_cast<T>(N)); | ||
|
|
||
| // Create base offsets [0, 1, 2, 3, ...] | ||
| hn::Vec<decltype(d)> base = hn::Iota(d, 0); | ||
| base = hn::Mul(base, step_vec); | ||
| hn::Vec<decltype(d)> current_start = hn::Add(hn::Set(d, start), base); | ||
|
|
||
| size_t idx = 0; | ||
| for (; idx + N <= n; idx += N) { | ||
| hn::StoreU(current_start, d, dst + idx); | ||
| current_start = hn::Add(current_start, n_step_vec); | ||
| } | ||
|
|
||
| // Handle remaining elements | ||
| for (; idx < n; ++idx) { dst[idx] = static_cast<T>(start + idx * step); } | ||
| } else { | ||
| // Scalar fallback for other types | ||
| for (size_t i = 0; i < n; ++i) { dst[i] = static_cast<T>(start + i * step); } | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❓ Verification inconclusive
Script executed:
# First, let's look at the full file context around the function
cat -n mllm/backends/cpu/kernels/common/fill-inl.hpp | head -300 | tail -120Repository: UbiquitousLearning/mllm
Repository: UbiquitousLearning/mllm
Exit code: 0
stdout:
181 static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_value_u16(mllm_uint16_t* HWY_RESTRICT dst, size_t size, mllm_uint16_t value) {
182 fill_value_impl(dst, size, value);
183 }
184
185 static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_value_i8(mllm_int8_t* HWY_RESTRICT dst, size_t size, mllm_int8_t value) {
186 fill_value_impl(dst, size, value);
187 }
188
189 static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_value_u8(mllm_uint8_t* HWY_RESTRICT dst, size_t size, mllm_uint8_t value) {
190 fill_value_impl(dst, size, value);
191 }
192
193 //===----------------------------------------------------------------------===//
194 // Fill Arange (start, end, step)
195 //===----------------------------------------------------------------------===//
196 template<typename T>
197 HWY_INLINE void fill_arange_impl(T* HWY_RESTRICT dst, size_t count, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step) {
198 if (step == 0) {
199 fill_value_impl(dst, count, static_cast<T>(start));
200 return;
201 }
202
203 // Calculate the actual number of elements to fill
204 size_t n = 0;
205 if ((step > 0 && start < end) || (step < 0 && start > end)) {
206 mllm_fp32_t n_float = (end - start) / step;
207 if (n_float > 0) {
208 n = static_cast<size_t>(std::ceil(n_float));
209 if (step > 0) {
210 if (start + (n - 1) * step >= end) --n;
211 } else {
212 if (start + (n - 1) * step <= end) --n;
213 }
214 n = std::min(n, count);
215 }
216 }
217
218 // Use SIMD for float types where we can vectorize the computation
219 if constexpr (std::is_same_v<T, mllm_fp32_t>) {
220 const hn::ScalableTag<T> d;
221 const size_t N = hn::Lanes(d);
222
223 // Create increment vector: [0, 1, 2, 3, ...] * step
224 const hn::Vec<decltype(d)> step_vec = hn::Set(d, step);
225 const hn::Vec<decltype(d)> n_step_vec = hn::Set(d, step * static_cast<T>(N));
226
227 // Create base offsets [0, 1, 2, 3, ...]
228 hn::Vec<decltype(d)> base = hn::Iota(d, 0);
229 base = hn::Mul(base, step_vec);
230 hn::Vec<decltype(d)> current_start = hn::Add(hn::Set(d, start), base);
231
232 size_t idx = 0;
233 for (; idx + N <= n; idx += N) {
234 hn::StoreU(current_start, d, dst + idx);
235 current_start = hn::Add(current_start, n_step_vec);
236 }
237
238 // Handle remaining elements
239 for (; idx < n; ++idx) { dst[idx] = static_cast<T>(start + idx * step); }
240 } else {
241 // Scalar fallback for other types
242 for (size_t i = 0; i < n; ++i) { dst[i] = static_cast<T>(start + i * step); }
243 }
244 }
245
246 static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_arange_fp32(mllm_fp32_t* HWY_RESTRICT dst, size_t size, mllm_fp32_t start,
247 mllm_fp32_t end, mllm_fp32_t step) {
248 fill_arange_impl(dst, size, start, end, step);
249 }
250
251 static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_arange_i32(mllm_int32_t* HWY_RESTRICT dst, size_t size, mllm_fp32_t start,
252 mllm_fp32_t end, mllm_fp32_t step) {
253 fill_arange_impl(dst, size, start, end, step);
254 }
255
256 static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_arange_u32(mllm_uint32_t* HWY_RESTRICT dst, size_t size, mllm_fp32_t start,
257 mllm_fp32_t end, mllm_fp32_t step) {
258 fill_arange_impl(dst, size, start, end, step);
259 }
260
261 static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_arange_i64(mllm_int64_t* HWY_RESTRICT dst, size_t size, mllm_fp32_t start,
262 mllm_fp32_t end, mllm_fp32_t step) {
263 fill_arange_impl(dst, size, start, end, step);
264 }
265
266 static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_arange_u64(mllm_uint64_t* HWY_RESTRICT dst, size_t size, mllm_fp32_t start,
267 mllm_fp32_t end, mllm_fp32_t step) {
268 fill_arange_impl(dst, size, start, end, step);
269 }
270
271 static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_arange_i16(mllm_int16_t* HWY_RESTRICT dst, size_t size, mllm_fp32_t start,
272 mllm_fp32_t end, mllm_fp32_t step) {
273 fill_arange_impl(dst, size, start, end, step);
274 }
275
276 static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_arange_u16(mllm_uint16_t* HWY_RESTRICT dst, size_t size, mllm_fp32_t start,
277 mllm_fp32_t end, mllm_fp32_t step) {
278 fill_arange_impl(dst, size, start, end, step);
279 }
280
281 static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_arange_i8(mllm_int8_t* HWY_RESTRICT dst, size_t size, mllm_fp32_t start,
282 mllm_fp32_t end, mllm_fp32_t step) {
283 fill_arange_impl(dst, size, start, end, step);
284 }
285
286 static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_arange_u8(mllm_uint8_t* HWY_RESTRICT dst, size_t size, mllm_fp32_t start,
287 mllm_fp32_t end, mllm_fp32_t step) {
288 fill_arange_impl(dst, size, start, end, step);
289 }
290
291 //===----------------------------------------------------------------------===//
292 // Fill Random (using LCG random number generator)
293 //===----------------------------------------------------------------------===//
294 template<typename T>
295 HWY_INLINE void fill_random_impl(T* HWY_RESTRICT dst, size_t count, mllm_fp32_t start, mllm_fp32_t end, uint64_t seed) {
296 const uint64_t multiplier = 1103515245ULL;
297 const uint64_t increment = 12345ULL;
298 const uint64_t modulus = 1ULL << 31; // 2^31
299 const mllm_fp32_t range = end - start;
300
Script executed:
# Now search for call sites as suggested in the review comment
rg -n "fill_arange_anytype|call_fill_arange_|fill_arange_impl" --type=cpp --type=hpp -A 3 -B 3Repository: UbiquitousLearning/mllm
Repository: UbiquitousLearning/mllm
Exit code: 2
stderr:
unrecognized file type: hpp
Script executed:
# Also search for fill_arange patterns more broadly
rg -n "fill_arange" --type=cpp --type=hpp -A 2 -B 2Repository: UbiquitousLearning/mllm
Repository: UbiquitousLearning/mllm
Exit code: 2
stderr:
unrecognized file type: hpp
Script executed:
# Search for call sites of fill_arange functions
rg -n "fill_arange" -A 2 -B 2Repository: UbiquitousLearning/mllm
Repository: UbiquitousLearning/mllm
Exit code: 0
stdout:
mllm/backends/opencl/kernels/fill_cl.cpp-5- " dst[index]=value;\n"
mllm/backends/opencl/kernels/fill_cl.cpp-6- "}\n"
mllm/backends/opencl/kernels/fill_cl.cpp:7: "__kernel void fill_arange_fp32(float start,float step,__global float *dst) {\n"
mllm/backends/opencl/kernels/fill_cl.cpp-8- " size_t index=get_global_id(0);\n"
mllm/backends/opencl/kernels/fill_cl.cpp-9- " dst[index]=start+(float)index*step;\n"
--
mllm/backends/opencl/kernels/fill.cl-4-}
mllm/backends/opencl/kernels/fill.cl-5-
mllm/backends/opencl/kernels/fill.cl:6:__kernel void fill_arange_fp32(float start, float step, __global float *dst) {
mllm/backends/opencl/kernels/fill.cl-7- size_t index = get_global_id(0);
mllm/backends/opencl/kernels/fill.cl-8- dst[index] = start + (float)index * step;
--
mllm/backends/opencl/ops/FillOp.cpp-12-
mllm/backends/opencl/ops/FillOp.cpp-13- kernel_fp32_buffer_ = runtime->buildKernel("fill", "fill_fp32", {});
mllm/backends/opencl/ops/FillOp.cpp:14: kernel_arange_fp32_buffer_ = runtime->buildKernel("fill", "fill_arange_fp32", {});
mllm/backends/opencl/ops/FillOp.cpp-15-}
mllm/backends/opencl/ops/FillOp.cpp-16-
--
mllm/backends/opencl/ops/FillOp.cpp-68- cl::NDRange(global_size), cl::NullRange);
mllm/backends/opencl/ops/FillOp.cpp-69- if (error != CL_SUCCESS) {
mllm/backends/opencl/ops/FillOp.cpp:70: MLLM_ERROR_EXIT(ExitCode::kOpenCLError, "Failed to execute fill_arange kernel, error code: {}", error);
mllm/backends/opencl/ops/FillOp.cpp-71- }
mllm/backends/opencl/ops/FillOp.cpp-72- } else {
--
mllm/backends/cpu/ops/FillOp.cpp-203- case kFloat32: {
mllm/backends/cpu/ops/FillOp.cpp-204-#if defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86)
mllm/backends/cpu/ops/FillOp.cpp:205: common::fill_arange_anytype(dst.ptr<mllm_fp32_t>(), dst.numel(), options_.start, options_.end, options_.step);
mllm/backends/cpu/ops/FillOp.cpp-206-#elif defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM)
mllm/backends/cpu/ops/FillOp.cpp:207: arm::fill_arange(dst.ptr<mllm_fp32_t>(), dst.numel(), options_.start, options_.end, options_.step, threads);
mllm/backends/cpu/ops/FillOp.cpp-208-#endif
mllm/backends/cpu/ops/FillOp.cpp-209- break;
--
mllm/backends/cpu/ops/FillOp.cpp-215- for (size_t i = 0; i < dst.numel(); ++i) { ptr[i] = static_cast<mllm_fp16_t>(options_.start + i * options_.step); }
mllm/backends/cpu/ops/FillOp.cpp-216-#elif defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM)
mllm/backends/cpu/ops/FillOp.cpp:217: arm::fill_arange_fp16(dst.ptr<mllm_fp16_t>(), dst.numel(), options_.start, options_.end, options_.step, threads);
mllm/backends/cpu/ops/FillOp.cpp-218-#endif
mllm/backends/cpu/ops/FillOp.cpp-219- break;
--
mllm/backends/cpu/ops/FillOp.cpp-221- case kInt64: {
mllm/backends/cpu/ops/FillOp.cpp-222-#if defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86)
mllm/backends/cpu/ops/FillOp.cpp:223: common::fill_arange_anytype(dst.ptr<mllm_int64_t>(), dst.numel(), options_.start, options_.end, options_.step);
mllm/backends/cpu/ops/FillOp.cpp-224-#elif defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM)
mllm/backends/cpu/ops/FillOp.cpp:225: arm::fill_arange_anytype<mllm_int64_t>(dst.ptr<mllm_int64_t>(), dst.numel(), options_.start, options_.end,
mllm/backends/cpu/ops/FillOp.cpp-226- options_.step, threads);
mllm/backends/cpu/ops/FillOp.cpp-227-#endif
--
mllm/backends/cpu/ops/FillOp.cpp-230- case kInt32: {
mllm/backends/cpu/ops/FillOp.cpp-231-#if defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86)
mllm/backends/cpu/ops/FillOp.cpp:232: common::fill_arange_anytype(dst.ptr<mllm_int32_t>(), dst.numel(), options_.start, options_.end, options_.step);
mllm/backends/cpu/ops/FillOp.cpp-233-#elif defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM)
mllm/backends/cpu/ops/FillOp.cpp:234: arm::fill_arange_anytype<mllm_int32_t>(dst.ptr<mllm_int32_t>(), dst.numel(), options_.start, options_.end,
mllm/backends/cpu/ops/FillOp.cpp-235- options_.step, threads);
mllm/backends/cpu/ops/FillOp.cpp-236-#endif
--
mllm/backends/cpu/ops/FillOp.cpp-239- case kInt16: {
mllm/backends/cpu/ops/FillOp.cpp-240-#if defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86)
mllm/backends/cpu/ops/FillOp.cpp:241: common::fill_arange_anytype(dst.ptr<mllm_int16_t>(), dst.numel(), options_.start, options_.end, options_.step);
mllm/backends/cpu/ops/FillOp.cpp-242-#elif defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM)
mllm/backends/cpu/ops/FillOp.cpp:243: arm::fill_arange_anytype<mllm_int16_t>(dst.ptr<mllm_int16_t>(), dst.numel(), options_.start, options_.end,
mllm/backends/cpu/ops/FillOp.cpp-244- options_.step, threads);
mllm/backends/cpu/ops/FillOp.cpp-245-#endif
--
mllm/backends/cpu/ops/FillOp.cpp-248- case kInt8: {
mllm/backends/cpu/ops/FillOp.cpp-249-#if defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86)
mllm/backends/cpu/ops/FillOp.cpp:250: common::fill_arange_anytype(dst.ptr<mllm_int8_t>(), dst.numel(), options_.start, options_.end, options_.step);
mllm/backends/cpu/ops/FillOp.cpp-251-#elif defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM)
mllm/backends/cpu/ops/FillOp.cpp:252: arm::fill_arange_anytype<mllm_int8_t>(dst.ptr<mllm_int8_t>(), dst.numel(), options_.start, options_.end,
mllm/backends/cpu/ops/FillOp.cpp-253- options_.step, threads);
mllm/backends/cpu/ops/FillOp.cpp-254-#endif
--
mllm/backends/cpu/ops/FillOp.cpp-257- case kUInt64: {
mllm/backends/cpu/ops/FillOp.cpp-258-#if defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86)
mllm/backends/cpu/ops/FillOp.cpp:259: common::fill_arange_anytype(dst.ptr<mllm_uint64_t>(), dst.numel(), options_.start, options_.end, options_.step);
mllm/backends/cpu/ops/FillOp.cpp-260-#elif defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM)
mllm/backends/cpu/ops/FillOp.cpp:261: arm::fill_arange_anytype<mllm_uint64_t>(dst.ptr<mllm_uint64_t>(), dst.numel(), options_.start, options_.end,
mllm/backends/cpu/ops/FillOp.cpp-262- options_.step, threads);
mllm/backends/cpu/ops/FillOp.cpp-263-#endif
--
mllm/backends/cpu/ops/FillOp.cpp-266- case kUInt32: {
mllm/backends/cpu/ops/FillOp.cpp-267-#if defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86)
mllm/backends/cpu/ops/FillOp.cpp:268: common::fill_arange_anytype(dst.ptr<mllm_uint32_t>(), dst.numel(), options_.start, options_.end, options_.step);
mllm/backends/cpu/ops/FillOp.cpp-269-#elif defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM)
mllm/backends/cpu/ops/FillOp.cpp:270: arm::fill_arange_anytype<mllm_uint32_t>(dst.ptr<mllm_uint32_t>(), dst.numel(), options_.start, options_.end,
mllm/backends/cpu/ops/FillOp.cpp-271- options_.step, threads);
mllm/backends/cpu/ops/FillOp.cpp-272-#endif
--
mllm/backends/cpu/ops/FillOp.cpp-275- case kUInt16: {
mllm/backends/cpu/ops/FillOp.cpp-276-#if defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86)
mllm/backends/cpu/ops/FillOp.cpp:277: common::fill_arange_anytype(dst.ptr<mllm_uint16_t>(), dst.numel(), options_.start, options_.end, options_.step);
mllm/backends/cpu/ops/FillOp.cpp-278-#elif defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM)
mllm/backends/cpu/ops/FillOp.cpp:279: arm::fill_arange_anytype<mllm_uint16_t>(dst.ptr<mllm_uint16_t>(), dst.numel(), options_.start, options_.end,
mllm/backends/cpu/ops/FillOp.cpp-280- options_.step, threads);
mllm/backends/cpu/ops/FillOp.cpp-281-#endif
--
mllm/backends/cpu/ops/FillOp.cpp-284- case kUInt8: {
mllm/backends/cpu/ops/FillOp.cpp-285-#if defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86)
mllm/backends/cpu/ops/FillOp.cpp:286: common::fill_arange_anytype(dst.ptr<mllm_uint8_t>(), dst.numel(), options_.start, options_.end, options_.step);
mllm/backends/cpu/ops/FillOp.cpp-287-#elif defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM)
mllm/backends/cpu/ops/FillOp.cpp:288: arm::fill_arange_anytype<mllm_uint8_t>(dst.ptr<mllm_uint8_t>(), dst.numel(), options_.start, options_.end,
mllm/backends/cpu/ops/FillOp.cpp-289- options_.step, threads);
mllm/backends/cpu/ops/FillOp.cpp-290-#endif
--
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp-174-// Fill Arange
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp-175-//===----------------------------------------------------------------------===//
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:176:HWY_EXPORT(fill_arange_fp32);
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:177:HWY_EXPORT(fill_arange_i32);
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:178:HWY_EXPORT(fill_arange_u32);
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:179:HWY_EXPORT(fill_arange_i64);
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:180:HWY_EXPORT(fill_arange_u64);
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:181:HWY_EXPORT(fill_arange_i16);
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:182:HWY_EXPORT(fill_arange_u16);
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:183:HWY_EXPORT(fill_arange_i8);
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:184:HWY_EXPORT(fill_arange_u8);
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp-185-
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:186:HWY_DLLEXPORT void call_fill_arange_fp32(mllm_fp32_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step) {
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:187: HWY_DYNAMIC_DISPATCH(fill_arange_fp32)(dst, n, start, end, step);
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp-188-}
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:189:HWY_DLLEXPORT void call_fill_arange_i32(mllm_int32_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step) {
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:190: HWY_DYNAMIC_DISPATCH(fill_arange_i32)(dst, n, start, end, step);
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp-191-}
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:192:HWY_DLLEXPORT void call_fill_arange_u32(mllm_uint32_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step) {
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:193: HWY_DYNAMIC_DISPATCH(fill_arange_u32)(dst, n, start, end, step);
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp-194-}
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:195:HWY_DLLEXPORT void call_fill_arange_i64(mllm_int64_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step) {
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:196: HWY_DYNAMIC_DISPATCH(fill_arange_i64)(dst, n, start, end, step);
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp-197-}
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:198:HWY_DLLEXPORT void call_fill_arange_u64(mllm_uint64_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step) {
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:199: HWY_DYNAMIC_DISPATCH(fill_arange_u64)(dst, n, start, end, step);
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp-200-}
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:201:HWY_DLLEXPORT void call_fill_arange_i16(mllm_int16_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step) {
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:202: HWY_DYNAMIC_DISPATCH(fill_arange_i16)(dst, n, start, end, step);
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp-203-}
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:204:HWY_DLLEXPORT void call_fill_arange_u16(mllm_uint16_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step) {
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:205: HWY_DYNAMIC_DISPATCH(fill_arange_u16)(dst, n, start, end, step);
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp-206-}
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:207:HWY_DLLEXPORT void call_fill_arange_i8(mllm_int8_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step) {
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:208: HWY_DYNAMIC_DISPATCH(fill_arange_i8)(dst, n, start, end, step);
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp-209-}
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:210:HWY_DLLEXPORT void call_fill_arange_u8(mllm_uint8_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step) {
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:211: HWY_DYNAMIC_DISPATCH(fill_arange_u8)(dst, n, start, end, step);
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp-212-}
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp-213-
--
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-77-// Fill Arange
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-78-//===----------------------------------------------------------------------===//
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:79:HWY_DLLEXPORT void call_fill_arange_fp32(mllm_fp32_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step);
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:80:HWY_DLLEXPORT void call_fill_arange_i32(mllm_int32_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step);
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:81:HWY_DLLEXPORT void call_fill_arange_u32(mllm_uint32_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step);
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:82:HWY_DLLEXPORT void call_fill_arange_i64(mllm_int64_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step);
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:83:HWY_DLLEXPORT void call_fill_arange_u64(mllm_uint64_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step);
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:84:HWY_DLLEXPORT void call_fill_arange_i16(mllm_int16_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step);
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:85:HWY_DLLEXPORT void call_fill_arange_u16(mllm_uint16_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step);
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:86:HWY_DLLEXPORT void call_fill_arange_i8(mllm_int8_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step);
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:87:HWY_DLLEXPORT void call_fill_arange_u8(mllm_uint8_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step);
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-88-
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-89-//===----------------------------------------------------------------------===//
--
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-188-
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-189-template<typename T>
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:190:inline void fill_arange_anytype(T* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step) {
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-191- if constexpr (std::is_same_v<T, mllm_fp32_t>) {
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:192: call_fill_arange_fp32(dst, n, start, end, step);
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-193- } else if constexpr (std::is_same_v<T, mllm_int32_t>) {
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:194: call_fill_arange_i32(dst, n, start, end, step);
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-195- } else if constexpr (std::is_same_v<T, mllm_uint32_t>) {
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:196: call_fill_arange_u32(dst, n, start, end, step);
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-197- } else if constexpr (std::is_same_v<T, mllm_int64_t>) {
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:198: call_fill_arange_i64(dst, n, start, end, step);
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-199- } else if constexpr (std::is_same_v<T, mllm_uint64_t>) {
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:200: call_fill_arange_u64(dst, n, start, end, step);
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-201- } else if constexpr (std::is_same_v<T, mllm_int16_t>) {
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:202: call_fill_arange_i16(dst, n, start, end, step);
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-203- } else if constexpr (std::is_same_v<T, mllm_uint16_t>) {
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:204: call_fill_arange_u16(dst, n, start, end, step);
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-205- } else if constexpr (std::is_same_v<T, mllm_int8_t>) {
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:206: call_fill_arange_i8(dst, n, start, end, step);
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-207- } else if constexpr (std::is_same_v<T, mllm_uint8_t>) {
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:208: call_fill_arange_u8(dst, n, start, end, step);
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-209- } else {
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-210- // Fallback
--
mllm/backends/cpu/kernels/common/fill-inl.hpp-195-//===----------------------------------------------------------------------===//
mllm/backends/cpu/kernels/common/fill-inl.hpp-196-template<typename T>
mllm/backends/cpu/kernels/common/fill-inl.hpp:197:HWY_INLINE void fill_arange_impl(T* HWY_RESTRICT dst, size_t count, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step) {
mllm/backends/cpu/kernels/common/fill-inl.hpp-198- if (step == 0) {
mllm/backends/cpu/kernels/common/fill-inl.hpp-199- fill_value_impl(dst, count, static_cast<T>(start));
--
mllm/backends/cpu/kernels/common/fill-inl.hpp-244-}
mllm/backends/cpu/kernels/common/fill-inl.hpp-245-
mllm/backends/cpu/kernels/common/fill-inl.hpp:246:static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_arange_fp32(mllm_fp32_t* HWY_RESTRICT dst, size_t size, mllm_fp32_t start,
mllm/backends/cpu/kernels/common/fill-inl.hpp-247- mllm_fp32_t end, mllm_fp32_t step) {
mllm/backends/cpu/kernels/common/fill-inl.hpp:248: fill_arange_impl(dst, size, start, end, step);
mllm/backends/cpu/kernels/common/fill-inl.hpp-249-}
mllm/backends/cpu/kernels/common/fill-inl.hpp-250-
mllm/backends/cpu/kernels/common/fill-inl.hpp:251:static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_arange_i32(mllm_int32_t* HWY_RESTRICT dst, size_t size, mllm_fp32_t start,
mllm/backends/cpu/kernels/common/fill-inl.hpp-252- mllm_fp32_t end, mllm_fp32_t step) {
mllm/backends/cpu/kernels/common/fill-inl.hpp:253: fill_arange_impl(dst, size, start, end, step);
mllm/backends/cpu/kernels/common/fill-inl.hpp-254-}
mllm/backends/cpu/kernels/common/fill-inl.hpp-255-
mllm/backends/cpu/kernels/common/fill-inl.hpp:256:static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_arange_u32(mllm_uint32_t* HWY_RESTRICT dst, size_t size, mllm_fp32_t start,
mllm/backends/cpu/kernels/common/fill-inl.hpp-257- mllm_fp32_t end, mllm_fp32_t step) {
mllm/backends/cpu/kernels/common/fill-inl.hpp:258: fill_arange_impl(dst, size, start, end, step);
mllm/backends/cpu/kernels/common/fill-inl.hpp-259-}
mllm/backends/cpu/kernels/common/fill-inl.hpp-260-
mllm/backends/cpu/kernels/common/fill-inl.hpp:261:static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_arange_i64(mllm_int64_t* HWY_RESTRICT dst, size_t size, mllm_fp32_t start,
mllm/backends/cpu/kernels/common/fill-inl.hpp-262- mllm_fp32_t end, mllm_fp32_t step) {
mllm/backends/cpu/kernels/common/fill-inl.hpp:263: fill_arange_impl(dst, size, start, end, step);
mllm/backends/cpu/kernels/common/fill-inl.hpp-264-}
mllm/backends/cpu/kernels/common/fill-inl.hpp-265-
mllm/backends/cpu/kernels/common/fill-inl.hpp:266:static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_arange_u64(mllm_uint64_t* HWY_RESTRICT dst, size_t size, mllm_fp32_t start,
mllm/backends/cpu/kernels/common/fill-inl.hpp-267- mllm_fp32_t end, mllm_fp32_t step) {
mllm/backends/cpu/kernels/common/fill-inl.hpp:268: fill_arange_impl(dst, size, start, end, step);
mllm/backends/cpu/kernels/common/fill-inl.hpp-269-}
mllm/backends/cpu/kernels/common/fill-inl.hpp-270-
mllm/backends/cpu/kernels/common/fill-inl.hpp:271:static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_arange_i16(mllm_int16_t* HWY_RESTRICT dst, size_t size, mllm_fp32_t start,
mllm/backends/cpu/kernels/common/fill-inl.hpp-272- mllm_fp32_t end, mllm_fp32_t step) {
mllm/backends/cpu/kernels/common/fill-inl.hpp:273: fill_arange_impl(dst, size, start, end, step);
mllm/backends/cpu/kernels/common/fill-inl.hpp-274-}
mllm/backends/cpu/kernels/common/fill-inl.hpp-275-
mllm/backends/cpu/kernels/common/fill-inl.hpp:276:static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_arange_u16(mllm_uint16_t* HWY_RESTRICT dst, size_t size, mllm_fp32_t start,
mllm/backends/cpu/kernels/common/fill-inl.hpp-277- mllm_fp32_t end, mllm_fp32_t step) {
mllm/backends/cpu/kernels/common/fill-inl.hpp:278: fill_arange_impl(dst, size, start, end, step);
mllm/backends/cpu/kernels/common/fill-inl.hpp-279-}
mllm/backends/cpu/kernels/common/fill-inl.hpp-280-
mllm/backends/cpu/kernels/common/fill-inl.hpp:281:static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_arange_i8(mllm_int8_t* HWY_RESTRICT dst, size_t size, mllm_fp32_t start,
mllm/backends/cpu/kernels/common/fill-inl.hpp-282- mllm_fp32_t end, mllm_fp32_t step) {
mllm/backends/cpu/kernels/common/fill-inl.hpp:283: fill_arange_impl(dst, size, start, end, step);
mllm/backends/cpu/kernels/common/fill-inl.hpp-284-}
mllm/backends/cpu/kernels/common/fill-inl.hpp-285-
mllm/backends/cpu/kernels/common/fill-inl.hpp:286:static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_arange_u8(mllm_uint8_t* HWY_RESTRICT dst, size_t size, mllm_fp32_t start,
mllm/backends/cpu/kernels/common/fill-inl.hpp-287- mllm_fp32_t end, mllm_fp32_t step) {
mllm/backends/cpu/kernels/common/fill-inl.hpp:288: fill_arange_impl(dst, size, start, end, step);
mllm/backends/cpu/kernels/common/fill-inl.hpp-289-}
mllm/backends/cpu/kernels/common/fill-inl.hpp-290-
--
mllm/backends/cpu/kernels/arm/fill.cpp-52-}
mllm/backends/cpu/kernels/arm/fill.cpp-53-
mllm/backends/cpu/kernels/arm/fill.cpp:54:void fill_arange(mllm_fp32_t* __restrict dst, size_t size, float start, float end, float step, int thread_count) {
mllm/backends/cpu/kernels/arm/fill.cpp-55- constexpr size_t vec_size = 4; // 4 floats in NEON
mllm/backends/cpu/kernels/arm/fill.cpp-56-
--
mllm/backends/cpu/kernels/arm/fill.cpp-129-}
mllm/backends/cpu/kernels/arm/fill.cpp-130-
mllm/backends/cpu/kernels/arm/fill.cpp:131:void fill_arange_fp16(mllm_fp16_t* __restrict dst, size_t size, float start, float end, float step, int thread_count) {
mllm/backends/cpu/kernels/arm/fill.cpp-132- constexpr size_t vec_size = 8; // 8 float16_t in NEON
mllm/backends/cpu/kernels/arm/fill.cpp-133-
--
mllm/backends/cpu/kernels/arm/fill.hpp-17-void fill_specific_value(mllm_fp32_t* __restrict dst, size_t size, mllm_fp32_t value, int thread_count);
mllm/backends/cpu/kernels/arm/fill.hpp-18-
mllm/backends/cpu/kernels/arm/fill.hpp:19:void fill_arange(mllm_fp32_t* __restrict dst, size_t size, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step,
mllm/backends/cpu/kernels/arm/fill.hpp-20- int thread_count);
mllm/backends/cpu/kernels/arm/fill.hpp-21-
--
mllm/backends/cpu/kernels/arm/fill.hpp-29-void fill_specific_value_fp16(mllm_fp16_t* __restrict dst, size_t size, mllm_fp32_t value, int thread_count);
mllm/backends/cpu/kernels/arm/fill.hpp-30-
mllm/backends/cpu/kernels/arm/fill.hpp:31:void fill_arange_fp16(mllm_fp16_t* __restrict dst, size_t size, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step,
mllm/backends/cpu/kernels/arm/fill.hpp-32- int thread_count);
mllm/backends/cpu/kernels/arm/fill.hpp-33-
--
mllm/backends/cpu/kernels/arm/fill.hpp-94-
mllm/backends/cpu/kernels/arm/fill.hpp-95-template<typename T>
mllm/backends/cpu/kernels/arm/fill.hpp:96:inline void fill_arange_anytype(T* __restrict dst, size_t size, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step,
mllm/backends/cpu/kernels/arm/fill.hpp-97- int thread_count) {
mllm/backends/cpu/kernels/arm/fill.hpp-98- if (step == 0) {
--
mllm/backends/cpu/kernels/arm/fill.hpp-119-
mllm/backends/cpu/kernels/arm/fill.hpp-120-template<>
mllm/backends/cpu/kernels/arm/fill.hpp:121:inline void fill_arange_anytype<mllm_fp32_t>(mllm_fp32_t* __restrict dst, size_t size, mllm_fp32_t start, mllm_fp32_t end,
mllm/backends/cpu/kernels/arm/fill.hpp-122- mllm_fp32_t step, int thread_count) {
mllm/backends/cpu/kernels/arm/fill.hpp:123: fill_arange(dst, size, start, end, step, thread_count);
mllm/backends/cpu/kernels/arm/fill.hpp-124-}
mllm/backends/cpu/kernels/arm/fill.hpp-125-
mllm/backends/cpu/kernels/arm/fill.hpp-126-template<>
mllm/backends/cpu/kernels/arm/fill.hpp:127:inline void fill_arange_anytype<mllm_fp16_t>(mllm_fp16_t* __restrict dst, size_t size, mllm_fp32_t start, mllm_fp32_t end,
mllm/backends/cpu/kernels/arm/fill.hpp-128- mllm_fp32_t step, int thread_count) {
mllm/backends/cpu/kernels/arm/fill.hpp:129: fill_arange_fp16(dst, size, start, end, step, thread_count);
mllm/backends/cpu/kernels/arm/fill.hpp-130-}
mllm/backends/cpu/kernels/arm/fill.hpp-131-
--
mllm/backends/cpu/kernels/x86/fill.cpp-109-}
mllm/backends/cpu/kernels/x86/fill.cpp-110-
mllm/backends/cpu/kernels/x86/fill.cpp:111:void fill_arange(mllm_fp32_t* __restrict dst, size_t size, float start, float end, float step, int thread_count) {
mllm/backends/cpu/kernels/x86/fill.cpp-112-#if defined(MLLM_HOST_FEATURE_AVX512F)
mllm/backends/cpu/kernels/x86/fill.cpp-113- constexpr size_t vec_size = 16;
--
mllm/backends/cpu/kernels/x86/fill.hpp-17-void fill_specific_value(mllm_fp32_t* __restrict dst, size_t size, float value, int thread_count);
mllm/backends/cpu/kernels/x86/fill.hpp-18-
mllm/backends/cpu/kernels/x86/fill.hpp:19:void fill_arange(mllm_fp32_t* __restrict dst, size_t size, float start, float end, float step, int thread_count);
mllm/backends/cpu/kernels/x86/fill.hpp-20-
mllm/backends/cpu/kernels/x86/fill.hpp-21-void fill_random(mllm_fp32_t* __restrict dst, size_t size, float start, float end, uint64_t seed, int thread_count);
Script executed:
# Look for where these functions are used
find . -name "*.cpp" -o -name "*.hpp" -o -name "*.h" | head -20Repository: UbiquitousLearning/mllm
Repository: UbiquitousLearning/mllm
Exit code: 0
stdout:
./algorithms/fancy_algorithm/main.cpp
./algorithms/fancy_algorithm/models/FooNet.hpp
./algorithms/fancy_algorithm/passes/FooPass.hpp
./algorithms/lazy_vlm/HKVCache.cpp
./algorithms/lazy_vlm/HKVCache.hpp
./algorithms/lazy_vlm/HKVCacheFast.cpp
./algorithms/lazy_vlm/HKVCacheFast.hpp
./algorithms/lazy_vlm/LazyVLMQwen2VL.cpp
./algorithms/lazy_vlm/LazyVLMQwen2VLFast.cpp
./algorithms/lazy_vlm/LazyVLMQwen2_5VL.cpp
./algorithms/lazy_vlm/LazyVLMQwen2_5VLFast.cpp
./algorithms/lazy_vlm/models/qwen2_5vl/lazy_vlm_cfg.hpp
./algorithms/lazy_vlm/models/qwen2_5vl/lazy_vlm_cfg_fast.hpp
./algorithms/lazy_vlm/models/qwen2_5vl/modeling_qwen2_5vl.hpp
./algorithms/lazy_vlm/models/qwen2_5vl/modeling_qwen2_5vl_fast.hpp
./algorithms/lazy_vlm/models/qwen2vl/lazy_vlm_cfg.hpp
./algorithms/lazy_vlm/models/qwen2vl/lazy_vlm_cfg_fast.hpp
./algorithms/lazy_vlm/models/qwen2vl/modeling_qwen2vl.hpp
./algorithms/lazy_vlm/models/qwen2vl/modeling_qwen2vl_fast.hpp
./benchmarks/cpu/arm_mllm_blas_sgemm.cpp
Script executed:
# Search more broadly for arange usage
rg -n "arange" -A 2 -B 2Repository: UbiquitousLearning/mllm
Repository: UbiquitousLearning/mllm
Exit code: 0
stdout:
tests/cpu/CausalMaskOpTest.hpp-17- using namespace mllm; // NOLINT
tests/cpu/CausalMaskOpTest.hpp-18- const int64_t total = static_cast<int64_t>(B) * H * S * D;
tests/cpu/CausalMaskOpTest.hpp:19: auto input = Tensor::arange(0, static_cast<float>(total), 1, kFloat32, kCPU).view({B, H, S, D});
tests/cpu/CausalMaskOpTest.hpp-20- auto output = mask_(input);
tests/cpu/CausalMaskOpTest.hpp-21- auto expected = buildExpectedTensor(input);
--
tests/cpu/PagedAttnTest.hpp-61-
tests/cpu/PagedAttnTest.hpp-62- // Build Index
tests/cpu/PagedAttnTest.hpp:63: auto index = mllm::Tensor::arange(0, S_KV, 1, mllm::kInt32, mllm::kCPU);
tests/cpu/PagedAttnTest.hpp-64- auto mask = mllm::Tensor::zeros({S_Q, S_KV}, mllm::kFloat32, mllm::kCPU);
tests/cpu/PagedAttnTest.hpp-65- auto mask_data = mask.ptr<mllm::mllm_fp32_t>();
--
pymllm/__init__.py-44- zeros,
pymllm/__init__.py-45- ones,
pymllm/__init__.py:46: arange,
pymllm/__init__.py-47- random,
pymllm/__init__.py-48-)
--
pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py-589- past_key_values.get_seq_length() if past_key_values is not None else 0
pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py-590- )
pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py:591: cache_position = torch.arange(
pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py-592- past_seen_tokens,
pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py-593- past_seen_tokens + inputs_embeds.shape[1],
--
pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py-624- mllm_qualcomm_max_length = kwargs.get("mllm_qualcomm_max_length", None)
pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py-625- assert mllm_qualcomm_max_length is not None
pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py:626: max_position_ids = torch.arange(
pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py-627- 0,
pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py-628- mllm_qualcomm_max_length,
--
pymllm/ffi/__init__.py-329-
pymllm/ffi/__init__.py-330-
pymllm/ffi/__init__.py:331:def arange(
pymllm/ffi/__init__.py-332- start: float,
pymllm/ffi/__init__.py-333- end: float,
--
pymllm/ffi/__init__.py-338- if isinstance(device_type, str):
pymllm/ffi/__init__.py-339- device_type = device(device_type)
pymllm/ffi/__init__.py:340: return _ffi_api.arange(start, end, step, dtype, device_type)
pymllm/ffi/__init__.py-341-
pymllm/ffi/__init__.py-342-
--
mllm/core/Tensor.cpp-130-}
mllm/core/Tensor.cpp-131-
mllm/core/Tensor.cpp:132:Tensor Tensor::arange(float start, float end, float step, DataTypes dtype, DeviceTypes device) {
mllm/core/Tensor.cpp-133- auto shape = std::vector<int32_t>{static_cast<int32_t>((end - start) / step)};
mllm/core/Tensor.cpp-134- auto i = Tensor::empty(shape, dtype, device).alloc();
--
mllm/core/Tensor.hpp-257- * `@return` Tensor
mllm/core/Tensor.hpp-258- */
mllm/core/Tensor.hpp:259: static Tensor arange(float start, float end, float step, DataTypes dtype = kFloat32, DeviceTypes device = kCPU);
mllm/core/Tensor.hpp-260-
mllm/core/Tensor.hpp-261- /**
--
mllm/ffi/Extension.cc-142- return ::mllm::ffi::Tensor(tensor);
mllm/ffi/Extension.cc-143- });
mllm/ffi/Extension.cc:144: refl::GlobalDef().def("mllm.arange",
mllm/ffi/Extension.cc-145- [](float start, float end, float step, const ::mllm::ffi::DType& dtype,
mllm/ffi/Extension.cc-146- const ::mllm::ffi::Device& device) -> mllm::ffi::Tensor {
mllm/ffi/Extension.cc:147: auto tensor = ::mllm::Tensor::arange(start, end, step, dtype->dtype, device->device);
mllm/ffi/Extension.cc-148- return ::mllm::ffi::Tensor(tensor);
mllm/ffi/Extension.cc-149- });
--
mllm/models/minicpm_o2_6/modeling_siglip.hpp-56- auto max_nb_patches_w = max_im_w / patch_size_;
mllm/models/minicpm_o2_6/modeling_siglip.hpp-57-
mllm/models/minicpm_o2_6/modeling_siglip.hpp:58: // Create boundaries like torch.arange(1 / self.num_patches_per_side, 1.0, 1 / self.num_patches_per_side)
mllm/models/minicpm_o2_6/modeling_siglip.hpp-59- std::vector<float> boundaries;
mllm/models/minicpm_o2_6/modeling_siglip.hpp-60- float step = 1.0f / static_cast<float>(num_patches_per_side_);
--
mllm/models/minicpm_o2_6/modeling_siglip.hpp-79- }
mllm/models/minicpm_o2_6/modeling_siglip.hpp-80-
mllm/models/minicpm_o2_6/modeling_siglip.hpp:81: // Create fractional coordinates like torch.arange(0, 1 - 1e-6, 1 / nb_patches_h/w)
mllm/models/minicpm_o2_6/modeling_siglip.hpp-82- std::vector<float> fractional_coords_h;
mllm/models/minicpm_o2_6/modeling_siglip.hpp-83- std::vector<float> fractional_coords_w;
--
mllm/models/minicpm_o2_6/modeling_siglip.hpp-146- } else {
mllm/models/minicpm_o2_6/modeling_siglip.hpp-147- auto seq_len = embeddings.shape()[1];
mllm/models/minicpm_o2_6/modeling_siglip.hpp:148: auto position_ids = Tensor::arange(0, seq_len, kInt64).view({1, seq_len});
mllm/models/minicpm_o2_6/modeling_siglip.hpp-149- auto pos_embeddings = position_embedding_(position_ids);
mllm/models/minicpm_o2_6/modeling_siglip.hpp-150- embeddings = embeddings + pos_embeddings;
--
mllm/models/minicpm_o2_6/modeling_vector_quantize.hpp-150- */
mllm/models/minicpm_o2_6/modeling_vector_quantize.hpp-151- Tensor createImplicitCodebook() {
mllm/models/minicpm_o2_6/modeling_vector_quantize.hpp:152: auto indices = Tensor::arange(0, static_cast<float>(codebook_size_), 1, kFloat32, kCPU);
mllm/models/minicpm_o2_6/modeling_vector_quantize.hpp-153- return indicesToCodes(indices);
mllm/models/minicpm_o2_6/modeling_vector_quantize.hpp-154- }
--
mllm/models/deepseek_ocr/deepencoder.hpp-94-
mllm/models/deepseek_ocr/deepencoder.hpp-95- // Register a buffer
mllm/models/deepseek_ocr/deepencoder.hpp:96: registerBuffer("position_ids", Tensor::arange(0, num_positions_, 1, kInt64, kCPU).view({1, -1}));
mllm/models/deepseek_ocr/deepencoder.hpp-97- }
mllm/models/deepseek_ocr/deepencoder.hpp-98-
--
mllm/models/minicpm_o2_6/modeling_whisper_encoder.hpp-194-
mllm/models/minicpm_o2_6/modeling_whisper_encoder.hpp-195- // Add positional embeddings
mllm/models/minicpm_o2_6/modeling_whisper_encoder.hpp:196: auto position_ids = Tensor::arange(0, seq_len, 1, kInt64).view({1, seq_len});
mllm/models/minicpm_o2_6/modeling_whisper_encoder.hpp-197- auto position_embeddings = embed_positions_(position_ids);
mllm/models/minicpm_o2_6/modeling_whisper_encoder.hpp-198- hidden_states = hidden_states + position_embeddings;
--
mllm/models/minicpm_o2_6/modeling_chattts.hpp-256- auto position_ids = Tensor::empty({batch_size, input_len}, kInt64, kCPU).alloc();
mllm/models/minicpm_o2_6/modeling_chattts.hpp-257- auto position_ids_ptr = position_ids.ptr<int64_t>();
mllm/models/minicpm_o2_6/modeling_chattts.hpp:258: // Fill position_ids: arange(past_key_values_length, past_key_values_length + input_len)
mllm/models/minicpm_o2_6/modeling_chattts.hpp-259- for (int b = 0; b < batch_size; ++b) {
mllm/models/minicpm_o2_6/modeling_chattts.hpp-260- for (int i = 0; i < input_len; ++i) { position_ids_ptr[b * input_len + i] = past_key_values_length + i; }
--
docs/api/tensor.rst-47- :return: New tensor with initialized one values
docs/api/tensor.rst-48-
docs/api/tensor.rst:49:arange
docs/api/tensor.rst-50-~~~~~~
docs/api/tensor.rst-51-
docs/api/tensor.rst:52:.. cpp:function:: static Tensor Tensor::arange(float start, float end, float step, DataTypes dtype = kFloat32, DeviceTypes device = kCPU)
docs/api/tensor.rst-53-
docs/api/tensor.rst-54- Creates a tensor with evenly spaced values within a specified range.
--
mllm/backends/opencl/ops/FillOp.hpp-17- private:
mllm/backends/opencl/ops/FillOp.hpp-18- std::shared_ptr<KernelWrap> kernel_fp32_buffer_ = nullptr;
mllm/backends/opencl/ops/FillOp.hpp:19: std::shared_ptr<KernelWrap> kernel_arange_fp32_buffer_ = nullptr;
mllm/backends/opencl/ops/FillOp.hpp-20-};
mllm/backends/opencl/ops/FillOp.hpp-21-
--
mllm/backends/opencl/ops/FillOp.cpp-12-
mllm/backends/opencl/ops/FillOp.cpp-13- kernel_fp32_buffer_ = runtime->buildKernel("fill", "fill_fp32", {});
mllm/backends/opencl/ops/FillOp.cpp:14: kernel_arange_fp32_buffer_ = runtime->buildKernel("fill", "fill_arange_fp32", {});
mllm/backends/opencl/ops/FillOp.cpp-15-}
mllm/backends/opencl/ops/FillOp.cpp-16-
--
mllm/backends/opencl/ops/FillOp.cpp-61- float start = options_.start;
mllm/backends/opencl/ops/FillOp.cpp-62- float step = options_.step;
mllm/backends/opencl/ops/FillOp.cpp:63: ret |= kernel_arange_fp32_buffer_->get().setArg(0, sizeof(float), &start);
mllm/backends/opencl/ops/FillOp.cpp:64: ret |= kernel_arange_fp32_buffer_->get().setArg(1, sizeof(float), &step);
mllm/backends/opencl/ops/FillOp.cpp:65: ret |= kernel_arange_fp32_buffer_->get().setArg(2, sizeof(cl_mem), &cl_buffer_dst);
mllm/backends/opencl/ops/FillOp.cpp-66- if (ret != CL_SUCCESS) { MLLM_ERROR("OpenCLFillOp setArg failed: {}", ret); }
mllm/backends/opencl/ops/FillOp.cpp:67: auto error = runtime->commandQueue().enqueueNDRangeKernel(kernel_arange_fp32_buffer_->get(), cl::NullRange,
mllm/backends/opencl/ops/FillOp.cpp-68- cl::NDRange(global_size), cl::NullRange);
mllm/backends/opencl/ops/FillOp.cpp-69- if (error != CL_SUCCESS) {
mllm/backends/opencl/ops/FillOp.cpp:70: MLLM_ERROR_EXIT(ExitCode::kOpenCLError, "Failed to execute fill_arange kernel, error code: {}", error);
mllm/backends/opencl/ops/FillOp.cpp-71- }
mllm/backends/opencl/ops/FillOp.cpp-72- } else {
--
mllm/backends/opencl/kernels/fill_cl.cpp-5- " dst[index]=value;\n"
mllm/backends/opencl/kernels/fill_cl.cpp-6- "}\n"
mllm/backends/opencl/kernels/fill_cl.cpp:7: "__kernel void fill_arange_fp32(float start,float step,__global float *dst) {\n"
mllm/backends/opencl/kernels/fill_cl.cpp-8- " size_t index=get_global_id(0);\n"
mllm/backends/opencl/kernels/fill_cl.cpp-9- " dst[index]=start+(float)index*step;\n"
--
mllm/backends/opencl/kernels/fill.cl-4-}
mllm/backends/opencl/kernels/fill.cl-5-
mllm/backends/opencl/kernels/fill.cl:6:__kernel void fill_arange_fp32(float start, float step, __global float *dst) {
mllm/backends/opencl/kernels/fill.cl-7- size_t index = get_global_id(0);
mllm/backends/opencl/kernels/fill.cl-8- dst[index] = start + (float)index * step;
--
mllm/backends/cpu/ops/FillOp.cpp-203- case kFloat32: {
mllm/backends/cpu/ops/FillOp.cpp-204-#if defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86)
mllm/backends/cpu/ops/FillOp.cpp:205: common::fill_arange_anytype(dst.ptr<mllm_fp32_t>(), dst.numel(), options_.start, options_.end, options_.step);
mllm/backends/cpu/ops/FillOp.cpp-206-#elif defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM)
mllm/backends/cpu/ops/FillOp.cpp:207: arm::fill_arange(dst.ptr<mllm_fp32_t>(), dst.numel(), options_.start, options_.end, options_.step, threads);
mllm/backends/cpu/ops/FillOp.cpp-208-#endif
mllm/backends/cpu/ops/FillOp.cpp-209- break;
--
mllm/backends/cpu/ops/FillOp.cpp-215- for (size_t i = 0; i < dst.numel(); ++i) { ptr[i] = static_cast<mllm_fp16_t>(options_.start + i * options_.step); }
mllm/backends/cpu/ops/FillOp.cpp-216-#elif defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM)
mllm/backends/cpu/ops/FillOp.cpp:217: arm::fill_arange_fp16(dst.ptr<mllm_fp16_t>(), dst.numel(), options_.start, options_.end, options_.step, threads);
mllm/backends/cpu/ops/FillOp.cpp-218-#endif
mllm/backends/cpu/ops/FillOp.cpp-219- break;
--
mllm/backends/cpu/ops/FillOp.cpp-221- case kInt64: {
mllm/backends/cpu/ops/FillOp.cpp-222-#if defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86)
mllm/backends/cpu/ops/FillOp.cpp:223: common::fill_arange_anytype(dst.ptr<mllm_int64_t>(), dst.numel(), options_.start, options_.end, options_.step);
mllm/backends/cpu/ops/FillOp.cpp-224-#elif defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM)
mllm/backends/cpu/ops/FillOp.cpp:225: arm::fill_arange_anytype<mllm_int64_t>(dst.ptr<mllm_int64_t>(), dst.numel(), options_.start, options_.end,
mllm/backends/cpu/ops/FillOp.cpp-226- options_.step, threads);
mllm/backends/cpu/ops/FillOp.cpp-227-#endif
--
mllm/backends/cpu/ops/FillOp.cpp-230- case kInt32: {
mllm/backends/cpu/ops/FillOp.cpp-231-#if defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86)
mllm/backends/cpu/ops/FillOp.cpp:232: common::fill_arange_anytype(dst.ptr<mllm_int32_t>(), dst.numel(), options_.start, options_.end, options_.step);
mllm/backends/cpu/ops/FillOp.cpp-233-#elif defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM)
mllm/backends/cpu/ops/FillOp.cpp:234: arm::fill_arange_anytype<mllm_int32_t>(dst.ptr<mllm_int32_t>(), dst.numel(), options_.start, options_.end,
mllm/backends/cpu/ops/FillOp.cpp-235- options_.step, threads);
mllm/backends/cpu/ops/FillOp.cpp-236-#endif
--
mllm/backends/cpu/ops/FillOp.cpp-239- case kInt16: {
mllm/backends/cpu/ops/FillOp.cpp-240-#if defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86)
mllm/backends/cpu/ops/FillOp.cpp:241: common::fill_arange_anytype(dst.ptr<mllm_int16_t>(), dst.numel(), options_.start, options_.end, options_.step);
mllm/backends/cpu/ops/FillOp.cpp-242-#elif defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM)
mllm/backends/cpu/ops/FillOp.cpp:243: arm::fill_arange_anytype<mllm_int16_t>(dst.ptr<mllm_int16_t>(), dst.numel(), options_.start, options_.end,
mllm/backends/cpu/ops/FillOp.cpp-244- options_.step, threads);
mllm/backends/cpu/ops/FillOp.cpp-245-#endif
--
mllm/backends/cpu/ops/FillOp.cpp-248- case kInt8: {
mllm/backends/cpu/ops/FillOp.cpp-249-#if defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86)
mllm/backends/cpu/ops/FillOp.cpp:250: common::fill_arange_anytype(dst.ptr<mllm_int8_t>(), dst.numel(), options_.start, options_.end, options_.step);
mllm/backends/cpu/ops/FillOp.cpp-251-#elif defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM)
mllm/backends/cpu/ops/FillOp.cpp:252: arm::fill_arange_anytype<mllm_int8_t>(dst.ptr<mllm_int8_t>(), dst.numel(), options_.start, options_.end,
mllm/backends/cpu/ops/FillOp.cpp-253- options_.step, threads);
mllm/backends/cpu/ops/FillOp.cpp-254-#endif
--
mllm/backends/cpu/ops/FillOp.cpp-257- case kUInt64: {
mllm/backends/cpu/ops/FillOp.cpp-258-#if defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86)
mllm/backends/cpu/ops/FillOp.cpp:259: common::fill_arange_anytype(dst.ptr<mllm_uint64_t>(), dst.numel(), options_.start, options_.end, options_.step);
mllm/backends/cpu/ops/FillOp.cpp-260-#elif defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM)
mllm/backends/cpu/ops/FillOp.cpp:261: arm::fill_arange_anytype<mllm_uint64_t>(dst.ptr<mllm_uint64_t>(), dst.numel(), options_.start, options_.end,
mllm/backends/cpu/ops/FillOp.cpp-262- options_.step, threads);
mllm/backends/cpu/ops/FillOp.cpp-263-#endif
--
mllm/backends/cpu/ops/FillOp.cpp-266- case kUInt32: {
mllm/backends/cpu/ops/FillOp.cpp-267-#if defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86)
mllm/backends/cpu/ops/FillOp.cpp:268: common::fill_arange_anytype(dst.ptr<mllm_uint32_t>(), dst.numel(), options_.start, options_.end, options_.step);
mllm/backends/cpu/ops/FillOp.cpp-269-#elif defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM)
mllm/backends/cpu/ops/FillOp.cpp:270: arm::fill_arange_anytype<mllm_uint32_t>(dst.ptr<mllm_uint32_t>(), dst.numel(), options_.start, options_.end,
mllm/backends/cpu/ops/FillOp.cpp-271- options_.step, threads);
mllm/backends/cpu/ops/FillOp.cpp-272-#endif
--
mllm/backends/cpu/ops/FillOp.cpp-275- case kUInt16: {
mllm/backends/cpu/ops/FillOp.cpp-276-#if defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86)
mllm/backends/cpu/ops/FillOp.cpp:277: common::fill_arange_anytype(dst.ptr<mllm_uint16_t>(), dst.numel(), options_.start, options_.end, options_.step);
mllm/backends/cpu/ops/FillOp.cpp-278-#elif defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM)
mllm/backends/cpu/ops/FillOp.cpp:279: arm::fill_arange_anytype<mllm_uint16_t>(dst.ptr<mllm_uint16_t>(), dst.numel(), options_.start, options_.end,
mllm/backends/cpu/ops/FillOp.cpp-280- options_.step, threads);
mllm/backends/cpu/ops/FillOp.cpp-281-#endif
--
mllm/backends/cpu/ops/FillOp.cpp-284- case kUInt8: {
mllm/backends/cpu/ops/FillOp.cpp-285-#if defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86)
mllm/backends/cpu/ops/FillOp.cpp:286: common::fill_arange_anytype(dst.ptr<mllm_uint8_t>(), dst.numel(), options_.start, options_.end, options_.step);
mllm/backends/cpu/ops/FillOp.cpp-287-#elif defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM)
mllm/backends/cpu/ops/FillOp.cpp:288: arm::fill_arange_anytype<mllm_uint8_t>(dst.ptr<mllm_uint8_t>(), dst.numel(), options_.start, options_.end,
mllm/backends/cpu/ops/FillOp.cpp-289- options_.step, threads);
mllm/backends/cpu/ops/FillOp.cpp-290-#endif
--
mllm/backends/cpu/ops/FillOp.cpp-292- }
mllm/backends/cpu/ops/FillOp.cpp-293- default: {
mllm/backends/cpu/ops/FillOp.cpp:294: NYI("FillOp::forward[arange] not implemented for this data type");
mllm/backends/cpu/ops/FillOp.cpp-295- }
mllm/backends/cpu/ops/FillOp.cpp-296- }
--
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-77-// Fill Arange
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-78-//===----------------------------------------------------------------------===//
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:79:HWY_DLLEXPORT void call_fill_arange_fp32(mllm_fp32_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step);
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:80:HWY_DLLEXPORT void call_fill_arange_i32(mllm_int32_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step);
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:81:HWY_DLLEXPORT void call_fill_arange_u32(mllm_uint32_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step);
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:82:HWY_DLLEXPORT void call_fill_arange_i64(mllm_int64_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step);
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:83:HWY_DLLEXPORT void call_fill_arange_u64(mllm_uint64_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step);
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:84:HWY_DLLEXPORT void call_fill_arange_i16(mllm_int16_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step);
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:85:HWY_DLLEXPORT void call_fill_arange_u16(mllm_uint16_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step);
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:86:HWY_DLLEXPORT void call_fill_arange_i8(mllm_int8_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step);
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:87:HWY_DLLEXPORT void call_fill_arange_u8(mllm_uint8_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step);
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-88-
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-89-//===----------------------------------------------------------------------===//
--
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-188-
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-189-template<typename T>
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:190:inline void fill_arange_anytype(T* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step) {
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-191- if constexpr (std::is_same_v<T, mllm_fp32_t>) {
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:192: call_fill_arange_fp32(dst, n, start, end, step);
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-193- } else if constexpr (std::is_same_v<T, mllm_int32_t>) {
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:194: call_fill_arange_i32(dst, n, start, end, step);
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-195- } else if constexpr (std::is_same_v<T, mllm_uint32_t>) {
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:196: call_fill_arange_u32(dst, n, start, end, step);
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-197- } else if constexpr (std::is_same_v<T, mllm_int64_t>) {
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:198: call_fill_arange_i64(dst, n, start, end, step);
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-199- } else if constexpr (std::is_same_v<T, mllm_uint64_t>) {
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:200: call_fill_arange_u64(dst, n, start, end, step);
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-201- } else if constexpr (std::is_same_v<T, mllm_int16_t>) {
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:202: call_fill_arange_i16(dst, n, start, end, step);
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-203- } else if constexpr (std::is_same_v<T, mllm_uint16_t>) {
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:204: call_fill_arange_u16(dst, n, start, end, step);
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-205- } else if constexpr (std::is_same_v<T, mllm_int8_t>) {
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:206: call_fill_arange_i8(dst, n, start, end, step);
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-207- } else if constexpr (std::is_same_v<T, mllm_uint8_t>) {
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp:208: call_fill_arange_u8(dst, n, start, end, step);
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-209- } else {
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp-210- // Fallback
--
mllm/backends/cpu/kernels/common/fill-inl.hpp-195-//===----------------------------------------------------------------------===//
mllm/backends/cpu/kernels/common/fill-inl.hpp-196-template<typename T>
mllm/backends/cpu/kernels/common/fill-inl.hpp:197:HWY_INLINE void fill_arange_impl(T* HWY_RESTRICT dst, size_t count, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step) {
mllm/backends/cpu/kernels/common/fill-inl.hpp-198- if (step == 0) {
mllm/backends/cpu/kernels/common/fill-inl.hpp-199- fill_value_impl(dst, count, static_cast<T>(start));
--
mllm/backends/cpu/kernels/common/fill-inl.hpp-244-}
mllm/backends/cpu/kernels/common/fill-inl.hpp-245-
mllm/backends/cpu/kernels/common/fill-inl.hpp:246:static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_arange_fp32(mllm_fp32_t* HWY_RESTRICT dst, size_t size, mllm_fp32_t start,
mllm/backends/cpu/kernels/common/fill-inl.hpp-247- mllm_fp32_t end, mllm_fp32_t step) {
mllm/backends/cpu/kernels/common/fill-inl.hpp:248: fill_arange_impl(dst, size, start, end, step);
mllm/backends/cpu/kernels/common/fill-inl.hpp-249-}
mllm/backends/cpu/kernels/common/fill-inl.hpp-250-
mllm/backends/cpu/kernels/common/fill-inl.hpp:251:static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_arange_i32(mllm_int32_t* HWY_RESTRICT dst, size_t size, mllm_fp32_t start,
mllm/backends/cpu/kernels/common/fill-inl.hpp-252- mllm_fp32_t end, mllm_fp32_t step) {
mllm/backends/cpu/kernels/common/fill-inl.hpp:253: fill_arange_impl(dst, size, start, end, step);
mllm/backends/cpu/kernels/common/fill-inl.hpp-254-}
mllm/backends/cpu/kernels/common/fill-inl.hpp-255-
mllm/backends/cpu/kernels/common/fill-inl.hpp:256:static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_arange_u32(mllm_uint32_t* HWY_RESTRICT dst, size_t size, mllm_fp32_t start,
mllm/backends/cpu/kernels/common/fill-inl.hpp-257- mllm_fp32_t end, mllm_fp32_t step) {
mllm/backends/cpu/kernels/common/fill-inl.hpp:258: fill_arange_impl(dst, size, start, end, step);
mllm/backends/cpu/kernels/common/fill-inl.hpp-259-}
mllm/backends/cpu/kernels/common/fill-inl.hpp-260-
mllm/backends/cpu/kernels/common/fill-inl.hpp:261:static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_arange_i64(mllm_int64_t* HWY_RESTRICT dst, size_t size, mllm_fp32_t start,
mllm/backends/cpu/kernels/common/fill-inl.hpp-262- mllm_fp32_t end, mllm_fp32_t step) {
mllm/backends/cpu/kernels/common/fill-inl.hpp:263: fill_arange_impl(dst, size, start, end, step);
mllm/backends/cpu/kernels/common/fill-inl.hpp-264-}
mllm/backends/cpu/kernels/common/fill-inl.hpp-265-
mllm/backends/cpu/kernels/common/fill-inl.hpp:266:static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_arange_u64(mllm_uint64_t* HWY_RESTRICT dst, size_t size, mllm_fp32_t start,
mllm/backends/cpu/kernels/common/fill-inl.hpp-267- mllm_fp32_t end, mllm_fp32_t step) {
mllm/backends/cpu/kernels/common/fill-inl.hpp:268: fill_arange_impl(dst, size, start, end, step);
mllm/backends/cpu/kernels/common/fill-inl.hpp-269-}
mllm/backends/cpu/kernels/common/fill-inl.hpp-270-
mllm/backends/cpu/kernels/common/fill-inl.hpp:271:static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_arange_i16(mllm_int16_t* HWY_RESTRICT dst, size_t size, mllm_fp32_t start,
mllm/backends/cpu/kernels/common/fill-inl.hpp-272- mllm_fp32_t end, mllm_fp32_t step) {
mllm/backends/cpu/kernels/common/fill-inl.hpp:273: fill_arange_impl(dst, size, start, end, step);
mllm/backends/cpu/kernels/common/fill-inl.hpp-274-}
mllm/backends/cpu/kernels/common/fill-inl.hpp-275-
mllm/backends/cpu/kernels/common/fill-inl.hpp:276:static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_arange_u16(mllm_uint16_t* HWY_RESTRICT dst, size_t size, mllm_fp32_t start,
mllm/backends/cpu/kernels/common/fill-inl.hpp-277- mllm_fp32_t end, mllm_fp32_t step) {
mllm/backends/cpu/kernels/common/fill-inl.hpp:278: fill_arange_impl(dst, size, start, end, step);
mllm/backends/cpu/kernels/common/fill-inl.hpp-279-}
mllm/backends/cpu/kernels/common/fill-inl.hpp-280-
mllm/backends/cpu/kernels/common/fill-inl.hpp:281:static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_arange_i8(mllm_int8_t* HWY_RESTRICT dst, size_t size, mllm_fp32_t start,
mllm/backends/cpu/kernels/common/fill-inl.hpp-282- mllm_fp32_t end, mllm_fp32_t step) {
mllm/backends/cpu/kernels/common/fill-inl.hpp:283: fill_arange_impl(dst, size, start, end, step);
mllm/backends/cpu/kernels/common/fill-inl.hpp-284-}
mllm/backends/cpu/kernels/common/fill-inl.hpp-285-
mllm/backends/cpu/kernels/common/fill-inl.hpp:286:static HWY_NOINLINE HWY_MAYBE_UNUSED void fill_arange_u8(mllm_uint8_t* HWY_RESTRICT dst, size_t size, mllm_fp32_t start,
mllm/backends/cpu/kernels/common/fill-inl.hpp-287- mllm_fp32_t end, mllm_fp32_t step) {
mllm/backends/cpu/kernels/common/fill-inl.hpp:288: fill_arange_impl(dst, size, start, end, step);
mllm/backends/cpu/kernels/common/fill-inl.hpp-289-}
mllm/backends/cpu/kernels/common/fill-inl.hpp-290-
--
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp-174-// Fill Arange
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp-175-//===----------------------------------------------------------------------===//
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:176:HWY_EXPORT(fill_arange_fp32);
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:177:HWY_EXPORT(fill_arange_i32);
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:178:HWY_EXPORT(fill_arange_u32);
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:179:HWY_EXPORT(fill_arange_i64);
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:180:HWY_EXPORT(fill_arange_u64);
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:181:HWY_EXPORT(fill_arange_i16);
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:182:HWY_EXPORT(fill_arange_u16);
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:183:HWY_EXPORT(fill_arange_i8);
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:184:HWY_EXPORT(fill_arange_u8);
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp-185-
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:186:HWY_DLLEXPORT void call_fill_arange_fp32(mllm_fp32_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step) {
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:187: HWY_DYNAMIC_DISPATCH(fill_arange_fp32)(dst, n, start, end, step);
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp-188-}
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:189:HWY_DLLEXPORT void call_fill_arange_i32(mllm_int32_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step) {
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:190: HWY_DYNAMIC_DISPATCH(fill_arange_i32)(dst, n, start, end, step);
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp-191-}
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:192:HWY_DLLEXPORT void call_fill_arange_u32(mllm_uint32_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step) {
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:193: HWY_DYNAMIC_DISPATCH(fill_arange_u32)(dst, n, start, end, step);
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp-194-}
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:195:HWY_DLLEXPORT void call_fill_arange_i64(mllm_int64_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step) {
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:196: HWY_DYNAMIC_DISPATCH(fill_arange_i64)(dst, n, start, end, step);
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp-197-}
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:198:HWY_DLLEXPORT void call_fill_arange_u64(mllm_uint64_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step) {
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:199: HWY_DYNAMIC_DISPATCH(fill_arange_u64)(dst, n, start, end, step);
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp-200-}
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:201:HWY_DLLEXPORT void call_fill_arange_i16(mllm_int16_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step) {
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:202: HWY_DYNAMIC_DISPATCH(fill_arange_i16)(dst, n, start, end, step);
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp-203-}
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:204:HWY_DLLEXPORT void call_fill_arange_u16(mllm_uint16_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step) {
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:205: HWY_DYNAMIC_DISPATCH(fill_arange_u16)(dst, n, start, end, step);
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp-206-}
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:207:HWY_DLLEXPORT void call_fill_arange_i8(mllm_int8_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step) {
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:208: HWY_DYNAMIC_DISPATCH(fill_arange_i8)(dst, n, start, end, step);
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp-209-}
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:210:HWY_DLLEXPORT void call_fill_arange_u8(mllm_uint8_t* dst, size_t n, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step) {
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp:211: HWY_DYNAMIC_DISPATCH(fill_arange_u8)(dst, n, start, end, step);
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp-212-}
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp-213-
--
mllm/backends/cpu/kernels/arm/fill.cpp-52-}
mllm/backends/cpu/kernels/arm/fill.cpp-53-
mllm/backends/cpu/kernels/arm/fill.cpp:54:void fill_arange(mllm_fp32_t* __restrict dst, size_t size, float start, float end, float step, int thread_count) {
mllm/backends/cpu/kernels/arm/fill.cpp-55- constexpr size_t vec_size = 4; // 4 floats in NEON
mllm/backends/cpu/kernels/arm/fill.cpp-56-
--
mllm/backends/cpu/kernels/arm/fill.cpp-58- size_t i = 0;
mllm/backends/cpu/kernels/arm/fill.cpp-59-
mllm/backends/cpu/kernels/arm/fill.cpp:60: // Vectorized arange
mllm/backends/cpu/kernels/arm/fill.cpp-61- float current_value = start;
mllm/backends/cpu/kernels/arm/fill.cpp-62- for (; i < vec_end; i += vec_size) {
--
mllm/backends/cpu/kernels/arm/fill.cpp-129-}
mllm/backends/cpu/kernels/arm/fill.cpp-130-
mllm/backends/cpu/kernels/arm/fill.cpp:131:void fill_arange_fp16(mllm_fp16_t* __restrict dst, size_t size, float start, float end, float step, int thread_count) {
mllm/backends/cpu/kernels/arm/fill.cpp-132- constexpr size_t vec_size = 8; // 8 float16_t in NEON
mllm/backends/cpu/kernels/arm/fill.cpp-133-
--
mllm/backends/cpu/kernels/arm/fill.cpp-135- size_t i = 0;
mllm/backends/cpu/kernels/arm/fill.cpp-136-
mllm/backends/cpu/kernels/arm/fill.cpp:137: // Vectorized arange
mllm/backends/cpu/kernels/arm/fill.cpp-138- float current_value = start;
mllm/backends/cpu/kernels/arm/fill.cpp-139- for (; i < vec_end; i += vec_size) {
--
mllm/backends/cpu/kernels/arm/fill.hpp-17-void fill_specific_value(mllm_fp32_t* __restrict dst, size_t size, mllm_fp32_t value, int thread_count);
mllm/backends/cpu/kernels/arm/fill.hpp-18-
mllm/backends/cpu/kernels/arm/fill.hpp:19:void fill_arange(mllm_fp32_t* __restrict dst, size_t size, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step,
mllm/backends/cpu/kernels/arm/fill.hpp-20- int thread_count);
mllm/backends/cpu/kernels/arm/fill.hpp-21-
--
mllm/backends/cpu/kernels/arm/fill.hpp-29-void fill_specific_value_fp16(mllm_fp16_t* __restrict dst, size_t size, mllm_fp32_t value, int thread_count);
mllm/backends/cpu/kernels/arm/fill.hpp-30-
mllm/backends/cpu/kernels/arm/fill.hpp:31:void fill_arange_fp16(mllm_fp16_t* __restrict dst, size_t size, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step,
mllm/backends/cpu/kernels/arm/fill.hpp-32- int thread_count);
mllm/backends/cpu/kernels/arm/fill.hpp-33-
--
mllm/backends/cpu/kernels/arm/fill.hpp-94-
mllm/backends/cpu/kernels/arm/fill.hpp-95-template<typename T>
mllm/backends/cpu/kernels/arm/fill.hpp:96:inline void fill_arange_anytype(T* __restrict dst, size_t size, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step,
mllm/backends/cpu/kernels/arm/fill.hpp-97- int thread_count) {
mllm/backends/cpu/kernels/arm/fill.hpp-98- if (step == 0) {
--
mllm/backends/cpu/kernels/arm/fill.hpp-119-
mllm/backends/cpu/kernels/arm/fill.hpp-120-template<>
mllm/backends/cpu/kernels/arm/fill.hpp:121:inline void fill_arange_anytype<mllm_fp32_t>(mllm_fp32_t* __restrict dst, size_t size, mllm_fp32_t start, mllm_fp32_t end,
mllm/backends/cpu/kernels/arm/fill.hpp-122- mllm_fp32_t step, int thread_count) {
mllm/backends/cpu/kernels/arm/fill.hpp:123: fill_arange(dst, size, start, end, step, thread_count);
mllm/backends/cpu/kernels/arm/fill.hpp-124-}
mllm/backends/cpu/kernels/arm/fill.hpp-125-
mllm/backends/cpu/kernels/arm/fill.hpp-126-template<>
mllm/backends/cpu/kernels/arm/fill.hpp:127:inline void fill_arange_anytype<mllm_fp16_t>(mllm_fp16_t* __restrict dst, size_t size, mllm_fp32_t start, mllm_fp32_t end,
mllm/backends/cpu/kernels/arm/fill.hpp-128- mllm_fp32_t step, int thread_count) {
mllm/backends/cpu/kernels/arm/fill.hpp:129: fill_arange_fp16(dst, size, start, end, step, thread_count);
mllm/backends/cpu/kernels/arm/fill.hpp-130-}
mllm/backends/cpu/kernels/arm/fill.hpp-131-
--
mllm/backends/cpu/kernels/x86/fill.hpp-17-void fill_specific_value(mllm_fp32_t* __restrict dst, size_t size, float value, int thread_count);
mllm/backends/cpu/kernels/x86/fill.hpp-18-
mllm/backends/cpu/kernels/x86/fill.hpp:19:void fill_arange(mllm_fp32_t* __restrict dst, size_t size, float start, float end, float step, int thread_count);
mllm/backends/cpu/kernels/x86/fill.hpp-20-
mllm/backends/cpu/kernels/x86/fill.hpp-21-void fill_random(mllm_fp32_t* __restrict dst, size_t size, float start, float end, uint64_t seed, int thread_count);
--
mllm/backends/cpu/kernels/x86/fill.cpp-109-}
mllm/backends/cpu/kernels/x86/fill.cpp-110-
mllm/backends/cpu/kernels/x86/fill.cpp:111:void fill_arange(mllm_fp32_t* __restrict dst, size_t size, float start, float end, float step, int thread_count) {
mllm/backends/cpu/kernels/x86/fill.cpp-112-#if defined(MLLM_HOST_FEATURE_AVX512F)
mllm/backends/cpu/kernels/x86/fill.cpp-113- constexpr size_t vec_size = 16;
--
mllm/backends/cpu/kernels/x86/fill.cpp-129- || defined(MLLM_HOST_FEATURE_SSE2) || defined(MLLM_HOST_FEATURE_SSE)
mllm/backends/cpu/kernels/x86/fill.cpp-130- if (vec_size > 1) {
mllm/backends/cpu/kernels/x86/fill.cpp:131: // Vectorized arange
mllm/backends/cpu/kernels/x86/fill.cpp-132- float current_value = start;
mllm/backends/cpu/kernels/x86/fill.cpp-133- for (; i < vec_end; i += vec_size) {
Script executed:
# Look at the complete fill_arange_impl and surrounding context for assertions/documentation
sed -n '193,245p' mllm/backends/cpu/kernels/common/fill-inl.hppRepository: UbiquitousLearning/mllm
Repository: UbiquitousLearning/mllm
Exit code: 0
stdout:
//===----------------------------------------------------------------------===//
// Fill Arange (start, end, step)
//===----------------------------------------------------------------------===//
template<typename T>
HWY_INLINE void fill_arange_impl(T* HWY_RESTRICT dst, size_t count, mllm_fp32_t start, mllm_fp32_t end, mllm_fp32_t step) {
if (step == 0) {
fill_value_impl(dst, count, static_cast<T>(start));
return;
}
// Calculate the actual number of elements to fill
size_t n = 0;
if ((step > 0 && start < end) || (step < 0 && start > end)) {
mllm_fp32_t n_float = (end - start) / step;
if (n_float > 0) {
n = static_cast<size_t>(std::ceil(n_float));
if (step > 0) {
if (start + (n - 1) * step >= end) --n;
} else {
if (start + (n - 1) * step <= end) --n;
}
n = std::min(n, count);
}
}
// Use SIMD for float types where we can vectorize the computation
if constexpr (std::is_same_v<T, mllm_fp32_t>) {
const hn::ScalableTag<T> d;
const size_t N = hn::Lanes(d);
// Create increment vector: [0, 1, 2, 3, ...] * step
const hn::Vec<decltype(d)> step_vec = hn::Set(d, step);
const hn::Vec<decltype(d)> n_step_vec = hn::Set(d, step * static_cast<T>(N));
// Create base offsets [0, 1, 2, 3, ...]
hn::Vec<decltype(d)> base = hn::Iota(d, 0);
base = hn::Mul(base, step_vec);
hn::Vec<decltype(d)> current_start = hn::Add(hn::Set(d, start), base);
size_t idx = 0;
for (; idx + N <= n; idx += N) {
hn::StoreU(current_start, d, dst + idx);
current_start = hn::Add(current_start, n_step_vec);
}
// Handle remaining elements
for (; idx < n; ++idx) { dst[idx] = static_cast<T>(start + idx * step); }
} else {
// Scalar fallback for other types
for (size_t i = 0; i < n; ++i) { dst[i] = static_cast<T>(start + i * step); }
}
}
Script executed:
# Check FillOp.cpp for any validation of count vs computed arange length
sed -n '195,300p' mllm/backends/cpu/ops/FillOp.cppRepository: UbiquitousLearning/mllm
Repository: UbiquitousLearning/mllm
Exit code: 0
stdout:
NYI("FillOp::forward[ones] not implemented for this data type");
break;
}
}
break;
}
case aops::FillOpTypes::kArange: {
switch (dst.dtype()) {
case kFloat32: {
`#if` defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86)
common::fill_arange_anytype(dst.ptr<mllm_fp32_t>(), dst.numel(), options_.start, options_.end, options_.step);
`#elif` defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM)
arm::fill_arange(dst.ptr<mllm_fp32_t>(), dst.numel(), options_.start, options_.end, options_.step, threads);
`#endif`
break;
}
case kFloat16: {
`#if` defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86)
// FP16 not directly supported by Highway on x86, use scalar fallback
auto ptr = dst.ptr<mllm_fp16_t>();
for (size_t i = 0; i < dst.numel(); ++i) { ptr[i] = static_cast<mllm_fp16_t>(options_.start + i * options_.step); }
`#elif` defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM)
arm::fill_arange_fp16(dst.ptr<mllm_fp16_t>(), dst.numel(), options_.start, options_.end, options_.step, threads);
`#endif`
break;
}
case kInt64: {
`#if` defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86)
common::fill_arange_anytype(dst.ptr<mllm_int64_t>(), dst.numel(), options_.start, options_.end, options_.step);
`#elif` defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM)
arm::fill_arange_anytype<mllm_int64_t>(dst.ptr<mllm_int64_t>(), dst.numel(), options_.start, options_.end,
options_.step, threads);
`#endif`
break;
}
case kInt32: {
`#if` defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86)
common::fill_arange_anytype(dst.ptr<mllm_int32_t>(), dst.numel(), options_.start, options_.end, options_.step);
`#elif` defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM)
arm::fill_arange_anytype<mllm_int32_t>(dst.ptr<mllm_int32_t>(), dst.numel(), options_.start, options_.end,
options_.step, threads);
`#endif`
break;
}
case kInt16: {
`#if` defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86)
common::fill_arange_anytype(dst.ptr<mllm_int16_t>(), dst.numel(), options_.start, options_.end, options_.step);
`#elif` defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM)
arm::fill_arange_anytype<mllm_int16_t>(dst.ptr<mllm_int16_t>(), dst.numel(), options_.start, options_.end,
options_.step, threads);
`#endif`
break;
}
case kInt8: {
`#if` defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86)
common::fill_arange_anytype(dst.ptr<mllm_int8_t>(), dst.numel(), options_.start, options_.end, options_.step);
`#elif` defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM)
arm::fill_arange_anytype<mllm_int8_t>(dst.ptr<mllm_int8_t>(), dst.numel(), options_.start, options_.end,
options_.step, threads);
`#endif`
break;
}
case kUInt64: {
`#if` defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86)
common::fill_arange_anytype(dst.ptr<mllm_uint64_t>(), dst.numel(), options_.start, options_.end, options_.step);
`#elif` defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM)
arm::fill_arange_anytype<mllm_uint64_t>(dst.ptr<mllm_uint64_t>(), dst.numel(), options_.start, options_.end,
options_.step, threads);
`#endif`
break;
}
case kUInt32: {
`#if` defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86)
common::fill_arange_anytype(dst.ptr<mllm_uint32_t>(), dst.numel(), options_.start, options_.end, options_.step);
`#elif` defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM)
arm::fill_arange_anytype<mllm_uint32_t>(dst.ptr<mllm_uint32_t>(), dst.numel(), options_.start, options_.end,
options_.step, threads);
`#endif`
break;
}
case kUInt16: {
`#if` defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86)
common::fill_arange_anytype(dst.ptr<mllm_uint16_t>(), dst.numel(), options_.start, options_.end, options_.step);
`#elif` defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM)
arm::fill_arange_anytype<mllm_uint16_t>(dst.ptr<mllm_uint16_t>(), dst.numel(), options_.start, options_.end,
options_.step, threads);
`#endif`
break;
}
case kUInt8: {
`#if` defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86)
common::fill_arange_anytype(dst.ptr<mllm_uint8_t>(), dst.numel(), options_.start, options_.end, options_.step);
`#elif` defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM)
arm::fill_arange_anytype<mllm_uint8_t>(dst.ptr<mllm_uint8_t>(), dst.numel(), options_.start, options_.end,
options_.step, threads);
`#endif`
break;
}
default: {
NYI("FillOp::forward[arange] not implemented for this data type");
}
}
break;
}
case aops::FillOpTypes::kRandom: {
switch (dst.dtype()) {
Script executed:
# Check Tensor.cpp for how arange creates the tensor with proper shape
sed -n '130,145p' mllm/core/Tensor.cppRepository: UbiquitousLearning/mllm
Repository: UbiquitousLearning/mllm
Exit code: 0
stdout:
}
Tensor Tensor::arange(float start, float end, float step, DataTypes dtype, DeviceTypes device) {
auto shape = std::vector<int32_t>{static_cast<int32_t>((end - start) / step)};
auto i = Tensor::empty(shape, dtype, device).alloc();
return Context::instance().buildOpAndSubmitTask(
OpTypes::kFill, aops::FillOpOptions{.type = aops::FillOpTypes::kArange, .start = start, .end = end, .step = step},
{i})[0];
}
Tensor Tensor::random(const std::vector<int32_t>& shape, float start, float end, DataTypes dtype, DeviceTypes device) {
auto i = Tensor::empty(shape, dtype, device).alloc();
return Context::instance().buildOpAndSubmitTask(
OpTypes::kFill,
aops::FillOpOptions{
.type = aops::FillOpTypes::kRandom, .start = start, .end = end, .seed = Context::instance().getRandomState()},
Script executed:
# Search for any existing comments or assertions about the arange contract
grep -n "assert\|ASSERT\|contract\|precondition" mllm/backends/cpu/kernels/common/fill-inl.hppRepository: UbiquitousLearning/mllm
Repository: UbiquitousLearning/mllm
Exit code: 1
Ensure the contract is explicit when arange length doesn't match count.
The function computes n from start/end/step and only fills n elements (line 214: n = std::min(n, count)). If count > n, the tail remains uninitialized. While std::min() prevents overflow, it silently allows gaps. Consider adding an assertion, an error check, or explicit documentation clarifying that callers must ensure count == ceil((end - start) / step), or explicitly fill any remainder.
🤖 Prompt for AI Agents
In `@mllm/backends/cpu/kernels/common/fill-inl.hpp` around lines 196 - 244, The
function fill_arange_impl computes n (the number of sequence elements) and may
leave dst[n..count-1] uninitialized; make the contract explicit and avoid silent
gaps by: after computing n, add a runtime check/assert that either n == count
(e.g., assert(n == count) or throw) or, if you want to tolerate count > n,
explicitly fill the remainder via fill_value_impl(dst + n, count - n,
static_cast<T>(start + n * step)) so no tail remains uninitialized; update
callers/docs accordingly. Reference: fill_arange_impl, variable n, and
fill_value_impl.
| #include <cstring> | ||
| #include "mllm/core/DataTypes.hpp" | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
find . -name "kernel_dispatch.hpp" -type fRepository: UbiquitousLearning/mllm
Length of output: 121
🏁 Script executed:
cat -n mllm/backends/cpu/kernels/common/kernel_dispatch.hpp | head -120Repository: UbiquitousLearning/mllm
Length of output: 9016
🏁 Script executed:
head -20 mllm/backends/cpu/kernels/common/kernel_dispatch.hppRepository: UbiquitousLearning/mllm
Length of output: 794
🏁 Script executed:
rg "#include.*type_traits" mllm/backends/cpu/kernels/common/kernel_dispatch.hppRepository: UbiquitousLearning/mllm
Length of output: 49
🏁 Script executed:
grep "^#include" mllm/backends/cpu/kernels/common/kernel_dispatch.hppRepository: UbiquitousLearning/mllm
Length of output: 199
Add missing #include <type_traits> for std::is_same_v usage.
The template fill_zeros_anytype at line 105+ uses std::is_same_v (lines 107, 109, 111, 113, 115, 117, 119, and beyond), which requires the <type_traits> header. The file currently does not include it, relying on transitive includes which is unreliable.
Proposed fix
`#include` <cstring>
+#include <type_traits>
`#include` "mllm/core/DataTypes.hpp"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| #include <cstring> | |
| #include "mllm/core/DataTypes.hpp" | |
| `#include` <cstring> | |
| `#include` <type_traits> | |
| `#include` "mllm/core/DataTypes.hpp" | |
🤖 Prompt for AI Agents
In `@mllm/backends/cpu/kernels/common/kernel_dispatch.hpp` around lines 10 - 12,
The file is missing `#include` <type_traits> required by the template
fill_zeros_anytype which uses std::is_same_v; add the header to the top of
kernel_dispatch.hpp (alongside <cstring> and "mllm/core/DataTypes.hpp") so
std::is_same_v resolves properly and avoids relying on transitive includes
referenced in fill_zeros_anytype.
| template<typename T> | ||
| inline void fill_zeros_anytype(T* dst, size_t n) { | ||
| if constexpr (std::is_same_v<T, mllm_fp32_t>) { | ||
| call_fill_zeros_fp32(dst, n); | ||
| } else if constexpr (std::is_same_v<T, mllm_fp64_t>) { | ||
| call_fill_zeros_fp64(dst, n); | ||
| } else if constexpr (std::is_same_v<T, mllm_int32_t>) { | ||
| call_fill_zeros_i32(dst, n); | ||
| } else if constexpr (std::is_same_v<T, mllm_uint32_t>) { | ||
| call_fill_zeros_u32(dst, n); | ||
| } else if constexpr (std::is_same_v<T, mllm_int64_t>) { | ||
| call_fill_zeros_i64(dst, n); | ||
| } else if constexpr (std::is_same_v<T, mllm_uint64_t>) { | ||
| call_fill_zeros_u64(dst, n); | ||
| } else if constexpr (std::is_same_v<T, mllm_int16_t>) { | ||
| call_fill_zeros_i16(dst, n); | ||
| } else if constexpr (std::is_same_v<T, mllm_uint16_t>) { | ||
| call_fill_zeros_u16(dst, n); | ||
| } else if constexpr (std::is_same_v<T, mllm_int8_t>) { | ||
| call_fill_zeros_i8(dst, n); | ||
| } else if constexpr (std::is_same_v<T, mllm_uint8_t>) { | ||
| call_fill_zeros_u8(dst, n); | ||
| } else { | ||
| // Fallback for unsupported types | ||
| std::memset(dst, 0, n * sizeof(T)); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# Check if the file exists and read the context around lines 105-131
cat -n mllm/backends/cpu/kernels/common/kernel_dispatch.hpp | sed -n '90,140p'Repository: UbiquitousLearning/mllm
Length of output: 3224
🏁 Script executed:
# Also check for any template constraints or type requirements
cat -n mllm/backends/cpu/kernels/common/kernel_dispatch.hpp | head -110 | tail -30Repository: UbiquitousLearning/mllm
Length of output: 2869
🏁 Script executed:
# Search for any type traits or constraints applied to fill_zeros_anytype
rg "fill_zeros_anytype" -A 10 -B 10 mllm/backends/cpu/kernels/common/kernel_dispatch.hppRepository: UbiquitousLearning/mllm
Length of output: 1402
🏁 Script executed:
# Search for where fill_zeros_anytype is used
rg "fill_zeros_anytype" -B 2 -A 2Repository: UbiquitousLearning/mllm
Length of output: 6805
🏁 Script executed:
# Check if there's similar code in fill_ones_anytype to see the pattern
cat -n mllm/backends/cpu/kernels/common/kernel_dispatch.hpp | sed -n '133,160p'Repository: UbiquitousLearning/mllm
Length of output: 1327
🌐 Web query:
std::memset non-trivial types C++ standard undefined behavior
💡 Result:
Short answer: Using std::memset on objects whose type is non‑trivially‑copyable (i.e. non‑trivial) is undefined behavior. For trivially‑copyable objects you may manipulate their object representation with byte‑wise functions, but you must respect the standard's restrictions (e.g. not a potentially‑overlapping subobject); the resulting value representation may be implementation‑defined. Prefer constructors / std::fill / type‑safe initialization for non‑trivial types.
Sources:
- std::memset specification (undefined if object is not TriviallyCopyable or is a potentially‑overlapping subobject). [1]
- Object / object‑representation and trivially‑copyable rules (why bytewise ops are only guaranteed for trivially‑copyable). [2]
- Discussion / examples and standard pointers about memcpy/memset use and UB for non‑trivial types. [3]
References:
[1] cppreference — std::memset.
[2] cppreference — Object representation / trivially‑copyable.
[3] Relevant discussion (Stack Overflow / standard excerpts) about memcpy/memset and trivially‑copyable vs non‑trivial types.
Guard memset fallback against non-trivial types.
Using std::memset on non-trivial types is undefined behavior per the C++ standard. The template accepts any type without constraints, so the fallback must check std::is_trivial_v<T>. Consider the safer element-wise approach used in fill_ones_anytype instead.
Suggested fix
} else {
// Fallback for unsupported types
- std::memset(dst, 0, n * sizeof(T));
+ if constexpr (std::is_trivial_v<T>) {
+ std::memset(dst, 0, n * sizeof(T));
+ } else {
+ for (size_t i = 0; i < n; ++i) { dst[i] = T{}; }
+ }
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| template<typename T> | |
| inline void fill_zeros_anytype(T* dst, size_t n) { | |
| if constexpr (std::is_same_v<T, mllm_fp32_t>) { | |
| call_fill_zeros_fp32(dst, n); | |
| } else if constexpr (std::is_same_v<T, mllm_fp64_t>) { | |
| call_fill_zeros_fp64(dst, n); | |
| } else if constexpr (std::is_same_v<T, mllm_int32_t>) { | |
| call_fill_zeros_i32(dst, n); | |
| } else if constexpr (std::is_same_v<T, mllm_uint32_t>) { | |
| call_fill_zeros_u32(dst, n); | |
| } else if constexpr (std::is_same_v<T, mllm_int64_t>) { | |
| call_fill_zeros_i64(dst, n); | |
| } else if constexpr (std::is_same_v<T, mllm_uint64_t>) { | |
| call_fill_zeros_u64(dst, n); | |
| } else if constexpr (std::is_same_v<T, mllm_int16_t>) { | |
| call_fill_zeros_i16(dst, n); | |
| } else if constexpr (std::is_same_v<T, mllm_uint16_t>) { | |
| call_fill_zeros_u16(dst, n); | |
| } else if constexpr (std::is_same_v<T, mllm_int8_t>) { | |
| call_fill_zeros_i8(dst, n); | |
| } else if constexpr (std::is_same_v<T, mllm_uint8_t>) { | |
| call_fill_zeros_u8(dst, n); | |
| } else { | |
| // Fallback for unsupported types | |
| std::memset(dst, 0, n * sizeof(T)); | |
| } | |
| } | |
| template<typename T> | |
| inline void fill_zeros_anytype(T* dst, size_t n) { | |
| if constexpr (std::is_same_v<T, mllm_fp32_t>) { | |
| call_fill_zeros_fp32(dst, n); | |
| } else if constexpr (std::is_same_v<T, mllm_fp64_t>) { | |
| call_fill_zeros_fp64(dst, n); | |
| } else if constexpr (std::is_same_v<T, mllm_int32_t>) { | |
| call_fill_zeros_i32(dst, n); | |
| } else if constexpr (std::is_same_v<T, mllm_uint32_t>) { | |
| call_fill_zeros_u32(dst, n); | |
| } else if constexpr (std::is_same_v<T, mllm_int64_t>) { | |
| call_fill_zeros_i64(dst, n); | |
| } else if constexpr (std::is_same_v<T, mllm_uint64_t>) { | |
| call_fill_zeros_u64(dst, n); | |
| } else if constexpr (std::is_same_v<T, mllm_int16_t>) { | |
| call_fill_zeros_i16(dst, n); | |
| } else if constexpr (std::is_same_v<T, mllm_uint16_t>) { | |
| call_fill_zeros_u16(dst, n); | |
| } else if constexpr (std::is_same_v<T, mllm_int8_t>) { | |
| call_fill_zeros_i8(dst, n); | |
| } else if constexpr (std::is_same_v<T, mllm_uint8_t>) { | |
| call_fill_zeros_u8(dst, n); | |
| } else { | |
| // Fallback for unsupported types | |
| if constexpr (std::is_trivial_v<T>) { | |
| std::memset(dst, 0, n * sizeof(T)); | |
| } else { | |
| for (size_t i = 0; i < n; ++i) { dst[i] = T{}; } | |
| } | |
| } | |
| } |
🤖 Prompt for AI Agents
In `@mllm/backends/cpu/kernels/common/kernel_dispatch.hpp` around lines 105 - 131,
The fallback in fill_zeros_anytype currently uses std::memset for all unmatched
types; restrict that to only trivial types by checking std::is_trivial_v<T> and
for non-trivial types perform an element-wise zero assignment (e.g., for (size_t
i=0;i<n;++i) dst[i] = T{}), mirroring the safe approach used in
fill_ones_anytype; update the else branch in fill_zeros_anytype to first if
constexpr (std::is_trivial_v<T>) use std::memset, else use the element-wise loop
to avoid undefined behavior on non-trivial types.
| #if defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86) | ||
| // TODO | ||
| // FP16 not directly supported by Highway on x86, use scalar fallback | ||
| std::memset(dst.ptr<mllm_fp16_t>(), 0, dst.numel() * sizeof(mllm_fp16_t)); | ||
| #elif defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
find . -name "FillOp.cpp" -type fRepository: UbiquitousLearning/mllm
Length of output: 167
🏁 Script executed:
cat -n ./mllm/backends/cpu/ops/FillOp.cppRepository: UbiquitousLearning/mllm
Length of output: 26749
🏁 Script executed:
cat -n ./mllm/backends/cpu/ops/FillOp.hpp | head -20Repository: UbiquitousLearning/mllm
Length of output: 772
🏁 Script executed:
cat -n ./mllm/backends/cpu/kernels/Kernels.hpp | head -30Repository: UbiquitousLearning/mllm
Length of output: 2061
🏁 Script executed:
cat -n ./mllm/backends/cpu/kernels/x86/fill.hpp | head -20Repository: UbiquitousLearning/mllm
Length of output: 825
🏁 Script executed:
rg "#include <cstring>" ./mllm/backends/cpu/kernels/Repository: UbiquitousLearning/mllm
Length of output: 612
🏁 Script executed:
rg "#include.*fill-inl.hpp" ./mllm/backends/cpu/kernels/Repository: UbiquitousLearning/mllm
Length of output: 178
🏁 Script executed:
cat -n ./mllm/backends/cpu/kernels/x86/fill.hppRepository: UbiquitousLearning/mllm
Length of output: 1017
🏁 Script executed:
rg "#include.*fill-inl.hpp" ./mllm/backends/cpu/kernels/x86/Repository: UbiquitousLearning/mllm
Length of output: 49
🏁 Script executed:
find ./mllm/backends/cpu/kernels/x86 -name "fill.cpp"Repository: UbiquitousLearning/mllm
Length of output: 107
🏁 Script executed:
cat -n ./mllm/backends/cpu/kernels/x86/fill.cpp | head -30Repository: UbiquitousLearning/mllm
Length of output: 1351
🏁 Script executed:
rg "#include" ./mllm/backends/cpu/kernels/x86/fill.cppRepository: UbiquitousLearning/mllm
Length of output: 206
🏁 Script executed:
cat -n ./mllm/utils/PlatformRTHelper.hppRepository: UbiquitousLearning/mllm
Length of output: 2824
🏁 Script executed:
cat -n ./mllm/core/DataTypes.hpp | head -30Repository: UbiquitousLearning/mllm
Length of output: 1200
🏁 Script executed:
rg "#include <cstring>" ./mllm/core/ ./mllm/backends/cpu/ops/Repository: UbiquitousLearning/mllm
Length of output: 2406
Add <cstring> include for std::memset usage on line 33.
The std::memset call at line 33 requires the <cstring> header. All other CPU operation files in the same directory include this header; add it to maintain consistency and ensure the file is self-contained.
🛠️ Proposed fix
`#include` "mllm/backends/cpu/ops/FillOp.hpp"
`#include` "mllm/backends/cpu/kernels/Kernels.hpp"
`#include` "mllm/utils/PlatformRTHelper.hpp"
+#include <cstring>📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| #if defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86) | |
| // TODO | |
| // FP16 not directly supported by Highway on x86, use scalar fallback | |
| std::memset(dst.ptr<mllm_fp16_t>(), 0, dst.numel() * sizeof(mllm_fp16_t)); | |
| #elif defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM) | |
| `#include` "mllm/backends/cpu/ops/FillOp.hpp" | |
| `#include` "mllm/backends/cpu/kernels/Kernels.hpp" | |
| `#include` "mllm/backends/cpu/utils/PlatformRTHelper.hpp" | |
| `#include` <cstring> |
🤖 Prompt for AI Agents
In `@mllm/backends/cpu/ops/FillOp.cpp` around lines 31 - 34, The file uses
std::memset in FillOp.cpp (inside the x86/x86_64 branch) but does not include
<cstring>, so add the missing include to the top of the file; update FillOp.cpp
to `#include` <cstring> (alongside other headers) so std::memset is declared and
the file is self-contained and consistent with other CPU ops.
Summary by CodeRabbit
New Features
Improvements
New Exports
✏️ Tip: You can customize this high-level summary in your review settings.