Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
53 commits
Select commit Hold shift + click to select a range
f4231a8
Update settings for vs code
Synicix Nov 6, 2025
6e37676
Add custom hasher framework
Synicix Nov 6, 2025
28761b3
Fix clippy errors
Synicix Nov 6, 2025
e6336c5
Add list hashing
Synicix Nov 7, 2025
1fc9bf5
Add decimal hashing
Synicix Nov 7, 2025
72b1fb8
Rename hasher to arrow_digester
Synicix Nov 7, 2025
e377ee8
Add String hashing
Synicix Nov 8, 2025
df3a21b
Add binary and string hashing
Synicix Nov 8, 2025
1f6577a
Add time hashing
Synicix Nov 8, 2025
1fff344
Change to new custom hasher and remove old one.
Synicix Nov 13, 2025
5a4371a
Add rust tests
Synicix Nov 13, 2025
934e6e7
Fix all clippy recommendations
Synicix Nov 14, 2025
da3a892
Remove incorrect categories
Synicix Nov 14, 2025
80653d1
Update categories
Synicix Nov 14, 2025
8da263c
Update clippy actions
Synicix Nov 14, 2025
4730702
Update read me to include section about hashing
Synicix Dec 4, 2025
2b9da46
Change delimiter from _ to __
Synicix Dec 4, 2025
485a544
Add test for field name extraction and fix logic bug
Synicix Dec 4, 2025
c2ff003
Up the version due to bug
Synicix Dec 4, 2025
669647a
Update README.md
Synicix Dec 4, 2025
882adca
Update README.md
Synicix Dec 4, 2025
3f083ba
Update src/arrow_digester.rs
Synicix Dec 4, 2025
a908bba
Update src/arrow_digester.rs
Synicix Dec 4, 2025
9dce44e
Change Vec to array since it is more flexiable
Synicix Dec 4, 2025
616101b
Add check for hash
Synicix Dec 4, 2025
3921886
Fix double negation
Synicix Dec 4, 2025
bcb984c
Update field name to deal with possible underflow
Synicix Dec 4, 2025
535615f
Fix nested field hash
Synicix Dec 4, 2025
115d80e
Remove casting scale because arrow allow for negative scale for rounding
Synicix Dec 4, 2025
48cd2cb
Update version with bug fixes
Synicix Dec 4, 2025
5bb6414
Move some dependenices into dev
Synicix Dec 4, 2025
ca3426c
Fix decimal and possible string array collision
Synicix Dec 4, 2025
08bc1f1
Remove unused panic
Synicix Dec 4, 2025
cb9cf80
Fix cargo fmt error
Synicix Dec 4, 2025
9ce2993
Change postcard hashing to json hashing
Synicix Dec 8, 2025
81726f5
Change delimiter to /
Synicix Dec 8, 2025
932abe1
Add binary array len hashing to resolve hash collision problem & tests
Synicix Dec 9, 2025
58ac701
Save progress on redesigning null handling
Synicix Dec 9, 2025
7728bfe
Patch nullbits handling and included datatypes into schema definition
Synicix Dec 10, 2025
c2e2564
Move actual arrow_digester logic to core and private it, while making…
Synicix Dec 11, 2025
63fb32a
Up clippy version
Synicix Dec 11, 2025
d4a233e
Update hashing to meet new arrow format
Synicix Dec 11, 2025
5a86fbc
Remove stale file that was already move to the lib module
Synicix Dec 11, 2025
b9b6384
Add nullable and non-nullable tests
Synicix Dec 11, 2025
45cb028
Add 3 bytes at the start for versioning
Synicix Dec 11, 2025
2f866e4
Add documentation about hashing
Synicix Dec 11, 2025
4f4b577
Add test to confirm update in batches and hashing all at once results…
Synicix Dec 11, 2025
23fc982
Add test to check for consistent hashing when one batch is null but t…
Synicix Dec 12, 2025
bfa2b17
feat: Remove some python interp settings
Synicix Jan 6, 2026
8b9db23
feat: update some stale comments
Synicix Jan 6, 2026
70effd5
feat: Expose new functions to python side
Synicix Jan 7, 2026
3591940
feat: remove eadianness file
Synicix Jan 7, 2026
26990d8
Fix fmt error
Synicix Jan 7, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 19 additions & 0 deletions .github/workflows/clippy.yml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell, this just runs checks but not apply the formatting, does it? Shall we make it so that code gets auto-formatted?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Vscode with the settings I set, it does auto format when you save. Unless you mean to let the github action force a commit to autoformat it?

Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
name: rust_checks
on:
- pull_request
jobs:
rust-syntax-style-format-and-integration:
runs-on: ubuntu-latest
env:
CARGO_TERM_COLOR: always
steps:
- uses: actions/checkout@v4
- name: Install Rust + components
uses: actions-rust-lang/setup-rust-toolchain@v1
with:
toolchain: 1.91.1
components: rustfmt,clippy
- name: Run syntax and style tests
run: cargo clippy --all-targets -- -D warnings
- name: Run format test
run: cargo fmt --check
26 changes: 26 additions & 0 deletions .vscode/settings.json
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally I don't think we should include vscode settings, and if we were to do, we should keep it minimal to things that would be applicable to everyone.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of it allow people who use VS code to have everything setup with the correct configuration like rust-analyzer, and formatting stuff. Ideally I want to keep it somewhere, but I can simplify it to the pure essentials.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's reduce this to bare minimal. There are definitely some entries like Python 3 interpreter path that shouldn't be set & expected to be the same across different working environments

Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
{
"[markdown]": {
"editor.defaultFormatter": null
},
"editor.formatOnPaste": false,
"editor.formatOnSave": true,
"editor.rulers": [
100
],
"files.autoSave": "off",
"files.insertFinalNewline": true,
"gitlens.showWhatsNewAfterUpgrades": false,
"lldb.consoleMode": "evaluate",
"rust-analyzer.check.command": "clippy",
"rust-analyzer.checkOnSave": true,
"rust-analyzer.runnables.extraTestBinaryArgs": [
"--nocapture"
],
"rust-analyzer.rustfmt.extraArgs": [
"--config",
"max_width=100"
],
"notebook.formatOnSave.enabled": true,
"notebook.output.scrolling": true,
"python.terminal.activateEnvironment": false
}
82 changes: 77 additions & 5 deletions Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,14 +1,34 @@
[package]
name = "starfix"
version = "0.1.3"
edition = "2024"
version = "0.0.2"
edition = "2021"
description = "Package for hashing Arrow's data structures uniquely for identifying and comparing data efficiently."
authors = ["synicix <synicix@gmail.com>"]
readme = "README.md"
repository = "https://github.com/nauticalab/starfix"
license = "MIT OR Apache-2.0"
keywords = ["arrow", "hashing"]
categories = ["algorithms"]

[dependencies]
arrow = { version = "56.0.0", features = ["ffi"] }
arrow-digest = "56.0.0"
arrow = { version = "57.0.0", features = ["ffi"] }
arrow-schema = { version = "57.0.0", features = ["serde"] }
bitvec = "1.0.1"
digest = "0.10.7"
indoc = "2.0.7"

postcard = "1.1.3"

serde = "1.0.228"
serde_json = "1.0.145"
sha2 = "0.10.9"
# automated CFFI + bindings in other languages
uniffi = { version = "0.29.4", features = ["cli", "tokio"] }
uniffi = { version = "0.30.0", features = ["cli", "tokio"] }

[dev-dependencies]
hex = "0.4.3"
pretty_assertions = "1.4.1"


[[bin]]
name = "uniffi-bindgen"
Expand All @@ -22,3 +42,55 @@ crate-type = ["rlib", "cdylib"]

[package.metadata.release]
publish = false


[lints.clippy]
cargo = "deny"
complexity = "deny"
correctness = "deny"
nursery = "deny"
pedantic = "deny"
perf = "deny"
restriction = "deny"
style = "deny"
suspicious = "deny"

min_ident_chars = { level = "allow", priority = 127 } # allow for variables that is one char
arbitrary_source_item_ordering = { level = "allow", priority = 127 } # allow arbitrary ordering to keep relevant code nearby
as_conversions = { level = "allow", priority = 127 } # allow casting
blanket_clippy_restriction_lints = { level = "allow", priority = 127 } # allow setting all restrictions so we can omit specific ones
default_numeric_fallback = { level = "allow", priority = 127 } # allow type inferred by numeric literal, detection is buggy
disallowed_script_idents = { level = "allow", priority = 127 } # skip since we use only ascii
exhaustive_enums = { level = "allow", priority = 127 } # remove requirement to label enum as exhaustive
exhaustive_structs = { level = "allow", priority = 127 } # revisit once lib is ready to be used externally
field_scoped_visibility_modifiers = { level = "allow", priority = 127 } # allow field-level visibility modifiers
float_arithmetic = { level = "allow", priority = 127 } # allow float arithmetic
impl_trait_in_params = { level = "allow", priority = 127 } # impl in params ok
implicit_return = { level = "allow", priority = 127 } # missing return ok
iter_over_hash_type = { level = "allow", priority = 127 } # allow iterating over unordered iterables like `HashMap`
little_endian_bytes = { level = "allow", priority = 127 } # allow to_le_bytes / from_le_bytes
missing_docs_in_private_items = { level = "allow", priority = 127 } # missing docs on private ok
missing_inline_in_public_items = { level = "allow", priority = 127 } # let rust compiler determine best inline logic
missing_trait_methods = { level = "allow", priority = 127 } # allow in favor of rustc `implement the missing item`
module_name_repetitions = { level = "allow", priority = 127 } # allow use of module name in type names
multiple_crate_versions = { level = "allow", priority = 127 } # allow since list of exceptions changes frequently from external
multiple_inherent_impl = { level = "allow", priority = 127 } # required in best practice to limit exposure over UniFFI
must_use_candidate = { level = "allow", priority = 127 } # omitting #[must_use] ok
mod_module_files = { level = "allow", priority = 127 } # mod directories ok
non_ascii_literal = { level = "allow", priority = 127 } # non-ascii char in string literal ok
partial_pub_fields = { level = "allow", priority = 127 } # partial struct pub fields ok
pattern_type_mismatch = { level = "allow", priority = 127 } # allow in favor of clippy::ref_patterns
print_stderr = { level = "allow", priority = 127 } # stderr prints ok
print_stdout = { level = "allow", priority = 127 } # stdout prints ok
pub_use = { level = "allow", priority = 127 } # ok to structure source into many files but clean up import
pub_with_shorthand = { level = "allow", priority = 127 } # allow use of pub(super)
question_mark_used = { level = "allow", priority = 127 } # allow question operator
self_named_module_files = { level = "allow", priority = 127 } # mod files ok
separated_literal_suffix = { level = "allow", priority = 127 } # literal suffixes should be separated by underscore
single_call_fn = { level = "allow", priority = 127 } # allow functions called only once, which allows better code organization
single_char_lifetime_names = { level = "allow", priority = 127 } # single char lifetimes ok
std_instead_of_alloc = { level = "allow", priority = 127 } # we should use std when possible
std_instead_of_core = { level = "allow", priority = 127 } # we should use std when possible
string_add = { level = "allow", priority = 127 } # simple concat ok
use_debug = { level = "warn", priority = 127 } # debug print
wildcard_enum_match_arm = { level = "allow", priority = 127 } # allow wildcard match arm in enums
Loading