-
Notifications
You must be signed in to change notification settings - Fork 2
feat: Treesitter and goblin #7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…atform mismatch detection
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR implements tree-sitter based Ruby symbol extraction and goblin-based binary architecture validation for gem files. The implementation adds three new CLI commands (sync, index, validate) and extends the database schema to store extracted symbols.
- Adds tree-sitter integration to extract class/module definitions from Ruby source files
- Implements goblin-based binary parsing to detect and validate architecture mismatches in native gems
- Extends the database schema with a
gem_symbolstable for storing indexed Ruby symbols
Reviewed changes
Copilot reviewed 13 out of 14 changed files in this pull request and generated 14 comments.
Show a summary per file
| File | Description |
|---|---|
| src/gem_metadata/symbols.rs | New module for extracting Ruby class/module symbols using tree-sitter queries |
| src/gem_metadata/indexer.rs | New module to index all Ruby files in a gem and extract their symbols |
| src/gem_metadata/binary_arch.rs | New module for detecting binary architecture and validating against claimed gem platforms |
| src/gem_metadata/analyzer.rs | Extended with binary architecture validation functionality |
| src/gem_metadata.rs | Module declarations and public API exports for new functionality |
| crates/vein-admin/src/main.rs | Refactored CLI to support subcommands (serve, sync, index, validate) |
| crates/vein-admin/src/commands.rs | Module declaration for new command implementations |
| crates/vein-admin/src/commands/index.rs | Command implementation for indexing gem symbols into the database |
| crates/vein-admin/src/commands/validate.rs | Command implementation for validating binary architectures in gems |
| crates/vein-adapter/src/cache.rs | Added trait methods for symbol storage and retrieval |
| crates/vein-adapter/src/cache/sqlite.rs | SQLite implementation for gem_symbols table and symbol operations |
| crates/vein-adapter/src/cache/postgres.rs | PostgreSQL implementation for gem_symbols table and symbol operations |
| Cargo.toml | Added dependencies for tree-sitter (0.26.3), tree-sitter-ruby (0.23.1), and goblin (0.9) |
| Cargo.lock | Lock file updates for new dependencies and their transitive dependencies |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| #[test] | ||
| fn test_extract_nested_class() { | ||
| let source = r#" | ||
| module Foo | ||
| class Bar | ||
| end | ||
| end | ||
| "#; | ||
| let symbols = extract_symbols(source).unwrap(); | ||
| assert_eq!(symbols.len(), 2); | ||
| assert!(symbols | ||
| .iter() | ||
| .any(|s| s.name == "Foo" && s.symbol_type == SymbolType::Module)); | ||
| assert!(symbols | ||
| .iter() | ||
| .any(|s| s.name == "Bar" && s.symbol_type == SymbolType::Class)); | ||
| } |
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test lacks assertion for the parent field. The test for nested classes (test_extract_nested_class) verifies that 'Foo' and 'Bar' are extracted but doesn't verify that the nested 'Bar' class has the correct parent relationship. According to the current implementation, nested classes defined as 'module Foo; class Bar; end; end' won't capture the parent relationship because the tree-sitter query only handles scope resolution syntax (::), not lexical nesting. This should either be tested and documented, or the implementation should be updated to handle nested definitions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. The additional test is warranted.
| parent_name TEXT, | ||
| line_number INTEGER, | ||
| created_at TIMESTAMP DEFAULT (strftime('%Y-%m-%dT%H:%M:%fZ','now')), | ||
| UNIQUE(gem_name, gem_version, gem_platform, file_path, symbol_name) |
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The UNIQUE constraint includes gem_platform in a way that may not handle NULL values as expected across all versions of SQLite. When gem_platform is NULL, multiple rows with the same gem_name, gem_version, file_path, and symbol_name but different NULL gem_platform values might be treated as distinct. Consider using a partial unique index or adjusting the constraint to ensure consistent behavior.
| DELETE FROM gem_symbols | ||
| WHERE gem_name = ?1 | ||
| AND gem_version = ?2 | ||
| AND ((gem_platform IS NULL AND ?3 IS NULL) OR gem_platform = ?3) |
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The NULL comparison logic in the WHERE clause may not work as intended. In SQL, NULL != NULL, so the condition '(gem_platform IS NULL AND ?3 IS NULL)' will not match when both are NULL because the parameter binding doesn't preserve NULL semantics in this comparison. This should use proper NULL-safe comparison or conditional logic.
| AND ((gem_platform IS NULL AND ?3 IS NULL) OR gem_platform = ?3) | |
| AND ( | |
| (?3 IS NULL AND gem_platform IS NULL) | |
| OR | |
| (?3 IS NOT NULL AND gem_platform = ?3) | |
| ) |
| DELETE FROM gem_symbols | ||
| WHERE gem_name = $1 | ||
| AND gem_version = $2 | ||
| AND ((gem_platform IS NULL AND $3 IS NULL) OR gem_platform = $3) |
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The NULL comparison logic in the WHERE clause may not work as intended. In PostgreSQL, while the syntax is different from SQLite, the condition '(gem_platform IS NULL AND $3 IS NULL)' still has issues. Parameter $3 will never be IS NULL in this context - it's either bound to a value or NULL. The proper way to handle this is to use COALESCE or conditional SQL logic to ensure NULL values are matched correctly.
| AND ((gem_platform IS NULL AND $3 IS NULL) OR gem_platform = $3) | |
| AND gem_platform IS NOT DISTINCT FROM $3 |
src/gem_metadata/analyzer.rs
Outdated
| if is_binary { | ||
| // Read the file contents | ||
| let mut contents = Vec::new(); | ||
| if let Err(_) = data_entry.read_to_end(&mut contents) { |
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Error value is discarded with underscore binding. When a file can't be read, the error information could be valuable for debugging. Consider logging the error or at least including it in the warning message context.
| if let Err(_) = data_entry.read_to_end(&mut contents) { | |
| if let Err(err) = data_entry.read_to_end(&mut contents) { | |
| eprintln!("Warning: failed to read binary file '{}': {}", path_str, err); |
| async fn validate_gem_version_with_platform( | ||
| config: &VeinConfig, | ||
| name: &str, | ||
| version: &str, | ||
| platform: Option<&str>, | ||
| ) -> Result<bool> { | ||
| // Parse version which might contain platform (e.g., "0.0.52-arm64-darwin") | ||
| let (actual_version, claimed_platform): (String, Option<String>) = if platform.is_some() { | ||
| (version.to_string(), platform.map(|s| s.to_string())) | ||
| } else { | ||
| // Try to extract platform from version string | ||
| // E.g., "0.0.52-arm64-darwin" -> ("0.0.52", Some("arm64-darwin")) | ||
| // Split by '-' to separate version from platform | ||
| let parts: Vec<&str> = version.splitn(2, '-').collect(); | ||
| if parts.len() == 2 { | ||
| // First part is version (e.g., "0.0.52"), second is platform (e.g., "arm64-darwin") | ||
| (parts[0].to_string(), Some(parts[1].to_string())) | ||
| } else { | ||
| (version.to_string(), None) | ||
| } |
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The version parsing logic here is fragile. Splitting on '-' with splitn(2, '-') assumes the version number never contains a hyphen, but some gem versions do (e.g., "1.0.0-pre1"). This could incorrectly parse "1.0.0-pre1" as version "1.0.0" with platform "pre1". Consider using a more robust parsing strategy that takes into account valid platform strings.
| async fn validate_gem_version_with_platform( | |
| config: &VeinConfig, | |
| name: &str, | |
| version: &str, | |
| platform: Option<&str>, | |
| ) -> Result<bool> { | |
| // Parse version which might contain platform (e.g., "0.0.52-arm64-darwin") | |
| let (actual_version, claimed_platform): (String, Option<String>) = if platform.is_some() { | |
| (version.to_string(), platform.map(|s| s.to_string())) | |
| } else { | |
| // Try to extract platform from version string | |
| // E.g., "0.0.52-arm64-darwin" -> ("0.0.52", Some("arm64-darwin")) | |
| // Split by '-' to separate version from platform | |
| let parts: Vec<&str> = version.splitn(2, '-').collect(); | |
| if parts.len() == 2 { | |
| // First part is version (e.g., "0.0.52"), second is platform (e.g., "arm64-darwin") | |
| (parts[0].to_string(), Some(parts[1].to_string())) | |
| } else { | |
| (version.to_string(), None) | |
| } | |
| fn looks_like_rubygems_platform(s: &str) -> bool { | |
| // Common RubyGems platform identifiers | |
| const KNOWN_PLATFORMS: &[&str] = &[ | |
| "ruby", | |
| "jruby", | |
| "java", | |
| "x86-mswin32", | |
| "x64-mswin64", | |
| "x86-mingw32", | |
| "x64-mingw32", | |
| "x64-mingw-ucrt", | |
| "x86_64-linux", | |
| "x86-linux", | |
| "x86_64-darwin", | |
| "arm64-darwin", | |
| "universal-darwin", | |
| "x86_64-freebsd", | |
| "x86_64-openbsd", | |
| "x86_64-solaris", | |
| ]; | |
| if KNOWN_PLATFORMS.contains(&s) { | |
| return true; | |
| } | |
| // Fallback: accept simple CPU-OS[-version] style strings with safe characters | |
| fn valid_segment(seg: &str) -> bool { | |
| !seg.is_empty() | |
| && seg | |
| .chars() | |
| .all(|c| c.is_ascii_alphanumeric() || c == '_' || c == '.') | |
| } | |
| let mut parts = s.split('-'); | |
| let first = match parts.next() { | |
| Some(p) => p, | |
| None => return false, | |
| }; | |
| let second = match parts.next() { | |
| Some(p) => p, | |
| None => return false, | |
| }; | |
| let third = parts.next(); | |
| // If there are more than three segments, it's unlikely to be a standard platform string. | |
| if parts.next().is_some() { | |
| return false; | |
| } | |
| if !valid_segment(first) || !valid_segment(second) { | |
| return false; | |
| } | |
| if let Some(third) = third { | |
| if !valid_segment(third) { | |
| return false; | |
| } | |
| } | |
| true | |
| } | |
| async fn validate_gem_version_with_platform( | |
| config: &VeinConfig, | |
| name: &str, | |
| version: &str, | |
| platform: Option<&str>, | |
| ) -> Result<bool> { | |
| // Parse version which might contain platform (e.g., "0.0.52-arm64-darwin"). | |
| // Only treat a suffix as platform if it looks like a valid RubyGems platform. | |
| let (actual_version, claimed_platform): (String, Option<String>) = if let Some(p) = platform { | |
| (version.to_string(), Some(p.to_string())) | |
| } else if let Some(idx) = version.find('-') { | |
| // Split into potential version and platform components. | |
| let (maybe_version, rest) = version.split_at(idx); | |
| // `rest` starts with '-', so skip it to get the candidate platform. | |
| let maybe_platform = &rest[1..]; | |
| if looks_like_rubygems_platform(maybe_platform) { | |
| (maybe_version.to_string(), Some(maybe_platform.to_string())) | |
| } else { | |
| // Hyphen is part of the version (e.g., "1.0.0-pre1"), not a platform separator. | |
| (version.to_string(), None) | |
| } | |
| } else { | |
| (version.to_string(), None) |
| .filter_map(|(gem_name, version_platform)| { | ||
| if gem_name == name { | ||
| Some((version_platform, None)) |
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function always returns the second element of tuples with None as the platform. Looking at the tuple structure (String, Option<String>), it appears the version_platform variable is being treated as just version, but the platform information is being lost. The return type suggests this should preserve platform information from the cache, but it's hardcoded to None.
| .filter_map(|(gem_name, version_platform)| { | |
| if gem_name == name { | |
| Some((version_platform, None)) | |
| .filter_map(|(gem_name, (version, platform))| { | |
| if gem_name == name { | |
| Some((version, platform)) |
crates/vein-admin/src/main.rs
Outdated
| None => { | ||
| // Default to serve if no subcommand provided | ||
| let config_path = std::path::Path::new("crates/vein-admin/config"); | ||
| let app = loco_rs::boot::create_app::<App, migration::Migrator>( | ||
| StartMode::ServerOnly, | ||
| &environment, | ||
| loco_rs::config::Config::from_folder(&environment, config_path)?, | ||
| ) | ||
| .await?; | ||
|
|
||
| let binding = app.app_context.config.server.binding.clone(); | ||
| let port = app.app_context.config.server.port; | ||
| let serve_params = ServeParams { port, binding }; | ||
|
|
||
| boot::start::<App>(app, serve_params, false).await?; |
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicate code block for default server startup. The code in the None branch (lines 142-154) duplicates the logic from the Serve command branch (lines 101-119), except it doesn't use CLI arguments for bind and port. This duplication makes maintenance harder. Consider extracting this into a helper function or restructuring to avoid repetition.
src/gem_metadata/indexer.rs
Outdated
| pub fn index_gem(gem_path: &Path) -> Result<Vec<FileSymbols>> { | ||
| let file = | ||
| File::open(gem_path).with_context(|| format!("opening gem at {}", gem_path.display()))?; | ||
| let mut archive = Archive::new(file); | ||
|
|
||
| let mut results = Vec::new(); | ||
|
|
||
| // Find and process data.tar.gz | ||
| for entry in archive.entries().context("reading gem archive entries")? { | ||
| let entry = entry.context("accessing gem archive entry")?; | ||
| let path = entry.path().context("reading entry path")?.into_owned(); | ||
|
|
||
| if path.as_os_str().to_string_lossy() == "data.tar.gz" { | ||
| let decoder = GzDecoder::new(entry); | ||
| let symbols = process_data_tar(decoder)?; | ||
| results = symbols; | ||
| break; | ||
| } | ||
| } | ||
|
|
||
| Ok(results) |
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function returns an empty Vec when data.tar.gz is not found, which silently succeeds even though no symbols were indexed. This could be misleading - if a gem file doesn't contain data.tar.gz or has a malformed structure, the function returns Ok with an empty result without any indication that something went wrong. Consider returning an error if data.tar.gz is not found.
src/gem_metadata/analyzer.rs
Outdated
| // Check if this is a binary file by extension or location | ||
| let path_str_lower = path_str.to_ascii_lowercase(); | ||
| let is_binary = file_path | ||
| .extension() | ||
| .and_then(|s| s.to_str()) | ||
| .map(|ext| matches!(ext, "so" | "dll" | "dylib" | "bundle" | "exe")) | ||
| .unwrap_or(false) | ||
| || path_str_lower.starts_with("exe/") | ||
| || path_str_lower.starts_with("bin/"); | ||
|
|
||
| if is_binary { | ||
| // Read the file contents | ||
| let mut contents = Vec::new(); | ||
| if let Err(_) = data_entry.read_to_end(&mut contents) { | ||
| continue; // Skip files we can't read | ||
| } | ||
|
|
||
| // Detect architecture | ||
| if let Ok(binary_info) = detect_binary_arch(&contents) { | ||
| let is_match = matches_platform(claimed_platform, &binary_info); | ||
|
|
||
| if !is_match { | ||
| mismatches.push(path_str.to_string()); | ||
| } | ||
|
|
||
| detected_binaries.push((path_str.to_string(), binary_info)); | ||
| } |
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Binary detection by file extension may produce false positives. Files in bin/ or exe/ directories might be shell scripts, Ruby scripts, or other non-binary executables. The code should verify that the file is actually a binary by attempting to parse it with goblin before classifying it as a binary. Currently, script files would be incorrectly flagged as binaries and fail parsing, which is caught by the Ok() check, but this approach is inefficient and could miss actual binaries that fail parsing for other reasons.
| // Check if this is a binary file by extension or location | |
| let path_str_lower = path_str.to_ascii_lowercase(); | |
| let is_binary = file_path | |
| .extension() | |
| .and_then(|s| s.to_str()) | |
| .map(|ext| matches!(ext, "so" | "dll" | "dylib" | "bundle" | "exe")) | |
| .unwrap_or(false) | |
| || path_str_lower.starts_with("exe/") | |
| || path_str_lower.starts_with("bin/"); | |
| if is_binary { | |
| // Read the file contents | |
| let mut contents = Vec::new(); | |
| if let Err(_) = data_entry.read_to_end(&mut contents) { | |
| continue; // Skip files we can't read | |
| } | |
| // Detect architecture | |
| if let Ok(binary_info) = detect_binary_arch(&contents) { | |
| let is_match = matches_platform(claimed_platform, &binary_info); | |
| if !is_match { | |
| mismatches.push(path_str.to_string()); | |
| } | |
| detected_binaries.push((path_str.to_string(), binary_info)); | |
| } | |
| // Read the file contents | |
| let mut contents = Vec::new(); | |
| if let Err(_) = data_entry.read_to_end(&mut contents) { | |
| continue; // Skip files we can't read | |
| } | |
| // Detect architecture; only treat as binary if parsing succeeds | |
| if let Ok(binary_info) = detect_binary_arch(&contents) { | |
| let is_match = matches_platform(claimed_platform, &binary_info); | |
| if !is_match { | |
| mismatches.push(path_str.to_string()); | |
| } | |
| detected_binaries.push((path_str.to_string(), binary_info)); |
pboling
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The copilot suggestions for additional tests are great ideas. Otherwise looks like a good intial implementation.
| @@ -0,0 +1,167 @@ | |||
| use anyhow::{Context, Result}; | |||
| use tree_sitter::{Parser, Query, QueryCursor, StreamingIterator}; | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯
| "class_scope" | "module_scope" => { | ||
| parent = Some(node_text.to_string()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, a more complex test is warranted.
| #[test] | ||
| fn test_extract_nested_class() { | ||
| let source = r#" | ||
| module Foo | ||
| class Bar | ||
| end | ||
| end | ||
| "#; | ||
| let symbols = extract_symbols(source).unwrap(); | ||
| assert_eq!(symbols.len(), 2); | ||
| assert!(symbols | ||
| .iter() | ||
| .any(|s| s.name == "Foo" && s.symbol_type == SymbolType::Module)); | ||
| assert!(symbols | ||
| .iter() | ||
| .any(|s| s.name == "Bar" && s.symbol_type == SymbolType::Class)); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. The additional test is warranted.
|
|
||
| # Tree-sitter for Ruby code analysis | ||
| tree-sitter = "0.26.3" | ||
| tree-sitter-ruby = "0.23.1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯
…orm parsing, error handling, tests, code quality)
Here is the implementation ..
This was cherrypicked ,