Skip to content

Conversation

@seuros
Copy link
Collaborator

@seuros seuros commented Jan 6, 2026

Here is the implementation ..

This was cherrypicked ,

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements tree-sitter based Ruby symbol extraction and goblin-based binary architecture validation for gem files. The implementation adds three new CLI commands (sync, index, validate) and extends the database schema to store extracted symbols.

  • Adds tree-sitter integration to extract class/module definitions from Ruby source files
  • Implements goblin-based binary parsing to detect and validate architecture mismatches in native gems
  • Extends the database schema with a gem_symbols table for storing indexed Ruby symbols

Reviewed changes

Copilot reviewed 13 out of 14 changed files in this pull request and generated 14 comments.

Show a summary per file
File Description
src/gem_metadata/symbols.rs New module for extracting Ruby class/module symbols using tree-sitter queries
src/gem_metadata/indexer.rs New module to index all Ruby files in a gem and extract their symbols
src/gem_metadata/binary_arch.rs New module for detecting binary architecture and validating against claimed gem platforms
src/gem_metadata/analyzer.rs Extended with binary architecture validation functionality
src/gem_metadata.rs Module declarations and public API exports for new functionality
crates/vein-admin/src/main.rs Refactored CLI to support subcommands (serve, sync, index, validate)
crates/vein-admin/src/commands.rs Module declaration for new command implementations
crates/vein-admin/src/commands/index.rs Command implementation for indexing gem symbols into the database
crates/vein-admin/src/commands/validate.rs Command implementation for validating binary architectures in gems
crates/vein-adapter/src/cache.rs Added trait methods for symbol storage and retrieval
crates/vein-adapter/src/cache/sqlite.rs SQLite implementation for gem_symbols table and symbol operations
crates/vein-adapter/src/cache/postgres.rs PostgreSQL implementation for gem_symbols table and symbol operations
Cargo.toml Added dependencies for tree-sitter (0.26.3), tree-sitter-ruby (0.23.1), and goblin (0.9)
Cargo.lock Lock file updates for new dependencies and their transitive dependencies

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +138 to +154
#[test]
fn test_extract_nested_class() {
let source = r#"
module Foo
class Bar
end
end
"#;
let symbols = extract_symbols(source).unwrap();
assert_eq!(symbols.len(), 2);
assert!(symbols
.iter()
.any(|s| s.name == "Foo" && s.symbol_type == SymbolType::Module));
assert!(symbols
.iter()
.any(|s| s.name == "Bar" && s.symbol_type == SymbolType::Class));
}
Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test lacks assertion for the parent field. The test for nested classes (test_extract_nested_class) verifies that 'Foo' and 'Bar' are extracted but doesn't verify that the nested 'Bar' class has the correct parent relationship. According to the current implementation, nested classes defined as 'module Foo; class Bar; end; end' won't capture the parent relationship because the tree-sitter query only handles scope resolution syntax (::), not lexical nesting. This should either be tested and documented, or the implementation should be updated to handle nested definitions.

Copilot uses AI. Check for mistakes.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. The additional test is warranted.

parent_name TEXT,
line_number INTEGER,
created_at TIMESTAMP DEFAULT (strftime('%Y-%m-%dT%H:%M:%fZ','now')),
UNIQUE(gem_name, gem_version, gem_platform, file_path, symbol_name)
Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The UNIQUE constraint includes gem_platform in a way that may not handle NULL values as expected across all versions of SQLite. When gem_platform is NULL, multiple rows with the same gem_name, gem_version, file_path, and symbol_name but different NULL gem_platform values might be treated as distinct. Consider using a partial unique index or adjusting the constraint to ensure consistent behavior.

Copilot uses AI. Check for mistakes.
DELETE FROM gem_symbols
WHERE gem_name = ?1
AND gem_version = ?2
AND ((gem_platform IS NULL AND ?3 IS NULL) OR gem_platform = ?3)
Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The NULL comparison logic in the WHERE clause may not work as intended. In SQL, NULL != NULL, so the condition '(gem_platform IS NULL AND ?3 IS NULL)' will not match when both are NULL because the parameter binding doesn't preserve NULL semantics in this comparison. This should use proper NULL-safe comparison or conditional logic.

Suggested change
AND ((gem_platform IS NULL AND ?3 IS NULL) OR gem_platform = ?3)
AND (
(?3 IS NULL AND gem_platform IS NULL)
OR
(?3 IS NOT NULL AND gem_platform = ?3)
)

Copilot uses AI. Check for mistakes.
DELETE FROM gem_symbols
WHERE gem_name = $1
AND gem_version = $2
AND ((gem_platform IS NULL AND $3 IS NULL) OR gem_platform = $3)
Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The NULL comparison logic in the WHERE clause may not work as intended. In PostgreSQL, while the syntax is different from SQLite, the condition '(gem_platform IS NULL AND $3 IS NULL)' still has issues. Parameter $3 will never be IS NULL in this context - it's either bound to a value or NULL. The proper way to handle this is to use COALESCE or conditional SQL logic to ensure NULL values are matched correctly.

Suggested change
AND ((gem_platform IS NULL AND $3 IS NULL) OR gem_platform = $3)
AND gem_platform IS NOT DISTINCT FROM $3

Copilot uses AI. Check for mistakes.
if is_binary {
// Read the file contents
let mut contents = Vec::new();
if let Err(_) = data_entry.read_to_end(&mut contents) {
Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error value is discarded with underscore binding. When a file can't be read, the error information could be valuable for debugging. Consider logging the error or at least including it in the warning message context.

Suggested change
if let Err(_) = data_entry.read_to_end(&mut contents) {
if let Err(err) = data_entry.read_to_end(&mut contents) {
eprintln!("Warning: failed to read binary file '{}': {}", path_str, err);

Copilot uses AI. Check for mistakes.
Comment on lines 95 to 114
async fn validate_gem_version_with_platform(
config: &VeinConfig,
name: &str,
version: &str,
platform: Option<&str>,
) -> Result<bool> {
// Parse version which might contain platform (e.g., "0.0.52-arm64-darwin")
let (actual_version, claimed_platform): (String, Option<String>) = if platform.is_some() {
(version.to_string(), platform.map(|s| s.to_string()))
} else {
// Try to extract platform from version string
// E.g., "0.0.52-arm64-darwin" -> ("0.0.52", Some("arm64-darwin"))
// Split by '-' to separate version from platform
let parts: Vec<&str> = version.splitn(2, '-').collect();
if parts.len() == 2 {
// First part is version (e.g., "0.0.52"), second is platform (e.g., "arm64-darwin")
(parts[0].to_string(), Some(parts[1].to_string()))
} else {
(version.to_string(), None)
}
Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The version parsing logic here is fragile. Splitting on '-' with splitn(2, '-') assumes the version number never contains a hyphen, but some gem versions do (e.g., "1.0.0-pre1"). This could incorrectly parse "1.0.0-pre1" as version "1.0.0" with platform "pre1". Consider using a more robust parsing strategy that takes into account valid platform strings.

Suggested change
async fn validate_gem_version_with_platform(
config: &VeinConfig,
name: &str,
version: &str,
platform: Option<&str>,
) -> Result<bool> {
// Parse version which might contain platform (e.g., "0.0.52-arm64-darwin")
let (actual_version, claimed_platform): (String, Option<String>) = if platform.is_some() {
(version.to_string(), platform.map(|s| s.to_string()))
} else {
// Try to extract platform from version string
// E.g., "0.0.52-arm64-darwin" -> ("0.0.52", Some("arm64-darwin"))
// Split by '-' to separate version from platform
let parts: Vec<&str> = version.splitn(2, '-').collect();
if parts.len() == 2 {
// First part is version (e.g., "0.0.52"), second is platform (e.g., "arm64-darwin")
(parts[0].to_string(), Some(parts[1].to_string()))
} else {
(version.to_string(), None)
}
fn looks_like_rubygems_platform(s: &str) -> bool {
// Common RubyGems platform identifiers
const KNOWN_PLATFORMS: &[&str] = &[
"ruby",
"jruby",
"java",
"x86-mswin32",
"x64-mswin64",
"x86-mingw32",
"x64-mingw32",
"x64-mingw-ucrt",
"x86_64-linux",
"x86-linux",
"x86_64-darwin",
"arm64-darwin",
"universal-darwin",
"x86_64-freebsd",
"x86_64-openbsd",
"x86_64-solaris",
];
if KNOWN_PLATFORMS.contains(&s) {
return true;
}
// Fallback: accept simple CPU-OS[-version] style strings with safe characters
fn valid_segment(seg: &str) -> bool {
!seg.is_empty()
&& seg
.chars()
.all(|c| c.is_ascii_alphanumeric() || c == '_' || c == '.')
}
let mut parts = s.split('-');
let first = match parts.next() {
Some(p) => p,
None => return false,
};
let second = match parts.next() {
Some(p) => p,
None => return false,
};
let third = parts.next();
// If there are more than three segments, it's unlikely to be a standard platform string.
if parts.next().is_some() {
return false;
}
if !valid_segment(first) || !valid_segment(second) {
return false;
}
if let Some(third) = third {
if !valid_segment(third) {
return false;
}
}
true
}
async fn validate_gem_version_with_platform(
config: &VeinConfig,
name: &str,
version: &str,
platform: Option<&str>,
) -> Result<bool> {
// Parse version which might contain platform (e.g., "0.0.52-arm64-darwin").
// Only treat a suffix as platform if it looks like a valid RubyGems platform.
let (actual_version, claimed_platform): (String, Option<String>) = if let Some(p) = platform {
(version.to_string(), Some(p.to_string()))
} else if let Some(idx) = version.find('-') {
// Split into potential version and platform components.
let (maybe_version, rest) = version.split_at(idx);
// `rest` starts with '-', so skip it to get the candidate platform.
let maybe_platform = &rest[1..];
if looks_like_rubygems_platform(maybe_platform) {
(maybe_version.to_string(), Some(maybe_platform.to_string()))
} else {
// Hyphen is part of the version (e.g., "1.0.0-pre1"), not a platform separator.
(version.to_string(), None)
}
} else {
(version.to_string(), None)

Copilot uses AI. Check for mistakes.
Comment on lines 194 to 196
.filter_map(|(gem_name, version_platform)| {
if gem_name == name {
Some((version_platform, None))
Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function always returns the second element of tuples with None as the platform. Looking at the tuple structure (String, Option<String>), it appears the version_platform variable is being treated as just version, but the platform information is being lost. The return type suggests this should preserve platform information from the cache, but it's hardcoded to None.

Suggested change
.filter_map(|(gem_name, version_platform)| {
if gem_name == name {
Some((version_platform, None))
.filter_map(|(gem_name, (version, platform))| {
if gem_name == name {
Some((version, platform))

Copilot uses AI. Check for mistakes.
Comment on lines 140 to 154
None => {
// Default to serve if no subcommand provided
let config_path = std::path::Path::new("crates/vein-admin/config");
let app = loco_rs::boot::create_app::<App, migration::Migrator>(
StartMode::ServerOnly,
&environment,
loco_rs::config::Config::from_folder(&environment, config_path)?,
)
.await?;

let binding = app.app_context.config.server.binding.clone();
let port = app.app_context.config.server.port;
let serve_params = ServeParams { port, binding };

boot::start::<App>(app, serve_params, false).await?;
Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate code block for default server startup. The code in the None branch (lines 142-154) duplicates the logic from the Serve command branch (lines 101-119), except it doesn't use CLI arguments for bind and port. This duplication makes maintenance harder. Consider extracting this into a helper function or restructuring to avoid repetition.

Copilot uses AI. Check for mistakes.
Comment on lines 17 to 37
pub fn index_gem(gem_path: &Path) -> Result<Vec<FileSymbols>> {
let file =
File::open(gem_path).with_context(|| format!("opening gem at {}", gem_path.display()))?;
let mut archive = Archive::new(file);

let mut results = Vec::new();

// Find and process data.tar.gz
for entry in archive.entries().context("reading gem archive entries")? {
let entry = entry.context("accessing gem archive entry")?;
let path = entry.path().context("reading entry path")?.into_owned();

if path.as_os_str().to_string_lossy() == "data.tar.gz" {
let decoder = GzDecoder::new(entry);
let symbols = process_data_tar(decoder)?;
results = symbols;
break;
}
}

Ok(results)
Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function returns an empty Vec when data.tar.gz is not found, which silently succeeds even though no symbols were indexed. This could be misleading - if a gem file doesn't contain data.tar.gz or has a malformed structure, the function returns Ok with an empty result without any indication that something went wrong. Consider returning an error if data.tar.gz is not found.

Copilot uses AI. Check for mistakes.
Comment on lines 147 to 173
// Check if this is a binary file by extension or location
let path_str_lower = path_str.to_ascii_lowercase();
let is_binary = file_path
.extension()
.and_then(|s| s.to_str())
.map(|ext| matches!(ext, "so" | "dll" | "dylib" | "bundle" | "exe"))
.unwrap_or(false)
|| path_str_lower.starts_with("exe/")
|| path_str_lower.starts_with("bin/");

if is_binary {
// Read the file contents
let mut contents = Vec::new();
if let Err(_) = data_entry.read_to_end(&mut contents) {
continue; // Skip files we can't read
}

// Detect architecture
if let Ok(binary_info) = detect_binary_arch(&contents) {
let is_match = matches_platform(claimed_platform, &binary_info);

if !is_match {
mismatches.push(path_str.to_string());
}

detected_binaries.push((path_str.to_string(), binary_info));
}
Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Binary detection by file extension may produce false positives. Files in bin/ or exe/ directories might be shell scripts, Ruby scripts, or other non-binary executables. The code should verify that the file is actually a binary by attempting to parse it with goblin before classifying it as a binary. Currently, script files would be incorrectly flagged as binaries and fail parsing, which is caught by the Ok() check, but this approach is inefficient and could miss actual binaries that fail parsing for other reasons.

Suggested change
// Check if this is a binary file by extension or location
let path_str_lower = path_str.to_ascii_lowercase();
let is_binary = file_path
.extension()
.and_then(|s| s.to_str())
.map(|ext| matches!(ext, "so" | "dll" | "dylib" | "bundle" | "exe"))
.unwrap_or(false)
|| path_str_lower.starts_with("exe/")
|| path_str_lower.starts_with("bin/");
if is_binary {
// Read the file contents
let mut contents = Vec::new();
if let Err(_) = data_entry.read_to_end(&mut contents) {
continue; // Skip files we can't read
}
// Detect architecture
if let Ok(binary_info) = detect_binary_arch(&contents) {
let is_match = matches_platform(claimed_platform, &binary_info);
if !is_match {
mismatches.push(path_str.to_string());
}
detected_binaries.push((path_str.to_string(), binary_info));
}
// Read the file contents
let mut contents = Vec::new();
if let Err(_) = data_entry.read_to_end(&mut contents) {
continue; // Skip files we can't read
}
// Detect architecture; only treat as binary if parsing succeeds
if let Ok(binary_info) = detect_binary_arch(&contents) {
let is_match = matches_platform(claimed_platform, &binary_info);
if !is_match {
mismatches.push(path_str.to_string());
}
detected_binaries.push((path_str.to_string(), binary_info));

Copilot uses AI. Check for mistakes.
Copy link

@pboling pboling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The copilot suggestions for additional tests are great ideas. Otherwise looks like a good intial implementation.

@@ -0,0 +1,167 @@
use anyhow::{Context, Result};
use tree_sitter::{Parser, Query, QueryCursor, StreamingIterator};
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

Comment on lines +82 to +83
"class_scope" | "module_scope" => {
parent = Some(node_text.to_string());
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, a more complex test is warranted.

Comment on lines +138 to +154
#[test]
fn test_extract_nested_class() {
let source = r#"
module Foo
class Bar
end
end
"#;
let symbols = extract_symbols(source).unwrap();
assert_eq!(symbols.len(), 2);
assert!(symbols
.iter()
.any(|s| s.name == "Foo" && s.symbol_type == SymbolType::Module));
assert!(symbols
.iter()
.any(|s| s.name == "Bar" && s.symbol_type == SymbolType::Class));
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. The additional test is warranted.


# Tree-sitter for Ruby code analysis
tree-sitter = "0.26.3"
tree-sitter-ruby = "0.23.1"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

…orm parsing, error handling, tests, code quality)
@seuros seuros merged commit c2ebe25 into master Jan 6, 2026
1 check passed
@seuros seuros deleted the treesitter branch January 6, 2026 08:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants