-
Notifications
You must be signed in to change notification settings - Fork 1
Home
- Project Overview
- Architecture
- Core Components
- Algorithm Details
- Performance Characteristics
- Building and Development
- Testing
- Contributing
- Advanced Usage
Omega is a high-performance file search utility designed to efficiently search large filesystems using parallel processing. It addresses the need for fast, flexible file discovery across different operating systems.
- Maximum search performance through parallelization
- Cross-platform compatibility (Linux, macOS, Windows)
- Low memory footprint even on large filesystems
- Flexible search options (fuzzy matching, content search, filtering)
- Both human-readable and machine-readable output formats
- Language: Rust (stable)
- Key Dependencies:
-
clap: Command-line argument parsing -
rayon: Data parallelism -
crossbeam: Lock-free concurrent data structures -
walkdir: Filesystem traversal
-
Omega follows a modular architecture with clear separation of concerns:
┌─────────────┐
│ CLI │ Parse arguments
└──────┬──────┘
│
┌──────▼──────┐
│ Engine │ Orchestrate search
└──────┬──────┘
│
┌───┴───┬────────┬──────────┐
│ │ │ │
┌──▼───┐ ┌▼─────┐ ┌▼──────┐ ┌─▼──────┐
│Scanner│ │Matcher│ │Metrics│ │Printer │
└───────┘ └──────┘ └───────┘ └────────┘
Omega uses a producer-consumer pattern:
- Main thread creates a Rayon thread pool
- Multiple scanner threads traverse directories in parallel
- Found files are sent through a lock-free channel
- A dedicated printer thread consumes and outputs results
- Atomic counters track progress across threads
Input Args
↓
Search Roots Determination
↓
Parallel Directory Traversal
↓
Pattern Matching + Filtering
↓
Channel Communication
↓
Result Output + Metrics
Defines the command-line interface using the clap parser. All search parameters are encapsulated in the Args struct with validation annotations.
Key responsibilities:
- Parse command-line arguments
- Provide help text and version information
- Validate argument combinations
The orchestration layer that coordinates the search process.
SearchEngine struct:
- Initializes scanner, matcher, and output components
- Creates thread pool based on configuration
- Manages the producer-consumer pipeline
- Handles graceful shutdown
Search flow:
- Build Rayon thread pool
- Create unbounded channel for results
- Spawn printer thread
- Execute parallel directory scans
- Wait for completion and collect metrics
Implements filesystem traversal and filtering logic.
FileSystemScanner:
- Uses
walkdirfor depth-first traversal - Applies filtering based on file type (files/directories)
- Respects depth limits
- Handles symbolic links (does not follow)
- Checks access permissions before traversal
SearchConfig:
- Thread count (auto-detected from CPU)
- Maximum depth limit
- File/directory filtering flags
Key methods:
-
scan_directory: Main traversal logic -
can_access_path: Permission checking
Pattern matching engine supporting multiple match modes.
PatternMatcher:
- Exact substring matching (default)
- Case-sensitive and case-insensitive modes
- Fuzzy matching using Levenshtein distance
- Content search for text files
Matching strategies:
-
Exact Match:
- Simple substring containment
- Case handling via string transformation
-
Fuzzy Match:
- First tries exact match
- Falls back to word-level Levenshtein distance
- Configurable distance threshold
-
Content Match:
- Reads file contents as UTF-8 string
- Applies same matching logic as filenames
- Respects maximum file size limit
Thread-safe metrics collection using atomic operations.
SearchMetrics:
- Found count: Number of matching items
- Scanned count: Total items examined
- Error count: Access/read failures
- Shutdown flag: Coordinate early termination
All counters use AtomicU64 with relaxed ordering for performance.
SearchLimits:
- Optional limits on found/scanned counts
- Triggers shutdown when limits reached
Error logging:
- Concurrent file writing with mutex protection
- Timestamped error entries
- Filters out common "Access denied" errors
Handles result formatting and printing.
OutputMode:
- Normal: Simple path output
- API: CSV format with detailed metadata
ResultPrinter:
- Runs in dedicated thread
- Consumes results from channel
- Prints in chosen format
- No buffering for real-time output
SearchResult:
- Final summary statistics
- Formatted output to stderr (doesn't interfere with piped results)
Rich file metadata representation.
FileInfo struct fields:
- path: Full path string
- name: Filename only
- is_dir, is_file: Type flags
- size: Bytes and human-readable format
- modified: Unix timestamp and ISO 8601 string
- is_hidden: Platform-specific detection
- extension: File extension
- permissions: Unix-style permission string
CSV serialization:
- Proper escaping of special characters
- Quote wrapping for fields with commas/newlines
Platform-specific root path determination.
Windows:
- Checks drives C:\ through Z:\
- Tests existence of each drive letter
- Returns all available drives
Unix (Linux/macOS):
- Always uses "/" as root
Custom paths:
- Validates existence before use
- Can specify multiple paths via repeated
-pflag
Utility functions for formatting and platform compatibility.
Functions:
-
escape_csv: Proper CSV field escaping -
format_size: Human-readable byte sizes (B, KB, MB, GB, TB) -
format_timestamp: Unix epoch to ISO 8601 conversion -
is_hidden_file: Platform-specific hidden file detection -
format_permissions: Unix-style permission strings
Date conversion:
- Custom implementation avoiding external dependencies
- Handles leap years correctly
- Unix epoch (1970-01-01) based calculations
Omega implements the Wagner-Fischer dynamic programming algorithm for computing edit distance:
fn levenshtein_distance(s1: &str, s2: &str) -> usize {
let s1_chars: Vec<char> = s1.chars().collect();
let s2_chars: Vec<char> = s2.chars().collect();
let len1 = s1_chars.len();
let len2 = s2_chars.len();
// Base cases
if len1 == 0 { return len2; }
if len2 == 0 { return len1; }
// Dynamic programming with space optimization
let mut prev_row: Vec<usize> = (0..=len2).collect();
let mut curr_row = vec![0; len2 + 1];
for (i, &ch1) in s1_chars.iter().enumerate() {
curr_row[0] = i + 1;
for j in 0..len2 {
let cost = if ch1 == s2_chars[j] { 0 } else { 1 };
curr_row[j + 1] = (curr_row[j] + 1) // insertion
.min(prev_row[j + 1] + 1) // deletion
.min(prev_row[j] + cost); // substitution
}
std::mem::swap(&mut prev_row, &mut curr_row);
}
prev_row[len2]
}Complexity:
- Time: O(m * n) where m and n are string lengths
- Space: O(min(m, n)) using row optimization
Fuzzy matching strategy:
- Split target string into words (alphanumeric boundaries)
- Compare each word against patterns
- Match if any word is within threshold distance
- Default threshold is 2 edits
Rayon's parallel iterator splits work across threads:
roots.par_iter().for_each(|root| {
self.scanner.scan_directory(root, tx.clone());
});Each root path is independently processed. Within each root, walkdir provides a depth-first iterator that is consumed sequentially per thread.
Work distribution:
- Multiple roots: Rayon distributes across threads
- Single root: Less parallelism, but still benefits from async I/O
- Work stealing: Rayon balances load automatically
Omega supports early termination via limits:
pub fn should_continue(&self, metrics: &SearchMetrics) -> bool {
if metrics.is_shutdown() {
return false;
}
if let Some(limit) = self.found {
if metrics.get_found() >= limit {
metrics.trigger_shutdown();
return false;
}
}
// Similar check for scanned limit
true
}The shutdown flag uses AtomicBool to propagate termination across threads without locks.
Directory traversal:
- O(N) where N is total filesystem entries
- Bounded by I/O speed and filesystem cache
Pattern matching:
- Exact: O(m * k) where m is filename length, k is pattern count
- Fuzzy: O(m * w * k * p) where w is word count, p is pattern length
Content search:
- Adds O(f * c) where f is file size, c is content matching complexity
- Significantly slower than filename matching
Memory usage:
- O(D) for depth-first traversal where D is max depth
- O(T) for channel buffer where T is thread count
- O(1) for metrics (atomic counters)
- FileInfo objects are streamed, not accumulated
Typical memory footprint:
- Base: ~5-10 MB
- Per thread: ~1-2 MB stack
- Channel: Unbounded but flows quickly
- No large data structures retained
Filesystem operations:
- Mostly sequential reads (directory listings)
- Metadata reads for each entry
- Content reads only when content search enabled
- Benefits from OS filesystem cache
Bottlenecks:
- Cold cache: 100-1000x slower than warm cache
- Network filesystems: High latency impact
- Spinning disks: Sequential access helps
- SSDs: Parallel access is effective
Thread scaling:
- Near-linear speedup with multiple roots
- Diminishing returns on single root
- Optimal thread count typically equals CPU cores
- I/O bound more than CPU bound
Dataset scaling:
- Linear time growth with filesystem size
- Constant memory usage regardless of size
- Early termination limits worst-case time
- Rust toolchain (1.70.0 or later recommended)
- Cargo package manager
# Clone repository
git clone https://github.com/naseridev/omega.git
cd omega
# Build debug version
cargo build
# Build optimized release version
cargo build --release
# Run tests
cargo test
# Generate documentation
cargo doc --openomega/
├── src/
│ ├── main.rs # Entry point
│ ├── lib.rs # Library root
│ ├── cli.rs # Argument parsing
│ ├── engine.rs # Search orchestration
│ ├── scanner.rs # Filesystem traversal
│ ├── matcher.rs # Pattern matching
│ ├── metrics.rs # Statistics tracking
│ ├── output.rs # Result formatting
│ ├── file_info.rs # File metadata
│ ├── paths.rs # Root path logic
│ └── utils.rs # Utility functions
├── Cargo.toml # Dependencies
└── README.md # User guide
Core dependencies:
-
clap = "4.x"- CLI parsing with derive macros -
rayon = "1.x"- Data parallelism -
crossbeam = "0.8"- Lock-free channels -
walkdir = "2.x"- Directory traversal
Platform-specific:
- Unix: Uses
std::os::unix::fs::PermissionsExt - Windows: Uses
std::os::windows::fs::MetadataExt
Release profile in Cargo.toml:
[profile.release]
opt-level = 3 # Maximum optimization
lto = true # Link-time optimization
codegen-units = 1 # Better optimization
strip = true # Remove debug symbolsOmega follows Rust standard conventions:
- Use
cargo fmtfor formatting - Use
cargo clippyfor linting - Maximum line length: 100 characters
- Prefer explicit types in public APIs
# Run all tests
cargo test
# Run with output
cargo test -- --nocapture
# Run specific test
cargo test test_levenshtein_distance
# Run with coverage (requires tarpaulin)
cargo tarpaulin --out HtmlUnit tests:
- Levenshtein distance correctness
- CSV escaping edge cases
- Timestamp formatting
- Permission string formatting
Integration tests:
- End-to-end search scenarios
- Multi-threaded correctness
- Limit enforcement
- Error handling
Manual testing:
# Create test directory structure
mkdir -p test_data/dir1/dir2
echo "test content" > test_data/file1.txt
echo "other data" > test_data/dir1/file2.log
# Test basic search
omega test -p test_data
# Test content search
omega "content" -c -p test_data
# Test fuzzy search
omega tset -z -p test_data
# Test CSV output
omega test --api -p test_data > results.csv- Fork the repository
- Create a feature branch:
git checkout -b feature-name - Make changes with clear commit messages
- Add tests for new functionality
- Ensure all tests pass:
cargo test - Format code:
cargo fmt - Check lints:
cargo clippy - Submit pull request with description
Priority improvements:
- Additional pattern matching algorithms (regex, glob)
- Interactive TUI mode with progress display
- Incremental search with result updates
- Configuration file support
- Bookmark/saved search functionality
Code quality:
- Expand test coverage (target: 80%+)
- Add benchmarks for performance tracking
- Improve error messages
- Add more inline documentation
Platform support:
- Extended attribute handling (macOS)
- Windows registry search
- Network path optimization
- Correctness: Does it work as intended?
- Performance: No significant regressions
- Safety: Proper error handling, no panics
- Style: Follows Rust conventions
- Tests: Adequate test coverage
- Documentation: Clear comments and docs
Using with grep:
# Find and filter results
omega .rs --api | grep "src/" | cut -d, -f1
# Count matches by extension
omega pattern --api | awk -F, '{print $10}' | sort | uniq -cUsing with xargs:
# Delete found files (use with caution!)
omega .tmp -f | xargs rm
# Copy found files to directory
omega .pdf -f | xargs -I {} cp {} /backup/Using with jq (convert CSV to JSON):
omega pattern --api > results.csv
# Convert CSV to JSON using external toolBash script to archive old files:
#!/bin/bash
# Find and archive files older than 1 year
omega ".log" -f --api | while IFS=, read -r path rest; do
# Extract timestamp, compare with current time
# Archive if older than threshold
tar -czf archive.tar.gz "$path"
donePython script for analysis:
import csv
import subprocess
# Run omega and capture CSV output
result = subprocess.run(
['omega', 'pattern', '--api'],
capture_output=True,
text=True
)
# Parse CSV
reader = csv.DictReader(result.stdout.splitlines())
for row in reader:
size = int(row['size'])
if size > 1000000: # Files > 1MB
print(f"Large file: {row['path']}")For maximum speed:
# Use many threads, limit results
omega pattern -t 16 -l 100
# Search specific paths only
omega pattern -p /specific/directory
# Avoid content search unless needed
omega pattern # filename onlyFor thorough search:
# Include content, use fuzzy matching
omega pattern -c -z
# No limits, search everything
omega pattern # no -l or -s flagsFor large filesystems:
# Limit depth to avoid deep traversal
omega pattern -d 4
# Use scan limit to sample filesystem
omega pattern -s 100000Find configuration files:
omega config -i # case-insensitive
omega .conf .cfg .ini -f # by extensionFind source code files:
omega .rs .go .py -f
omega "func main" -c # by contentFind large files:
omega "" --api | awk -F, '$5 > 1000000000 {print $1,$6}'Find recently modified:
omega "" --api | sort -t, -k7 -rn | head -n 20Issue: Search is too slow
Solutions:
- Use
-pto limit search scope - Increase threads with
-t - Add limits with
-lor-s - Disable content search if not needed
Issue: Too many errors
Solutions:
- Use
-eto hide error count - Check file permissions
- Review
omega.logfor details - Exclude problematic directories
Issue: Out of memory
Solutions:
- Reduce thread count with
-t - Use scan limit with
-s - Avoid searching network filesystems
Issue: Wrong results
Solutions:
- Check case sensitivity (use
-i) - Verify fuzzy threshold with
-T - Ensure patterns are correct
- Test with
--apito see metadata
Enable verbose error logging:
# Errors always logged to omega.log
tail -f omega.log # monitor in real-timeTest with small dataset:
mkdir test && cd test
touch file1.txt file2.log
omega txt # should find file1.txtVerify thread behavior:
omega pattern -t 1 # single-threaded
omega pattern -t 4 # multi-threaded
# Compare results to ensure consistency- User Guide: README.md
- API Documentation:
cargo doc --open - Code Comments: Inline documentation in source files
- Issues: GitHub Issues for bug reports
- Discussions: GitHub Discussions for questions
- Pull Requests: Contributions welcome
- ripgrep: Fast grep alternative (content search focused)
- fd: Fast find alternative (simpler but less features)
- fzf: Fuzzy finder (interactive selection)
- ag (The Silver Searcher): Fast code search
This project is open source. Please refer to the LICENSE file in the repository for specific terms and conditions.
This project builds upon the excellent work of the Rust community and leverages several high-quality open-source libraries. Special thanks to all contributors and maintainers of the dependencies used in this project.
Author: Nima Naseri
For questions, suggestions, or contributions, please use GitHub Issues or submit a pull request.