A smart CSV and Excel file combiner that intelligently groups files by header compatibility and merges them with automatic column alignment.
- Multi-format Support: Reads CSV and Excel files (.csv, .xlsx, .xls, .xlsm, .xlsb, .ods)
- Smart Header Grouping: Automatically groups files based on column header compatibility
- Intelligent Merging: Merges files with similar headers (≥50% overlap) into a single output
- Column Alignment: Automatically aligns columns and fills missing values with empty strings
- Flexible Header Handling: Handles files with extra, missing, or reordered columns
- Recursive Directory Search: Automatically finds all compatible files in a directory
- Detailed Logging: Provides clear feedback about which files are being processed
- Rust 1.70 or higher
- Cargo (comes with Rust)
# Clone the repository
git clone <repository-url>
cd csv_combine
# Build the project
cargo build --release
# The binary will be available at target/release/csv_combine# Process all CSV/Excel files in the current directory
csv_combine
# Process files in a specific directory
csv_combine /path/to/directory
# Process a single file
csv_combine /path/to/file.csvThe program uses intelligent header compatibility detection:
- Read all files and extract their headers
- Group files by header compatibility (≥50% column overlap)
- Merge headers within each group into a superset of all columns
- Align data by mapping rows to the merged header (missing columns filled with empty strings)
- Write output files with descriptive names based on header hash
-
Multiple compatible files:
combined_{hash}.csv- Contains merged data from all files in the group
- Headers are the union of all columns
-
Single unique file:
single_{hash}.csv- Contains data from a file with unique headers
- No other files share ≥50% column overlap
Input:
file1.csv: Name, Age, City
file2.csv: Name, Age, City
Output:
combined_abc123.csv: Name, Age, City
- All rows from file1 and file2 combined
Input:
file1.csv: Name, Age
file2.csv: Name, Age, City
file3.csv: Name, Age, City, Country
Compatibility:
- Overlap: 2 common columns out of 4 total = 50%+ ✓
- All three files are compatible!
Output:
combined_def456.csv: Name, Age, City, Country
- file1 rows: Name, Age filled; City, Country empty
- file2 rows: Name, Age, City filled; Country empty
- file3 rows: All columns filled
Input:
file1.csv: Name, Age, City
file2.csv: Name, Age, Country
Compatibility:
- Common columns: Name, Age (2)
- Total unique columns: Name, Age, City, Country (4)
- Overlap: 2/4 = 50% ✓
Output:
combined_ghi789.csv: Name, Age, City, Country
- file1 rows: Name, Age, City filled; Country empty
- file2 rows: Name, Age, Country filled; City empty
Input:
employees.csv: Name, Age, Department
products.csv: Product, Price, SKU
Compatibility:
- Common columns: 0
- Overlap: 0/6 = 0% ✗
Output:
single_jkl012.csv: Name, Age, Department (employees data)
single_mno345.csv: Product, Price, SKU (products data)
Input:
file1.csv: Name, Age, City
file2.csv: Name, Product, Price
Compatibility:
- Common columns: Name (1)
- Total unique columns: 5
- Overlap: 1/5 = 20% ✗
Output:
single_pqr678.csv: Name, Age, City
single_stu901.csv: Name, Product, Price
Files are considered compatible if they share ≥50% of their columns:
Overlap Percentage = (Common Columns) / (Total Unique Columns)
Compatible: Overlap ≥ 50% → Files merged together
Incompatible: Overlap < 50% → Files separated
This threshold ensures:
- ✓ Files with the same columns are always merged
- ✓ Files with a few extra columns are merged
- ✓ Files with minor variations are merged
- ✗ Files with fundamentally different schemas are separated
The program provides detailed debug logging:
[INFO] Searching for files in: /path/to/directory
[INFO] Found 5 files to process
[INFO] Reading: file1.csv
[INFO] Found 2 compatible header groups
[INFO] Processing group with merged headers: Name, Age, City (3 files)
[INFO] - Including: file1.csv (headers: Name, Age)
[INFO] - Including: file2.csv (headers: Name, Age, City)
[INFO] - Including: file3.csv (headers: Name, Age, City)
[INFO] Created: combined_abc123.csv (3 files, 150 data rows)
[INFO] Processing complete! Created 2 output files
# Run all tests
cargo test
# Run tests with output
cargo test -- --nocapture
# Run a specific test
cargo test test_headers_are_compatibleThe project includes 26 comprehensive tests covering:
- File path validation
- CSV and Excel reading
- Header compatibility detection
- Header merging logic
- Row mapping with missing columns
- Complex scenarios with quotes and special characters
csv_combine/
├── src/
│ └── main.rs # Main application code
├── Cargo.toml # Project dependencies
├── Cargo.lock # Locked dependencies
└── README.md # This file
- CSV:
.csvfiles (with proper quote and comma handling) - Excel:
.xlsx,.xls,.xlsm,.xlsb(reads first sheet) - OpenDocument:
.ods(reads first sheet)
csv- CSV reading and writingcalamine- Excel file supportanyhow- Error handlingwalkdir- Directory traversallog+pretty_env_logger- Loggingtokio- Async runtimesystem-pause- User interaction
- Only processes the first sheet of Excel workbooks
- Files must have headers in the first row
- Column matching is case-sensitive
- Large files are processed in memory (consider RAM usage)
Problem: "No CSV or Excel files found"
- Check that your directory contains
.csvor Excel files - Verify the file extensions are correct
Problem: Files not being combined
- Check the header overlap percentage
- Files need ≥50% column overlap to be compatible
- Use debug logging to see compatibility decisions
Problem: Missing data in output
- This is expected! Files with fewer columns will have empty values for missing columns
- Check the merged header to see all available columns
Contributions are welcome! Please feel free to submit issues or pull requests.