Complete user guide with examples and best practices for using the project.
# Activate virtual environment
# Windows:
venv\Scripts\activate
# Linux/macOS:
source venv/bin/activatecd ms-access-data-processor# Generate sample data for testing
python tests/generate_test_data.py# Process all files
python src/access_processor.py# View processed results
cat data/output/result.csv
# View processing log
cat processor.logdata/
├── input/ # Input files (CSV format)
│ ├── file1.csv
│ ├── file2.csv
│ └── ...
├── correspondence.csv # ID → ID2 mapping
├── filename_codes.csv # Filename → Code mapping
└── output/ # Output directory (created automatically)
└── result.csv # Final results
Format: CSV with semicolon separator
Required Columns:
ID: Original identifier (e.g., "8d 7d 2c_Ah9h")
Example:
RecordID;ID;SomeData
1;8d 7d 2c_Ah9h;Data_1
2;3f 2a 1b_Xk5l;Data_2
3;5e 9c 4d_Bm3n;Data_3Format: CSV with semicolon separator
Columns:
id: Original IDID2: Corresponding ID2
Example:
id;ID2
8d 7d 2c_Ah9h;8d 7d 2c_P000
3f 2a 1b_Xk5l;3f 2a 1b_P000
5e 9c 4d_Bm3n;5e 9c 4d_P000Format: CSV with semicolon separator
Columns:
filename: Input filenamecode: Corresponding code
Example:
filename;code
file1.csv;AF21
file2.csv;BG22
file3.csv;EJ25# Process all files in data/input/
python src/access_processor.pyfrom src.access_processor import AccessDataProcessor
# Create processor instance
processor = AccessDataProcessor(
input_dir='data/input',
correspondence_file='data/correspondence.csv',
codes_file='data/filename_codes.csv'
)
# Process all files
processor.process_all_files('data/output/result.csv')from src.access_processor import AccessDataProcessor
# Custom configuration
processor = AccessDataProcessor(
input_dir='custom/input/path',
correspondence_file='custom/correspondence.csv',
codes_file='custom/codes.csv'
)
# Process specific file
processor.process_file('custom/input/specific_file.csv', 'custom/output/result.csv')# 1. Generate test data
python tests/generate_test_data.py
# 2. Run processing
python src/access_processor.py
# 3. Check results
head -10 data/output/result.csvExpected Output:
ID3;ID4
AF21_8d 7d 2c_P000;AF21_8d 7d 2c_Ah9h
AF21_3f 2a 1b_P000;AF21_3f 2a 1b_Xk5l
BG22_5e 9c 4d_P000;BG22_5e 9c 4d_Bm3n# Create custom processor
processor = AccessDataProcessor()
# Load custom correspondence table
processor.load_correspondence_table('custom_mapping.csv')
# Load custom filename codes
processor.load_filename_codes('custom_codes.csv')
# Process specific files
processor.process_file('data/input/special_file.csv', 'data/output/special_result.csv')# Process multiple input directories
input_dirs = ['data/input1', 'data/input2', 'data/input3']
for input_dir in input_dirs:
processor = AccessDataProcessor(input_dir=input_dir)
output_file = f'data/output/results_{input_dir.split("/")[-1]}.csv'
processor.process_all_files(output_file)The processor uses Python's logging module with different levels:
- DEBUG: Detailed processing information
- INFO: General processing steps
- WARNING: Non-critical issues
- ERROR: Processing errors
All logs are saved to processor.log with detailed information about:
- Processing start/end times
- File processing status
- Record counts
- Error messages
import logging
# Set custom log level
logging.getLogger().setLevel(logging.DEBUG)
# Or modify the processor directly
processor = AccessDataProcessor(...)
# Logs will be written to processor.logA: Currently supports CSV files with semicolon separators. MS Access .mdb files support is planned for future versions.
A: The processor uses streaming processing, so it can handle files of any size. Memory usage remains constant regardless of file size.
A: The processor will log warnings for missing IDs and skip those records. Check the log file for details.
A: Currently, files are processed sequentially. Parallel processing is planned for future versions.
A: Modify the process_file() method in src/access_processor.py to change the ID3 and ID4 generation logic.
A: Modify the delimiter=';' parameter in the csv.DictReader() calls in the processor code.
A: Extend the process_file() method to handle different file formats and extensions.
A: Yes, you can set up a cron job (Linux/macOS) or Task Scheduler (Windows) to run the processor automatically.
Solution:
- Check file paths in configuration
- Ensure all required files exist
- Verify file permissions
Solution:
- Check if input files contain data
- Verify CSV format and separators
- Check correspondence table completeness
Solution:
- The processor should handle large files automatically
- Check available system memory
- Consider processing files in smaller batches
Solution:
- This is normal behavior - duplicates are automatically skipped
- Check log file for details about skipped records
- Use SSD storage for better I/O performance
- Ensure sufficient RAM (2GB+ recommended)
- Close other applications during processing
- Use virtual environment to avoid package conflicts
- Monitor log files for performance insights
- Small files (< 1MB): ~1000 records/second
- Medium files (1-10MB): ~500 records/second
- Large files (> 10MB): ~200 records/second
Need help or have questions?
- 📧 Email: palagina00@gmail.com
- 🐛 Report Bug: GitHub Issues
- 📖 Documentation: INSTALLATION.md
Happy processing! 🎉
For more examples and advanced usage, check the source code in src/access_processor.py