Flexible, Robust Multi-File Concatenation Toolkit
ConCat (short for concatenate) is a Python-based command-line tool designed for researchers, bioinformaticians, and data scientists who routinely merge large collections of CSV/TSV/tabular files. ConCat provides:
- Automatic delimiter sniffing (comma, tab, semicolon, pipe)
- Optional delimiter normalization across heterogeneous files
- Strict / union / intersection schema modes
- Column-selection mode with missing-column policies
- Automatic source-file annotation (first column)
- Chunked streaming for multi‑GB files
- Modular architecture similar to SeqForge and ReGAIN
We recommend installing ConCat in your ~/home directory
cd ~
git clone https://github.com/ERBringHorvath/ConCat
cd ~/concatpip install .Add the following line to the end of .bashrc/.bash_profile/etc
export PATH="/home/usr/concat/bin:$PATH"Or add the executable directory to your PATH:
echo 'export PATH="$HOME/concat/bin:$PATH"' >> ~/.bashrc
source ~/.bashrcReplace /home/user/concat/bin with the actual path to the directory containing the executable.
Whatever the initial directory, this path should end with /concat/bin
Close and re-open your terminal or run source ~/.bashrc (or whatever your file is named)
Once installed, verify:
concat --versionConCat uses a subcommand structure:
concat <command> [arguments]The primary command is:
concat combine| Argument | Description |
|---|---|
-d DIR, --directory DIR |
Read all files in a directory |
--glob PATTERN [PATTERN ...] |
Glob pattern(s) expanded by ConCat (not the shell) |
-i FILE [...], --input-files |
Explicit file paths; works with shell-expanded globs |
| Argument | Description |
|---|---|
-e EXT, --extension EXT |
Enforce a specific extension; otherwise all must match |
--sample-rows N |
Rows used for delimiter sniffing (default: 50) |
--normalize {comma,tab,semicolon,pipe} |
Normalize mixed delimiters to a unified one |
Supported delimiters:
- comma →
, - tab →
\t - semicolon →
; - pipe →
|
| Argument | Description |
|---|---|
--schema strict |
All files must share the exact same columns (default) |
--schema union |
Output includes all columns ever seen in any file |
--schema intersection |
Only output columns shared by all files |
(Ignored if --columns is used)
| Argument | Description |
|---|---|
--columns COL [...] |
Only include the specified columns, in order |
--missing-policy {error,skip,fillna} |
How to handle missing columns in selection mode |
--case-insensitive |
Case-insensitive column matching |
Missing policy behaviors:
error→ abortskip→ skip the filefillna→ include file, missing values become NA
| Argument | Description |
|---|---|
--no-source-col |
Disable automatic first-column source annotation |
--source-col-name NAME |
Rename the source column (default: source_file) |
--source-col-mode {name,stem,path} |
Value in source column |
Modes:
name→file.tsvstem→filepath→/full/path/to/file.tsv
| Argument | Description |
|---|---|
-o FILE, --out FILE |
Output file (required) |
--out-delim {comma,tab,semicolon,pipe} |
Output delimiter (default: comma) |
--no-header |
Omit header row |
| Argument | Description |
|---|---|
--chunksize N |
Rows per chunk (default: 200,000) |
-T N, --threads N |
Threads for normalization (default: 4) |
| Argument | Description |
|---|---|
--dry-run |
Summarize actions but do not output |
-V, --verbose |
Verbose logging |
--version |
Display ConCat version |
concat combine -d ./results/ -e tsv \
--out merged.tsv --out-delim tabconcat combine --glob './runs/*virus_summary.tsv' \
-o combined_virus_summaries.csvconcat combine -i ./data/*.tsv -o merged.tsvconcat combine --glob './data/*' --normalize tab \
-o merged.tsvconcat combine -i FileA.csv FileB.csv FileC.csv --schema union \
-o merged.csvconcat combine --glob './samples/*.csv' \
--columns sample_id date score \
--missing-policy fillna \
-o combined_subset.csvconcat combine -i *.tsv \
--source-col-mode path \
-o merged.tsvconcat combine --glob './data/*.tsv' --dry-runLauncher script must point PYTHONPATH to the project root:
PROJECT_ROOT="$(cd "${SCRIPT_DIR}/.." && pwd)"
export PYTHONPATH="${PYTHONPATH:-}:${PROJECT_ROOT}"Use:
--normalize tabOr enforce consistent input.
Choose:
--schema unionOr:
--schema intersectionOr explicitly:
--columns col1 col2 col3ConCat is licensed under the GNU GPLv3 (or later).
Full license text: https://www.gnu.org/licenses/gpl-3.0.txt
ConCat: Flexible Multi-File Concatenation Toolkit (https://github.com/ERBringHorvath/ConCat)
PRs welcome! When contributing:
Keep modules modular (cli.py, combine.py)
Keep behavior consistent with SeqForge/ReGAIN_CLI patterns
