Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 12 additions & 1 deletion CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,17 @@ project(fastq_pair)

set(CMAKE_C_STANDARD 99)

# Add the zlib library from the external folder
add_subdirectory(external/zlib-1.3.1)

# List your source files
set(SOURCE_FILES main.c robstr.c fastq_pair.c is_gzipped.c is_gzipped.h)

# Add the executable for your project
add_executable(fastq_pair ${SOURCE_FILES})
install (TARGETS fastq_pair DESTINATION bin)

# Link zlib (built locally) with your executable
target_link_libraries(fastq_pair PRIVATE zlibstatic)

# Installation configuration
install(TARGETS fastq_pair DESTINATION bin)
112 changes: 69 additions & 43 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,25 @@
[![Edwards Lab](https://img.shields.io/badge/Bioinformatics-EdwardsLab-03A9F4)](https://edwards.sdsu.edu/research)
[![DOI](https://zenodo.org/badge/98881309.svg)](https://zenodo.org/badge/latestdoi/98881309)
[![Build Status](https://travis-ci.org/linsalrob/fastq-pair.svg?branch=master)](https://travis-ci.org/linsalrob/fastq-pair)
## Acknowledgments

This project is based on the original implementation by [Edwards Lab](https://github.com/linsalrob/fastq-pair). Their work laid the foundation for this repository.


## Modifications and Enhancements

### Added support for gzipped FastQ files
- The tool now fully supports input and output of `.gz` compressed FastQ files.
- This addition allows seamless handling of large datasets, avoiding the need for users to uncompress and re-compress FastQ files.
- This reduces disk space usage and improves overall efficiency.

### Added option for entries deduplication
- Introduced an optional feature (-d) to remove duplicated entries per file, based on the entry names.

### Added option for identifier reformatting
- Introduced an option to reformat sequence identifiers to the minimal identifier (before space), allowing better compatibility with some downstream analysis tools.
- Introduced an option to only reformat sequence identifiers (-h).

### Changed the output filename routine
- The output filenames are now generated by retaining only the basename and removing the extension.
- This provides clearer, more consistent file naming and avoids redundancy.

# FASTQ PAIR

Expand Down Expand Up @@ -37,6 +56,11 @@ out how many sequences there are in your fastq file:
```
wc -l fastq_filename
```
or, for gzipped files:
```
zcat fastq_filename.gz | wc -l
```

The number of sequences will be the number printed here, divided by 4.

_Note_: If you get an error that looks like
Expand All @@ -55,26 +79,12 @@ need to increase the value you provide to `-t`. If most of the entries are zero,
As an aside, this code is also _really_ slow if _none_ of your sequences are paired. You should most likely use this
after taking a peek at your files and making sure there are at least _some_ paired sequences in your files!

## Installing fastq_pair

We recommend installing fastq-pair using [bioconda](https://bioconda.github.io/recipes/fastq-pair/README.html)

```
mamba install -c bioconda fastq-pair
```

or in its own environment:

```
mamba create --name fastq-pair -c bioconda fastq-pair
```

### Installing from source

To install the code, grab the github repository, then make a build directory:
```$xslt
mkdir build && cd build
cmake3 ..
cmake ..
make && sudo make install
```
There are more instructions on the [installation](INSTALLATION.md) page.
Expand All @@ -99,52 +109,68 @@ You can also print out the number of elements in each bucket using the `-p` para
fastq_pair -p -t 100 file1.fastq file2.fastq
```

You can also de-duplicate your entries using the `-d` parameter. This will remove any duplicated entries, based on the identifier, identified in each fastq. Please note that this will double the amount of memory used:

```$xslt
fastq_pair -d file1.fastq file2.fastq
```

You can also reformat your entries identifiers, leaving only the minimal identifier (before the first space) using the `-f` parameter. Note that this should not be used with the `-s` parameter:

```$xslt
fastq_pair -f file1.fastq file2.fastq
```

You can also ONLY reformat your entries identifiers, leaving only the minimal identifier (before the first space) using the `-h` parameter. Note that this should not be used with the `-s` parameter and that this option will NOT deduplicate your reads or reads-matching between files. This is faster and requires less memory:

```$xslt
fastq_pair -h file1.fastq file2.fastq
```


## Testing fastq_pair

In the [test](test/) directory there are two fastq files that you can use to test `fastq_pair`. There are 250 sequences
in the [left](test/left.fastq) file and 75 sequences in the [right](test/right.fastq) file. Only 50 sequences are common
between the two files.
In the [test](test/) directory there are two fastq files that you can use to test `fastq_pair`. There are 251 sequences
in the [left](test/left.fastq) file and 78 sequences in the [right](test/right.fastq) file. Only 50 sequences are common
between the two files; the [left](test/left.fastq) contains 1 duplicate entry and the [right](test/right.fastq) contains
3 duplicated entries. In addition, the [test](test/) directory also contains the gzipped version of both files.

You can test the code with:

```$xslt
fastq_pair -t 1000 test/left.fastq test/right.fastq
fastq_pair -d -t 1000 test/left.fastq test/right.fastq
```

This will make four files in the [test/](test) directory:
- left.fastq.paired.fq
- left.fastq.single.fq
- right.fastq.paired.fq
- right.fastq.single.fq
- left.paired.fastq
- left.single.fastq
- right.paired.fastq
- right.single.fastq

The _paired_ files have 50 sequences each, and the two _single_ files have 200 and 25 sequences (left and right respectively).

### A note about gzipped fastq files

Unfortunately `fastq_pair` doesn't work with gzipped files at the moment, because it relies heavily on random access of
the file stream. That is complex with gzipped files, especially when the uncompressed file exceeds available memory
(which is exactly the situation that `fastq_pair` was designed to handle).
`fastq_pair` also works with gzipped files. Gzipped files are read using the zlib library, a copy of which is included in the [external](external) folder, for the installation. Note that if any of the fastq file provided is gzipped, output files will also be gzipped.

Therefore, at this time, `fastq_pair` does not support gzipped files. You need to uncompress the files before using
`fastq_pair`.
Simply provide your gzipped entries to `fastq_pair`.

If you really need to use gzipped files, and can accept slightly worse performance, then
[we have some alternative](https://edwards.sdsu.edu/research/sorting-and-paring-fastq-files/) approaches
written in Python that you can try.
You can test the code with:

### Testing for gzipped files ([issue #6](https://github.com/linsalrob/fastq-pair/issues/6))
```$xslt
fastq_pair -d -t 1000 test/left.fastq.gz test/right.fastq.gz
```

We take a peek at the first couple of bytes in the file to see if the file is gzip compressed. Per the standard, the
files should start 0x1F and 0x8B as the first two bytes. There is a small tester for the gzip program, called `test_gzip.c`,
that takes a single argument and reports whether it is gzipped or not. You can compile that tester with the command:
This will make four files in the [test/](test) directory:
- left.paired.fastq.gz
- left.single.fastq.gz
- right.paired.fastq.gz
- right.single.fastq.gz

```
gcc -std=gnu99 -o testgz ./test_gzip.c is_gzipped.c
```
The results should be identical to the non-gzipped fastq files.

We now test both files and exit (hopefully gracefully) if either is gzip compressed. The easiest solution is to
uncompress your files, and we recommend and love [pigz](https://zlib.net/pigz/) because it is awesome!
Alternatively, [we have alternative](https://edwards.sdsu.edu/research/sorting-and-paring-fastq-files/) approaches
written in Python that you can try.

## Citing fastq_pair

Expand Down
89 changes: 89 additions & 0 deletions external/zlib-1.3.1/.github/workflows/cmake.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
name: CMake
on: [push, pull_request]
jobs:
ci-cmake:
name: ${{ matrix.name }}
runs-on: ${{ matrix.os }}
strategy:
fail-fast: false
matrix:
include:
- name: Ubuntu GCC
os: ubuntu-latest
compiler: gcc

# Test out of source builds
- name: Ubuntu GCC OSB
os: ubuntu-latest
compiler: gcc
build-dir: ../build
src-dir: ../zlib

- name: Ubuntu GCC -O3
os: ubuntu-latest
compiler: gcc
cflags: -O3

- name: Ubuntu Clang
os: ubuntu-latest
compiler: clang

- name: Ubuntu Clang Debug
os: ubuntu-latest
compiler: clang
build-config: Debug

- name: Windows MSVC Win32
os: windows-latest
compiler: cl
cmake-args: -A Win32

- name: Windows MSVC Win64
os: windows-latest
compiler: cl
cmake-args: -A x64

- name: Windows GCC
os: windows-latest
compiler: gcc
cmake-args: -G Ninja

- name: macOS Clang
os: macos-latest
compiler: clang

- name: macOS GCC
os: macos-latest
compiler: gcc-11

steps:
- name: Checkout repository
uses: actions/checkout@v3

- name: Install packages (Windows)
if: runner.os == 'Windows'
run: |
choco install --no-progress ninja ${{ matrix.packages }}

- name: Generate project files
run: cmake -S ${{ matrix.src-dir || '.' }} -B ${{ matrix.build-dir || '.' }} ${{ matrix.cmake-args }} -D CMAKE_BUILD_TYPE=${{ matrix.build-config || 'Release' }}
env:
CC: ${{ matrix.compiler }}
CFLAGS: ${{ matrix.cflags }}

- name: Compile source code
run: cmake --build ${{ matrix.build-dir || '.' }} --config ${{ matrix.build-config || 'Release' }}

- name: Run test cases
run: ctest -C Release --output-on-failure --max-width 120
working-directory: ${{ matrix.build-dir || '.' }}

- name: Upload build errors
uses: actions/upload-artifact@v3
if: failure()
with:
name: ${{ matrix.name }} (cmake)
path: |
**/CMakeFiles/CMakeOutput.log
**/CMakeFiles/CMakeError.log
retention-days: 7
Loading