nhanced FastQ-Pair Tool with New Features and Improvements #23

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open

f-huber wants to merge 8 commits into linsalrob:master from f-huber:master

CMakeLists.txt

-Original file line number
+Diff line change
@@ Expand Up / @@ -3,6 +3,17 @@ project(fastq_pair) @@
     set(CMAKE_C_STANDARD 99)
+    # Add the zlib library from the external folder
+    add_subdirectory(external/zlib-1.3.1)
+    # List your source files
     set(SOURCE_FILES main.c robstr.c fastq_pair.c is_gzipped.c is_gzipped.h)
+    # Add the executable for your project
     add_executable(fastq_pair ${SOURCE_FILES})
-    install (TARGETS fastq_pair DESTINATION bin)
+    # Link zlib (built locally) with your executable
+    target_link_libraries(fastq_pair PRIVATE zlibstatic)
+    # Installation configuration
+    install(TARGETS fastq_pair DESTINATION bin)

README.md

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -1,6 +1,25 @@
  
    [![Edwards Lab](https://img.shields.io/badge/Bioinformatics-EdwardsLab-03A9F4)](https://edwards.sdsu.edu/research)

    [![DOI](https://zenodo.org/badge/98881309.svg)](https://zenodo.org/badge/latestdoi/98881309)

    [![Build Status](https://travis-ci.org/linsalrob/fastq-pair.svg?branch=master)](https://travis-ci.org/linsalrob/fastq-pair)

    ## Acknowledgments

    This project is based on the original implementation by [Edwards Lab](https://github.com/linsalrob/fastq-pair). Their work laid the foundation for this repository.

    ## Modifications and Enhancements

    ### Added support for gzipped FastQ files

    - The tool now fully supports input and output of `.gz` compressed FastQ files.

    - This addition allows seamless handling of large datasets, avoiding the need for users to uncompress and re-compress FastQ files.

    - This reduces disk space usage and improves overall efficiency.

    ### Added option for entries deduplication

    - Introduced an optional feature (-d) to remove duplicated entries per file, based on the entry names.

    ### Added option for identifier reformatting

    - Introduced an option to reformat sequence identifiers to the minimal identifier (before space), allowing better compatibility with some downstream analysis tools.

    - Introduced an option to only reformat sequence identifiers (-h).

    ### Changed the output filename routine

    - The output filenames are now generated by retaining only the basename and removing the extension.

    - This provides clearer, more consistent file naming and avoids redundancy.

    # FASTQ PAIR

    @@ -37,6 +56,11 @@ out how many sequences there are in your fastq file:
  
    ```

    wc -l fastq_filename

    ```

    or, for gzipped files:

    ```

    zcat fastq_filename.gz | wc -l

    ```

    The number of sequences will be the number printed here, divided by 4.

    _Note_: If you get an error that looks like 

    @@ -55,26 +79,12 @@ need to increase the value you provide to `-t`. If most of the entries are zero,
  
    As an aside, this code is also _really_ slow if _none_ of your sequences are paired. You should most likely use this

    after taking a peek at your files and making sure there are at least _some_ paired sequences in your files!

    ## Installing fastq_pair

    We recommend installing fastq-pair using [bioconda](https://bioconda.github.io/recipes/fastq-pair/README.html)

    ```

    mamba install -c bioconda fastq-pair

    ```

    or in its own environment:

    ```

    mamba create --name fastq-pair -c bioconda fastq-pair

    ```

    ### Installing from source

    To install the code, grab the github repository, then make a build directory:

    ```$xslt

    mkdir build && cd build

    cmake3 ..

    cmake ..

    make && sudo make install

    ```

    There are more instructions on the [installation](INSTALLATION.md) page.

    @@ -99,52 +109,68 @@ You can also print out the number of elements in each bucket using the `-p` para
  
    fastq_pair -p -t 100 file1.fastq file2.fastq

    ```

    You can also de-duplicate your entries using the `-d` parameter. This will remove any duplicated entries, based on the identifier, identified in each fastq. Please note that this will double the amount of memory used:

    ```$xslt

    fastq_pair -d file1.fastq file2.fastq

    ```

    You can also reformat your entries identifiers, leaving only the minimal identifier (before the first space) using the `-f` parameter. Note that this should not be used with the `-s` parameter:

    ```$xslt

    fastq_pair -f file1.fastq file2.fastq

    ```

    You can also ONLY reformat your entries identifiers, leaving only the minimal identifier (before the first space) using the `-h` parameter. Note that this should not be used with the `-s` parameter and that this option will NOT deduplicate your reads or reads-matching between files. This is faster and requires less memory:

    ```$xslt

    fastq_pair -h file1.fastq file2.fastq

    ```

    ## Testing fastq_pair

    In the [test](test/) directory there are two fastq files that you can use to test `fastq_pair`. There are 250 sequences

    in the [left](test/left.fastq) file and 75 sequences in the [right](test/right.fastq) file. Only 50 sequences are common

    between the two files.

    In the [test](test/) directory there are two fastq files that you can use to test `fastq_pair`. There are 251 sequences

    in the [left](test/left.fastq) file and 78 sequences in the [right](test/right.fastq) file. Only 50 sequences are common

    between the two files; the [left](test/left.fastq) contains 1 duplicate entry and the [right](test/right.fastq) contains 

    3 duplicated entries. In addition, the [test](test/) directory also contains the gzipped version of both files.

    You can test the code with:

    ```$xslt

    fastq_pair -t 1000 test/left.fastq test/right.fastq

    fastq_pair -d -t 1000 test/left.fastq test/right.fastq

    ```

    This will make four files in the [test/](test) directory:

    - left.fastq.paired.fq

    - left.fastq.single.fq

    - right.fastq.paired.fq

    - right.fastq.single.fq

    - left.paired.fastq

    - left.single.fastq

    - right.paired.fastq

    - right.single.fastq

    The _paired_ files have 50 sequences each, and the two _single_ files have 200 and 25 sequences (left and right respectively).

    ### A note about gzipped fastq files

    Unfortunately `fastq_pair` doesn't work with gzipped files at the moment, because it relies heavily on random access of

    the file stream. That is complex with gzipped files, especially when the uncompressed file exceeds available memory

    (which is exactly the situation that `fastq_pair` was designed to handle).

    `fastq_pair` also works with gzipped files. Gzipped files are read using the zlib library, a copy of which is included in the [external](external) folder, for the installation. Note that if any of the fastq file provided is gzipped, output files will also be gzipped.

    Therefore, at this time, `fastq_pair` does not support gzipped files. You need to uncompress the files before using

    `fastq_pair`.

    Simply provide your gzipped entries to `fastq_pair`.

    If you really need to use gzipped files, and can accept slightly worse performance, then

    [we have some alternative](https://edwards.sdsu.edu/research/sorting-and-paring-fastq-files/) approaches

    written in Python that you can try.

    You can test the code with:

    ### Testing for gzipped files ([issue #6](https://github.com/linsalrob/fastq-pair/issues/6))

    ```$xslt

    fastq_pair -d -t 1000 test/left.fastq.gz test/right.fastq.gz

    ```

    We take a peek at the first couple of bytes in the file to see if the file is gzip compressed. Per the standard, the

    files should start 0x1F and 0x8B as the first two bytes. There is a small tester for the gzip program, called `test_gzip.c`,

    that takes a single argument and reports whether it is gzipped or not. You can compile that tester with the command:

    This will make four files in the [test/](test) directory:

    - left.paired.fastq.gz

    - left.single.fastq.gz

    - right.paired.fastq.gz

    - right.single.fastq.gz

    ```

    gcc -std=gnu99  -o testgz ./test_gzip.c  is_gzipped.c

    ```

    The results should be identical to the non-gzipped fastq files.

    We now test both files and exit (hopefully gracefully) if either is gzip compressed. The easiest solution is to

    uncompress your files, and we recommend and love [pigz](https://zlib.net/pigz/) because it is awesome!

    Alternatively, [we have alternative](https://edwards.sdsu.edu/research/sorting-and-paring-fastq-files/) approaches

    written in Python that you can try.

    ## Citing fastq_pair

external/zlib-1.3.1/.github/workflows/cmake.yml

-Original file line number
+Diff line change
@@ -0,0 +1,89 @@
+    name: CMake
+    on: [push, pull_request]
+    jobs:
+      ci-cmake:
+        name: ${{ matrix.name }}
+        runs-on: ${{ matrix.os }}
+        strategy:
+          fail-fast: false
+          matrix:
+            include:
+              - name: Ubuntu GCC
+                os: ubuntu-latest
+                compiler: gcc
+              # Test out of source builds
+              - name: Ubuntu GCC OSB
+                os: ubuntu-latest
+                compiler: gcc
+                build-dir: ../build
+                src-dir: ../zlib
+              - name: Ubuntu GCC -O3
+                os: ubuntu-latest
+                compiler: gcc
+                cflags: -O3
+              - name: Ubuntu Clang
+                os: ubuntu-latest
+                compiler: clang
+              - name: Ubuntu Clang Debug
+                os: ubuntu-latest
+                compiler: clang
+                build-config: Debug
+              - name: Windows MSVC Win32
+                os: windows-latest
+                compiler: cl
+                cmake-args: -A Win32
+              - name: Windows MSVC Win64
+                os: windows-latest
+                compiler: cl
+                cmake-args: -A x64
+              - name: Windows GCC
+                os: windows-latest
+                compiler: gcc
+                cmake-args: -G Ninja
+              - name: macOS Clang
+                os: macos-latest
+                compiler: clang
+              - name: macOS GCC
+                os: macos-latest
+                compiler: gcc-11
+        steps:
+        - name: Checkout repository
+          uses: actions/checkout@v3
+        - name: Install packages (Windows)
+          if: runner.os == 'Windows'
+          run: |
+            choco install --no-progress ninja ${{ matrix.packages }}
+        - name: Generate project files
+          run: cmake -S ${{ matrix.src-dir || '.' }} -B ${{ matrix.build-dir || '.' }} ${{ matrix.cmake-args }} -D CMAKE_BUILD_TYPE=${{ matrix.build-config || 'Release' }}
+          env:
+            CC: ${{ matrix.compiler }}
+            CFLAGS: ${{ matrix.cflags }}
+        - name: Compile source code
+          run: cmake --build ${{ matrix.build-dir || '.' }} --config ${{ matrix.build-config || 'Release' }}
+        - name: Run test cases
+          run: ctest -C Release --output-on-failure --max-width 120
+          working-directory: ${{ matrix.build-dir || '.' }}
+        - name: Upload build errors
+          uses: actions/upload-artifact@v3
+          if: failure()
+          with:
+            name: ${{ matrix.name }} (cmake)
+            path: |
+              **/CMakeFiles/CMakeOutput.log
+              **/CMakeFiles/CMakeError.log
+            retention-days: 7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nhanced FastQ-Pair Tool with New Features and Improvements #23

Uh oh!

Diff view

Diff view

There are no files selected for viewing

Uh oh!

nhanced FastQ-Pair Tool with New Features and Improvements #23

Are you sure you want to change the base?

Uh oh!

nhanced FastQ-Pair Tool with New Features and Improvements #23

Uh oh!

Uh oh!

Diff view

Diff view

There are no files selected for viewing

Uh oh!