Skip to content

xcftools dealing with empty xcf.bin (no markers) #14

@gtdoctor

Description

@gtdoctor

Hi Olivier and Simone,

I have been using xcftools as part of the IMPUTE5 UKB imputation pipeline.
As previoulsy mentioned, I had been experimenting with creating the xcf reference panel from phased data using approx 2mb "little chunks" (each given chunk subdivided into 20 smaller chunks), and then using xcftools concat --naive to stitch them back together to the "big chunk". (The point is to run xcf conversion on "little chunks" on low priority (cheap) workers for a short c.1-1.5h duration without risking being kicked off the worker and losing the work, saving $$). While this appeared to work (no failure notice at either xcf conversion or xcf concatenation), some of the resulting "big chunks" were malformed as I discovered when I tried to impute against them and got segmentation fault errors.

I have partially diagnosed the issue and though to share with you.

Many of the "little chunk" xcfs have empty genotype (.bin) files. (Because there are no variants in range and passing filter, to convert). These cause trouble when concatenating.

Take the case that I have a folder $DIR which contains a mix of normal and 'empty' xcfs.bin files

# list all xcf.bcf files
ls -1v $DIR/*bcf > allxcfs.list
  1. Using 'xcftools concat --naive --input allxcfs.list --output combined.xcf1.bcf` appears to create a well-formed "big chunk" xcf, but the file is malfmored and can't be converted back to a regular bcf or imputed against.
# list only non-empty xcf.bcf files 
ls -1vs "/mnt/project/${inp_pfx}"*.bin | awk '$1 > 0 { sub(/\.bin$/, ".bcf", $2); print $2 }' > normalxcfonly.list
  1. Using 'xcftools concat --naive --input normalxcfonly.list --output combined.xcf2.bcf` creates a well-formed xcf that can be converted back to regular bcf and impted against
  2. However using 'xcftools concat --ligate --input normalxcfonly.list --output combined.xcf2.bcf' errors out, saying that one of the well-formed subchunkxcf.bcf cannot be opened. Notably this doesn't happen if subchunkxcf.bcf is the first listed xcf.bcf file in the list. The file itself is well-formed, I think the error might be because the buffer is not overlapping with the previous file in the list (which has been excluded.)

These behaviours were observed using the current release version, but persist (with altered error messages) on testing using the latest xcftools compiled from github.

The troubling thing, from my perspective, was that xcftools view happily (and silently) created an empty xcf.bin file, but then made it hard to use it!

While the underlying issue is with xcftools, the simplest fix for the UKB imputation pipeline is to update a line in xcftools_concat.sh, so i've forked this.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions