This repository was archived by the owner on Jan 14, 2026. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 56
[WORK IN PROGRESS] - preparsing job recognises uploaded archive better #492
Open
philippbayer
wants to merge
6
commits into
master
Choose a base branch
from
fixUploadZip
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
6 commits
Select commit
Hold shift + click to select a range
08e2c4b
First piece of work on the preparsing file recognition
philippbayer 276b593
Fixed python True/False with ruby true/false
philippbayer 2485376
Merge branch 'master' into fixUploadZip
philippbayer 9369ad0
Accidentally left part of hack in
philippbayer 610c56f
Satisfy hound
philippbayer b5b38a0
further hound satisfaction
philippbayer File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,5 +1,7 @@ | ||
| # frozen_string_literal: true | ||
| require 'zip' | ||
| require 'zlib' | ||
| require 'rubygems/package' | ||
| require 'digest' | ||
|
|
||
| class Preparsing | ||
|
|
@@ -10,11 +12,40 @@ class Preparsing | |
| def perform(genotype_id) | ||
| genotype = Genotype.find(genotype_id) | ||
|
|
||
| logger.info "Starting preparse" | ||
| biggest = '' | ||
| biggest_size = 0 | ||
| begin | ||
| Zip::File.open(genotype.genotype.path) do |zipfile| | ||
| logger.info "Starting preparse on #{genotype.genotype.path}" | ||
| # First, we need to find out which archive or flat text our uploaded file is! | ||
| # We use the bash tool file for that | ||
| # | ||
| # There are two possible outcomes - file is a collection of files (tar, tar.gz, zip) | ||
| # or file is a single file (ASCII, gz) | ||
| filetype = `file #{genotype.genotype.path}` | ||
| case filetype | ||
| when /ASCII text/ | ||
| logger.info 'File is flat text' | ||
| reader = File.method('open') | ||
| is_collection = false | ||
| when /gzip compressed data, was/ | ||
| reader = Zlib::GzipReader.method('open') | ||
| logger.info 'file is gz' | ||
| is_collection = false | ||
| when /gzip compressed data, last modified/ | ||
| reader = ->(zipfile){ Gem::Package::TarReader.new(Zlib::GzipReader.open(zipfile)) } | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Space missing to the left of {. |
||
| is_collection = true | ||
| when /POSIX tar archive/ | ||
| logger.info 'File is tar' | ||
| reader = Gem::Package::TarReader.method('new') | ||
| is_collection = true | ||
| when /Zip archive data/ | ||
| logger.info 'File is zip' | ||
| reader = Zip::File.method('open') | ||
| is_collection = true | ||
| end | ||
|
|
||
| if is_collection | ||
| # Find the biggest file in the archive | ||
| biggest = '' | ||
| biggest_size = 0 | ||
| reader.call genotype.genotype.path do |zipfile| | ||
| # find the biggest file, since that's going to be the genotyping | ||
| zipfile.each do |entry| | ||
| if entry.size > biggest_size | ||
|
|
@@ -23,18 +54,19 @@ def perform(genotype_id) | |
| end | ||
| end | ||
|
|
||
| zipfile.extract(biggest,"#{Rails.root}/tmp/#{genotype.fs_filename}.csv") | ||
| system("mv #{Rails.root}/tmp/#{genotype.fs_filename}.csv #{Rails.root}/public/data/#{genotype.fs_filename}") | ||
| logger.info "copied file" | ||
| zipfile.extract(biggest, Rails.root.join('tmp', "#{genotype.fs_filename}.csv")) | ||
| system("mv #{Rails.root.join('tmp', "#{genotype.fs_filename}.csv")} \ | ||
| #{Rails.root.join('public', 'data',genotype.fs_filename)}") | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Space missing after comma. |
||
| logger.info 'Copied file' | ||
| end | ||
|
|
||
| rescue | ||
| logger.info "nothing to unzip, seems to be a text-file in the first place" | ||
| else | ||
| system("cp #{genotype.genotype.path} \ | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please use Rails.root.join('path', 'to') instead. |
||
| #{Rails.root.join('public', 'data', genotype.fs_filename)}") | ||
| end | ||
|
|
||
| # now that they are unzipped, check if they're actually proper files | ||
| file_is_ok = false | ||
| fh = File.open(genotype.genotype.path) | ||
|
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How did this ever work?? It parses the original uploaded file the way this looks like, that can be anything before extraction |
||
| fh = File.open Rails.root.join('public', 'data', genotype.fs_filename) | ||
| l = fh.readline() | ||
| # some files, for some reason, start with the UTF-BOM-marker | ||
| l = l.sub("\uFEFF","") | ||
|
|
||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw some
.docxfiles as well.. not sure if after extraction or before..There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha, yeah, some folks aren't great at uploading the right file types. I'd say we should reject anything that's not ASCII post-unzip :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true. I also saw a few post-genotype analyses from Promethease.