Skip to content

Commit 8726af9

Browse files
Merge pull request #121 from PoonLab/dev
add data files as args fixed issue dropping isSDRM from mutdict
2 parents b96522c + 4e1aa93 commit 8726af9

5 files changed

Lines changed: 227 additions & 54 deletions

File tree

README.md

Lines changed: 45 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,7 @@ pip install --user .
4747
## Using sierra-local
4848

4949
### Command-line interface (CLI)
50-
Before running, we recommend using the `sierralocal/updater.py` script to update the data files associated with this repository to the most updated versions available from [hivfacts](https://github.com/hivdb/hivfacts/tree/main/data). Please note that you do need the requests package stated above for the following command to run. More information regarding this script is detailed below.
50+
Before running, we recommend using the `sierralocal/updater.py` script to update the data files associated with this repository to the most updated versions available from [hivfacts](https://github.com/hivdb/hivfacts/tree/main/data). Please note that you do need the requests package stated above for the following command to run. More information regarding this script is detailed below. An alternative to running this script through the main function is also provided below.
5151
```console
5252
(sierra) will@dyn172-30-75-11 sierra-local % python3 sierralocal/updater.py
5353
Downloading the latest HIVDB XML File
@@ -160,6 +160,50 @@ Writing JSON to file RT_results.json
160160
Time elapsed: 9.3442 seconds (10.751 it/s)
161161
```
162162

163+
To specify other files for detecting the parameters, `isApobecMutation`, `isUnusual`, `isSDRM`, `primaryType`, you can use the following arguments: `-apobec_csv`, `-unusual_csv`, `-sdrms_csv`, `-mutation_csv`, respectively
164+
```console
165+
(sierra) will@Williams-MacBook-Pro sierra-local % sierralocal -apobec_csv apobecs.csv -unusual_csv rx-all_subtype-all.csv -sdrms_csv sdrms_hiv1.csv -mutation_csv mutation-type-pairs_hiv1.csv RT.fa
166+
searching path /Users/will/miniconda3/envs/sierra/lib/python3.9/site-packages/sierralocal/data/HIVDB*.xml
167+
searching path /Users/will/miniconda3/envs/sierra/lib/python3.9/site-packages/sierralocal/data/apobec_drms.json
168+
HIVdb version 9.8
169+
Using unusual file: rx-all_subtype-all.csv
170+
Using SDRM mutations file: sdrms_hiv1.csv
171+
Using mutation type file: mutation-type-pairs_hiv1.csv
172+
Using APOBEC file: apobecs.csv
173+
Aligning using post-align
174+
Aligned RT.fa
175+
100 sequences found in file RT.fa.
176+
Writing JSON to file RT_results.json
177+
Time elapsed: 9.5917 seconds (10.481 it/s)
178+
```
179+
180+
To update these files while running the script, and subsequently specify an output directory, you can use the args `-forceupdate` followed by `-output_dir` and the new file path. Please note, if you run with `-forceupdate`, you must rerun the installation steps to apply the changes. If you do choose to a different output directory, you must always specifiy the new file locations for these files, otherwise they will default to the ones found in the `sierralocal/data` folder.
181+
```console
182+
(sierra) will@Williams-MacBook-Pro sierra-local % sierralocal --forceupdate -updater_outdir . RT.fa
183+
Downloading the latest HIVDB XML File
184+
Updated HIVDB XML into ./HIVDB_9.8.xml
185+
Downloading the latest APOBEC DRMS File
186+
Updated APOBEC DRMs into ./apobec_drms.json
187+
Downloading the latest file to determine apobec
188+
Updated apobecs file to ./apobecs.csv
189+
Downloading the latest file to determine is unusual
190+
Updated is unusual file to ./rx-all_subtype-all.csv
191+
Downloading the latest file to determine SDRM mutations
192+
Updated SDRM mutations file to ./sdrms_hiv1.csv
193+
Downloading the latest file to determine mutation type
194+
Updated mutation type file to ./mutation-type-pairs_hiv1.csv
195+
HIVdb version 9.8
196+
Using unusual file: /Users/will/miniconda3/envs/sierra/lib/python3.9/site-packages/sierralocal/data/rx-all_subtype-all.csv
197+
Using SDRM mutations file: /Users/will/miniconda3/envs/sierra/lib/python3.9/site-packages/sierralocal/data/sdrms_hiv1.csv
198+
Using mutation type file: /Users/will/miniconda3/envs/sierra/lib/python3.9/site-packages/sierralocal/data/mutation-type-pairs_hiv1.csv
199+
Using APOBEC file: /Users/will/miniconda3/envs/sierra/lib/python3.9/site-packages/sierralocal/data/apobecs.csv
200+
Aligning using post-align
201+
Aligned RT.fa
202+
100 sequences found in file RT.fa.
203+
Writing JSON to file RT_results.json
204+
Time elapsed: 9.9952 seconds (10.846 it/s)
205+
```
206+
163207
### As a Python module
164208
If you have downloaded the package source to your computer, you can also run *sierra-local* as a Python module from the root directory of the package. In the following example, we are calling the main function of *sierra-local* from an interactive Python session:
165209
```console

sierralocal/hivdb.py

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -12,14 +12,19 @@ class HIVdb():
1212
webserver, to retrieve the rules-based prediction algorithm as ASI XML,
1313
and convert this information into Python objects.
1414
"""
15-
def __init__(self, asi2=None, apobec=None, forceupdate=False):
15+
def __init__(self, asi2=None, apobec=None, forceupdate=False, updater_outdir=None):
1616
self.xml_filename = None
1717
self.json_filename = None
1818

1919
if forceupdate:
2020
import sierralocal.updater as updater
21-
self.xml_filename = updater.update_HIVDB()
22-
self.json_filename = updater.update_APOBEC()
21+
self.xml_filename = updater.update_hivdb(updater_outdir)
22+
self.json_filename = updater.update_apobec_mutation(updater_outdir)
23+
self.apobec_csv = updater.update_apobec(updater_outdir)
24+
self.is_unusual_csv = updater.update_is_unusual(updater_outdir)
25+
self.sdrms_csv = updater.update_sdrms(updater_outdir)
26+
self.mutation_type_csv = updater.update_mutation_type(updater_outdir)
27+
2328
else:
2429
self.set_hivdb_xml(asi2)
2530
self.set_apobec_json(apobec)

sierralocal/jsonwriter.py

Lines changed: 52 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@
1010

1111

1212
class JSONWriter():
13-
def __init__(self, algorithm):
13+
def __init__(self, algorithm, apobec_csv, unusual_csv, sdrms_csv, mutation_csv):
1414
# possible alternative drug abbrvs
1515
self.names = {'3TC': 'LMV'}
1616

@@ -39,7 +39,17 @@ def __init__(self, algorithm):
3939
self.rt_comments = dict(csv.reader(rt_file, delimiter='\t'))
4040

4141
# make dictionary for isUnusual
42-
dest = str(Path(os.path.dirname(__file__)) / 'data' / 'rx-all_subtype-all.csv')
42+
if unusual_csv is None:
43+
dest = str(Path(os.path.dirname(__file__)) / 'data' / 'rx-all_subtype-all.csv')
44+
else:
45+
if os.path.isfile(unusual_csv): # Ensure is a file
46+
dest = unusual_csv
47+
else:
48+
raise FileNotFoundError(
49+
"Path to CSV file to determine if is unusual cannot be found at user specified "
50+
"path {}".format(unusual_csv))
51+
print("Using unusual file: "+dest)
52+
4353
with open(dest, 'r', encoding='utf-8-sig') as is_unusual_file:
4454
is_unusual_file = csv.DictReader(is_unusual_file)
4555
self.is_unusual_dic = {}
@@ -54,7 +64,17 @@ def __init__(self, algorithm):
5464
self.is_unusual_dic[gene].update({pos: {}})
5565
self.is_unusual_dic[gene][pos].update({aa: unusual})
5666

57-
dest = str(Path(os.path.dirname(__file__)) / 'data' / 'sdrms_hiv1.csv')
67+
if sdrms_csv is None:
68+
dest = str(Path(os.path.dirname(__file__)) / 'data' / 'sdrms_hiv1.csv')
69+
else:
70+
if os.path.isfile(sdrms_csv): # Ensure is a file
71+
dest = sdrms_csv
72+
else:
73+
raise FileNotFoundError(
74+
"Path to CSV file to determine SDRM mutations cannot be found at user specified "
75+
"path {}".format(sdrms_csv))
76+
print("Using SDRM mutations file: "+dest)
77+
5878
with open(dest, 'r', encoding='utf-8-sig') as sdrm_files:
5979
sdrm_files = csv.DictReader(sdrm_files)
6080
self.sdrm_dic = {}
@@ -86,7 +106,17 @@ def __init__(self, algorithm):
86106
self.apobec_drm_dic[gene][position] += aa
87107

88108
# make dictionary for primary type
89-
dest = str(Path(os.path.dirname(__file__)) / 'data' / 'mutation-type-pairs_hiv1.csv')
109+
if mutation_csv is None:
110+
dest = str(Path(os.path.dirname(__file__)) / 'data' / 'mutation-type-pairs_hiv1.csv')
111+
else:
112+
if os.path.isfile(mutation_csv): # Ensure is a file
113+
dest = mutation_csv
114+
else:
115+
raise FileNotFoundError(
116+
"Path to CSV file to determine mutation type cannot be found at user specified "
117+
"path {}".format(mutation_csv))
118+
119+
print("Using mutation type file: "+dest)
90120
with open(dest, 'r', encoding='utf-8-sig') as mut_type_pairs1_files:
91121
mut_type_pairs1_files = csv.DictReader(mut_type_pairs1_files)
92122
self.primary_type_dic = {}
@@ -102,7 +132,16 @@ def __init__(self, algorithm):
102132
self.primary_type_dic[gene][pos].update({aa: mut})
103133

104134
# make dictionary for apobec mutations
105-
dest = str(Path(os.path.dirname(__file__)) / 'data' / 'apobecs.csv')
135+
if apobec_csv is None:
136+
dest = str(Path(os.path.dirname(__file__)) / 'data' / 'apobecs.csv')
137+
else:
138+
if os.path.isfile(apobec_csv): # Ensure is a file
139+
dest = apobec_csv
140+
else:
141+
raise FileNotFoundError(
142+
"Path to CSV file with APOBEC cannot be found at user specified "
143+
"path {}".format(apobec_csv))
144+
print("Using APOBEC file: "+dest)
106145
with open(dest, 'r', encoding='utf-8-sig') as apobec_mutations:
107146
apobec_mutations = csv.DictReader(apobec_mutations)
108147
self.apobec_mutations_dic = {}
@@ -295,9 +334,11 @@ def format_aligned_gene_sequences(self, ordered_mutation_list,
295334
mutation[3])
296335
check_sdrm, sdrm_aas = self.is_sdrm(gene,
297336
mutation[0],
298-
mutation[1])
299-
300-
if check_sdrm:
337+
mutation[1],
338+
mutation[3])
339+
mutdict['isSDRM'] = check_sdrm
340+
341+
if check_sdrm:
301342
dic['SDRMs'].append({'text': mutation[2] + str(mutation[0]) + sdrm_aas})
302343

303344
mutdict['hasStop'] = self.has_stop(mutation, mutation[3])
@@ -470,14 +511,16 @@ def is_apobec_drm(self, gene, consensus, position, AA):
470511
return True
471512
return False
472513

473-
def is_sdrm(self, gene, position, AA):
514+
def is_sdrm(self, gene, position, AA, text):
474515
"""
475516
see if specific amino acid mutation is a sdrm through checking hivbd facts
476517
@param gene: str, RT, IN, PR
477518
@param position: int, position of mutation relative to POL
478519
@param AA: new amino acid
479520
@return: bool
480521
"""
522+
if text == 'X':
523+
return False, ''
481524
position = str(position)
482525
all_aas = ''
483526
found = False

sierralocal/main.py

Lines changed: 77 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -3,13 +3,14 @@
33
import time
44
import argparse
55
import json
6+
from pathlib import Path
7+
import csv
68

79
from sierralocal import score_alg
810
from sierralocal.hivdb import HIVdb
911
from sierralocal.jsonwriter import JSONWriter
1012
from sierralocal.nucaminohook import NucAminoAligner
1113

12-
1314
def score(filename, xml_path=None, tsv_path=None, forceupdate=False, do_subtype=False, program='post'): # pragma: no cover
1415
"""
1516
Functionality as a Python module. Can import this function from sierralocal.
@@ -123,7 +124,8 @@ def scorefile(input_file, algorithm, do_subtype=False, program='post'):
123124
file_genes, sequence_lengths, file_trims, subtypes, na_sequence, ambiguous, gene_order
124125

125126
def sierralocal(fasta, outfile, xml=None, json=None, cleanup=False, forceupdate=False,
126-
program='post', do_subtype=False): # pragma: no cover
127+
apobec_csv=None, unusual_csv=None, sdrms_csv=None, mutation_csv=None,
128+
updater_outdir=None, program='post', do_subtype=False): # pragma: no cover
127129
"""
128130
Contains all initializing and processing calls.
129131
@@ -134,13 +136,17 @@ def sierralocal(fasta, outfile, xml=None, json=None, cleanup=False, forceupdate=
134136
@param json: <optional> str, path to local copy of HIVdb algorithm APOBEC DRM file
135137
@param cleanup: <optional> bool, to delete alignment file
136138
@param forceupdate: <optional> bool, forces sierralocal to update its local copy of the HIVdb algorithm
139+
@param apobec_csv: str <optional>, Path to CSV APOBEC csv file (default: apobecs.csv)
140+
@param unusual_csv: str <optional>, Path to CSV file to determine if is unusual (default: rx-all_subtype-all.csv)
141+
@param sdrms_csv: str <optional>, Path to CSV file to determine SDRM mutations (default: sdrms_hiv1.csv)
142+
@param mutation_csv: str <optional>, Path to CSV file to determine mutation type (default: mutation-type-pairs_hiv1.csv)
137143
@return: tuple, a tuple of (number of records processed, time elapsed initializing algorithm)
138144
"""
139145

140146
# initialize algorithm and jsonwriter
141147
time0 = time.time()
142-
algorithm = HIVdb(asi2=xml, apobec=json, forceupdate=forceupdate)
143-
writer = JSONWriter(algorithm)
148+
algorithm = HIVdb(asi2=xml, apobec=json, forceupdate=forceupdate, updater_outdir=updater_outdir)
149+
writer = JSONWriter(algorithm, apobec_csv, unusual_csv, sdrms_csv, mutation_csv)
144150
time_elapsed = time.time() - time0
145151

146152
# accommodate single file path argument
@@ -188,7 +194,7 @@ def parse_args(): # pragma: no cover
188194
parser.add_argument('fasta', nargs='+', type=str, help='List of input files.')
189195
parser.add_argument('-o', dest='outfile', default=None, type=str, help='Output filename.')
190196
parser.add_argument('-xml', default=None,
191-
help='<optional> Path to HIVdb ASI2 XML file')
197+
help='<optional> Path to HIVdb ASI2 XML file (default: HIVDB_9.4.xml)')
192198
parser.add_argument('-json', default=None,
193199
help='<optional> Path to JSON HIVdb APOBEC DRM file')
194200
parser.add_argument('--cleanup', action='store_true',
@@ -197,16 +203,80 @@ def parse_args(): # pragma: no cover
197203
help='Forces update of HIVdb algorithm. Requires network connection.')
198204
parser.add_argument('-alignment', default='post', choices=['post', 'nuc'],
199205
help='Alignment program to use, "post" for post align and "nuc" for nucamino')
206+
parser.add_argument('-apobec_csv', default=None,
207+
help='<optional> Path to CSV APOBEC csv file (default: apobecs.csv)')
208+
parser.add_argument('-unusual_csv', default=None,
209+
help='<optional> Path to CSV file to determine if is unusual (default: rx-all_subtype-all.csv)')
210+
parser.add_argument('-sdrms_csv', default=None,
211+
help='<optional> Path to CSV file to determine SDRM mutations (default: sdrms_hiv1.csv)')
212+
parser.add_argument('-mutation_csv', default=None,
213+
help='<optional> Path to CSV file to determine mutation type (default: mutation-type-pairs_hiv1.csv)')
214+
parser.add_argument('-updater_outdir', default=None,
215+
help='<optional> Path to folder to store updated files from updater (default: sierralocal/data folder))')
216+
200217
args = parser.parse_args()
201218
return args
202219

203220

221+
def check_input(apobec_path, unusual_path, sdrms_path, mutation_path):
222+
"""
223+
Check if the input for the files are valid based on the first row of the csv.
224+
225+
apobec_path: path to apobec_drms.csv
226+
unusual_path: path to rx-all_subtype-all.csv
227+
sdrms_path: path to sdrms_hiv1.csv
228+
mutation_path: path to mutation-type-pairs_hiv1.csv
229+
"""
230+
exp = {
231+
"apobec_csv": ["gene", "position", "aa"],
232+
"unusual_csv": ["gene", "position", "aa", "percent", "count", "total", "reason", "isUnusual"],
233+
"sdrms_csv": ["drug_class", "gene", "position", "aa"],
234+
"mutation_csv": ["strain", "gene", "drugClass", "position", "aas", "mutationType", "isUnusual"],
235+
}
236+
237+
paths = {
238+
"apobec_csv": apobec_path,
239+
"unusual_csv": unusual_path,
240+
"sdrms_csv": sdrms_path,
241+
"mutation_csv": mutation_path,
242+
}
243+
for key, path in paths.items():
244+
if path is None:
245+
continue
246+
try:
247+
with open(path, newline="", encoding="utf-8-sig") as f:
248+
reader = csv.reader(f)
249+
header = next(reader)
250+
except Exception as e:
251+
sys.exit(f"Could not open {key} file '{path}': {e}")
252+
253+
if header != exp[key]:
254+
print(
255+
f"Invalid header in {key} file '{path}'.\n"
256+
f"Expected: {exp[key]}\nFound: {header}"
257+
)
258+
sys.exit()
259+
260+
204261
def main(): # pragma: no cover
205262
"""
206263
Main function called from CLI.
207264
"""
208265
args = parse_args()
209266

267+
# check for valid file inputs
268+
check_input(args.apobec_csv, args.unusual_csv, args.sdrms_csv, args.mutation_csv)
269+
270+
mod_path = Path(os.path.dirname(__file__))
271+
272+
if args.updater_outdir:
273+
target_dir = args.updater_outdir
274+
else:
275+
target_dir = os.path.join(mod_path, "data")
276+
277+
# Create directory if it doesn't exist
278+
os.makedirs(target_dir, exist_ok=True)
279+
210280
# check that FASTA files in list all exist
211281
for file in args.fasta:
212282
if not os.path.exists(file):
@@ -216,6 +286,8 @@ def main(): # pragma: no cover
216286
time_start = time.time()
217287
count, time_elapsed = sierralocal(args.fasta, args.outfile, xml=args.xml,
218288
json=args.json, cleanup=args.cleanup, forceupdate=args.forceupdate,
289+
apobec_csv=args.apobec_csv, unusual_csv=args.unusual_csv,
290+
sdrms_csv=args.sdrms_csv, mutation_csv=args.mutation_csv, updater_outdir=target_dir,
219291
program=args.alignment)
220292
time_diff = time.time() - time_start
221293

0 commit comments

Comments
 (0)