Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
893da48
Add base collections to Base flag
BTheDragonMaster Apr 23, 2025
3d98f1c
Add tests for bases
BTheDragonMaster Apr 23, 2025
61e3a9a
Add negative test for Base
BTheDragonMaster Apr 23, 2025
e664d62
Add base pair featurisation
BTheDragonMaster Apr 24, 2025
6e46de7
Add test for zero-padding base
BTheDragonMaster Apr 24, 2025
e5a5a86
Bugfix: prevent aliasing of reference dictionary when one-hot encoding
BTheDragonMaster Apr 24, 2025
df0ea14
Assert that the max stem size is equal to or greater than number of s…
BTheDragonMaster Apr 24, 2025
5fbde28
Max stem size can now be smaller than the stem
BTheDragonMaster Apr 25, 2025
45e1dc8
Add script to obtain feature labels from a feature index
BTheDragonMaster Apr 28, 2025
9dc77f8
Separate embedding of purines and pyrimidines into two features
BTheDragonMaster Apr 28, 2025
167ead4
Backup
BTheDragonMaster Apr 28, 2025
cb57b88
Add __next__ and __iter__ methods to class Sequence (untested)
BTheDragonMaster Apr 28, 2025
ddd8105
Add vector method and tests for class Loop
BTheDragonMaster Apr 28, 2025
6c2a348
Add check and test for odd max loop size
BTheDragonMaster Apr 28, 2025
3d5e59f
Change get_hairpin_parts method to set_hairpin_parts method
BTheDragonMaster Apr 28, 2025
2fe5daf
Call get_basepairs method in __init__ of class Stem
BTheDragonMaster Apr 28, 2025
27d676e
Code cleanup
BTheDragonMaster Apr 28, 2025
6982077
Add hairpin type enum
BTheDragonMaster Apr 28, 2025
5256f52
Bugfix: a tract vector now uses a_tract_size argument for indexing se…
BTheDragonMaster Apr 28, 2025
fd1ba16
Add tests for a-tract and u-tract embeddings
BTheDragonMaster Apr 28, 2025
b0d34bc
Add terminator class (untested)
BTheDragonMaster Apr 28, 2025
f7ea4f3
Parse terminators from termite output (untested)
BTheDragonMaster Apr 28, 2025
4f17651
Add enum for feature type
BTheDragonMaster May 7, 2025
ce8cf73
Bugfix: loop embeddings are now properly aligned
BTheDragonMaster May 7, 2025
f136083
Bugfix: terminator featurisation
BTheDragonMaster May 7, 2025
0ef6ebb
Add machine learning code
BTheDragonMaster May 7, 2025
baea83c
Add parser for DNABERT data
BTheDragonMaster May 7, 2025
d415734
Add scikit-learn to workflow
BTheDragonMaster May 7, 2025
5e9327a
Add code for creating train test splits
BTheDragonMaster May 7, 2025
0ebc707
Add mock test data
BTheDragonMaster May 7, 2025
8d4b128
Add output folder for testing
BTheDragonMaster May 7, 2025
ebbc4a7
Code refactor: move binning and train test split to separate files
BTheDragonMaster May 8, 2025
c8f4848
Turn TE percentage into number from 0-1 when reading in termite data
BTheDragonMaster May 8, 2025
a0919d5
Change DNABert data parser to remove 100 division
BTheDragonMaster May 8, 2025
5532edf
Backup
BTheDragonMaster May 30, 2025
6e11961
Refactor transformer (untested)
BTheDragonMaster Jun 2, 2025
7e1e2ac
Refactor finetuning code (untested)
BTheDragonMaster Jun 3, 2025
324073c
Add script to find best performing model from a folder
BTheDragonMaster Jun 12, 2025
cd72a89
Add custom reduce on plateau scheduler with warmup and early stopping
BTheDragonMaster Jun 12, 2025
dc05457
Add option to add second hidden layer in finetuning head
BTheDragonMaster Jun 16, 2025
adda653
Improve modularity to train on data from different sources
BTheDragonMaster Jul 15, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/workflows/pytest.yml
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ jobs:
run: |
python -m pip install --upgrade pip
python -m pip install pytest
python -m pip install scikit-learn==1.6.1
python -m pip install .
- name: Test with pytest
run: |
Expand Down
146 changes: 146 additions & 0 deletions data/sequence_data/chen/test.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,146 @@
GACGAACAATAAGGCCTCCCTAACGGGGGGCCTTTTTTATTGATAACAAAA 0.9943801281330785
CCGGCTCATTGCAGCGAAATAATCCTCTCTTTATCTGCTATACCTGGT 0.7752808988764045
ttcctgacttAAGCGGCGCTGGTTATCCATcggagccatc 0.8427672955974843
accaggtataGCAGATAAAGAGAGGATTATTTCGCTGCaatgagccgg 0.6389891696750902
TAGCGTGCTAACCACGCACGCTATTTTTTGTA 0.5305164319248826
tacaaaaaaaGCCTCCACTGGGAGGCtttcaggcgc 0.4818652849740932
CCTGGTAAGACGCCGCAGCGTCGCATCAGGTTTGTTGAG 0.0
tggaaattaaTGCGCTGGCGGCAATGGCAcagcacagaa 0.0
ttttagctatAAAAAAACCCGCCGAAGCGGGTTTTTTcgaaaattgt 0.5614035087719298
TCGGTTACATGTTCGCATGTAACCGATTATCAAAA 0.0
CGACGATGCCATTCGTGGCATCGTCGTTAAAATAA 0.0
CTCGGTACCAAAGACGAACAATAAGACGCTGAAAAGCGTCTTTTTTCGTTTTGGTCC 0.9960885551122585
GGCGTGAGATTGGAATACAATTTCGCGCCTTTTGTT 0.0990990990990992
CTCGGTACCAAATTCCAGAAAAGAGGCCTCCCGAAAGGGGGGCCTTTTTTCGTTTTGGTCC 0.9973830895245074
GTTGCCATTTGCCCTCCGCTGCGGCGGGGGGCTTTTAACCGGG 0.8516320474777448
caaccatccgAAACCGCTCTCATCCATTCGATGAGAGCGGTTTttttaattac 0.8861047835990888
tggggagactAAGGCAGCCAGATGGCTGCCTTttttacaggt 0.8525073746312685
AAAACTCCAGGCCGGGTACGGTGTTTTACGCCGCATCCGGCATTACAAAAT 0.0
CCTGGTAAGACGCAGATGCGTCTTATCAGGTTTTTTTTT 0.0825688073394496
ctctttgacgGGCCAATAGCGATATTGGCCATTTTTTTagcgcaacat 0.896049896049896
TCGGTTACATGTTCGCATGTAACCGATTTTCTCTG 0.8938428874734607
caatccatgtAAAAAAAGGGCCCTGAAATTCAGGACCCTTTCtggcatcagc 0.45945945945945954
TCGGTTACCGCTTCGGCGGTAACCGATTAAAATAA 0.0
CGACGATATTCGTATCGTCGTTTTTTGGG 0.4117647058823529
TCGGTTACCGCTTCGGCGGTAACCGATTTATTGTC 0.5983935742971889
CTCGGTACCAAATTCCAGAAAAGAGACGCTGAAAAGCGTCTTTTTTTTTTTTGGTCC 0.9946638207043756
CGACGATATTCGTATCGTCGTTTTTTTTT 0.31506849315068497
ggggaaataaACGGCCCATCCATGAGGAATGGGCCGTgaaaggagat 0.8531571218795888
ataacagaaaACTCCCCCGCGAGAAGCGGGGGAGTcgctggttaa 0.07407407407407418
CGACGATGCCATTCGTGGCATCGTCGTTTTGTTGG 0.3939393939393939
tagcaacaaaAAAGCCGACTCACTTGCAGTCGGCTTTctcattttaa 0.8522895125553914
acaagaaaaaAGGCACGTCATCTGACGTGCCttttttattt 0.9921334172435494
cagccactgcTCTGACCACAAGTAATTGTTCAGAttgataaaac 0.0
CCCCGATTTATCGGGGTTTTTTGTTATCTGACTACAGAATAACTGGGCTTTAGGCCCTTTTTTT 0.8593530239099859
TGATCATCAAGGCTTCCTTCGGGAAGCCTTTCTACGTTA 0.8275862068965517
GCCGGATCGGCGCACTGATCCGGCTTTTGCAAC 0.0
AAGACCCCCGCACCGAAAGGTCCGGGGGTTTTTTTT 0.9752413963852439
AAAGTCAAAATGCCCGATCGAGGATCGGGCATTTTTGTAGC 0.765807962529274
tgcgttatttTCGGCACCTTTTATGTAGCGAAGGTGCCGGaatatattct 0.3197278911564626
agttttaacgAAGGGGTGGTTTCACCCCTTttgtctttct 0.7762863534675615
GCCGCCAGTTCCGCTGGCGGCATTTT 0.19999999999999996
CGACGATGTTCGCATCGTCGTTAAAATAA 0.0
cgataaaaaaAGCCTGCCAGATGGCAGGCTatttaataac 0.24812030075187974
CCCGCTTCGGCGGGTTTTTTTTT 0.9582985821517932
AAAAAAAAAAAACACCCTAACGGGTGTTTTTTTTTTTTTGGTCTCCC 0.9964526427811281
CCTGGTAAGACGCGAACAGCGTCGCATCAGGTTTTGCAAC 0.0
TTTAATATGACACCGGACTCCGTTCCTCGATGGGGTCCGGTTGTTTTATTCAC 0.0
GGGCGGTCGAACAGATCGCCCTTGTTGTAT 0.0
AGCAGGAAAGAGTAAGGCTGAACCTTCATGTTCAACCTTACTCTCATTTAC 0.3197278911564626
tattgattatAAAGGGCTTTAATTTTTGGCCCTTTtatttttggt 0.8347107438016529
CCCGCATGTTCGCATGCGGGTTTTTTTTT 0.7252747252747254
CTCGGTACCAAATTCCAGAAAAGAGACGCTGAAAAGCGTCTTTTTTCGTTTTGGTCC 0.9958317702471761
CGACGATATTCGTATCGTCGTTTATTGTC 0.4949494949494949
gtttctcgcgCAGGCGCTGAAAATAGCGCCTGtttttatttc 0.43181818181818177
GATCCAGCCCATTCGTGGGCTGGATCTTAAAATAA 0.2063492063492064
TTCTGTGCTGTGCCATTGCCGCCAGCGCATTAATTTCCA 0.15966386554621848
TCGGTTACATGTTCGCATGTAACCGATTTTTTTTT 0.5515695067264574
AACGAGAAAAGCCAACCTGCGGGTTGGCTTTTTTATGCA 0.9355670103092784
GGGCGGTCAGATGATCGCCCTTTTTTTTT 0.9666332999666333
tcggtcggtcCCCTCGCCCCTCTGGGGAGAGGGttagggtgag 0.33333333333333337
ACGGCCCTGAACAAGGGCCGTTTGTTGTAT 0.0
CCCGCATGTTCGCATGCGGGTTTATTGTC 0.5726495726495726
aaaaatatgaATATATTCCGGCGCTTAATGCCACGCCGGAACATATcgaaatgatg 0.8425196850393701
CCAATTATTGAACACCCGAAAGGGTGTTTTTTTGTTTCTGGTCTCCC 0.9936520027931187
GACGAACAATAACACCCTAACGGGTGTTTTTTTGTTTCTGGTCTCCC 0.983878768337901
GCCGGAGCGGCGCACTGCTCCGGCTTTTGCAAC 0.19999999999999996
TTCCAGAAAAGAGGCCTCCCAAATCGGGGGGCCTTTTTTATTGATAACAAAA 0.9931544359255202
ACGGCCCTGAACAAGGGCCGTTTTTGCAAC 0.6168582375478927
CTCGGTACCAAATTCCAGAAAAGAGACGCTGAAAAGCGTCTTTTTTTATAGCGGTCC 0.9789517996211324
gcgtaaaaaaGCACCTTTTTAGGTGCttttttgtgg 0.8
TTCCGCTGAAGGCGTAATTGTTTAAATAACATTACGCCGCCTGGCCTT 0.7890295358649789
taacgtagaaAGGCTTCCCGAAGGAAGCCttgatgatca 0.0
gaacacatttGTCGGATGCGGCGCGAGCGCCTTATCCGACctacggttcg 0.6515679442508711
CGACGTTCGCGTCGTTTTTTGGG 0.22480620155038766
CAGATTGCTGACAACGTGCGCGTTGTTCATGCCGGA 0.6402877697841727
CCTGGTAAGACGCCGCAGCGTCTTATCAGGTTTTTTGTA 0.2063492063492064
GCCCGGACCAGGCCGCAGGGGGGAAACTCTGCGGCCTTTTTCGTTCTTACT 0.9178981937602627
aataagcaatAACGGTACGACAGCTGTGTCGTGCCGTttgttttttc 0.3464052287581699
atcaaaaaggAGCCGCCTGAGGGCGGCTtctttttgtg 0.9139414802065404
TCTAACTAAAAAGGCCTCCCAAATCGGGGGGCCTTTTTTCTTTTCAACAAAA 0.9814436815735758
CGACGATGCCATTCGTGGCATCGTCGTTTATTGTC 0.5689655172413792
ttgaagataaAAAACCCTCTGTAGTAACAGAGGGTTTTgttcattcat 0.8373983739837398
CCCGCTTCGGCGGGTTATCAAAA 0.0
GATCCAGCTTCGGCTGGATCTTTTCTCTG 0.8990918264379415
GACGAACAATAAGGCCTCCCTTTAGGGGGGGCCTTTTTTATTGATAACAAAA 0.9635302698760029
tccggcaattAAAAAAGCGGCTAACCACGCCGCTTTTTTtacgtctgca 0.9916114419931213
CTCGGTACCAAATTCCAGAAAAGAGGGGAGCGGGAAACCGCTCCCCTTTTTTCGTTTTGGTCC 0.9921550168667137
TCGGTTACCGCTTCGGCGGTAACCGATTTTTTGGG 0.6309963099630996
CGAACCGTAGGTCGGATAAGGCGCTCGCGCCGCATCCGACAAATGTGTTC 0.0825688073394496
TAGCGTGACCGGAGATTCGGTCACGCTATTTTTTTTT 0.4285714285714286
CGCCCGCGAACAGCGGGCGTTTTGCAAC 0.4117647058823529
TCGGTTACCGCTTCGGCGGTAACCGATTTTTTTTT 0.4974874371859297
GATCCAGCCCATTCGTGGGCTGGATCTTTATTGTC 0.6677740863787376
cgcaaaaaaaAGCCAGCCTGTTTCCAGACTGGCttttgtgctt 0.7382198952879582
GGCTCAAAGACCCGCTGCGGCGGGTTTTTTTGTCT 0.896049896049896
GTAACAACGGAAACCGGCCATTGCGCCGGTTTTTTTTGGCCT 0.9847071417647958
tccggcatgaACAACGCGCACGTTGTcagcaatctg 0.8427672955974843
ttagtgcccaGGGTTCCCTCTCACCCTAACCCTCTCCCCGGTGGGGCGAGGGGACTgaccgagcgc 0.8058252427184466
CGATTGAGCCTTCCAGTCCTTCGGGACTGGAATTTTTTTGTT 0.4444444444444444
CGACGATACCATTCGTGGTATCGTCGTTTTCTCTG 0.8287671232876712
AGGCCTCCCCGCAGGGGGGCCTTTTTTTGTA 0.5121951219512195
CGCCCGCGAACAGCGGGCGTTGTTGTAT 0.0
GACGAACAATAAGGCCTCCCAAATCGGGGGGCCTTTTTTCTTTTCAACAAAA 0.9736217356897916
tacttcttacTCGCCCATCTGCAACGGATGGGCGAatttataccc 0.6563573883161512
CCCGCATGCCATTCGTGGCATGCGGGTTAAAATAA 0.0
CGACGATGCCATTCGTGGCATCGTCGTTTTTTGGG 0.6212121212121212
agaaacagcaaacaatccaaaacgccgcgttcagcggcgttttttctgcttttct 0.9596122778675282
ACTATTTTCTAAAGGCGCTTCGGCGCCTTTTTAGTCAGAT 0.7929606625258799
aacggtttatTAGTCTGGAGACGGCAGACTAtcctcttccc 0.9445676274944568
ATGGGAGGCGTTTCGTCGTGTGAAACAGAATGCGAAGACGAACAATAAAGGCCTCCCAAATCGGGGGGCCTTTTTT 0.9532273152478952
cgcaaataacCAGGAGATAAAACCGACCACGGCACCAGGCAGTGACCATGTGGTTTCTTCAtcctcagtaa 0.9559082892416226
aagtcaaaagcctccggtcggaggcttttgacttt 0.8868778280542986
ttcctgatgtAATGCCGGATGACCTTCGTGTCATCCGGCATTtttcttttca 0.9546485260770975
CCAATTATTGAACACCCTAACGGGTGTTTTTTCTTTTCTGGTCTCCC 0.985505145673286
aggccaaaaaAAACCGGCGCAATGGCCGGTTTccgttgttac 0.07407407407407418
CTCGGTACCAAATTCCAGAAAAGAGACGCTGAAAAGCGTCTTTTTATTTTTCGGTCC 0.9420289855072463
catgactaaaAACAGCAGCAGTAAAACAGACCCTACTGCTGTTaaaacaagcg 0.04761904761904767
TACGAATAAACGGCTCAGAAATGAGCCGTTTATTTTTTC 0.9090909090909091
CCAATTATTGAACACCCTAACGGGTGTTTTTATTTTTCTGGTCTCCC 0.9836947660198924
aagatgaacaAAACTAAAGCGCCACAAGGGCGCTTTAGTTTgttttccggt 0.7607655502392344
CTCGGTACCAAAAAAAAAAAAAAAGACGCTGAAAAGCGTCTTTTTTCGTTTTGGTCC 0.9967691910054277
TAATCGGATGCAGGCAGGGGAAGTGTCTGTTTACCCTGCCTGGTCTGATACG 0.4923857868020305
ctgatgaaaaGGTGCCGGATGATGTGAATCATCCGGCACtggattatta 0.4974874371859297
CGACGTTCGCGTCGTTATCAAAA 0.0
CTAAAGCGCCGAACAGGCGCTTTAGTTGTTGTAT 0.14529914529914523
cataaaaaaaGGGCCTAAAGCCCagttattctg 0.2063492063492064
CCCGCTTCGGCGGGTTTTTTGGG 0.5726495726495726
GATCCAGCCCATTCGTGGGCTGGATCTTTTGTTGG 0.4382022471910112
AACTCCGCTGTTGCCCTGTTTCAGGGCAATTTTGCAACC 0.6047430830039525
TTTTCGAAAAAAGGCCTCCCAAATCGGGGGGCCTTTTTTATTGATAACAAAA 0.9967320261437909
acgcgtacaaCCGCGTGGGGAGACGACGCGGatttttaact 0.0
TAGCGTGACCGGCGCATCGGTCACGCTATTTGTTGAG 0.0
AGGCCTCCCAGATGGGGGGCCTTTTTTTTTT 0.9242424242424242
ACGGCCCTAGATAGGGCCGTTTTTTTTTT 0.9651567944250871
CGACGATACCATTCGTGGTATCGTCGTTAAAATAA 0.0
CCCGCTTCGGCGGGTTTTCTCTG 0.8986828774062816
GACGAACAATAAGGCCGCAAATCGCGGCCTTTTTTATTGATAACAAAA 0.9921911603935655
cataaaaaaaCCCGCTTGCGCGGGctttttcaca 0.16666666666666663
gtattcgcgcACCCCGGTCTAGCCGGGGTCATTTTTTagtggctttt 0.8366013071895425
TAGCGTGACCGGCGCATCGGTCACGCTATTTTGCAAC 0.0
aacgcatgagAAAGCCCCCGGAAGATCACCTTCCGGGGGCTTTtttattgcgc 0.9939246658566221
TTCCAGAAAAGACACCCTAACGGGTGTTTTTTCGTTTTTGGTCTCCC 0.99153403318659
ATCTCTCTACGCCCTCACCCGTACAGGGTGAGGGCAATAATCTTT 0.7429305912596401
CCCGCACTTAACCCGCTTCGGCGGGTTTTTGTTTTT 0.9511957052220595
CCTGGTAAGACGCTAACCACGCGTCTTATCAGGTTGTTGTAT 0.0
GCCGGAGCGGTAACCACCTGCTCCGGCTTTTTTGTA 0.5305164319248826
Loading
Loading