sequenenz_analyzer

Introduction

This project is university project. In this project we were provided with a data set. The data set contained several protein prediction-embeddings. In addition, we had for each protein the information if it the protein has an exophysiologial(exo.) function or an endophysiological(endo.) function.

After conducting a normal EDA, we were trying to analyse the data in a proper way. Therefore, I came up with the idea to analyse all embeddings which are to a prediction for each amino acids. To find then a combination or a specific amino acids which has a great evidence for each amino acid to be exo or endo. This idea tries to find a new way to look at each specific amino acids and the connected properties of the amino acid.

Wrangling with the embedddings

In addition, I only looked at the following embeddings:

Amino Acid sequence
Dssp3 Prediction for each amino acid
Dssp8 Prediction for each amino acid
Conservation score for each amino acid
Membran_tmbed score for each amino acid
Metall Binding per amino acid
Nucleic Acid Binding per amino acid
Small Molecules Binding per amino acid

AA:                         V   G   V   L   L   D   I   L   Q   R   ...
dssp3:                      E   L   L   L   H   H   H   H   E   L   ...
dssp8:                      H   H   H   T   T   C   E   E   E   E   ...
conserv_score:              0   3   3   3   4   0   5   0   5   5   ...
membran_tmbed:              S   S   S   o   o   o   o   o   o   o   ...
metal_binding:              M   M   -   M   -   -   -   -   -   -   ...
nucleic_binding:            N   -   -   -   N   -   -   -   -   -   ... 
small_molecules_binding:    -   -   S   -   -   S   -   S   -   S   ...
position:                   0   1   2   3   4   5   6   7   8   9   ...

In the next phase, I was constructing for each amino acid a word by combining all the the letters under each position. Each word describes the amino acid more precisely. With the example above I was constructing the following words for position 0, 1 and 2:

0:  VEH0SMN-    |   Translated: Valine,Sheet,alpha-helix,cons:8,Transmembrane-alpha-helix,MetallBinding,NucleicBindingSmallMoleculeBinding
1:  GLH3SM--    |   Tranalated:...
2:  VLH3S--S    |   Translated:...

So, I was creating for each of the proteins their set of amino-acid-word. Therefore, I was collecting in addition the positions where the words occure and how many the word occured per data set.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
occurences		occurences
positions		positions
wrangling		wrangling
.DS_Store		.DS_Store
LICENSE		LICENSE
README.md		README.md
merged_df.tsv		merged_df.tsv
plotting.ipynb		plotting.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

sequenenz_analyzer

Introduction

Wrangling with the embedddings

Interpreting the data parts

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

sequenenz_analyzer

Introduction

Wrangling with the embedddings

Interpreting the data parts

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages