The project is divided into four main parts:
- [Preliminary analysis]
- [Classification]
- [Prediction with different algorithms and evaluation] 4.a [Prediction of the expression values of the protein] 4.b [Determination if the test performance of the best model found at step (2.) improves if the SOD1_N feature is also used for prediction (and training).]
Programming language: Python in jupyter notebook The datasets used for the analysis are: training and test datasets
- Inspection of classes and parameters/proteins
- Inspection of protein expression distribution
- Missing data
- Extreme values
- General look to the proteins important for the prediction of each class
- Feature to feature relationships, collinearity
- Check for unbalanced classes
- Protein division in groups of lowly, medium and highly expressed for each class
- Important proteins for each pair of biological meaningful classes
- Clustering, serch for structure in data
- Feature selection
- Comparison of different classification algorithms
- Test the best algorithm on the test set