This project is a simple Python exercise for practicing text processing, file handling, and validation logic.
The script performs the following tasks:
- Reads the raw preproinsulin sequence from a text file.
- Cleans the sequence by removing all non-letter characters and converting it to lowercase.
- Splits the cleaned sequence into parts:
- Signal peptide (24 amino acids)
- B chain (30 amino acids)
- C peptide (35 amino acids)
- A chain (21 amino acids)
- Saves each part to its own file.
- Validates that each file has the correct length.
-
Place the raw sequence file named
preproinsulin-seq.txtin the same directory as the script. -
Run the script:
python analyze-insulin.py-
Check the output files:
preproinsulin-seq-clean.txtlsinsulin-seq-clean.txtbinsulin-seq-clean.txtcinsulin-seq-clean.txtainsulin-seq-clean.txt
-
Review the console output to confirm validation results.
- Python 3.x
- No additional libraries required (only re from Python standard library).
Cleaned sequence: 116 characters
lsinsulin-seq-clean.txt saved with 24 characters.
binsulin-seq-clean.txt saved with 30 characters.
cinsulin-seq-clean.txt saved with 35 characters.
ainsulin-seq-clean.txt saved with 21 characters.
=== Validation ===
❌ preproinsulin-seq-clean.txt: ERROR! 116/110 characters
✅ lsinsulin-seq-clean.txt: OK! 24/24 characters
✅ binsulin-seq-clean.txt: OK! 30/30 characters
✅ cinsulin-seq-clean.txt: OK! 35/35 characters
✅ ainsulin-seq-clean.txt: OK! 21/21 characters
- Regular expressions for data cleaning.
- String slicing to extract subsequences.
- File operations: reading, writing, validating.
- Error handling with try-except.
Ícaro Torres — Software development student, always learning and improving Python skills.