Conversation
Reviewer's GuideThis PR introduces a comprehensive feedback notebook that walks through the shark‐attacks dataset cleaning pipeline—covering country normalization, sex and age standardization, injury classification, fatality derivation, and date parsing—alongside a FEEDBACK.md file summarizing high-level repo and style suggestions. Entity relationship diagram for cleaned shark attack dataseterDiagram
SHARK_ATTACKS {
string Country
string State
string Location
string Activity
string Name
string Sex
int Age
string Injury
string Type
string Fatal_YN
string Time
string Species
string Source
int DD
string MM
int YYYY
int MM_clean
string Injury_Severity
}
SHARK_ATTACKS ||--o{ COUNTRY : has
SHARK_ATTACKS ||--o{ SEX : has
SHARK_ATTACKS ||--o{ AGE : has
SHARK_ATTACKS ||--o{ INJURY : has
SHARK_ATTACKS ||--o{ TYPE : has
SHARK_ATTACKS ||--o{ FATAL_YN : has
SHARK_ATTACKS ||--o{ DATE : has
COUNTRY {
string name
}
SEX {
string value
}
AGE {
int value
}
INJURY {
string description
string severity
}
TYPE {
string classification
}
FATAL_YN {
string value
}
DATE {
int DD
string MM
int YYYY
int MM_clean
}
Class diagram for data cleaning pipeline stepsclassDiagram
class SharkAttackRaw {
+string Date
+float Year
+string Type
+string Country
+string State
+string Location
+string Activity
+string Name
+string Sex
+string Age
+string Injury
+string Fatal_YN
+string Time
+string Species
+string Source
}
class SharkAttackCleaned {
+string Country
+string State
+string Location
+string Activity
+string Name
+string Sex
+int Age
+string Injury
+string Type
+string Fatal_YN
+string Time
+string Species
+string Source
+int DD
+string MM
+int YYYY
+int MM_clean
+string Injury_Severity
}
SharkAttackRaw <|-- SharkAttackCleaned : cleaning
class CountryCleaner {
+normalize_country_names()
+drop_nulls()
+replace_aliases()
}
class SexCleaner {
+standardize_sex()
+drop_nulls()
}
class AgeCleaner {
+standardize_age()
+categorize_age()
+drop_nulls()
}
class InjuryClassifier {
+classify_injury()
+classify_severity()
}
class FatalityDeriver {
+derive_fatality()
}
class DateParser {
+parse_date()
+split_date()
+drop_invalid()
}
SharkAttackRaw --> CountryCleaner : uses
SharkAttackRaw --> SexCleaner : uses
SharkAttackRaw --> AgeCleaner : uses
SharkAttackRaw --> InjuryClassifier : uses
SharkAttackRaw --> FatalityDeriver : uses
SharkAttackRaw --> DateParser : uses
CountryCleaner --> SharkAttackCleaned : outputs
SexCleaner --> SharkAttackCleaned : outputs
AgeCleaner --> SharkAttackCleaned : outputs
InjuryClassifier --> SharkAttackCleaned : outputs
FatalityDeriver --> SharkAttackCleaned : outputs
DateParser --> SharkAttackCleaned : outputs
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey there - I've reviewed your changes - here's some feedback:
- Extract recurring cleaning logic (e.g. country normalization, month mapping) into reusable utility functions or modules to DRY up the notebook and improve maintainability.
- Consider breaking the monolithic data‐cleaning notebook into smaller, focused scripts or notebooks so each cleaning step can be reviewed and tested in isolation.
- Use pandas built-in parsing (like pd.to_datetime) and categorical dtypes instead of custom regex and manual mapping for more robust and concise date and category transformations.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Extract recurring cleaning logic (e.g. country normalization, month mapping) into reusable utility functions or modules to DRY up the notebook and improve maintainability.
- Consider breaking the monolithic data‐cleaning notebook into smaller, focused scripts or notebooks so each cleaning step can be reviewed and tested in isolation.
- Use pandas built-in parsing (like pd.to_datetime) and categorical dtypes instead of custom regex and manual mapping for more robust and concise date and category transformations.
## Individual Comments
### Comment 1
<location> `FEEDBACK.md:10` </location>
<code_context>
+- Also love how organized the Jupyter notebook is :heart-eyes:. That's a work of art!
+- Like that you used a GDrive link to make sure you all worked with the same raw data but using the same relative path, i.e. `pd.read_csv("../data/raw.csv")` would have achieved the same result!
+- Really thorough, clean, and well-organized data cleaning, colour me impressed with your performance!!
+(there is a meme reference which I'm offerring as a well-done, from the kickboxing world: [A Canadian boxer trying to talk smack](https://www.youtube.com/shorts/RdoFCWhiL9g))
+
+### To fix / improve on:
</code_context>
<issue_to_address>
**issue (typo):** Typo: 'offerring' should be 'offering'.
```suggestion
(there is a meme reference which I'm offering as a well-done, from the kickboxing world: [A Canadian boxer trying to talk smack](https://www.youtube.com/shorts/RdoFCWhiL9g))
```
</issue_to_address>
### Comment 2
<location> `FEEDBACK.md:18` </location>
<code_context>
+- I noticed you did describe the data quite often during the cleaning process, but I didn't see you making any comments on what you were seeing. This is rule #1: Whenever you check something, you need to extract an insight from it, whether it's boring/obvious or not! That's your main job as analysts. Of course, now I ask you to do this everytime in writing so you can build the habit and you might stop doing it for simple things at some point, but for now... it's mandatory!
+
+Tips:
+- If you want your bulletpoints to show up in new lines, make sure to add at least 2 spaces after each paragraph
+- You can also write `pd.options.display.max_columns = None`
+
</code_context>
<issue_to_address>
**issue (typo):** Typo: 'bulletpoints' should be 'bullet points'.
Please update 'bulletpoints' to 'bullet points' for correct spelling.
```suggestion
- If you want your bullet points to show up in new lines, make sure to add at least 2 spaces after each paragraph
```
</issue_to_address>
### Comment 3
<location> `FEEDBACK.md:14` </location>
<code_context>
+
+### To fix / improve on:
+- Stay consistent with the separators you use in titles, choose either "_" or "-"
+- The file should be `__init___.py` not `init.py`
+- I noticed you did describe the data quite often during the cleaning process, but I didn't see you making any comments on what you were seeing. This is rule #1: Whenever you check something, you need to extract an insight from it, whether it's boring/obvious or not! That's your main job as analysts. Of course, now I ask you to do this everytime in writing so you can build the habit and you might stop doing it for simple things at some point, but for now... it's mandatory!
+
</code_context>
<issue_to_address>
**issue (typo):** Typo: '__init___.py' should be '__init__.py' (two underscores).
Use '__init__.py' with two underscores before and after 'init' for Python package initialization files.
```suggestion
- The file should be `__init__.py` not `init.py`
```
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| - Also love how organized the Jupyter notebook is :heart-eyes:. That's a work of art! | ||
| - Like that you used a GDrive link to make sure you all worked with the same raw data but using the same relative path, i.e. `pd.read_csv("../data/raw.csv")` would have achieved the same result! | ||
| - Really thorough, clean, and well-organized data cleaning, colour me impressed with your performance!! | ||
| (there is a meme reference which I'm offerring as a well-done, from the kickboxing world: [A Canadian boxer trying to talk smack](https://www.youtube.com/shorts/RdoFCWhiL9g)) |
There was a problem hiding this comment.
issue (typo): Typo: 'offerring' should be 'offering'.
| (there is a meme reference which I'm offerring as a well-done, from the kickboxing world: [A Canadian boxer trying to talk smack](https://www.youtube.com/shorts/RdoFCWhiL9g)) | |
| (there is a meme reference which I'm offering as a well-done, from the kickboxing world: [A Canadian boxer trying to talk smack](https://www.youtube.com/shorts/RdoFCWhiL9g)) |
| - I noticed you did describe the data quite often during the cleaning process, but I didn't see you making any comments on what you were seeing. This is rule #1: Whenever you check something, you need to extract an insight from it, whether it's boring/obvious or not! That's your main job as analysts. Of course, now I ask you to do this everytime in writing so you can build the habit and you might stop doing it for simple things at some point, but for now... it's mandatory! | ||
|
|
||
| Tips: | ||
| - If you want your bulletpoints to show up in new lines, make sure to add at least 2 spaces after each paragraph |
There was a problem hiding this comment.
issue (typo): Typo: 'bulletpoints' should be 'bullet points'.
Please update 'bulletpoints' to 'bullet points' for correct spelling.
| - If you want your bulletpoints to show up in new lines, make sure to add at least 2 spaces after each paragraph | |
| - If you want your bullet points to show up in new lines, make sure to add at least 2 spaces after each paragraph |
|
|
||
| ### To fix / improve on: | ||
| - Stay consistent with the separators you use in titles, choose either "_" or "-" | ||
| - The file should be `__init___.py` not `init.py` |
There was a problem hiding this comment.
issue (typo): Typo: 'init_.py' should be 'init.py' (two underscores).
Use 'init.py' with two underscores before and after 'init' for Python package initialization files.
| - The file should be `__init___.py` not `init.py` | |
| - The file should be `__init__.py` not `init.py` |
Summary by Sourcery
Include reviewer feedback on data cleaning project
Documentation: