Skip to content

DS (first half)#5

Open
SarahRana wants to merge 7 commits intoStephen-Cole267:Data_Sciencefrom
SarahRana:Master
Open

DS (first half)#5
SarahRana wants to merge 7 commits intoStephen-Cole267:Data_Sciencefrom
SarahRana:Master

Conversation

@SarahRana
Copy link

Part 1. Will upload second notebook with DS once completed.

@alexnaylor1999
Copy link
Collaborator

Hey Sarah, firstly thanks for submitting the first half of the project. Just a few pointers/suggestions:

  • When creating the connection object, save your credentials as environment variables. You can then access them in the notebook using the os module. This is good practice as you don't want to be making your credentials public when dealing with sensitive data.
  • Really liked the additional checks after making the manipulations. Also loved the chart checking TotalWorkingYears is valid with Age.
  • Good suggestions for why there may be discrepancies in the dataset.
  • Good work with using histograms and boxplots to analyse MonthlyIncome, YearsSinceLastPromotion, TotalWorkingYears and YearsAtCompany distributions.
  • Overall, really clear and effective visuals.

Great stuff! Let me know when you submit the second half - I'll take a look over it.

@Stephen-Cole267
Copy link
Owner

A lot of insight into the data without a lot of code! This is really good :) .

What I liked:

  • Like Alex has mentioned, I really like how you did additional checks after manipulating the data and made sure that every question about the data was answered.
  • Good display of function creation - percent_plot and percent_line
  • Analysed the distribution of each feature and their interaction with the attrition target by looking at percentages which makes it easy to understand
  • Easy to follow code
  • Changed the ordinal columns into their respective categoricals which would mean a lot more to the stakeholder than integers and also more representative within the model
  • Used a magic operator to see wall time - this is especially useful for code that could potentially run for a long time

Suggestions to improve:
Note: Not sure if these will be covered in the second half so please ignore points which are covered in the future

  • Could see if there were any nulls in the data and use your distribution plots to decide what to impute these features with
  • Remove any highly correlated features (depending on the model you are planning on using) as they will not contribute much to the resulting model. Can use your correlation and percent plots to decide which feature is removed.

Looking forward to the second half :) !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants