Skip to content

Latest commit

 

History

History
 
 

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 

README.md

IronHack Logo

Guided Project: Demonstration of Data Cleaning and Manipulation with Pandas

Overview

The goal of this project is to combine everything you have learned about data wrangling, cleaning, and manipulation with Pandas so you can see how it all works together. For this project, you will start with a messy data set of your choice. You will need to import it, use your data wrangling skills to clean it up, prepare it to be analyzed, and then export it as a clean CSV data file.

You will be working individually for this project, but we'll be guiding you along the process and helping you as you go. Show us what you've got!


Technical Requirements

The technical requirements for this project are as follows:

  • You must start out with a significantly messy data set so that you can apply the different cleaning and manipulation techniques you have learned.
  • Import the data using Pandas.
  • Examine the data for potential issues.
  • Use at least 8 of the cleaning and manipulation methods you have learned on the data.
  • Produce a Jupyter Notebook that shows the steps you took and the code you used to clean and transform your data set.
  • Export a clean CSV version of your data using Pandas.

Necessary Deliverables

The following deliverables should be pushed to your Github repo for this chapter.

  • A cleaned CSV data file containing the results of your data wrangling work.
  • A Jupyter Notebook (data-wrangling.ipynb) containing all Python code and commands used in the importing, cleaning, manipulation, and exporting of your data set.
  • A README.md file containing a detailed explanation of the process followed in the importing, cleaning, manipulation, and exporting of your data as well as your results, obstacles encountered, and lessons learned.

Suggested Ways to Get Started

  • Find a messy data set - a great place to start looking would be Awesome Public Data Sets and Kaggle Data Sets.
  • Examine the data and try to understand what the fields mean before diving into data cleaning and manipulation methods.
  • Break the project down into different steps - use the topics covered in the lessons to form a check list, add anything else you can think of that may be wrong with your data set, and then work through the check list.
  • Use the tools in your tool kit - your knowledge of Python, data structures, Pandas, and data wrangling.
  • Work through the lessons in class & ask questions when you need to! Think about adding relevant code to your project each night, instead of, you know... procrastinating.
  • Commit early, commit often, don’t be afraid of doing something incorrectly because you can always roll back to a previous version.
  • Consult documentation and resources provided to better understand the tools you are using and how to accomplish what you want.

Useful Resources

Project Feedback + Evaluation

  • Technical Requirements: Did you deliver a project that met all the technical requirements? Given what the class has covered so far, did you build something that was reasonably complex?

  • Creativity: Did you add a personal spin or creative element into your project submission? Did you incorporate domain knowledge or unique perspective into your analysis.

  • Code Quality: Did you follow code style guidance and best practices covered in class?

  • Total: Your instructors will give you a total score on your project between:

    Score Expectations
    0 Does not meet expectations
    1 Meets expectactions, good job!
    2 Exceeds expectations, you wonderful creature, you!

This will be useful as an overall gauge of whether you met the project goals, but the more important scores are described in the specs above, which can help you identify where to focus your efforts for the next project!

Presentation Guideline and Criteria

Format

  • Presentation Time: 6 minutes
  • Q & A: 3 minutes
  • Total Time: 9 minutes

Attire

Outputs

  • A presentation in slides.com
  • A demo deployed on GitHub Pages
  • The presentation and demo will be executed on a class computer (instead of your own)
  • Get ready to explain some of your code in GitHub

Things you might want to talk about

  • Short presentation of yourself:
    • Who are you?
    • A hobby you have.
    • Note: we are getting you ready for final presentation!
  • Elevator pitch:
    • Data set you chose.
    • Why did you chose that data set?
    • The most important thing you learned.
  • One technical challenge you faced:
    • Explain the challenge.
    • Explain how and what you did to overcome it.
    • Show and explain code snippets in your presentation slides.
  • Git:
    • Display an screenshot of your GitHub graphs to show your commit frequency and how much work you did.
  • Pandas Data Wrangling Walkthrough:
    • Walk the audience through the data set you chose, providing an overview of some of the fields and other information contained in the data.
    • Walk the audience through your data wrangling workflow including what initial problems you identified in the data, what cleaning and manipulation techniques you employed, what avenues you decided to pursue and why, and what lessons you learned.
  • One important mistake you made:
    • Did you made a mistake planning your time? Maybe transforming a variable that wasn't useful? Accidentally dropping one that was?