Skip to content

Latest commit

 

History

History
64 lines (51 loc) · 6.84 KB

File metadata and controls

64 lines (51 loc) · 6.84 KB

Mid Bootcamp Project

Introduction:

The goal of this project is to perform a descriptive statistical analysis, gain insights and create an interactive dashboard that tells a compelling story with visualizations, allowing for decision-making.

Your mid-bootcamp project is an opportunity to create a piece of work that is valuable to you. This could mean:

A project that you can use to help you get a job as a data analyst A project that you can use to advance your career in your current company A project on a topic that you are passionate about If you're struggling to come up with a topic, we have provided a list of datasets for you to consider. However, we highly recommend that you explore and select a topic and dataset that personally interests you, as this will make the project more engaging and rewarding.

Prerequisites:

In order to successfully complete the upcoming project, you should possess a strong understanding of several key concepts, including Python programming, data wrangling, exploratory data analysis (EDA), SQL and Tableau. The following are essential prerequisites that you should have before beginning the project:

  • Proficiency in basic Python programming, including knowledge of data wrangling and data cleaning techniques in Python.
  • Proficiency in EDA and descriptive statistics, including an understanding of different types of data and their properties, how to transform between data types, and how to perform analysis based on the data type. This includes numerical and graphical techniques, as well as bivariate and multivariate analysis to identify and analyze relationships between pairs or sets of numerical and categorical variables.
  • Ability to use measures such as frequency tables and centrality measures, as well as graphical methods such as bar charts, histograms, and box plots
  • Thorough bivariate and multivariate analysis, employing techniques such as contingency tables, chi-square goodness of fit, correlation coefficients, and a range of graphical methods such as scatterplots and correlation maps.
  • Familiarity with basic Matplotlib and Seaborn for graphical analysis with Python.
  • Knowledge of how to detect outliers using scatterplots, box plots, etc, and how to handle them.
  • Basic knowledge of probability and inferential statistics to conduct tests used in EDA, such as chi-square tests, normality tests, and to interpret p-values in the context of data analytics. This includes an understanding of probability distributions, such as the normal distribution.
  • Familiarity with Tableau in order to create interactive dashboards for decision making.

Suggested ways to get started:

  • Select a business problem and formulate one or more hypotheses that will guide your data analysis, allowing you to draw meaningful conclusions. Locate relevant data sources. Consider merging multiple datasets to augment your analysis with additional insights.
  • Examine the data, understand what the fields mean and use exploratory data analysis to explore the data and identify any issues that need to be addressed.
  • Do data cleaning and data wrangling to prepare the dataset for analysis. Remember to look into missing data, outliers, data types, feature selection, converting qualitative data to quantitative etc.
  • Analyze the data, including numerical and graphical techniques (univariate, bivariate and multivariate analysis) to reveal patterns, relationships, and trends that may not be apparent from the raw data.
  • Create a dashboard with key KPIs and insights that tell a compelling story about the data, while allowing for decision-making.
  • Create a visually appealing presentation with minimal text to showcase that effectively communicates your insights and conclusions to stakeholders, building a compelling narrative that highlights the significance of your analysis.
  • Remember, the goal of this project is to showcase your skills in data analysis, visualization and extraction of insights that are meaningful and actionable.

Deliverables:

You must submit the following deliverables in order for the project to be deemed complete:

  1. A new repo on your github account.
  2. A working code that meets all technical requirements, built by you, either on Python or SQL.
  3. At least 1 jupyter notebook is required containing your Python code or a SQL file for data cleaning and preprocessing steps, descriptive statistical analysis, and visualizations. You will also provide a brief summary of the insights gained from the analysis and visualizations, along with recommendations for further analysis or actions based on the results.
  4. Include your functions in .py files
  5. Tableau or PowerBI report
  6. Additional needed files for your work
  7. A README with the completed project documentation.
  8. The URL of the slides for your project presentation - in case you use them.

Presentation:

When presenting your work, there are many important factors to consider, such as the content of your presentation and the way you deliver it. The presentation will take a maximum of 7min and no code will be shown. Remember, your audience do not care about the code, they care about the insights provided and the story that you tell.

Rubrics:

In order to assess your project and ensure all requirements are met, a rubric will be used. This rubric is used to evaluate your project by your teaching staff but also to communicate what constitutes incomplete, acceptable and excellent performance across each of the learning outcomes for the project. Take some time to review the rubric here and ask Gonçalo or Karol any questions about it if necessary.

Optional Advanced Features:

While completing the basic requirements of your project is a great start, taking advantage of some advanced features can really take your work to the next level. Here are some options to consider if you want to go above and beyond:

  • Data gathering and integration: use APIs and web scraping to gather data from different sources. Combine and integrate data from multiple sources, including different databases, APIs, or file formats.
  • Use advanced data cleaning techniques, when imputing missing values or handling duplicates (such as using Linear Regressions), in addition to the basic techniques.
  • Database creation: create a database to store your raw and your clean data for analysis.
  • Correlation and statistical analysis: use hypothesis testing concepts to interpret your correlation analysis and to validate your findings.
  • Improve your code by using error handling techniques, applying functions for modularity and reusability, and utilizing regular expressions to extract insights from textual data.
  • Do advanced visualizations using interactive libraries such as Plotly.
  • Deploy your dashboard to the web using a tool like Flask or Django to make it accessible to others and share your insights more widely.
  • Use a kanban board to organize and manage project tasks.
  • Anything outside of the box that can improve your analysis!