diff --git a/_sources/Introduction/introduction.rst b/_sources/Introduction/introduction.rst index 2e9f4ee..9d507e4 100644 --- a/_sources/Introduction/introduction.rst +++ b/_sources/Introduction/introduction.rst @@ -261,7 +261,7 @@ discover your ability to perform complex analyses to solve real-world problem. There is another definition of the learning zone that is related to what we have been talking about. In this amazing -`TED talk: How to get better at the things you care about `_, +TED talk: How to get better at the things you care about, Eduardo Briceño talks about the "performance zone" versus the "learning zone." Please watch it. diff --git a/_sources/Statistics/Figures/crossreference1.PNG b/_sources/Statistics/Figures/crossreference1.PNG new file mode 100644 index 0000000..c915374 Binary files /dev/null and b/_sources/Statistics/Figures/crossreference1.PNG differ diff --git a/_sources/Statistics/Figures/crossreference2.PNG b/_sources/Statistics/Figures/crossreference2.PNG new file mode 100644 index 0000000..b7495e5 Binary files /dev/null and b/_sources/Statistics/Figures/crossreference2.PNG differ diff --git a/_sources/Statistics/Figures/crossreference3.PNG b/_sources/Statistics/Figures/crossreference3.PNG new file mode 100644 index 0000000..d862047 Binary files /dev/null and b/_sources/Statistics/Figures/crossreference3.PNG differ diff --git a/_sources/Statistics/Figures/crossreference4.PNG b/_sources/Statistics/Figures/crossreference4.PNG new file mode 100644 index 0000000..fdf38b7 Binary files /dev/null and b/_sources/Statistics/Figures/crossreference4.PNG differ diff --git a/_sources/Statistics/Figures/datatypes.png b/_sources/Statistics/Figures/datatypes.png new file mode 100644 index 0000000..8531b74 Binary files /dev/null and b/_sources/Statistics/Figures/datatypes.png differ diff --git a/_sources/Statistics/cs1_exploring_happiness.rst b/_sources/Statistics/cs1_exploring_happiness.rst index 59c6e02..4757e2a 100644 --- a/_sources/Statistics/cs1_exploring_happiness.rst +++ b/_sources/Statistics/cs1_exploring_happiness.rst @@ -24,10 +24,21 @@ factors may contribute to the happiness of a country, and we will use spreadshee explore and analyze what factors may be most important in determining a country's happiness. -We will start by loading the -`happiness_2017.csv <../_static/happiness_2017.csv>`_ file into Google Sheets. -The list below gives a bit of detail about each of the columns on the -spreadsheet. +We will be using Google Sheets instead of Microsoft Excel. Google sheets is used because it is a preferred method for +sharing a link and working in real time with a team. The same Solver tool is available on Microsoft Excel however working +on Google Sheets is preferred because you can easily and quickly share data with other users, and work on the same dataset at the same time. + +.. _googlesheet_setup: + +We will start by loading the `happiness_2017.csv <../_static/happiness_2017.csv>`_ file into Google Sheets. + +1. In order to do that you should go to `Google Sheets `_ + +2. Click on "Go to Sheets". + +3. Open a blank and then at the top, click File and then Import the file `happiness_2017.csv <../_static/happiness_2017.csv>`_. + +The list below gives a bit of detail about each of the columns on the spreadsheet. The following definitions are reproduced from `World Happiness Report 2018 `_. @@ -43,6 +54,7 @@ The following definitions are reproduced from per capita, as this form fits the data significantly better than GDP per capita. + 2. The time series of healthy life expectancy at birth are constructed based on data from the World Health Organization (WHO) and WDI. WHO publishes the data on healthy life expectancy for the year 2012. The time series of life @@ -53,33 +65,40 @@ The following definitions are reproduced from country-specific ratios to other years to generate the healthy life expectancy data. + 3. Social support is the national average of the binary responses (either 0 or 1) to the Gallup World Poll (GWP) question "If you were in trouble, do you have relatives or friends you can count on to help you whenever you need them, or not?" + 4. Freedom to make life choices is the national average of binary responses to the GWP question "Are you satisfied or dissatisfied with your freedom to choose what you do with your life?" + 5. Generosity is a function of the national average of GWP responses to the question "Have you donated money to a charity in the past month?" on GDP per capita. + 6. Perceptions of corruption are the average of binary answers to two GWP questions: "Is corruption widespread throughout the government or not?" and "Is corruption widespread within businesses or not?". Where data for government corruption are missing, the perception of business corruption is used as the overall corruption-perception measure. + 7. Positive affect is defined as the average of previous-day affect measures for happiness, laughter, and enjoyment for GWP waves 3-7 (years 2008 to 2012, and some in 2013). It is defined as the average of laughter and enjoyment for other waves where the happiness question was not asked. + 8. Negative affect is defined as the average of previous-day affect measures for worry, sadness, and anger for all waves. + In this first part, we will review and practice some spreadsheet calculations by doing some exploratory data analysis. If you have never used a spreadsheet before, don't worry, you will catch on quickly. Remember that we are just exploring at this @@ -206,7 +225,7 @@ Summary Statistics .. fillintheblank:: fb_avghappiness Calculating the average happiness score. You should include three - digits to the right of the decimal point.|blank| + digits to the right of the decimal point. - :5.399: Is the correct answer :5.398: 5.3989 should be rounded up to 5.399 diff --git a/_sources/Statistics/cs2_exploring_business_data.rst b/_sources/Statistics/cs2_exploring_business_data.rst index 0f4cbdc..abb5e1a 100644 --- a/_sources/Statistics/cs2_exploring_business_data.rst +++ b/_sources/Statistics/cs2_exploring_business_data.rst @@ -4,8 +4,8 @@ http://creativecommons.org/licenses/by-sa/4.0/. -Case Study 2: Considering Starting a Business? -============================================== +Case Study 2: Business Start-Up Analysis in Different Countries +=============================================================== Data science and data analytics can be used to analyze and understand data related to many different fields, such as education, business, targeted advertising, healthcare, and many more. @@ -13,12 +13,14 @@ In this case study, we will explore a data set related to starting a business. -Thinking About Starting Your Business -------------------------------------- +Analyzing Business Start-Ups +---------------------------- +On the following section, we will import the provided dataset into Goolge sheets following the :ref:`instuctions ` from a previous chapter. +We will use **Google Sheets** to explore which of these indicators are most important to start a new business in each economy's largest cities. -This case study utilizes the `starting a business <../_static/Start_a_Business_2019.csv>`_ (also called business start-up) data set obtained from the Doing Business-World Bank website. +This case study utilizes the starting a business dataset `Start_a_Business_2019.csv <../_static/Start_a_Business_2019.csv>`_ ,also called business start-up, obtained from the Doing Business-World Bank website. The data set contains indicators from over 190 countries that measure the relative ease of starting a business in those countries. The data set looks at -two limited liability companies in various regions and countries around the world. +two limited liability companies in various regions and countries around the world related to the ease of starting a business in different countries. Each country in the data set measures things such as the minimum amount of capital investment an entrepreneur must have to start a business, and the number of procedures that are necessary to register the business, and more that will be covered throughout this case study. @@ -36,8 +38,23 @@ Below are definitions of the indicators found in the data set. - **Paid-In Minimum Capital:** The minimum amount of money the entrepreneur must have in the bank for the business registration process to be completed. - **Income Level:** This represents the income levels of each country's economy. This indicator is divided into low, lower-middle, upper-middle, and high, based on a country's gross national income (GNI) per person. -We will use **Google Sheets** to explore which of these indicators are most important to start a new business in each economy's largest cities. -Import the data set that you downloaded earlier, `starting a business <../_static/Start_a_Business_2019.csv>`_, into Google Sheets. +Now, let's revise what we have learned from Chapter 2.1. Based on `Data Types in Statistics `_ article, in order to categorize different types of variables, we split them into Categorical and Numerical Data. + +Categorical data describe charateristics of the variables, and can be further split into Nominal and Ordinal. Some examples of categorical data are eye color, social class, etc. +Nominal values represent discrete units and are used to label the variables, which have no quantitative value. An example of nominal data is eye color (blue, brown, black). +Ordinal data are values that are ordered discrete characteristics. An example of ordinal data is social class (lower, middle, upper). + +As for the numerical data, we include quantitaive data, and they are seperated into interval and ratio data. Some examples of numerical data are height, weight, income. +Interval values represent ordered categories that have the equal difference. Interval data do not have a "true zero". Some examples of interval values are the temprature or income. +Ratio values are also ordered categories that have the equal difference and can have a "true zero". Some examples of ratio data are height, length, or weight. + +The following picture is a great way to determine to determine to which category a variable belongs. + +.. image:: Figures/datatypes.png + :alt: A graph showing the different data types + +Numerical Data can also be seperated into discrete and continuous. Discrete are data that can be classified and also counted. Some discrete data are the number of students in a class, or someone's shoe size. +Continuous data can't be counted but they can be measured. An example of continuous data is a person's height. .. mchoice:: dat_sab1 @@ -60,8 +77,11 @@ Import the data set that you downloaded earlier, `starting a business <../_stati - Incorrect -Business Start-Up Data Analysis Research Questions --------------------------------------------------- +Business Start-Up Analysis in Different Countries +------------------------------------------------- + +The research questions below are interesting questions that can be addressed using data analysis. Using data analytics techniques we will be able to explore some of these +research questions questions throughout the book. 1. What are the different factors that lead to a high ranking in the business start-up dataset? 2. What role does “income level” play in determining the rank of a country? @@ -71,8 +91,59 @@ Business Start-Up Data Analysis Research Questions The data set lists countries based on their business start-up scores. While it is easy to see the best countries for starting a business based on the business start-up rank, it is not -easy to grasp the relative simplicity of each country. We can use the functions that we -learned in the previous case study to create a common baseline: average, standard deviation, and median. So, let's average +easy to grasp the relative simplicity of each country. + +Descriptive Statistics +----------------------- + +The following are some very important terms in data analytics that are used to describe the dataset. + +**Mean** is the average of a set of values. It is important in analytics as it is a measure of central tendency. In Google Sheets we use the function ``AVERAGE`` and then select the cells of the values values to find the mean. +From now on we will use the words mean and average interchangeably. + +**Range** is the difference between the lowest and highest values of the dataset. To find the maximum value you use the function ``MAX``. Similarly, to find the minimum value you use the +function ``MIN``. + +**Standard deviation** is the average distance from the mean. It shows how spread out the data is more than other types of variabilities. To find the standard deviation we use the +the function ``STDEV``. + +**Mode** is the most common value on the dataset. It is very important in categorical data because it describes the most frequent option. To find the mode you can use the function +``MODE``. + +**Median** is the middle value of the dataset. The median is also as important because it provides another kind of baseline besides mean and mode. The function that gives the median is ``MEDIAN``. + +**Pearson correlation** is a type of measurement; it measures the strength and direction of a linear relationship between two variables. The Pearson correlation coefficient takes values from -1 to 1. +The value of -1 means it has a strong negative relationship, and the value of +1 means it has has a strong positive relationship. + +Another important formula for Google Sheets is Cell Refrencing ``$``. The dollar sign ``$`` can be used before the column and/or row of a reference to control how the reference will be updated when copied to a row or collumn. +The dollar sign causes the reference's relevant component to remain unaltered. +For example, this will keep the row number the same. Here is the example of the formulas if we write the formula ``=A$1`` in C1 and grab it down and to the right: + +.. image:: Figures/crossreference1.PNG + :alt: Google sheets crossreference example 1 + +This shows that it hold the row number constant as we move it along the rows and columns. Here are the results: + +.. image:: Figures/crossreference2.PNG + :alt: Google sheets crossreference Result 1 + +If we drag the formula ``=A$1`` from cell C1 through cell C3, it keeps the same column as we have not changed column but it also does not change the one since A1 is 1. If drag the formula ``=A1`` from cell C1 through cell C3, +it keeps the same column as we have not changed column but in this case cell C1 will have the formula ``=A1``, cell C2 will have the formula ``=A2`` and cell C3 will have the formula ``=A3``, and the results will change to 1, 3, and 5 respectively. + +Similarly, this will keep the column number the same. Here is the example of the formulas if we write ``=$A1`` in C1 and drag it down and to the right: + +.. image:: Figures/crossreference3.PNG + :alt: Google sheets crossreference example 2 + +This shows that it hold the column number constant as we move it along the rows and columns. Here are the results: + +.. image:: Figures/crossreference4.PNG + :alt: Google sheets crossreference Result 2 + +Here, when we drag the formula from cell C1 to cell C3 it keeps the column constant and it changes the row number. However when we grab C1 to D1 the formula remains the same because the column remains constant becasue of the ``$`` and the row remains the same. +If we had the formula ``=A1`` in C1 and we drag it to D1 then the column name will change to ``=B1``. + +We can use the functions that we learned in the previous case study to create a common baseline: average, standard deviation, and median. So, let's average the business start-up score of all countries together. a. Use the ``AVERAGE`` function to calculate the mean in column D. Scroll down and click on a cell in column 194. @@ -87,13 +158,15 @@ b. Many formulas in Google Sheets use ranges. They can span cells in a single co - E2:E192 - E2:L192 -c. **Standard deviation** is the average distance from the mean. It shows how spread out the data is more - than other types of variabilities. The median is also as important because it provides another kind of - baseline besides mean and mode. Calculate the ``STDEV`` and ``MEDIAN`` for the business start-up score column. +c. Calculate the ``STDEV`` and ``MEDIAN`` for the business start-up score column. d. Calculate the standard deviation and median by copying and pasting the formula to other columns. -e. Copy the formula for ``=AVERAGE(D2:D141)`` from a, and the formula for standard deviation from c then calculate: +e. Copy the formula for ``=AVERAGE(D2:D141)`` from a, and the formula for standard deviation from c + +f. Remember, use a ``$`` so Google Sheets will not change the cell references when copy/pasting. + +Then calculate the following: .. fillintheblank:: fb_sab8 @@ -115,49 +188,29 @@ e. Copy the formula for ``=AVERAGE(D2:D141)`` from a, and the formula for standa :x: USE the ``STDEV`` function and the range from N2 to N192 -f. Remember, use a ``$`` so Google Sheets will not change the cell references when copy/pasting. - - -Visualizing How to Start a Business ------------------------------------ - -1. Visualizing the data is a great way to begin to interpret the data because doing so allows the viewer to easily see trends or find outliers. -A **histogram** is one way to visualize the standard deviation of a particular data set. - -2. When you have a data set covering the entire world, it can be interesting to identify certain information. For instance, -you can calculate which countries have the largest or smallest GNI, the income per capita of women and men, and so on. - -a. Remember, finding the maximum value of a column does not mean we know which country it corresponds to. Therefore, we can use the ``MATCH`` and ``INDEX`` functions - to fix this problem. Let's find what country corresponds to the maximum value of GNI. First, calculate the maximum GNI in cell M193, then in cell M194 type ``=MATCH(M193, M2:M192, 0)``. - Notice that the match function searches for the value in cell M193 in the range ``M2:M192``, and the 0 tells Google Sheets that the data is not sorted. The 0 is - important because, without it, sheets will assume the data is sorted and will stop when it finds a value greater than the value in M194. - -b. Type ``=INDEX(A2:A192, M194)`` in cell M195. The ``A2:A192`` parameters is the range from which ``INDEX`` will return a corresponding value; in this - case, it is the location. M194 from the previous question is ``=MATCH(M193, M2:M192, 0)``. So the ``INDEX`` is practically telling sheets to find the - location, from column A, that is found in the same row as the maximum value. +More Data Analytics +------------------- -c. All three steps shown above can be performed in a single cell. Let’s look at the country that has the lowest Procedure Men number. - In cell E193 type ``=INDEX($A2:$A141, MATCH(MIN(E2:E141), E2:E141, 0))``. The ``MATCH`` and ``MIN`` functions both return one value. - So, sheets will first find the minimum value in cells ``J2:J141``. Then it will use the ``MATCH`` function to find the cell location (column and row) - of where that minimum value is. Finally, it will use the ``INDEX`` function to find what value from ``A2:A141`` matches up with the given parameters. Try - this and see what it returns. It should return New Zealand, its region, business start-up rank, and business start-up score. +When you have a data set covering the entire world, it can be interesting to identify certain information. For instance, +you can calculate which countries have the largest or smallest Gross national income (GNI), the income per capita of women and men, and so on. -d. Practice using the functions you have learned by finding the names of locations for other columns. +Remember, finding the maximum value of a column does not mean we know which country it corresponds to. Therefore, we can use the ``MATCH`` and ``INDEX`` function to fix this problem. Let's find what country corresponds to the maximum value of GNI. First, calculate the maximum GNI in cell M193, then in cell M194 type ``=MATCH(M193, M2:M192, 0)``. +Notice that the match function searches for the value in cell M193 in the range ``M2:M192``, and the 0 tells Google Sheets that the data is not sorted. The 0 is +important because, without it, sheets will assume the data is sorted and will stop when it finds a value greater than the value in M194. -e. If you want to copy/paste, check the ranges carefully and add the ``$`` sign to avoid running into errors. +Type ``=INDEX(A2:A192, M194)`` in cell M195. The ``A2:A192`` parameters is the range from which ``INDEX`` will return a corresponding value; in this +case, it is the location. M194 from the previous question is ``=MATCH(M193, M2:M192, 0)``. So the ``INDEX`` is practically telling sheets to find the +location, from column A, that is found in the same row as the maximum value. -3. Another great way of visualizing data is to use a **choropleth**. As you know, a choropleth takes in a set of geographic data and uses a map -to show another set of data, such as business start-up score. - -a. Click on Insert then select Chart +All three steps shown above can be performed in a single cell. Let’s look at the country that has the lowest Procedure Men number. +In cell E193 type ``=INDEX($A2:$A141, MATCH(MIN(E2:E141), E2:E141, 0))``. The ``MATCH`` and ``MIN`` functions both return one value. +So, sheets will first find the minimum value in cells ``J2:J141``. Then it will use the ``MATCH`` function to find the cell location (column and row) +of where that minimum value is. Finally, it will use the ``INDEX`` function to find what value from ``A2:A141`` matches up with the given parameters. Try +this and see what it returns. It should return New Zealand, its region, business start-up rank, and business start-up score. -b. On the new Chart editor section, click on Chart Type and select Geo Chart +Practice using the functions you have learned by finding the names of locations for other columns.If you want to copy/paste, check the ranges carefully and add the ``$`` sign to avoid running into errors. -c. Select location column (``A2:A192``) as the region and any column that you wish to see as the Color. - -d. You may hover around each country to see its respective statistic. - -4. You may be wondering if there is a **correlation** between a country’s ease of starting a business score and GNI or procedure. +You may be wondering if there is a **correlation** between a country’s ease of starting a business score and GNI or procedure. One way to check this is to use the ``CORREL`` function to see how the score is affected by each factor i.e., business start-up score to GNI, business start-up score to the procedure, business start-up score to time. @@ -173,7 +226,24 @@ a. Calculate the mean of each factor for the top 20 countries, then do so for th in those averages for each of the factors for the top and bottom 20 countries. Which factors have the most impact on the business start-up score? -6. While using the choropleth, you might have noticed some outliers in the data, for example, South Africa has one of the lowest cost +Data Visualization +------------------- + +Visualizing the data is a great way to begin to interpret the data because doing so allows the viewer to easily see trends or find outliers. +A **histogram** is one way to visualize the standard deviation of a particular data set. + +Another great way of visualizing data is to use a **choropleth**. As you know, a choropleth takes in a set of geographic data and uses a map +to show another set of data, such as business start-up score. + +a. Click on Insert then select Chart + +b. On the new Chart editor section, click on Chart Type and select Geo Chart + +c. Select location column (``A2:A192``) as the region and any column that you wish to see as the Color. + +d. You may hover around each country to see its respective statistic. + +While using the choropleth, you might have noticed some outliers in the data, for example, South Africa has one of the lowest cost of starting a business but is ranked 139. The countries above and below South Africa have a cost of 5 and 5.7 while South Africa has a cost of 0.2. diff --git a/_sources/Statistics/glossary.rst b/_sources/Statistics/glossary.rst index 72770f1..767c1e8 100644 --- a/_sources/Statistics/glossary.rst +++ b/_sources/Statistics/glossary.rst @@ -19,11 +19,19 @@ Definitions **Histogram:** Is a graph used to display data. +**Mean:** is the average of a set of values. + +**Median:** is the middle value of the dataset. + +**Mode:** is the most common value on the dataset and shows frequency. + **Pearson correlation:** Is a type of measurement; it measures the strength and direction of a linear relationship between two variables. -1 has a strong negative relationship, and +1 has a strong positive relationship. **Pivot table:** Is a function used in Google Sheets to summarize, organize, sort, and perform other operations on data sets. -**Standard Deviation:** Is used to measure the degree of variation of a set of values. +**Range:** is the difference between the lowest and highest values of the dataset. + +**Standard Deviation:** Is used to measure the degree of variation of a set of values. It also shows the difference from the mean and how spread out the data is more than other types of variabilities. Keywords --------