diff --git a/All Submissions/Finalprojectpaper.rmd b/All Submissions/Finalprojectpaper.rmd new file mode 100644 index 00000000..802b6d18 --- /dev/null +++ b/All Submissions/Finalprojectpaper.rmd @@ -0,0 +1,379 @@ +--- +title: "Final Project/Paper" +description: "Final project topic covers academic factors that increase student GPA using a dataset developed by Yılmaz and Sekeroglu (2019)" +author: "Emily Duryea" +date: "2022-12-20" +output: distill::distill_article +--- + +# Academic Factors that Increase Student GPA + +## Introduction + +There are numerous factors documented in research literature that have correlated with students' cumulative GPA. For example, classroom engagement, time management, motivation, class attendance, time spent studying, and group study sessions, to name a few (Fokkens-Bruinsma, et al., 2021; Büchele, 2021; Nelson, 2003; Thibodeaux, et al., 2017; Vargas, et al., 2018). The purpose of this research project was to further investigate these findings using a dataset collected by Yılmaz and Sekeroglu (2019). The dataset was originally used to classify student performance using artificial intelligence. For the purposes of this study, I used the dataset to examine what factors correlated with students' cumulative GPA.  + +This research project examined the following three research questions: 1) Does classroom engagement (i.e., taking notes, attending class, listening) result in a higher GPA in university students?; 2) Does reported studying (i.e., weekly study hours) result in a higher GPA in university students?; and 3) Does collaboration between students (i.e., studying together, positive class discussions) result in a higher GPA in university students? + +For the first research question, it is reasonable to hypothesize that classroom engagement will have a positive effect on students' academic achievement. Previous research supports this hypothesis. For example, one study found that classroom engagement, as well as other related factors such as time management and autonomous motivation, are predictors of academic achievement (Fokkens-Bruinsma, et al., 2021). Another study found that attendance in higher education is a small, but still statistically significant, predictor of academic performance (Büchele, 2021). In this study, classroom engagement will be defined as "taking notes, attendance, and frequency of listening." These measures will be reported by university students via survey. + +In regards to the second research question, it is hypothesized that students who study more will have a higher GPA. There are many previous studies that support this claim. For instance, one study found that university freshmen who studied more than eight hours a week saw an average increase in GPA of 0.580 (Nelson, 2003). Research has also found that increasing study time leads to an increased GPA (Thibodeaux, et al., 2017). In this study, hours spent studying will be measured through students' estimated range of hours studied, reported via survey. + +In response to the third research question, it is hypothesized that student collaboration will have a positive effect on student GPA. There is some research literature that supports this statement. One study found that students who study with their peers achieve significantly higher homework scores (Vargas, et al., 2018). Another study found that university students who had a strong social network and exhibited collaborative behaviors tended to achieve higher grades (Ellis & Han, 2021). Effective student collaboration can also occur during class time, such as through small group discussions. Research has found that students who participate in small group discussions demonstrate an increase in resilience, which has shown to improve academic performance (Torrento-estimo, et al, 2012). In this study, student collaboration will be measured through students' reported time spent studying with peers, and impact that their class discussions have. + +## Data + +### The Dataset + +The dataset that I used is a data set I found on the website Kaggle (link can be found here: https://www.kaggle.com/datasets/csafrit2/higher-education-students-performance-evaluation?resource=download). The dataset is a survey given to university students that collects demographic variables (e.g., age, job status, family background) and variables pertaining to their academic performance (e.g., time spent studying, class attendance, GPA). Below is the R code I used to read in the data set, as well as a summary of the data. + +```{r} +knitr::opts_chunk$set(echo = TRUE) +studentsurvey <- read.csv("_data/student_prediction.csv") +summary(studentsurvey) +``` + +As this dataset was used in a previous research study (Yılmaz & Sekeroglu, 2019), the data has already been de-identified participants using a number code to represent responses in order to keep subjects' identities private. What I did want to change was the labeling of certain variables where it was unnecessary. For example, to represent gender, the researchers used the code "1" to represent "female" and "2" to represent "male." To make it less confusing when using the dataset, I chose to rename these terms. However, some I chose not to rename - if the variable involved a range, the number placeholder was kept (e.g., for ages, the ranges were 1 = 18-21, 2 = 22-25, 3 = 26 or above). + +```{r} +library(tidyverse) +library(tidyr) +library(dplyr) + +studentsurvey$GENDER <- factor(studentsurvey$GENDER, + levels=c(1,2), + labels=c("female","male")) + +studentsurvey$HS_TYPE <- factor(studentsurvey$HS_TYPE, + levels=c(1,2,3), + labels=c("private","state", "other")) + +studentsurvey$SCHOLARSHIP <- factor(studentsurvey$SCHOLARSHIP, + levels=c(1,2,3,4,5), + labels=c("None","25%", "50%", "75%", "Full")) + +studentsurvey$WORK <- factor(studentsurvey$WORK, + levels=c(1,2), + labels=c("Yes","No")) + +studentsurvey$ACTIVITY <- factor(studentsurvey$ACTIVITY, + levels=c(1,2), + labels=c("Yes","No")) + +studentsurvey$PARTNER <- factor(studentsurvey$PARTNER, + levels=c(1,2), + labels=c("Yes","No")) + +studentsurvey$TRANSPORT <- factor(studentsurvey$TRANSPORT, + levels=c(1,2,3,4), + labels=c("Bus","Car/Taxi", "Bicycle", "Other")) + +studentsurvey$LIVING <- factor(studentsurvey$LIVING, + levels=c(1,2,3,4), + labels=c("Rental","Dorm", "With Family", "Other")) + +studentsurvey$MOTHER_EDU <- factor(studentsurvey$MOTHER_EDU, + levels=c(1,2,3,4,5,6), + labels=c("primary school","secondary school", "high school", "university", "Masters", "Doctorate")) + +studentsurvey$FATHER_EDU <- factor(studentsurvey$FATHER_EDU, + levels=c(1,2,3,4,5,6), + labels=c("primary school","secondary school", "high school", "university", "Masters", "Doctorate")) + +studentsurvey$KIDS <- factor(studentsurvey$KIDS, + levels=c(1,2,3), + labels=c("Married","Divorced", "Died")) + +studentsurvey$MOTHER_JOB <- factor(studentsurvey$MOTHER_JOB, + levels=c(1,2,3,4,5,6), + labels=c("retired","housewife", "government officer", "private sector employee", "self-employment", "other")) + +studentsurvey$FATHER_JOB <- factor(studentsurvey$FATHER_JOB, + levels=c(1,2,3,4,5), + labels=c("retired", "government officer", "private sector employee", "self-employment", "other")) + +studentsurvey$READ_FREQ <- factor(studentsurvey$READ_FREQ, + levels=c(1,2,3), + labels=c("None","Sometimes", "Often")) + +studentsurvey$READ_FREQ_SCI <- factor(studentsurvey$READ_FREQ_SCI, + levels=c(1,2,3), + labels=c("None","Sometimes", "Often")) + +studentsurvey$ATTEND_DEPT <- factor(studentsurvey$ATTEND_DEPT, + levels=c(1,2), + labels=c("Yes","No")) + +studentsurvey$IMPACT <- factor(studentsurvey$IMPACT, + levels=c(1,2,3), + labels=c("Positive","Negative","Neutral")) + +studentsurvey$ATTEND <- factor(studentsurvey$ATTEND, + levels=c(1,2,3), + labels=c("always","sometimes","never")) + +studentsurvey$PREP_STUDY <- factor(studentsurvey$PREP_STUDY, + levels=c(1,2,3), + labels=c("alone","with friends","not applicable")) + +studentsurvey$PREP_EXAM <- factor(studentsurvey$PREP_EXAM, + levels=c(1,2,3), + labels=c("close to exam date","regularly throughout semester","never")) + +studentsurvey$NOTES <- factor(studentsurvey$NOTES, + levels=c(1,2,3), + labels=c("never","sometimes","always")) + +studentsurvey$LISTENS <- factor(studentsurvey$LISTENS, + levels=c(1,2,3), + labels=c("never","sometimes","always")) + +studentsurvey$LIKES_DISCUSS <- factor(studentsurvey$LIKES_DISCUSS, + levels=c(1,2,3), + labels=c("never","sometimes","always")) + +studentsurvey$CLASSROOM <- factor(studentsurvey$CLASSROOM, + levels=c(1,2,3), + labels=c("not useful","useful","not applicable")) +``` + +### Demographic Variables + +In order to gain information about the participant sample, I began by running some descriptive statistics with the sample's demographic variable. Below are some bar graphs (with the code needed to generate the graphs) pertaining to the demographics of the sample. + +```{r} +# Bar graph of sample gender +ggplot(studentsurvey, aes(x = GENDER)) + ggtitle("Sample Gender") + geom_bar() + +# Bar graph of sample high school type +ggplot(studentsurvey, aes(x = HS_TYPE)) + ggtitle("High School Type Sample Graduated From") + geom_bar() + +# Bar graph of sample scholarship received +ggplot(studentsurvey, aes(x = SCHOLARSHIP)) + ggtitle("Percentage of Scholarship Received") + geom_bar() + +# Bar graph of sample work status +ggplot(studentsurvey, aes(x = WORK)) + ggtitle("Sample Work Status") + geom_bar() +``` + +In this particular study, there were more male than female participants. Most students attended a state/public high school. Additionally, most students have received at least 50% scholarship at this university, indicating that many students at this particular university have received scholarships. Furthermore, most students do not have a job while they are studying at university in this sample. As the vast majority of students have scholarships, working a job during university may not be necessary. + +This sample may not be representative of the U.S. student population. There are more male than female students, which is not the case at most schools: there is about a 1:2 male to female ratio at U.S. colleges (Leukhina & Smaldone, 2022). Additionally, like in the sample, the vast majority of students attended public schools (Riser-Kositsky, 2022). In regards to scholarships, the students at this particular university receive scholarships at significantly higher rates than the rest of the U.S. Only about one in eight students receive a scholarship, and only 5% receive a full scholarship (Scholarship Statistics, 2021). While the enrollment statuses of the students were not given, if all students were full-time students, it would align with research that shows that less than half of full-time students (40%) in U.S. universities work while in school. While this sample may not be entirely representative of the U.S. college student population, analyses of this dataset conducted may provide some insight on factors that improve university students GPA. + +## Visualization + +### Analysis + +For research question 1 -- the influence of classroom engagement on student GPA -- I chose to run a simple linear regression and a correlation test. I did also conduct a multiple regression analysis, but I preferred to separate the three variables within my definition of "classroom engagement" so I could analyze them individually. For research question 2, like with my previous research question, I chose to run a simple linear regression and a correlation test to analyze the data, but due to the number of variables, a multiple regression analysis was not conducted. The same method of analysis for research question 1 was applied to research question 3 as well. + +### Research Question 1 + +#### Statistical Analyses + +```{r} +# In order to conduct proper analysis, the numeric values are needed, thus the dataset was reimported. +studentsurvey <- read.csv("_data/student_prediction.csv") + +### Factor 1: Taking Notes and GPA ### + +# Simple linear regression +nfit <- lm(NOTES ~ CUML_GPA, data = studentsurvey) +summary(nfit) + +# Correlation test +cor.test(studentsurvey$NOTES, studentsurvey$CUML_GPA) + +### Factor 2: Class Attendance and GPA ### + +# Simple linear regression +afit <- lm(ATTEND ~ CUML_GPA, data = studentsurvey) +summary(afit) + +# Correlation test +cor.test(studentsurvey$ATTEND, studentsurvey$CUML_GPA) + +### Factor 3: Reported Listening and GPA ### + +# Simple linear regression +lfit <- lm(LISTENS ~ CUML_GPA, data = studentsurvey) +summary(lfit) + +# correlation test +cor.test(studentsurvey$LISTENS, studentsurvey$CUML_GPA) + +### Multiple Regression: All Factors Combined and GPA ### +summary(lm(CUML_GPA ~ NOTES + ATTEND + LISTENS, data = studentsurvey)) +``` + +Three variables were classified as \"classroom engagement\": 1) taking notes, 2) class attendance, and 3) reported listening in class. The first variable, taking notes, did not appear to have a significant impact on cumulative GPA. The p-value (0.08499) was greater than 0.05, indicating the result was not statistically significant. Additionally, the correlation coefficient was positive, but only slightly (0.1435413). The adjusted r squared also indicated a low correlation (0.01376). + +The second variable, class attendance was found to be statistically significant, as the p-value was less than 0.05 (0.0319). Students who always attended class had higher GPAs than those who never attended class. This correlation is also slight, as indicated by the correlation coefficient (-0.1783047) and the adjusted r squared (0.02502). + +Students' reported listening during class was not statistically significant on GPA, with a p-value higher than 0.05 (0.5079). The correlation was also extremely slight, with a positive correlation coefficient of 0.05542742 and an adjusted r squared value of -0.003899. + +Finally, the multiple linear regression of all factors combined was again not statistically significant (p = 0.07402). Thus, my hypothesis that classroom engagement would have a positive influence on GPA would be rejected. + +#### Data Visualization + +```{r} +### Factor 1: Taking Notes and GPA ### +# Numeric key: 1 = never takes notes, 2 = sometimes takes notes, and 3 = always takes notes +ggplot(data = studentsurvey, aes(x = NOTES, y = CUML_GPA)) + + geom_point() + + geom_smooth(method = lm) + +### Factor 2: Class Attendance and GPA ### +# Numeric key: 1 = always attends class, 2 = sometimes attends class, 3 = never attends class) +ggplot(data = studentsurvey, aes(x = ATTEND, y = CUML_GPA)) + + geom_point() + + geom_smooth(method = lm) + +### Factor 3: Reported Listening and GPA ### +# Numeric key: 1 = never listens to class lectures, 2 = sometimes listens to class lectures, 3 = always listens to class lectures +ggplot(data = studentsurvey, aes(x = LISTENS, y = CUML_GPA)) + + geom_point() + + geom_smooth(method = lm) +``` + +### Research Question 2 + +#### Statistical Analyses + +```{r} +# Simple linear regression +shfit <- lm(STUDY_HRS ~ CUML_GPA, data = studentsurvey) +summary(shfit) + +# Correlation test +cor.test(studentsurvey$STUDY_HRS, studentsurvey$CUML_GPA) +``` + +Like with my previous research question, I chose to run a simple linear regression and a correlation test to analyze the data. The results indicated that hours spent studying had very little impact on cumulative GPA. The p-value was greater than 0.05 (0.9225), and both the correlation coefficient, although positive, and the adjusted r r squared values were extremely small (0.008144991 and -0.006926). Thus, my hypothesis would be refuted. + +#### Data Visualization + +```{r} +# Numeric key: 1 = 0 hours per week, 2 = <5 hours, 3 = 6-10 hours, 4 = 11-20 hours, 5 = more than 20 hours) +ggplot(data = studentsurvey, aes(x = STUDY_HRS, y = CUML_GPA)) + + geom_point() + + geom_smooth(method = lm) +``` + +### Research Question 3 + +#### Statistical Analyses + +```{r} +### Factor 1: Peer Study Groups and GPA ### + +# Dividing into whether or not students study with peers +studentsurvey$PREP_STUDY <- ifelse(studentsurvey$PREP_STUDY==2, 2, 1) + +# Simple linear regression +spfit <- lm(PREP_STUDY ~ CUML_GPA, data = studentsurvey) +summary(spfit) + +# Correlation test +cor.test(studentsurvey$PREP_STUDY, studentsurvey$CUML_GPA) + +### Factor 2: Positive Class Discussions and GPA ### + +# Dividing into whether or not students enjoy class discussions +studentsurvey$PREP_STUDY <- ifelse(studentsurvey$LIKES_DISCUSS==1, 1, 2) + +ldfit <- lm(LIKES_DISCUSS ~ CUML_GPA, data = studentsurvey) +summary(ldfit) + +cor.test(studentsurvey$LIKES_DISCUSS, studentsurvey$CUML_GPA) + +### Multiple Regression: Both Factors and GPA ### +summary(lm(CUML_GPA ~ PREP_STUDY + LIKES_DISCUSS, data = studentsurvey)) +``` + +Students who study with their peers are more likely to have higher GPAs, according to the simple linear regression and correlation test. The p-value was less than 0.05 (0.01535). However, the correlation was not extremely high (0.2009882) and neither was the adjusted r-squared value (0.03369). That being said, the results were statistically significant. + +Additionally, students who found class discussions to be helpful (always or some of the time, compared to those who did not find class discussions to be a positive experience) to their education and learning were significantly more likely to have higher GPAs. The p-value was less than 0.01 (0.007804). Again the correlation was not extreme (0.2201251) as well as the adjusted r-squared (0.0418). + +The multiple regression analysis also found the combined two variables to be statistically significant (0.01666). Thus, it could be concluded that collaboration has a positive impact on GPA, supporting my hypothesis. + +#### Data Visualization + +```{r} +### Factor 1: Peer Study Groups and GPA ### +# Numeric key: 1 = does not study with peers, 2 = studies with peers +ggplot(data = studentsurvey, aes(x = PREP_STUDY, y = CUML_GPA)) + + geom_point() + + geom_smooth(method = lm) + +### Factor 2: Positive Class Discussions and GPA ### +# Numeric key: 1 = does not enjoy class discussions, 2 = enjoys class discussions +ggplot(data = studentsurvey, aes(x = LIKES_DISCUSS, y = CUML_GPA)) + + geom_point() + + geom_smooth(method = lm) +``` + +## Reflection + +To be honest, this project was a huge challenge for me. Although the dataset I chose seems relatively simple, it was certainly a challenge for me. As someone who is interested in education (I used to be in school to become a school psychologist), I decided to conduct a research project relating to education. I searched for datasets relating to education on numerous sites, but the one that stood out to me was the one I found on Kaggle. It was so well-organized with an outstanding key, it was possible for me to wrap my head around it. + +The next part was cleaning it. I wanted to have the variables be in word format rather than assigned numbers so it was easier to read. It took a lot of research to figure out how to do it with ease, but once I got it, it was simple. Unfortunately, when it came to data analysis, numeric values were needed to conduct proper analyses, so I had to convert it back when I went to analyze the data. I must admit that I was pretty disappointed that my work seemed pointless, but running tests with the code I learned from classes was pretty easy. It was interesting to combine different factors to see how they affected my dependent variable (student cumulative GPA). + +If I were to continue with the project, I'd like to conduct my own study to see if my hypothesis that student collaboration does indeed improve student GPA with other samples. A potential study could be comparing the academic performance of students in a class that is primarily discussion-based vs. a class that is lecture-based. I would like to get survey data from students, too, asking about their experiences in the classes. My hypothesis would be that students in discussion-based classes perform better than those in lecture-based classes, and those students enjoy their discussion-based classes more. + +## Conclusion + +The hypothesis that classroom engagement would have a positive influence on GPA would be rejected. It may be because of the quality of students' time spent in and outside the classroom. However, it appears that attending class is very important for students' GPA. Additionally, the quality of notes and active listening may be more important that the quantity. Further research is needed with students' classroom habits to confirm. The hypothesis that more hours studying would have a positive influence on GPA would be rejected. Again, it may be an issue with quality rather than the quantity. Perhaps overstudying is a problem, or students spend a lot of that time actually distracted. More research into students' study habits would be needed. It could be concluded that collaboration has a positive impact on GPA, supporting the hypothesis. As university classes move to a more collaborative format, encouraging peer study groups and ensuring that classroom discussions are a positive experience for students may help with both their social and academic skills. + +## References + +Büchele, S. (2021). Evaluating the link between attendance and performance in higher education:  + +the role of classroom engagement dimensions. Assessment & Evaluation in Higher Education, 46(1), 132-150. + +Ellis, R., & Han, F. (2021). Assessing university student collaboration in new ways. Assessment  + +& Evaluation in Higher Education, 46(4), 509-524. + +Fokkens-Bruinsma, M., Vermue, C., Deinumdataset, J. F., & van Rooij, E. (2021). First-year  + +academic achievement: the role of academic self-efficacy, self-regulated learning and beyond classroom engagement. Assessment & Evaluation in Higher Education, 46(7), 1115-1126. + +Hanson, M. (2022, July 26). College Enrollment & Student Demographic Statistics.  + +EducationData.org. Retrieved from . + +Leukhina, O., & Smaldone, A. (2022, March 14). Why do women outnumber men in college  + +enrollment? Saint Louis Fed Eagle. Retrieved from . + +National Center for Education Statistics. (2022, May). College Student Employment. Coe -  + +college student employment. Retrieved from . + +Nelson, R. (2003). Student Efficiency: A study on the behavior and productive efficiency of  + +college students and the determinants of GPA. Issues in Political Economy, 12, 32-43. + +Riser-Kositsky, M. (2022, August 2). Education statistics: Facts about American Schools.  + +Education Week. Retrieved from . + +Scholarship statistics. ThinkImpact.com. (2021, November 10). Retrieved from  + +. + +Thibodeaux, J., Deutsch, A., Kitsantas, A., & Winsler, A. (2017). First-year college students\'  + +time use: Relations with self-regulation and GPA. Journal of Advanced Academics, 28(1), 5-27. + +Torrento-estimo, E., Lourdes, C., & Evidente, L. G. (2012). Collaborative Learning in Small  + +Group Discussions and Its Impact on Resilience Quotient and Academic Performance. JPAIR Multidisciplinary Research Journal, 7(1), 1-1. + +Vargas, D. L., Bridgeman, A. M., Schmidt, D. R., Kohl, P. B., Wilcox, B. R., & Carr, L. D.  + +(2018). Correlation between student collaboration network centrality and academic performance. Physical Review Physics Education Research, 14(2), 020112. + +Yılmaz, N., & Sekeroglu, B. (2019, August). Student Performance Classification Using Artificial  + +Intelligence Techniques. In International Conference on Theory and Application of Soft Computing, Computing with Words and Perceptions (pp. 596-603). Springer, Cham. + +\ diff --git a/All Submissions/Homework3.qmd b/All Submissions/Homework3.qmd new file mode 100644 index 00000000..8fbf3352 --- /dev/null +++ b/All Submissions/Homework3.qmd @@ -0,0 +1,118 @@ +--- +title: "Homework 3" +author: "Emily Duryea" +desription: "Homework 3 submission by Emily Duryea" +date: "12/20/2022" +format: + html: + toc: true + code-fold: true + code-copy: true + code-tools: true +categories: + - homework2 + - emilyduryea + - student + - academic +--- + +# Homework 3 + +```{r} +# Importing dataset +studentsurvey <- read.csv("_data/student_prediction.csv") +summary(studentsurvey) +``` + +## Descriptive Statistics & Visualization + +```{r} +library(tidyverse) +library(tidyr) +library(dplyr) +library(epiDisplay) + +### Sample Information ### + +# Sample Gender +tab1(studentsurvey$GENDER, sort.group = "decreasing", cum.percent = TRUE) +# Numeric key: 1 = female, 2 = male + +# Sample's Graduated High School Type +tab1(studentsurvey$HS_TYPE, sort.group = "decreasing", cum.percent = TRUE) +# Numeric key: 1 = graduated from a private high school, 2 = state high school, 3 = other) + +# Sample's Work Status +tab1(studentsurvey$WORK, sort.group = "decreasing", cum.percent = TRUE) +# Numeric key: 1 = Yes, 2 = No + +# Sample's Received Scholarship +tab1(studentsurvey$SCHOLARSHIP, sort.group = "decreasing", cum.percent = TRUE) +# Numeric key: 1 = No Scholarship, 2 = 25% Scholarship, 3 = 50% Scholarship, 4 = 75% Scholarship, 5 = Full Scholarship) + +### Research Question 1 ### + +# Taking Notes in Class +mean(studentsurvey$NOTES) +median(studentsurvey$NOTES) +sd(studentsurvey$NOTES) +# Numeric key: 1 = never takes notes, 2 = sometimes takes notes, and 3 = always takes notes +# Frequency visualization +tab1(studentsurvey$NOTES, sort.group = "decreasing", cum.percent = TRUE) + +# Class Attendance +mean(studentsurvey$ATTEND) +median(studentsurvey$ATTEND) +sd(studentsurvey$ATTEND) +# Numeric key: 1 = always attends class, 2 = sometimes attends class, 3 = never attends class) +# Frequency visualization +tab1(studentsurvey$ATTEND, sort.group = "decreasing", cum.percent = TRUE) + +# Reported Listening in Class +mean(studentsurvey$LISTENS) +median(studentsurvey$LISTENS) +sd(studentsurvey$LISTENS) +# Numeric key: 1 = never listens to class lectures, 2 = sometimes listens to class lectures, 3 = always listens to class lectures +# Frequency visualization +tab1(studentsurvey$LISTENS, sort.group = "decreasing", cum.percent = TRUE) + +### Research Question 2 ### + +# Hours Studying +mean(studentsurvey$STUDY_HRS) +median(studentsurvey$STUDY_HRS) +sd(studentsurvey$STUDY_HRS) +# Numeric key: 1 = 0 hours per week, 2 = <5 hours, 3 = 6-10 hours, 4 = 11-20 hours, 5 = more than 20 hours) +# Frequency visualization +tab1(studentsurvey$STUDY_HRS, sort.group = "decreasing", cum.percent = TRUE) + +### Research Question 3 ### + +# Peer Study Groups +mean(studentsurvey$PREP_STUDY) +median(studentsurvey$PREP_STUDY) +sd(studentsurvey$PREP_STUDY) +# Numeric key: 1 = studies alone, 2 = studies with friends, 3 = not applicable +# Frequency visualization +tab1(studentsurvey$PREP_STUDY, sort.group = "decreasing", cum.percent = TRUE) + +# Positive Class Discussions +mean(studentsurvey$LIKES_DISCUSS) +median(studentsurvey$LIKES_DISCUSS) +sd(studentsurvey$LIKES_DISCUSS) +# Numeric key: 1 = never likes/participates in discussions 2 = sometimes, 3 = always) +# Frequency visualization +tab1(studentsurvey$LIKES_DISCUSS, sort.group = "decreasing", cum.percent = TRUE) +``` + +### Sample Conclusions + +In this particular study, there were more male (60%) than female (40%) participants. Most students attended a state/public high school (71%). Additionally, most students have received at least 50% scholarship at this university (52.4% received 50% scholarship, 29% received 75%, 15.9% received full scholarship, whereas only 2.1% received 25% and 0.7% received no scholarship), indicating that many students at this particular university have received scholarships. Furthermore, most students do not have a job (66.2%) while they are studying at university in this sample. As the vast majority of students have scholarships, working a job during university may not be necessary. + +### Research Question Variables Conclusions + +For research question 1, most students reported always taking notes (57.3%; m = 2.544828, sd = 0.5649399, median = 3). Most students reported attending class (75.9%; m = 1.241379, sd = 0.429403, median = 1). Interestingly, the majority of students reported only sometimes listening in class (54.2%, m = 2.055172, sd = 0.6747357, median = 2). The statistics reveal that the majority of students engage in classroom engagement behaviors. In regards to research question 2, most students reported studying less than five hours a week (51%; m = 2.2, sd = 0.9174239, median = 2). That is not a lot of time spent studying as I would have anticipated of university students. Most students reported studying alone (73.8%; m = 1.337931, sd = 0.61487, median = 1). Most students also enjoyed class discussions sometimes (48.3%) or all of the time (45.5%). Only a few never enjoyed class discussions and found them beneficial to their learning (6.2%; m = 2.393103, sd = 0.6043425, median = 2). + +## Reflection + +Limitations of my visuals are that they probably would not be able to be processed by a naive viewer without the numeric key. That is why I provided the necessary numeric values in the code so that it could be understood. Going forward, I would be interested to see how different variables interact with the cumulative GPA (like classroom engagement, peer collaboration, and study habits. Perhaps seeing how demographic variables interact with GPA and other factors (e.g., hours spent studying and work status). diff --git a/All Submissions/challenge1.qmd b/All Submissions/challenge1.qmd new file mode 100644 index 00000000..9d9a2a62 --- /dev/null +++ b/All Submissions/challenge1.qmd @@ -0,0 +1,69 @@ +--- +title: "Challenge 1" +author: "Emily Duryea" +desription: "Reading in data and creating a post" +date: "11/25/2022" +format: + html: + toc: true + code-fold: true + code-copy: true + code-tools: true +categories: + - challenge_1 + - birds +--- + +```{r} +#| label: setup +#| warning: false +#| message: false + +library(tidyverse) + +knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) +``` + +## Challenge Overview + +Today's challenge is to + +1) read in a dataset, and + +2) describe the dataset using both words and any supporting information (e.g., tables, etc) + +## Read in the Data + +Read in one (or more) of the following data sets, using the correct R package and command. + +- railroad_2012_clean_county.csv ⭐ +- birds.csv ⭐⭐ +- FAOstat\*.csv ⭐⭐ +- wild_bird_data.xlsx ⭐⭐⭐ +- StateCounty2012.xls ⭐⭐⭐⭐ + +Find the `_data` folder, located inside the `posts` folder. Then you can read in the data, using either one of the `readr` standard tidy read commands, or a specialized package such as `readxl`. + +```{r} +# Importing the data file +library(readr) +birds <- read_csv("_data/birds.csv") +View(birds) +``` + +Add any comments or documentation as needed. More challenging data sets may require additional code chunks and documentation. + +## Describe the data + +Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data). + +```{r} +#| label: summary +summary(birds) +count(birds, Item) +count(birds, Area) +``` + +This dataset includes 30,977 rows, with 14 columns. It contains data on 5 categories of birds from 248 countries. Across those 248 countries, 13,074 are chickens, 6,909 are ducks, 5,693 are turkeys, 4,136 are geese and guinea fowls, and 1,165 are pigeons and other birds. Some countries contain a large portion of those entries (e.g., France, Egypt, and Greece, with 290), while others have very few (e.g., Luxembourg with 19, Montenegro with 13, and Sudan with 7). + + diff --git a/All Submissions/challenge2.qmd b/All Submissions/challenge2.qmd new file mode 100644 index 00000000..2e00f8c3 --- /dev/null +++ b/All Submissions/challenge2.qmd @@ -0,0 +1,86 @@ +--- +title: "Challenge 2" +author: "Emily Duryea" +desription: "Data wrangling: using group() and summarise()" +date: "11/25/2022" +format: + html: + toc: true + code-fold: true + code-copy: true + code-tools: true +categories: + - challenge_2 + - railroad +--- + +```{r} +#| label: setup +#| warning: false +#| message: false + +library(tidyverse) + +knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) +``` + +## Challenge Overview + +Today's challenge is to + +1) read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc) +2) provide summary statistics for different interesting groups within the data, and interpret those statistics + +## Read in the Data + +Read in one (or more) of the following data sets, available in the `posts/_data` folder, using the correct R package and command. + +- railroad\*.csv or StateCounty2012.xls ⭐ +- FAOstat\*.csv or birds.csv ⭐⭐⭐ +- hotel_bookings.csv ⭐⭐⭐⭐ + +```{r} +# Importing the data file +library(readr) +railroad <- read_csv("_data/railroad_2012_clean_county.csv") +View(railroad) +``` + +Add any comments or documentation as needed. More challenging data may require additional code chunks and documentation. + +## Describe the data + +Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data). + +```{r} +#| label: summary +summary(railroad) +dim(railroad) +str(railroad) +mean(railroad$total_employees) +min(railroad$total_employees) +max(railroad$total_employees) +median(railroad$total_employees) +count(railroad, state) +count(railroad, county) +``` + +This dataset includes 2,930 rows, with 3 columns. It contains data from 53 states & territories and 1,709. The mean number of employees at each of these railroads in this dataset was 87.17816. The minimum number of employees at any railroad in this data set is 1, and the max is 8,207. The median is 21. These results suggest that there are some major outliers that have increased the mean, since the median is much lower than the mean, and the maximum is an extremely large value. This maximum is located in Cook County, Illinois. + +## Provide Grouped Summary Statistics + +Conduct some exploratory data analysis, using dplyr commands such as `group_by()`, `select()`, `filter()`, and `summarise()`. Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set. + +```{r} +# Finding the central tendency for total employees by state +railroad %>% + select(state, total_employees)%>% + group_by(state) %>% + summarize(mean(total_employees), median(total_employees), sd(total_employees)) +``` + +### Explain and Interpret + +Be sure to explain why you choose a specific group. Comment on the interpretation of any interesting differences between groups that you uncover. This section can be integrated with the exploratory data analysis, just be sure it is included. + +I chose to examine the central tendency of the total number of employees at railroad countries by state. I was curious to see how states varied by total employees. There appears to be a higher average by states with higher populations. For example, California, which has a high population has an average of 238 employees, while a state like Maine, with a lower population, has an average of 40 employees. It would be interesting to see if this hypothesis would be correct in further analyses. \ No newline at end of file diff --git a/All Submissions/challenge3.qmd b/All Submissions/challenge3.qmd new file mode 100644 index 00000000..6d93fb58 --- /dev/null +++ b/All Submissions/challenge3.qmd @@ -0,0 +1,63 @@ +--- +title: "Challenge 3" +author: "Emily Duryea" +desription: "Challenge 3" +date: "12/20/2022" +format: + html: + toc: true + code-fold: true + code-copy: true + code-tools: true +categories: + - challenge_3 +--- + +# Challenge 3 + +```{r} +#| label: setup +#| warning: false +#| message: false + +library(tidyverse) + +knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) +``` + +## Read in Data + +```{r} +aw <- read.csv("_data/animal_weight.csv") +aw +``` + +The dataset, which I chose to label as "aw" as short for "animal weights," contains data on different animal weights from different regions of the world. The animals include: 1) both dairy, and 2) non dairy cattle, 3) buffaloes, 4) market swine, 5) breeding swine, 6) chickens (broilers), 7) chickens (layers), 8) ducks, 9) turkeys, 10) sheep, 11) goats, 12) horses, 13) asses, 14) mules, 15) camel, and 16) llamas. The animals are listed in columns. The regions of the animals are in rows, and are as follows: 1) Indian subcontinent, 2) Eastern Europe, 3) Africa, 4) Oceania, 5) Western Europe, 6) Latin America, 7) Asia, 8) Middle East, and 9) North America. The values in the rows and columns are the animal weights by region. + +## Finding the Dimensions + +```{r} +# Getting the number of rows +nrow(aw) + +# Getting the number of columns +ncol(aw) + +# Calculating the expected number of total cases (rows times columns) +nrow(aw) * (ncol(aw)-1) + +# Calculating the expected number of columns +1+1+1 +``` + +The dimensions of the current dataset are 16 columns with 9 rows, and it is anticipated to have 144 cases. + +## Pivot the Data + +```{r} +pivot_longer(aw, "Cattle...dairy":"Llamas", + names_to="animal", + values_to = "weights") +``` + +After the pivoting the data, there are are 3 columns with 144 rows, as anticipated by the calculations. diff --git a/All Submissions/challenge4.qmd b/All Submissions/challenge4.qmd new file mode 100644 index 00000000..1dcbdb6d --- /dev/null +++ b/All Submissions/challenge4.qmd @@ -0,0 +1,68 @@ +--- +title: "Challenge 4" +author: "Emily Duryea" +desription: "Challenge 4" +date: "12/20/2022" +format: + html: + toc: true + code-fold: true + code-copy: true + code-tools: true +categories: + - challenge_4 +--- + +# Challenge 4 + +```{r} +#| label: setup +#| warning: false +#| message: false + +library(tidyverse) + +knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) +``` + +## Read in Data + +```{r} +abc <- read.csv("_data/abc_poll_2021.csv") +abc + +# summary of data +summary(abc) + +``` + +This dataset appears to be an ABC Poll, a survey conducted with 527 respondents (the number of rows). This survey contains 15 demographic variables, 10 political attitudes questions, and 5 survey administration variables. + +## Cleaning the Data + +```{r} +# Renaming variables +abc <-rename(abc, language = xspanish, age = ppage, education5 = ppeduc5, education = ppeducat, gender = ppgender, ethnicity = ppethm, household_size = pphhsize, income = ppinc7, marital_status = ppmarit5, region = ppreg4, rent = pprent, state = ppstaten, work = PPWORKA, employment = ppemploy) +abc <- abc%>% + mutate(ethnicity = str_remove (ethnicity, ", Non-Hispanic")) + +# Removing values where respondents skipped +abc<-abc %>% + mutate(across(starts_with("Q"), ~ na_if(.x, "Skipped"))) +``` + +## Mutating Variables + +```{r} +# Mutating Party ID +abc <-abc %>% + mutate(QPID = fct_recode(QPID, "dem" = "A Democrat", + "rep" = "A Republican", + "ind" = "An Independent", + "na" = "Skipped", + "other" = "Something else")) %>% + mutate(QPID = fct_relevel(QPID, "dem", "ind", "rep","other", "na")) + +ggplot(abc, aes(QPID)) + geom_bar() + +``` diff --git a/All Submissions/challenge5.qmd b/All Submissions/challenge5.qmd new file mode 100644 index 00000000..f82d43fa --- /dev/null +++ b/All Submissions/challenge5.qmd @@ -0,0 +1,119 @@ +--- +title: "Challenge 5" +author: "Emily Duryea" +desription: "Challenge 5" +date: "12/20/2022" +format: + html: + toc: true + code-fold: true + code-copy: true + code-tools: true +categories: + - challenge_5 +--- + +# Challenge 5 + +```{r} +#| label: setup +#| warning: false +#| message: false + +library(tidyverse) +library(ggplot2) + +knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) +``` + +## Read in Data + +```{r} +# Reading in the data +cereal <- read.csv("_data/cereal.csv") +cereal +``` + +The dataset contains four columns: cereal (name), amount of sodium per serving of cereal (in mg), amount of sugar (in g), and the type of cereal ("A" for adult cereal and "C" for children cereal). There are twenty types of cereal which make up the rows. + +## Tidying the Data + +```{r} +# Renaming columns +cereal_tidy <- cereal %>% + rename(cereal_name = Cereal, sugar=Sugar, sodium=Sodium, type=Type) +cereal_tidy +``` + +## Mutating Variables + +```{r} +# Making it so cereal sodium is in grams instead of mg to match measurement of sugar (in grams) +cereal_sodium <- cereal_tidy %>% +mutate(sodium = sodium/1000) + cereal_sodium +``` + +## Pivoting the Data + +```{r} +# Pivoting it so that the data is grouped by sodium and sugar +cerealg <- cereal_sodium %>% + pivot_longer(col =c("sodium", "sugar"), + names_to="Sodium_Sugar", + values_to="Amount") +cerealg +``` + +## Univariate Visualization + +```{r} +# The visualization shows which cereal has the most about of sodium to least +cereal_sodium %>% + arrange(sodium) %>% +mutate(cereal_name=factor(cereal_name, levels=cereal_name)) %>% + ggplot(aes(x=cereal_name, y=sodium)) + + geom_segment(aes(xend=cereal_name, yend=0), color="green") + + geom_point(colour="orange", size=2, alpha=0.5)+ + coord_flip() +``` + +The visualization shows that Raisin Bran has the most amount of sodium in the twenty cereals in the dataset, with Frosted Mini Wheats having the least. + +```{r} +# The visualization shows which cereal has the most about of sugar to least +cereal_sodium %>% + arrange(sugar) %>% +mutate(cereal_name=factor(cereal_name, levels=cereal_name)) %>% + ggplot(aes(x=cereal_name, y=sugar)) + + geom_bar(stat="identity") + + coord_flip() +``` + +The visualization demonstrates that, once again, Raisin Bran has the most amount of sugar, whereas Fiber One has the least. + +## Bivariate Visualization + +For this visualization, I am looking at if the amount of sodium/sugar plays a role in whether the cereal is classified as an adult or children's cereal. I hypothesize that cereals with more sugar/sodium will be classified as children's cereal, as children tend to have strong sugar/sodium cravings, and adult cereal tends to be marketed as "healthier," and as adults try to be more health conscious, the sugar/sodium content is monitored. + +```{r} +# Looking at if the amount of sodium/sugar plays a role in whether a cereal is classified as an adult or children's cereal + +# Cereal sugar content & type of cereal +cereal_sodium %>% + arrange(sugar) %>% +mutate(cereal_name=factor(cereal_name, levels=cereal_name)) %>% + ggplot(aes(x=cereal_name, y=sugar, fill=type)) + + geom_bar(stat="identity") + + coord_flip() + +# Cereal sodium content & type of cereal +cereal_sodium %>% + arrange(sodium) %>% +mutate(cereal_name=factor(cereal_name, levels=cereal_name)) %>% + ggplot(aes(x=cereal_name, y=sodium, fill=type)) + + geom_bar(stat="identity") + + coord_flip() +``` + +Based on these visualizations, it would seem that adult cereals actually have higher sugar content than children's cereal. The cereals with the highest sugar content are all classified as adult cereals (Raisin Bran, Crackling Oat Bran, and Honey Smacks). Sodium appears to be a toss-up between adult and children, with highest sodium contents flipping between adult and children cereal. Thus, my hypothesis that adult cereal would have less sugar and sodium would be refuted. diff --git a/All Submissions/challenge6.qmd b/All Submissions/challenge6.qmd new file mode 100644 index 00000000..f7b59ab4 --- /dev/null +++ b/All Submissions/challenge6.qmd @@ -0,0 +1,96 @@ +--- +title: "Challenge 6" +author: "Emily Duryea" +desription: "Challenge 6" +date: "12/20/2022" +format: + html: + toc: true + code-fold: true + code-copy: true + code-tools: true +categories: + - challenge_6 +--- + +# Challenge 6 + +```{r} +#| label: setup +#| warning: false +#| message: false + +library(tidyverse) +library(ggplot2) +library(lubridate) +library(readxl) +library(viridis) +library(hrbrthemes) +library(plotly) + +knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) +``` + +## Reading in Data + +```{r} +# Reading in dataset +debt <-read_excel("debt_in_trillions.xlsx", skip= 1, col_names = c("Year_Quarter", "Mortgage", "HE_Revolving", "Auto_Loan", "Credit_Card", "Student_Loan", "Other", "Total")) +debt +``` + +This dataset contains debt (in the trillions) from 2003 to 2021 (in yearly quarters) for six different categories: 1) mortgage, 2) home equity revolving debt, 3) auto loan, 4) credit card, 5) student loan, and 6) miscellanious debts. + +## Tidying the Data + +```{r} +# Separating year and quarters +debt_new<- debt %>% + separate("Year_Quarter",c("Year","Quarter"),sep = ":") +view(debt_new) +``` + +To tidy the data, I separated the yearly quarters into "years" and "quarters." + +## Time Dependent Visualization + +```{r} +debt_plot <- debt_new %>% + ggplot(mapping=aes(x = Year, y = "Student_Loan"))+ + geom_point(aes(color=Quarter)) +debt_plot +``` + +## Pivoting the Data + +```{r} +debt_new1<- debt_new %>% + pivot_longer(!c(Year,Quarter), names_to = "DebtType",values_to = "DebtPercent" ) + +debt_new1 +``` + +## **Part-Whole Relationships Visualization** + +```{r} +debt_new1_plot <- debt_new %>% + ggplot(mapping=aes(x = Year, y = "DebtPercent")) + +debt_new1_plot + + facet_wrap(~"DebtType", scales = "free") + +debt_new1_plot + + geom_point(aes(color = "DebtType")) + +debt_new1_plot+ + geom_point() + + facet_wrap(~"DebtType") + + scale_x_discrete(breaks = c('03','06','09',12,15,18,21)) + +debt_new1_plot + + geom_point(aes(color = "Quarter",alpha=0.9,)) + + facet_wrap(~"DebtType", scales = "free_y") + + guides(alpha="none") + + labs(title="Debt by type from '03 - '21")+ + scale_x_discrete(breaks = c('03','06','09',12,15,18,21)) +``` diff --git a/All Submissions/challenge7.qmd b/All Submissions/challenge7.qmd new file mode 100644 index 00000000..daa81a03 --- /dev/null +++ b/All Submissions/challenge7.qmd @@ -0,0 +1,101 @@ +--- +title: "Challenge 7" +author: "Emily Duryea" +desription: "Challenge 7" +date: "12/20/2022" +format: + html: + toc: true + code-fold: true + code-copy: true + code-tools: true +categories: + - challenge_7 +--- + +# Challenge 7 + +```{r} +#| label: setup +#| warning: false +#| message: false + +library(tidyverse) +library(ggplot2) + +knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) +``` + +## Read in Data + +```{r} +egg <- read.csv("_data/eggs_tidy.csv") +egg +``` + +This dataset shows egg sales by month/year, as well as the size (half dozen large, dozen large, half dozen extra large, dozen extra large). + +## Tidying & Mutating the Data + +```{r} +# Making a column to combine months and years +column = names(egg) +column <- column[!column %in% c("year","month")] +column + +# Pivoting the data +egg <- egg %>% + pivot_longer(egg, cols=column, names_to = "carton_type", values_to = "sales") +newegg +``` + +## Visualization with Multiple Dimensions + +```{r} +# Grouping by sales and year +egggroup <- egg %>% + group_by(year) %>% + summarise( + total_sales = sum(sales) + ) +egggroup + +# Creating a line plot of total sales and year +ggplot(egggroup, aes(x = year, y = total_sales)) + + geom_line(color = "black") + + theme_minimal() + + theme( + plot.background = element_rect(fill = "lightyellow"), + panel.grid = element_line(color = "grey") + ) + + +egggroup2 <- egg %>% + group_by(year, carton_type) %>% + summarise( + total = sum(sales) + ) +egggroup2 + +ggplot(egggroup2, aes(x=year, y=total, fill=carton_type)) + + geom_col(color="black", size=0.5) + + theme(text = element_text(family="Times")) + + geom_vline(xintercept=c(2010, 2015, 2020), color="blue", linetype="dashed", size=1) + + scale_fill_brewer(type="seq", palette="Reds") + +ggplot(data=egggroup2, aes(x=year, y=total, color= carton_type)) + + geom_line() + + geom_point() + + labs( + x = "Year", + y = "Total Sales", + color = "Carton Type", + title = "Total Sales of Egg Carton Types Over the Years" + ) + + guides(color = guide_legend(title="Carton Type")) + + theme_minimal() + + theme( + text = element_text(family="Times", size=12, color="black"), + panel.background = element_rect(fill="lightyellow") + ) +``` diff --git a/All Submissions/challenge8.qmd b/All Submissions/challenge8.qmd new file mode 100644 index 00000000..b2875518 --- /dev/null +++ b/All Submissions/challenge8.qmd @@ -0,0 +1,63 @@ +--- +title: "Challenge 8" +author: "Emily Duryea" +desription: "Challenge 8" +date: "12/20/2022" +format: + html: + toc: true + code-fold: true + code-copy: true + code-tools: true +categories: + - challenge_8 +--- + +# Challenge 8 + +```{r} +#| label: setup +#| warning: false +#| message: false + +library(tidyverse) +library(ggplot2) + +knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) +``` + +## Reading in Data + +```{r} +cgroups <- read.csv("_data.FAOSTAT_country_groups.csv") +dcattle <- read.csv("_data/FAOSTAT_cattle_dairy.csv") +``` + +The dataset that I will be primarily working with is the FAO Stat Cattle dataset. I will combine the dataset with another dataset which groups countries into region so that the countries in the FAO Stat Cattle dataset won't be on an individual level. This FAO Stat Cattle dataset contains data on cow milk and sales in countries all over the world, a total of 245. The data dates back to 1961 and goes to 2018. There are 14 columns, and 36,449 rows. + +## Tidying & Combining the Data + +```{r} +# Changing "Area.Code" to "Country.Code" to match the other dataset +dcattle2 <- rename(dcattle, "Country.Code"= "Area.Code" ) +dcattle2 + +# Joining the two datasets together +newcattle <- left_join(dcattle2, cgroups, by = "Country.Code" ) +newcattle + +# Grouping by value of cow milk and country group +newcattle1 <- newcattle %>% + group_by(Country.Group) %>% + summarise(Value) +newcattle1 + +# Creating plot of value of cow milk and country group +ggplot(newcattle1, aes(x = Country.Group, y = Value)) + + geom_line(color = "black") + + theme_minimal() + + theme( + plot.background = element_rect(fill = "lightyellow"), + panel.grid = element_line(color = "grey") + ) +``` diff --git a/Finalprojectpaper.rmd b/Finalprojectpaper.rmd new file mode 100644 index 00000000..802b6d18 --- /dev/null +++ b/Finalprojectpaper.rmd @@ -0,0 +1,379 @@ +--- +title: "Final Project/Paper" +description: "Final project topic covers academic factors that increase student GPA using a dataset developed by Yılmaz and Sekeroglu (2019)" +author: "Emily Duryea" +date: "2022-12-20" +output: distill::distill_article +--- + +# Academic Factors that Increase Student GPA + +## Introduction + +There are numerous factors documented in research literature that have correlated with students' cumulative GPA. For example, classroom engagement, time management, motivation, class attendance, time spent studying, and group study sessions, to name a few (Fokkens-Bruinsma, et al., 2021; Büchele, 2021; Nelson, 2003; Thibodeaux, et al., 2017; Vargas, et al., 2018). The purpose of this research project was to further investigate these findings using a dataset collected by Yılmaz and Sekeroglu (2019). The dataset was originally used to classify student performance using artificial intelligence. For the purposes of this study, I used the dataset to examine what factors correlated with students' cumulative GPA.  + +This research project examined the following three research questions: 1) Does classroom engagement (i.e., taking notes, attending class, listening) result in a higher GPA in university students?; 2) Does reported studying (i.e., weekly study hours) result in a higher GPA in university students?; and 3) Does collaboration between students (i.e., studying together, positive class discussions) result in a higher GPA in university students? + +For the first research question, it is reasonable to hypothesize that classroom engagement will have a positive effect on students' academic achievement. Previous research supports this hypothesis. For example, one study found that classroom engagement, as well as other related factors such as time management and autonomous motivation, are predictors of academic achievement (Fokkens-Bruinsma, et al., 2021). Another study found that attendance in higher education is a small, but still statistically significant, predictor of academic performance (Büchele, 2021). In this study, classroom engagement will be defined as "taking notes, attendance, and frequency of listening." These measures will be reported by university students via survey. + +In regards to the second research question, it is hypothesized that students who study more will have a higher GPA. There are many previous studies that support this claim. For instance, one study found that university freshmen who studied more than eight hours a week saw an average increase in GPA of 0.580 (Nelson, 2003). Research has also found that increasing study time leads to an increased GPA (Thibodeaux, et al., 2017). In this study, hours spent studying will be measured through students' estimated range of hours studied, reported via survey. + +In response to the third research question, it is hypothesized that student collaboration will have a positive effect on student GPA. There is some research literature that supports this statement. One study found that students who study with their peers achieve significantly higher homework scores (Vargas, et al., 2018). Another study found that university students who had a strong social network and exhibited collaborative behaviors tended to achieve higher grades (Ellis & Han, 2021). Effective student collaboration can also occur during class time, such as through small group discussions. Research has found that students who participate in small group discussions demonstrate an increase in resilience, which has shown to improve academic performance (Torrento-estimo, et al, 2012). In this study, student collaboration will be measured through students' reported time spent studying with peers, and impact that their class discussions have. + +## Data + +### The Dataset + +The dataset that I used is a data set I found on the website Kaggle (link can be found here: https://www.kaggle.com/datasets/csafrit2/higher-education-students-performance-evaluation?resource=download). The dataset is a survey given to university students that collects demographic variables (e.g., age, job status, family background) and variables pertaining to their academic performance (e.g., time spent studying, class attendance, GPA). Below is the R code I used to read in the data set, as well as a summary of the data. + +```{r} +knitr::opts_chunk$set(echo = TRUE) +studentsurvey <- read.csv("_data/student_prediction.csv") +summary(studentsurvey) +``` + +As this dataset was used in a previous research study (Yılmaz & Sekeroglu, 2019), the data has already been de-identified participants using a number code to represent responses in order to keep subjects' identities private. What I did want to change was the labeling of certain variables where it was unnecessary. For example, to represent gender, the researchers used the code "1" to represent "female" and "2" to represent "male." To make it less confusing when using the dataset, I chose to rename these terms. However, some I chose not to rename - if the variable involved a range, the number placeholder was kept (e.g., for ages, the ranges were 1 = 18-21, 2 = 22-25, 3 = 26 or above). + +```{r} +library(tidyverse) +library(tidyr) +library(dplyr) + +studentsurvey$GENDER <- factor(studentsurvey$GENDER, + levels=c(1,2), + labels=c("female","male")) + +studentsurvey$HS_TYPE <- factor(studentsurvey$HS_TYPE, + levels=c(1,2,3), + labels=c("private","state", "other")) + +studentsurvey$SCHOLARSHIP <- factor(studentsurvey$SCHOLARSHIP, + levels=c(1,2,3,4,5), + labels=c("None","25%", "50%", "75%", "Full")) + +studentsurvey$WORK <- factor(studentsurvey$WORK, + levels=c(1,2), + labels=c("Yes","No")) + +studentsurvey$ACTIVITY <- factor(studentsurvey$ACTIVITY, + levels=c(1,2), + labels=c("Yes","No")) + +studentsurvey$PARTNER <- factor(studentsurvey$PARTNER, + levels=c(1,2), + labels=c("Yes","No")) + +studentsurvey$TRANSPORT <- factor(studentsurvey$TRANSPORT, + levels=c(1,2,3,4), + labels=c("Bus","Car/Taxi", "Bicycle", "Other")) + +studentsurvey$LIVING <- factor(studentsurvey$LIVING, + levels=c(1,2,3,4), + labels=c("Rental","Dorm", "With Family", "Other")) + +studentsurvey$MOTHER_EDU <- factor(studentsurvey$MOTHER_EDU, + levels=c(1,2,3,4,5,6), + labels=c("primary school","secondary school", "high school", "university", "Masters", "Doctorate")) + +studentsurvey$FATHER_EDU <- factor(studentsurvey$FATHER_EDU, + levels=c(1,2,3,4,5,6), + labels=c("primary school","secondary school", "high school", "university", "Masters", "Doctorate")) + +studentsurvey$KIDS <- factor(studentsurvey$KIDS, + levels=c(1,2,3), + labels=c("Married","Divorced", "Died")) + +studentsurvey$MOTHER_JOB <- factor(studentsurvey$MOTHER_JOB, + levels=c(1,2,3,4,5,6), + labels=c("retired","housewife", "government officer", "private sector employee", "self-employment", "other")) + +studentsurvey$FATHER_JOB <- factor(studentsurvey$FATHER_JOB, + levels=c(1,2,3,4,5), + labels=c("retired", "government officer", "private sector employee", "self-employment", "other")) + +studentsurvey$READ_FREQ <- factor(studentsurvey$READ_FREQ, + levels=c(1,2,3), + labels=c("None","Sometimes", "Often")) + +studentsurvey$READ_FREQ_SCI <- factor(studentsurvey$READ_FREQ_SCI, + levels=c(1,2,3), + labels=c("None","Sometimes", "Often")) + +studentsurvey$ATTEND_DEPT <- factor(studentsurvey$ATTEND_DEPT, + levels=c(1,2), + labels=c("Yes","No")) + +studentsurvey$IMPACT <- factor(studentsurvey$IMPACT, + levels=c(1,2,3), + labels=c("Positive","Negative","Neutral")) + +studentsurvey$ATTEND <- factor(studentsurvey$ATTEND, + levels=c(1,2,3), + labels=c("always","sometimes","never")) + +studentsurvey$PREP_STUDY <- factor(studentsurvey$PREP_STUDY, + levels=c(1,2,3), + labels=c("alone","with friends","not applicable")) + +studentsurvey$PREP_EXAM <- factor(studentsurvey$PREP_EXAM, + levels=c(1,2,3), + labels=c("close to exam date","regularly throughout semester","never")) + +studentsurvey$NOTES <- factor(studentsurvey$NOTES, + levels=c(1,2,3), + labels=c("never","sometimes","always")) + +studentsurvey$LISTENS <- factor(studentsurvey$LISTENS, + levels=c(1,2,3), + labels=c("never","sometimes","always")) + +studentsurvey$LIKES_DISCUSS <- factor(studentsurvey$LIKES_DISCUSS, + levels=c(1,2,3), + labels=c("never","sometimes","always")) + +studentsurvey$CLASSROOM <- factor(studentsurvey$CLASSROOM, + levels=c(1,2,3), + labels=c("not useful","useful","not applicable")) +``` + +### Demographic Variables + +In order to gain information about the participant sample, I began by running some descriptive statistics with the sample's demographic variable. Below are some bar graphs (with the code needed to generate the graphs) pertaining to the demographics of the sample. + +```{r} +# Bar graph of sample gender +ggplot(studentsurvey, aes(x = GENDER)) + ggtitle("Sample Gender") + geom_bar() + +# Bar graph of sample high school type +ggplot(studentsurvey, aes(x = HS_TYPE)) + ggtitle("High School Type Sample Graduated From") + geom_bar() + +# Bar graph of sample scholarship received +ggplot(studentsurvey, aes(x = SCHOLARSHIP)) + ggtitle("Percentage of Scholarship Received") + geom_bar() + +# Bar graph of sample work status +ggplot(studentsurvey, aes(x = WORK)) + ggtitle("Sample Work Status") + geom_bar() +``` + +In this particular study, there were more male than female participants. Most students attended a state/public high school. Additionally, most students have received at least 50% scholarship at this university, indicating that many students at this particular university have received scholarships. Furthermore, most students do not have a job while they are studying at university in this sample. As the vast majority of students have scholarships, working a job during university may not be necessary. + +This sample may not be representative of the U.S. student population. There are more male than female students, which is not the case at most schools: there is about a 1:2 male to female ratio at U.S. colleges (Leukhina & Smaldone, 2022). Additionally, like in the sample, the vast majority of students attended public schools (Riser-Kositsky, 2022). In regards to scholarships, the students at this particular university receive scholarships at significantly higher rates than the rest of the U.S. Only about one in eight students receive a scholarship, and only 5% receive a full scholarship (Scholarship Statistics, 2021). While the enrollment statuses of the students were not given, if all students were full-time students, it would align with research that shows that less than half of full-time students (40%) in U.S. universities work while in school. While this sample may not be entirely representative of the U.S. college student population, analyses of this dataset conducted may provide some insight on factors that improve university students GPA. + +## Visualization + +### Analysis + +For research question 1 -- the influence of classroom engagement on student GPA -- I chose to run a simple linear regression and a correlation test. I did also conduct a multiple regression analysis, but I preferred to separate the three variables within my definition of "classroom engagement" so I could analyze them individually. For research question 2, like with my previous research question, I chose to run a simple linear regression and a correlation test to analyze the data, but due to the number of variables, a multiple regression analysis was not conducted. The same method of analysis for research question 1 was applied to research question 3 as well. + +### Research Question 1 + +#### Statistical Analyses + +```{r} +# In order to conduct proper analysis, the numeric values are needed, thus the dataset was reimported. +studentsurvey <- read.csv("_data/student_prediction.csv") + +### Factor 1: Taking Notes and GPA ### + +# Simple linear regression +nfit <- lm(NOTES ~ CUML_GPA, data = studentsurvey) +summary(nfit) + +# Correlation test +cor.test(studentsurvey$NOTES, studentsurvey$CUML_GPA) + +### Factor 2: Class Attendance and GPA ### + +# Simple linear regression +afit <- lm(ATTEND ~ CUML_GPA, data = studentsurvey) +summary(afit) + +# Correlation test +cor.test(studentsurvey$ATTEND, studentsurvey$CUML_GPA) + +### Factor 3: Reported Listening and GPA ### + +# Simple linear regression +lfit <- lm(LISTENS ~ CUML_GPA, data = studentsurvey) +summary(lfit) + +# correlation test +cor.test(studentsurvey$LISTENS, studentsurvey$CUML_GPA) + +### Multiple Regression: All Factors Combined and GPA ### +summary(lm(CUML_GPA ~ NOTES + ATTEND + LISTENS, data = studentsurvey)) +``` + +Three variables were classified as \"classroom engagement\": 1) taking notes, 2) class attendance, and 3) reported listening in class. The first variable, taking notes, did not appear to have a significant impact on cumulative GPA. The p-value (0.08499) was greater than 0.05, indicating the result was not statistically significant. Additionally, the correlation coefficient was positive, but only slightly (0.1435413). The adjusted r squared also indicated a low correlation (0.01376). + +The second variable, class attendance was found to be statistically significant, as the p-value was less than 0.05 (0.0319). Students who always attended class had higher GPAs than those who never attended class. This correlation is also slight, as indicated by the correlation coefficient (-0.1783047) and the adjusted r squared (0.02502). + +Students' reported listening during class was not statistically significant on GPA, with a p-value higher than 0.05 (0.5079). The correlation was also extremely slight, with a positive correlation coefficient of 0.05542742 and an adjusted r squared value of -0.003899. + +Finally, the multiple linear regression of all factors combined was again not statistically significant (p = 0.07402). Thus, my hypothesis that classroom engagement would have a positive influence on GPA would be rejected. + +#### Data Visualization + +```{r} +### Factor 1: Taking Notes and GPA ### +# Numeric key: 1 = never takes notes, 2 = sometimes takes notes, and 3 = always takes notes +ggplot(data = studentsurvey, aes(x = NOTES, y = CUML_GPA)) + + geom_point() + + geom_smooth(method = lm) + +### Factor 2: Class Attendance and GPA ### +# Numeric key: 1 = always attends class, 2 = sometimes attends class, 3 = never attends class) +ggplot(data = studentsurvey, aes(x = ATTEND, y = CUML_GPA)) + + geom_point() + + geom_smooth(method = lm) + +### Factor 3: Reported Listening and GPA ### +# Numeric key: 1 = never listens to class lectures, 2 = sometimes listens to class lectures, 3 = always listens to class lectures +ggplot(data = studentsurvey, aes(x = LISTENS, y = CUML_GPA)) + + geom_point() + + geom_smooth(method = lm) +``` + +### Research Question 2 + +#### Statistical Analyses + +```{r} +# Simple linear regression +shfit <- lm(STUDY_HRS ~ CUML_GPA, data = studentsurvey) +summary(shfit) + +# Correlation test +cor.test(studentsurvey$STUDY_HRS, studentsurvey$CUML_GPA) +``` + +Like with my previous research question, I chose to run a simple linear regression and a correlation test to analyze the data. The results indicated that hours spent studying had very little impact on cumulative GPA. The p-value was greater than 0.05 (0.9225), and both the correlation coefficient, although positive, and the adjusted r r squared values were extremely small (0.008144991 and -0.006926). Thus, my hypothesis would be refuted. + +#### Data Visualization + +```{r} +# Numeric key: 1 = 0 hours per week, 2 = <5 hours, 3 = 6-10 hours, 4 = 11-20 hours, 5 = more than 20 hours) +ggplot(data = studentsurvey, aes(x = STUDY_HRS, y = CUML_GPA)) + + geom_point() + + geom_smooth(method = lm) +``` + +### Research Question 3 + +#### Statistical Analyses + +```{r} +### Factor 1: Peer Study Groups and GPA ### + +# Dividing into whether or not students study with peers +studentsurvey$PREP_STUDY <- ifelse(studentsurvey$PREP_STUDY==2, 2, 1) + +# Simple linear regression +spfit <- lm(PREP_STUDY ~ CUML_GPA, data = studentsurvey) +summary(spfit) + +# Correlation test +cor.test(studentsurvey$PREP_STUDY, studentsurvey$CUML_GPA) + +### Factor 2: Positive Class Discussions and GPA ### + +# Dividing into whether or not students enjoy class discussions +studentsurvey$PREP_STUDY <- ifelse(studentsurvey$LIKES_DISCUSS==1, 1, 2) + +ldfit <- lm(LIKES_DISCUSS ~ CUML_GPA, data = studentsurvey) +summary(ldfit) + +cor.test(studentsurvey$LIKES_DISCUSS, studentsurvey$CUML_GPA) + +### Multiple Regression: Both Factors and GPA ### +summary(lm(CUML_GPA ~ PREP_STUDY + LIKES_DISCUSS, data = studentsurvey)) +``` + +Students who study with their peers are more likely to have higher GPAs, according to the simple linear regression and correlation test. The p-value was less than 0.05 (0.01535). However, the correlation was not extremely high (0.2009882) and neither was the adjusted r-squared value (0.03369). That being said, the results were statistically significant. + +Additionally, students who found class discussions to be helpful (always or some of the time, compared to those who did not find class discussions to be a positive experience) to their education and learning were significantly more likely to have higher GPAs. The p-value was less than 0.01 (0.007804). Again the correlation was not extreme (0.2201251) as well as the adjusted r-squared (0.0418). + +The multiple regression analysis also found the combined two variables to be statistically significant (0.01666). Thus, it could be concluded that collaboration has a positive impact on GPA, supporting my hypothesis. + +#### Data Visualization + +```{r} +### Factor 1: Peer Study Groups and GPA ### +# Numeric key: 1 = does not study with peers, 2 = studies with peers +ggplot(data = studentsurvey, aes(x = PREP_STUDY, y = CUML_GPA)) + + geom_point() + + geom_smooth(method = lm) + +### Factor 2: Positive Class Discussions and GPA ### +# Numeric key: 1 = does not enjoy class discussions, 2 = enjoys class discussions +ggplot(data = studentsurvey, aes(x = LIKES_DISCUSS, y = CUML_GPA)) + + geom_point() + + geom_smooth(method = lm) +``` + +## Reflection + +To be honest, this project was a huge challenge for me. Although the dataset I chose seems relatively simple, it was certainly a challenge for me. As someone who is interested in education (I used to be in school to become a school psychologist), I decided to conduct a research project relating to education. I searched for datasets relating to education on numerous sites, but the one that stood out to me was the one I found on Kaggle. It was so well-organized with an outstanding key, it was possible for me to wrap my head around it. + +The next part was cleaning it. I wanted to have the variables be in word format rather than assigned numbers so it was easier to read. It took a lot of research to figure out how to do it with ease, but once I got it, it was simple. Unfortunately, when it came to data analysis, numeric values were needed to conduct proper analyses, so I had to convert it back when I went to analyze the data. I must admit that I was pretty disappointed that my work seemed pointless, but running tests with the code I learned from classes was pretty easy. It was interesting to combine different factors to see how they affected my dependent variable (student cumulative GPA). + +If I were to continue with the project, I'd like to conduct my own study to see if my hypothesis that student collaboration does indeed improve student GPA with other samples. A potential study could be comparing the academic performance of students in a class that is primarily discussion-based vs. a class that is lecture-based. I would like to get survey data from students, too, asking about their experiences in the classes. My hypothesis would be that students in discussion-based classes perform better than those in lecture-based classes, and those students enjoy their discussion-based classes more. + +## Conclusion + +The hypothesis that classroom engagement would have a positive influence on GPA would be rejected. It may be because of the quality of students' time spent in and outside the classroom. However, it appears that attending class is very important for students' GPA. Additionally, the quality of notes and active listening may be more important that the quantity. Further research is needed with students' classroom habits to confirm. The hypothesis that more hours studying would have a positive influence on GPA would be rejected. Again, it may be an issue with quality rather than the quantity. Perhaps overstudying is a problem, or students spend a lot of that time actually distracted. More research into students' study habits would be needed. It could be concluded that collaboration has a positive impact on GPA, supporting the hypothesis. As university classes move to a more collaborative format, encouraging peer study groups and ensuring that classroom discussions are a positive experience for students may help with both their social and academic skills. + +## References + +Büchele, S. (2021). Evaluating the link between attendance and performance in higher education:  + +the role of classroom engagement dimensions. Assessment & Evaluation in Higher Education, 46(1), 132-150. + +Ellis, R., & Han, F. (2021). Assessing university student collaboration in new ways. Assessment  + +& Evaluation in Higher Education, 46(4), 509-524. + +Fokkens-Bruinsma, M., Vermue, C., Deinumdataset, J. F., & van Rooij, E. (2021). First-year  + +academic achievement: the role of academic self-efficacy, self-regulated learning and beyond classroom engagement. Assessment & Evaluation in Higher Education, 46(7), 1115-1126. + +Hanson, M. (2022, July 26). College Enrollment & Student Demographic Statistics.  + +EducationData.org. Retrieved from . + +Leukhina, O., & Smaldone, A. (2022, March 14). Why do women outnumber men in college  + +enrollment? Saint Louis Fed Eagle. Retrieved from . + +National Center for Education Statistics. (2022, May). College Student Employment. Coe -  + +college student employment. Retrieved from . + +Nelson, R. (2003). Student Efficiency: A study on the behavior and productive efficiency of  + +college students and the determinants of GPA. Issues in Political Economy, 12, 32-43. + +Riser-Kositsky, M. (2022, August 2). Education statistics: Facts about American Schools.  + +Education Week. Retrieved from . + +Scholarship statistics. ThinkImpact.com. (2021, November 10). Retrieved from  + +. + +Thibodeaux, J., Deutsch, A., Kitsantas, A., & Winsler, A. (2017). First-year college students\'  + +time use: Relations with self-regulation and GPA. Journal of Advanced Academics, 28(1), 5-27. + +Torrento-estimo, E., Lourdes, C., & Evidente, L. G. (2012). Collaborative Learning in Small  + +Group Discussions and Its Impact on Resilience Quotient and Academic Performance. JPAIR Multidisciplinary Research Journal, 7(1), 1-1. + +Vargas, D. L., Bridgeman, A. M., Schmidt, D. R., Kohl, P. B., Wilcox, B. R., & Carr, L. D.  + +(2018). Correlation between student collaboration network centrality and academic performance. Physical Review Physics Education Research, 14(2), 020112. + +Yılmaz, N., & Sekeroglu, B. (2019, August). Student Performance Classification Using Artificial  + +Intelligence Techniques. In International Conference on Theory and Application of Soft Computing, Computing with Words and Perceptions (pp. 596-603). Springer, Cham. + +\ diff --git a/Homework3.qmd b/Homework3.qmd new file mode 100644 index 00000000..ea0a2d6d --- /dev/null +++ b/Homework3.qmd @@ -0,0 +1,192 @@ +--- + title: "Homework 3 - Emily Duryea" +author: "Emily Duryea" +description: "The third homework assignment for DACSS 603" +date: "10/31/2022" +format: + html: + toc: true +code-fold: true +code-copy: true +code-tools: true +categories: +- hw3 +- Emily Duryea +--- + +# Homework 3 + +## Question 1 + +United Nations (Data file: UN11in alr4) + +The data in the file UN11 contains several variables, including ppgdp, the gross national product per person in U.S. dollars, and fertility, the birth rate per 1000 females, both from the year 2009. The data are for 199 localities, mostly UN member countries, but also other areas such as Hong Kong that are not independent countries. The data were collected from the United Nations (2011). We will study the dependence of fertility on ppgdp. + +### 1.1.1 + +Question: Identify the predictor and the response. + +Answer: The predictor is ppgdp, and the response is ferility, since we are looking at how ppgdp (the independent variable) is affecting fertility (dependent variable). + +### 1.1.2 + +Question: Draw the scatterplot of fertility on the vertical axis versus ppgdp on the horizontal axis and summarize the information in this graph. Does a straight-line mean function seem to be plausible for a summary of this graph? + +```{r} +# Importing needed libraries +library(tidyverse) +library(ggplot2) +library(alr4) +library(smss) + +# Importing the UN11 dataset +data(UN11) + +# Creating a scatterplot +ggplot(data = UN11, aes(x = ppgdp, y = fertility)) + + geom_point(color = 'black') + + labs(title = "PPGDP and Fertility") +``` + +Answer: This graph does not look like it could represented by a linear function. Rather, it looks like it would be represented by a nonlinear (curvilinear) function. + +### 1.1.3 + +Question: Draw the scatterplot of log(fertility) versus log(ppgdp) using natural logarithms. Does the simple linear regression model seem plausible for a summary of this graph? If you use a different base of logarithms, the shape of the graph won't change, but the values on the axes will change. + +```{r} +# Creating a scatterplot +ggplot(data = UN11, aes(x = log(ppgdp), y = log(fertility))) + + geom_point(color = 'black') + + geom_smooth(method = lm) + + labs(title = "PPGDP and Fertility") +``` + +Answer: After taking the logarithm of each variable, based on the graph, it is now plausible to use a simple linear regression. + +## Question 2 + +Annual income, in dollars, is an explanatory variable in a regression analysis. For a British version of the report on the analysis, all responses are converted to British pounds sterling (1 pound equals about 1.33 dollars, as of 2016). + +### Part A + +Question: How, if at all, does the slope of the prediction equation change? + +```{r} +# Creating a variable for the British pound +UN11$Britishpound <- 1.33*UN11$ppgdp + +# Examining the slope +summary(lm(fertility ~ Britishpound, UN11)) +ggplot(data = UN11, aes(x = log(Britishpound), y = log(fertility))) + + geom_point(color = 'black') + + geom_smooth(method = lm) + + labs(title = "British Pound and Fertility") + +# Comparing the slope +summary(lm(fertility ~ ppgdp, UN11)) +``` + +Answer: The slope has changed slightly due to the 1.33 increase adjustment for British pounds, but according to the results of the summary function, the adjusted R-squared is the same for both (0.1895). + +### Part B + +Question: How, if at all, does the correlation change? + +```{r} +# Finding the correlation with US dollars +cor(UN11$ppgdp, UN11$fertility) + +# Finding the correlation with British pounds +cor(UN11$Britishpound, UN11$fertility) +``` + +Answer: The correlations of fertility with US dollars AND British pounds are the same, because, although British pounds are of a different value from US dollars, the values are multiplied by a constant (1.33). + +## Question 3 + +Water runoff in the Sierras (Data file: water in alr4) + +Can Southern California's water supply in future years be predicted from past data? One factor affecting water availability is stream runoff. If runoff could be predicted, engineers, planners, and policy makers could do their jobs more efficiently. The data file contains 43 years' worth of precipitation measurements taken at six sites in the Sierra Nevada mountains (labeled APMAM, APSAB, APSLAKE, OPBPC, OPRC, and OPSLAKE) and stream runoff volume at a site near Bishop, California, labeled BSAAM. Draw the scatterplot matrix for these data and summarize the information available from these plots. (Hint: Use the pairs() function.) + +```{r} +# Loading dataset +data(water) + +# Creating scatterplots +pairs(water) + +# Conducting regression analysis +water1 <- lm(BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE, data = water) +summary(water1) +``` + +Answer: This graph does not look like it could represented by a linear function. Rather, it looks like it would be represented by a nonlinear (curvilinear) function. + +## Question 4 + +Professor ratings (Data file: Rateprof in alr4) + +In the website and online forum RateMyProfessors.com, students rate and comment on their instructors. Launched in 1999, the site includes millions of ratings on thousands of instructors. The data file includes the summaries of the ratings of 364 instructors at a large campus in the Midwest (Bleske-Rechek and Fritsch, 2011). Each instructor included in the data had at least 10 ratings over a several year period. Students provided ratings of 1--5 on quality, helpfulness, clarity, easiness of instructor's courses, and raterInterest in the subject matter covered in the instructor's courses. The data file provides the averages of these five ratings. Create a scatterplot matrix of these five variables. Provide a brief description of the relationships between the five ratings. + +```{r} +# Importing dataset +data(Rateprof) + +# Creating a subset of the dataset with the five variables of interest +Rateprof5 <- Rateprof %>% select(quality, helpfulness, clarity, easiness, raterInterest) + +# Creating the scatterplots +pairs(Rateprof5) +``` + +Answer: All 5 of the variables of interest have positive correlations. However, some relationships are stronger than others. Quality, helpfulness, and clarity all have stronger positive relationships, while easiness and raterInterest are very weak positive relationships. + +## Question 5 + +For the student.survey data file in the smss package, conduct regression analyses relating (by convention, y denotes the outcome variable, x denotes the explanatory variable) (i) y = political ideology and x = religiosity, (ii) y = high school GPA and x = hours of TV watching. (You can use ?student.survey in the R console, after loading the package, to see what each variable means.) + +### Part A + +Question: Graphically portray how the explanatory variable relates to the outcome variable in each of the two cases + +```{r} +# Importing dataset +data(student.survey) +studentsurvey <- student.survey + +# Creating subset of data with variables needed +studentsurvey <- studentsurvey %>% + select(hi, tv, pi, re) + +# Creating a plot to compare political ideology with religious service attendance +plot(pi ~ re, data = studentsurvey) + +# Creating a plot comparing High School GPA (hi) and average number of hours watching tb a week (tv) +ggplot(data = studentsurvey, aes(x = tv, y = hi)) + + geom_point() + + geom_smooth(method = lm) +``` + +Answer: Based on the plots generated, religious service attendance is correlated with conservatism, and hours of TV watched per week has a negative relationship with high school GPA. + +### Part B + +Question: Summarize and interpret results of inferential analyses. + +```{r} +# Changing the pi variable to a numeric one +studentsurvey$pi <- as.numeric(studentsurvey$pi) + +# Removing ordering from the re variable +levels(studentsurvey$re) <- c("N", "O", "M", "E") +studentsurvey$re <- factor(studentsurvey$re, ordered = FALSE) + +# Conducting regression analyses for pi and re +summary(lm(pi ~ re, studentsurvey)) + +# Conducting regression analyses for hi and tv +summary(lm(hi ~ tv, studentsurvey)) +``` + +Answer: According to this dataset, people who attended religious services most weeks or every week are significantly more likely to report as conservative (p \< 0.001). Additionally, people who watch less hours of tv are significantly more likely to have a higher GPA (p \< 0.05). diff --git a/challenge1.qmd b/challenge1.qmd new file mode 100644 index 00000000..9d9a2a62 --- /dev/null +++ b/challenge1.qmd @@ -0,0 +1,69 @@ +--- +title: "Challenge 1" +author: "Emily Duryea" +desription: "Reading in data and creating a post" +date: "11/25/2022" +format: + html: + toc: true + code-fold: true + code-copy: true + code-tools: true +categories: + - challenge_1 + - birds +--- + +```{r} +#| label: setup +#| warning: false +#| message: false + +library(tidyverse) + +knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) +``` + +## Challenge Overview + +Today's challenge is to + +1) read in a dataset, and + +2) describe the dataset using both words and any supporting information (e.g., tables, etc) + +## Read in the Data + +Read in one (or more) of the following data sets, using the correct R package and command. + +- railroad_2012_clean_county.csv ⭐ +- birds.csv ⭐⭐ +- FAOstat\*.csv ⭐⭐ +- wild_bird_data.xlsx ⭐⭐⭐ +- StateCounty2012.xls ⭐⭐⭐⭐ + +Find the `_data` folder, located inside the `posts` folder. Then you can read in the data, using either one of the `readr` standard tidy read commands, or a specialized package such as `readxl`. + +```{r} +# Importing the data file +library(readr) +birds <- read_csv("_data/birds.csv") +View(birds) +``` + +Add any comments or documentation as needed. More challenging data sets may require additional code chunks and documentation. + +## Describe the data + +Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data). + +```{r} +#| label: summary +summary(birds) +count(birds, Item) +count(birds, Area) +``` + +This dataset includes 30,977 rows, with 14 columns. It contains data on 5 categories of birds from 248 countries. Across those 248 countries, 13,074 are chickens, 6,909 are ducks, 5,693 are turkeys, 4,136 are geese and guinea fowls, and 1,165 are pigeons and other birds. Some countries contain a large portion of those entries (e.g., France, Egypt, and Greece, with 290), while others have very few (e.g., Luxembourg with 19, Montenegro with 13, and Sudan with 7). + + diff --git a/challenge2.qmd b/challenge2.qmd new file mode 100644 index 00000000..2e00f8c3 --- /dev/null +++ b/challenge2.qmd @@ -0,0 +1,86 @@ +--- +title: "Challenge 2" +author: "Emily Duryea" +desription: "Data wrangling: using group() and summarise()" +date: "11/25/2022" +format: + html: + toc: true + code-fold: true + code-copy: true + code-tools: true +categories: + - challenge_2 + - railroad +--- + +```{r} +#| label: setup +#| warning: false +#| message: false + +library(tidyverse) + +knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) +``` + +## Challenge Overview + +Today's challenge is to + +1) read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc) +2) provide summary statistics for different interesting groups within the data, and interpret those statistics + +## Read in the Data + +Read in one (or more) of the following data sets, available in the `posts/_data` folder, using the correct R package and command. + +- railroad\*.csv or StateCounty2012.xls ⭐ +- FAOstat\*.csv or birds.csv ⭐⭐⭐ +- hotel_bookings.csv ⭐⭐⭐⭐ + +```{r} +# Importing the data file +library(readr) +railroad <- read_csv("_data/railroad_2012_clean_county.csv") +View(railroad) +``` + +Add any comments or documentation as needed. More challenging data may require additional code chunks and documentation. + +## Describe the data + +Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data). + +```{r} +#| label: summary +summary(railroad) +dim(railroad) +str(railroad) +mean(railroad$total_employees) +min(railroad$total_employees) +max(railroad$total_employees) +median(railroad$total_employees) +count(railroad, state) +count(railroad, county) +``` + +This dataset includes 2,930 rows, with 3 columns. It contains data from 53 states & territories and 1,709. The mean number of employees at each of these railroads in this dataset was 87.17816. The minimum number of employees at any railroad in this data set is 1, and the max is 8,207. The median is 21. These results suggest that there are some major outliers that have increased the mean, since the median is much lower than the mean, and the maximum is an extremely large value. This maximum is located in Cook County, Illinois. + +## Provide Grouped Summary Statistics + +Conduct some exploratory data analysis, using dplyr commands such as `group_by()`, `select()`, `filter()`, and `summarise()`. Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set. + +```{r} +# Finding the central tendency for total employees by state +railroad %>% + select(state, total_employees)%>% + group_by(state) %>% + summarize(mean(total_employees), median(total_employees), sd(total_employees)) +``` + +### Explain and Interpret + +Be sure to explain why you choose a specific group. Comment on the interpretation of any interesting differences between groups that you uncover. This section can be integrated with the exploratory data analysis, just be sure it is included. + +I chose to examine the central tendency of the total number of employees at railroad countries by state. I was curious to see how states varied by total employees. There appears to be a higher average by states with higher populations. For example, California, which has a high population has an average of 238 employees, while a state like Maine, with a lower population, has an average of 40 employees. It would be interesting to see if this hypothesis would be correct in further analyses. \ No newline at end of file diff --git a/challenge3.qmd b/challenge3.qmd new file mode 100644 index 00000000..6d93fb58 --- /dev/null +++ b/challenge3.qmd @@ -0,0 +1,63 @@ +--- +title: "Challenge 3" +author: "Emily Duryea" +desription: "Challenge 3" +date: "12/20/2022" +format: + html: + toc: true + code-fold: true + code-copy: true + code-tools: true +categories: + - challenge_3 +--- + +# Challenge 3 + +```{r} +#| label: setup +#| warning: false +#| message: false + +library(tidyverse) + +knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) +``` + +## Read in Data + +```{r} +aw <- read.csv("_data/animal_weight.csv") +aw +``` + +The dataset, which I chose to label as "aw" as short for "animal weights," contains data on different animal weights from different regions of the world. The animals include: 1) both dairy, and 2) non dairy cattle, 3) buffaloes, 4) market swine, 5) breeding swine, 6) chickens (broilers), 7) chickens (layers), 8) ducks, 9) turkeys, 10) sheep, 11) goats, 12) horses, 13) asses, 14) mules, 15) camel, and 16) llamas. The animals are listed in columns. The regions of the animals are in rows, and are as follows: 1) Indian subcontinent, 2) Eastern Europe, 3) Africa, 4) Oceania, 5) Western Europe, 6) Latin America, 7) Asia, 8) Middle East, and 9) North America. The values in the rows and columns are the animal weights by region. + +## Finding the Dimensions + +```{r} +# Getting the number of rows +nrow(aw) + +# Getting the number of columns +ncol(aw) + +# Calculating the expected number of total cases (rows times columns) +nrow(aw) * (ncol(aw)-1) + +# Calculating the expected number of columns +1+1+1 +``` + +The dimensions of the current dataset are 16 columns with 9 rows, and it is anticipated to have 144 cases. + +## Pivot the Data + +```{r} +pivot_longer(aw, "Cattle...dairy":"Llamas", + names_to="animal", + values_to = "weights") +``` + +After the pivoting the data, there are are 3 columns with 144 rows, as anticipated by the calculations. diff --git a/challenge4.qmd b/challenge4.qmd new file mode 100644 index 00000000..1dcbdb6d --- /dev/null +++ b/challenge4.qmd @@ -0,0 +1,68 @@ +--- +title: "Challenge 4" +author: "Emily Duryea" +desription: "Challenge 4" +date: "12/20/2022" +format: + html: + toc: true + code-fold: true + code-copy: true + code-tools: true +categories: + - challenge_4 +--- + +# Challenge 4 + +```{r} +#| label: setup +#| warning: false +#| message: false + +library(tidyverse) + +knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) +``` + +## Read in Data + +```{r} +abc <- read.csv("_data/abc_poll_2021.csv") +abc + +# summary of data +summary(abc) + +``` + +This dataset appears to be an ABC Poll, a survey conducted with 527 respondents (the number of rows). This survey contains 15 demographic variables, 10 political attitudes questions, and 5 survey administration variables. + +## Cleaning the Data + +```{r} +# Renaming variables +abc <-rename(abc, language = xspanish, age = ppage, education5 = ppeduc5, education = ppeducat, gender = ppgender, ethnicity = ppethm, household_size = pphhsize, income = ppinc7, marital_status = ppmarit5, region = ppreg4, rent = pprent, state = ppstaten, work = PPWORKA, employment = ppemploy) +abc <- abc%>% + mutate(ethnicity = str_remove (ethnicity, ", Non-Hispanic")) + +# Removing values where respondents skipped +abc<-abc %>% + mutate(across(starts_with("Q"), ~ na_if(.x, "Skipped"))) +``` + +## Mutating Variables + +```{r} +# Mutating Party ID +abc <-abc %>% + mutate(QPID = fct_recode(QPID, "dem" = "A Democrat", + "rep" = "A Republican", + "ind" = "An Independent", + "na" = "Skipped", + "other" = "Something else")) %>% + mutate(QPID = fct_relevel(QPID, "dem", "ind", "rep","other", "na")) + +ggplot(abc, aes(QPID)) + geom_bar() + +``` diff --git a/challenge5.qmd b/challenge5.qmd new file mode 100644 index 00000000..f82d43fa --- /dev/null +++ b/challenge5.qmd @@ -0,0 +1,119 @@ +--- +title: "Challenge 5" +author: "Emily Duryea" +desription: "Challenge 5" +date: "12/20/2022" +format: + html: + toc: true + code-fold: true + code-copy: true + code-tools: true +categories: + - challenge_5 +--- + +# Challenge 5 + +```{r} +#| label: setup +#| warning: false +#| message: false + +library(tidyverse) +library(ggplot2) + +knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) +``` + +## Read in Data + +```{r} +# Reading in the data +cereal <- read.csv("_data/cereal.csv") +cereal +``` + +The dataset contains four columns: cereal (name), amount of sodium per serving of cereal (in mg), amount of sugar (in g), and the type of cereal ("A" for adult cereal and "C" for children cereal). There are twenty types of cereal which make up the rows. + +## Tidying the Data + +```{r} +# Renaming columns +cereal_tidy <- cereal %>% + rename(cereal_name = Cereal, sugar=Sugar, sodium=Sodium, type=Type) +cereal_tidy +``` + +## Mutating Variables + +```{r} +# Making it so cereal sodium is in grams instead of mg to match measurement of sugar (in grams) +cereal_sodium <- cereal_tidy %>% +mutate(sodium = sodium/1000) + cereal_sodium +``` + +## Pivoting the Data + +```{r} +# Pivoting it so that the data is grouped by sodium and sugar +cerealg <- cereal_sodium %>% + pivot_longer(col =c("sodium", "sugar"), + names_to="Sodium_Sugar", + values_to="Amount") +cerealg +``` + +## Univariate Visualization + +```{r} +# The visualization shows which cereal has the most about of sodium to least +cereal_sodium %>% + arrange(sodium) %>% +mutate(cereal_name=factor(cereal_name, levels=cereal_name)) %>% + ggplot(aes(x=cereal_name, y=sodium)) + + geom_segment(aes(xend=cereal_name, yend=0), color="green") + + geom_point(colour="orange", size=2, alpha=0.5)+ + coord_flip() +``` + +The visualization shows that Raisin Bran has the most amount of sodium in the twenty cereals in the dataset, with Frosted Mini Wheats having the least. + +```{r} +# The visualization shows which cereal has the most about of sugar to least +cereal_sodium %>% + arrange(sugar) %>% +mutate(cereal_name=factor(cereal_name, levels=cereal_name)) %>% + ggplot(aes(x=cereal_name, y=sugar)) + + geom_bar(stat="identity") + + coord_flip() +``` + +The visualization demonstrates that, once again, Raisin Bran has the most amount of sugar, whereas Fiber One has the least. + +## Bivariate Visualization + +For this visualization, I am looking at if the amount of sodium/sugar plays a role in whether the cereal is classified as an adult or children's cereal. I hypothesize that cereals with more sugar/sodium will be classified as children's cereal, as children tend to have strong sugar/sodium cravings, and adult cereal tends to be marketed as "healthier," and as adults try to be more health conscious, the sugar/sodium content is monitored. + +```{r} +# Looking at if the amount of sodium/sugar plays a role in whether a cereal is classified as an adult or children's cereal + +# Cereal sugar content & type of cereal +cereal_sodium %>% + arrange(sugar) %>% +mutate(cereal_name=factor(cereal_name, levels=cereal_name)) %>% + ggplot(aes(x=cereal_name, y=sugar, fill=type)) + + geom_bar(stat="identity") + + coord_flip() + +# Cereal sodium content & type of cereal +cereal_sodium %>% + arrange(sodium) %>% +mutate(cereal_name=factor(cereal_name, levels=cereal_name)) %>% + ggplot(aes(x=cereal_name, y=sodium, fill=type)) + + geom_bar(stat="identity") + + coord_flip() +``` + +Based on these visualizations, it would seem that adult cereals actually have higher sugar content than children's cereal. The cereals with the highest sugar content are all classified as adult cereals (Raisin Bran, Crackling Oat Bran, and Honey Smacks). Sodium appears to be a toss-up between adult and children, with highest sodium contents flipping between adult and children cereal. Thus, my hypothesis that adult cereal would have less sugar and sodium would be refuted. diff --git a/challenge6.qmd b/challenge6.qmd new file mode 100644 index 00000000..f7b59ab4 --- /dev/null +++ b/challenge6.qmd @@ -0,0 +1,96 @@ +--- +title: "Challenge 6" +author: "Emily Duryea" +desription: "Challenge 6" +date: "12/20/2022" +format: + html: + toc: true + code-fold: true + code-copy: true + code-tools: true +categories: + - challenge_6 +--- + +# Challenge 6 + +```{r} +#| label: setup +#| warning: false +#| message: false + +library(tidyverse) +library(ggplot2) +library(lubridate) +library(readxl) +library(viridis) +library(hrbrthemes) +library(plotly) + +knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) +``` + +## Reading in Data + +```{r} +# Reading in dataset +debt <-read_excel("debt_in_trillions.xlsx", skip= 1, col_names = c("Year_Quarter", "Mortgage", "HE_Revolving", "Auto_Loan", "Credit_Card", "Student_Loan", "Other", "Total")) +debt +``` + +This dataset contains debt (in the trillions) from 2003 to 2021 (in yearly quarters) for six different categories: 1) mortgage, 2) home equity revolving debt, 3) auto loan, 4) credit card, 5) student loan, and 6) miscellanious debts. + +## Tidying the Data + +```{r} +# Separating year and quarters +debt_new<- debt %>% + separate("Year_Quarter",c("Year","Quarter"),sep = ":") +view(debt_new) +``` + +To tidy the data, I separated the yearly quarters into "years" and "quarters." + +## Time Dependent Visualization + +```{r} +debt_plot <- debt_new %>% + ggplot(mapping=aes(x = Year, y = "Student_Loan"))+ + geom_point(aes(color=Quarter)) +debt_plot +``` + +## Pivoting the Data + +```{r} +debt_new1<- debt_new %>% + pivot_longer(!c(Year,Quarter), names_to = "DebtType",values_to = "DebtPercent" ) + +debt_new1 +``` + +## **Part-Whole Relationships Visualization** + +```{r} +debt_new1_plot <- debt_new %>% + ggplot(mapping=aes(x = Year, y = "DebtPercent")) + +debt_new1_plot + + facet_wrap(~"DebtType", scales = "free") + +debt_new1_plot + + geom_point(aes(color = "DebtType")) + +debt_new1_plot+ + geom_point() + + facet_wrap(~"DebtType") + + scale_x_discrete(breaks = c('03','06','09',12,15,18,21)) + +debt_new1_plot + + geom_point(aes(color = "Quarter",alpha=0.9,)) + + facet_wrap(~"DebtType", scales = "free_y") + + guides(alpha="none") + + labs(title="Debt by type from '03 - '21")+ + scale_x_discrete(breaks = c('03','06','09',12,15,18,21)) +``` diff --git a/challenge7.qmd b/challenge7.qmd new file mode 100644 index 00000000..daa81a03 --- /dev/null +++ b/challenge7.qmd @@ -0,0 +1,101 @@ +--- +title: "Challenge 7" +author: "Emily Duryea" +desription: "Challenge 7" +date: "12/20/2022" +format: + html: + toc: true + code-fold: true + code-copy: true + code-tools: true +categories: + - challenge_7 +--- + +# Challenge 7 + +```{r} +#| label: setup +#| warning: false +#| message: false + +library(tidyverse) +library(ggplot2) + +knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) +``` + +## Read in Data + +```{r} +egg <- read.csv("_data/eggs_tidy.csv") +egg +``` + +This dataset shows egg sales by month/year, as well as the size (half dozen large, dozen large, half dozen extra large, dozen extra large). + +## Tidying & Mutating the Data + +```{r} +# Making a column to combine months and years +column = names(egg) +column <- column[!column %in% c("year","month")] +column + +# Pivoting the data +egg <- egg %>% + pivot_longer(egg, cols=column, names_to = "carton_type", values_to = "sales") +newegg +``` + +## Visualization with Multiple Dimensions + +```{r} +# Grouping by sales and year +egggroup <- egg %>% + group_by(year) %>% + summarise( + total_sales = sum(sales) + ) +egggroup + +# Creating a line plot of total sales and year +ggplot(egggroup, aes(x = year, y = total_sales)) + + geom_line(color = "black") + + theme_minimal() + + theme( + plot.background = element_rect(fill = "lightyellow"), + panel.grid = element_line(color = "grey") + ) + + +egggroup2 <- egg %>% + group_by(year, carton_type) %>% + summarise( + total = sum(sales) + ) +egggroup2 + +ggplot(egggroup2, aes(x=year, y=total, fill=carton_type)) + + geom_col(color="black", size=0.5) + + theme(text = element_text(family="Times")) + + geom_vline(xintercept=c(2010, 2015, 2020), color="blue", linetype="dashed", size=1) + + scale_fill_brewer(type="seq", palette="Reds") + +ggplot(data=egggroup2, aes(x=year, y=total, color= carton_type)) + + geom_line() + + geom_point() + + labs( + x = "Year", + y = "Total Sales", + color = "Carton Type", + title = "Total Sales of Egg Carton Types Over the Years" + ) + + guides(color = guide_legend(title="Carton Type")) + + theme_minimal() + + theme( + text = element_text(family="Times", size=12, color="black"), + panel.background = element_rect(fill="lightyellow") + ) +``` diff --git a/challenge8.qmd b/challenge8.qmd new file mode 100644 index 00000000..b2875518 --- /dev/null +++ b/challenge8.qmd @@ -0,0 +1,63 @@ +--- +title: "Challenge 8" +author: "Emily Duryea" +desription: "Challenge 8" +date: "12/20/2022" +format: + html: + toc: true + code-fold: true + code-copy: true + code-tools: true +categories: + - challenge_8 +--- + +# Challenge 8 + +```{r} +#| label: setup +#| warning: false +#| message: false + +library(tidyverse) +library(ggplot2) + +knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) +``` + +## Reading in Data + +```{r} +cgroups <- read.csv("_data.FAOSTAT_country_groups.csv") +dcattle <- read.csv("_data/FAOSTAT_cattle_dairy.csv") +``` + +The dataset that I will be primarily working with is the FAO Stat Cattle dataset. I will combine the dataset with another dataset which groups countries into region so that the countries in the FAO Stat Cattle dataset won't be on an individual level. This FAO Stat Cattle dataset contains data on cow milk and sales in countries all over the world, a total of 245. The data dates back to 1961 and goes to 2018. There are 14 columns, and 36,449 rows. + +## Tidying & Combining the Data + +```{r} +# Changing "Area.Code" to "Country.Code" to match the other dataset +dcattle2 <- rename(dcattle, "Country.Code"= "Area.Code" ) +dcattle2 + +# Joining the two datasets together +newcattle <- left_join(dcattle2, cgroups, by = "Country.Code" ) +newcattle + +# Grouping by value of cow milk and country group +newcattle1 <- newcattle %>% + group_by(Country.Group) %>% + summarise(Value) +newcattle1 + +# Creating plot of value of cow milk and country group +ggplot(newcattle1, aes(x = Country.Group, y = Value)) + + geom_line(color = "black") + + theme_minimal() + + theme( + plot.background = element_rect(fill = "lightyellow"), + panel.grid = element_line(color = "grey") + ) +```