DACSS · eduryea · Dec 20, 2022 · Dec 20, 2022 · Dec 21, 2022 · Dec 21, 2022
diff --git a/All Submissions/Finalprojectpaper.rmd b/All Submissions/Finalprojectpaper.rmd
diff --git a/All Submissions/Homework3.qmd b/All Submissions/Homework3.qmd
@@ -0,0 +1,118 @@
+---
+title: "Homework 3"
+author: "Emily Duryea"
+desription: "Homework 3 submission by Emily Duryea"
+date: "12/20/2022"
+format:
+  html:
+    toc: true
+    code-fold: true
+    code-copy: true
+    code-tools: true
+categories:
+  - homework2
+  - emilyduryea
+  - student
+  - academic
+---
+
+# Homework 3
+
+```{r}
+# Importing dataset
+studentsurvey <- read.csv("_data/student_prediction.csv")
+summary(studentsurvey)
+```
+
+## Descriptive Statistics & Visualization
+
+```{r}
+library(tidyverse)
+library(tidyr)
+library(dplyr)
+library(epiDisplay)
+
+### Sample Information ###
+
+# Sample Gender
+tab1(studentsurvey$GENDER, sort.group = "decreasing", cum.percent = TRUE)
+# Numeric key: 1 = female, 2 = male
+
+# Sample's Graduated High School Type
+tab1(studentsurvey$HS_TYPE, sort.group = "decreasing", cum.percent = TRUE)
+# Numeric key: 1 = graduated from a private high school, 2 = state high school, 3 = other)
+
+# Sample's Work Status
+tab1(studentsurvey$WORK, sort.group = "decreasing", cum.percent = TRUE)
+# Numeric key: 1 = Yes, 2 = No
+
+# Sample's Received Scholarship
+tab1(studentsurvey$SCHOLARSHIP, sort.group = "decreasing", cum.percent = TRUE)
+# Numeric key: 1 = No Scholarship, 2 = 25% Scholarship, 3 = 50% Scholarship, 4 = 75% Scholarship, 5 = Full Scholarship)
+
+### Research Question 1 ###
+
+# Taking Notes in Class
+mean(studentsurvey$NOTES)
+median(studentsurvey$NOTES)
+sd(studentsurvey$NOTES)
+# Numeric key: 1 = never takes notes, 2 = sometimes takes notes, and 3 = always takes notes
+# Frequency visualization
+tab1(studentsurvey$NOTES, sort.group = "decreasing", cum.percent = TRUE)
+
+# Class Attendance
+mean(studentsurvey$ATTEND)
+median(studentsurvey$ATTEND)
+sd(studentsurvey$ATTEND)
+# Numeric key: 1 = always attends class, 2 = sometimes attends class, 3 = never attends class)
+# Frequency visualization
+tab1(studentsurvey$ATTEND, sort.group = "decreasing", cum.percent = TRUE)
+
+# Reported Listening in Class
+mean(studentsurvey$LISTENS)
+median(studentsurvey$LISTENS)
+sd(studentsurvey$LISTENS)
+# Numeric key: 1 = never listens to class lectures, 2 = sometimes listens to class lectures, 3 = always listens to class lectures
+# Frequency visualization
+tab1(studentsurvey$LISTENS, sort.group = "decreasing", cum.percent = TRUE)
+
+### Research Question 2 ###
+
+# Hours Studying
+mean(studentsurvey$STUDY_HRS)
+median(studentsurvey$STUDY_HRS)
+sd(studentsurvey$STUDY_HRS)
+# Numeric key: 1 = 0 hours per week, 2 = <5 hours, 3 = 6-10 hours, 4 = 11-20 hours, 5 = more than 20 hours)
+# Frequency visualization
+tab1(studentsurvey$STUDY_HRS, sort.group = "decreasing", cum.percent = TRUE)
+
+### Research Question 3 ###
+
+# Peer Study Groups
+mean(studentsurvey$PREP_STUDY)
+median(studentsurvey$PREP_STUDY)
+sd(studentsurvey$PREP_STUDY)
+# Numeric key: 1 = studies alone, 2 = studies with friends, 3 = not applicable
+# Frequency visualization
+tab1(studentsurvey$PREP_STUDY, sort.group = "decreasing", cum.percent = TRUE)
+
+# Positive Class Discussions
+mean(studentsurvey$LIKES_DISCUSS)
+median(studentsurvey$LIKES_DISCUSS)
+sd(studentsurvey$LIKES_DISCUSS)
+# Numeric key: 1 = never likes/participates in discussions 2 = sometimes, 3 = always)
+# Frequency visualization
+tab1(studentsurvey$LIKES_DISCUSS, sort.group = "decreasing", cum.percent = TRUE)
+```
+
+### Sample Conclusions
+
+In this particular study, there were more male (60%) than female (40%) participants. Most students attended a state/public high school (71%). Additionally, most students have received at least 50% scholarship at this university (52.4% received 50% scholarship, 29% received 75%, 15.9% received full scholarship, whereas only 2.1% received 25% and 0.7% received no scholarship), indicating that many students at this particular university have received scholarships. Furthermore, most students do not have a job (66.2%) while they are studying at university in this sample. As the vast majority of students have scholarships, working a job during university may not be necessary.
+
+### Research Question Variables Conclusions
+
+For research question 1, most students reported always taking notes (57.3%; m = 2.544828, sd = 0.5649399, median = 3). Most students reported attending class (75.9%; m = 1.241379, sd = 0.429403, median = 1). Interestingly, the majority of students reported only sometimes listening in class (54.2%, m = 2.055172, sd = 0.6747357, median = 2). The statistics reveal that the majority of students engage in classroom engagement behaviors. In regards to research question 2, most students reported studying less than five hours a week (51%; m = 2.2, sd = 0.9174239, median = 2). That is not a lot of time spent studying as I would have anticipated of university students. Most students reported studying alone (73.8%; m = 1.337931, sd = 0.61487, median = 1). Most students also enjoyed class discussions sometimes (48.3%) or all of the time (45.5%). Only a few never enjoyed class discussions and found them beneficial to their learning (6.2%; m = 2.393103, sd = 0.6043425, median = 2).
+
+## Reflection
+
+Limitations of my visuals are that they probably would not be able to be processed by a naive viewer without the numeric key. That is why I provided the necessary numeric values in the code so that it could be understood. Going forward, I would be interested to see how different variables interact with the cumulative GPA (like classroom engagement, peer collaboration, and study habits. Perhaps seeing how demographic variables interact with GPA and other factors (e.g., hours spent studying and work status).
diff --git a/All Submissions/challenge1.qmd b/All Submissions/challenge1.qmd
@@ -0,0 +1,69 @@
+---
+title: "Challenge 1"
+author: "Emily Duryea"
+desription: "Reading in data and creating a post"
+date: "11/25/2022"
+format:
+  html:
+    toc: true
+    code-fold: true
+    code-copy: true
+    code-tools: true
+categories:
+  - challenge_1
+  - birds
+---
+
+```{r}
+#| label: setup
+#| warning: false
+#| message: false
+
+library(tidyverse)
+
+knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
+```
+
+## Challenge Overview
+
+Today's challenge is to
+
+1)  read in a dataset, and
+
+2)  describe the dataset using both words and any supporting information (e.g., tables, etc)
+
+## Read in the Data
+
+Read in one (or more) of the following data sets, using the correct R package and command.
+
+-   railroad_2012_clean_county.csv ⭐
+-   birds.csv ⭐⭐
+-   FAOstat\*.csv ⭐⭐
+-   wild_bird_data.xlsx ⭐⭐⭐
+-   StateCounty2012.xls ⭐⭐⭐⭐
+
+Find the `_data` folder, located inside the `posts` folder. Then you can read in the data, using either one of the `readr` standard tidy read commands, or a specialized package such as `readxl`.
+
+```{r}
+# Importing the data file
+library(readr)
+birds <- read_csv("_data/birds.csv")
+View(birds)
+```
+
+Add any comments or documentation as needed. More challenging data sets may require additional code chunks and documentation.
+
+## Describe the data
+
+Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).
+
+```{r}
+#| label: summary
+summary(birds)
+count(birds, Item)
+count(birds, Area)
+```
+
+This dataset includes 30,977 rows, with 14 columns. It contains data on 5 categories of birds from 248 countries. Across those 248 countries, 13,074 are chickens, 6,909 are ducks, 5,693 are turkeys, 4,136 are geese and guinea fowls, and 1,165 are pigeons and other birds. Some countries contain a large portion of those entries (e.g., France, Egypt, and Greece, with 290), while others have very few (e.g., Luxembourg with 19, Montenegro with 13, and Sudan with 7).
+
+
diff --git a/All Submissions/challenge2.qmd b/All Submissions/challenge2.qmd
@@ -0,0 +1,86 @@
+---
+title: "Challenge 2"
+author: "Emily Duryea"
+desription: "Data wrangling: using group() and summarise()"
+date: "11/25/2022"
+format:
+  html:
+    toc: true
+    code-fold: true
+    code-copy: true
+    code-tools: true
+categories:
+  - challenge_2
+  - railroad
+---
+
+```{r}
+#| label: setup
+#| warning: false
+#| message: false
+
+library(tidyverse)
+
+knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
+```
+
+## Challenge Overview
+
+Today's challenge is to
+
+1)  read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
+2)  provide summary statistics for different interesting groups within the data, and interpret those statistics
+
+## Read in the Data
+
+Read in one (or more) of the following data sets, available in the `posts/_data` folder, using the correct R package and command.
+
+-   railroad\*.csv or StateCounty2012.xls ⭐
+-   FAOstat\*.csv or birds.csv ⭐⭐⭐
+-   hotel_bookings.csv ⭐⭐⭐⭐
+
+```{r}
+# Importing the data file
+library(readr)
+railroad <- read_csv("_data/railroad_2012_clean_county.csv")
+View(railroad)
+```
+
+Add any comments or documentation as needed. More challenging data may require additional code chunks and documentation.
+
+## Describe the data
+
+Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).
+
+```{r}
+#| label: summary
+summary(railroad)
+dim(railroad)
+str(railroad)
+mean(railroad$total_employees)
+min(railroad$total_employees)
+max(railroad$total_employees)
+median(railroad$total_employees)
+count(railroad, state)
+count(railroad, county)
+```
+
+This dataset includes 2,930 rows, with 3 columns. It contains data from 53 states & territories and 1,709. The mean number of employees at each of these railroads in this dataset was 87.17816. The minimum number of employees at any railroad in this data set is 1, and the max is 8,207. The median is 21. These results suggest that there are some major outliers that have increased the mean, since the median is much lower than the mean, and the maximum is an extremely large value. This maximum is located in Cook County, Illinois. 
+
+## Provide Grouped Summary Statistics
+
+Conduct some exploratory data analysis, using dplyr commands such as `group_by()`, `select()`, `filter()`, and `summarise()`. Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.
+
+```{r}
+# Finding the central tendency for total employees by state
+railroad %>%
+  select(state, total_employees)%>%
+  group_by(state) %>%
+  summarize(mean(total_employees), median(total_employees), sd(total_employees))
+```
+
+### Explain and Interpret
+
+Be sure to explain why you choose a specific group. Comment on the interpretation of any interesting differences between groups that you uncover. This section can be integrated with the exploratory data analysis, just be sure it is included.
+
+I chose to examine the central tendency of the total number of employees at railroad countries by state. I was curious to see how states varied by total employees. There appears to be a higher average by states with higher populations. For example, California, which has a high population has an average of 238 employees, while a state like Maine, with a lower population, has an average of 40 employees. It would be interesting to see if this hypothesis would be correct in further analyses. 
diff --git a/All Submissions/challenge3.qmd b/All Submissions/challenge3.qmd
@@ -0,0 +1,63 @@
+---
+title: "Challenge 3"
+author: "Emily Duryea"
+desription: "Challenge 3"
+date: "12/20/2022"
+format:
+  html:
+    toc: true
+    code-fold: true
+    code-copy: true
+    code-tools: true
+categories:
+  - challenge_3
+---
+
+# Challenge 3
+
+```{r}
+#| label: setup
+#| warning: false
+#| message: false
+
+library(tidyverse)
+
+knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
+```
+
+## Read in Data
+
+```{r}
+aw <- read.csv("_data/animal_weight.csv")
+aw
+```
+
+The dataset, which I chose to label as "aw" as short for "animal weights," contains data on different animal weights from different regions of the world. The animals include: 1) both dairy, and 2) non dairy cattle, 3) buffaloes, 4) market swine, 5) breeding swine, 6) chickens (broilers), 7) chickens (layers), 8) ducks, 9) turkeys, 10) sheep, 11) goats, 12) horses, 13) asses, 14) mules, 15) camel, and 16) llamas. The animals are listed in columns. The regions of the animals are in rows, and are as follows: 1) Indian subcontinent, 2) Eastern Europe, 3) Africa, 4) Oceania, 5) Western Europe, 6) Latin America, 7) Asia, 8) Middle East, and 9) North America. The values in the rows and columns are the animal weights by region.
+
+## Finding the Dimensions
+
+```{r}
+# Getting the number of rows
+nrow(aw)
+
+# Getting the number of columns
+ncol(aw)
+
+# Calculating the expected number of total cases (rows times columns)
+nrow(aw) * (ncol(aw)-1)
+
+# Calculating the expected number of columns 
+1+1+1
+```
+
+The dimensions of the current dataset are 16 columns with 9 rows, and it is anticipated to have 144 cases.
+
+## Pivot the Data
+
+```{r}
+pivot_longer(aw, "Cattle...dairy":"Llamas",
+                 names_to="animal",
+                 values_to = "weights")
+```
+
+After the pivoting the data, there are are 3 columns with 144 rows, as anticipated by the calculations.