Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
379 changes: 379 additions & 0 deletions All Submissions/Finalprojectpaper.rmd

Large diffs are not rendered by default.

118 changes: 118 additions & 0 deletions All Submissions/Homework3.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
---
title: "Homework 3"
author: "Emily Duryea"
desription: "Homework 3 submission by Emily Duryea"
date: "12/20/2022"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- homework2
- emilyduryea
- student
- academic
---

# Homework 3

```{r}
# Importing dataset
studentsurvey <- read.csv("_data/student_prediction.csv")
summary(studentsurvey)
```

## Descriptive Statistics & Visualization

```{r}
library(tidyverse)
library(tidyr)
library(dplyr)
library(epiDisplay)

### Sample Information ###

# Sample Gender
tab1(studentsurvey$GENDER, sort.group = "decreasing", cum.percent = TRUE)
# Numeric key: 1 = female, 2 = male

# Sample's Graduated High School Type
tab1(studentsurvey$HS_TYPE, sort.group = "decreasing", cum.percent = TRUE)
# Numeric key: 1 = graduated from a private high school, 2 = state high school, 3 = other)

# Sample's Work Status
tab1(studentsurvey$WORK, sort.group = "decreasing", cum.percent = TRUE)
# Numeric key: 1 = Yes, 2 = No

# Sample's Received Scholarship
tab1(studentsurvey$SCHOLARSHIP, sort.group = "decreasing", cum.percent = TRUE)
# Numeric key: 1 = No Scholarship, 2 = 25% Scholarship, 3 = 50% Scholarship, 4 = 75% Scholarship, 5 = Full Scholarship)

### Research Question 1 ###

# Taking Notes in Class
mean(studentsurvey$NOTES)
median(studentsurvey$NOTES)
sd(studentsurvey$NOTES)
# Numeric key: 1 = never takes notes, 2 = sometimes takes notes, and 3 = always takes notes
# Frequency visualization
tab1(studentsurvey$NOTES, sort.group = "decreasing", cum.percent = TRUE)

# Class Attendance
mean(studentsurvey$ATTEND)
median(studentsurvey$ATTEND)
sd(studentsurvey$ATTEND)
# Numeric key: 1 = always attends class, 2 = sometimes attends class, 3 = never attends class)
# Frequency visualization
tab1(studentsurvey$ATTEND, sort.group = "decreasing", cum.percent = TRUE)

# Reported Listening in Class
mean(studentsurvey$LISTENS)
median(studentsurvey$LISTENS)
sd(studentsurvey$LISTENS)
# Numeric key: 1 = never listens to class lectures, 2 = sometimes listens to class lectures, 3 = always listens to class lectures
# Frequency visualization
tab1(studentsurvey$LISTENS, sort.group = "decreasing", cum.percent = TRUE)

### Research Question 2 ###

# Hours Studying
mean(studentsurvey$STUDY_HRS)
median(studentsurvey$STUDY_HRS)
sd(studentsurvey$STUDY_HRS)
# Numeric key: 1 = 0 hours per week, 2 = <5 hours, 3 = 6-10 hours, 4 = 11-20 hours, 5 = more than 20 hours)
# Frequency visualization
tab1(studentsurvey$STUDY_HRS, sort.group = "decreasing", cum.percent = TRUE)

### Research Question 3 ###

# Peer Study Groups
mean(studentsurvey$PREP_STUDY)
median(studentsurvey$PREP_STUDY)
sd(studentsurvey$PREP_STUDY)
# Numeric key: 1 = studies alone, 2 = studies with friends, 3 = not applicable
# Frequency visualization
tab1(studentsurvey$PREP_STUDY, sort.group = "decreasing", cum.percent = TRUE)

# Positive Class Discussions
mean(studentsurvey$LIKES_DISCUSS)
median(studentsurvey$LIKES_DISCUSS)
sd(studentsurvey$LIKES_DISCUSS)
# Numeric key: 1 = never likes/participates in discussions 2 = sometimes, 3 = always)
# Frequency visualization
tab1(studentsurvey$LIKES_DISCUSS, sort.group = "decreasing", cum.percent = TRUE)
```

### Sample Conclusions

In this particular study, there were more male (60%) than female (40%) participants. Most students attended a state/public high school (71%). Additionally, most students have received at least 50% scholarship at this university (52.4% received 50% scholarship, 29% received 75%, 15.9% received full scholarship, whereas only 2.1% received 25% and 0.7% received no scholarship), indicating that many students at this particular university have received scholarships. Furthermore, most students do not have a job (66.2%) while they are studying at university in this sample. As the vast majority of students have scholarships, working a job during university may not be necessary.

### Research Question Variables Conclusions

For research question 1, most students reported always taking notes (57.3%; m = 2.544828, sd = 0.5649399, median = 3). Most students reported attending class (75.9%; m = 1.241379, sd = 0.429403, median = 1). Interestingly, the majority of students reported only sometimes listening in class (54.2%, m = 2.055172, sd = 0.6747357, median = 2). The statistics reveal that the majority of students engage in classroom engagement behaviors. In regards to research question 2, most students reported studying less than five hours a week (51%; m = 2.2, sd = 0.9174239, median = 2). That is not a lot of time spent studying as I would have anticipated of university students. Most students reported studying alone (73.8%; m = 1.337931, sd = 0.61487, median = 1). Most students also enjoyed class discussions sometimes (48.3%) or all of the time (45.5%). Only a few never enjoyed class discussions and found them beneficial to their learning (6.2%; m = 2.393103, sd = 0.6043425, median = 2).

## Reflection

Limitations of my visuals are that they probably would not be able to be processed by a naive viewer without the numeric key. That is why I provided the necessary numeric values in the code so that it could be understood. Going forward, I would be interested to see how different variables interact with the cumulative GPA (like classroom engagement, peer collaboration, and study habits. Perhaps seeing how demographic variables interact with GPA and other factors (e.g., hours spent studying and work status).
69 changes: 69 additions & 0 deletions All Submissions/challenge1.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
---
title: "Challenge 1"
author: "Emily Duryea"
desription: "Reading in data and creating a post"
date: "11/25/2022"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_1
- birds
---

```{r}
#| label: setup
#| warning: false
#| message: false

library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```

## Challenge Overview

Today's challenge is to

1) read in a dataset, and

2) describe the dataset using both words and any supporting information (e.g., tables, etc)

## Read in the Data

Read in one (or more) of the following data sets, using the correct R package and command.

- railroad_2012_clean_county.csv ⭐
- birds.csv ⭐⭐
- FAOstat\*.csv ⭐⭐
- wild_bird_data.xlsx ⭐⭐⭐
- StateCounty2012.xls ⭐⭐⭐⭐

Find the `_data` folder, located inside the `posts` folder. Then you can read in the data, using either one of the `readr` standard tidy read commands, or a specialized package such as `readxl`.

```{r}
# Importing the data file
library(readr)
birds <- read_csv("_data/birds.csv")
View(birds)
```

Add any comments or documentation as needed. More challenging data sets may require additional code chunks and documentation.

## Describe the data

Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).

```{r}
#| label: summary
summary(birds)
count(birds, Item)
count(birds, Area)
```

This dataset includes 30,977 rows, with 14 columns. It contains data on 5 categories of birds from 248 countries. Across those 248 countries, 13,074 are chickens, 6,909 are ducks, 5,693 are turkeys, 4,136 are geese and guinea fowls, and 1,165 are pigeons and other birds. Some countries contain a large portion of those entries (e.g., France, Egypt, and Greece, with 290), while others have very few (e.g., Luxembourg with 19, Montenegro with 13, and Sudan with 7).


86 changes: 86 additions & 0 deletions All Submissions/challenge2.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
---
title: "Challenge 2"
author: "Emily Duryea"
desription: "Data wrangling: using group() and summarise()"
date: "11/25/2022"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_2
- railroad
---

```{r}
#| label: setup
#| warning: false
#| message: false

library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```

## Challenge Overview

Today's challenge is to

1) read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
2) provide summary statistics for different interesting groups within the data, and interpret those statistics

## Read in the Data

Read in one (or more) of the following data sets, available in the `posts/_data` folder, using the correct R package and command.

- railroad\*.csv or StateCounty2012.xls ⭐
- FAOstat\*.csv or birds.csv ⭐⭐⭐
- hotel_bookings.csv ⭐⭐⭐⭐

```{r}
# Importing the data file
library(readr)
railroad <- read_csv("_data/railroad_2012_clean_county.csv")
View(railroad)
```

Add any comments or documentation as needed. More challenging data may require additional code chunks and documentation.

## Describe the data

Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).

```{r}
#| label: summary
summary(railroad)
dim(railroad)
str(railroad)
mean(railroad$total_employees)
min(railroad$total_employees)
max(railroad$total_employees)
median(railroad$total_employees)
count(railroad, state)
count(railroad, county)
```

This dataset includes 2,930 rows, with 3 columns. It contains data from 53 states & territories and 1,709. The mean number of employees at each of these railroads in this dataset was 87.17816. The minimum number of employees at any railroad in this data set is 1, and the max is 8,207. The median is 21. These results suggest that there are some major outliers that have increased the mean, since the median is much lower than the mean, and the maximum is an extremely large value. This maximum is located in Cook County, Illinois.

## Provide Grouped Summary Statistics

Conduct some exploratory data analysis, using dplyr commands such as `group_by()`, `select()`, `filter()`, and `summarise()`. Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.

```{r}
# Finding the central tendency for total employees by state
railroad %>%
select(state, total_employees)%>%
group_by(state) %>%
summarize(mean(total_employees), median(total_employees), sd(total_employees))
```

### Explain and Interpret

Be sure to explain why you choose a specific group. Comment on the interpretation of any interesting differences between groups that you uncover. This section can be integrated with the exploratory data analysis, just be sure it is included.

I chose to examine the central tendency of the total number of employees at railroad countries by state. I was curious to see how states varied by total employees. There appears to be a higher average by states with higher populations. For example, California, which has a high population has an average of 238 employees, while a state like Maine, with a lower population, has an average of 40 employees. It would be interesting to see if this hypothesis would be correct in further analyses.
63 changes: 63 additions & 0 deletions All Submissions/challenge3.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
---
title: "Challenge 3"
author: "Emily Duryea"
desription: "Challenge 3"
date: "12/20/2022"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_3
---

# Challenge 3

```{r}
#| label: setup
#| warning: false
#| message: false

library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```

## Read in Data

```{r}
aw <- read.csv("_data/animal_weight.csv")
aw
```

The dataset, which I chose to label as "aw" as short for "animal weights," contains data on different animal weights from different regions of the world. The animals include: 1) both dairy, and 2) non dairy cattle, 3) buffaloes, 4) market swine, 5) breeding swine, 6) chickens (broilers), 7) chickens (layers), 8) ducks, 9) turkeys, 10) sheep, 11) goats, 12) horses, 13) asses, 14) mules, 15) camel, and 16) llamas. The animals are listed in columns. The regions of the animals are in rows, and are as follows: 1) Indian subcontinent, 2) Eastern Europe, 3) Africa, 4) Oceania, 5) Western Europe, 6) Latin America, 7) Asia, 8) Middle East, and 9) North America. The values in the rows and columns are the animal weights by region.

## Finding the Dimensions

```{r}
# Getting the number of rows
nrow(aw)

# Getting the number of columns
ncol(aw)

# Calculating the expected number of total cases (rows times columns)
nrow(aw) * (ncol(aw)-1)

# Calculating the expected number of columns
1+1+1
```

The dimensions of the current dataset are 16 columns with 9 rows, and it is anticipated to have 144 cases.

## Pivot the Data

```{r}
pivot_longer(aw, "Cattle...dairy":"Llamas",
names_to="animal",
values_to = "weights")
```

After the pivoting the data, there are are 3 columns with 144 rows, as anticipated by the calculations.
Loading