Data Skills 2 - R

Fall Quarter 2024

Homework 4

Due: November 30 before midnight on Gradescope

Question 1 Big Data (50%):

From Canvas (Modules -> Problem Sets), download the nursing-home-inspect-data.zip file. This contains unredacted nursing home inspections acquired by ProPublica through a Freedom of Information Act request as part of their investigation on nursing home quality.

Unzip the files (this can be done manually or in R) in a folder outside of the one you will upload to Github. Your code should allow the user to input the name of the path where the unzipped CSV files are.
Load the CSVs together as one parquet. The code to do this should be generalized so that if the folder of CSVs is updated with more files, the code will be able to handle this.
Then partition the data by state, and write the partitioned dataset to disk as a series of parquets. Again, save this to a folder outside of the one you will upload to Github. Load the partitioned datasets as a new parquet.
Write a function that takes as input a state name, and then times how long it takes to do the following: filter data by state, then create a summary table with the number of deficiencies for each severity level. The function will measure how long it takes to run it using the non-partitioned data loaded from CSV vs. the partitioned data loaded from the parquets you saved in step 3. (Hint: look into how to use Sys.time() to measure how long code takes to run). This function will print the time elapsed with and without partitioning, as well as the difference between the two.
Make a scatter plot of how much faster the code runs for partitioned data (e.g. 2x faster, 3x faster) vs. the size of filtered data. Let each state be one point on the scatter plot. The x-axis should be the total number of observations in the dataset after filtering for that state. Is there a relationship with the size of the filtered data? Note that this might take awhile to run! Save the scatter plot as a .png.
Then create a new plot with 3 data series: the y-axis includes the total time without partitioning, total time elapsed with partitioning, and the difference in time between the two. What is the relationship between the time saved by using parquets and the size of the filtered data? Describe this in your README file.

Question 2 Python (50%): You receive a piece of code from a former colleague that is written in Python (program.py). Unfortunately, your team works in R and your former colleague has bad coding style, so the rest of the team can't understand what this code does. Your boss asks you to look at their Python code, figure out what it is doing, and then translate it into R code.

Test the program out in Python. Add comments to the .py code to explain each step and save and upload it as program_commented.py. (Hint: it can be helpful to add in additional print statements to see what the code is doing at each step.)
Then, write in the README file a short description of what this code does.
Then, write a piece of R code that does the same thing as this Python code. Make sure that unlike your colleague, you use good coding style (i.e., useful function and variable names) so that future users can understand the code more easily.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.DS_Store		.DS_Store
README.md		README.md
ReadMePS4		ReadMePS4
program.py		program.py
program_commented.py		program_commented.py
question1.R		question1.R
question2.R		question2.R
state_speed_performance.png		state_speed_performance.png
state_timing_by_observations.png		state_timing_by_observations.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Skills 2 - R

Fall Quarter 2024

Homework 4

Due: November 30 before midnight on Gradescope

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Data Skills 2 - R

Fall Quarter 2024

Homework 4

Due: November 30 before midnight on Gradescope

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages