Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
56 changes: 25 additions & 31 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,77 +11,73 @@ Spills of hazardous materials, like petroleum, mercury, and battery acid, that c

The data used in this tutorial was collected from [https://catalog.data.gov/dataset/spill-incidents/resource/a8f9d3c8-c3fa-4ca1-a97a-55e55ca6f8c0](https://catalog.data.gov/dataset/spill-incidents/resource/a8f9d3c8-c3fa-4ca1-a97a-55e55ca6f8c0) and modified for teaching purposes.

To access all of the materials to complete this tutorial, first log into your OSPool access point and run the following command: `git clone https://github.com/OSGConnect/tutorial-spills-R/`.
To access all of the materials to complete this tutorial, first log into your OSPool access point and run the following commands:

```bash
git clone https://github.com/OSGConnect/tutorial-spills-R/
```

```bash
cd tutorial-spills-R
```

## Step 1: Get to Know Hazardous Spills Dataset

Let's explore the data files that we will be analyzing. Before we do so, we must make sure we are in the tutorial directory (`tutorial-spills-R/`). We can do this by printing your working directory (`pwd`):


```bash
pwd
```

We should see something similar to `/home/jovyan/tutorial-spills-R/`, where `jovyan` could alternatively be your OSG account username.

Next, let's navigate to our `/data` directory and list (`ls`) the files inside of it:

Next, let's navigate to our `data` directory and list (`ls`) the files inside of it:

```bash
cd data/
```


```bash
ls
```

We should see seven `.csv` files, one for each decade between 1950-2019.
We should see seven `.csv` files named using the format `spills_<StartingYear>_<EndingYear>.csv`, one for each decade between 1950-2019.

To explore the contents of these files, we can use commands like `head -n 5 <fileName>` to view the first 5 lines of our data files.


```bash
head -n 5 spills_1980_1989.csv
```

<span style="color:blue">We can also use the navigation bar on the left side of your notebook to double-click and open each comma seperated value ("csv") .csv file and see it in a table format, instead of a traditional command line rendering above.</span>
> <span style="color:blue"> If using the [OSPool Notebook](https://portal.osg-htc.org/documentation/htc_workloads/submitting_workloads/jupyter/), you can also use the navigation bar on the left side of your notebook to double-click and open each comma seperated value ("csv") .csv file and see it in a table format, instead of a traditional command line rendering above.</span>

## Step 2: Prepare the R Executable

Next, we need to create an R script to anayze our datasets. An example of an R script can be found in our main tutorial directory, so let's navigate there:


```bash
cd ../ # change directory to move one up
```


```bash
ls # list files
```


```bash
cat spill_calculation.r
```

Then let us print the contents of our executable script:


```bash
cat spill_calculation.r
```

This script will read in different datasets as arguments and then will carry out summary statistics to print out the number of spills recorded per decade and the total size (in gallons) of the hazardous spills.
This script will read in the name of the dataset `.csv` file as an argument and then will carry out summary statistics to print out the number of hazardous spills recorded and the total size (in gallons).

## Step 3: Prepare Portable Software

Some common software, like R, are provided by OSG using containers. Because of this, you do not need to install R yourself, you will just tell HTCondor what container to use for your jobs. Additionally, this tutorial just uses base-R and no special libraries, but if you need libraries (e.g., tidyverse, ggplot2) you can always install them in your R container.

A list of containers and other software provided by OSG staff can be found on our website [https://portal.osg-htc.org/documentation/](https://portal.osg-htc.org/documentation/), along with resources for learning how to add libraries to your container.

We will be using the R container for R 3.5.0, which is accessable under `/cvmfs/singularity.opensciencegrid.org/opensciencegrid/osgvo-r:3.5.0`, so we must make sure to tell HTCondor to fetch this container when starting each of our jobs. To learn how to tell HTCondor to do this, see below.
We will be using the R container for R 3.5.0, which is accessable under `/cvmfs/singularity.opensciencegrid.org/opensciencegrid/osgvo-r:3.5.0`, so we must make sure to tell HTCondor to fetch this container when starting each of our jobs. We have already included the command to do so in the provided submit file. For more information on how to use containers, see our [container guide](https://portal.osg-htc.org/documentation/htc_workloads/using_software/containers-singularity/).

## Step 4: Prepare and Submit an HTCondor Submit File for One Test Job

Expand All @@ -91,23 +87,20 @@ For example, you should specify what executable you want run, if you want a cont

### Step 4A: Prepare and Submit an HTCondor Submit File

A sample submit to analyze our smallest dataset, `spills_1950_1959.csv`, file might look like:

A sample submit file to analyze our smallest dataset, `spills_1950_1959.csv`, is provided for you. Take a look at it with the command:

```bash
cat R.submit
```

We can submit this job using `condor_submit <SubmitFile>`:

We can submit this job to HTCondor using the command `condor_submit <SubmitFile>`:

```bash
condor_submit R.submit
```

We can check on the status of our job in HTCondor's queue by running:


```bash
condor_q
```
Expand All @@ -118,7 +111,6 @@ Once our job is done running, it will leave HTCondor's queue automatically.

Once our job is done running, we can check the results by looking in our `output` folder:


```bash
cat output/spills.out
```
Expand All @@ -127,15 +119,21 @@ We should see that from 1950-1959, New York recorded five spills that totalled l

## Step 5: Scale Out Your Workflow to Analyze Many Datasets

We just prepared and ran one job analyzing the `spills_1950_1959.csv` dataset! But now, we want to analyze the remaining 6 datasets. Luckily, HTCondor is very helpful when it comes to rapidly queueing many small jobs!
We just prepared and ran one job analyzing the `spills_1950_1959.csv` dataset! But now, we want to analyze the remaining 6 datasets. Thankfully, HTCondor is very helpful when it comes to rapidly queueing many small jobs!

To do so, we will update our submit file to use the `queue <variable> from <list>` syntax. But before we do this, we need to create a list of the files we want to queue a job for:

```bash
cd data/
```

```bash
ls data > list_of_datasets.txt
ls *.csv > ../list_of_datasets.txt
```

```bash
cd ../
```

```bash
cat list_of_datasets.txt
Expand All @@ -149,8 +147,7 @@ Now, let's modify the queue line of our submit file to use the new queue syntax.

We can then call this new variable, `dataset`, elsewhere in our submit file by wrapping it with `$()` like so: `$(dataset)`.

Our updated submit file might look like this:

We have provided the file `many-R.submit` with these changes already applied. Take a look at this modifed submit file with:

```bash
cat many-R.submit
Expand All @@ -160,7 +157,6 @@ cat many-R.submit

Now we can submit our new submit file using `condor_submit` again:


```bash
condor_submit many-R.submit
```
Expand All @@ -171,14 +167,12 @@ Notice that we have now queued 7 jobs using one submit file!

We can check on the status of our 7 jobs using `condor_q`:


```bash
condor_q
```

Once our jobs are done, we can also review our output files:


```bash
cat output/*.csv.out
```
Expand Down
7 changes: 6 additions & 1 deletion data/spills_1950_1959.csv
Original file line number Diff line number Diff line change
@@ -1 +1,6 @@
{"payload":{"allShortcutsEnabled":false,"fileTree":{"data":{"items":[{"name":".empty","path":"data/.empty","contentType":"file"},{"name":"spills_1950_1959.csv","path":"data/spills_1950_1959.csv","contentType":"file"},{"name":"spills_1960_1969.csv","path":"data/spills_1960_1969.csv","contentType":"file"},{"name":"spills_1970_1979.csv","path":"data/spills_1970_1979.csv","contentType":"file"},{"name":"spills_1980_1989.csv","path":"data/spills_1980_1989.csv","contentType":"file"},{"name":"spills_1990_1999.csv","path":"data/spills_1990_1999.csv","contentType":"file"},{"name":"spills_2000_2009.csv","path":"data/spills_2000_2009.csv","contentType":"file"},{"name":"spills_2010_2019.csv","path":"data/spills_2010_2019.csv","contentType":"file"}],"totalCount":8},"":{"items":[{"name":"data","path":"data","contentType":"directory"},{"name":"error","path":"error","contentType":"directory"},{"name":"log","path":"log","contentType":"directory"},{"name":"output","path":"output","contentType":"directory"},{"name":"README.md","path":"README.md","contentType":"file"},{"name":"R_OSPool.submit","path":"R_OSPool.submit","contentType":"file"},{"name":"R_jupyter.submit","path":"R_jupyter.submit","contentType":"file"},{"name":"spill_calculation.r","path":"spill_calculation.r","contentType":"file"}],"totalCount":8}},"fileTreeProcessingTime":3.601174,"foldersToFetch":[],"reducedMotionEnabled":null,"repo":{"id":746910870,"defaultBranch":"main","name":"tutorial-spills-R","ownerLogin":"OSGConnect","currentUserCanPush":false,"isFork":false,"isEmpty":false,"createdAt":"2024-01-22T22:39:58.000Z","ownerAvatar":"https://avatars.githubusercontent.com/u/7956953?v=4","public":true,"private":false,"isOrgOwned":true},"symbolsExpanded":false,"treeExpanded":true,"refInfo":{"name":"main","listCacheKey":"v0:1705963367.0","canEdit":false,"refType":"branch","currentOid":"b8363e962ab56b5eaddb71cebbf76b1ebdb7a1cf"},"path":"data/spills_1950_1959.csv","currentUser":null,"blob":{"rawLines":null,"stylingDirectives":null,"csv":[["SpillNumber","ProgramFacilityName","Street1","Street2","Locality","County","ZIPCode","SWISCode","DECRegion","SpillDate","SpillYear","ReceivedDate","ContributingFactor","Waterbody","Source","CloseDate","MaterialName","MaterialFamily","Quantity","Units","Recovered"],["207294","AES WESTOVER","720 RIVERSIDE DRIVE",null,"JOHNSON CITY","Broome",null,"446","7","1/1/50","1950","10/14/02","Other",null,"Commercial/Industrial","10/16/02","hydrogen chloride","Hazardous Material","0","Gallons","0"],["207294","AES WESTOVER","720 RIVERSIDE DRIVE",null,"JOHNSON CITY","Broome",null,"446","7","1/1/50","1950","10/14/02","Other",null,"Commercial/Industrial","10/16/02","hydrogen fluoride","Hazardous Material","0","Gallons","0"],["911661","KNOLLS ATOMIC BUILDING 1","350 ATOMIC PROJECT ROAD",null,"BALLSTON SPA","Saratoga",null,"4642","5","1/1/50","1950","2/1/10","Unknown","NONE","Commercial/Industrial","4/4/13","asbestos","Hazardous Material","0",null,"0"],["1505910","VACANT LOT, FORMER RESIDENCE","37 HUGUENOT AVE",null,"NEW ROCHELLE","Westchester",null,"6010","3","9/2/50","1950","9/2/15","Equipment Failure",null,"Private Dwelling","9/28/15","#2 fuel oil","Petroleum","0",null,"0"],["1401615","TOWN OF PIERCEFIELD SAND PIT","OFF MAIN ST",null,"PIERCEFIELD","St Lawrence",null,"4568","6","5/15/54","1954","5/15/14","Unknown",null,"Unknown","8/20/15","#6 fuel oil","Petroleum","0",null,"0"]],"csvError":null,"dependabotInfo":{"showConfigurationBanner":false,"configFilePath":null,"networkDependabotPath":"/OSGConnect/tutorial-spills-R/network/updates","dismissConfigurationNoticePath":"/settings/dismiss-notice/dependabot_configuration_notice","configurationNoticeDismissed":null,"repoAlertsPath":"/OSGConnect/tutorial-spills-R/security/dependabot","repoSecurityAndAnalysisPath":"/OSGConnect/tutorial-spills-R/settings/security_analysis","repoOwnerIsOrg":true,"currentUserCanAdminRepo":false},"displayName":"spills_1950_1959.csv","displayUrl":"https://github.com/OSGConnect/tutorial-spills-R/blob/main/data/spills_1950_1959.csv?raw=true","headerInfo":{"blobSize":"1.07 KB","deleteInfo":{"deleteTooltip":"You must be signed in to make or propose changes"},"editInfo":{"editTooltip":"You must be signed in to make or propose changes"},"ghDesktopPath":"https://desktop.github.com","gitLfsPath":null,"onBranch":true,"shortPath":"1becdd5","siteNavLoginPath":"/login?return_to=https%3A%2F%2Fgithub.com%2FOSGConnect%2Ftutorial-spills-R%2Fblob%2Fmain%2Fdata%2Fspills_1950_1959.csv","isCSV":true,"isRichtext":false,"toc":null,"lineInfo":{"truncatedLoc":"6","truncatedSloc":"6"},"mode":"file"},"image":false,"isCodeownersFile":null,"isPlain":false,"isValidLegacyIssueTemplate":false,"issueTemplateHelpUrl":"https://docs.github.com/articles/about-issue-and-pull-request-templates","issueTemplate":null,"discussionTemplate":null,"language":"CSV","languageID":51,"large":false,"loggedIn":false,"newDiscussionPath":"/OSGConnect/tutorial-spills-R/discussions/new","newIssuePath":"/OSGConnect/tutorial-spills-R/issues/new","planSupportInfo":{"repoIsFork":null,"repoOwnedByCurrentUser":null,"requestFullPath":"/OSGConnect/tutorial-spills-R/blob/main/data/spills_1950_1959.csv","showFreeOrgGatedFeatureMessage":null,"showPlanSupportBanner":null,"upgradeDataAttributes":null,"upgradePath":null},"publishBannersInfo":{"dismissActionNoticePath":"/settings/dismiss-notice/publish_action_from_dockerfile","dismissStackNoticePath":"/settings/dismiss-notice/publish_stack_from_file","releasePath":"/OSGConnect/tutorial-spills-R/releases/new?marketplace=true","showPublishActionBanner":false,"showPublishStackBanner":false},"rawBlobUrl":"https://github.com/OSGConnect/tutorial-spills-R/raw/main/data/spills_1950_1959.csv","renderImageOrRaw":false,"richText":null,"renderedFileInfo":null,"shortPath":null,"tabSize":8,"topBannersInfo":{"overridingGlobalFundingFile":false,"globalPreferredFundingPath":null,"repoOwner":"OSGConnect","repoName":"tutorial-spills-R","showInvalidCitationWarning":false,"citationHelpUrl":"https://docs.github.com/en/github/creating-cloning-and-archiving-repositories/creating-a-repository-on-github/about-citation-files","showDependabotConfigurationBanner":false,"actionsOnboardingTip":null},"truncated":false,"viewable":true,"workflowRedirectUrl":null,"symbols":{"timed_out":false,"not_analyzed":true,"symbols":[]}},"copilotInfo":null,"copilotAccessAllowed":false,"csrf_tokens":{"/OSGConnect/tutorial-spills-R/branches":{"post":"ydcbeBHDnjtXUoSN3wIVcZcqqQGi30ne4ONtiaU_b7FZVNLiKqllmMa17vbBuVJxj10yFIU9dIOe5NiVhNGknw"},"/repos/preferences":{"post":"jeoh6vLRW0oI5mV2sBrv90OfS4YanTrd5H9vYvpeUtczUnlePo0jOn-m933sIZCqSMsQmw7ehaEWUjC4c8smRQ"}}},"title":"tutorial-spills-R/data/spills_1950_1959.csv at main · OSGConnect/tutorial-spills-R"}
SpillNumber,ProgramFacilityName,Street1,Street2,Locality,County,ZIPCode,SWISCode,DECRegion,SpillDate,SpillYear,ReceivedDate,ContributingFactor,Waterbody,Source,CloseDate,MaterialName,MaterialFamily,Quantity,Units,Recovered
207294,AES WESTOVER,720 RIVERSIDE DRIVE,,JOHNSON CITY,Broome,,446,7,1/1/50,1950,10/14/02,Other,,Commercial/Industrial,10/16/02,hydrogen chloride,Hazardous Material,0,Gallons,0
207294,AES WESTOVER,720 RIVERSIDE DRIVE,,JOHNSON CITY,Broome,,446,7,1/1/50,1950,10/14/02,Other,,Commercial/Industrial,10/16/02,hydrogen fluoride,Hazardous Material,0,Gallons,0
911661,KNOLLS ATOMIC BUILDING 1,350 ATOMIC PROJECT ROAD,,BALLSTON SPA,Saratoga,,4642,5,1/1/50,1950,2/1/10,Unknown,NONE,Commercial/Industrial,4/4/13,asbestos,Hazardous Material,0,,0
1505910,"VACANT LOT, FORMER RESIDENCE",37 HUGUENOT AVE,,NEW ROCHELLE,Westchester,,6010,3,9/2/50,1950,9/2/15,Equipment Failure,,Private Dwelling,9/28/15,#2 fuel oil,Petroleum,0,,0
1401615,TOWN OF PIERCEFIELD SAND PIT,OFF MAIN ST,,PIERCEFIELD,St Lawrence,,4568,6,5/15/54,1954,5/15/14,Unknown,,Unknown,8/20/15,#6 fuel oil,Petroleum,0,,0
Loading