Home

Welcome to the DHOxSS2016 wiki!

To Do's and major issues being tracked in Zenhub

6 July 2016

Notes from Andrea - Bertram has tasked me with converting the dates in the "index" file to the ISO standard. So, I'm going to follow my own advice and document things as I go along.

Steps are as follows:

I review the data to get a sense of trends and common structure. I made a text fact (click the triangle; select "Facet > text fact" -- and noticed that some of the "n.d."s (for "no date") were inconsistently punctuated. So I normalized those
I clicked through the Open Refine menus to see if there was a "convert to ISO date" function that I'd forgotten about. There was not :(
I took to google! Google input: "convert to date open refine". Google output: https://github.com/OpenRefine/OpenRefine/wiki/GREL-Date-Functions
Nice! Looks like there's a "convert to date" function within GREL. it is:

toDate(o, boolean month_first / format1, format2, ... )

Returns o converted to a date object.

All other arguments are optional:

month_first: set false if the date is formatted with the day before the month. formatN: attempt to parse the date using an ordered list of possible formats. See SimpleDateFormat for the syntax. Examples: You can parse the cells "Nov-09" and "11/09" using

value.toDate('MM/yy','MMM-yy').toString('yyyy-MM') For a date of the form: "1/4/2012 13:30:00" use GREL function:

toDate(value,"dd/mm/YYYY H:m:s")

Time to try this on my dataset! I don't want to lose my 'fuzzy' dates (e.g. "18??" or "May 1850") so instead of transforming my column, I'm going to select Edit Column > Add column based on this column
I enter value.toDate() BAM, I've got dates. However I'm not sure if these are quite in the right format -- they've got this blanked out Time information, and I don't know if that's what we want.
Am going to upload the file anyway for everyone to review, and we can amend as necessaary -- because remember, OpenRefine is almost infinitely undoable!

5 July 2016

Notes on how I cleaned the files in the "Open Refine" outputs:

Splitting the box/folder/letter column

When I sat down to work on this tonight, I suddenly remembered the .split() recipe. It works the same way as "split column on..." but it gives you much more control over what you can do. You can also combine it with other "ingredients" (e.g. .replace)

If you split something on itself, it splits it into an array (nb, this is v. pythonic). So, if you just want one aspect of the list, you need to call its index. Index starts at 0 - again, v. pythonic.

value.split('\n')[1] tried value.split('to') Faceted to check - noticed some weird "n's" and other htings BUT split on people named "crofton'

Changed to value.split(' to ')[1] facet it worked!

need to clear out the names from the senders now made a transform value.split(" to ")[0]

Now need to clear out the notes '('+value.split(" (")[1] - this splits it out on the paren -- but also replaces the paren

Now to delete the notes

At step 21 - done with the index information

Moving on to column 2

Clustering organization - used metaphone - this worked fairly well

[old] Notes [that can probably be deleted]:

Zenhub only works with Chrome and Firefox - and will need to be installed on the Oxford computers

Use scenarios:

could base on Virgil & Thomer report? this is very specific to mega-large projects though
Someone who wants to visualize punctuation: https://medium.com/@neuroecology/punctuation-in-novels-8f316d542ec4#.ekcsuz39b
Someone who wants to create social networks of who talks to who
Someone who wants to do topic modeling?

OR day could involve general metadata cleaning so that everyone's more or less doing the same thing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

6 July 2016

5 July 2016

Notes on how I cleaned the files in the "Open Refine" outputs:

[old] Notes [that can probably be deleted]:

Clone this wiki locally