-
Notifications
You must be signed in to change notification settings - Fork 16
Home
Welcome to the DHOxSS2016 wiki!
To Do's and major issues being tracked in Zenhub
Notes from Andrea - Bertram has tasked me with converting the dates in the "index" file to the ISO standard. So, I'm going to follow my own advice and document things as I go along.
Steps are as follows:
- I review the data to get a sense of trends and common structure. I made a text fact (click the triangle; select "Facet > text fact" -- and noticed that some of the "n.d."s (for "no date") were inconsistently punctuated. So I normalized those
- I clicked through the Open Refine menus to see if there was a "convert to ISO date" function that I'd forgotten about. There was not :(
- I took to google! Google input: "convert to date open refine". Google output: https://github.com/OpenRefine/OpenRefine/wiki/GREL-Date-Functions
- Nice! Looks like there's a "convert to date" function within GREL. it is:
toDate(o, boolean month_first / format1, format2, ... )
Returns o converted to a date object.
All other arguments are optional:
month_first: set false if the date is formatted with the day before the month.formatN: attempt to parse the date using an ordered list of possible formats. See SimpleDateFormat for the syntax. Examples: You can parse the cells "Nov-09" and "11/09" using
value.toDate('MM/yy','MMM-yy').toString('yyyy-MM')For a date of the form: "1/4/2012 13:30:00" use GREL function:
toDate(value,"dd/mm/YYYY H:m:s")
- Time to try this on my dataset! I don't want to lose my 'fuzzy' dates (e.g. "18??" or "May 1850") so instead of transforming my column, I'm going to select
Edit Column > Add column based on this column - I enter
value.toDate()BAM, I've got dates. However I'm not sure if these are quite in the right format -- they've got this blanked out Time information, and I don't know if that's what we want. - Am going to upload the file anyway for everyone to review, and we can amend as necessaary -- because remember, OpenRefine is almost infinitely undoable!
Splitting the box/folder/letter column
When I sat down to work on this tonight, I suddenly remembered the .split() recipe. It works the same way as "split column on..." but it gives you much more control over what you can do. You can also combine it with other "ingredients" (e.g. .replace)
If you split something on itself, it splits it into an array (nb, this is v. pythonic). So, if you just want one aspect of the list, you need to call its index. Index starts at 0 - again, v. pythonic.
value.split('\n')[1] tried value.split('to') Faceted to check - noticed some weird "n's" and other htings BUT split on people named "crofton'
Changed to value.split(' to ')[1] facet it worked!
need to clear out the names from the senders now made a transform value.split(" to ")[0]
Now need to clear out the notes '('+value.split(" (")[1] - this splits it out on the paren -- but also replaces the paren
Now to delete the notes
At step 21 - done with the index information
Moving on to column 2
Clustering organization - used metaphone - this worked fairly well
Zenhub only works with Chrome and Firefox - and will need to be installed on the Oxford computers
Use scenarios:
- could base on Virgil & Thomer report? this is very specific to mega-large projects though
- Someone who wants to visualize punctuation: https://medium.com/@neuroecology/punctuation-in-novels-8f316d542ec4#.ekcsuz39b
- Someone who wants to create social networks of who talks to who
- Someone who wants to do topic modeling?
OR day could involve general metadata cleaning so that everyone's more or less doing the same thing.