-
Notifications
You must be signed in to change notification settings - Fork 16
Added Stanford tokenizer, sentence splitter, and part of speech tagging to varaha.text #4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
@rjurney Would you mind squashing these commits so I can look at a single diff? |
|
Yeah, I can do that. I think you can also do that in the interface? On Sunday, December 29, 2013, Jacob wrote:
Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.com |
|
Sorry for taking so long to get to this. Overall it looks good. Can we put the udfs that rely strictly on the stanford nlp package in their own namespace? varaha.text is getting a little crowded. |
|
Yeah, I'll do that. On Tuesday, January 14, 2014, Jacob wrote:
Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.com |
register ../../lib/stanford-postagger-withModel.jar
register ../../target/varaha-1.0-SNAPSHOT.jar
reviews = LOAD 'data/ten.avro' USING AvroStorage;
foo = FOREACH reviews GENERATE business_id, varaha.text.StanfordTokenize(text) AS tagged;
DUMP foo
(41J1FgfIsmsLRCZ3QILG6w,{(truly),(impressive),(facility),(came),(for),(two),(books),(not),(knowing),(this),(location),(-LRB-),(normally),(Appaloosa),(-RRB-),(The),(staff),(was),(very),(helpful),(and),(found),(what),(wanted),(very),(quickly),(was),(there),(minutes),(tops),(would),(highly),(recommend),(this),(Library),(anyone),(interested),('ll),(coming),(back),(very),(soon),(for),(next),(batch)})
(4YX4ZtUqs6xtcc4AdjbpeQ,{(Other),(circle),(are),(much),(cleaner),(than),(this),(one),(The),(best),(thing),(about),(this),(store),(the),(Employees),(are),(friendly),(and),(nice),('ve),(been),(this),(location),(the),(morning),(and),(the),(evening),(and),(there),(must),(point),(where),(the),(shift),(changes),(and),(they),(stop),(cleaning),(the),(bathrooms),(and),(emptying),(the),(trash),(the),(morning),(everything),(clean),(the),(time),(evening),(rolls),(around),(there),(are),(odd),(smells),(all),(over),(the),(store),(shame),(since),(larger),(newer),(looking),(store),(that),(n't),(cleaner),('ll),(back),(hopes),(they),(clean),(little),(more)})
(5kRug3bEienrpovtPRVVwg,{(Went),(with),(husband),(Richardson),(Rokerij),(for),(the),(first),(time),(raved),(about),(this),(place),(went),(Wednesday),(night),(with),(reservation),(The),(wait),(was),(about),(hour),(Luckily),(there),(were),(bar),(seats),(that),(became),(available),(took),(them),(ordered),(the),(cheese),(flatbread),(appetizer),(and),(was),(delicious),(had),(large),(salad),(for),(dinner),(which),(was),(perfect),(was),(not),(very),(hungry),(husband),(had),(the),(chicken),(enchiladas),(that),(tasted),(and),(were),(very),(good),(The),(food),(cooked),(order),(did),(take),(while),(get),(our),(meal),(but),(was),(worth),(the),(wait),(and),(service),(was),(excellent),(While),(waiting),(chatted),(with),(several),(people),(the),(bar),(and),(one),(couple),(offered),(taste),(their),(appetizer),(returned),(the),(favor),(when),(flatbread),(came),(One),(more),(thing),(not),(leave),(without),(getting),(the),(decadent),(truffle),(dessert),(Heavenly),(but),(not),(over),(done),(any),(way),(All),(all),(great),(experience),(recommend),(reservations)})
reviews = LOAD 'data/ten.avro' USING AvroStorage();
reviews = LIMIT reviews 1000;
bar = FOREACH reviews GENERATE business_id, FLATTEN(varaha.text.SentenceTokenize(text)) AS tokenized_sentences;
bar = FOREACH bar GENERATE business_id, varaha.text.StanfordPOSTagger(tokenized_sentences) AS tagged;
DUMP bar
(6VRbbNQe5ouWmwsMebUMkg,{(My,PRP$),(friend,NN),(added,VBD),(some,DT),(sugar,NN),(to,TO),(it,PRP),(and,CC),(it,PRP),(turned,VBD),(okay/good,NN),(.,.)})$,$ ),(10,CD),(-,:),($,$ ),(13,CD),(.,.)})$,$ ),(1.50,CD),(-,:),($,$ ),(3,CD),(.,.)})$,$ ),(3,CD),(-,:),($,$ ),(8,CD),(+,CC),(.,.)})
(6VRbbNQe5ouWmwsMebUMkg,{(Entrees,NNS),(average,VBP),(about,IN),(
(6VRbbNQe5ouWmwsMebUMkg,{(Naan,NN),(ranges,NNS),(from,IN),(about,IN),(
(6VRbbNQe5ouWmwsMebUMkg,{(Appetizers,NNS),(during,IN),(happy,JJ),(hour,NN),(range,NN),(from,IN),(
(6VRbbNQe5ouWmwsMebUMkg,{(Add,VB),(in,IN),(alcohol,NN),(and,CC),(you,PRP),('re,VBP),(looking,VBG),(at,IN),(a,DT),(not,RB),(inexpensive,JJ),(meal,NN),(but,CC),(definitely,RB),(good,JJ),(quality,NN),(.,.)})
(6oRAC4uyJCsJl1X0WZpVSA,{(love,VB),(the,DT),(gyro,NN),(plate,NN),(.,.)})
(6oRAC4uyJCsJl1X0WZpVSA,{(Rice,NNP),(is,VBZ),(so,RB),(good,JJ),(and,CC),(I,PRP),(also,RB),(dig,VBP),(their,PRP$),(candy,NN),(selection,NN),(:,:),(-RRB-,-RRB-)})
reviews = LOAD 'data/ten.avro' USING AvroStorage();
reviews = LIMIT reviews 1000;
bar = FOREACH reviews GENERATE business_id, varaha.text.StanfordPOSTagger(varaha.text.StanfordTokenize(text)) AS tokens;
DUMP bar
(-UnYs8XvV1M983xZoREdng,{(have,VB),(say,VB),(loved,NN),(Vino,NNP),(First,NNP),(off,RB),(very,RB),(unpretentious,JJ),(not,RB),(very,RB),(knowledgeable,JJ),(about,IN),(wine,NN),(tend,VBP),(shy,JJ),(away,RB),(from,IN),(places,NNS),(that,WDT),(have,VBP),(attitude,NN),(also,RB),(had,VBD),(one,CD),(the,DT),(1000,CD),(outstanding,JJ),(Groupons,NNS),(about,IN),(expire,VBP),(And,CC),(spite,NN),(the,DT),(fact,NN),(that,IN),(just,RB),(about,IN),(everyone,NN),(coming,VBG),(that,IN),(evening,NN),(had,VBD),(Groupon,NNP),(the,DT),(staff,NN),(was,VBD),(fantastic,JJ),(they,PRP),(not,RB),(have,VBP),(kitchen,NN),(all,DT),(appetizers,NNS),(are,VBP),(cold,JJ),(but,CC),(had,VBD),(nice,JJ),(cheese,NN),(plate,NN),(which,WDT),(included,VBD),(cheeses,NNS),(olives,NNS),(nuts,NNS),(grapes,NNS),(and,CC),(dried,VBD),(fruit,NN),(only,RB),(complaint,NN),(was,VBD),(that,IN),(the,DT),(lahvosh-like,JJ),(crackers,NNS),(were,VBD),(really,RB),(oily,JJ),(and,CC),(not,RB),(good,JJ),(all,DT),(Lose,VB),(those,DT),(and,CC),(would,MD),(have,VB),(been,VBN),(much,RB),(better,RBR),(for,IN),(the,DT),(wine,NN),(was,VBD),(actually,RB),(better,JJR),(than,IN),(expected,VBN),(Although,IN),(n't,RB),(generally,RB),(care,VB),(for,IN),(really,RB),(sweet,JJ),(wines,NNS),(both,CC),(the,DT),(Summer,NN),(Rain,NN),(and,CC),(Peachy,JJ),(Keen,JJ),(were,VBD),(really,RB),(enjoyable,JJ),(just,RB),(think,VB),(them,PRP),(more,RBR),(crisp,JJ),(summer,NN),(beverage,NN),(than,IN),(wine,NN),(was,VBD),(surprised,VBN),(like,IN),(the,DT),(Pinot,NNP),(Grigio,NNP),(much,RB),(did,VBD),(and,CC),(may,MD),(have,VB),(purchased,VBN),(bottle,NN),(but,CC),(was,VBD),(not,RB),(available,JJ),(that,IN),(evening,NN),(The,DT),(Miscela,NNP),(Italian,NNP),(blend,VB),(was,VBD),(miss,VB),(for,IN),(-LRB-,-LRB-),(too,RB),(acidic,JJ),(for,IN),(taste,NN),(-RRB-,-RRB-),(but,CC),(the,DT),(Malbec,NNP),(was,VBD),(better,JJR),(For,IN),(after,IN),(dinner,NN),(wines,NNS),(the,DT),(Grande,NNP),(Finale,NNP),(was,VBD),(over-the-top,JJ),(sweet,JJ),(would,MD),(probably,RB),(not,RB),(drink,VB),(more,JJR),(than,IN),(tasting,NN),(The,DT),(Porto,NNP),(Cocoa,NNP),(however,RB),(was,VBD),(fantastic,JJ),(generally,RB),(stay,VB),(away,RB),(from,IN),(Port,NNP),(because,IN),(dislike,NN),(the,DT),(brandy,NN),(burn,VBP),(But,CC),(one,CD),(whiff,NN),(this,DT),(and,CC),(was,VBD),(hooked,VBN),(before,IN),(tasted,VBN),(While,IN),(not,RB),(like,IN),(terribly,RB),(sweet,JJ),(you,PRP),(definitely,RB),(get,VBP),(the,DT),(essence,NN),(chocolate,NN),(bought,VBD),(bottle,NN),(take,VB),(home,NN),(fact,NN),(but,CC),(only,RB),(saw,VBD),(one,CD),(wee,NN),(little,JJ),(glass,NN),(husband,NN),(apparently,RB),(mistook,VBD),(for,IN),(Yoo-hoo,NN),(and,CC),(drank,VBD),(the,DT),(rest,NN),(Great,JJ),(place,NN),(begin,VB),(your,PRP$),(evening,NN),(And,CC),(because,IN),(many,JJ),(these,DT),(young,JJ),(wines,NNS),(are,VBP),(sweeter,JJR),(even,RB),(non-wine-drinking,JJ),(husband,NN),(enjoyed,VBN)})