pull data from wikipedia

given an ISO language code (de, en, ru) find X\* random articles on wikipedia in that language (how to deal with language specific urls for wikipedia:random?)

get those files into lang/dirty
strip the useful text of those files into lang/clean

learn from those files instead of dictionaries to get more realistic ngrams (dictionaries overrepresent patterns from rare words)

*\* likely dependant on how high the n in your ngram is.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pull data from wikipedia #3

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

pull data from wikipedia #3

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions