-
Notifications
You must be signed in to change notification settings - Fork 1
Technical Considerations
For every task, we present a possible very different way on how to mine Wikipedia's taxonomy and then the parts like title, Infobox, summary. Note that this text rather represents an opinion. There is no evaluation provided that supports this technical discussion, but you might see a lot of people discussing in these directions at stackoverflow, etc.
I do not want to go into accessing through Wikipedia API. It is the most restricted way, because whenever you want to access an article's content, you have to post a HTTP Request. When you read the documentation etc., you find out that requesting a small amount (like < 500) is okay, but you should not request more. Imagine every Wikipedia Researcher posting many requests all the time...
For Python, SPARQLWrapper is a good way of accessing SPARQL endpoints.
from SPARQLWrapper import SPARQLWrapper, JSON
sparql = SPARQLWrapper('http://dbpedia.org/sparql')
sparql.setReturnFormat(JSON)
# Here is an example query, you may want to use. It returns all articles at from a depth of 0 to a depth of 5 into the subcategory tree from Computer_languages.
query = """
SELECT DISTINCT ?article where {
?article dct:subject/skos:broader{0,5} http://dbpedia.org/resource/Category:Computer_languages.
}
ORDER BY ASC(?article)
"""
sparql.setQuery(query)
res = sparql.query()
While this is a comfortable way to access some important content, like the summary of every article and the names of the templates that are being used, it has its limitations.
- Whenever you want to get more than 10.000 results you have to work with offsets and subqueries. The Endpoint only returns 10.000 results. This is still an okay thing to do, but even this way has its limitations.
- You might want to be lazy and let the endpoint extract all articles in your scope using the example query from above. When you set the maximum depth to a larger number (>12 I think... Haven't tested in a while.), the endpoint will go out of memory. Then, you might want to come up with more complex solutions like extracting subcategories from a front. In our case, we ended up with really bad performance, because we again had to post thousands of requests to the endpoint. Use the dump for big data processing.
- Download dumps here https://dumps.wikimedia.org/enwiki/ or from mirrors for better download speed, e.g., ftp://ftp.acc.umu.se/mirror/wikimedia.org/dumps/enwiki/20190220/
- For extracting articles and categories under certain root categories, you will need the 'page.sql' and the 'categorylinks.sql' dumps.
-
- Load the SQL dumps into a mysql database (Beware: Follow the advice at https://stackoverflow.com/questions/30387731/loading-enwiki-latest-categorylinks-sql-into-mysql for better performance when loading the dumps. For Windows, you might want to install Windows-grep and cygwin.)
-
- You can join the tables on categorylinks.cl_from = page.page_id (as advised here: https://stackoverflow.com/questions/4789843/a-sql-query-that-acquires-the-list-of-categories-given-a-page-title-from-wikiped)
-
- You may want to dump the results of a join into a CSV file. If you run your mysql server locally, the easiest way is to work with INTO OUTFILE '<path.csv>' FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' LINES TERMINATED BY '\n'. You can increase the performance for the dumps by further refining the query from step 4. If you only want subcategory relationships, use WHERE page.page_namespace = 14. If you only want article relationships, use WHERE page.page_namespace = 0.
-
- If you want to read the CSVs using Python's csv module, read the advice here as you will encounter weird quotes inside titles: https://stackoverflow.com/questions/23897193/handling-escaped-quotes-with-pythons-csv-reader
- For mining article texts, download the pages-articles-multistream.xml
-
- From here on, it's a free choice in what format you want to work. I chose to proceed working in csv format. Beware here, as the combination ',"' that is apparent in some articles may mess up your CSV processing.
-
- WikiOnto scripts check whether the article is in the scope and then extracts the text.
-
- Mind the notice from the dumps page: The 7zip decoder on Windows is known to have problems with some bz2-format files for larger wikis; we recommend the use of bzip2 for Windows for these cases.
(Side Note: One might think that the abstract dumps for Yahoo at (e.g., https://dumps.wikimedia.org/enwiki/20180901/) available online are useful. One has to be careful when using them, because many of them still contain template fragments, where templates were not properly resolved.)