This repository contains instructions on how to extract and transform OpenAlex data for data analysis with Google BigQuery.
The following packages are required for this workflow.
OpenAlex snapshots are available through AWS. Instructions for downloading can be found here: https://docs.openalex.org/download-all-data/download-to-your-machine.
$ aws s3 sync 's3://openalex' 'openalex-snapshot' --no-sign-requestTo reduce the size of the data stored in BigQuery, some data transformation
is applied to the works entity. Data transformation is
carried out on the High Performance Cluster of the
GWDG Göttingen. However, you can also
use the script on other servers with only minor adjustments. Entities
like authors, publishers, institutions, funders and sources
are not affected by the data transformation step.
$ sbatch openalex_works_hpc.shFiles can be uploaded to a Google Bucket using gsutil. Note that only
data in the works entity has been transformed. All other data can be found
in openalex-snapshot/data.
$ gsutil -m cp -r /scratch/users/haupka/works gs://bigscholUse bq load to create a table in BigQuery with data stored in a
Google Bucket. Schemas for the tables can be found here.
$ bq load --ignore_unknown_values --source_format=NEWLINE_DELIMITED_JSON subugoe-collaborative:openalex.works gs://bigschol/works/*.gz schema_openalex_work.json- Following fields are not included in the
worksschema:mesh,related_works,concepts. - An additional field
has_abstractis added during the data transformation step that replaces the fieldabstract_inverted_index.