A collection of scripts for automatic corpus generation on Google Cloud Platform.
- Convert XLSX to newline-delimited JSON.
- Upload data to Cloud Storage.
- Create external tables in BigQuery from data in Cloud Storage.
- existing project on Google Cloud Platform.
- APIs enabled
- private key for a Google Cloud Service Account
- open Creditentials
- create/open a service account
- go to KEYS
- ADD KEY > Create new key
- in the dialog window create a JSON key
- place the JSON file in the same folder as the script
- rename it to "gcp_key.json"
- for the NLP script you have to download a HuSpaCy model
- the larger model is recommended (hu_core_news_lg)
- you can do this quickly the following way:
pip install https://huggingface.co/huspacy/hu_core_news_lg/resolve/main/hu_core_news_lg-any-py3-none-any.whl
You can modify the parameters in config.ini, which is structured the following way:
| parameter |
expected string |
| raw |
folder containing raw XLSX data (relative/absolute path) |
| json |
output folder for newline-delimited JSON (relative/absolute path) |
| schemas |
folder containing schema information as JSON (relative/absolute path) |
| parameter |
expected string |
| project |
GCP project ID |
| bucket |
Cloud Storage bucket name |
| dataset |
BigQuery dataset name |
| parameter |
expected string |
| xlsx2jsonl |
True/False (turns conversion on/off) |
| storage |
True/False (turns Cloud Storage upload on/off) |
| bigquery |
True/False (turns BigQuery table generation on/off) |
| parameter |
expected string |
| nlp |
True/False (turns text analysis on/off) |
Schemas are loaded from JSON files.
Table structure is almost identical to cap_pilot_benchmark_xlsx_json.
| name |
type |
mode |
| cap_id |
INTEGER |
REQUIRED |
| nev_angol |
STRING |
NULLABLE |
| nev_magyar |
STRING |
NULLABLE |
| name |
type |
mode |
| szo_id |
INTEGER |
REQUIRED |
| szoalak |
STRING |
NULLABLE |
| lemma |
STRING |
NULLABLE |
| entity_IOB |
STRING |
NULLABLE |
| POS |
STRING |
NULLABLE |
| morf_analysis |
STRING |
NULLABLE |
| dependencia_el |
STRING |
NULLABLE |
| mondat_id |
INTEGER |
NULLABLE |
| name |
type |
mode |
| text_id |
INTEGER |
REQUIRED |
| text_type |
INTEGER |
NULLABLE |
| exact_date |
DATE |
NULLABLE |
| cycle_number |
INTEGER |
NULLABLE |
| parliamentary_id |
STRING |
NULLABLE |
| text |
STRING |
NULLABLE |
| napirendi_pont |
STRING |
NULLABLE |
| video_felszolalas_ido |
STRING |
NULLABLE |
| video_feszolalas_url |
STRING |
NULLABLE |
| felszolalas_url |
STRING |
NULLABLE |
| tokenszam |
STRING |
NULLABLE |
| major_topic |
STRING |
NULLABLE |
| COVID |
STRING |
NULLABLE |
| text_id_old |
STRING |
NULLABLE |
| name |
type |
mode |
| parliamentary_id |
STRING |
REQUIRED |
| surname |
STRING |
NULLABLE |
| first_name |
STRING |
NULLABLE |
| birth_year |
FLOAT |
NULLABLE |
| birth_place |
STRING |
NULLABLE |
| sex |
INTEGER |
NULLABLE |
| death_date |
FLOAT |
NULLABLE |
| death_place |
STRING |
NULLABLE |
| PERSON_POS |
INTEGER |
NULLABLE |
| change_name |
STRING |
NULLABLE |
| surname_new |
STRING |
NULLABLE |
| surname_from |
DATE |
NULLABLE |
| id_old |
STRING |
NULLABLE |
| name |
type |
mode |
| mondat_id |
INTEGER |
REQUIRED |
| raw_text |
STRING |
NULLABLE |
| text_id |
STRING |
NULLABLE |
| name |
type |
mode |
| cycle_number |
INTEGER |
REQUIRED |
| cycle_years_from |
INTEGER |
NULLABLE |
| cycle_years_to |
INTEGER |
NULLABLE |
| cycle_from |
DATE |
NULLABLE |
| cycle_to |
DATE |
NULLABLE |
| name |
type |
mode |
| party_id |
INTEGER |
REQUIRED |
| party_name_full_HUN |
STRING |
NULLABLE |
| party_name_full_HUN_from |
DATE |
NULLABLE |
| party_name_full_HUN_to |
DATE |
NULLABLE |
| party_name2_full_HUN |
STRING |
NULLABLE |
| party_name2_full_HUN_from |
DATE |
NULLABLE |
| party_name2_full_HUN_to |
DATE |
NULLABLE |
| party_name3_full_HUN |
STRING |
NULLABLE |
| party_name_full3_HUN_from |
DATE |
NULLABLE |
| party_name_full3_HUN_to |
DATE |
NULLABLE |