cap-benchmark-gcp

A collection of scripts for automatic corpus generation on Google Cloud Platform.

Convert XLSX to newline-delimited JSON.
Upload data to Cloud Storage.
Create external tables in BigQuery from data in Cloud Storage.

Prerequisites

existing project on Google Cloud Platform.
APIs enabled
private key for a Google Cloud Service Account
- open Creditentials
- create/open a service account
- go to KEYS
- ADD KEY > Create new key
- in the dialog window create a JSON key
- place the JSON file in the same folder as the script
- rename it to "gcp_key.json"
for the NLP script you have to download a HuSpaCy model
- the larger model is recommended (hu_core_news_lg)
- you can do this quickly the following way:
```
pip install https://huggingface.co/huspacy/hu_core_news_lg/resolve/main/hu_core_news_lg-any-py3-none-any.whl
```

Usage

You can modify the parameters in config.ini, which is structured the following way:

[local]

parameter	expected string
raw	folder containing raw XLSX data (relative/absolute path)
json	output folder for newline-delimited JSON (relative/absolute path)
schemas	folder containing schema information as JSON (relative/absolute path)

[cloud]

parameter	expected string
project	GCP project ID
bucket	Cloud Storage bucket name
dataset	BigQuery dataset name

[options]

parameter	expected string
xlsx2jsonl	True/False (turns conversion on/off)
storage	True/False (turns Cloud Storage upload on/off)
bigquery	True/False (turns BigQuery table generation on/off)

[nlp]

parameter	expected string
nlp	True/False (turns text analysis on/off)

Schema information

Schemas are loaded from JSON files. Table structure is almost identical to cap_pilot_benchmark_xlsx_json.

cap_kod

name	type	mode
cap_id	INTEGER	REQUIRED
nev_angol	STRING	NULLABLE
nev_magyar	STRING	NULLABLE

conll_kod

name	type	mode
szo_id	INTEGER	REQUIRED
szoalak	STRING	NULLABLE
lemma	STRING	NULLABLE
entity_IOB	STRING	NULLABLE
POS	STRING	NULLABLE
morf_analysis	STRING	NULLABLE
dependencia_el	STRING	NULLABLE
mondat_id	INTEGER	NULLABLE

felszolalas

name	type	mode
text_id	INTEGER	REQUIRED
text_type	INTEGER	NULLABLE
exact_date	DATE	NULLABLE
cycle_number	INTEGER	NULLABLE
parliamentary_id	STRING	NULLABLE
text	STRING	NULLABLE
napirendi_pont	STRING	NULLABLE
video_felszolalas_ido	STRING	NULLABLE
video_feszolalas_url	STRING	NULLABLE
felszolalas_url	STRING	NULLABLE
tokenszam	STRING	NULLABLE
major_topic	STRING	NULLABLE
COVID	STRING	NULLABLE
text_id_old	STRING	NULLABLE

kepviselo

name	type	mode
parliamentary_id	STRING	REQUIRED
surname	STRING	NULLABLE
first_name	STRING	NULLABLE
birth_year	FLOAT	NULLABLE
birth_place	STRING	NULLABLE
sex	INTEGER	NULLABLE
death_date	FLOAT	NULLABLE
death_place	STRING	NULLABLE
PERSON_POS	INTEGER	NULLABLE
change_name	STRING	NULLABLE
surname_new	STRING	NULLABLE
surname_from	DATE	NULLABLE
id_old	STRING	NULLABLE

mondat

name	type	mode
mondat_id	INTEGER	REQUIRED
raw_text	STRING	NULLABLE
text_id	STRING	NULLABLE

parlamenti_ciklus

name	type	mode
cycle_number	INTEGER	REQUIRED
cycle_years_from	INTEGER	NULLABLE
cycle_years_to	INTEGER	NULLABLE
cycle_from	DATE	NULLABLE
cycle_to	DATE	NULLABLE

part

name	type	mode
party_id	INTEGER	REQUIRED
party_name_full_HUN	STRING	NULLABLE
party_name_full_HUN_from	DATE	NULLABLE
party_name_full_HUN_to	DATE	NULLABLE
party_name2_full_HUN	STRING	NULLABLE
party_name2_full_HUN_from	DATE	NULLABLE
party_name2_full_HUN_to	DATE	NULLABLE
party_name3_full_HUN	STRING	NULLABLE
party_name_full3_HUN_from	DATE	NULLABLE
party_name_full3_HUN_to	DATE	NULLABLE

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data_jsonl		data_jsonl
data_raw		data_raw
schemas		schemas
src		src
README.md		README.md
config.ini		config.ini
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cap-benchmark-gcp

Prerequisites

Usage

[local]

[cloud]

[options]

[nlp]

Schema information

cap_kod

conll_kod

felszolalas

kepviselo

mondat

parlamenti_ciklus

part

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

cap-benchmark-gcp

Prerequisites

Usage

[local]

[cloud]

[options]

[nlp]

Schema information

cap_kod

conll_kod

felszolalas

kepviselo

mondat

parlamenti_ciklus

part

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages