-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Currently it is not possible to relate information of the scraping table to its corresponding extracted product information in the green-db table via id. If we want to join the tables we currently have to use timestamp, url and category.
We already use the id, to retrieve a specific row in the scraping table, but the id is not used any further when writing the extracted product information into the green-db, see:
green-db/workers/workers/extract.py
Lines 36 to 39 in 90b631b
| scraped_page = CONNECTION_FOR_TABLE[table_name].get_scraped_page(id=row_id) | |
| if product := extract_product(table_name=table_name, scraped_page=scraped_page): | |
| green_db_connection.write(product) |
The green-db table already has an id column, but this is autogenerated, see:
green-db/database/database/tables.py
Line 203 in 90b631b
| id = Column(INTEGER, nullable=False, autoincrement=True, primary_key=True) |
So, integrating this shouIdn't be a lot of work and would help whenever we want to use information from scraping table together with green-db table. For example using the HTML together with the extracted product information for some ML.