Project of BDT - Giorgia Villa, Daniele Passabì
This code provides a possible implementation of a Big Data System able to perform predictions on tenders published by the Italian administrative bodies. It returns the most likely winners of a tender specified by the end-user through a customised interface. Predictions are based on historical data and obtained through a machine learning model running a K-NearestNeighbour classification algorithm.
The end-user is required to run the predict_winner_interface.py script. This will open a window where some company and tender information need to be specified.
After this, it is enough to click on the Get prediction button and the prediction for the most likely tender winner will be displayed.
We suppose that the first step of the solution (the population of the MySQL database) has already been implemented. Therefore from the developer side, the only action to be performed is the update of the above-mentioned database. Recalling that tender information and winners information has to be updated monthly and every six months respectively, to update the database one should:
-
Download the new data from ANAC.
-
Run the
update_database_interface.pyscriptAfter having selected which kind of data to update and the path of the new .csv together with the MySQL credentials, the application runs autonomously the whole update process: first, the database is updated, then the prediction model is trained also on the new data.
- Suggested Python version: 3.8.10
- Required libraries: the file
requirements.txtcontains all libraries to be installed for the solution to run properly.
- It is necessary to install a MySQL instance, such as MySQL Workbench.
- In the chosen MySQL instance two tables must be generated, as follows:
Table 1
Table Name: elenco_aggiudicatari
| Column Name | Column Type |
|---|---|
cig |
CHAR(10) |
aggiudicatario |
VARCHAR(128) |
tipo_aggiudicatario |
VARCHAR(256) |
Table 2
Table Name: appalti_aggiudicatari
| Column Name | Column Type |
|---|---|
cig |
CHAR(10) |
numero_gara |
VARCHAR(10) |
importo_complessivo_gara |
FLOAT |
n_lotti_componenti |
SMALLINT |
importo_lotto |
FLOAT |
settore |
VARCHAR(32) |
data_pubblicazione |
DATE |
tipo_scelta_contraente |
VARCHAR(128) |
modalita_realizzazione |
VARCHAR(128) |
denominazione_amministrazione_appaltante |
VARCHAR(256) |
sezione_regionale |
VARCHAR(32) |
descrizione_cpv |
VARCHAR(256) |
aggiudicatario |
VARCHAR(128) |
tipo_aggiudicatario |
VARCHAR(256) |
Notes: specific column types were chosen based on the Official Documentation and the specific type of variable present in each column. The aim was to minimize the required space, without preventing further developments of the application.
Below the detailed description of the code structure of this project.
It is the very first step and it is thought to be executed only once.
In folder database_population there are two files.
-
config.py, with configuration settings necessary to populate the database for the first time. In details it contains:- Paths of folders of the csv files downloaded from ANAC official website;
- Credentials of the MySQL database in which the cleaned data will be stored.
-
populate_database.pywhich:- Takes the datasets in the raw folder in .csv format;
- Cleans aggiudicatari data and uploads them in the MySQL database;
- Cleans tenders data, joins them with the winners data and uploads the results in the MySQL database.
As mentioned as the beginning, this operation is thought to be performed only once. After this first initialization, data will be updated through the application presented in the next paragraph.
ANAC provides data on tenders (CIG) and on winners (aggiudicatari) differently:
- CIG: every month a new .csv is released with data related to tenders of that month;
- Winners: seldom the same .csv file is updated with new winners (hence the file is getting larger and larger).
For this reason two different procedures were implemented to update data on the MySQL database.
The MySQL database can be updated in two ways: through the application (suggested) o by running the Python scripts specifically for the tables in the database.
-
The application for update can be run through the script
update_database_interface.py -
It lets the user select which kind of tables to update (
elenco_aggiudicatari|appalti_aggiudicatari) -
It lets the user select the new .csv with raw data to be added to the MySQL database
-
It lets the user provide their own MySQL credentials
-
Once all information have been provided, it is only necessary to click on the Update DB button to launch the updating procedure.
Below some pictures
First of all, it is necessary to provide credentials and settings for the MySQL database in the configuration file config.py.
Below a detailed explanation of the script and logic of the system:
-
update_database_CIG.py- Enables to update the table
appalti_aggiudicatari - Enables to select a new file .csv (raw) with new data concerning tenders
- Cleans the new data
- Updates the MySQL database
- Enables to update the table
-
update_database_aggiudicatari_v1.py- Enables to update the table
elenco_aggiudicatari - Enables to select a new file .csv (raw) with new data concerning winners
- Removes old data about winners from the MySQL database
- Uploads the new data on the MySQL database
- Enables to update the table
-
update_database_aggiudicatari_v2.py- Enables to update the table
elenco_aggiudicatari - Enables to select a new file .csv (raw) with new data concerning winners
- Compares the the new data with the ones already present in the MySQL database
- Updates the MySQL database loading only records not already present in the database
- Enables to update the table
Note: update_database_aggiudicatari_v1.py is way less efficient than update_database_aggiudicatari_v2.py. It is left as an alternative solution in case the dataset is too large to be stored twice in memory (which is necessary for the comparison operation in update_database_aggiudicatari_v2.py).
In folder database_backup it is possible to run a script to generate a backup of the MySQL database.
In the config.py, it is possible to provide customized database settings and the folder in which to save the dump.
It is suggested to periodically dump the database, and store its data in the cloud or in a different machine from the one with the original database.
Data used is licensed under a Creative Common Attribution - ShareAlike 4.0 International (CC BY-SA 4.0). More info here.
It is attributed to the Italian Autorità Nazionale Anticorruzione (ANAC).
Original version of data can be found here.

