Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
86 commits
Select commit Hold shift + click to select a range
67c3af8
gitignore add .idea/
larnhold Jun 1, 2020
2052bc7
rm cached idea files
larnhold Jun 1, 2020
2684108
separate flask routes
larnhold Jun 2, 2020
b55d272
fix swagger documentation
larnhold Jun 3, 2020
e1d68eb
refactoring
larnhold Jun 28, 2020
23dfe30
add marshmallow for http parameter parsing
larnhold Jul 3, 2020
38c23c8
adapt to changes in the ted api
larnhold Jul 25, 2020
5c981bf
add fullTextModel
larnhold Jul 25, 2020
32afdc2
first svm model on descriptions
larnhold Aug 1, 2020
348bac3
adapt to language options
larnhold Aug 21, 2020
cebb543
save dataset for development
larnhold Aug 22, 2020
d9666a8
full text sklearn svm modell
larnhold Sep 2, 2020
6239299
TransformerModel use cuda if available
larnhold Sep 18, 2020
1ec73d3
add Pytorch BertClassification model
larnhold Sep 29, 2020
80f1a4f
download tender descriptions in original language
larnhold Oct 1, 2020
d53324f
pytorch transformer model
larnhold Oct 1, 2020
13e4769
Pytorch Lighting Bert-Model pass Tenders with description embeddings
larnhold Oct 1, 2020
5a5a848
add dev dataset
larnhold Oct 1, 2020
89672fd
logging and scheduler
larnhold Oct 1, 2020
bc3dc12
statistical analysis on the dataset
larnhold Oct 14, 2020
7aa5980
update requirements
larnhold Oct 14, 2020
ed09cfd
count token length for statistics
larnhold Oct 16, 2020
2717fc0
use config for tranformer model
larnhold Oct 16, 2020
213d3e9
refactor old pytorch transformer model
larnhold Oct 16, 2020
4132531
rm old pytorch model withoud pytorch-lightning
larnhold Oct 16, 2020
5dc1544
refactoring
larnhold Oct 16, 2020
198e6d8
updated persitence json parsing
larnhold Oct 18, 2020
090f768
correct import order
larnhold Oct 18, 2020
41bc11a
return all tenders of no date specified
larnhold Oct 18, 2020
1981dc2
saving and loading of transformer models - implementation of classifi…
larnhold Oct 21, 2020
cd844e6
add link to tender entity
larnhold Oct 21, 2020
fea2506
add classification to FullTextSvmModel
larnhold Oct 21, 2020
1c52e11
package resolution
larnhold Oct 28, 2020
a0e790e
add model with trained bert layers
larnhold Oct 28, 2020
db994ad
renaming
larnhold Oct 28, 2020
a3ea038
add bert dataset
larnhold Oct 28, 2020
2240497
change path for saved models
larnhold Oct 28, 2020
48d7a8b
adapt to changes in the ted api
larnhold Oct 30, 2020
4d48fc1
revert to api fallback
larnhold Oct 30, 2020
30e81e8
reverse order of fetched contract notices to deal with format break r…
larnhold Oct 30, 2020
7ca3341
refactor location of the original language entity to match descriptio…
larnhold Dec 5, 2020
50fc4a9
renaming
larnhold Dec 5, 2020
18a39d7
updated readme
larnhold Dec 6, 2020
5c62598
updated readme
larnhold Dec 6, 2020
323aa6e
updated readme
larnhold Dec 6, 2020
eadd9d2
use api v1 instead of v2
larnhold Dec 6, 2020
1cc4bde
add fFullTextFastTextModel
larnhold Dec 16, 2020
9f320f1
refactor classifier folder structure
larnhold Dec 16, 2020
dab59d5
fix imports
larnhold Dec 16, 2020
1ed08b3
fix imports
larnhold Dec 16, 2020
aa0fe76
cleanup of FullTextSvcModel
larnhold Dec 16, 2020
7dd5a7e
add kfold validation
larnhold Dec 21, 2020
388ea52
validation
larnhold Dec 21, 2020
e5b8548
validation
larnhold Dec 21, 2020
61c4192
fix imports
larnhold Dec 21, 2020
566ec4b
clear cuda cache
larnhold Dec 21, 2020
41091ac
garbage collecting
larnhold Dec 21, 2020
f0bc635
no grad transformer evaluation
larnhold Dec 21, 2020
41bfa72
revert set eval mode for pytorch prediction
larnhold Dec 22, 2020
5d856ce
clear cuda memory
larnhold Dec 22, 2020
0e6a2ca
delete underlying bert model on new model
larnhold Dec 23, 2020
abbe67a
set correct batch size
larnhold Dec 24, 2020
25cd4e0
add profiling
larnhold Dec 26, 2020
42868ca
use one single HuggingFace model in the FulltextTransformerModel
larnhold Dec 26, 2020
d5b63a3
update docker integration
larnhold Jan 21, 2021
d3329a6
include data volume in docker
larnhold Jan 21, 2021
9013010
add fetch rest call
larnhold Jan 30, 2021
5c4a151
move fetcher to service
larnhold Jan 30, 2021
a532a57
dedicated fetch endpoint
larnhold Jan 30, 2021
7fd0b88
integrate new fetch endpoint
larnhold Jan 31, 2021
9657537
remove dev mode
larnhold Jan 31, 2021
5943906
train from dataset endpoint
larnhold Jan 31, 2021
c523dad
split model training into distinct new - save -load
larnhold Feb 1, 2021
52d7878
FullTextSvmModel load save
larnhold Feb 1, 2021
052caea
legacy SpacySciKitModel load save
larnhold Feb 1, 2021
6a4a504
FullTextSvmModel load save
larnhold Feb 1, 2021
97f313c
test save validation
larnhold Feb 6, 2021
5b9b73b
fix docker integration
larnhold Feb 6, 2021
d0630c4
validation endpoint
larnhold Feb 6, 2021
efa96f2
fix svm classifier
larnhold Feb 6, 2021
34ca842
adapt swagger
larnhold Feb 6, 2021
cca0c77
web recommendation filter by original language
larnhold Feb 6, 2021
e8bc0a0
implement dataset split utitlity
larnhold Feb 7, 2021
7f1c025
adapt spacy scikit model
larnhold Feb 7, 2021
f72a025
update data inspector
larnhold Feb 7, 2021
355bcf2
add german bert lite model
larnhold Feb 7, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
src/data/*
8 changes: 7 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,8 +1,14 @@
.idea/
outputs/
data/
data/*
venv/
test/
runs/
scikit_model
cache_dir/
__pycache__/
wandb/
src.iml
/src/tb_logs/
/src/tenderclass/
/src/tenderclass-backend-src/
3 changes: 0 additions & 3 deletions .idea/.gitignore

This file was deleted.

6 changes: 0 additions & 6 deletions .idea/inspectionProfiles/profiles_settings.xml

This file was deleted.

7 changes: 0 additions & 7 deletions .idea/misc.xml

This file was deleted.

8 changes: 0 additions & 8 deletions .idea/modules.xml

This file was deleted.

12 changes: 0 additions & 12 deletions .idea/tenderclass-backend.iml

This file was deleted.

6 changes: 0 additions & 6 deletions .idea/vcs.xml

This file was deleted.

8 changes: 6 additions & 2 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -9,11 +9,15 @@ WORKDIR /

#RUN apk add make gcc musl-dev g++

RUN pip install --upgrade pip
RUN pip install cython


RUN pip install git+https://github.com/huggingface/transformers.git

# we need to install further python packages which are listed in requirements.txt
COPY requirements.txt ./

RUN pip install --upgrade pip
RUN pip install cython
RUN pip install --no-cache-dir -r requirements.txt

COPY . .
Expand Down
45 changes: 36 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,12 @@
# tuwien-inso-bachelorthesis-tenderclass-backend

tenderclass is an automated screening system for public procurement notices using state-of-the-art Machine Learning and Natural Language Processing (NLP) frameworks. This git repository holds the Python-based backend of tenderclass. It is responsible for downloading, parsing and classifying tenders from Tenders Electronic Daily (TED). For this reason, this prototype implements two Machine Learning approaches:
tenderclass is an automated screening system for public procurement notices using state-of-the-art Machine Learning and Natural Language Processing (NLP) frameworks. This git repository holds the Python-based backend of tenderclass. It is responsible for downloading, parsing and classifying tenders from Tenders Electronic Daily (TED). For this reason, this prototype implements following Machine Learning approaches:

- SpacyScikitModel: Machine Learning Model based on [spaCy](https://spacy.io/) and [scikit-learn](https://scikit-learn.org/stable/)
- TransformerModel: Machine Learning Model based on [Hugging Face](https://github.com/huggingface/transformers) and [simpletransformers](https://github.com/ThilinaRajapakse/simpletransformers)
- SpacyScikitModel: Machine Learning Model based on [spaCy](https://spacy.io/) and [scikit-learn](https://scikit-learn.org/stable/) (titles only)
- TransformerModel: Machine Learning Model based on [Hugging Face](https://github.com/huggingface/transformers) and [simpletransformers](https://github.com/ThilinaRajapakse/simpletransformers) (titles only)
- FullTextSvmModel: Machine Learning Model based on [spaCy](https://spacy.io/) and [scikit-learn](https://scikit-learn.org/stable/)
- FastTextModel: Machine Leanring Model based on [FastText](https://fasttext.cc/)
- FullTextTransformerModel: Machine Learning Model based on [Hugging Face](https://github.com/huggingface/transformers) and [PyTorch](https://pytorch.org/) with [PytorchLightning](https://www.pytorchlightning.ai/)

## Getting Started

Expand All @@ -14,7 +17,7 @@ These instructions will get you a copy of the project up and running on your loc
What things you need to install the software and how to install them

- [Python 3.7/3.8](https://www.python.org/downloads/)
- OPTIONAL: If you want to train the TransformerModel on a Nvidia GPU (much faster!): [CUDA Toolkit 10.2](https://developer.nvidia.com/cuda-downloads)
- OPTIONAL: Some models can be trained on a Nvidia GPU (much faster!): [CUDA Toolkit 10.2](https://developer.nvidia.com/cuda-downloads)
- OPTIONAL: If you want to deploy it as a Docker container: [Docker](https://www.docker.com/) runtime environment

### Installing
Expand All @@ -39,14 +42,17 @@ Run on Linux: `$ source venv/bin/activate`
3. Install all the required dependecies using Python packet manager `pip`.<br/>
`$ pip install -r requirements.txt`

4. Install the spaCy german language model.<br/>
4. The Transformers package has to be installed from source.<br/>
`pip install git+https://github.com/huggingface/transformers.git`

5. Install the spaCy german language model.<br/>
`$ python -m spacy download de`

5. Navigate to the `src` directory and start the web server by running `main.py`.<br/>
6. Navigate to the `src` directory and start the web server by running `main.py`.<br/>
`$ cd src`<br/>
`$ python main.py`

6. OPTIONAL: Deactive the virtual environment:<br/>
7. OPTIONAL: Deactive the virtual environment:<br/>
Run on Windows: `$ venv\Scripts\deactivate.bat`<br/>
Run on Linux: `$ source venv/bin/deactivate`

Expand Down Expand Up @@ -75,6 +81,7 @@ You can deploy the backend of tenderclass by using a Docker container.

## API Endpoints
Documentation for the API Endpoints is available in Swagger UI. After starting the web server, enter the following web site into your browser:<br/>

[API Documenation](http://localhost:5000/swagger)

## Architecture
Expand All @@ -83,34 +90,54 @@ The back end incorporates the business logic and Machine Learning services inclu
- Trainer: This module trains the model by fetching two tender sets (positive tenders and negative tenders), labeling them and then feeding them to the model. It also allows to reset the model.
- Fetcher: This class is responsible for downloading tenders from the internet. Although it only delegates the request to the TedFetcher, there is the possibility that tenderclass could also address other public procurement data sources next to TED. This component would decide which data source should be used.
- TedFetcher: Given a number and a search query, this component automatically downloads the raw XML tender data using the TedDownloader. Afterwards, it delegates the XML document to TedExtractor, which builds up the Tender entity by extracting the relevant fields from the XML document.

![Component Diagram](doc/arch2.png)

### Data model
The prototype tenderclass implements two types of entities. A Tender represents one single public procurement notice. It holds the unique identifier of type string, which is assigned externally by TED, the hyperlink to the official TED website document of type string as well as the
list of CPV codes of type list of strings. Moreover, each tender consists of an array of at least one LanguageEntity. This entity holds the language-specific information such as title of type string and description of type string. Although requirement analysis only dictates to support German public procurement notices, the data model already supports multiple languages in case of extending the prototype with additional features such as multi-language or translation support. The following figure shows the corresponding class diagram of the data model.
The prototype tenderclass implements two types of entities. A Tender represents one single public procurement notice. It holds the unique identifier of type string, which is assigned externally by TED, the list of CPV codes of type list of strings and the language of the original submission of the procurment notive.
Moreover, each tender consists of an array of at least one LanguageEntity which is the language entity in the language of the original publication.
This entity holds the language-specific information such as title of type string and description of type string and additionally a link to the procurement on the TED webite.
Although requirement analysis only dictates to support German public procurement notices, the data model already supports multiple languages in case of extending the prototype with additional features such as multi-language or translation support. The following figure shows the corresponding class diagram of the data model.

![Data model](doc/datamodel.png)

## Endpoints and Program Flow

### Get recommendations
This endpoint downloads all tenders published on the current date, classifies them and then only returns the positive tenders as recommendation. The following figure depicts the communication flow. After receiving the request, firstly the Flask web server delegates it to the Recommender module. This component uses the Fetcher module for downloading and parsing the tenders. The Fetch Model section displays its sequence communication in more detail. Subsequently, the Recommender module only returns those tenders the model has classified to be interesting.

![Get recommendations](doc/recommendations.png)

### Create new model
This endpoint creates a new model and trains it with two distinct sets of tenders. For this purpose, the JSON body requires four different properties. The pos_number attribute indicates how many positive tenders the application should download from TED and feed to the model. Thereby, the pos_search_criteria specifies the constraints each positive tender must fulfill. In this case, at least one CPV code must start with 72. Analogously, the same procedure applies for the negative tenders. The following figure illustrates the communication flow. After receiving the request from the Flask web server, the Trainer component firstly creates a new model. Secondly, it fetches both sets of positive and negative tenders respectively. With the tenders wrapped to tuples together with their corresponding labels (1 for positive, 0 for negative), the Trainer module randomly shuffles the tuples. The reason is that otherwise, the model would firstly be trained with the series of positive tenders and afterwards with the series of negative tenders. To counteract this imbalance, the following train call receives the shuffled list of labeled tenders.

![Create new model](doc/newmodel.png)

### Train from web
This endpoint updates the model with additional labeled training data. As the user should have the possibility to either confirm or reject the recommendations, this endpoint fits the model with feedback data. For that reason, the Flask web server accepts a JSON body with two properties. The ids property is a JSON list of tender identifiers which tenderclass automatically downloads. The labels property is an integer list which gives the corresponding labels for these tenders. The i-th label belongs to the i-th id. This is why both lists must be of the same length. Similar to the get recommendations endpoint in the previous section, Flask delegates the request to another component, but this time to the Trainer. The following figure outlines the communication flow. As this module only knows the ids, but not the actual tender data, it first of all needs to download the entire tender metadata. This is why it builds up a search criteria query in the way that the tender id must equal to at least one id in the list. After passing this search criteria to the Fetcher, it receives all tender entities that have been found. As a second step, the Trainer module maps the downloaded tenders to the given labels before wrapping them to tuples. By passing them to the train method, the Model component feeds those labeled tenders to its internal classification model. Finally, the flask web server returns with ok.

![Train from web](doc/train.png)

### Fetch models
Although there is no designated endpoint for fetching specific, query-based tenders, each core function requires downloading, parsing and extracting tenders which the Fetcher module is responsible for. This is why this subsection explains in detail the communication flow of fetching a tender, as seen in the following figure. As soon as the Fetcher module gets a request, it immediately delegates the request to the TedFetcher. In defiance of this extra delegation, this pattern allows developers to add additional data sources such as national public procurement platforms. As TED API supports pagination with up to 100 tender documents per API call, the TedFetcher needs to enter a loop. In each iteration, it calls get_xml_contracts with i as the page number. Subsequently, the triggered TedDownloader issues a REST call to the TED API as described in the Fetch Tender section. Once it has parsed the response and returned the list of XML documents of the i-th page, the TedFetcher module calls the extract method from the TedExtractor. This second step instantiates and initializes a new Tender entity by extracting CPV code, id, title and description out of the XML document. As soon as the component either reaches the requested number c of tenders or exceeds the maximum number of pages (which implies that fewer tenders than intended are returned), the module returns the list of Tender entities.

![Fetch models](doc/get.png)

## Logging
The FullTextTransformer model has support for wandb to log statistics of a training run of the model. The logs can be accessed on the [Weights & Biases](https://wandb.ai/home) (wandb) portal.

To write logs to the platform, the user has to be logged in. Provided the the corresponding `wandb package has been istalled via pip, which is the case
if all requirements have been installed, the user ca login via following command:

`wandb login`

No the logged metrics of runs of the FullTextTransformerModel can be viewed on the wandb-acocunt associated with the used
credentials on [https://www.wandb.com/](https://www.wandb.com/).

## Authors

* **Nicolas Griebenow** - *Initial work* - [ngriebenow](https://github.com/ngriebenow)
* **Lukas Arnhold** - *Further development of classification models* - [larnhold](https://github.com/larnhold)



Expand Down
Binary file modified doc/datamodel.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
9 changes: 9 additions & 0 deletions docker-compose.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
version: "3"

services:
backend:
build: .
ports:
- "5000:5000"
volumes:
- "./data:/src/data:rw"
22 changes: 17 additions & 5 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,9 +1,21 @@
scikit-learn
joblib
pytorch-lightning~=1.1.4
scikit-learn~=0.23.1
joblib~=0.16.0
simpletransformers
flask
beautifulsoup4
flask~=1.1.2
beautifulsoup4~=4.9.1
flask_swagger_ui
spaCy
spaCy~=2.3.2
lxml
flask_cors
requests~=2.24.0
pandas~=1.0.4
marshmallow~=3.7.1
nltk~=3.5
sklearn~=0.0
torch~=1.7.1
torchvision
wandb
fasttext
matplotlib~=3.3.2
memory_profiler
4 changes: 4 additions & 0 deletions src/Models/FromDatasetsModelModel.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
class FromDatasetsModel:
def __init__(self, pos_filename, neg_filename):
self.pos_filename = pos_filename
self.neg_filename = neg_filename
3 changes: 3 additions & 0 deletions src/Models/ModelNameModel.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
class ModelNameModel:
def __init__(self, name):
self.name = name
6 changes: 6 additions & 0 deletions src/Models/NewModelModel.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
class NewModelModel:
def __init__(self, pos_number, neg_number, pos_search_criteria, neg_search_criteria):
self.pos_number = pos_number
self.neg_number = neg_number
self.pos_search_criteria = pos_search_criteria
self.neg_search_criteria = neg_search_criteria
7 changes: 7 additions & 0 deletions src/Models/TedSaveModel.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
class TedSaveModel:
def __init__(self, amount, search_criteria, dataset_name, original_languages=None, languages=None):
self.amount: int = amount
self.search_criteria: str = search_criteria
self.original_languages: list[str] = original_languages
self.languages: list[str] = languages
self.dataset_name: str = dataset_name
Empty file added src/Models/__init__.py
Empty file.
Loading