INSO-World · larnhold · Jun 1, 2020 · Jun 1, 2020 · Jun 2, 2020 · Jun 3, 2020
diff --git a/.dockerignore b/.dockerignore
@@ -0,0 +1 @@
+src/data/*
diff --git a/.gitignore b/.gitignore
@@ -1,8 +1,14 @@
+.idea/
 outputs/
-data/
+data/*
 venv/
 test/
 runs/
 scikit_model
 cache_dir/
 __pycache__/
+wandb/
+src.iml
+/src/tb_logs/
+/src/tenderclass/
+/src/tenderclass-backend-src/
diff --git a/.idea/.gitignore b/.idea/.gitignore
diff --git a/.idea/inspectionProfiles/profiles_settings.xml b/.idea/inspectionProfiles/profiles_settings.xml
diff --git a/.idea/misc.xml b/.idea/misc.xml
diff --git a/.idea/modules.xml b/.idea/modules.xml
diff --git a/.idea/tenderclass-backend.iml b/.idea/tenderclass-backend.iml
diff --git a/.idea/vcs.xml b/.idea/vcs.xml
diff --git a/Dockerfile b/Dockerfile
@@ -9,11 +9,15 @@ WORKDIR /
 
 #RUN apk add make gcc musl-dev g++
 
+RUN pip install --upgrade pip
+RUN pip install cython
+
+
+RUN pip install git+https://github.com/huggingface/transformers.git
+
 # we need to install further python packages which are listed in requirements.txt
 COPY requirements.txt ./
 
-RUN pip install --upgrade pip
-RUN pip install cython
 RUN pip install --no-cache-dir -r requirements.txt
 
 COPY . .

diff --git a/README.md b/README.md
@@ -1,9 +1,12 @@
 # tuwien-inso-bachelorthesis-tenderclass-backend
 
-tenderclass is an automated screening system for public procurement notices using state-of-the-art Machine Learning and Natural Language Processing (NLP) frameworks. This git repository holds the Python-based backend of tenderclass. It is responsible for downloading, parsing and classifying tenders from Tenders Electronic Daily (TED). For this reason, this prototype implements two Machine Learning approaches:
+tenderclass is an automated screening system for public procurement notices using state-of-the-art Machine Learning and Natural Language Processing (NLP) frameworks. This git repository holds the Python-based backend of tenderclass. It is responsible for downloading, parsing and classifying tenders from Tenders Electronic Daily (TED). For this reason, this prototype implements following Machine Learning approaches:
 
-- SpacyScikitModel: Machine Learning Model based on [spaCy](https://spacy.io/) and [scikit-learn](https://scikit-learn.org/stable/)
-- TransformerModel: Machine Learning Model based on [Hugging Face](https://github.com/huggingface/transformers) and [simpletransformers](https://github.com/ThilinaRajapakse/simpletransformers)
+- SpacyScikitModel: Machine Learning Model based on [spaCy](https://spacy.io/) and [scikit-learn](https://scikit-learn.org/stable/) (titles only)
+- TransformerModel: Machine Learning Model based on [Hugging Face](https://github.com/huggingface/transformers) and [simpletransformers](https://github.com/ThilinaRajapakse/simpletransformers) (titles only)
+- FullTextSvmModel: Machine Learning Model based on [spaCy](https://spacy.io/) and [scikit-learn](https://scikit-learn.org/stable/)
+- FastTextModel: Machine Leanring Model based on [FastText](https://fasttext.cc/)
+- FullTextTransformerModel: Machine Learning Model based on [Hugging Face](https://github.com/huggingface/transformers) and [PyTorch](https://pytorch.org/) with [PytorchLightning](https://www.pytorchlightning.ai/)
 
 ## Getting Started
 
@@ -14,7 +17,7 @@ These instructions will get you a copy of the project up and running on your loc
 What things you need to install the software and how to install them
 
 - [Python 3.7/3.8](https://www.python.org/downloads/)
-- OPTIONAL: If you want to train the TransformerModel on a Nvidia GPU (much faster!): [CUDA Toolkit 10.2](https://developer.nvidia.com/cuda-downloads)
+- OPTIONAL: Some models can be trained on a Nvidia GPU (much faster!): [CUDA Toolkit 10.2](https://developer.nvidia.com/cuda-downloads)
 - OPTIONAL: If you want to deploy it as a Docker container: [Docker](https://www.docker.com/) runtime environment
 
 ### Installing
@@ -39,14 +42,17 @@ Run on Linux: `$ source venv/bin/activate`
 3. Install all the required dependecies using Python packet manager `pip`.<br/>
 `$ pip install -r requirements.txt`
 
-4. Install the spaCy german language model.<br/>
+4. The Transformers package has to be installed from source.<br/>
+`pip install git+https://github.com/huggingface/transformers.git`
+
+5. Install the spaCy german language model.<br/>
 `$ python -m spacy download de`
 
-5. Navigate to the `src` directory and start the web server by running `main.py`.<br/>
+6. Navigate to the `src` directory and start the web server by running `main.py`.<br/>
 `$ cd src`<br/>
 `$ python main.py`
 
-6. OPTIONAL: Deactive the virtual environment:<br/>
+7. OPTIONAL: Deactive the virtual environment:<br/>
 Run on Windows: `$ venv\Scripts\deactivate.bat`<br/>
 Run on Linux: `$ source venv/bin/deactivate`
 
@@ -75,6 +81,7 @@ You can deploy the backend of tenderclass by using a Docker container.
 
 ## API Endpoints
 Documentation for the API Endpoints is available in Swagger UI. After starting the web server, enter the following web site into your browser:<br/>
+
 [API Documenation](http://localhost:5000/swagger)
 
 ## Architecture
@@ -83,34 +90,54 @@ The back end incorporates the business logic and Machine Learning services inclu
 - Trainer: This module trains the model by fetching two tender sets (positive tenders and negative tenders), labeling them and then feeding them to the model. It also allows to reset the model.
 - Fetcher: This class is responsible for downloading tenders from the internet. Although it only delegates the request to the TedFetcher, there is the possibility that tenderclass could also address other public procurement data sources next to TED. This component would decide which data source should be used.
 - TedFetcher: Given a number and a search query, this component automatically downloads the raw XML tender data using the TedDownloader. Afterwards, it delegates the XML document to TedExtractor, which builds up the Tender entity by extracting the relevant fields from the XML document.
+
 ![Component Diagram](doc/arch2.png)
 
 ### Data model
-The prototype tenderclass implements two types of entities. A Tender represents one single public procurement notice. It holds the unique identifier of type string, which is assigned externally by TED, the hyperlink to the official TED website document of type string as well as the
-list of CPV codes of type list of strings. Moreover, each tender consists of an array of at least one LanguageEntity. This entity holds the language-specific information such as title of type string and description of type string. Although requirement analysis only dictates to support German public procurement notices, the data model already supports multiple languages in case of extending the prototype with additional features such as multi-language or translation support. The following figure shows the corresponding class diagram of the data model.
+The prototype tenderclass implements two types of entities. A Tender represents one single public procurement notice. It holds the unique identifier of type string, which is assigned externally by TED, the list of CPV codes of type list of strings and the language of the original submission of the procurment notive.
+Moreover, each tender consists of an array of at least one LanguageEntity which is the language entity in the language of the original publication.
+This entity holds the language-specific information such as title of type string and description of type string and additionally a link to the procurement on the TED webite.
+Although requirement analysis only dictates to support German public procurement notices, the data model already supports multiple languages in case of extending the prototype with additional features such as multi-language or translation support. The following figure shows the corresponding class diagram of the data model.
+
 ![Data model](doc/datamodel.png)
 
 ## Endpoints and Program Flow
 
 ### Get recommendations
 This endpoint downloads all tenders published on the current date, classifies them and then only returns the positive tenders as recommendation. The following figure depicts the communication flow. After receiving the request, firstly the Flask web server delegates it to the Recommender module. This component uses the Fetcher module for downloading and parsing the tenders. The Fetch Model section displays its sequence communication in more detail. Subsequently, the Recommender module only returns those tenders the model has classified to be interesting.
+
 ![Get recommendations](doc/recommendations.png)
 
 ### Create new model
 This endpoint creates a new model and trains it with two distinct sets of tenders. For this purpose, the JSON body requires four different properties. The pos_number attribute indicates how many positive tenders the application should download from TED and feed to the model. Thereby, the pos_search_criteria specifies the constraints each positive tender must fulfill. In this case, at least one CPV code must start with 72. Analogously, the same procedure applies for the negative tenders. The following figure illustrates the communication flow. After receiving the request from the Flask web server, the Trainer component firstly creates a new model. Secondly, it fetches both sets of positive and negative tenders respectively. With the tenders wrapped to tuples together with their corresponding labels (1 for positive, 0 for negative), the Trainer module randomly shuffles the tuples. The reason is that otherwise, the model would firstly be trained with the series of positive tenders and afterwards with the series of negative tenders. To counteract this imbalance, the following train call receives the shuffled list of labeled tenders.
+
 ![Create new model](doc/newmodel.png)
 
 ### Train from web
 This endpoint updates the model with additional labeled training data. As the user should have the possibility to either confirm or reject the recommendations, this endpoint fits the model with feedback data. For that reason, the Flask web server accepts a JSON body with two properties. The ids property is a JSON list of tender identifiers which tenderclass automatically downloads. The  labels property is an integer list which gives the corresponding labels for these tenders. The i-th label belongs to the i-th id. This is why both lists must be of the same length. Similar to the get recommendations endpoint in the previous section, Flask delegates the request to another component, but this time to the Trainer. The following figure outlines the communication flow. As this module only knows the ids, but not the actual tender data, it first of all needs to download the entire tender metadata. This is why it builds up a search criteria query in the way that the tender id must equal to at least one id in the list. After passing this search criteria to the Fetcher, it receives all tender entities that have been found. As a second step, the Trainer module maps the downloaded tenders to the given labels before wrapping them to tuples. By passing them to the train method, the Model component feeds those labeled tenders to its internal classification model. Finally, the flask web server returns with ok.
+
 ![Train from web](doc/train.png)
 
 ### Fetch models
 Although there is no designated endpoint for fetching specific, query-based tenders, each core function requires downloading, parsing and extracting tenders which the Fetcher module is responsible for. This is why this subsection explains in detail the communication flow of fetching a tender, as seen in the following figure. As soon as the Fetcher module gets a request, it immediately delegates the request to the TedFetcher. In defiance of this extra delegation, this pattern allows developers to add additional data sources such as national public procurement platforms. As TED API supports pagination with up to 100 tender documents per API call, the TedFetcher needs to enter a loop. In each iteration, it calls get_xml_contracts with i as the page number. Subsequently, the triggered TedDownloader issues a REST call to the TED API as described in the Fetch Tender section. Once it has parsed the response and returned the list of XML documents of the i-th page, the TedFetcher module calls the extract method from the TedExtractor. This second step instantiates and initializes a new Tender entity by extracting CPV code, id, title and description out of the XML document. As soon as the component either reaches the requested number c of tenders or exceeds the maximum number of pages (which implies that fewer tenders than intended are returned), the module returns the list of Tender entities.
+
 ![Fetch models](doc/get.png)
 
+## Logging
+The FullTextTransformer model has support for wandb to log statistics of a training run of the model. The logs can be accessed on the [Weights & Biases](https://wandb.ai/home) (wandb) portal. 
+
+To write logs to the platform, the user has to be logged in. Provided the the corresponding `wandb package has been istalled via pip, which is the case
+if all requirements have been installed, the user ca login via following command:
+
+`wandb login`
+
+No the logged metrics of runs of the FullTextTransformerModel can be viewed on the wandb-acocunt associated with the used
+credentials on [https://www.wandb.com/](https://www.wandb.com/).
+
 ## Authors
 
 * **Nicolas Griebenow** - *Initial work* - [ngriebenow](https://github.com/ngriebenow)
+* **Lukas Arnhold** - *Further development of classification models* - [larnhold](https://github.com/larnhold)
 
 
 

diff --git a/doc/datamodel.png b/doc/datamodel.png
diff --git a/docker-compose.yml b/docker-compose.yml
@@ -0,0 +1,9 @@
+version: "3"
+
+services:
+  backend:
+    build: .
+    ports:
+      - "5000:5000"
+    volumes:
+      - "./data:/src/data:rw"
diff --git a/requirements.txt b/requirements.txt
@@ -1,9 +1,21 @@
-scikit-learn
-joblib
+pytorch-lightning~=1.1.4
+scikit-learn~=0.23.1
+joblib~=0.16.0
 simpletransformers
-flask
-beautifulsoup4
+flask~=1.1.2
+beautifulsoup4~=4.9.1
 flask_swagger_ui
-spaCy
+spaCy~=2.3.2
 lxml
 flask_cors
+requests~=2.24.0
+pandas~=1.0.4
+marshmallow~=3.7.1
+nltk~=3.5
+sklearn~=0.0
+torch~=1.7.1
+torchvision
+wandb
+fasttext
+matplotlib~=3.3.2
+memory_profiler
diff --git a/src/Models/FromDatasetsModelModel.py b/src/Models/FromDatasetsModelModel.py
@@ -0,0 +1,4 @@
+class FromDatasetsModel:
+    def __init__(self, pos_filename, neg_filename):
+        self.pos_filename = pos_filename
+        self.neg_filename = neg_filename
diff --git a/src/Models/ModelNameModel.py b/src/Models/ModelNameModel.py
@@ -0,0 +1,3 @@
+class ModelNameModel:
+    def __init__(self, name):
+        self.name = name
diff --git a/src/Models/NewModelModel.py b/src/Models/NewModelModel.py
@@ -0,0 +1,6 @@
+class NewModelModel:
+    def __init__(self, pos_number, neg_number, pos_search_criteria, neg_search_criteria):
+        self.pos_number = pos_number
+        self.neg_number = neg_number
+        self.pos_search_criteria = pos_search_criteria
+        self.neg_search_criteria = neg_search_criteria
diff --git a/src/Models/TedSaveModel.py b/src/Models/TedSaveModel.py
@@ -0,0 +1,7 @@
+class TedSaveModel:
+    def __init__(self, amount, search_criteria, dataset_name, original_languages=None, languages=None):
+        self.amount: int = amount
+        self.search_criteria: str = search_criteria
+        self.original_languages: list[str] = original_languages
+        self.languages: list[str] = languages
+        self.dataset_name: str = dataset_name
diff --git a/src/Models/__init__.py b/src/Models/__init__.py