Conversational agent in spanish done with deep learning and a dataset of movies subtitles. If you want to walk directly to the transformer version you can do it here.
Clone the repository and install:
pip install .
You can alternatively install from pip, which doesn't download the exploratory notebooks:
pip install spanish_chatbot
For a quickstart:
from spanish_chatbot import TransformerChatbot
chatbot = TransformerChatbot(load_quant=True,use_cuda=False) # load pre-trained model
chatbot.evaluateOneInput('Hola') # one input, one output
chatbot.evaluateCycle() # Cicle of input and outputs
-
Seq2seq. For a detailed explanation in spanish you can see this blog post. Features:
-
Transformer. Features:
- Weight tying
- Beam search
- Quantization: Pytorch Dynamic Quantization. Model size reduced to 41% of the original and 2x inference speed up. Backends suported:
- x86 CPUs with AVX2 support or higher (without AVX2 some operations have inefficient implementations)
- ARM CPUs (typically found in mobile/embedded devices)
- For training:
- Download dataset from here here (2Gb) and put it on /data
- Generate data with
python pre_processing.py. Arguments:--lines: number of lines from the orignial dataset to be processed. Default 500_00--max_len: max length of the sentence. Default: 40--min_count: min count of a word to be left of the vocabulary. Default: 10
- Run the training notebook for training and evaluating of the model
For a detailed explanation of the processing see the notebook.
- For evaluation:
- Download the parameters for the seq2seq mode, the full transformer model or the quantized transformer model and uncompress on
./data. - Run the evaluation notebook.
- Download the parameters for the seq2seq mode, the full transformer model or the quantized transformer model and uncompress on
- Pytorch tutorial, for the base of the model. Link
- OpenSubtitle and thier collection of datasets of movies subtitles in every language.