Code for the paper
PDCDE : Patent Document Clustering with Deep Embeddings
Jaeyoung Kim, Janghyeok Yoon, Eunjeong Park, Sungchul Choi
https://www.researchgate.net/publication/325251122_Patent_Document_Clustering_with_Deep_Embeddings
-
KIPRIS dataset
- KIPRIS dataset consists of abstracts from five categories of US patent
- Categories : car, cameras, CPUs, memory, graphics.
-
The combination used in the paper
- Task 1 : car-camera(Less relevant class)
- Task 2 : memory-cpu(Relevant classes)
- Task 3 : car, camera, cpu, memory, graphics.
- 3 categories task is used KISTA dataset, we will add this dataset soon.
- Tensorflow 1.4.0
- Keras 2.2.0
- nltk 3.3
- pandas 0.23.0
- scikit-learn 0.19.1
#python2
$ pip install -r requirments.txt
#python3
$ pip3 install -r requirments.txt
- category : car_camera, memory_cpu, 5_categories
$ python embedding_patent.py --dataset "category"
$ python train.py --dataset "category"
$ python train.py --dataset "category" --task test
dataset: categories of dataset. you can select{"car_camera", "memory_cpu", "5_categories"}save_embedding_vector: path to the embedding vectors.save_weight_path: path to the trained weight.dataset_path: path to KPRIS dataset. Default is./dataset
window_size: Doc2Vec window size. Default is5.embedding_size: Embedding vector dimension. Default is50.doc_initializer: Doc2Vec word and document initializer. Default isuniformnegative_sample: Number of negative sampling used0 nce loss. Default is5.doc_lr: Doc2Vec initial learning rate. Default is0.01.doc_batch_size: Doc2Vec batch size. Default is256.doc_epochs: Doc2Vec epochs. Default is500.
dec_batch_size: DEC model batch size. Default is256dec_lr: DEC initial learning rate. Default is0.001dec_decay_step: step decay every n epochs.layerwise_pretrain_iters: layerwise weight pretrain iterations(greedy layer wise auto encoder). Default is5000.finetune_iters: fine-tunning iteration after layerwise weights pretrain. Default is5000.


