The repository contains the code and notebooks for the tutorials on:
- How to extract embedding features from COCO pictures using the ResNext model developed by Facebook AI.
- Visualizing the picture embedding vectors in a 3D space using PCA and t-SNE.
- Find the nearest neighbors of each picture based on the cosine distance.
- Reduce the embedding space dimensionality while preserving manifold structures using UMAP.
- Find the optimal GMM clusters using the BIC elbow method and the Silhouette analysis.
- Visualize the pictures closest to each centroid to identify the cluster topic.
- Apply an adapted version of the p-SIF (partition averaging) algorithm in order to produce document embeddings from the bag-of-word model and the original picture embedding vectors.
- Test the effectiveness of the novel proposed method against the baseline methods for document averaging (weighted averaging and TF-IDF).
Original paper: P-SIF: Document Embeddings Using Partition Averaging, V. Gupta et al.
Algorithm overview diagram:
Articles of the "Embed, Cluster, Average" series:
- Extracting rich embedding features from COCO pictures using PyTorch and ResNeXt-WSL
- Manifold clustering in the embedding space using UMAP and GMM
- A novel approach to Document Embedding using Partition Averaging on Bag of Words (soon to be published)
You can view and execute the development notebook in Colab:
