Skip to content

The repository contains the code and notebooks for the tutorials on how to extract embedding features from pictures using the ResNext model. The quality and effectiveness of the techniques are proved by the clustering in the embedding space and the correlation of clusters with their corresponding labels.

License

Notifications You must be signed in to change notification settings

gm-spacagna/docem

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Document Embedding (DocEm)

The repository contains the code and notebooks for the tutorials on:

  1. How to extract embedding features from COCO pictures using the ResNext model developed by Facebook AI.
  2. Visualizing the picture embedding vectors in a 3D space using PCA and t-SNE.
  3. Find the nearest neighbors of each picture based on the cosine distance.
  4. Reduce the embedding space dimensionality while preserving manifold structures using UMAP.
  5. Find the optimal GMM clusters using the BIC elbow method and the Silhouette analysis.
  6. Visualize the pictures closest to each centroid to identify the cluster topic.
  7. Apply an adapted version of the p-SIF (partition averaging) algorithm in order to produce document embeddings from the bag-of-word model and the original picture embedding vectors.
  8. Test the effectiveness of the novel proposed method against the baseline methods for document averaging (weighted averaging and TF-IDF).

Overview of the p-SIF algorithm

Original paper: P-SIF: Document Embeddings Using Partition Averaging, V. Gupta et al.

Algorithm overview diagram:

alt text

Read more

Articles of the "Embed, Cluster, Average" series:

Experiment yourself

You can view and execute the development notebook in Colab:

Open In Colab

About

The repository contains the code and notebooks for the tutorials on how to extract embedding features from pictures using the ResNext model. The quality and effectiveness of the techniques are proved by the clustering in the embedding space and the correlation of clusters with their corresponding labels.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published