Skip to content

Commit 7ebf7cf

Browse files
committed
updated docs, minor renames
1 parent 6413d75 commit 7ebf7cf

File tree

8 files changed

+210
-202
lines changed

8 files changed

+210
-202
lines changed

README.md

Lines changed: 11 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,13 @@ Leverage the power of NLP Topic Modeling, Semantic Similarity and Network analys
77
- Please leave a ⭐ to let me know it has been useful to you so that I can dedicate more of my time working on it.
88

99
## Install
10+
- Highly recommend to install in a conda environment
11+
```
12+
conda create -n stripnet python=3.8 jupyterlab -y
13+
conda activate stripnet
14+
```
15+
16+
- Pip install this library
1017
```
1118
pip install stripnet
1219
```
@@ -47,7 +54,7 @@ stripnet.fit_transform(data['text'])
4754
- The plot is fully interactive too! Hovering over any bar shows the relevant information of the paper.
4855

4956
```
50-
stripnet.most_important()
57+
stripnet.most_important_docs()
5158
```
5259

5360
![Most Important Text](https://github.com/stephenleo/stripnet/blob/main/images/centrality.png?raw=true "Most Important Papers")
@@ -63,5 +70,6 @@ STriP Net stands on the shoulder of giants and several prior work. The most nota
6370
# Buy me a coffee
6471
If this work helped you in any way, please consider the following way to give me feedback so I can spend more time on this project
6572
1. ⭐ this repository
66-
2. ❤️ [the Huggingface space ](https://huggingface.co/spaces/stephenleo/strip)
67-
3.[Buy me a Coffee!](https://www.buymeacoffee.com/stephenleo)
73+
2. ❤️ [the Huggingface space ](https://huggingface.co/spaces/stephenleo/strip) (Coming Jan 11 2022!)
74+
3. 👏 [the Medium post](https://stephen-leo.medium.com/) (Coming End Jan 2022!)
75+
4.[Buy me a Coffee!](https://www.buymeacoffee.com/stephenleo)

notebook/stripnet.html

Lines changed: 0 additions & 108 deletions
This file was deleted.
File renamed without changes.
Lines changed: 38 additions & 68 deletions
Original file line numberDiff line numberDiff line change
@@ -94,23 +94,23 @@
9494
"name": "stderr",
9595
"output_type": "stream",
9696
"text": [
97-
"2022-01-05 20:42:12 INFO: Load pretrained SentenceTransformer: allenai-specter\n",
98-
"2022-01-05 20:42:36 INFO: Use pytorch device: cuda\n",
99-
"2022-01-05 20:42:36 INFO: Missing data detected. Dropping them\n",
100-
"2022-01-05 20:42:36 INFO: ========== Step1: Calculating Embeddings ==========\n",
101-
"Batches: 100%|██████████| 3/3 [00:02<00:00, 1.17it/s]\n",
102-
"2022-01-05 20:42:41 INFO: ========== Step2: Topic modeling ==========\n",
103-
"2022-01-05 20:42:41 INFO: Initializing the topic model\n",
104-
"2022-01-05 20:42:41 INFO: Training the topic model\n",
105-
"2022-01-05 20:42:50,425 - BERTopic - Reduced dimensionality with UMAP\n",
106-
"2022-01-05 20:42:50,437 - BERTopic - Clustered UMAP embeddings with HDBSCAN\n",
107-
"2022-01-05 20:42:50 INFO: Populating Topic Results\n",
108-
"2022-01-05 20:42:50 INFO: ========== Step3: STriP Network ==========\n",
109-
"2022-01-05 20:42:50 INFO: Cosine similarity\n",
110-
"2022-01-05 20:42:50 INFO: Calculating optimal threshold\n",
111-
"2022-01-05 20:42:50 INFO: Number of connections: 126\n",
112-
"2022-01-05 20:42:50 INFO: Calculating Network Plot\n",
113-
"2022-01-05 20:42:50 INFO: ========== Model Fit Successfully! ==========\n"
97+
"2022-01-06 12:16:44 INFO: Load pretrained SentenceTransformer: allenai-specter\n",
98+
"2022-01-06 12:17:07 INFO: Use pytorch device: cuda\n",
99+
"2022-01-06 12:17:07 INFO: Missing data detected. Dropping them\n",
100+
"2022-01-06 12:17:07 INFO: ========== Step1: Calculating Embeddings ==========\n",
101+
"Batches: 100%|██████████| 3/3 [00:02<00:00, 1.11it/s]\n",
102+
"2022-01-06 12:17:12 INFO: ========== Step2: Topic modeling ==========\n",
103+
"2022-01-06 12:17:12 INFO: Initializing the topic model\n",
104+
"2022-01-06 12:17:12 INFO: Training the topic model\n",
105+
"2022-01-06 12:17:21,291 - BERTopic - Reduced dimensionality with UMAP\n",
106+
"2022-01-06 12:17:21,304 - BERTopic - Clustered UMAP embeddings with HDBSCAN\n",
107+
"2022-01-06 12:17:21 INFO: Populating Topic Results\n",
108+
"2022-01-06 12:17:21 INFO: ========== Step3: STriP Network ==========\n",
109+
"2022-01-06 12:17:21 INFO: Cosine similarity\n",
110+
"2022-01-06 12:17:21 INFO: Calculating optimal threshold\n",
111+
"2022-01-06 12:17:21 INFO: Number of connections: 126\n",
112+
"2022-01-06 12:17:21 INFO: Calculating Network Plot\n",
113+
"2022-01-06 12:17:21 INFO: ========== Model Fit Successfully! ==========\n"
114114
]
115115
},
116116
{
@@ -139,7 +139,7 @@
139139
"name": "stderr",
140140
"output_type": "stream",
141141
"text": [
142-
"2022-01-05 20:07:00 INFO: Calculating Network Centrality\n"
142+
"2022-01-06 12:18:23 INFO: Calculating Network Centrality\n"
143143
]
144144
},
145145
{
@@ -161,6 +161,9 @@
161161
[
162162
"Want To Reduce Labeling Cost? GPT-3 Can<br>Help<br><br>Data annotation is a time-consuming<br>and labor-intensive process for many NLP tasks.<br>Although there exist various methods to produce<br>pseudo data labels, they are often task-specific<br>and require a decent amount of labeled data to<br>start with. Recently, the immense language model<br>GPT-3 with 175 billion parameters has achieved<br>tremendous improvement across many few-shot<br>learning tasks. In this paper, we explore ways to<br>..."
163163
],
164+
[
165+
"FNet: Mixing Tokens with Fourier<br>Transforms<br><br>We show that Transformer encoder<br>architec-tures can be massively sped up, with<br>limited accuracy costs, by replacing the self-<br>attention sublayers with simple linear<br>transformations that \"mix\" input tokens. These<br>linear transformations , along with simple<br>nonlinearities in feed-forward layers, are<br>sufficient to model semantic relationships in<br>several text classification tasks. Perhaps most<br>surprisingly, we find that ..."
166+
],
164167
[
165168
"Neural Machine Translation of Rare Words with<br>Subword Units<br><br>Neural machine translation<br>(NMT) models typically operate with a fixed<br>vocabulary , but translation is an open-vocabulary<br>problem. Previous work addresses the translation<br>of out-of-vocabulary words by backing off to a<br>dictionary. In this paper , we introduce a simpler<br>and more effective approach, making the NMT model<br>capable of open-vocabulary translation by encoding<br>rare and unknown words as sequences ..."
166169
]
@@ -183,13 +186,15 @@
183186
0.10673251529415916,
184187
0.09934333324744282,
185188
0.07184737801176154,
189+
0.05168250442223043,
186190
0.042019416676950916
187191
],
188192
"xaxis": "x",
189193
"y": [
190194
"5",
191195
"24",
192196
"59",
197+
"8",
193198
"7"
194199
],
195200
"yaxis": "y"
@@ -200,9 +205,6 @@
200205
[
201206
"An Image is Worth 16x16 Words: Transformers for<br>Image Recognition at Scale<br><br>While the<br>Transformer architecture has become the de-facto<br>standard for natural language processing tasks,<br>its applications to computer vision remain<br>limited. In vision, attention is either applied in<br>conjunction with convolutional networks, or used<br>to replace certain components of convolutional<br>networks while keeping their overall structure in<br>place. We show that this reliance on CNNs is..."
202207
],
203-
[
204-
"Unsupervised Data Augmentation for Consistency<br>Training<br><br>Semi-supervised learning lately<br>has shown much promise in improving deep learning<br>models when labeled data is scarce. Common among<br>recent approaches is the use of consistency<br>training on a large amount of unlabeled data to<br>constrain model predictions to be invariant to<br>input noise. In this work, we present a new<br>perspective on how to effectively noise unlabeled<br>examples and argue that the quality of noising..."
205-
],
206208
[
207209
"The 2021 Image Similarity Dataset and<br>Challenge<br><br>This paper introduces a new<br>benchmark for large-scale image similarity<br>detection. This benchmark is used for the Image<br>Similarity Challenge at NeurIPS'21 (ISC2021). The<br>goal is to determine whether a query image is a<br>modified copy of any image in a reference corpus<br>of size 1~million. The benchmark features a<br>variety of image transformations such as automated<br>transformations, hand-crafted image edits and<br>machine-..."
208210
],
@@ -213,31 +215,29 @@
213215
"Learning Transferable Visual Models From Natural<br>Language Supervision<br><br>State-of-the-art<br>computer vision systems are trained to predict a<br>fixed set of predetermined object categories. This<br>restricted form of supervision limits their<br>generality and usability since additional labeled<br>data is needed to specify any other visual<br>concept. Learning directly from raw text about<br>images is a promising alternative which leverages<br>a much broader source of supervision. We<br>d..."
214216
]
215217
],
216-
"hovertemplate": "Topic_Name=image, vision, learning, visual<br>Betweenness Centrality=%{x}<br>index=%{y}<br>Text=%{customdata[0]}<extra></extra>",
217-
"legendgroup": "image, vision, learning, visual",
218+
"hovertemplate": "Topic_Name=image, learning, contrastive, vision<br>Betweenness Centrality=%{x}<br>index=%{y}<br>Text=%{customdata[0]}<extra></extra>",
219+
"legendgroup": "image, learning, contrastive, vision",
218220
"marker": {
219221
"color": "#EF553B",
220222
"pattern": {
221223
"shape": ""
222224
}
223225
},
224-
"name": "image, vision, learning, visual",
225-
"offsetgroup": "image, vision, learning, visual",
226+
"name": "image, learning, contrastive, vision",
227+
"offsetgroup": "image, learning, contrastive, vision",
226228
"orientation": "h",
227229
"showlegend": true,
228230
"textposition": "auto",
229231
"type": "bar",
230232
"x": [
231233
0.07375869019704637,
232-
0.06015364679748239,
233234
0.059237319511292116,
234235
0.053904344657769304,
235236
0.03623643212684308
236237
],
237238
"xaxis": "x",
238239
"y": [
239240
"57",
240-
"18",
241241
"42",
242242
"1",
243243
"34"
@@ -248,29 +248,29 @@
248248
"alignmentgroup": "True",
249249
"customdata": [
250250
[
251-
"FNet: Mixing Tokens with Fourier<br>Transforms<br><br>We show that Transformer encoder<br>architec-tures can be massively sped up, with<br>limited accuracy costs, by replacing the self-<br>attention sublayers with simple linear<br>transformations that \"mix\" input tokens. These<br>linear transformations , along with simple<br>nonlinearities in feed-forward layers, are<br>sufficient to model semantic relationships in<br>several text classification tasks. Perhaps most<br>surprisingly, we find that ..."
251+
"Unsupervised Data Augmentation for Consistency<br>Training<br><br>Semi-supervised learning lately<br>has shown much promise in improving deep learning<br>models when labeled data is scarce. Common among<br>recent approaches is the use of consistency<br>training on a large amount of unlabeled data to<br>constrain model predictions to be invariant to<br>input noise. In this work, we present a new<br>perspective on how to effectively noise unlabeled<br>examples and argue that the quality of noising..."
252252
]
253253
],
254-
"hovertemplate": "Topic_Name=image, matching, similarity, copy<br>Betweenness Centrality=%{x}<br>index=%{y}<br>Text=%{customdata[0]}<extra></extra>",
255-
"legendgroup": "image, matching, similarity, copy",
254+
"hovertemplate": "Topic_Name=learning, image, titles, product<br>Betweenness Centrality=%{x}<br>index=%{y}<br>Text=%{customdata[0]}<extra></extra>",
255+
"legendgroup": "learning, image, titles, product",
256256
"marker": {
257257
"color": "#00cc96",
258258
"pattern": {
259259
"shape": ""
260260
}
261261
},
262-
"name": "image, matching, similarity, copy",
263-
"offsetgroup": "image, matching, similarity, copy",
262+
"name": "learning, image, titles, product",
263+
"offsetgroup": "learning, image, titles, product",
264264
"orientation": "h",
265265
"showlegend": true,
266266
"textposition": "auto",
267267
"type": "bar",
268268
"x": [
269-
0.05168250442223043
269+
0.06015364679748239
270270
],
271271
"xaxis": "x",
272272
"y": [
273-
"8"
273+
"18"
274274
],
275275
"yaxis": "y"
276276
}
@@ -1138,36 +1138,7 @@
11381138
}
11391139
],
11401140
"source": [
1141-
"stripnet.most_important()"
1142-
]
1143-
},
1144-
{
1145-
"cell_type": "code",
1146-
"execution_count": 3,
1147-
"metadata": {},
1148-
"outputs": [
1149-
{
1150-
"data": {
1151-
"text/plain": [
1152-
"['bertopic==0.9.4',\n",
1153-
" 'networkx==2.6.3',\n",
1154-
" 'numpy==1.22.0',\n",
1155-
" 'pandas==1.3.5',\n",
1156-
" 'plotly==5.5.0',\n",
1157-
" 'pyvis==0.1.9',\n",
1158-
" 'scikit_learn==1.0.2',\n",
1159-
" 'sentence_transformers==2.1.0',\n",
1160-
" 'setuptools==58.0.4']"
1161-
]
1162-
},
1163-
"execution_count": 3,
1164-
"metadata": {},
1165-
"output_type": "execute_result"
1166-
}
1167-
],
1168-
"source": [
1169-
"import pathlib\n",
1170-
"pathlib.Path(\"../requirements.txt\").read_text().splitlines()"
1141+
"stripnet.most_important_docs()"
11711142
]
11721143
},
11731144
{
@@ -1183,7 +1154,7 @@
11831154
"hash": "165d1ae889830a583229da7bcb4f0175182080283a5d782889056a279531f3b2"
11841155
},
11851156
"kernelspec": {
1186-
"display_name": "Python 3.8.12 64-bit ('stripnet': conda)",
1157+
"display_name": "Python 3 (ipykernel)",
11871158
"language": "python",
11881159
"name": "python3"
11891160
},
@@ -1198,9 +1169,8 @@
11981169
"nbconvert_exporter": "python",
11991170
"pygments_lexer": "ipython3",
12001171
"version": "3.8.12"
1201-
},
1202-
"orig_nbformat": 4
1172+
}
12031173
},
12041174
"nbformat": 4,
1205-
"nbformat_minor": 2
1175+
"nbformat_minor": 4
12061176
}

notebooks/stripnet.html

Lines changed: 108 additions & 0 deletions
Large diffs are not rendered by default.

requirements.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,3 +7,4 @@ pyvis==0.1.9
77
scikit_learn==1.0.2
88
sentence_transformers==2.1.0
99
setuptools==58.0.4
10+
ipywidgets==7.6.5

setup.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,11 +12,12 @@
1212
'plotly==5.5.0',
1313
'pyvis==0.1.9',
1414
'scikit_learn==1.0.2',
15-
'sentence_transformers==2.1.0']
15+
'sentence_transformers==2.1.0',
16+
'ipywidgets==7.6.5']
1617

1718
setuptools.setup(
1819
name="stripnet",
19-
version="0.0.4",
20+
version="0.0.5",
2021
author="stephenleo",
2122
author_email="stephen.leo87@gmail.com",
2223
description="STriP Net: Semantic Similarity of Scientific Papers (S3P) Network",

0 commit comments

Comments
 (0)