Skip to content

Reorganise the Repo#55

Open
NirantK wants to merge 6 commits intodevfrom
dataset
Open

Reorganise the Repo#55
NirantK wants to merge 6 commits intodevfrom
dataset

Conversation

@NirantK
Copy link
Copy Markdown
Owner

@NirantK NirantK commented Feb 1, 2021

No description provided.

radhikasethi2011 and others added 5 commits February 1, 2021 18:36
* Add WanDB to Hinglish (#48)

* Adding SentencePiece

* Rearrange

* Adding finetuning scripts

* Making dir structure more managable

* Add Random Search

* Add majority voting

* Strip notebooks

* Add nb stripout from fastai standalone version

* Move everything to one notebook

* Change the name of the file saved

* Change the parameters

* Combine common code

* Pass "name" to methods in hinglish utils

* fix imports

* Pass name as variable to add_padding

* Remove output

* Fix typo

* Change names for ensemble models

* Change from "output" to name of the LM model

* Fix typo

* Remove hardcoding for epochs

* isort and black :)

* Move everything to hinglish

* Coffe isn't good for me

* Break everything into sensible methods

* import clash

* fix run_valid

* Change notebook according to hinglish.py

* Change logs

* Fix imports

* Change method names

* Remove tf dependency

* Remove tf from requirements.txt

* Remove hardcoding from pd columns

* Use store_attr to load variables for class

* Fix the size mismatch error by changing final_test.json file

* nb-stripout worked

* Add majority voting explanation

* extract if tarfile and run_language_modeling documentation

* Split the transformers notebook

* Something broke, I don't know what.

* Remove setup

* Things would be easier if I knew OOP or Python better

* Will fix this later is this works¿

* nan sent¿

* Print eval and test metrics

* Change the label for eval

* Change the file with empty clean_text

* Remove eval testing for now

* Logfile name

* moving the part which copies things to drive here

* Fix formatting add pathlib

* add drivepath

* Changed the file paths

* Remove additional code

* Update model performance after reproducing code (#40)

- Reproduced on Monday 29th December. 
- [Training Files](https://drive.google.com/drive/folders/12qEbxbefBY24-YqahVV0v7q_IFyxz3L8?usp=sharing)
- [Model/Output files](https://drive.google.com/drive/folders/1x-6klxSJEQu5gUOR1zHUHyHjHrApKmRD?usp=sharing)

* Strip NB

* Black and Isort all Python code

* Squashed commit of the following (Ref wandb):

commit 733d6eb
Author: meghanabhange <meghanabhange13@gmail.com>
Date:   Tue Oct 13 14:18:24 2020 +0530

    Add sweeps for wandb

commit 2f10366
Author: meghanabhange <meghanabhange13@gmail.com>
Date:   Tue Oct 13 12:45:24 2020 +0530

    Add command line training

commit 9f3680d
Author: meghanabhange <meghanabhange13@gmail.com>
Date:   Tue Oct 13 00:48:44 2020 +0530

    Removing print statements for the variables which are being logged by wandb

commit e4d15be
Author: meghanabhange <meghanabhange13@gmail.com>
Date:   Tue Oct 13 00:35:54 2020 +0530

    Add wname

commit e3c9896
Author: meghanabhange <meghanabhange13@gmail.com>
Date:   Mon Oct 12 23:54:18 2020 +0530

    Change timestamp logic

commit a80379d
Author: meghanabhange <meghanabhange13@gmail.com>
Date:   Mon Oct 12 23:50:41 2020 +0530

    Add configs to wandb

commit 77024af
Author: meghanabhange <meghanabhange13@gmail.com>
Date:   Mon Oct 12 23:43:28 2020 +0530

    Add hyperparameters

commit 5aeb78f
Author: meghanabhange <meghanabhange13@gmail.com>
Date:   Mon Oct 12 22:52:08 2020 +0530

    Need to push before power cuts

commit 738f1fb
Author: meghanabhange <meghanabhange13@gmail.com>
Date:   Mon Oct 12 22:47:27 2020 +0530

    We need apex?

commit 37361b0
Author: meghanabhange <meghanabhange13@gmail.com>
Date:   Mon Oct 12 22:46:05 2020 +0530

    transformers??

commit f9f9a11
Author: meghanabhange <meghanabhange13@gmail.com>
Date:   Mon Oct 12 22:43:29 2020 +0530

    Clean notebooks

commit 3d6f182
Author: meghanabhange <meghanabhange13@gmail.com>
Date:   Mon Oct 12 22:41:45 2020 +0530

    Fix requirements

commit fec8e54
Author: meghanabhange <meghanabhange13@gmail.com>
Date:   Mon Oct 12 22:39:13 2020 +0530

    Edit requirements and typo

commit eaf18d3
Author: meghanabhange <meghanabhange13@gmail.com>
Date:   Mon Oct 12 22:35:13 2020 +0530

    Wandb notebook changes

commit b35d72f
Author: meghanabhange <meghanabhange13@gmail.com>
Date:   Mon Oct 12 22:32:58 2020 +0530

    Add wandb

commit 9ad9372
Author: meghanabhange <meghanabhange13@gmail.com>
Date:   Mon Oct 12 22:13:16 2020 +0530

    Add final metrics to wandb too

commit 129a6ab
Author: meghanabhange <meghanabhange13@gmail.com>
Date:   Mon Oct 12 21:43:58 2020 +0530

    Remove stray print statement

commit 66009c6
Author: meghanabhange <meghanabhange13@gmail.com>
Date:   Mon Oct 12 21:39:29 2020 +0530

    Remove redundant print statements

commit 4497615
Author: meghanabhange <meghanabhange13@gmail.com>
Date:   Mon Oct 12 21:31:42 2020 +0530

    Change valid logging

commit 05e7fe8
Author: meghanabhange <meghanabhange13@gmail.com>
Date:   Mon Oct 12 21:24:13 2020 +0530

    Make logging per epoch

commit 5709ff3
Author: meghanabhange <meghanabhange13@gmail.com>
Date:   Mon Oct 12 21:15:22 2020 +0530

    Log traning loss per batch

commit 0c63488
Author: meghanabhange <meghanabhange13@gmail.com>
Date:   Mon Oct 12 21:14:22 2020 +0530

    Print metrics too

commit 12115e8
Author: meghanabhange <meghanabhange13@gmail.com>
Date:   Mon Oct 12 21:11:42 2020 +0530

    Change run valid

commit 8b76118
Author: meghanabhange <meghanabhange13@gmail.com>
Date:   Mon Oct 12 21:03:52 2020 +0530

    Init wandb only once and run valid at every batch

commit 3a6742a
Author: meghanabhange <meghanabhange13@gmail.com>
Date:   Mon Oct 12 20:50:54 2020 +0530

    Change name of logging variables

commit 435ecc4
Author: meghanabhange <meghanabhange13@gmail.com>
Date:   Mon Oct 12 20:46:59 2020 +0530

    save all output files created

commit e898996
Author: meghanabhange <meghanabhange13@gmail.com>
Date:   Mon Oct 12 20:40:20 2020 +0530

    Only log metrics for wandb

commit fc997e9
Author: meghanabhange <meghanabhange13@gmail.com>
Date:   Mon Oct 12 20:23:53 2020 +0530

    Make wandb logs to dict

commit 07c8e5f
Author: meghanabhange <meghanabhange13@gmail.com>
Date:   Mon Oct 12 20:22:00 2020 +0530

    Undo run_lm changes

commit 9fbc95e
Author: meghanabhange <meghanabhange13@gmail.com>
Date:   Mon Oct 12 19:58:30 2020 +0530

    Remove logger

commit 18a5fc1
Author: meghanabhange <meghanabhange13@gmail.com>
Date:   Mon Oct 12 19:46:54 2020 +0530

    Convert the existing logging to wandb logs

* Remove the branch switching in notebook

* Didn't remove merge conflict properly

* Put random back in the universe

* Resolve merge conflicts in hinglishutils

* Black Hinglishutils

* Fix import errors

* Extract only tarfile

* This is where tests would have helped right?

* import time

* Fix imports

* Fix imports more

Co-authored-by: meghanabhange <meghana.bhange@cumminscollege.in>
Co-authored-by: meghanabhange <34004739+meghanabhange@users.noreply.github.com>
Co-authored-by: meghanabhange <meghanabhange13@gmail.com>

* Add MIT License

* Created using Colaboratory

* graph exploration

* resolving conflicts

Co-authored-by: Nirant <NirantK@users.noreply.github.com>
Co-authored-by: meghanabhange <meghana.bhange@cumminscollege.in>
Co-authored-by: meghanabhange <34004739+meghanabhange@users.noreply.github.com>
Co-authored-by: meghanabhange <meghanabhange13@gmail.com>
@NirantK NirantK requested a review from meghanabhange February 1, 2021 18:39
@review-notebook-app
Copy link
Copy Markdown

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@@ -0,0 +1,165 @@
{
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have the same/one clean function for tweets which we can use in all notebooks instead of having a new one for each notebook? This will change how the cleaned file looks right?


Reply via ReviewNB

@@ -0,0 +1,165 @@
{
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Um. Confusion is me. Where are we using data_lst again? Shouldn't this be appended to text?


Reply via ReviewNB

@@ -0,0 +1,74 @@
{
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A question here, is "outputs" in "data" dir? Are we making this dir somewhere? Also shouldn't we have the code to download and extract 7z from gdrive somewhere because then this would be hard to figure out again in 5 months? Or we don't want to keep this code public?


Reply via ReviewNB

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, we should remove this code from public now itself.

@@ -0,0 +1,74 @@
{
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, can we save tweet_ids.json in data dir?


Reply via ReviewNB

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should!

@@ -0,0 +1,165 @@
{
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this notebook repeated?


Reply via ReviewNB

@@ -0,0 +1,903 @@
{
Copy link
Copy Markdown
Collaborator

@meghanabhange meghanabhange Feb 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@/radhika look at relative paths using pathlib. This notebook is very colab specific.


Reply via ReviewNB

@@ -0,0 +1,903 @@
{
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid %cd in code directly. Instead, use the Path(data_dir)/"filename" kind of convention?


Reply via ReviewNB

@@ -0,0 +1,903 @@
{
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try having a consistent naming convention for the variable names. Ref: https://www.python.org/dev/peps/pep-0008/#naming-conventions


Reply via ReviewNB

@@ -0,0 +1,903 @@
{
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change the name MegaTextDoc to something more readable?


Reply via ReviewNB

@@ -0,0 +1,903 @@
{
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the stopwords-hinglish.txt locally uploaded?

This will be hard to reproduce for someone who doesn't have access to that file. Add the text file to drive and use gdown there?


Reply via ReviewNB

@meghanabhange
Copy link
Copy Markdown
Collaborator

Also, @NirantK when reorganising a lot files will break because of the file paths. Currently They have data_path like

data_raw = datapath/"raw"
data_interim = datapath/"interim"
data_processed = datapath/"processed"
cleanlab_datapath = datapath/"cleanlab"```

So this might require changes. 

@meghanabhange
Copy link
Copy Markdown
Collaborator

Also, notebooks like nbs/07_Transformers.ipynb will break because it's importing from the scripts like from hinglishutils import get_files_from_gdrive so the imports would also require changing and jupyter notebook acts very moody with parallel directory imports.

Copy link
Copy Markdown
Collaborator

@meghanabhange meghanabhange left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Reordering would need changes in the notebooks datapath so that it doesn't break
  • Also would require a change in how the imports will work in notebooks which import from hinglish and hinglishutils scripts.
  • Vocabulary_Count.ipynb is repeated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants