Skip to content

Using NIH RePORTER data as a machine learning playground for Databricks, NLP and collaborative development

Notifications You must be signed in to change notification settings

sdchandra/nih_reporter_DC

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

nih_reporter

Using NIH RePORTER data as a machine learning playground for Databricks, NLP, Azure tools, and collaborative development

Stream Labels

This repo is intended to contain multiple streams (sub-projects or research ideas). Unique stream labels are to be used as directory names to organise the streams and match across directories. Label shared is reserved for codes and features common to all streams.

Stream labels should also be used as branch names to aid code management.

Key directories

  doc/                         - documentation
  src/                          - source codes
    |_  pipelines/[stream]/     - data / ml pipelines
    |_  notesbooks/[stream]/    - exploratory/ experimental notebooks
    |_  utils                   - utility scripts
  test/                         - codes for unit or regression testing
    |_ [stream]/                - organised by streams
  out/[stream]/                 - small output files(eg plots) generated by codes
  data/[stream]/                - small resources or files used by your program
  models/[stream]/              - saved models for deployment
  README.md
  requirements.txt              - use if applicable

Note: Large files ( say, > 1MB) should reside in external file system such as Databricks DBFS and OneDrive.

Notes for contributors

  1. FORK: Create a fork from the main repo [jtjli/nih_reporter] unless you want to develop on top on an existing fork.
  2. BRANCH: Use a branch that's representative of your development, such as using a Stream Label as the branch name. Avoid developing on the main branch.
  3. Create a Pull Request when your codes are ready for merging into the main repo.
  4. Wherever appropriate, use Stream Labels as section heading in files such as .gitignore, the global requirements.txt, and README

About

Using NIH RePORTER data as a machine learning playground for Databricks, NLP and collaborative development

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 100.0%