Skip to content

Dataset collection improvements #16

@tobiolabode

Description

@tobiolabode

Based on what we discussed on the discord call, I will be looking more into dataset improvements. The scraper already works well, as mentioned from TODOs just needs to be expanded. Also some related to other tasks, like Non-LangChain library, we want to use for this repo. This will help with the filtering tasks below.

Some thoughts as well, is better for this focus enriching our arxiv dataset first (filtering, search, etc) or expanding our dataset as much possible? (like the non-arxiv papers).

I'm open to any ideas regarding the direction to take with the dataset section.

These tasks I will start look into:

  • Cron job to download new published paper or github-actions
  • Need to write a crawler/scrapper to get data directly from arxiv

Taken from the main page

  • Need to write a crawler/scrapper to get data directly from arxiv (@mnm-matin will push a small prototype script)
  • Need to be able to search/filter for AI Safety papers on arxiv
  • Download papers/extract abstract from those papers and any relevant tags
  • Cron job to download new published paper or github-actions
  • extract citations for each paper?
  • extend to non-arxiv papers, this project

Best, Tobi

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions