Dataset collection improvements

Based on what we discussed on the discord call, I will be looking more into dataset improvements. The scraper already works well, as mentioned from TODOs just needs to be expanded. Also some related to other tasks, like Non-LangChain library, we want to use for this repo. This will help with the filtering tasks below.

Some thoughts as well, is better for this focus enriching our arxiv dataset first (filtering, search, etc) or expanding our dataset as much possible? (like the non-arxiv papers). 

I'm open to any ideas regarding the direction to take with the dataset section.

These tasks I will start look into:
- [ ] Cron job to download new published paper or [github-actions](https://github.com/AutoLLM/ArxivDigest)
- [ ] Need to write a crawler/scrapper to get data directly from arxiv

### Taken from the main page
- [ ] Need to write a crawler/scrapper to get data directly from arxiv (@mnm-matin will push a small prototype script)
- [ ]  Need to be able to search/filter for AI Safety papers on arxiv
- [ ]  Download papers/extract abstract from those papers and any relevant tags
- [ ]  Cron job to download new published paper or [github-actions](https://github.com/AutoLLM/ArxivDigest)
- [ ]  extract citations for each paper?
- [ ]  extend to non-arxiv papers, [this project](https://github.com/monk1337/resp)


Best, Tobi

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset collection improvements #16

Taken from the main page

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Dataset collection improvements #16

Description

Taken from the main page

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions