forked from CERC-AAI/generalist-agent
-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
dataAll things dataAll things data
Description
Inappropriate content: NSFW, hate speech, offensive words, sentiments
Title, captions, tags, comments
Quality data:
Length, Aspect Ratio, Resolution, likes, views, number of subscribers of the creator, comments filtered with language model, ensure some amount of movement in the video
Language: English for now
(From Michael) the final version of the dataset from the collection side of things will look like the following:
channels.tsvwith columns ['link', 'name', 'description', 'subscribers', 'isFamilySafe', 'tags']videos.tsvwith columns ['channel_link', 'id', 'title', 'date', 'length', 'views']- a sample with a 2 videos from a random set of channels listed in channels.tsv can be found in the DuckAI google drive (note there may be some duplicates!!)
Loose tasks:
- Data analysis on metadata in the TSVs
- Build pipeline to collect metadata using video2dataset (may need to alter video2dataset)
- Analysis on NSFW (using some kind of NSFW filter)
- English filter
- Pixel based filters using thumbnails (see comments below on youtube's auto generated thumbnails)
- Brainstorm more ideas on filtering to have high quality video!
Metadata
Metadata
Assignees
Labels
dataAll things dataAll things data