Skip to content

Filtering based on meta data for each youtube URL #17

@daniel-z-kaplan

Description

@daniel-z-kaplan

Inappropriate content: NSFW, hate speech, offensive words, sentiments
Title, captions, tags, comments
Quality data:
Length, Aspect Ratio, Resolution, likes, views, number of subscribers of the creator, comments filtered with language model, ensure some amount of movement in the video
Language: English for now

(From Michael) the final version of the dataset from the collection side of things will look like the following:

  • channels.tsv with columns ['link', 'name', 'description', 'subscribers', 'isFamilySafe', 'tags']
  • videos.tsv with columns ['channel_link', 'id', 'title', 'date', 'length', 'views']
  • a sample with a 2 videos from a random set of channels listed in channels.tsv can be found in the DuckAI google drive (note there may be some duplicates!!)

Loose tasks:

  • Data analysis on metadata in the TSVs
  • Build pipeline to collect metadata using video2dataset (may need to alter video2dataset)
  • Analysis on NSFW (using some kind of NSFW filter)
  • English filter
  • Pixel based filters using thumbnails (see comments below on youtube's auto generated thumbnails)
  • Brainstorm more ideas on filtering to have high quality video!

Metadata

Metadata

Labels

dataAll things data

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions