Video-Segmentation-Transcript-Summarization/Datasets at main · harshilpradhan59/Video-Segmentation-Transcript-Summarization · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
# VT-SSum Dataset

## Overview

**VT-SSum** is a benchmark dataset created for research in **video transcript segmentation and summarization**. It consists of spoken language transcripts extracted from real-world videos, making it suitable for developing and evaluating models that process, segment, and summarize video transcripts.

## Dataset Structure

- **Total Videos:**
  - **Train:** 7,692
  - **Dev:** 962
  - **Test:** 962

- **Data Format:**
  Each entry is a JSON object with the following fields:
  - `id`: Unique video identifier
  - `title`: Title of the video
  - `info`: Additional video details (e.g., recording/upload time)
  - `url`: Source link to the video
  - `segmentation`: List of transcript segments (each segment is a list of sentences)
  - `summarization`: Dictionary containing segment summaries and sentence-level summary annotations

### Example Data Instance

{
"id": "A01",
"title": "Sample Video Title",
"info": "Uploaded 2021-01-01",
"url": "https://sample.video.url",
"segmentation": [
["Sentence 1", "Sentence 2", "..."],
["Sentence 3", "Sentence 4", "..."]
],
"summarization": {
"segments": [
{
"summary": "...",
"is_summarization_sample": true,
"summarization_data": [
{"sentence": "...", "label": 1},
...
]
}
]
}
}


## Supported Tasks

- **Transcript Segmentation:** Divide long transcripts into semantically consistent segments.
- **Summarization:** Generate coherent summaries at the segment or video level.
- **Benchmarking:** Systematic evaluation of modern language models with provided training, validation, and test splits.

## Data Files

**File structure:**

VT-SSum/
├── train.json
├── dev.json
├── test.json
└── README.md

- All files are in JSON format.
- Each file contains a list of video transcript entries as described above.

## Languages

The dataset is in **English**, covering conversational, documentary, and lecture-style spoken content.