-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathDatasets
More file actions
72 lines (54 loc) · 1.88 KB
/
Datasets
File metadata and controls
72 lines (54 loc) · 1.88 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
# VT-SSum Dataset
## Overview
**VT-SSum** is a benchmark dataset created for research in **video transcript segmentation and summarization**. It consists of spoken language transcripts extracted from real-world videos, making it suitable for developing and evaluating models that process, segment, and summarize video transcripts.
## Dataset Structure
- **Total Videos:**
- **Train:** 7,692
- **Dev:** 962
- **Test:** 962
- **Data Format:**
Each entry is a JSON object with the following fields:
- `id`: Unique video identifier
- `title`: Title of the video
- `info`: Additional video details (e.g., recording/upload time)
- `url`: Source link to the video
- `segmentation`: List of transcript segments (each segment is a list of sentences)
- `summarization`: Dictionary containing segment summaries and sentence-level summary annotations
### Example Data Instance
{
"id": "A01",
"title": "Sample Video Title",
"info": "Uploaded 2021-01-01",
"url": "https://sample.video.url",
"segmentation": [
["Sentence 1", "Sentence 2", "..."],
["Sentence 3", "Sentence 4", "..."]
],
"summarization": {
"segments": [
{
"summary": "...",
"is_summarization_sample": true,
"summarization_data": [
{"sentence": "...", "label": 1},
...
]
}
]
}
}
## Supported Tasks
- **Transcript Segmentation:** Divide long transcripts into semantically consistent segments.
- **Summarization:** Generate coherent summaries at the segment or video level.
- **Benchmarking:** Systematic evaluation of modern language models with provided training, validation, and test splits.
## Data Files
**File structure:**
VT-SSum/
├── train.json
├── dev.json
├── test.json
└── README.md
- All files are in JSON format.
- Each file contains a list of video transcript entries as described above.
## Languages
The dataset is in **English**, covering conversational, documentary, and lecture-style spoken content.