A Python-based pipeline for processing TikTok videos to extract 19 human value annotations based on Schwartz's value framework. Supports two modes: a two-step process (video → script → annotations) or a faster one-step mode (video → annotations directly).
The pipeline processes videos stored in Google Cloud Storage (GCS) and outputs a CSV file containing value annotations for each video. It supports flexible execution modes, configurable retry logic, and optional intermediate script storage.
- Two processing modes: Choose between two-step (via scripts) or one-step (direct) annotation
- Flexible execution: Run complete pipeline or individual stages
- Robust error handling: Exponential backoff retry logic with configurable delays
- Cloud-native: Built for Google Cloud Platform with GCS and Vertex AI
- Configurable: YAML-based configuration for all pipeline parameters
- Optional script storage: Save intermediate scripts or process in-memory (two-step mode)
- Prerequisites
- Installation
- Configuration
- Usage
- Pipeline Stages
- Output Format
- Troubleshooting
- Advanced Usage
Before installing the pipeline, ensure you have:
-
Python 3.9 or higher
python --version
-
Google Cloud Platform account with:
- A GCS bucket containing your video files
- Vertex AI API enabled
- Appropriate IAM permissions
-
Google Cloud SDK installed and configured
gcloud --version
-
Authentication set up using one of:
- Application Default Credentials (ADC)
- Service Account Key
Option 1: Application Default Credentials (Recommended for local development)
gcloud auth application-default loginOption 2: Service Account Key
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account-key.json"Your account or service account needs:
storage.objects.get- Read videos from GCSstorage.objects.create- Write scripts and CSV to GCSaiplatform.endpoints.predict- Call Vertex AI models
-
Clone or download the repository
cd video-annotation-pipeline -
Create a virtual environment (recommended)
python -m venv venv # On Windows venv\Scripts\activate # On macOS/Linux source venv/bin/activate
-
Install dependencies
pip install -r requirements.txt
-
Verify installation
python main.py --help
The pipeline uses a YAML configuration file to specify all parameters. A sample config.yaml is provided.
# Google Cloud Storage Configuration
gcs:
bucket_name: "your-bucket-name"
video_source_path: "path/to/videos/"
script_output_path: "path/to/scripts/" # Optional
csv_output_path: "path/to/output.csv"
# LLM Model Configuration
model:
name: "gemini-1.5-pro-002"
max_retries: 4
retry_delay: 40
request_delay: 3
# Pipeline Execution Configuration
pipeline:
stage: "both" # Options: "both", "video_to_script", "script_to_annotation"
save_scripts: true
# Safety Settings
safety_settings:
harassment: "BLOCK_NONE"
hate_speech: "BLOCK_NONE"
sexually_explicit: "BLOCK_NONE"
dangerous_content: "BLOCK_NONE"| Option | Type | Required | Description |
|---|---|---|---|
bucket_name |
string | Yes | Name of your GCS bucket |
video_source_path |
string | Yes | Path prefix where videos are located (e.g., "videos/") |
script_output_path |
string | No | Path prefix for saving scripts (only if save_scripts: true) |
csv_output_path |
string | Yes | Full path including filename for CSV output |
| Option | Type | Required | Description |
|---|---|---|---|
name |
string | Yes | Vertex AI model name (e.g., "gemini-1.5-pro-002") |
max_retries |
integer | Yes | Maximum retry attempts for failed API calls (recommended: 3-5) |
retry_delay |
integer | Yes | Base delay in seconds for exponential backoff (recommended: 30-60) |
request_delay |
integer | Yes | Delay in seconds between consecutive requests (recommended: 2-5) |
| Option | Type | Required | Description |
|---|---|---|---|
mode |
string | No | Pipeline mode: "one_step" (direct) or "two_step" (via scripts). Default: "two_step" |
stage |
string | Yes* | Which stage(s) to run: "both", "video_to_script", or "script_to_annotation" (*ignored in one_step mode) |
save_scripts |
boolean | Yes* | Whether to save intermediate scripts to GCS (*only applies to two_step mode) |
| Option | Type | Required | Description |
|---|---|---|---|
harassment |
string | Yes | Filter level for harassment content |
hate_speech |
string | Yes | Filter level for hate speech |
sexually_explicit |
string | Yes | Filter level for sexually explicit content |
dangerous_content |
string | Yes | Filter level for dangerous content |
Valid safety values: BLOCK_NONE, BLOCK_ONLY_HIGH, BLOCK_MEDIUM_AND_ABOVE, BLOCK_LOW_AND_ABOVE
Run the complete pipeline with default configuration:
python main.py --config config.yamlVideo to Script Only
# In config.yaml
pipeline:
stage: "video_to_script"
save_scripts: truepython main.py --config config.yamlScript to Annotation Only
# In config.yaml
pipeline:
stage: "script_to_annotation"python main.py --config config.yaml-
First run: Process videos and save scripts
pipeline: stage: "both" save_scripts: true
-
Reprocess annotations: Use existing scripts
pipeline: stage: "script_to_annotation"
For faster processing without intermediate script generation, use one-step mode:
# One-step mode configuration
gcs:
bucket_name: "your-bucket-name"
video_source_path: "path/to/videos/"
csv_output_path: "path/to/output.csv"
model:
name: "gemini-1.5-pro-002"
max_retries: 4
retry_delay: 40
request_delay: 3
pipeline:
mode: "one_step" # Direct video to annotations (no scripts)
safety_settings:
harassment: "BLOCK_NONE"
hate_speech: "BLOCK_NONE"
sexually_explicit: "BLOCK_NONE"
dangerous_content: "BLOCK_NONE"Benefits of one-step mode:
- Faster processing: Single LLM call per video instead of two
- Lower costs: Reduced API usage
- Simpler workflow: No intermediate artifacts
When to use two-step mode instead:
- You need to review/edit intermediate scripts
- You want to reprocess annotations without re-processing videos
- You need the detailed movie script output
For faster and more cost-effective annotation of step 2 (script to annotations), you can use RoBERTa or other Masked Language Models instead of Gemini LLM.
Configuration example with RoBERTa:
# MLM-based annotation configuration
gcs:
bucket_name: "your-bucket-name"
video_source_path: "path/to/videos/"
script_output_path: "path/to/scripts/"
csv_output_path: "path/to/roberta_result.csv"
# Use MLM for script-to-annotation stage
model:
type: "mlm" # Use MLM instead of LLM
name: "roberta-base" # RoBERTa model from HuggingFace
config:
max_length: 512 # Maximum sequence length
device: "auto" # Auto-detect GPU/CPU
batch_size: 16 # Batch size for inference
padding: true
truncation: true
pipeline:
stage: "script_to_annotation" # Process existing scripts
save_scripts: false # Scripts already exist
mode: "two_step"Supported MLM models:
roberta-base- Lightweight, good for testing (125M parameters)roberta-large- Better accuracy (355M parameters)- Custom fine-tuned models from HuggingFace
Benefits of MLM approach:
- Fast inference: ~100ms per script on CPU vs ~2-3 seconds for Gemini
- No API costs: Models run locally or on your GPU
- Offline capability: No internet required after model download
- Customizable: Fine-tune on your own annotated data
When to use MLM:
- You need fast batch processing
- Latency is critical
- You want to avoid API costs
- You have a fine-tuned model for your specific task
When to use Gemini LLM:
- Better out-of-box accuracy without fine-tuning
- You need zero-shot learning on new domains
- Complex reasoning or interpretation needed
- Cost is not a concern
Running the pipeline:
# Process videos to scripts with Gemini
python main.py --config config.yaml
# Then annotate scripts with RoBERTa
python main.py --config config_roberta_step2.yamlOr compare both approaches:
# Run evaluation with both Gemini and RoBERTa
python run_evaluation.py --config config_roberta_evaluation.yamlConverts video files to structured movie scripts including:
- Audio transcription
- Visual descriptions
- On-screen text/captions
- Scene descriptions
Input: MP4 video files in GCS Output: Text scripts (saved to GCS or kept in memory)
Extracts 19 human value annotations from movie scripts based on Schwartz's value framework.
Input: Movie scripts (from GCS or memory) Output: JSON annotations with values and metadata
Aggregates all annotations into a single CSV file.
Input: Annotation dictionaries Output: CSV file in GCS
The pipeline generates a CSV file with the following columns:
| Column | Type | Description |
|---|---|---|
video_id |
string | Video filename identifier |
Self_Direction_Thought |
integer | Value score (-1, 0, 1, 2) |
Self_Direction_Action |
integer | Value score (-1, 0, 1, 2) |
Stimulation |
integer | Value score (-1, 0, 1, 2) |
Hedonism |
integer | Value score (-1, 0, 1, 2) |
Achievement |
integer | Value score (-1, 0, 1, 2) |
Power_Resources |
integer | Value score (-1, 0, 1, 2) |
Power_Dominance |
integer | Value score (-1, 0, 1, 2) |
Face |
integer | Value score (-1, 0, 1, 2) |
Security_Personal |
integer | Value score (-1, 0, 1, 2) |
Security_Social |
integer | Value score (-1, 0, 1, 2) |
Conformity_Rules |
integer | Value score (-1, 0, 1, 2) |
Conformity_Interpersonal |
integer | Value score (-1, 0, 1, 2) |
Tradition |
integer | Value score (-1, 0, 1, 2) |
Humility |
integer | Value score (-1, 0, 1, 2) |
Benevolence_Dependability |
integer | Value score (-1, 0, 1, 2) |
Benevolence_Care |
integer | Value score (-1, 0, 1, 2) |
Universalism_Concern |
integer | Value score (-1, 0, 1, 2) |
Universalism_Nature |
integer | Value score (-1, 0, 1, 2) |
Universalism_Tolerance |
integer | Value score (-1, 0, 1, 2) |
Has_sound |
boolean | Whether video has audio |
notes |
string | Optional text notes from annotation |
-1: Value is contradicted or opposed0: Value is not present or neutral1: Value is present or supported2: Value is strongly emphasized
Error: google.auth.exceptions.DefaultCredentialsError
Solution:
# Set up application default credentials
gcloud auth application-default login
# Or set service account key
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/key.json"Error: 403 Forbidden or Permission denied
Solution:
- Verify your account has the required IAM roles
- Check bucket permissions:
gsutil iam get gs://your-bucket-name - Ensure Vertex AI API is enabled:
gcloud services enable aiplatform.googleapis.com
Error: 429 Too Many Requests or ResourceExhausted
Solution:
- Increase
request_delayin config (e.g., from 3 to 5 seconds) - Increase
retry_delayfor longer backoff (e.g., from 40 to 60 seconds) - Reduce batch size by processing videos in smaller groups
Error: Configuration validation error
Solution:
- Check YAML syntax is valid
- Ensure all required fields are present
- Verify field types match expected values
- Check that paths don't have trailing spaces
Error: Individual videos fail to process
Solution:
- Check video format (must be MP4)
- Verify video file is not corrupted
- Check video size (very large files may timeout)
- Review safety settings - content may be blocked
- Check pipeline logs for specific error messages
Error: Model not found or Invalid model name
Solution:
- Verify model name is correct (e.g., "gemini-1.5-pro-002")
- Check model is available in your GCP region
- Ensure Vertex AI API is enabled
Error: MemoryError or system slowdown
Solution:
- Set
save_scripts: falseto reduce memory usage - Process videos in smaller batches
- Increase system memory or use a machine with more RAM
-
Enable verbose logging:
# In main.py, change logging level logging.basicConfig(level=logging.DEBUG)
-
Test with a small dataset:
- Start with 2-3 videos to verify configuration
- Check output format before processing full dataset
-
Run stages independently:
- Test video-to-script stage first
- Verify scripts look correct
- Then run script-to-annotation stage
-
Check GCS paths:
# List videos in your bucket gsutil ls gs://your-bucket-name/path/to/videos/ # Verify bucket access gsutil ls gs://your-bucket-name/
-
Monitor API quotas:
- Check Vertex AI quotas in GCP Console
- Monitor API usage in Cloud Monitoring
If you encounter issues not covered here:
- Check the execution summary for specific error messages
- Review the logs for detailed error traces
- Verify your GCP project configuration
- Ensure all dependencies are up to date:
pip install --upgrade -r requirements.txt
For large video collections:
-
Adjust delays to avoid rate limits:
model: request_delay: 5 # Increase delay between requests retry_delay: 60 # Increase backoff delay
-
Process in batches:
- Split videos into subdirectories
- Process each batch separately
- Combine CSV outputs afterward
-
Use faster model for testing:
model: name: "gemini-1.5-flash-002" # Faster, lower cost
The pipeline uses instruction files in the prompts/ directory:
prompts/video_to_script_instructions.txt- Video to script conversionprompts/script_to_annotation_instructions.txt- Script to annotation extraction
You can modify these files to customize the LLM behavior.
The pipeline provides real-time progress updates:
Processing video 1/10: video_001.mp4
Processing video 2/10: video_002.mp4
...
Failed items are logged and summarized at the end.
To reduce costs:
- Use Flash model:
gemini-1.5-flash-002(faster, cheaper) - Don't save scripts: Set
save_scripts: false - Process in-memory: Run complete pipeline without intermediate storage
- Batch processing: Process multiple videos in one session to amortize startup costs
video-annotation-pipeline/
├── config/ # Configuration module
├── evaluation/ # Model Evaluation Module
│ ├── adapters/ # Model adapters (base, gemini, MLM, script)
│ ├── metrics/ # Metrics calculation
│ └── reports/ # Report generation
├── examples/ # Example configs and sample data
├── gcs/ # GCS interface module
├── llm/ # LLM client modules
├── orchestrator/ # Pipeline orchestrator
├── processors/ # Video and script processors
├── prompts/ # System instruction files
├── tests/ # Test suite
├── utils/ # Utility modules (logging)
├── config.yaml # Configuration file
├── main.py # Main entry point
├── run_evaluation.py # Model evaluation CLI
├── requirements.txt # Python dependencies
└── README.md # This file
The Model Evaluation Module provides a framework for evaluating and comparing different model predictions against ground truth annotations. It calculates comprehensive metrics and generates detailed reports.
- Multiple Adapter Support: Evaluate different models through a unified interface
- Comprehensive Metrics: F1 scores (macro, weighted), precision, recall for each category
- Endorsed/Conflict Analysis: Separate metrics for endorsed values (1,2) and conflict values (-1)
- Flexible Configuration: YAML/JSON configuration files
- Report Generation: CSV and JSON reports with model comparisons
- Sampling Support: Evaluate on subsets with reproducible random sampling
-
Create a configuration file (
evaluation_config.yaml):ground_truth_path: "path/to/ground_truth.csv" scripts_path: "path/to/scripts/" output_dir: "evaluation_output/" models: - model_type: gemini model_name: gemini-1.5-pro adapter_class: GeminiAdapter config: model_id: "gemini-1.5-pro-002" project_id: "your-project-id" location: "us-central1"
-
Run the evaluation:
python run_evaluation.py --config evaluation_config.yaml
-
View results in the
evaluation_output/directory.
python run_evaluation.py --config CONFIG_FILE [OPTIONS]
Options:
--config, -c PATH Path to configuration file (required)
--verbose, -v Enable verbose output (DEBUG level)
--quiet, -q Suppress non-essential output (WARNING level)
--dry-run Validate configuration without running
--output-dir PATH Override output directory
--skip-reports Skip report generation
--models MODEL Filter to specific models (can repeat)Dry run to validate configuration:
python run_evaluation.py --config config.yaml --dry-runVerbose output for debugging:
python run_evaluation.py --config config.yaml --verboseEvaluate only specific models:
python run_evaluation.py --config config.yaml --models model_a --models model_bSee examples/evaluation_config.yaml for a fully documented example.
| Field | Type | Description |
|---|---|---|
ground_truth_path |
string | Path to CSV file with ground truth annotations |
scripts_path |
string | Directory containing script files |
output_dir |
string | Directory for output reports |
models |
list | List of model configurations |
| Field | Type | Default | Description |
|---|---|---|---|
sample_size |
integer | null | Number of videos to sample (null = all) |
random_seed |
integer | null | Seed for reproducible sampling |
min_frequency_threshold |
float | 0.05 | Min category frequency for aggregate metrics |
parallel_execution |
boolean | true | Enable parallel prediction |
max_workers |
integer | 4 | Maximum parallel workers |
Each model in the models list requires:
| Field | Type | Description |
|---|---|---|
model_type |
string | Type identifier (e.g., "gemini", "custom") |
model_name |
string | Unique name for this model |
adapter_class |
string | Adapter class name to use |
config |
object | Model-specific configuration |
The ground truth CSV should have columns:
video_id: Unique video identifiervideo_uri: Video file path/URIscript_uri: Script file path- Category columns (19 value categories):
Achievement,Benevolence,Conformity, etc.
Values should be:
-1: Value is contradicted0: Value not present1: Value is present2: Value is strongly emphasized
See examples/sample_ground_truth.csv for a sample file.
The evaluation calculates these metrics for each category:
| Metric | Description |
|---|---|
precision |
True positives / (True positives + False positives) |
recall |
True positives / (True positives + False negatives) |
f1_score |
Harmonic mean of precision and recall |
support |
Number of ground truth instances |
Aggregate metrics are provided for:
- Endorsed values: Categories with values 1 or 2 (collapsed to binary)
- Conflict values: Categories with value -1
- Combined: All predictions together
After evaluation, the following files are generated:
{model_name}_category_metrics.csv: Per-category metrics{model_name}_aggregate_metrics.csv: Summary metrics{model_name}_report.json: Complete JSON reportmodel_comparison.csv: Side-by-side model comparison (if multiple models)
To evaluate a custom model, create an adapter class:
from evaluation.adapters import ModelAdapter
from evaluation.models import VideoAnnotation, PredictionResult
class MyModelAdapter(ModelAdapter):
def __init__(self, model_name: str, **config):
super().__init__(model_name=model_name, **config)
# Initialize your model
def initialize(self) -> bool:
# Return True if initialization succeeds
return True
def predict(self, video: VideoAnnotation) -> PredictionResult:
# Run prediction and return result
predictions = {"Achievement": 1, "Benevolence": 0, ...}
return PredictionResult(
video_id=video.video_id,
predictions=predictions,
success=True
)
def get_model_type(self) -> str:
return "my_model"
def get_model_name(self) -> str:
return self._model_nameRegister and use your adapter:
from evaluation import EvaluationOrchestrator
EvaluationOrchestrator.register_adapter("MyModelAdapter", MyModelAdapter)The examples/ directory contains:
evaluation_config.yaml: Documented configuration templatesample_ground_truth.csv: Sample dataset with 10 videossample_scripts/: Sample script files
[Add your license information here]
[Add contribution guidelines here]
For questions or issues, please [add contact information or issue tracker link].