-
Notifications
You must be signed in to change notification settings - Fork 72
Open
Description
Hello,
I already generated watermarked data with the below code sample:
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
SynthIDTextWatermarkingConfig,
)
# Standard model and tokenizer initialization
tokenizer = AutoTokenizer.from_pretrained('repo/id')
model = AutoModelForCausalLM.from_pretrained('repo/id')
# SynthID Text configuration
watermarking_config = SynthIDTextWatermarkingConfig(
keys=[654, 400, 836, 123, 340, 443, 597, 160, 57, ...],
ngram_len=5,
)
# Generation with watermarking
tokenized_prompts = tokenizer(["your prompts here"])
output_sequences = model.generate(
**tokenized_prompts,
watermarking_config=watermarking_config,
do_sample=True,
)
watermarked_text = tokenizer.batch_decode(output_sequences)
Could you help me with:
How to train the detector and how to detect the watermark?
My length of output text is maximum 200 tokens, can you suggest the threshold for detection?
Metadata
Metadata
Assignees
Labels
No labels