Inference code for text-image-to-image (TI→I) tasks

Hello, thank you for open-sourcing the Tar project. 

I'm currently trying to reproduce the text-image-to-image (TI→I) task mentioned in your paper, aiming to generate images like "a photo of XX in the style of <image>".

My implementation is similar to this:
```
prompt = "a photo of panda in the style of <image>"
image = Image.open(image_path).convert('RGB')
image = to_tensor(image).unsqueeze(0).to(self.device)
image_code = self.input_visual_tokenizer(image)['encoded']
image_text = "".join([f"<I{x}>" for x in image_code[0].cpu().tolist()])

# Prepare prompt
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": f"{image_text}\n{prompt}"}
]

input_text = self.tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True)
input_text += f"<im_start><S{self.config.scale}>"
# Generate tokens
inputs = self.tokenizer(input_text, return_tensors="pt")
gen_ids = self.model.generate(
    inputs.input_ids.to(self.device),
    max_new_tokens=self.config.seq_len,
    do_sample=True,
    temperature=self.config.temperature,
    top_p=self.config.top_p,
    top_k=self.config.top_k)

# Process generated tokens
gen_text = self.tokenizer.batch_decode(gen_ids)[0]
gen_code = [int(x) for x in re.findall(r'<I(\d+)>', gen_text)]
gen_code = gen_code[:self.config.seq_len] + [0] * max(0, self.config.seq_len - len(gen_code))
gen_code = torch.tensor(gen_code).unsqueeze(0).to(self.device)

gen_tensor = self.visual_tokenizer.decode_from_encoder_indices(
    gen_code, 
    {'cfg_scale': self.config.cfg_scale}
)
gen_image = Image.fromarray(gen_tensor[0].numpy())
```

However, the generated image is almost identical to the input image, rather than a new image in the specified style. Could you please provide some guidance on the recommended inference code for handling subject-driven TI→I tasks with the Tar model? 

Any code examples or detailed instructions would be greatly appreciated.

Thank you again for your hard work and for sharing this project with us!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Inference code for text-image-to-image (TI→I) tasks #3

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Inference code for text-image-to-image (TI→I) tasks #3

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions