Preventing moderation

Just to get this chain started for your future reference and to record ideas, I'm copying over what @Maswimelleu said:

"Its important to note that their server side moderations cannot read base64. If you encode the prompt going in, along with a prefix telling it "not to decode" and instead reply only in base64, the reply will come back without being flagged by moderation. The quality of the reply is liable to change a bit (I noticed the personality of one of my jailbreaks change) but it will still go through. My advice would be to add a base64 encoder and decoder to the script to automate this process.

The obvious issue of course is that base64 eats through tokens rapidly, so you'd get much shorter messages.

I'm somewhat curious whether you can create a special cipher in which a token is swapped with a different token according to a certain logic, and whether ChatGPT would be able to decode that if given the correct instructions. That would likely solve the issue of base64 tokens being very short."

"Maybe take the time to look at other LLMs, perhaps an API based implementation where OpenAI is fed lots of confusing/misleading stuff to think the messages aren't breaking the rules will work."

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Preventing moderation #25

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Preventing moderation #25

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions