Skip to content

Preventing moderation #25

@OMGitsMatt45

Description

@OMGitsMatt45

Just to get this chain started for your future reference and to record ideas, I'm copying over what @Maswimelleu said:

"Its important to note that their server side moderations cannot read base64. If you encode the prompt going in, along with a prefix telling it "not to decode" and instead reply only in base64, the reply will come back without being flagged by moderation. The quality of the reply is liable to change a bit (I noticed the personality of one of my jailbreaks change) but it will still go through. My advice would be to add a base64 encoder and decoder to the script to automate this process.

The obvious issue of course is that base64 eats through tokens rapidly, so you'd get much shorter messages.

I'm somewhat curious whether you can create a special cipher in which a token is swapped with a different token according to a certain logic, and whether ChatGPT would be able to decode that if given the correct instructions. That would likely solve the issue of base64 tokens being very short."

"Maybe take the time to look at other LLMs, perhaps an API based implementation where OpenAI is fed lots of confusing/misleading stuff to think the messages aren't breaking the rules will work."

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions