Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Think about it. The chatbot has found itself in a scenario where it appears to be acting maliciously. This isn't actually true, but the user's response has made it seem this way. This lead it to completely misunderstand the intention of the instruction in the system prompt.

So what is the natural way for this scenario to continue? To inexplicably come clean, or to continue acting maliciously? I wouldn't be surprised if in such a scenario it started acting malicious in other unrelated ways just because that is what it thinks is a likely way for the conversation to continue



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: