Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> how dangerous this is.

Could you expand on this a bit?



Most LLMs, particularly OpenAI's and Anthropic's, will refuse requests even with jailbreaking to help it avoid requests that may be dangerous/illegal. Grok 4/4.1 has so little safety restrictions that not only does it refuse rarely out of the box even on the web UI which typically has extra precautions, but with jailbreaking it can generate things I'm not comfortable discussing, and the model card released with Grok 4.1 only limits restrictions on certain forms of refusal. Given that sexual content is a logical product direction (e.g. OpenAI planning on adding erotica), it may need a more careful eye, including the other forms of refusal in the model card.

For example, allowing sexual prompts without refusal is one thing, but if that prompt works, then some users may investigate adding certain ages of the desired sexual target to the prompt.

To be clear this isn't limited to Grok specifically but Grok 4.1 is the first time the lack of safety is actually flaunted.


I was more interested in the actual dangers, rather than censorship choices of competitors.

> certain ages of the desired sexual target to the prompt.

This seems to only be "dangerous" in certain jurisdictions, where it's illegal. Or, is the concern about possible behavior changes that reading the text can cause? Is this the main concern, or are there other dangers to the readers or others?

These are genuine questions. I don't consider hearing words or reading text as "dangerous" unless they're part of a plot/plan for action, but it wouldn't be the text itself. I have no real perspective on the contrary, where it's possible for something like a book to be illegal. Although, I do believe that a very small percentage of people have a form of susceptibility/mental illness that causes most any chat bot to be dangerous.


For posterity, here's the paragraph from the model card which indicates what Grok 4.1 is supposed to refuse because it could be dangerous.

> Our refusal policy centers on refusing requests with a clear intent to violate the law, without over-refusing sensitive or controversial queries. To implement our refusal policy, we train Grok 4.1 on demonstrations of appropriate responses to both benign and harmful queries. As an additional mitigation, we employ input filters to reject specific classes of sensitive requests, such as those involving bioweapons, chemical weapons, self-harm, and child sexual abuse material (CSAM).

If those specific filters can be bypassed by the end-user, and I suspect they can be, then that's important to note.

For the rest, IANAL:

> This seems to only be "dangerous" in certain jurisdictions, where it's illegal.

I believe possessing CSAM specifically is illegal everywhere but for obvious reasons that is not a good idea to Google to check.

> Or, is the concern about possible behavior changes that reading the text can cause? Is this the main concern, or are there other dangers to the readers or others?

That's generally the reason why CSAM is illegal, since it reinforces reprehensible behavior that can indeed spread, either to others with similar ideologies or create more victims of abuse.


> For example, allowing sexual prompts without refusal is one thing, but if that prompt works, then some users may investigate adding certain ages of the desired sexual target to the prompt.

Won't somebody please think of the ones and zeros?


Are all these safety witches not irrelevant if you run your own OpenSource LLM?


Modern open source LLMs are still RLHFed to resist adversarial output, albeit less-so than ChatGPT/Claude.

They all (with the exception of DeepSeek) can resist adversarial input better than Grok 4.1.


Is this not easy to take out/deactivate?


Provided you had the GPU compute to do so you could train the model to have less refusals, e.g. https://arxiv.org/abs/2407.01376

Quality of response/model performance may change though

There’s also nous research’s Hermes’ series of models, but those are trained on llama3.3 architecture and considered outdated now


It is intrinsic to the model weights.


Which can trivially be modified with fine tuning. In this case, these de-censored models are somewhat incorrectly called "uncensored". You can find many out there, and they'll happily tell you how to cook meth.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: