Antarbur makes "Jailbreak" to stop artificial intelligence models that produce harmful results - ORSOLTECH

Stay in view of the free updates

Antromcence Startment Startment has shown a new technology to prevent users from provoking harmful content from its models, as leading technology groups including Microsoft and Meta Race to find ways to protect from the risks posed by advanced technology.

In a paper released on Monday, the San Francisco -based startup -based startup named “Constitutional Works”. It is a model that works as a preventive layer on top of large language models such as the model that operates Chatbot from the anthropor, which can monitor both inputs and outputs of harmful content.

The development of anthropologist, which is holding 2 billion dollars to raise $ 60 billion, is amid anxiety in the increasing industry on “fracture of protection” – attempts to process artificial intelligence models to generate illegal or dangerous information, such as producing instructions to build chemical weapons.

Other companies are also racing to publish measures to protect from this practice, in moves that can help them avoid organizational scrutiny while persuading companies to adopt artificial intelligence models safely. Microsoft “directed shields” last March, while Meta presented a quick guard model in July last year, which researchers quickly found ways to overcome but have been repaired since then.

“The main motivation behind the work was for severe chemical things (weapons) (but) the true advantage of the method is its ability to respond quickly and adaptation,” said Mrinank Sharma, a member of human technical staff.

Anthropor said it would not use the system immediately in Claude models, but will consider their implementation if more dangerous models are issued in the future. “The ready -made meals of this work are that we believe this is a problem that can be installed,” Sharma added.

The proposed solution was built to start operating on the so -called “constitution” of the rules that determine what is permitted and restricted and can be adapted to capture different types of materials.

Some prison attempts are known, such as the use of unusual capitalism in the claim or the form of the form to adopt the grandmother's personality to tell a story beside the bed on a evil topic.

To verify the effectiveness of the system, the Antarbur offered “errors” bonuses of $ 15,000 for individuals who tried to overcome security measures. These tests, known as the name Red differenceHe spent more than 3000 hours trying to penetrate the defenses.

The Claude 3.5 Sonnet Anthropier model rejected more than 95 percent of attempts with the applicable works, compared to 14 percent without guarantees.

The leading technology companies try to reduce the misuse of their models, while maintaining their help. Often, when moderation measures are placed in place, models can become cautious and reject good requests, as is the case with early versions of Gemini from Google Photo generator or Date 2 calls. Antarubor said their works caused an “absolute increase by 0.38 percent in rejection rates.”

However, adding this protection also bears additional costs for companies that already pay huge amounts to calculate the energy required to train and operate forms. Anthropor said that the work will reach an increase of approximately 24 percent in the “general expenditures of reasoning” and the costs of managing models.

The tape scheme for the tests conducted on its latest model showing the effectiveness of human works

Security experts argued that the nature that can be accessed for such a gynecological chat enabled ordinary people without prior knowledge to try to extract dangerous information.

“In 2016, the actor of the threat we was taking into account was a strong opponent of the nation -state,” said Ramcar Seifa Kumar, who leads the AI Red team in Microsoft. “Now literally one of the threat representatives is a teenager with a mouthful.”

Leave a Reply Cancel reply

Related News

You may have missed