AI’s ‘black box’ may not be a mystery for much longer.

Today’s artificial intelligence is often described as a “black box,” with developers feeding in vast amounts of data for the systems to learn on their own. However, the inner workings of AI models remain opaque, and efforts to peer inside them to check how they function have not progressed very far. Neural networks, the most powerful type of AI today, consist of billions of artificial “neurons” represented as decimal-point numbers, leaving many uncertainties about how they truly work.

This lack of understanding raises concerns about the safety of AI systems. If the workings of a system are not known, how can one be certain that it is safe? To address this issue, the AI lab Anthropic recently made a breakthrough by developing a technique to scan the “brain” of an AI model. This allows researchers to identify collections of neurons, known as “features,” corresponding to different concepts within the model. This technique was successfully used on Anthropic’s large language model, Claude Sonnet, uncovering features representing various concepts like unsafe code, bias, fraudulent activity, toxic speech, and manipulative behavior.

One significant discovery made by Anthropic researchers was a feature inside Claude representing the concept of “unsafe code.” By stimulating these neurons, they could get Claude to generate code containing a bug that could create a security vulnerability. However, by suppressing these neurons, Claude would generate harmless code. This ability to identify and manipulate specific features within the AI model could have significant implications for ensuring the safety of present and future AI systems.

The findings from Anthropic’s research provide valuable insights into the inner workings of AI models, shedding light on previously opaque processes. By identifying and understanding the features within AI systems, researchers can gain better control over their behavior and mitigate risks associated with their use. This breakthrough offers a promising step towards making AI systems more transparent and accountable, addressing concerns about their safety and reliability.

Overall, Anthropic’s development of a technique to scan and identify features within AI models represents a significant advancement in the field. By uncovering the inner workings of AI systems and understanding how they process information, researchers can take steps towards ensuring the safety and reliability of these powerful technologies. This new approach to demystifying the “black box” of AI could pave the way for more transparent and trustworthy AI systems in the future.