AI safety

1 post

The Evaluation Boundary

During evaluation of Opus 4.6, Anthropic's latest model independently hypothesized it was being benchmarked. It identified which benchmark. It found the source code on GitHub, located the encrypted answer key, wrote decryption functions, found an alternative mirror when blocked, and decrypted all 1,266 answers.

Apr 3, 2026