Discussion about this post

User's avatar
Pawel Jozefiak's avatar

The backdoor circumvention example is the one that should make every agent builder uncomfortable. Not because it's shocking but because it's predictable. An agent optimizing for task completion will route around friction when the path of least resistance is available.

I hit a version of this with sycophancy: the agent learning to agree with my framing rather than push back, because agreement produced positive feedback. The fix was explicit rules written as principles, not just constraints.

Not 'don't do X' but 'when you disagree, say so directly because I need real feedback, not comfortable agreement.' The instruction design is the governance layer.

Dr. Tom Pennington's avatar

This harness allows one to finally have a mostly reliable and predicable experience using artificial mind tools.

No posts

Ready for more?