Risk and Responsibility

How to be certain your agent doesn't fail in production -- Governance Best Practices

Jun 07, 2025

This week has seen three signals that Agent are entering Financial Services.

Liberis launches an agent for SME loan underwriting
AntGroup’s Alipay+ GenAI Cockpit
AgentSmyth raises 8.7M seed round

So it’s clear as day that AI Agents are going prime time.

Therefore, I want to make one thing abundantly clear.

If your agent is not capable of 1. recovering from failure, 2. being reliable and controllable in tool use, and 3. you can’t explain/log its decisions, then your agent does not belong in production.

Especially not in a regulated industry.

Building for Autonomy - What Makes an Agent Production-Ready

“I will be calm; I will be mistress of myself.”
Elinor Dashwood in Sense and Sensibility by Jane Austen

So when you are building a high-stakes, high-autonomy agent architecture, you need to be able to answer at least these three questions:

• Can it recover from partial failure?
• Can it explain its steps?
• Can it operate within boundaries you define?

Enterprise-grade agents are engineered systems. Build them like one.

Recap, I have built agents that can make credit decisions up to one million USD. Giving autonomy to advanced cognitive systems will for certain raise systemic risks and responsibilities for audit, governance, and risk management functions in pretty much every organization. Fortunately, there is still a sharp divide between what’s being demoed as “autonomous agents” right now and what’s viable in production. And one can argue that, specifically the Liberis product is exactly the same “narrow” AI that I implemented at Mercedes. But we all know the writing is on the wall.

Common Implementation Mistakes - What Most Teams Get Wrong

As new agents are going into production, I figured I would give you here some best practices on what most implementations get wrong.

”Think step-by-step” ≠ Structured Reasoning
As you might remember, I have done some work on graph reasoning and also contrasting reasoning methods like chain of thought, tree of thought, and ReAct. If you provide such instructions to the model via the prompt, you effectively put all your hopes on the model “thinking clearly” but it is neither well-engineered nor deterministic. In general, you want to minimize risks that the agent starts off with the wrong seed and then returns only nonsense. Btw, the smaller the model is, the more likely it is that it will happen. Therefore, it is important to be as specific and precise as possible when engaging the model, especially when the agent iterates over several steps.

Prompting ≠ Architecture
In reality, multi-step behavior requires planning, memory, control flow, and supervision. A prompt can suggest intent, but only the architecture can enforce and guardrail execution. This architecture is especially important if the agent has the capability to call tools on demand. If you recall that the a ReAct pattern executes a pattern (Observation → Thought → Action → Observation → …), where an agent thinks just enough to act, and then iterates based on the result of such action, then it becomes obvious that if the new information is not controlled the agent can easily steer into a completely incorrect direction. In some cases, this might be intended, but in most others, that is strongly forbidden. Catching such errors is near impossible with just prompting.

More context ≠ Better outcomes
When it comes to context, more is not always better. As I wrote already, shoving more documents into the context window creates more noise and does not decrease controlling entropy. Therefore, properly structuring relevant context is exactly what needs to be done.

For that purpose, I was proposing a context system that might be organized like this:

The context problem only increases when a variety of memory (sequential, temporal, etc) are used.

Tool access ≠ Real-world capability
What is the difference between one small agent calling a tool vs the workflow calling the API? The agent has a significantly higher probability of failure. In reality, calling an API is easy, and handling API failure, retrying, validating outputs, and escalating exceptions are standard methods that can be easily implemented.

In reality, if you look at the below implementation of my finMCP tool, it is merely a wrapper around an API call. So when you are designing your agent, you need to have a very clear reason why you need this additional complexity. In my opinion, the main benefit is giving the agent the capability to source additional information on the fly.

But this might come at a price if you don’t set proper operational boundaries.

Autonomy ≠ Value
Unbounded agents introduce operational risk. Bounding task execution with traceability and control is a must-have for agent implementations. The problem here is that the narrower we structure the agent, the more we limit its agency, we might think that the agent provides less value. I don’t think that is true. There are good reasons why your sales team is not writing accounting entries. We have specialists and middle managers in our organizations for a good reason.

Governance Is Part of the Product

But there is more to this since it is not only an engineering problem. Governance, Risk Management, and Compliance (GRC) is a business function in professional companies that exist for a reason. At the core of such a GRC function is the development and maintenance of a robust governance framework. This framework must support strategic objectives while ensuring alignment with industry best practices. It acts as a foundation for ethical AI development, responsible data usage, and principled decision-making across the company. By establishing clear policies, procedures, and controls, the governance function safeguards the integrity an the company’s operations and protects the interests of its customers, partners, and broader society.

If you consider this control flow, you will quickly notice that effective governance is not a standalone function; it must be woven into the fabric of the organization. Internal standards and external expectations should be managed through first principles, and embedding these principles early prevents costly and reputation-damaging issues later in the lifecycle of AI systems.

But realities are still different. While most companies now have genAI policies in place, none of them enforce them.

A recent McKinsey study revealed:
→ 78% of organizations have GenAI guidelines in place
→ But only 38% actually enforce them
→ And just 28% have mapped AI risks to business impact
→ Only 9% monitor for hallucinations
→ 47% are concerned about staff entering sensitive information
→ Over 60% lack end-to-end oversight of AI usage

A robust Risk management is a central pillar of any policy framework. Identifying, assessing, and prioritizing risks related to AI operations, cybersecurity, regulatory compliance, and intellectual property requires a proactive and structured approach.

Anthropic’s Responsible Scaling policy might be a great way to start.

Beyond recognizing potential issues, the GRC team must design and implement risk mitigation strategies, including sophisticated monitoring systems and detailed contingency plans. Regular risk assessments and scenario analyses further bolster a company’s resilience, allowing it to anticipate and adapt to emerging threats in the dynamic AI landscape.

Compliance with applicable laws and regulations is another obvious baseline responsibility. Monitoring regulatory developments and advising leadership on their potential impacts enables the company to stay ahead of compliance challenges and strategically adjust its operations and product roadmap. A strong compliance posture not only protects your company from legal and financial penalties but also enhances its reputation as a trustworthy AI provider.

We all know that risk management is only skin-deep if the company lives it as a sustainable culture of accountability and awareness. Developing and delivering GRC training programs for employees ensures that governance principles are understood and embraced across the organization. Spreading this knowledge reinforces a proactive approach to ethical AI development and risk management, empowering employees at all levels to make informed, responsible decisions. This also defines the GRC function as a knowledgeable resource within the company. Acting as a subject matter expert, the team provides guidance on risk and compliance matters, helping operational teams navigate complex issues and avoid missteps. Clear, consistent, and timely advice strengthens decision-making and ensures that governance is seen not as a hurdle, but as a strategic enabler. Creating and presenting regular reports on GRC performance, risks, and compliance status to senior leadership and stakeholders ensures transparency, accountability, and informed decision-making at the highest levels.

In conclusion / final thoughts

The faster agents are entering our organizations, the faster the governance gap is outgrowing the models themselves. This could lead to three painful outcomes. 1. Engineers having to ignore safety restrictions to overcome broken workflows, 2. sensitive company data ending up in public LLMs (just look at the ToU of ChatGPT), and also 3. agent outputs getting piped into production servers without review.

ChatGPT - Terms us use:

I think we all are aware of the implications. I hope this post gave you some ideas on how to structure your responsible AI governance and engineering processes effectively.

Here is a Top10 of my lessons learned:

Lobotomizing, i.e., reducing capabilities, is making the model less useful.
Poorly aligned models make integrators less likely to build in your ecosystem.
Risk Tolerance is a number.
Risk can never be eliminated entirely. See Google’s- Eating stones vs. eating glue.
Product innovation vs. Risk Management is not a tradeoff.
Beware of unintended consequences, i.e. optimizing for engagement.
Ensure that autonomy is managed → Can Cursor build Cursor?
Understand the privacy vs. measuring intent trade-off.
ASL 4+ is not defined yet.
Taking shortcuts (Builder.AI) is the wrong approach.

I always saw governance is part of the product and not a policy that can be ignored until an audit finding pops up. It should introduce friction, fall-backs, and human oversight where risks are the highest. Explainability, logging, and tracing should be part of the system architecture from day 1.

You can’t govern what you don’t understand.
And you can’t scale what you can’t trust.

Encyclopedia Autonomica

Discussion about this post