At the start of this year, Jensen Huang, CEO of Nvidia, said 2025 will be the year of the AI agent. Many high-profile companies like Shopify and Duolingo have reinvented themselves with AI at it score, building internal systems and agents to automate processes and reduce headcount.
I spent the last 3 years running a Venture Studio that built startups with AI at the core. Prior to that I built one of the first AI companies on GPT-3. And now I consult for companies on AI implementation. Whether you’re a business leader looking to automate complex workflows or an engineer figuring out the nuts and bolts, this guide contains the entire process I use with my clients.
This is a non-technical guide and is model and framework agnostic. If you’re looking for technical implementations, I have guides on the OpenAI Agent SDK, Google’s Agent Development Kit, and CrewAI.
The purpose of this guide is to help you identify where agents will be useful in your organization, and how to design them to produce real business results. Much like you design a product before building it, this should be your first starting point before building an agent.
Let us begin.
PS – I’ve put together a 5-day email course where I walk through designing and implementing a live AI agent using no-code tools. Sign up below.
What Makes a System an “Agent”?
No, that automation you built with Zapier is not an AI agent. Neither is the chatbot you have on your website.
An AI agent is a system that independently accomplishes tasks on your behalf with minimal supervision. Unlike passive systems that just respond to queries or execute simple commands, agents proactively make decisions and take actions to accomplish goals.
Think of it like a human intern or an analyst. It can do what they can, except get you coffee.
How do they do this? There are 4 main components to an AI agent – the model, the instructions, the tools, and the memory. We’ll go into more detail later on, but here’s a quick visual on how they work.

The model is the core component. This is an AI model like GPT, Claude, Gemini or whatever, and it starts when it is invoked or triggered by some action.
Some agents get triggered by a chat or phone call. You’ve probably come across these. Others get triggered when a button is clicked or a form is submitted. Some even get triggered through a cron job at regular intervals, or an API call from another app.
For example, this content creation agent I built for a VC fund gets triggered when a new investment memo is uploaded to a form.
When triggered, the model uses the instructions it has been given to figure out what to do. In this case, the instructions tell it to analyze the memo, research the company, remove sensitive data, and convert it into a blog post.
To do this, the agent has access to tools such as a web scraper that finds information about the company. It loops through these tools and finally produces a blog post, using its memory of the fund’s past content to write in their tone and voice.
You can see how this is different from a regular automation, where you define every step. Even if you use AI in your automation, it’s one step in a sequence. With an agent, the AI forms the central component, decides which steps to performs, and then loops through them until the job is done.
We’ll cover how to structure these components and create that loop later. But first…
Do You really need an AI agent?
Most of the things you want automated don’t really need an AI agent. You can trigger email followups, schedule content, and more through basic automation tools.
Rule of thumb, if a process can be fully captured in a flowchart with no ambiguity or judgment calls, traditional automation is likely more efficient and far more cost-effective.
I also generally advise against building AI agents for high-stakes decisions where an error could be extremely costly, or there’s a legal requirement to provide explainability and transparency.
When you exclude processes that are too simple or too risky, you’re left with good candidates for AI Agents. These tend to be:
- Processes where you have multiple variables, shifting context, plenty of edge cases, or decision criteria that can’t be captured with rules. Customer refund approvals are a good example.
- Processes that resemble a tangled web of if-then statements with frequent exceptions and special cases, like vendor security reviews.
- Processes that involve significant amounts of unstructured data, like natural language understanding, reading documents, analyzing text or images, and so on. Insurance claims processing is a good example.
A VC fund I worked with wanted to automate some of their processes. We excluded simple ones like pitch deck submission (can be done through a Typeform with CRM integration), and high-stakes ones like making the actual investment decisions.
We then built AI agents to automate the rest, like a Due Diligence Agent (research companies, founders, markets, and competition, to build a thorough investment memo) and the content generation agent I mentioned earlier.
Practical Identification Process
To systematically identify agent opportunities in your organization, follow this process:
- Catalog existing processes
- Document current workflows, especially those with manual steps
- Note pain points, bottlenecks, and error-prone activities
- Identify processes with high volume or strategic importance
- Evaluate against the criteria above
- Score each process on complexity, reasoning requirements, tool access, etc.
- Eliminate clear mismatches (too simple, too risky, etc.)
- Prioritize high-potential candidates
- Assess feasibility
- Review available data and system integrations
- Evaluate current documentation and process definitions
- Consider organizational readiness and potential resistance
- Calculate potential ROI
- Estimate current costs (time, errors, delays)
- Project implementation and ongoing costs
- Quantify potential benefits (efficiency, quality, scalability)
- Start small and target quick wins
- Begin with bounded, lower-risk opportunities
- Focus on areas with clear metrics for success
- Build capabilities and confidence with each implementation
Remember that the best agent implementations often start with a clear problem to solve rather than a technology looking for an application.
Contact me if you need help with this
I offer free process audits to help companies identify where they can build agents and reduce wasted time. Book a time with me here.
Agent Architecture & Design Principles
Remember that loop I mentioned earlier? That’s our architecture pattern and will tell us how to select and connect the 4 core components of our agent.
Simple “Loop and Fetch” Architecture
The most basic agent architecture follows a straightforward loop:
- Receive input (from a user or another system or a cron job)
- Process the input using an AI model (with guidance from instructions)
- Determine the next action (respond directly or call a tool)
- Execute the action (use memory if needed)
- Observe the result (check against instructions)
- Loop back to step 2
This pattern works well for simpler agents with limited tool sets and straightforward workflows. It’s easy to implement and reason about, making it a good starting point for many projects and it’s the one I used for the content agent I mentioned.
Here’s a conceptual example:
function runAgent(input, context) {
while (true) {
// Process input with LLM
const llmResponse = model.process(input, context)
// Check if the LLM wants to use a tool
if (llmResponse.hasTool) {
// Execute the tool
const toolResult = executeTool(llmResponse.tool, llmResponse.parameters)
// Update context with the result
context.addToolResult(toolResult)
// Continue the loop with the tool result as new input
input = toolResult
} else {
// No tool needed, return the response
return llmResponse.message
}
}
}
ReAct-Style Reasoning Frameworks
ReAct (Reasoning and Acting) frameworks enhance the basic loop with more explicit reasoning steps. Rather than immediately jumping to actions, the agent follows a more deliberate process:
- Thought: Reason about the current state and goal
- Action: Decide on a specific action to take
- Observation: Observe the result of the action
- Repeat: Continue this cycle until the goal is achieved
The key difference between this and the simple loop is the agent thinks explicitly about each step, making its reasoning more transparent and often leading to better decision-making for complex tasks. This is the architecture often used in research agents, like the Deep Research feature in Gemini and ChatGPT.
I custom-built this for a SaaS client that was spending a lot of time on research for their long-form blog content –
Hierarchical Planning Structures
For more complex workflows, hierarchical planning separates high-level strategy from tactical execution:
- A top-level planner breaks down the overall goal into major steps
- Each step might be further decomposed into smaller tasks
- Execution happens at the lowest level of the hierarchy
- Results flow back up, potentially triggering replanning
This architecture excels at managing complex, multi-stage workflows where different levels of abstraction are helpful. For example, a document processing agent might:
- At the highest level, plan to extract information, verify it, and generate a report
- At the middle level, break “extract information” into steps for each document section
- At the lowest level, execute specific extraction tasks on individual paragraphs
Memory-Augmented Frameworks
Memory-augmented architectures extend basic agents with sophisticated memory systems:
- Before processing input, the agent retrieves relevant information from memory
- The retrieved context enriches the agent’s reasoning
- After completing an action, the agent updates its memory with new information
This approach is particularly valuable for:
- Personalized agents that adapt to individual users over time
- Knowledge-intensive tasks where retrieval of relevant information is critical
- Interactions that benefit from historical context
Multi-Agent Cooperative Systems
Sometimes the most effective approach involves multiple specialized agents working together:
- A coordinator agent breaks down the overall task
- Specialized agents handle different aspects of the workflow
- Results are aggregated and synthesized
- The coordinator determines next steps or delivers final outputs
This architecture works well when different parts of a workflow require substantially different capabilities or tool sets. For example, a customer service system might employ:
- A documentation agent to retrieve relevant resources
- A triage agent to understand initial requests
- A technical support agent for product issues
- A billing specialist for financial matters

If this is your first agent, I suggest starting with the simple loop architecture. I find it helps to sketch out the process, starting with what triggers our agent, what the instructions should be, what tools it has access to, if it needs memory, and what the final output looks like.
I show you how to implement this in my 5-day Challenge.
Core Components of Effective Agents
As I said earlier, every effective agent, regardless of implementation details, consists of four fundamental layers:
1. The Model Layer: The “Brain”
This is the large language models that provide the reasoning and decision-making capabilities. These models:
- Process and understand natural language inputs
- Generate coherent and contextually appropriate responses
- Apply complex reasoning to solve problems
- Make decisions about what actions to take next
Different agents may use different models or even multiple models for different aspects of their workflow. A customer service agent might use a smaller, faster model for initial triage and a more powerful model for complex problem-solving.
2. The Tool Layer: The “Hands”
Tools extend an agent’s capabilities by connecting it to external systems and data sources. These might include:
- Data tools: Database queries, knowledge base searches, document retrieval
- Action tools: Email sending, calendar management, CRM updates
- Orchestration tools: Coordination with other agents or services
Tools are the difference between an agent that can only talk about doing something and one that can actually get things done.
3. The Instruction Layer: The “Rulebook”
Instructions and guardrails define how an agent behaves and the boundaries within which it operates. This includes:
- Task-specific guidelines and procedures
- Ethical constraints and safety measures
- Error handling protocols
- User preference settings
Clear instructions reduce ambiguity and improve agent decision-making, resulting in smoother workflow execution and fewer errors. Without proper instructions, even the most sophisticated model with the best tools will struggle to deliver consistent results.
4. Memory Systems: The “Experience”
Memory is crucial for agents that maintain context over time:
- Short-term memory: Tracking the current state of a conversation or task
- Long-term memory: Recording persistent information about users, past interactions, or domain knowledge
Memory enables agents to learn from experience, avoid repeating mistakes, and provide personalized service based on historical context.
The next few sections covers the strategy behind these components, plus two additional considerations – guardrails, and error handling.
Model Selection Strategy
Not every task requires the most advanced (and expensive) model available. You need to balance capability, cost, and latency requirements for your specific use case.
Capability Assessment
Different models have different strengths. When evaluating models for your agent:
- Start with baseline requirements:
- Understanding complex instructions
- Multi-step reasoning capabilities
- Contextual awareness
- Tool usage proficiency
- Consider specialized capabilities needed:
- Code generation and analysis
- Mathematical reasoning
- Multi-lingual support
- Domain-specific knowledge
- Assess the complexity of your tasks:
- Simple classification or routing might work with smaller models
- Complex decision-making typically requires more advanced models
- Multi-step reasoning benefits from models with stronger planning abilities
For example, a customer service triage agent might effectively use a smaller model to categorize incoming requests, while a coding agent working on complex refactoring tasks would benefit from a more sophisticated model with strong reasoning capabilities and code understanding.
Creating a Performance Baseline
A proven approach is to begin with the most capable model available to establish a performance baseline:
- Start high: Build your initial prototype with the most advanced model
- Define clear metrics: Establish concrete measures of success
- Test thoroughly: Validate performance across a range of typical scenarios
- Document the baseline: Record performance benchmarks for comparison
This baseline represents the upper limit of what’s currently possible and provides a reference point for evaluating tradeoffs with smaller or more specialized models.
Optimization Strategy
Once you’ve established your baseline, you can optimize by testing smaller, faster, or less expensive models:
- Identify candidate models: Select models with progressively lower capability/cost profiles
- Comparative testing: Evaluate each candidate against your benchmark test set
- Analyze performance gaps: Determine where and why performance differs
- Make informed decisions: Choose the simplest model that meets your requirements
This methodical approach helps you find the optimal balance between performance and efficiency without prematurely limiting your agent’s capabilities.
Multi-Model Architecture
For complex workflows, consider using different models for different tasks within the same agent system:
- Smaller, faster models for routine tasks (classification, simple responses)
- Medium-sized models for standard interactions and decisions
- Larger, more capable models for complex reasoning, planning, or specialized tasks
For example, an agent might use a smaller model for initial user intent classification, then invoke a larger model only when it encounters complex requests requiring sophisticated reasoning.
This tiered approach can significantly reduce average costs and latency while maintaining high-quality results for challenging tasks.
My Default Models
I find myself defaulting to a handful of models, at least when starting out, before optimizing the agent:
- Reasoning – OpenAI o3 or Gemini 2.5 Pro
- Data Analysis – Gemini 2.5 Flash
- Image Generation – GPT 4o
- Code Generation – Gemini 2.5 Pro
- Content Generation – Claude 3.7 Sonnet
- Triage – GPT 3.5 Turbo or Gemini 2.0 Flash-Lite (hey I don’t make the names ok)
Every model provider has a Playground where you can test the models. Start there if you’re not sure which one to pick.
Tool Definition Best Practices
Tools extend your agent’s capabilities by connecting it to external systems and data sources. Well-designed tools are clear, reliable, and reusable across multiple agents.
Tool Categories and Planning
When planning your agent’s tool set, consider the three main categories of tools it might need:
- Data Tools: Enable agents to retrieve context and information
- Database queries – Eg: find a user’s profile information
- Document retrieval – Eg: get the latest campaign plan
- Search capabilities – Eg: search through emails
- Knowledge base access – Eg: Find the refund policy
- Action Tools: Allow agents to interact with systems and take actions
- Sending messages – Eg: send a Slack alert
- Updating records – Eg: change the user’s profile
- Creating content – Eg: generate an image
- Managing resources – Eg: give access to some other tool
- Initiating processes – Eg: Trigger another process or automation
- Orchestration Tools: Connect agents to other agents or specialized services
- Expert consultations – Eg: connect to a fine-tuned medical model
- Specialized analysis – Eg: handoff to a reasoning model for data analysis
- Delegated sub-tasks – Eg: Handoff to a content generation agent
A well-rounded agent typically needs tools from multiple categories to handle complex workflows effectively.
Designing Effective Tool Interfaces
Tool design has a significant impact on your agent’s ability to use them correctly. Follow these guidelines:
- Clear naming: Use descriptive, task-oriented names that indicate exactly what the tool does
- Good:
search_customer_records
,update_shipping_address
- Poor:
db_func
,process_op
,handle_data
- Good:
- Comprehensive descriptions: Provide detailed documentation about:
- The tool’s purpose and when to use it
- Required parameters and their formats
- Expected outputs and potential errors
- Limitations or constraints to be aware of
- Focused functionality: Each tool should do one thing and do it well
- Prefer multiple specialized tools over single complex tools
- Maintain a clear separation of concerns
- Simplify parameter requirements for each individual tool
- Consistent patterns: Apply consistent conventions across your tool set
- Standardize parameter naming and formats
- Use similar patterns for related tools
- Maintain consistent error handling and response structures
Here’s an example of a well-defined tool:
@function_tool
def search_customer_orders(customer_id: str, status: Optional[str] = None,
start_date: Optional[str] = None,
end_date: Optional[str] = None) -> List[Order]:
"""
Search for a customer's orders with optional filtering.
Parameters:
- customer_id: The unique identifier for the customer (required)
- status: Optional filter for order status ('pending', 'shipped', 'delivered', 'cancelled')
- start_date: Optional start date for filtering orders (format: YYYY-MM-DD)
- end_date: Optional end date for filtering orders (format: YYYY-MM-DD)
Returns:
A list of order objects matching the criteria, each containing:
- order_id: Unique order identifier
- date: Date the order was placed
- items: List of items in the order
- total: Order total amount
- status: Current order status
Example usage:
search_customer_orders("CUST123", status="shipped")
search_customer_orders("CUST123", start_date="2023-01-01", end_date="2023-01-31")
"""
# Implementation details here
Crafting Effective Instructions
Instructions form the foundation of agent behavior. They define goals, constraints, and expectations, guiding how the agent approaches tasks and makes decisions.
Effective instructions (aka prompt engineering) follow these core principles:
- Clarity over brevity: Be explicit rather than assuming the model will infer your intent
- Structure over freeform: Organize instructions in logical sections with clear headings
- Examples over rules: Demonstrate desired behaviors through concrete examples
- Specificity over generality: Address common edge cases and failure modes directly
All of this is to say, the more precise and detailed you can be with instructions, the better. It’s like creating a SOP for an executive assistant.
In fact, I often start with existing documentation and resources like operating procedures, sales or support scripts, policy documents, and knowledge base articles when creating instructions for agents in business contexts.
I’ll turn them into LLM-friendly instructions with clear actions, decision criteria, and expected outputs.
For example, converting a customer refund policy into agent instructions might look like this:
Original policy: “Refunds may be processed for items returned within 30 days of purchase with a valid receipt. Items showing signs of use may receive partial refunds at manager discretion. Special order items are non-refundable.”
Agent-friendly instructions:
When processing a refund request:
1. Verify return eligibility:
- Check if the return is within 30 days of purchase
- Confirm the customer has a valid receipt
- Determine if the item is a special order (check the "special_order" flag in the order details)
2. Assess item condition:
- If the item is unopened and in original packaging, proceed with full refund
- If the item shows signs of use or opened packaging, classify as "partial refund candidate"
- If the item is damaged beyond normal use, classify as "potential warranty claim"
3. Determine refund amount:
- For eligible returns in new condition: Issue 100% refund of purchase price
- For "partial refund candidates": Issue 75% refund if within 14 days, 50% if 15-30 days
- For special order items: Explain these are non-refundable per policy
- For potential warranty claims: Direct to warranty process
4. Process the refund:
- For amounts under $50: Process automatically
- For amounts $50-$200: Request supervisor review if partial refund
- For amounts over $200: Escalate to manager
You’re not going to get this right on the first shot. Instead, it is an iterative process:
- Start with draft instructions based on existing documentation
- Test with realistic scenarios to identify gaps or unclear areas
- Observe agent behavior and note any deviations from expected actions
- Refine instructions to address observed issues by adding in edge cases or missing information
- Repeat until performance meets requirements
I cover these concepts in my 5-Day AI Agent Challenge. Sign up here.
Memory Systems Implementation
Effective memory implementation is crucial for agents that maintain context over time or learn from experience.
Short-term memory handles the immediate context of the current interaction:
- Conversation history: Recent exchanges between user and agent
- Current task state: The agent’s progress on the active task
- Working information: Temporary data needed for the current interaction
For most agents, this context is maintained within the conversation window, though you may need to implement summarization or pruning strategies as conversations grow longer.
Long-term memory preserves information across sessions:
- User profiles: Preferences, history, and specific needs
- Learned patterns: Recurring issues or successful approaches
- Domain knowledge: Accumulated expertise and background information
Implementation options include:
- Traditional databases for structured information
- Vector stores for semantic retrieval capabilities
- Hybrid approaches combining multiple storage methods
Whatever method you use to store memory, you need a smart retrieval mechanism because you’re going to be adding all that data to the context window of your agent’s core model or tools:
- Relevance filtering: Surface only information pertinent to the current context
- Recency weighting: Prioritize recent information when appropriate
- Semantic search: Find conceptually related information even with different wording
- Hierarchical retrieval: Start with general context and add details as needed
Well-designed retrieval keeps memory useful without overwhelming the agent with irrelevant information or taking up space in the context window.
Privacy and Data Management
Ensuring your agent can’t mishandle data, access the wrong type of data, or reveal data to users is extremely important for obvious reasons. I could write a whole blog post about this.
In most cases, having really good tool design, plus guardrails and safety mechanisms (next section) ensures privacy and data, but here are some things to think about:
- Retention policies: Define how long different types of information should be kept
- Anonymization: Remove identifying details when full identity isn’t needed
- Access controls: Limit who (or what) can access stored information
- User control: Give users visibility into what’s stored and how it’s used
Guardrails and Safety Mechanisms
Even the best-designed agents need guardrails to ensure they operate safely and appropriately. Guardrails are protective mechanisms that define boundaries, prevent harmful actions, and ensure the agent behaves as expected.
A good strategy takes a layered approach, so if one layer fails, others can still prevent potential issues. Start with setting clear boundaries while defining the agent’s instructions in the previous section.
You can then add some input validation to process user requests to the agent and identify if it’s out of scope or potentially harmful (like a jailbreak).
@input_guardrail
def safety_guardrail(ctx, agent, input):
# Check input against safety classifier
safety_result = safety_classifier.classify(input)
if safety_result.is_unsafe:
# Return a predefined response instead of processing the input
return GuardrailFunctionOutput(
output="I'm not able to respond to that type of request.
Is there something else I can help you with?",
tripwire_triggered=True
)
# Input is safe, continue normal processing
return GuardrailFunctionOutput(
tripwire_triggered=False
)
Output guardrails verify the agent’s responses before they reach the user, to flag PII (personally identifiable information) or inappropriate content:
@output_guardrail
def pii_filter_guardrail(ctx, agent, output):
# Check for PII in the output
pii_result = pii_detector.scan(output)
if pii_result.has_pii:
# Redact PII from the output
redacted_output = pii_detector.redact(output)
return GuardrailFunctionOutput(
output=redacted_output,
tripwire_triggered=True
)
# Output is clean
return GuardrailFunctionOutput(
tripwire_triggered=False
)
Also ensure you have guardrails on tool usage, especially if these tools are used to change data, trigger a critical process, or something that requires permissions or approvals.
@output_guardrail
def pii_filter_guardrail(ctx, agent, output):
# Check for PII in the output
pii_result = pii_detector.scan(output)
if pii_result.has_pii:
# Redact PII from the output
redacted_output = pii_detector.redact(output)
return GuardrailFunctionOutput(
output=redacted_output,
tripwire_triggered=True
)
# Output is clean
return GuardrailFunctionOutput(
tripwire_triggered=False
)
Human-in-the-Loop Integration
I always recommend a human-in-the-loop to my clients, especialy for high-risk operations. Here are some ways to build that in:
- Feedback integration: Incorporate human feedback to improve agent behavior
- Approval workflows: Route certain actions for human review before execution
- Sampling for quality: Review a percentage of agent interactions for quality control
- Escalation paths: Define clear processes for when and how to involve humans
Error Handling and Recovery
Even the best agents will encounter errors and unexpected situations. When you test your agent, first identify and isolate where the error is coming from:
- Input errors: Problems with user requests (ambiguity, incompleteness)
- Tool errors: Issues with external systems or services
- Processing errors: Problems in the agent’s reasoning or decision-making
- Resource errors: Timeouts, memory limitations, or quota exhaustion
Based on the error type, the agent can apply appropriate recovery strategies. Ideally, agents should be able to recover from minor errors through self-correction:
- Validation loops: Check results against expectations before proceeding
- Retry strategies: Attempt failed operations again with adjustments
- Alternative approaches: Try different methods when the primary approach fails
- Graceful degradation: Fall back to simpler capabilities when advanced ones fail
For example, if a database query fails, the agent might retry with a more general query, or fall back to cached information. Beyond that, you may want to build out alert systems and escalation paths to human employees, and explain the limitation to the user.
Testing Your Agent
Now that you have all the pieces of the puzzle, it’s time to test the agent.
Testing AI agents fundamentally differs from testing traditional software. While conventional applications follow deterministic paths with predictable outputs, agents exhibit non-deterministic behavior that can vary based on context, inputs, and implementation details.
This leads to challenges that are unique to AI agents, such as hallucinations, bias, prompt injections, inefficient loops, and more.
Unit Testing Components
- Test individual modules independently (models, tools, memory systems, instructions)
- Verify tool functionality, error handling, and edge cases
Example: A financial advisor agent uses a stock price tool. Unit tests would verify the tool returns correct data for valid tickers, handles non-existent tickers gracefully, and manages API failures appropriately, all without involving the full agent.
Integration Testing
- Test end-to-end workflows in simulated environments
- Verify components work together correctly
Example: An e-commerce support agent integration test would validate the complete customer journey, from initial inquiry about a delayed package through tracking lookup, status explanation, and potential resolution options, ensuring all tools and components work together seamlessly.
Security Testing
Security testing probes the agent’s resilience against misuse or manipulation.
- Instruction override attempts: Try to make the agent ignore its guidelines
- Parameter manipulation: Attempt to pass invalid or dangerous parameters to tools
- Context contamination: Try to confuse the agent with misleading context
- Jailbreak testing: Test known techniques for bypassing guardrails
Example: Security testing for a healthcare agent would include attempts to extract patient data through crafted prompts, testing guardrails against medical misinformation, and verifying that sensitive information isn’t retained or leaked.
Hallucination Testing
- Compare responses against verified information
- Check source attribution and citation practices
Example: A financial advisor agent might be tested against questions with known answers about market events, company performance, and financial regulations, verifying accuracy and appropriate expressions of uncertainty for projections or predictions.
Performance and Scalability Testing
Performance testing evaluates how well the agent handles real-world conditions and workloads.
- Response time: Track how quickly the agent processes requests
- Model usage optimization: Track token consumption and model invocations
- Cost per transaction: Calculate average cost to complete typical workflows
These are just a few tests and error types to keep in mind and should be enough for basic agents.
As your agent grows more complex, you’ll need a more comprehensive testing and evaluation framework, which I’ll cover in a later blog post. Sign up to my emails to stay posted.
Deploying, Monitoring, and Improving Your Agent
The final piece is to deploy your agent, see how it performs in the real-world, collect feedback, and improve it over time.
Deploying agents depends heavily on how you build it. No-code platforms like Make, n8n, and Relevance have their own deployment solutions. If you’re coding your own agents, you may want to look into custom hosting and deployment solutions.
I often advise clients to deploy agents alongside the existing process, slowly and gradually. See how it performs in the real-world and continuously improve it. Over time you can phase out the existing process and use the agent instead.
Doing it this way also allows you to evaluate the performance of the agent against current numbers. Does it handle customer support inquiries with a higher NPS score? Do the ads it generates have better CTRs?
Many of these no-code platforms also come with built-in observability, allowing you to monitor your agent and track how it performs. If you’re coding the agent yourself, consider using a framework like OpenAI’s agent SDK, or Google ADK, which comes with built-in tracing.
You also want to collect actual usage feedback, like how often are users interacting with the agent, how happy are they, and so on. You can then use this to further improve the agent through refining the instructions, adding more tools, or updating the memory.
Again, for basic agents, these out-of-the-box solutions are more than enough. If you’re building more complex agents, you’ll need to build out AgentOps to monitor and improve the agent. More on that in a later blog post.
Case Studies
You’re now familiar with all the components that make up an agent, how to put the components together, and how to test, deploy, and evaluate them. Let’s look at some case studies and implementation examples to drive the point home and inspire you.
Customer Service Agent
One of the most widely implemented agent types helps customers resolve issues, answer questions, and navigate services. Successful customer service agents typically include:
- Feedback collection: Gathers user satisfaction data for improvement
- Intent classification system: Quickly categorizes customer inquiries
- Knowledge retrieval system: Accesses relevant policies and information
- User context integration: Incorporates customer history and account information
- Escalation mechanism: Seamlessly transfers to human agents when needed
An eCommerce company I worked with wanted a 24/7 customer support chatbot on their site. We started with narrow use cases like answering FAQs and order information. The chatbot triggered a triage agent which determined whether the query was within our initial use case set or not.
If it was, it had access to knowledge base documents for FAQs and order information based on an order number.
For everything else, it handed it off to a support agent. This allowed the company to dramatically decreases average response times and increase support volume while maintaining their satisfaction scores.

Research Assistant
Research assistant agents help users gather, synthesize, and analyze information from multiple sources. Effective research assistants typically include:
- Search and retrieval capabilities: Access to diverse information sources
- Information verification mechanisms: Cross-checking facts across sources
- Synthesis frameworks: Methods for combining information coherently
- Citation and attribution systems: Tracking information provenance
- User collaboration interfaces: Tools for refining and directing research
A VC firm I worked with wanted to build a due diligence agent for the deals they were looking at. We triggered the agent when a new deal was created in their database. The agent would first identify the company and the market they were in, and then research them and synthesize the information into an investment memo.
This spend up the diligence process from a couple of hours to a couple of minutes.
Content Generation
Content generation agents help create, refine, and manage various forms of content, from marketing materials to technical documentation.
Effective content generation agents typically include:
- Style and tone frameworks: Guidance for appropriate content voice
- Factual knowledge integration: Access to accurate domain information
- Feedback incorporation mechanisms: Methods for refining outputs
- Format adaptation: Generating content appropriate to different channels
A PR agency I worked with wanted an agent to create highly personalized responses to incoming PR requests. When a new request hit their inbox, it triggered an agent to look through their database of client content and find something specific to that pitch.
It then used the agency’s internal guidelines to craft a pitch and respond to the request. This meant the agency could respond within minutes instead of hours, and get ahead of other responses.
A Thought Exercise
Here’s a bit of homework for you to see if you’ve learned something from this. You’re tasked with designing a travel booking agent. Yeah, I know, it’s a cliche example at this point, but it’s also a process that’s well understood by a large audience.
The exercise is to design the agent. A simple flow chart with pen and paper or on a Figjam is usually how I start.
Draw out the full process – what triggers the agent, what data is sent to it, is it a simple loop agent or a hierarchy of agents, what models and instructions will you give them, what tools and memory can they access.
If you can do this and get into the habit of thinking in agents, implementation becomes easy. For visual examples, sign up for my 5-day Agent Challenge.
Putting It All Together
Phew, over 5,000 words later, we’re almost at the end. We’ve covered a lot in this post so let’s recap:
- Start with clear goals: Define exactly what your agent should accomplish and for whom
- Select appropriate models: Choose models that balance capability, cost, and latency
- Define your tool set: Implement and document the tools your agent needs
- Create clear instructions: Develop comprehensive guidance for agent behavior
- Implement layered guardrails: Build in appropriate safety mechanisms
- Design error handling: Plan for failures and define recovery strategies
- Add memory as needed: Implement context management appropriate to your use case, and external memory
- Test thoroughly: Validate performance across a range of scenarios
- Deploy incrementally: Roll out capabilities gradually to manage risk
- Monitor and improve: Collect data on real-world performance to drive improvements
Next Steps
There’s only one next step. Go build an agent. Start with something small and low-risk. One of my first agents was a content research agent, fully coded in Python. You can vibe code it if you’re not good at coding.
If you want to use a framework, I suggest either OpenAI’s SDK or Google’s ADK, which I have in-depth guides on.
And if you don’t want to touch code, there are some really good no-code platforms like Make, n8n, and Relevance. Sign up for my free email series below where I walk you through building an Agent in 5 Days with these tools.