Build a Coding Agent from Scratch: The Complete Python Tutorial

I have been a heavy user of Claude Code since it came out (and recently Amp Code). As someone who builds agents for a living, I’ve always wondered what makes it so good.

So I decided to try and reverse engineer it.

It turns out building a coding agent is surprisingly straightforward once you understand the core concepts. You don’t need a PhD in machine learning or years of AI research experience. You don’t even need an agent framework.

Over the course of this tutorial, we’re going to build a baby Claude Code using nothing but Python. It won’t be nearly as good as the real thing, but you will have a real, working agent that can:

Read and understand codebases
Execute code safely in a sandboxed environment
Iterate on solutions based on test results and error feedback
Handle multi-step coding tasks
Debug itself when things go wrong

So grab your favorite terminal, fire up your Python environment, and let’s build something awesome.

Understanding Coding Agents: Core Concepts

Before we dive into implementation details, let’s take a step back and define what a “coding agent” actually is.

An agent is a system that perceives its environment, makes decisions based on those perceptions, and takes actions to achieve goals.

In our case, the environment is a codebase, the perceptions come from reading files and executing code, and the actions are things like creating files, running tests, or modifying existing code.

What makes coding agents particularly interesting is that they operate in a domain that’s already highly structured and rule-based. Code either works or it doesn’t. Tests pass or fail. Syntax is valid or invalid. This binary feedback creates excellent training signals for iterative improvement.

The ReAct Pattern: How Agents Actually Think

Most agents today follow a pattern called ReAct (Reason, Act, Observe). Here’s how it works in practice:

Reason: The agent analyzes the current situation and plans its next step. “I need to understand this codebase. Let me start by looking at the main entry point and understanding the project structure.”

Act: The agent takes a concrete action based on its reasoning. It might read a file, execute a command, or write some code.

Observe: The agent examines the results of its action and incorporates that feedback into its understanding.

Then the cycle repeats. Reason → Act → Observe → Reason → Act → Observe.

It’s similar to how humans solve problems. When you’re debugging a complex issue, you don’t just stare at the code hoping for divine inspiration. You form a hypothesis (reason), test it by adding a print statement or running a specific test (act), look at the results (observe), and then refine your understanding based on what you learned.

The Four Pillars of Our Coding Agent

Every effective AI agent needs four core components – The brain, the tools, the instructions, and the memory or context.

I’ll skim over the details here but I’ve explained more in my guide to designing AI agents.

The brain is the core LLM that does the reasoning and code gen. Reasoning models like Claude Sonnet, Gemini 2.5 Pro, and OpenAI’s o-series or GPT-5 are recommended. In this tutorial we use Claude Sonnet.
The instructions are the core system prompt you give to the LLM when you initialize it. Read about prompt engineering to learn more.
The tools are the concrete actions your agent can take in the world. Reading files, writing code, executing commands, running tests – basically anything a human developer can do through their keyboard.
Memory is the data your agent works with. For coding agents, we need a context management system that allows your agent to work with large codebases by intelligently selecting the most relevant information for each task.

For coding agents specifically, I’d add that we need an execution sandbox. Your agent will be writing and executing code, potentially on your production machine. Without proper sandboxing, you’re essentially giving a very enthusiastic and tireless intern root access to your system.

Want to build your own AI agents?

Sign up for my newsletter covering everything from the tools, APIs, and frameworks you need, to building and serving your own multi-step AI agents.

The Agent Architecture We’re Building

I want to show you the complete blueprint before we start coding, because understanding the overall architecture will make every individual component make sense as we implement it.

Here’s our roadmap:

Phase 1: Minimal Viable Agent – Get the core ReAct loop working with basic file operations. By the end of this phase, you’ll have an agent that can read files, understand simple tasks, and reason through solutions step by step.

Phase 2: Safe Code Execution Engine – Add the ability to generate and execute code safely. This is where we implement AST-based validation and process sandboxing. Your agent will be able to write Python code, test it, and iterate based on the results.

Phase 3: Context Management for Large Codebases – Scale beyond toy examples to real projects. We’ll implement search and intelligent context retrieval so your agent can work with codebases containing hundreds of files.

Each phase builds on the previous one, and you’ll have working software at every step.

Phase 1: Minimum Viable Agent

We’re going to do this all in one file and 300 lines of code. Just create a folder in your computer and in it create a file called agent.py

Step 1: Define the Brain

Everything goes into one big CodingAgent class. We’re going to initialize an Anthropic client and also set our working directory:

Python

def __init__(self, 
             api_key: str, 
             working_directory: str = ".", 
             history_file: str = "agent_history.json"):
    self.client = anthropic.Anthropic(api_key=api_key)
    self.working_directory = Path(working_directory).resolve()
    self.history_file = history_file
    self.messages: List[Dict] = []
    self.load_history()

def __init__(self, 
             api_key: str, 
             working_directory: str = ".", 
             history_file: str = "agent_history.json"):
    self.client = anthropic.Anthropic(api_key=api_key)
    self.working_directory = Path(working_directory).resolve()
    self.history_file = history_file
    self.messages: List[Dict] = []
    self.load_history()

You’ll notice some references to ‘history’ in there. That’s our primitive memory and context management. I’ll come to it later.

Let’s use Sonnet 4 as our main model. It’s solid reasoning model and really good at coding.

Python

async def _call_claude(self, messages: List[Dict]) -> Tuple[Any, Optional[str]]:
    try:
        response = self.client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=4000,
            system=SYSTEM_PROMPT,
            tools=TOOLS_SCHEMA,
            messages=messages,
            temperature=0.7
            )
        return response.content, None
    except anthropic.APIError as e:
        return None, f"API Error: {str(e)}"
    except Exception as e:
        return None, f"Unexpected error calling Claude API: {str(e)}"

async def _call_claude(self, messages: List[Dict]) -> Tuple[Any, Optional[str]]:
    try:
        response = self.client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=4000,
            system=SYSTEM_PROMPT,
            tools=TOOLS_SCHEMA,
            messages=messages,
            temperature=0.7
            )
        return response.content, None
    except anthropic.APIError as e:
        return None, f"API Error: {str(e)}"
    except Exception as e:
        return None, f"Unexpected error calling Claude API: {str(e)}"

And that’s really it. This is boilerplate code for calling a Claude model. Gemini, GPT, and others are different, but as long as you’re using a reasoning model you’re good.

Step 2: Give it Instructions

When we initialized our Anthropic client, you may have noticed we’re passing in a System Prompt and a Tools Schema. These are the instructions we give to our model so that it know how to behave and what tools it has access to.

Here’s my system prompt, feel free to tweak it as needed:

Python

SYSTEM_PROMPT = """You are a helpful coding agent that assists with programming tasks and file operations.

When responding to requests:
1. Analyze what the user needs
2. Use the minimum number of tools necessary to accomplish the task
3. After using tools, provide a concise summary of what was done

IMPORTANT: Once you've completed the requested task, STOP and provide your final response. Do not continue creating additional files or performing extra actions unless specifically asked.

Examples of good behavior:
- User: "Create a file that adds numbers" → Create ONE file, then summarize
- User: "Create files for add and subtract" → Create ONLY those two files, then summarize
- User: "Create math operation files" → Ask for clarification on which operations, or create a reasonable set and stop

After receiving tool results:
- If the task is complete, provide a final summary
- Only continue with more tools if the original request is not yet fulfilled
- Do not interpret successful tool execution as a request to do more

Be concise and efficient. Complete the requested task and stop."""

SYSTEM_PROMPT = """You are a helpful coding agent that assists with programming tasks and file operations.

When responding to requests:
1. Analyze what the user needs
2. Use the minimum number of tools necessary to accomplish the task
3. After using tools, provide a concise summary of what was done

IMPORTANT: Once you've completed the requested task, STOP and provide your final response. Do not continue creating additional files or performing extra actions unless specifically asked.

Examples of good behavior:
- User: "Create a file that adds numbers" → Create ONE file, then summarize
- User: "Create files for add and subtract" → Create ONLY those two files, then summarize
- User: "Create math operation files" → Ask for clarification on which operations, or create a reasonable set and stop

After receiving tool results:
- If the task is complete, provide a final summary
- Only continue with more tools if the original request is not yet fulfilled
- Do not interpret successful tool execution as a request to do more

Be concise and efficient. Complete the requested task and stop."""

Current gen models have a tool use ability and you just need to send it a schema up front so that when it’s reasoning it can look at the tool list and decide if it needs one to help with it’s task.

We define it like this:

Python

TOOLS_SCHEMA = [
  {
      "name": "read_file",
      "description": "Read the contents of a file",
      "input_schema": {
          "type": "object",
          "properties": {
              "path": {"type": "string", "description": "The path to the file to read"}
          },
          "required": ["path"]
      }
  },
      { # Other tool definitions follow a similar pattern
        }
]

TOOLS_SCHEMA = [
  {
      "name": "read_file",
      "description": "Read the contents of a file",
      "input_schema": {
          "type": "object",
          "properties": {
              "path": {"type": "string", "description": "The path to the file to read"}
          },
          "required": ["path"]
      }
  },
      { # Other tool definitions follow a similar pattern
        }
]

Step 3: Define The Tool logic

Let’s also define our actual tool logic. Here’s what it would look like for the Read File tool:

Python

async def _read_file(self, path: str) -> Dict[str, Any]:
    """Read a file and return its contents"""
    try:
        file_path = (self.working_directory / path).resolve()
        if not str(file_path).startswith(str(self.working_directory)):
            return {"error": "Access denied: path outside working directory"}
            
        with open(file_path, 'r', encoding='utf-8') as f:
            content = f.read()
        return {"success": True, "content": content, "path": str(file_path)}
    except Exception as e:
        return {"error": f"Could not read file: {str(e)}"}

async def _read_file(self, path: str) -> Dict[str, Any]:
    """Read a file and return its contents"""
    try:
        file_path = (self.working_directory / path).resolve()
        if not str(file_path).startswith(str(self.working_directory)):
            return {"error": "Access denied: path outside working directory"}
            
        with open(file_path, 'r', encoding='utf-8') as f:
            content = f.read()
        return {"success": True, "content": content, "path": str(file_path)}
    except Exception as e:
        return {"error": f"Could not read file: {str(e)}"}

Continue defining the rest of the tools that way and add them to the tools schema. You can look at the full code in my GitHub Repository for help.

I have implemented read, write, list, and search but you can add more for an extra challenge.

We’ll also need a function to execute the tool that we call if our LLM responds with a tool use request.

Python

async def _execute_tool_calls(self, tool_uses: List[Any]) -> List[Dict]:
    tool_results = []
        
    for tool_use in tool_uses:
        print(f"   Executing: {tool_use.name}")
        try:
            if tool_use.name == "read_file":
                result = await self._read_file(tool_use.input.get("path", ""))
            elif tool_use.name == "write_file":
                result = await self._write_file(tool_use.input.get("path", ""), 
                                                   tool_use.input.get("content", ""))
            elif tool_use.name == "list_files":
                result = await self._list_files(tool_use.input.get("path", "."))
            elif tool_use.name == "search_files":
                result = await self._search_files(tool_use.input.get("pattern", ""), 
                                                 tool_use.input.get("path", "."))
            else:
                result = {"error": f"Unknown tool: {tool_use.name}"}
        except Exception as e:
            result = {"error": f"Tool execution failed: {str(e)}"}
            
        # Log success/error briefly
        if "success" in result and result["success"]:
            print(f"Tool executed successfully")
        elif "error" in result:
            print(f"Error: {result['error']}")
            
        # Collect result for API
        tool_results.append({
            "tool_use_id": tool_use.id,
            "content": json.dumps(result)
        })
        
    return tool_results

async def _execute_tool_calls(self, tool_uses: List[Any]) -> List[Dict]:
    tool_results = []
        
    for tool_use in tool_uses:
        print(f"   Executing: {tool_use.name}")
        try:
            if tool_use.name == "read_file":
                result = await self._read_file(tool_use.input.get("path", ""))
            elif tool_use.name == "write_file":
                result = await self._write_file(tool_use.input.get("path", ""), 
                                                   tool_use.input.get("content", ""))
            elif tool_use.name == "list_files":
                result = await self._list_files(tool_use.input.get("path", "."))
            elif tool_use.name == "search_files":
                result = await self._search_files(tool_use.input.get("pattern", ""), 
                                                 tool_use.input.get("path", "."))
            else:
                result = {"error": f"Unknown tool: {tool_use.name}"}
        except Exception as e:
            result = {"error": f"Tool execution failed: {str(e)}"}
            
        # Log success/error briefly
        if "success" in result and result["success"]:
            print(f"Tool executed successfully")
        elif "error" in result:
            print(f"Error: {result['error']}")
            
        # Collect result for API
        tool_results.append({
            "tool_use_id": tool_use.id,
            "content": json.dumps(result)
        })
        
    return tool_results

It’s a bit verbose but good enough for our MVP. And now our Brain is connected with Tools!

Step 4: Context Management and Memory

Remember the references to ‘history’ from earlier? That’s a crude implementation of memory. We basically write our conversation to a history file. Every time we start up our agent, it reads that file and loads the full conversation. We can clear the file and start a fresh conversation.

Python

def save_history(self):
    """Save conversation history"""
    try:
        with open(self.history_file, 'w') as f:
            json.dump(self.messages, f, indent=2)
    except Exception as e:
        print(f"Warning: Could not save history: {e}")    
    
def load_history(self):
    """Load conversation history"""
    try:
        if os.path.exists(self.history_file):
            with open(self.history_file, 'r') as f:
                self.messages = json.load(f)
    except Exception:
        self.messages = []

def save_history(self):
    """Save conversation history"""
    try:
        with open(self.history_file, 'w') as f:
            json.dump(self.messages, f, indent=2)
    except Exception as e:
        print(f"Warning: Could not save history: {e}")    
    
def load_history(self):
    """Load conversation history"""
    try:
        if os.path.exists(self.history_file):
            with open(self.history_file, 'r') as f:
                self.messages = json.load(f)
    except Exception:
        self.messages = []

Let’s also define some functions to help with context management. Right now we’re just going to track the conversation history and build a messages list.

Python

def add_message(self, role: str, content: str):
    """Add a message to conversation history"""
    self.messages.append({"role": role, "content": content})
    self.save_history()
        
def build_messages_list(self, user_input: Optional[str] = None, 
                       tool_results: Optional[List[Dict]] = None,
                       assistant_content: Optional[Any] = None,
                       max_history: int = 20) -> List[Dict]:
    """Build a clean messages list for the API call"""
    messages = []
        
    # Add conversation history (limited to recent messages for context window)
    start_idx = max(0, len(self.messages) - max_history)
        
    for msg in self.messages[start_idx:]:
        if isinstance(msg, dict) and "role" in msg and "content" in msg:
            # Clean the message for API compatibility
            clean_msg = {"role": msg["role"], "content": msg["content"]}
            messages.append(clean_msg)
        
    # Add new user input if provided
    if user_input:
        messages.append({"role": "user", "content": user_input})
        
    # Add assistant content if provided (for tool use continuation)
    if assistant_content:
        messages.append({"role": "assistant", "content": assistant_content})
        
    # Add tool results as user message if provided
    if tool_results:
        messages.append({
            "role": "user",
            "content": [
                {
                    "type": "tool_result",
                    "tool_use_id": tr["tool_use_id"],
                    "content": tr["content"]
                }
                for tr in tool_results
            ]
        })
        
    return messages

def add_message(self, role: str, content: str):
    """Add a message to conversation history"""
    self.messages.append({"role": role, "content": content})
    self.save_history()
        
def build_messages_list(self, user_input: Optional[str] = None, 
                       tool_results: Optional[List[Dict]] = None,
                       assistant_content: Optional[Any] = None,
                       max_history: int = 20) -> List[Dict]:
    """Build a clean messages list for the API call"""
    messages = []
        
    # Add conversation history (limited to recent messages for context window)
    start_idx = max(0, len(self.messages) - max_history)
        
    for msg in self.messages[start_idx:]:
        if isinstance(msg, dict) and "role" in msg and "content" in msg:
            # Clean the message for API compatibility
            clean_msg = {"role": msg["role"], "content": msg["content"]}
            messages.append(clean_msg)
        
    # Add new user input if provided
    if user_input:
        messages.append({"role": "user", "content": user_input})
        
    # Add assistant content if provided (for tool use continuation)
    if assistant_content:
        messages.append({"role": "assistant", "content": assistant_content})
        
    # Add tool results as user message if provided
    if tool_results:
        messages.append({
            "role": "user",
            "content": [
                {
                    "type": "tool_result",
                    "tool_use_id": tr["tool_use_id"],
                    "content": tr["content"]
                }
                for tr in tool_results
            ]
        })
        
    return messages

And those are the core components of our coding agent!

Step 5: Build the ReAct Loop

Finally, we need a function to guide our model to follow the ReAct pattern.

Python

async def react_loop(self, user_input: str) -> str:
    # Add user message to history
    self.add_message("user", user_input)
        
    # Build initial messages list
    messages = self.build_messages_list(user_input=user_input)
        
    # Track the last text response to avoid duplication
    last_complete_response = None
        
    # Safety limit to prevent infinite loops
    safety_limit = 20
    iterations = 0
        
    while iterations < safety_limit:
        iterations += 1
            
        # Get Claude's response
        content_blocks, error = await self._call_claude(messages)
            
        if error:
            error_msg = f"Error: {error}"
            self.add_message("assistant", error_msg)
            return error_msg
            
        # Parse response into text and tool uses
        text_responses, tool_uses = self._parse_claude_response(content_blocks)
            
        # Store the last complete text response
        if text_responses:
            last_complete_response = "\n".join(text_responses)
            
        # If no tools were used, Claude is done - return final response
        if not tool_uses:
            break
            
        # Execute tools and collect results
        tool_results = await self._execute_tool_calls(tool_uses)
            
        # Build messages for next iteration
        messages = self.build_messages_list(
            assistant_content=content_blocks,
            tool_results=tool_results
        )
        
        # Prepare final response
        if not last_complete_response:
            final_response = "I couldn't generate a response."
        elif iterations >= safety_limit:
            final_response = f"{last_complete_response}\n\n(Note: I reached my processing limit. You may want to break this down into smaller steps.)"
        else:
            final_response = last_complete_response
        
        # Save to history and return
        self.add_message("assistant", final_response)
        return final_response

async def process_message(self, user_input: str) -> str:
    """Main entry point for processing user messages"""
    try:
        # Use the ReAct loop to process the message
        response = await self.react_loop(user_input)
        return response
    except Exception as e:
        error_msg = f"Unexpected error processing message: {str(e)}"
        self.add_message("assistant", error_msg)
        return error_msg

async def react_loop(self, user_input: str) -> str:
    # Add user message to history
    self.add_message("user", user_input)
        
    # Build initial messages list
    messages = self.build_messages_list(user_input=user_input)
        
    # Track the last text response to avoid duplication
    last_complete_response = None
        
    # Safety limit to prevent infinite loops
    safety_limit = 20
    iterations = 0
        
    while iterations < safety_limit:
        iterations += 1
            
        # Get Claude's response
        content_blocks, error = await self._call_claude(messages)
            
        if error:
            error_msg = f"Error: {error}"
            self.add_message("assistant", error_msg)
            return error_msg
            
        # Parse response into text and tool uses
        text_responses, tool_uses = self._parse_claude_response(content_blocks)
            
        # Store the last complete text response
        if text_responses:
            last_complete_response = "\n".join(text_responses)
            
        # If no tools were used, Claude is done - return final response
        if not tool_uses:
            break
            
        # Execute tools and collect results
        tool_results = await self._execute_tool_calls(tool_uses)
            
        # Build messages for next iteration
        messages = self.build_messages_list(
            assistant_content=content_blocks,
            tool_results=tool_results
        )
        
        # Prepare final response
        if not last_complete_response:
            final_response = "I couldn't generate a response."
        elif iterations >= safety_limit:
            final_response = f"{last_complete_response}\n\n(Note: I reached my processing limit. You may want to break this down into smaller steps.)"
        else:
            final_response = last_complete_response
        
        # Save to history and return
        self.add_message("assistant", final_response)
        return final_response

async def process_message(self, user_input: str) -> str:
    """Main entry point for processing user messages"""
    try:
        # Use the ReAct loop to process the message
        response = await self.react_loop(user_input)
        return response
    except Exception as e:
        error_msg = f"Unexpected error processing message: {str(e)}"
        self.add_message("assistant", error_msg)
        return error_msg

Yes, it really is just a while loop. We call Claude with our request and it answers. If it needs to use a tool, we process the tool (as defined before) and then send back the tool result.

And then we loop. We’ve set a safety limit of 20 turns to avoid infinite loops (and to stop you from racking up those api calls).

When there are no more tool calls, we assume it’s done and print the final response.

We’re also parsing Claude’s responses for readability so that we can print it to our terminal and see what’s happening.

Python

def _parse_claude_response(self, content_blocks: Any) -> Tuple[List[str], List[Any]]:
    text_responses = []
    tool_uses = []
        
    for block in content_blocks:
        if block.type == "text":
            text_responses.append(block.text)
            print(f" {block.text}")
        elif block.type == "tool_use":
            tool_uses.append(block)
            print(f" Tool call: {block.name}")
        
    return text_responses, tool_uses

def _parse_claude_response(self, content_blocks: Any) -> Tuple[List[str], List[Any]]:
    text_responses = []
    tool_uses = []
        
    for block in content_blocks:
        if block.type == "text":
            text_responses.append(block.text)
            print(f" {block.text}")
        elif block.type == "tool_use":
            tool_uses.append(block)
            print(f" Tool call: {block.name}")
        
    return text_responses, tool_uses

Let’s Test it out!

Our agent is ready to use. We’re at 400 lines of code, but that includes the comments, error handling, and helper functions for verbosity. Our core agent code is ~300 lines. Let’s see if it’s any good!

Let’s add a main function to our code so that we can get that CLI interface:

Python

async def main():
    """Main CLI interface"""he
    print("Welcome to Baby Claude Code!!")
    print("Type 'exit' or 'quit' to quit, 'clear' to clear history, 'history' to show recent messages")
    print("-" * 50)
    
    # Get API key
    api_key = os.getenv("ANTHROPIC_API_KEY")
    if not api_key:
        api_key = input("Enter your Anthropic API key: ").strip()
    
    # Initialize agent
    agent = CodingAgent(api_key)
    
    while True:
        try:
            user_input = input("\n You: ").strip()
            
            if user_input.lower() in ['exit', 'quit']:
                print("Goodbye!")
                break
            elif user_input.lower() == 'clear':
                agent.messages = []
                agent.save_history()
                print("History cleared!")
                continue
            elif user_input.lower() == 'history':
                print("\nRecent conversation history:")
                for msg in agent.messages[-10:]:
                    role = msg.get("role", "unknown")
                    content = msg.get("content", "")
                    if len(content) > 100:
                        content = content[:100] + "..."
                    timestamp = msg.get("timestamp", "")
                    print(f"  [{role}] {content}")
                continue
            elif not user_input:
                continue
            
            print("\n Agent processing...")
            response = await agent.process_message(user_input)
            
        except KeyboardInterrupt:
            print("\n\nGoodbye!")
            break
        except Exception as e:
            print(f"\n Error: {e}")


if __name__ == "__main__":
    asyncio.run(main())

async def main():
    """Main CLI interface"""he
    print("Welcome to Baby Claude Code!!")
    print("Type 'exit' or 'quit' to quit, 'clear' to clear history, 'history' to show recent messages")
    print("-" * 50)
    
    # Get API key
    api_key = os.getenv("ANTHROPIC_API_KEY")
    if not api_key:
        api_key = input("Enter your Anthropic API key: ").strip()
    
    # Initialize agent
    agent = CodingAgent(api_key)
    
    while True:
        try:
            user_input = input("\n You: ").strip()
            
            if user_input.lower() in ['exit', 'quit']:
                print("Goodbye!")
                break
            elif user_input.lower() == 'clear':
                agent.messages = []
                agent.save_history()
                print("History cleared!")
                continue
            elif user_input.lower() == 'history':
                print("\nRecent conversation history:")
                for msg in agent.messages[-10:]:
                    role = msg.get("role", "unknown")
                    content = msg.get("content", "")
                    if len(content) > 100:
                        content = content[:100] + "..."
                    timestamp = msg.get("timestamp", "")
                    print(f"  [{role}] {content}")
                continue
            elif not user_input:
                continue
            
            print("\n Agent processing...")
            response = await agent.process_message(user_input)
            
        except KeyboardInterrupt:
            print("\n\nGoodbye!")
            break
        except Exception as e:
            print(f"\n Error: {e}")


if __name__ == "__main__":
    asyncio.run(main())

Now run the file and watch your own baby Claude Code come to life!

Understanding the Code Flow

If you’ve been following along, you should have a working coding agent. It’s basic but it gets the job done.

We first pass your task to the react_loop method, which compiles a conversation history and calls Claude.

Based on our system prompt and tool schema, Claude decides if it needs to use a tool to answer our request. If so, it sends back a tool request which we execute. We add the results to our message history and send it back to Claude, and loop over.

We keep doing this until there are no more tool calls, in which case we assume Claude has nothing else to do and we return the final answer.

Et voila! We have a functioning coding agent that can explain codebases, write new code, and keep track of a conversation.

Pretty sweet.

I’ve added all the code to my Github. Enter your email below to receive it.

Want to build your own AI agents?

Sign up for my newsletter covering everything from the tools, APIs, and frameworks you need, to building and serving your own multi-step AI agents.

Phase 2: Adding Code Execution

We have a coding agent that can read and write code, but in this age of vibe coding, we want it to be able to text and execute code as well. Those bugs ain’t gonna debug themselves.

All we need to do is give it new tools to execute code. The main complexity is ensuring it doesn’t run malicious code or delete our OS by mistake. That’s why this phase is mostly about code validation and sandboxing. Let’s see how.

Step 1: Code Refactoring

Before we do anything, let’s refactor our existing code for better readability and modularity.

Here’s our new project structure:

Python

coding_agent/
├── __init__.py           # Package initialization
├── config.py             # Central configuration
├── agent.py              # Main CodingAgent class
├── tools/
│   ├── __init__.py
│   ├── base.py          # Tool interface & registry
│   ├── file_ops.py      # File operation tools
│   └── code_exec.py     # Code execution tools
├── execution/
│   ├── __init__.py
│   ├── validator.py     # AST-based validator
│   └── executor.py      # Sandboxed executor
└── cli.py               # CLI interface

coding_agent/
├── __init__.py           # Package initialization
├── config.py             # Central configuration
├── agent.py              # Main CodingAgent class
├── tools/
│   ├── __init__.py
│   ├── base.py          # Tool interface & registry
│   ├── file_ops.py      # File operation tools
│   └── code_exec.py     # Code execution tools
├── execution/
│   ├── __init__.py
│   ├── validator.py     # AST-based validator
│   └── executor.py      # Sandboxed executor
└── cli.py               # CLI interface

Most of the code is pretty much the same. Config.py has our model configuration parameters and system prompt. CLI.py is the main cli interface that we added right at the end of Phase 1.

Agent.py is the core agent class sans the tools setup, which go into a tools folder. We have a base tool template, then define the read, write and search file tools in file_ops.py

The new code is the code_exec.py file which contains the meta data for the executor and validator tools, and the actual implementation of those tools are in the execution folder for readability.

Step 2: The Validator

The CodeValidator uses Python’s Abstract Syntax Tree (AST) to analyze code before it runs. Think of it as a security guard that inspects code at the gate.

Python

class CodeValidator:
    def validate(self, code: str) -> Tuple[bool, List[str]]:
        <em># Parse code into an AST</em>
        tree = ast.parse(code)
        
        <em># Walk the tree looking for dangerous patterns</em>
        self._check_node(tree)
        
        <em># Return validation result</em>
        return len(self.violations) == 0, self.violations

class CodeValidator:
    def validate(self, code: str) -> Tuple[bool, List[str]]:
        <em># Parse code into an AST</em>
        tree = ast.parse(code)
        
        <em># Walk the tree looking for dangerous patterns</em>
        self._check_node(tree)
        
        <em># Return validation result</em>
        return len(self.violations) == 0, self.violations

What the Validator Blocks:

Dangerous Imports

Python

import os  # BLOCKED - could delete files
import subprocess  # BLOCKED - could run shell commands
import socket  # BLOCKED - could make network connections

import os  # BLOCKED - could delete files
import subprocess  # BLOCKED - could run shell commands
import socket  # BLOCKED - could make network connections

2. File Operations

Python

open('file.txt', 'w')  # BLOCKED - could overwrite files
with open('/etc/passwd', 'r'):  # BLOCKED - could read sensitive files

open('file.txt', 'w')  # BLOCKED - could overwrite files
with open('/etc/passwd', 'r'):  # BLOCKED - could read sensitive files

3. Dangerous Built-in Functions

Python

eval("malicious_code")  # BLOCKED - arbitrary code execution
exec("import os; os.system('rm -rf /')")  # BLOCKED
__import__('os')  # BLOCKED - dynamic imports

eval("malicious_code")  # BLOCKED - arbitrary code execution
exec("import os; os.system('rm -rf /')")  # BLOCKED
__import__('os')  # BLOCKED - dynamic imports

4. System Access Attempts

Python

sys.exit()  # BLOCKED - could crash the program
os.environ['SECRET_KEY']  # BLOCKED - environment access

sys.exit()  # BLOCKED - could crash the program
os.environ['SECRET_KEY']  # BLOCKED - environment access

The validator works by walking the AST and checking each node type:

ast.Import and ast.ImportFrom nodes → check against dangerous modules
ast.Call nodes → check for dangerous function calls
ast.Attribute nodes → check for dangerous attribute access

Most coding agents don’t actually block all of this. They have a permissioning system to give their users control. I’m just being overly cautious for the sake of this tutorial.

Step 3: The Executor

Even if code passes validation, we still need runtime protection. Again, I’m being overly cautious here and creating a custom Python environment with only certain built-in functions:

Python

# User code runs with ONLY these functions available
safe_builtins = {
    'print': print,    # Safe for output
    'len': len,        # Safe for measurement
    'range': range,    # Safe for iteration
    'int': int,        # Safe type conversion
    # ... other safe functions
    
    # Notably missing:
    # - open (no file access)
    # - __import__ (no imports)
    # - eval/exec (no dynamic execution)
    # - input (no user interaction)
}

# User code runs with ONLY these functions available
safe_builtins = {
    'print': print,    # Safe for output
    'len': len,        # Safe for measurement
    'range': range,    # Safe for iteration
    'int': int,        # Safe type conversion
    # ... other safe functions
    
    # Notably missing:
    # - open (no file access)
    # - __import__ (no imports)
    # - eval/exec (no dynamic execution)
    # - input (no user interaction)
}

And then when we do run code, it’s in a separate sub-process:

Python

process = await asyncio.create_subprocess_exec(
    sys.executable, code_file,
    stdout=asyncio.subprocess.PIPE,
    stderr=asyncio.subprocess.PIPE,
    cwd=str(self.sandbox_dir)  # Isolated directory
)

process = await asyncio.create_subprocess_exec(
    sys.executable, code_file,
    stdout=asyncio.subprocess.PIPE,
    stderr=asyncio.subprocess.PIPE,
    cwd=str(self.sandbox_dir)  # Isolated directory
)

This gives us:

Memory isolation: Can’t access parent process memory
Crash protection: If code crashes, main program continues
Clean termination: Can kill runaway processes
Output capture: All output is captured and controlled

The executor also sets strict resource limits at the OS level:

Python

# CPU time limit (prevents infinite loops)
resource.setrlimit(resource.RLIMIT_CPU, (5, 5))

# Memory limit (prevents memory bombs)
resource.setrlimit(resource.RLIMIT_AS, (100_000_000, 100_000_000))  # 100MB

# No core dumps (prevents disk filling)
resource.setrlimit(resource.RLIMIT_CORE, (0, 0))

# Process limit (prevents fork bombs)
resource.setrlimit(resource.RLIMIT_NPROC, (1, 1))

# CPU time limit (prevents infinite loops)
resource.setrlimit(resource.RLIMIT_CPU, (5, 5))

# Memory limit (prevents memory bombs)
resource.setrlimit(resource.RLIMIT_AS, (100_000_000, 100_000_000))  # 100MB

# No core dumps (prevents disk filling)
resource.setrlimit(resource.RLIMIT_CORE, (0, 0))

# Process limit (prevents fork bombs)
resource.setrlimit(resource.RLIMIT_NPROC, (1, 1))

As a final safeguard, all code execution has a timeout:

Python

try:
    stdout, stderr = await asyncio.wait_for(
        process.communicate(),
        timeout=10  # 10 second maximum
    )
except asyncio.TimeoutError:
    process.kill()  # Force terminate
    return {"error": "Execution timed out"}

try:
    stdout, stderr = await asyncio.wait_for(
        process.communicate(),
        timeout=10  # 10 second maximum
    )
except asyncio.TimeoutError:
    process.kill()  # Force terminate
    return {"error": "Execution timed out"}

There’s a bit more code around creating the sandbox environment to execute code but we’re almost at 5,000 words and my WordPress backend is getting sluggish, so I’m not going to paste it all here. You can get it from my Github.

You’ll also want to update the tool schema with the new tools and also describe how Claude can use them in the system prompt.

How It All Works Together

Adding code execution transforms our agent from a simple file manipulator into a true coding assistant that can:

Learn from execution results to improve its suggestions
Write and immediately test solutions
Debug by seeing actual error messages
Iterate on solutions that don’t work
Validate that code produces expected output

Here’s the complete flow when the agent executes code:

Python

User Request: "Test this fibonacci function"
    ↓
1. Agent calls execute_code tool
    ↓
2. CodeValidator.validate(code)
    ├─ Parse to AST
    ├─ Check for dangerous imports ✓
    ├─ Check for dangerous functions ✓
    └─ Check for file operations ✓
    ↓
3. CodeExecutor.execute(code)
    ├─ Create sandboxed code file
    ├─ Apply restricted builtins
    ├─ Set resource limits
    ├─ Run in subprocess
    ├─ Monitor with timeout
    └─ Capture output safely
    ↓
4. Return results to agent
    ├─ stdout: "Fibonacci(10) = 55"
    ├─ stderr: ""
    └─ success: true

User Request: "Test this fibonacci function"
    ↓
1. Agent calls execute_code tool
    ↓
2. CodeValidator.validate(code)
    ├─ Parse to AST
    ├─ Check for dangerous imports ✓
    ├─ Check for dangerous functions ✓
    └─ Check for file operations ✓
    ↓
3. CodeExecutor.execute(code)
    ├─ Create sandboxed code file
    ├─ Apply restricted builtins
    ├─ Set resource limits
    ├─ Run in subprocess
    ├─ Monitor with timeout
    └─ Capture output safely
    ↓
4. Return results to agent
    ├─ stdout: "Fibonacci(10) = 55"
    ├─ stderr: ""
    └─ success: true

And that’s Phase 2! If you’ve been implementing with me, you should be getting results like this:

Phase 3: Better Context management

Phases 1 and 2 gave our agent powerful capabilities: it can manipulate files and safely execute code. But try asking it to “refactor the authentication system” in a real project with 500 files, and it hits a wall. The agent doesn’t know:

What files are relevant to authentication
How components connect across the codebase
Which functions call which others
What context it needs to make safe changes

This is the fundamental challenge of AI coding assistants: context. LLMs have a limited context window, and even if we could fit an entire codebase, indiscriminately dumping hundreds of files would be wasteful and confusing. The agent would spend most of its reasoning power just figuring out what’s relevant.

I’m going to pause here for now and come back to this section later. Meanwhile, read my guide on Context Engineering to understand the concepts behind this. And sign up below for when I complete Phase 3!

Want to build your own AI agents?

Sign up for my newsletter covering everything from the tools, APIs, and frameworks you need, to building and serving your own multi-step AI agents.

Build a Coding Agent from Scratch: The Complete Python Tutorial

Understanding Coding Agents: Core Concepts

The ReAct Pattern: How Agents Actually Think

The Four Pillars of Our Coding Agent

Want to build your own AI agents?

The Agent Architecture We’re Building

Phase 1: Minimum Viable Agent

Step 1: Define the Brain

Step 2: Give it Instructions

Step 3: Define The Tool logic

Step 4: Context Management and Memory

Step 5: Build the ReAct Loop

Let’s Test it out!

Understanding the Code Flow

Want to build your own AI agents?

Phase 2: Adding Code Execution

Step 1: Code Refactoring

Step 2: The Validator

What the Validator Blocks:

Step 3: The Executor

How It All Works Together

Phase 3: Better Context management

Want to build your own AI agents?

More posts

Building an AI-Powered Market Research Agent With Parallel AI

Cartesia AI Tutorial: Build an AI Podcast Generator

Claude Skills Tutorial: Give your AI Superpowers

Building a Competitor Intelligence Agent with Browserbase