I have been a heavy user of Claude Code since it came out (and recently Amp Code). As someone who builds agents for a living, I’ve always wondered what makes it so good.
So I decided to try and reverse engineer it.
It turns out building a coding agent is surprisingly straightforward once you understand the core concepts. You don’t need a PhD in machine learning or years of AI research experience. You don’t even need an agent framework.
Over the course of this tutorial, we’re going to build a baby Claude Code using nothing but Python. It won’t be nearly as good as the real thing, but you will have a real, working agent that can:
- Read and understand codebases
- Execute code safely in a sandboxed environment
- Iterate on solutions based on test results and error feedback
- Handle multi-step coding tasks
- Debug itself when things go wrong
So grab your favorite terminal, fire up your Python environment, and let’s build something awesome.
Understanding Coding Agents: Core Concepts
Before we dive into implementation details, let’s take a step back and define what a “coding agent” actually is.
An agent is a system that perceives its environment, makes decisions based on those perceptions, and takes actions to achieve goals.
In our case, the environment is a codebase, the perceptions come from reading files and executing code, and the actions are things like creating files, running tests, or modifying existing code.
What makes coding agents particularly interesting is that they operate in a domain that’s already highly structured and rule-based. Code either works or it doesn’t. Tests pass or fail. Syntax is valid or invalid. This binary feedback creates excellent training signals for iterative improvement.
The ReAct Pattern: How Agents Actually Think
Most agents today follow a pattern called ReAct (Reason, Act, Observe). Here’s how it works in practice:
Reason: The agent analyzes the current situation and plans its next step. “I need to understand this codebase. Let me start by looking at the main entry point and understanding the project structure.”
Act: The agent takes a concrete action based on its reasoning. It might read a file, execute a command, or write some code.
Observe: The agent examines the results of its action and incorporates that feedback into its understanding.
Then the cycle repeats. Reason → Act → Observe → Reason → Act → Observe.
It’s similar to how humans solve problems. When you’re debugging a complex issue, you don’t just stare at the code hoping for divine inspiration. You form a hypothesis (reason), test it by adding a print statement or running a specific test (act), look at the results (observe), and then refine your understanding based on what you learned.
The Four Pillars of Our Coding Agent
Every effective AI agent needs four core components – The brain, the tools, the instructions, and the memory or context.
I’ll skim over the details here but I’ve explained more in my guide to designing AI agents.
- The brain is the core LLM that does the reasoning and code gen. Reasoning models like Claude Sonnet, Gemini 2.5 Pro, and OpenAI’s o-series or GPT-5 are recommended. In this tutorial we use Claude Sonnet.
- The instructions are the core system prompt you give to the LLM when you initialize it. Read about prompt engineering to learn more.
- The tools are the concrete actions your agent can take in the world. Reading files, writing code, executing commands, running tests – basically anything a human developer can do through their keyboard.
- Memory is the data your agent works with. For coding agents, we need a context management system that allows your agent to work with large codebases by intelligently selecting the most relevant information for each task.
For coding agents specifically, I’d add that we need an execution sandbox. Your agent will be writing and executing code, potentially on your production machine. Without proper sandboxing, you’re essentially giving a very enthusiastic and tireless intern root access to your system.
Want to build your own AI agents?
Sign up for my newsletter covering everything from the tools, APIs, and frameworks you need, to building and serving your own multi-step AI agents.
The Agent Architecture We’re Building
I want to show you the complete blueprint before we start coding, because understanding the overall architecture will make every individual component make sense as we implement it.
Here’s our roadmap:
Phase 1: Minimal Viable Agent – Get the core ReAct loop working with basic file operations. By the end of this phase, you’ll have an agent that can read files, understand simple tasks, and reason through solutions step by step.
Phase 2: Safe Code Execution Engine – Add the ability to generate and execute code safely. This is where we implement AST-based validation and process sandboxing. Your agent will be able to write Python code, test it, and iterate based on the results.
Phase 3: Context Management for Large Codebases – Scale beyond toy examples to real projects. We’ll implement search and intelligent context retrieval so your agent can work with codebases containing hundreds of files.
Each phase builds on the previous one, and you’ll have working software at every step.
Phase 1: Minimum Viable Agent
We’re going to do this all in one file and 300 lines of code. Just create a folder in your computer and in it create a file called agent.py
Step 1: Define the Brain
Everything goes into one big CodingAgent class. We’re going to initialize an Anthropic client and also set our working directory:
def __init__(self,
api_key: str,
working_directory: str = ".",
history_file: str = "agent_history.json"):
self.client = anthropic.Anthropic(api_key=api_key)
self.working_directory = Path(working_directory).resolve()
self.history_file = history_file
self.messages: List[Dict] = []
self.load_history()
You’ll notice some references to ‘history’ in there. That’s our primitive memory and context management. I’ll come to it later.
Let’s use Sonnet 4 as our main model. It’s solid reasoning model and really good at coding.
async def _call_claude(self, messages: List[Dict]) -> Tuple[Any, Optional[str]]:
try:
response = self.client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=4000,
system=SYSTEM_PROMPT,
tools=TOOLS_SCHEMA,
messages=messages,
temperature=0.7
)
return response.content, None
except anthropic.APIError as e:
return None, f"API Error: {str(e)}"
except Exception as e:
return None, f"Unexpected error calling Claude API: {str(e)}"
And that’s really it. This is boilerplate code for calling a Claude model. Gemini, GPT, and others are different, but as long as you’re using a reasoning model you’re good.
Step 2: Give it Instructions
When we initialized our Anthropic client, you may have noticed we’re passing in a System Prompt and a Tools Schema. These are the instructions we give to our model so that it know how to behave and what tools it has access to.
Here’s my system prompt, feel free to tweak it as needed:
SYSTEM_PROMPT = """You are a helpful coding agent that assists with programming tasks and file operations.
When responding to requests:
1. Analyze what the user needs
2. Use the minimum number of tools necessary to accomplish the task
3. After using tools, provide a concise summary of what was done
IMPORTANT: Once you've completed the requested task, STOP and provide your final response. Do not continue creating additional files or performing extra actions unless specifically asked.
Examples of good behavior:
- User: "Create a file that adds numbers" → Create ONE file, then summarize
- User: "Create files for add and subtract" → Create ONLY those two files, then summarize
- User: "Create math operation files" → Ask for clarification on which operations, or create a reasonable set and stop
After receiving tool results:
- If the task is complete, provide a final summary
- Only continue with more tools if the original request is not yet fulfilled
- Do not interpret successful tool execution as a request to do more
Be concise and efficient. Complete the requested task and stop."""
Current gen models have a tool use ability and you just need to send it a schema up front so that when it’s reasoning it can look at the tool list and decide if it needs one to help with it’s task.
We define it like this:
TOOLS_SCHEMA = [
{
"name": "read_file",
"description": "Read the contents of a file",
"input_schema": {
"type": "object",
"properties": {
"path": {"type": "string", "description": "The path to the file to read"}
},
"required": ["path"]
}
},
{ # Other tool definitions follow a similar pattern
}
]
Step 3: Define The Tool logic
Let’s also define our actual tool logic. Here’s what it would look like for the Read File tool:
async def _read_file(self, path: str) -> Dict[str, Any]:
"""Read a file and return its contents"""
try:
file_path = (self.working_directory / path).resolve()
if not str(file_path).startswith(str(self.working_directory)):
return {"error": "Access denied: path outside working directory"}
with open(file_path, 'r', encoding='utf-8') as f:
content = f.read()
return {"success": True, "content": content, "path": str(file_path)}
except Exception as e:
return {"error": f"Could not read file: {str(e)}"}
Continue defining the rest of the tools that way and add them to the tools schema. You can look at the full code in my GitHub Repository for help.
I have implemented read, write, list, and search but you can add more for an extra challenge.
We’ll also need a function to execute the tool that we call if our LLM responds with a tool use request.
async def _execute_tool_calls(self, tool_uses: List[Any]) -> List[Dict]:
tool_results = []
for tool_use in tool_uses:
print(f" Executing: {tool_use.name}")
try:
if tool_use.name == "read_file":
result = await self._read_file(tool_use.input.get("path", ""))
elif tool_use.name == "write_file":
result = await self._write_file(tool_use.input.get("path", ""),
tool_use.input.get("content", ""))
elif tool_use.name == "list_files":
result = await self._list_files(tool_use.input.get("path", "."))
elif tool_use.name == "search_files":
result = await self._search_files(tool_use.input.get("pattern", ""),
tool_use.input.get("path", "."))
else:
result = {"error": f"Unknown tool: {tool_use.name}"}
except Exception as e:
result = {"error": f"Tool execution failed: {str(e)}"}
# Log success/error briefly
if "success" in result and result["success"]:
print(f"Tool executed successfully")
elif "error" in result:
print(f"Error: {result['error']}")
# Collect result for API
tool_results.append({
"tool_use_id": tool_use.id,
"content": json.dumps(result)
})
return tool_results
It’s a bit verbose but good enough for our MVP. And now our Brain is connected with Tools!
Step 4: Context Management and Memory
Remember the references to ‘history’ from earlier? That’s a crude implementation of memory. We basically write our conversation to a history file. Every time we start up our agent, it reads that file and loads the full conversation. We can clear the file and start a fresh conversation.
def save_history(self):
"""Save conversation history"""
try:
with open(self.history_file, 'w') as f:
json.dump(self.messages, f, indent=2)
except Exception as e:
print(f"Warning: Could not save history: {e}")
def load_history(self):
"""Load conversation history"""
try:
if os.path.exists(self.history_file):
with open(self.history_file, 'r') as f:
self.messages = json.load(f)
except Exception:
self.messages = []
Let’s also define some functions to help with context management. Right now we’re just going to track the conversation history and build a messages list.
def add_message(self, role: str, content: str):
"""Add a message to conversation history"""
self.messages.append({"role": role, "content": content})
self.save_history()
def build_messages_list(self, user_input: Optional[str] = None,
tool_results: Optional[List[Dict]] = None,
assistant_content: Optional[Any] = None,
max_history: int = 20) -> List[Dict]:
"""Build a clean messages list for the API call"""
messages = []
# Add conversation history (limited to recent messages for context window)
start_idx = max(0, len(self.messages) - max_history)
for msg in self.messages[start_idx:]:
if isinstance(msg, dict) and "role" in msg and "content" in msg:
# Clean the message for API compatibility
clean_msg = {"role": msg["role"], "content": msg["content"]}
messages.append(clean_msg)
# Add new user input if provided
if user_input:
messages.append({"role": "user", "content": user_input})
# Add assistant content if provided (for tool use continuation)
if assistant_content:
messages.append({"role": "assistant", "content": assistant_content})
# Add tool results as user message if provided
if tool_results:
messages.append({
"role": "user",
"content": [
{
"type": "tool_result",
"tool_use_id": tr["tool_use_id"],
"content": tr["content"]
}
for tr in tool_results
]
})
return messages
And those are the core components of our coding agent!
Step 5: Build the ReAct Loop
Finally, we need a function to guide our model to follow the ReAct pattern.
async def react_loop(self, user_input: str) -> str:
# Add user message to history
self.add_message("user", user_input)
# Build initial messages list
messages = self.build_messages_list(user_input=user_input)
# Track the last text response to avoid duplication
last_complete_response = None
# Safety limit to prevent infinite loops
safety_limit = 20
iterations = 0
while iterations < safety_limit:
iterations += 1
# Get Claude's response
content_blocks, error = await self._call_claude(messages)
if error:
error_msg = f"Error: {error}"
self.add_message("assistant", error_msg)
return error_msg
# Parse response into text and tool uses
text_responses, tool_uses = self._parse_claude_response(content_blocks)
# Store the last complete text response
if text_responses:
last_complete_response = "\n".join(text_responses)
# If no tools were used, Claude is done - return final response
if not tool_uses:
break
# Execute tools and collect results
tool_results = await self._execute_tool_calls(tool_uses)
# Build messages for next iteration
messages = self.build_messages_list(
assistant_content=content_blocks,
tool_results=tool_results
)
# Prepare final response
if not last_complete_response:
final_response = "I couldn't generate a response."
elif iterations >= safety_limit:
final_response = f"{last_complete_response}\n\n(Note: I reached my processing limit. You may want to break this down into smaller steps.)"
else:
final_response = last_complete_response
# Save to history and return
self.add_message("assistant", final_response)
return final_response
async def process_message(self, user_input: str) -> str:
"""Main entry point for processing user messages"""
try:
# Use the ReAct loop to process the message
response = await self.react_loop(user_input)
return response
except Exception as e:
error_msg = f"Unexpected error processing message: {str(e)}"
self.add_message("assistant", error_msg)
return error_msg
Yes, it really is just a while loop. We call Claude with our request and it answers. If it needs to use a tool, we process the tool (as defined before) and then send back the tool result.
And then we loop. We’ve set a safety limit of 20 turns to avoid infinite loops (and to stop you from racking up those api calls).
When there are no more tool calls, we assume it’s done and print the final response.
We’re also parsing Claude’s responses for readability so that we can print it to our terminal and see what’s happening.
def _parse_claude_response(self, content_blocks: Any) -> Tuple[List[str], List[Any]]:
text_responses = []
tool_uses = []
for block in content_blocks:
if block.type == "text":
text_responses.append(block.text)
print(f" {block.text}")
elif block.type == "tool_use":
tool_uses.append(block)
print(f" Tool call: {block.name}")
return text_responses, tool_uses
Let’s Test it out!
Our agent is ready to use. We’re at 400 lines of code, but that includes the comments, error handling, and helper functions for verbosity. Our core agent code is ~300 lines. Let’s see if it’s any good!
Let’s add a main function to our code so that we can get that CLI interface:
async def main():
"""Main CLI interface"""he
print("Welcome to Baby Claude Code!!")
print("Type 'exit' or 'quit' to quit, 'clear' to clear history, 'history' to show recent messages")
print("-" * 50)
# Get API key
api_key = os.getenv("ANTHROPIC_API_KEY")
if not api_key:
api_key = input("Enter your Anthropic API key: ").strip()
# Initialize agent
agent = CodingAgent(api_key)
while True:
try:
user_input = input("\n You: ").strip()
if user_input.lower() in ['exit', 'quit']:
print("Goodbye!")
break
elif user_input.lower() == 'clear':
agent.messages = []
agent.save_history()
print("History cleared!")
continue
elif user_input.lower() == 'history':
print("\nRecent conversation history:")
for msg in agent.messages[-10:]:
role = msg.get("role", "unknown")
content = msg.get("content", "")
if len(content) > 100:
content = content[:100] + "..."
timestamp = msg.get("timestamp", "")
print(f" [{role}] {content}")
continue
elif not user_input:
continue
print("\n Agent processing...")
response = await agent.process_message(user_input)
except KeyboardInterrupt:
print("\n\nGoodbye!")
break
except Exception as e:
print(f"\n Error: {e}")
if __name__ == "__main__":
asyncio.run(main())
Now run the file and watch your own baby Claude Code come to life!
Understanding the Code Flow
If you’ve been following along, you should have a working coding agent. It’s basic but it gets the job done.
We first pass your task to the react_loop
method, which compiles a conversation history and calls Claude.
Based on our system prompt and tool schema, Claude decides if it needs to use a tool to answer our request. If so, it sends back a tool request which we execute. We add the results to our message history and send it back to Claude, and loop over.
We keep doing this until there are no more tool calls, in which case we assume Claude has nothing else to do and we return the final answer.

Et voila! We have a functioning coding agent that can explain codebases, write new code, and keep track of a conversation.
Pretty sweet.
I’ve added all the code to my Github. Enter your email below to receive it.
Want to build your own AI agents?
Sign up for my newsletter covering everything from the tools, APIs, and frameworks you need, to building and serving your own multi-step AI agents.
Phase 2: Adding Code Execution
We have a coding agent that can read and write code, but in this age of vibe coding, we want it to be able to text and execute code as well. Those bugs ain’t gonna debug themselves.
All we need to do is give it new tools to execute code. The main complexity is ensuring it doesn’t run malicious code or delete our OS by mistake. That’s why this phase is mostly about code validation and sandboxing. Let’s see how.
Step 1: Code Refactoring
Before we do anything, let’s refactor our existing code for better readability and modularity.
Here’s our new project structure:
coding_agent/
├── __init__.py # Package initialization
├── config.py # Central configuration
├── agent.py # Main CodingAgent class
├── tools/
│ ├── __init__.py
│ ├── base.py # Tool interface & registry
│ ├── file_ops.py # File operation tools
│ └── code_exec.py # Code execution tools
├── execution/
│ ├── __init__.py
│ ├── validator.py # AST-based validator
│ └── executor.py # Sandboxed executor
└── cli.py # CLI interface
Most of the code is pretty much the same. Config.py has our model configuration parameters and system prompt. CLI.py is the main cli interface that we added right at the end of Phase 1.
Agent.py is the core agent class sans the tools setup, which go into a tools folder. We have a base tool template, then define the read, write and search file tools in file_ops.py
The new code is the code_exec.py file which contains the meta data for the executor and validator tools, and the actual implementation of those tools are in the execution folder for readability.
Step 2: The Validator
The CodeValidator
uses Python’s Abstract Syntax Tree (AST) to analyze code before it runs. Think of it as a security guard that inspects code at the gate.
class CodeValidator:
def validate(self, code: str) -> Tuple[bool, List[str]]:
<em># Parse code into an AST</em>
tree = ast.parse(code)
<em># Walk the tree looking for dangerous patterns</em>
self._check_node(tree)
<em># Return validation result</em>
return len(self.violations) == 0, self.violations
What the Validator Blocks:
- Dangerous Imports
import os # BLOCKED - could delete files
import subprocess # BLOCKED - could run shell commands
import socket # BLOCKED - could make network connections
2. File Operations
open('file.txt', 'w') # BLOCKED - could overwrite files
with open('/etc/passwd', 'r'): # BLOCKED - could read sensitive files
3. Dangerous Built-in Functions
eval("malicious_code") # BLOCKED - arbitrary code execution
exec("import os; os.system('rm -rf /')") # BLOCKED
__import__('os') # BLOCKED - dynamic imports
4. System Access Attempts
sys.exit() # BLOCKED - could crash the program
os.environ['SECRET_KEY'] # BLOCKED - environment access
The validator works by walking the AST and checking each node type:
ast.Import
andast.ImportFrom
nodes → check against dangerous modulesast.Call
nodes → check for dangerous function callsast.Attribute
nodes → check for dangerous attribute access
Most coding agents don’t actually block all of this. They have a permissioning system to give their users control. I’m just being overly cautious for the sake of this tutorial.
Step 3: The Executor
Even if code passes validation, we still need runtime protection. Again, I’m being overly cautious here and creating a custom Python environment with only certain built-in functions:
# User code runs with ONLY these functions available
safe_builtins = {
'print': print, # Safe for output
'len': len, # Safe for measurement
'range': range, # Safe for iteration
'int': int, # Safe type conversion
# ... other safe functions
# Notably missing:
# - open (no file access)
# - __import__ (no imports)
# - eval/exec (no dynamic execution)
# - input (no user interaction)
}
And then when we do run code, it’s in a separate sub-process:
process = await asyncio.create_subprocess_exec(
sys.executable, code_file,
stdout=asyncio.subprocess.PIPE,
stderr=asyncio.subprocess.PIPE,
cwd=str(self.sandbox_dir) # Isolated directory
)
This gives us:
- Memory isolation: Can’t access parent process memory
- Crash protection: If code crashes, main program continues
- Clean termination: Can kill runaway processes
- Output capture: All output is captured and controlled
The executor also sets strict resource limits at the OS level:
# CPU time limit (prevents infinite loops)
resource.setrlimit(resource.RLIMIT_CPU, (5, 5))
# Memory limit (prevents memory bombs)
resource.setrlimit(resource.RLIMIT_AS, (100_000_000, 100_000_000)) # 100MB
# No core dumps (prevents disk filling)
resource.setrlimit(resource.RLIMIT_CORE, (0, 0))
# Process limit (prevents fork bombs)
resource.setrlimit(resource.RLIMIT_NPROC, (1, 1))
As a final safeguard, all code execution has a timeout:
try:
stdout, stderr = await asyncio.wait_for(
process.communicate(),
timeout=10 # 10 second maximum
)
except asyncio.TimeoutError:
process.kill() # Force terminate
return {"error": "Execution timed out"}
There’s a bit more code around creating the sandbox environment to execute code but we’re almost at 5,000 words and my WordPress backend is getting sluggish, so I’m not going to paste it all here. You can get it from my Github.
You’ll also want to update the tool schema with the new tools and also describe how Claude can use them in the system prompt.
How It All Works Together
Adding code execution transforms our agent from a simple file manipulator into a true coding assistant that can:
- Learn from execution results to improve its suggestions
- Write and immediately test solutions
- Debug by seeing actual error messages
- Iterate on solutions that don’t work
- Validate that code produces expected output
Here’s the complete flow when the agent executes code:
User Request: "Test this fibonacci function"
↓
1. Agent calls execute_code tool
↓
2. CodeValidator.validate(code)
├─ Parse to AST
├─ Check for dangerous imports ✓
├─ Check for dangerous functions ✓
└─ Check for file operations ✓
↓
3. CodeExecutor.execute(code)
├─ Create sandboxed code file
├─ Apply restricted builtins
├─ Set resource limits
├─ Run in subprocess
├─ Monitor with timeout
└─ Capture output safely
↓
4. Return results to agent
├─ stdout: "Fibonacci(10) = 55"
├─ stderr: ""
└─ success: true
And that’s Phase 2! If you’ve been implementing with me, you should be getting results like this:

Phase 3: Better Context management
Phases 1 and 2 gave our agent powerful capabilities: it can manipulate files and safely execute code. But try asking it to “refactor the authentication system” in a real project with 500 files, and it hits a wall. The agent doesn’t know:
- What files are relevant to authentication
- How components connect across the codebase
- Which functions call which others
- What context it needs to make safe changes
This is the fundamental challenge of AI coding assistants: context. LLMs have a limited context window, and even if we could fit an entire codebase, indiscriminately dumping hundreds of files would be wasteful and confusing. The agent would spend most of its reasoning power just figuring out what’s relevant.
I’m going to pause here for now and come back to this section later. Meanwhile, read my guide on Context Engineering to understand the concepts behind this. And sign up below for when I complete Phase 3!
Want to build your own AI agents?
Sign up for my newsletter covering everything from the tools, APIs, and frameworks you need, to building and serving your own multi-step AI agents.