I Built an AI Chief of Staff That Runs Entirely on My Laptop

I spend way too much time context-switching between client emails, project notes, meeting transcripts, and CRM data. Every morning, I’m asking myself the same questions: What’s urgent? Who needs a response? What deadlines am I forgetting?

So I built an AI assistant that answers all of this and it runs 100% locally on my MacBook.

No API costs. No data leaving my machine. Just my documents, my AI, completely private.

I’m calling it Vault. Not a very creative name but it works!

Why Local AI Matters

Here’s the thing about cloud AI: every query you send is data you’re handing over to someone else. When you’re dealing with client information, financial data, or sensitive business documents, that’s a problem.

And then there’s cost. I’ve burned through hundreds of dollars in OpenAI API credits on projects before. For a personal knowledge base I’m querying dozens of times a day? That adds up fast.

Local inference solves both problems. Your data stays on your machine, and once you’ve got the model running, queries are essentially free.

The tradeoff used to be performance. Local models were slow and dumb compared to closed-source SOTA. But that’s changing fast.

Enter Parallax

I’m using Parallax from Gradient Network for local inference. It’s a fully decentralized inference engine for local AI models, and the setup is dead simple:

Python

git clone https://github.com/GradientHQ/parallax.git
cd parallax

# Enter Python virtual environment
python3 -m venv ./venv
source ./venv/bin/activate

pip install -e '.[mac]'

git clone https://github.com/GradientHQ/parallax.git
cd parallax

# Enter Python virtual environment
python3 -m venv ./venv
source ./venv/bin/activate

pip install -e '.[mac]'

Once done, just run parallax run and you can start setting up your AI cluster on localhost:3001. Follow the instructions and you should soon be able to pick one of many LLMs and chat with it!

The Architecture

Of course, we’re going to do more with our local AI than just chat. Vault is a RAG (Retrieval-Augmented Generation) system and a personal AI Chief of Staff (glorified executive assistant). The idea is simple:

Ingest documents from my Gmail and Google Drive into a local vector database
Search for relevant chunks when the user asks a question about a project
Generate an answer using the retrieved context and the local AI

Here’s the high-level flow:

Python

Documents (PDF, Email, DOCX, CSV)
        ↓
   [Chunking & Embedding]
        ↓
   ChromaDB Vector Store
        ↓
   Semantic Search
        ↓
   Parallax LLM
        ↓
   Contextual Answer

Documents (PDF, Email, DOCX, CSV)
        ↓
   [Chunking & Embedding]
        ↓
   ChromaDB Vector Store
        ↓
   Semantic Search
        ↓
   Parallax LLM
        ↓
   Contextual Answer

Let me walk through each component.

Document Ingestion

The first challenge: getting all my messy data into a format the AI can work with. I’ve got PDFs, Word docs, email chains, CSV exports from my CRM, JSON files with meeting notes, the works.

Python

class DocumentLoader:
    """Load documents from various file formats."""

    SUPPORTED_EXTENSIONS = {
        '.pdf', '.md', '.txt', '.docx',  <em># Documents</em>
        '.eml',                           <em># Emails</em>
        '.csv', '.json'                   <em># Data exports</em>
    }

    def __init__(self, chunk_size: int = 500, chunk_overlap: int = 50):
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap

class DocumentLoader:
    """Load documents from various file formats."""

    SUPPORTED_EXTENSIONS = {
        '.pdf', '.md', '.txt', '.docx',  <em># Documents</em>
        '.eml',                           <em># Emails</em>
        '.csv', '.json'                   <em># Data exports</em>
    }

    def __init__(self, chunk_size: int = 500, chunk_overlap: int = 50):
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap

The chunk_size and chunk_overlap parameters are crucial. Too large, and you waste context window space. Too small, and you lose coherence. I landed on 500 characters with 50-character overlap after some experimentation.

For emails, I extract the metadata into a structured format the AI can reason about:

Python

def _load_email(self, path: Path) -> str:
    """Extract content and metadata from .eml files."""
    with open(path, 'rb') as f:
        msg = email.message_from_binary_file(f, policy=policy.default)

    parts = [
        f"EMAIL",
        f"From: {msg['From']}",
        f"To: {msg['To']}",
        f"Subject: {msg['Subject']}",
        f"Date: {msg['Date']}",
    ]

    <em># Extract body</em>
    body = ""
    if msg.is_multipart():
        for part in msg.walk():
            if part.get_content_type() == "text/plain":
                body = part.get_content()
                break
    else:
        body = msg.get_content()

    parts.append(f"\n{body}")
    return "\n".join(parts)

def _load_email(self, path: Path) -> str:
    """Extract content and metadata from .eml files."""
    with open(path, 'rb') as f:
        msg = email.message_from_binary_file(f, policy=policy.default)

    parts = [
        f"EMAIL",
        f"From: {msg['From']}",
        f"To: {msg['To']}",
        f"Subject: {msg['Subject']}",
        f"Date: {msg['Date']}",
    ]

    <em># Extract body</em>
    body = ""
    if msg.is_multipart():
        for part in msg.walk():
            if part.get_content_type() == "text/plain":
                body = part.get_content()
                break
    else:
        body = msg.get_content()

    parts.append(f"\n{body}")
    return "\n".join(parts)

This way, when I ask “what emails need my response?”, the AI has all the metadata it needs to give a useful answer.

Gmail and Google Drive Integration

I didn’t want to manually export emails and documents every time. So I built integrations for Gmail and Google Drive that sync directly into the knowledge base.

The Gmail client uses OAuth 2.0 and the Gmail API to fetch messages:

Python

class GmailClient:
    """Gmail API client for syncing emails to Vault."""

    SCOPES = ['https://www.googleapis.com/auth/gmail.readonly']

    def fetch_messages(self, days_back: int = 30, query: str = "",
                       max_results: int = 500) -> list[dict]:
        """Fetch messages from Gmail."""
        after_date = (datetime.now() - timedelta(days=days_back)).strftime('%Y/%m/%d')
        search_query = f"after:{after_date}"
        if query:
            search_query += f" {query}"

        results = self.service.users().messages().list(
            userId='me', q=search_query, maxResults=max_results
        ).execute()

        messages = []
        for msg_info in results.get('messages', []):
            msg = self.service.users().messages().get(
                userId='me', id=msg_info['id'], format='full'
            ).execute()
            messages.append(self._parse_message(msg))

        return messages

class GmailClient:
    """Gmail API client for syncing emails to Vault."""

    SCOPES = ['https://www.googleapis.com/auth/gmail.readonly']

    def fetch_messages(self, days_back: int = 30, query: str = "",
                       max_results: int = 500) -> list[dict]:
        """Fetch messages from Gmail."""
        after_date = (datetime.now() - timedelta(days=days_back)).strftime('%Y/%m/%d')
        search_query = f"after:{after_date}"
        if query:
            search_query += f" {query}"

        results = self.service.users().messages().list(
            userId='me', q=search_query, maxResults=max_results
        ).execute()

        messages = []
        for msg_info in results.get('messages', []):
            msg = self.service.users().messages().get(
                userId='me', id=msg_info['id'], format='full'
            ).execute()
            messages.append(self._parse_message(msg))

        return messages

Google Drive works similarly. It fetches documents, exports Google Docs to plain text, and downloads supported file types:

Python

class DriveClient:
    """Google Drive client for syncing documents to Vault."""

    SUPPORTED_MIME_TYPES = {
        'application/pdf': '.pdf',
        'text/plain': '.txt',
        'text/markdown': '.md',
        'application/vnd.google-apps.document': '.gdoc',  <em># Export as text</em>
    }

    def sync_to_vault(self, vectorstore, loader, folder_id=None,
                      days_back: int = 30) -> int:
        """Sync Drive files to the knowledge base."""
        files = self.list_files(folder_id=folder_id, days_back=days_back)

        for file_info in files:
            content = self.download_file(file_info['id'], file_info['mimeType'])
            chunks = loader.chunk_text(content, source=f"drive://{file_info['name']}")
            vectorstore.add_chunks(chunks)

        return len(files)

class DriveClient:
    """Google Drive client for syncing documents to Vault."""

    SUPPORTED_MIME_TYPES = {
        'application/pdf': '.pdf',
        'text/plain': '.txt',
        'text/markdown': '.md',
        'application/vnd.google-apps.document': '.gdoc',  <em># Export as text</em>
    }

    def sync_to_vault(self, vectorstore, loader, folder_id=None,
                      days_back: int = 30) -> int:
        """Sync Drive files to the knowledge base."""
        files = self.list_files(folder_id=folder_id, days_back=days_back)

        for file_info in files:
            content = self.download_file(file_info['id'], file_info['mimeType'])
            chunks = loader.chunk_text(content, source=f"drive://{file_info['name']}")
            vectorstore.add_chunks(chunks)

        return len(files)

Now I can run python main.py sync gmail and have the last 30 days of emails indexed in seconds.

Vector Storage with ChromaDB

ChromaDB handles embedding and similarity search. The nice thing is it includes a default embedding model, so you don’t need to set up a separate embedding service:

Python

class VectorStore:
    """ChromaDB-based vector store for document chunks."""

    def __init__(self, persist_dir: str = "./data/chromadb"):
        self.persist_dir = Path(persist_dir)
        self.persist_dir.mkdir(parents=True, exist_ok=True)

        self.client = chromadb.PersistentClient(
            path=str(self.persist_dir),
            settings=Settings(anonymized_telemetry=False)
        )

        self.collection = self.client.get_or_create_collection(
            name="knowledge_base",
            metadata={"hnsw:space": "cosine"}
        )

class VectorStore:
    """ChromaDB-based vector store for document chunks."""

    def __init__(self, persist_dir: str = "./data/chromadb"):
        self.persist_dir = Path(persist_dir)
        self.persist_dir.mkdir(parents=True, exist_ok=True)

        self.client = chromadb.PersistentClient(
            path=str(self.persist_dir),
            settings=Settings(anonymized_telemetry=False)
        )

        self.collection = self.client.get_or_create_collection(
            name="knowledge_base",
            metadata={"hnsw:space": "cosine"}
        )

I’m using cosine similarity for the HNSW index. It works well for semantic search and is the standard choice for most text embedding models.

Searching is a single method call:

Python

def search(self, query: str, n_results: int = 5) -> list[dict]:
    """Search for similar chunks."""
    results = self.collection.query(
        query_texts=[query],
        n_results=n_results,
        include=["documents", "metadatas", "distances"]
    )
    <em># ... process results</em>

def search(self, query: str, n_results: int = 5) -> list[dict]:
    """Search for similar chunks."""
    results = self.collection.query(
        query_texts=[query],
        n_results=n_results,
        include=["documents", "metadatas", "distances"]
    )
    <em># ... process results</em>

ChromaDB returns the top N most similar chunks, along with their similarity scores. We pass these to the LLM as context.

The RAG Engine

This is where it all comes together. The RAG engine:

Takes a user question
Retrieves relevant context from the vector store
Builds a prompt with the context
Sends it to Parallax for generation
Streams the response back

Python

class RAGEngine:
    """Retrieval-Augmented Generation engine with conversation memory."""

    SYSTEM_PROMPT = """You are Vault, a helpful AI assistant with access to a personal knowledge base.
Your role is to answer questions based on the provided context from the user's documents.

Guidelines:
- Answer based primarily on the provided context
- If the context doesn't contain enough information, say so clearly
- Cite the source documents when relevant
- Be concise but thorough"""

    def build_context(self, query: str) -> tuple[str, list[dict]]:
        """Retrieve relevant context for a query."""
        results = self.vectorstore.search(query, n_results=self.n_results)

        if not results:
            return "", []

        context_parts = []
        for i, result in enumerate(results, 1):
            source = result['source'].split('/')[-1]
            context_parts.append(
                f"[{i}] Source: {source}\n{result['content']}"
            )

        return "\n\n---\n\n".join(context_parts), results

class RAGEngine:
    """Retrieval-Augmented Generation engine with conversation memory."""

    SYSTEM_PROMPT = """You are Vault, a helpful AI assistant with access to a personal knowledge base.
Your role is to answer questions based on the provided context from the user's documents.

Guidelines:
- Answer based primarily on the provided context
- If the context doesn't contain enough information, say so clearly
- Cite the source documents when relevant
- Be concise but thorough"""

    def build_context(self, query: str) -> tuple[str, list[dict]]:
        """Retrieve relevant context for a query."""
        results = self.vectorstore.search(query, n_results=self.n_results)

        if not results:
            return "", []

        context_parts = []
        for i, result in enumerate(results, 1):
            source = result['source'].split('/')[-1]
            context_parts.append(
                f"[{i}] Source: {source}\n{result['content']}"
            )

        return "\n\n---\n\n".join(context_parts), results

Streaming Responses

Nobody wants to wait 30 seconds staring at a blank screen. Parallax supports streaming, so we can show tokens as they’re generated:

Python

def _stream_chat(self, payload: dict) -> Generator[str, None, None]:
    """Streaming chat completion."""
    with httpx.Client(timeout=120.0) as client:
        with client.stream("POST", self.api_url, json=payload) as response:
            for line in response.iter_lines():
                if line.startswith("data: "):
                    data_str = line[6:]
                    if data_str.strip() == "[DONE]":
                        break
                    data = json.loads(data_str)
                    content = data['choices'][0].get('delta', {}).get('content', '')
                    if content:
                        yield content

def _stream_chat(self, payload: dict) -> Generator[str, None, None]:
    """Streaming chat completion."""
    with httpx.Client(timeout=120.0) as client:
        with client.stream("POST", self.api_url, json=payload) as response:
            for line in response.iter_lines():
                if line.startswith("data: "):
                    data_str = line[6:]
                    if data_str.strip() == "[DONE]":
                        break
                    data = json.loads(data_str)
                    content = data['choices'][0].get('delta', {}).get('content', '')
                    if content:
                        yield content

This is the same SSE (Server-Sent Events) format that OpenAI uses, so if you’ve worked with their streaming API before, this will look familiar.

Conversation Memory

A knowledge base isn’t very useful if you have to re-explain context every query. I added a simple conversation history that gets included in subsequent prompts:

Python

@dataclass
class ChatSession:
    """Maintains conversation history for multi-turn chat."""
    messages: list[ChatMessage] = field(default_factory=list)
    max_history: int = 10

    def get_history_summary(self) -> str:
        """Get a summary of recent conversation for context."""
        summary_parts = []
        for msg in self.messages[-6:]:  <em># Last 3 exchanges</em>
            prefix = "User" if msg.role == "user" else "Assistant"
            content = msg.content[:200] + "..." if len(msg.content) > 200 else msg.content
            summary_parts.append(f"{prefix}: {content}")
        return "\n".join(summary_parts)

@dataclass
class ChatSession:
    """Maintains conversation history for multi-turn chat."""
    messages: list[ChatMessage] = field(default_factory=list)
    max_history: int = 10

    def get_history_summary(self) -> str:
        """Get a summary of recent conversation for context."""
        summary_parts = []
        for msg in self.messages[-6:]:  <em># Last 3 exchanges</em>
            prefix = "User" if msg.role == "user" else "Assistant"
            content = msg.content[:200] + "..." if len(msg.content) > 200 else msg.content
            summary_parts.append(f"{prefix}: {content}")
        return "\n".join(summary_parts)

Now I can ask “what’s the deadline for Mike’s project?” and follow up with “what did we discuss in the last meeting?” without re-specifying who Mike is.

The Chief of Staff Experience

This is where it gets interesting. With all my client data indexed (emails, meeting notes, CRM exports, project briefs) I can just ask natural questions:

Python

> What are my priorities for today?

Based on your documents, here are today's priorities:

1. **Acme Corp** - API integration deadline tomorrow. John emailed about
   OAuth token issues that need resolving before the Phase 2 delivery.

2. **TechStart Inc** - Proposal due Friday for the mobile app project.
   Sarah wants a follow-up on the architecture recommendations.

3. **GlobalBank** - URGENT: Mike emailed about the compliance review.
   They need the security documentation before end of week.

4. **RetailMax** - Payment gateway integration due next week.
   Lisa asked about supporting Apple Pay in addition to Stripe.

> What are my priorities for today?

Based on your documents, here are today's priorities:

1. **Acme Corp** - API integration deadline tomorrow. John emailed about
   OAuth token issues that need resolving before the Phase 2 delivery.

2. **TechStart Inc** - Proposal due Friday for the mobile app project.
   Sarah wants a follow-up on the architecture recommendations.

3. **GlobalBank** - URGENT: Mike emailed about the compliance review.
   They need the security documentation before end of week.

4. **RetailMax** - Payment gateway integration due next week.
   Lisa asked about supporting Apple Pay in addition to Stripe.

This is just semantic search + LLM reasoning over my actual documents.

I can drill down:

Python

> What did Mike say about the compliance issue?

According to your email from Mike (Dec 4), GlobalBank's compliance team
has approved the project to move forward. He needs you to confirm Q1
availability by December 10th and will send updated security requirements
by December 15th.

> What did Mike say about the compliance issue?

According to your email from Mike (Dec 4), GlobalBank's compliance team
has approved the project to move forward. He needs you to confirm Q1
availability by December 10th and will send updated security requirements
by December 15th.

Or get synthesis across sources:

Python

> What's the status of all my active projects?

You have 4 active projects:

1. **Acme Corp** - API Integration, in progress, $150,000 (deadline Dec 15)
2. **TechStart Inc** - Website Redesign, complete, awaiting final invoice
3. **GlobalBank** - Mobile App, on hold until January, $280,000
4. **RetailMax** - E-commerce Platform, in progress, $95,000

> What's the status of all my active projects?

You have 4 active projects:

1. **Acme Corp** - API Integration, in progress, $150,000 (deadline Dec 15)
2. **TechStart Inc** - Website Redesign, complete, awaiting final invoice
3. **GlobalBank** - Mobile App, on hold until January, $280,000
4. **RetailMax** - E-commerce Platform, in progress, $95,000

Next Steps

There are lots of directions I can take this. For starters, I’m using a pretty small model, and it still works great for answering questions about my projects, identifying my priorities and keeping me on track, and even writing emails.

But I can build it into something more powerful that can help me with actually executing on projects, delivering client work, and so on. I just need to add more nodes and use a bigger model.

As I continue building it out, I’ll add to this post. Sign up below to stay tuned!

Get more deep dives on AI

Like this post? Sign up for my newsletter and get notified every time I do a deep dive like this one.

I Built an AI Chief of Staff That Runs Entirely on My Laptop

Why Local AI Matters

Enter Parallax

The Architecture

Document Ingestion

Gmail and Google Drive Integration

Vector Storage with ChromaDB

The RAG Engine

Streaming Responses

Conversation Memory

The Chief of Staff Experience

Next Steps

Get more deep dives on AI

More posts

I Built an AI Chief of Staff That Runs Entirely on My Laptop

Gemini 3 Pro: The best AI Model, by a mile

Building an AI-Powered Market Research Agent With Parallel AI

Cartesia AI Tutorial: Build an AI Podcast Generator