I spend way too much time context-switching between client emails, project notes, meeting transcripts, and CRM data. Every morning, I’m asking myself the same questions: What’s urgent? Who needs a response? What deadlines am I forgetting?
So I built an AI assistant that answers all of this and it runs 100% locally on my MacBook.
No API costs. No data leaving my machine. Just my documents, my AI, completely private.
I’m calling it Vault. Not a very creative name but it works!
Why Local AI Matters
Here’s the thing about cloud AI: every query you send is data you’re handing over to someone else. When you’re dealing with client information, financial data, or sensitive business documents, that’s a problem.
And then there’s cost. I’ve burned through hundreds of dollars in OpenAI API credits on projects before. For a personal knowledge base I’m querying dozens of times a day? That adds up fast.
Local inference solves both problems. Your data stays on your machine, and once you’ve got the model running, queries are essentially free.
The tradeoff used to be performance. Local models were slow and dumb compared to closed-source SOTA. But that’s changing fast.
Enter Parallax
I’m using Parallax from Gradient Network for local inference. It’s a fully decentralized inference engine for local AI models, and the setup is dead simple:
git clone https://github.com/GradientHQ/parallax.git
cd parallax
# Enter Python virtual environment
python3 -m venv ./venv
source ./venv/bin/activate
pip install -e '.[mac]'Once done, just run parallax run and you can start setting up your AI cluster on localhost:3001. Follow the instructions and you should soon be able to pick one of many LLMs and chat with it!
The Architecture
Of course, we’re going to do more with our local AI than just chat. Vault is a RAG (Retrieval-Augmented Generation) system and a personal AI Chief of Staff (glorified executive assistant). The idea is simple:
- Ingest documents from my Gmail and Google Drive into a local vector database
- Search for relevant chunks when the user asks a question about a project
- Generate an answer using the retrieved context and the local AI
Here’s the high-level flow:
Documents (PDF, Email, DOCX, CSV)
↓
[Chunking & Embedding]
↓
ChromaDB Vector Store
↓
Semantic Search
↓
Parallax LLM
↓
Contextual Answer
Let me walk through each component.
Document Ingestion
The first challenge: getting all my messy data into a format the AI can work with. I’ve got PDFs, Word docs, email chains, CSV exports from my CRM, JSON files with meeting notes, the works.
class DocumentLoader:
"""Load documents from various file formats."""
SUPPORTED_EXTENSIONS = {
'.pdf', '.md', '.txt', '.docx', <em># Documents</em>
'.eml', <em># Emails</em>
'.csv', '.json' <em># Data exports</em>
}
def __init__(self, chunk_size: int = 500, chunk_overlap: int = 50):
self.chunk_size = chunk_size
self.chunk_overlap = chunk_overlap
The chunk_size and chunk_overlap parameters are crucial. Too large, and you waste context window space. Too small, and you lose coherence. I landed on 500 characters with 50-character overlap after some experimentation.
For emails, I extract the metadata into a structured format the AI can reason about:
def _load_email(self, path: Path) -> str:
"""Extract content and metadata from .eml files."""
with open(path, 'rb') as f:
msg = email.message_from_binary_file(f, policy=policy.default)
parts = [
f"EMAIL",
f"From: {msg['From']}",
f"To: {msg['To']}",
f"Subject: {msg['Subject']}",
f"Date: {msg['Date']}",
]
<em># Extract body</em>
body = ""
if msg.is_multipart():
for part in msg.walk():
if part.get_content_type() == "text/plain":
body = part.get_content()
break
else:
body = msg.get_content()
parts.append(f"\n{body}")
return "\n".join(parts)
This way, when I ask “what emails need my response?”, the AI has all the metadata it needs to give a useful answer.
Gmail and Google Drive Integration
I didn’t want to manually export emails and documents every time. So I built integrations for Gmail and Google Drive that sync directly into the knowledge base.
The Gmail client uses OAuth 2.0 and the Gmail API to fetch messages:
class GmailClient:
"""Gmail API client for syncing emails to Vault."""
SCOPES = ['https://www.googleapis.com/auth/gmail.readonly']
def fetch_messages(self, days_back: int = 30, query: str = "",
max_results: int = 500) -> list[dict]:
"""Fetch messages from Gmail."""
after_date = (datetime.now() - timedelta(days=days_back)).strftime('%Y/%m/%d')
search_query = f"after:{after_date}"
if query:
search_query += f" {query}"
results = self.service.users().messages().list(
userId='me', q=search_query, maxResults=max_results
).execute()
messages = []
for msg_info in results.get('messages', []):
msg = self.service.users().messages().get(
userId='me', id=msg_info['id'], format='full'
).execute()
messages.append(self._parse_message(msg))
return messages
Google Drive works similarly. It fetches documents, exports Google Docs to plain text, and downloads supported file types:
class DriveClient:
"""Google Drive client for syncing documents to Vault."""
SUPPORTED_MIME_TYPES = {
'application/pdf': '.pdf',
'text/plain': '.txt',
'text/markdown': '.md',
'application/vnd.google-apps.document': '.gdoc', <em># Export as text</em>
}
def sync_to_vault(self, vectorstore, loader, folder_id=None,
days_back: int = 30) -> int:
"""Sync Drive files to the knowledge base."""
files = self.list_files(folder_id=folder_id, days_back=days_back)
for file_info in files:
content = self.download_file(file_info['id'], file_info['mimeType'])
chunks = loader.chunk_text(content, source=f"drive://{file_info['name']}")
vectorstore.add_chunks(chunks)
return len(files)
Now I can run python main.py sync gmail and have the last 30 days of emails indexed in seconds.
Vector Storage with ChromaDB
ChromaDB handles embedding and similarity search. The nice thing is it includes a default embedding model, so you don’t need to set up a separate embedding service:
class VectorStore:
"""ChromaDB-based vector store for document chunks."""
def __init__(self, persist_dir: str = "./data/chromadb"):
self.persist_dir = Path(persist_dir)
self.persist_dir.mkdir(parents=True, exist_ok=True)
self.client = chromadb.PersistentClient(
path=str(self.persist_dir),
settings=Settings(anonymized_telemetry=False)
)
self.collection = self.client.get_or_create_collection(
name="knowledge_base",
metadata={"hnsw:space": "cosine"}
)
I’m using cosine similarity for the HNSW index. It works well for semantic search and is the standard choice for most text embedding models.
Searching is a single method call:
def search(self, query: str, n_results: int = 5) -> list[dict]:
"""Search for similar chunks."""
results = self.collection.query(
query_texts=[query],
n_results=n_results,
include=["documents", "metadatas", "distances"]
)
<em># ... process results</em>
ChromaDB returns the top N most similar chunks, along with their similarity scores. We pass these to the LLM as context.
The RAG Engine
This is where it all comes together. The RAG engine:
- Takes a user question
- Retrieves relevant context from the vector store
- Builds a prompt with the context
- Sends it to Parallax for generation
- Streams the response back
class RAGEngine:
"""Retrieval-Augmented Generation engine with conversation memory."""
SYSTEM_PROMPT = """You are Vault, a helpful AI assistant with access to a personal knowledge base.
Your role is to answer questions based on the provided context from the user's documents.
Guidelines:
- Answer based primarily on the provided context
- If the context doesn't contain enough information, say so clearly
- Cite the source documents when relevant
- Be concise but thorough"""
def build_context(self, query: str) -> tuple[str, list[dict]]:
"""Retrieve relevant context for a query."""
results = self.vectorstore.search(query, n_results=self.n_results)
if not results:
return "", []
context_parts = []
for i, result in enumerate(results, 1):
source = result['source'].split('/')[-1]
context_parts.append(
f"[{i}] Source: {source}\n{result['content']}"
)
return "\n\n---\n\n".join(context_parts), resultsStreaming Responses
Nobody wants to wait 30 seconds staring at a blank screen. Parallax supports streaming, so we can show tokens as they’re generated:
def _stream_chat(self, payload: dict) -> Generator[str, None, None]:
"""Streaming chat completion."""
with httpx.Client(timeout=120.0) as client:
with client.stream("POST", self.api_url, json=payload) as response:
for line in response.iter_lines():
if line.startswith("data: "):
data_str = line[6:]
if data_str.strip() == "[DONE]":
break
data = json.loads(data_str)
content = data['choices'][0].get('delta', {}).get('content', '')
if content:
yield content
This is the same SSE (Server-Sent Events) format that OpenAI uses, so if you’ve worked with their streaming API before, this will look familiar.
Conversation Memory
A knowledge base isn’t very useful if you have to re-explain context every query. I added a simple conversation history that gets included in subsequent prompts:
@dataclass
class ChatSession:
"""Maintains conversation history for multi-turn chat."""
messages: list[ChatMessage] = field(default_factory=list)
max_history: int = 10
def get_history_summary(self) -> str:
"""Get a summary of recent conversation for context."""
summary_parts = []
for msg in self.messages[-6:]: <em># Last 3 exchanges</em>
prefix = "User" if msg.role == "user" else "Assistant"
content = msg.content[:200] + "..." if len(msg.content) > 200 else msg.content
summary_parts.append(f"{prefix}: {content}")
return "\n".join(summary_parts)
Now I can ask “what’s the deadline for Mike’s project?” and follow up with “what did we discuss in the last meeting?” without re-specifying who Mike is.
The Chief of Staff Experience
This is where it gets interesting. With all my client data indexed (emails, meeting notes, CRM exports, project briefs) I can just ask natural questions:
> What are my priorities for today?
Based on your documents, here are today's priorities:
1. **Acme Corp** - API integration deadline tomorrow. John emailed about
OAuth token issues that need resolving before the Phase 2 delivery.
2. **TechStart Inc** - Proposal due Friday for the mobile app project.
Sarah wants a follow-up on the architecture recommendations.
3. **GlobalBank** - URGENT: Mike emailed about the compliance review.
They need the security documentation before end of week.
4. **RetailMax** - Payment gateway integration due next week.
Lisa asked about supporting Apple Pay in addition to Stripe.
This is just semantic search + LLM reasoning over my actual documents.
I can drill down:
> What did Mike say about the compliance issue?
According to your email from Mike (Dec 4), GlobalBank's compliance team
has approved the project to move forward. He needs you to confirm Q1
availability by December 10th and will send updated security requirements
by December 15th.
Or get synthesis across sources:
> What's the status of all my active projects?
You have 4 active projects:
1. **Acme Corp** - API Integration, in progress, $150,000 (deadline Dec 15)
2. **TechStart Inc** - Website Redesign, complete, awaiting final invoice
3. **GlobalBank** - Mobile App, on hold until January, $280,000
4. **RetailMax** - E-commerce Platform, in progress, $95,000Next Steps
There are lots of directions I can take this. For starters, I’m using a pretty small model, and it still works great for answering questions about my projects, identifying my priorities and keeping me on track, and even writing emails.
But I can build it into something more powerful that can help me with actually executing on projects, delivering client work, and so on. I just need to add more nodes and use a bigger model.
As I continue building it out, I’ll add to this post. Sign up below to stay tuned!
Get more deep dives on AI
Like this post? Sign up for my newsletter and get notified every time I do a deep dive like this one.
