I recently worked with a company to help their marketing team set up a custom competitive intelligence system. They’re in a hyper-competitive space and with new AI products sprouting up in their industry every day, the list of companies they keep tabs on is multiplying.
While the overall project is part of a larger build to eventually generate sales enablement content, BI dashboards, and competitive landing pages, I figured I’d share how I built the core piece here.
In this deep-dive tutorial, I’ll show you how to build an automated competitor monitoring system using Firecrawl that not only tracks changes but provides actionable intelligence, with just basic Python code.
Why Firecrawl?
You can absolutely build your own web scraping tool. There are some packages like Beautiful Soup that make it easier. But it’s just annoying. You have to parse complex HTML and handle JS rendering. Your selectors break. You fight anti-bot measures.
And that doesn’t even count the cleaning and structuring of extracted data. Basically, you spend more time maintaining your scraping infrastructure than actually analyzing competitive data.
Firecrawl flips this equation. Instead of battling technical complexity, you describe what you want in plain English. Firecrawl’s AI understands context, handles the technical heavy lifting, and returns clean, structured data.
Out of the box, it provides:
- Automatic JavaScript rendering: No need for Selenium or Puppeteer
- AI-powered extraction: Describe what you want in natural language
- Clean markdown output: No HTML parsing needed
- Built-in rate limiting: Respectful scraping by default
- Structured data extraction: Get JSON data with defined schemas
Think of Firecrawl as having a smart assistant who visits websites for you, understands what’s important, and returns exactly the data you need.
The Solution Architecture
The system has four core components working together.
- The Data Extractor acts like a research librarian, systematically gathering information from target sources and organizing it consistently.
- The Change Detector functions like an analyst, comparing new information against historical data to identify what’s different and why it matters.
- The Report Generator serves as a communications specialist, transforming technical changes into business insights that inform decision-making.
- The Storage Layer works like an institutional memory, maintaining historical context that enables trend analysis and pattern recognition.
We’re just going to build this as a one-directional pre-defined process but if you wanted to make this agentic, each of these components would become a sub-agent
For this tutorial, we’ll monitor Firecrawl’s own website as our “competitor.” This gives us a real, working example that you can run immediately while learning the concepts. The techniques transfer directly to monitoring actual competitors.
Prerequisites and Setup
Before we start coding, let’s ensure you have everything needed:
# Check Python version (need 3.9+)
python --version
# Create project directory
mkdir competitor-research
cd competitor-research
# Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install firecrawl-py python-dotenv deepdiff
Understanding Our Dependencies
Each dependency serves a specific purpose in our intelligence pipeline.
- firecrawl-py provides the official Python SDK for Firecrawl’s API, abstracting away the complexity of web scraping and data extraction.
- python-dotenv manages environment variables securely, ensuring API keys never end up in your codebase.
- deepdiff offers intelligent comparison of complex data structures, understanding that changing the order of items in a list might not be meaningful while changing their content definitely is.
Create a .env
file for your API key:
FIRECRAWL_API_KEY=fc-your-api-key-here
Get your free API key at firecrawl.dev. The free tier provides 500 pages per month, which is plenty for experimentation and learning the system.
Step 1: Configuration Design
Let’s start by defining what we want to monitor. This configuration is the brain of our system. It tells our extractor what to look for and how to interpret it. Think of this as programming your research assistant’s knowledge about what matters in competitive intelligence.
We’re hard-coding in Firecrawl’s pages for the purposes of this demo, but you can of course extend this to dynamically take in other competitor URLs.
Create config.py
:
MONITORING_TARGETS = {
"pricing": {
"url": "https://firecrawl.dev/pricing",
"description": "Pricing plans and tiers",
"extract_schema": {
"type": "object",
"properties": {
"plans": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {"type": "string"},
"price": {"type": "string"},
"pages_per_month": {"type": "string"},
"features": {"type": "array", "items": {"type": "string"}}
}
}
}
}
}
},
"blog": {
"url": "https://firecrawl.dev/blog",
"description": "Latest blog posts",
"extract_prompt": "Extract the titles, dates, and summaries of the latest blog posts"
}
}
Design Decision: Schema vs Prompt Extraction
Notice we’re using two different extraction methods. Each approach serves different competitive intelligence needs and understanding when to use which method is crucial for effective monitoring.
Schema-based extraction (for the pricing page) works like filling out a standardized form. You define exactly what fields you expect and what types of data they should contain. This approach provides consistent structure across extractions, guarantees specific fields will be present or explicitly null, enables reliable numerical comparisons for metrics like prices, and works best when you know exactly what data structure to expect.
Prompt-based extraction (for the blog) operates more like asking a smart assistant to summarize what they observe. You describe what you’re looking for in natural language, and the AI adapts to whatever it finds. This approach offers flexibility for varied content, adapts to different page layouts without breaking, handles content that might have varying formats, and uses natural language understanding to capture nuanced information.
The choice between these methods depends on your competitive intelligence goals. Use schema extraction when you need to track specific metrics over time, compare numerical data across competitors, or ensure consistency for automated analysis. Use prompt extraction when monitoring diverse content types, tracking qualitative changes, or exploring new areas where you’re not sure what data might be valuable.
Step 2: Building the Data Extraction Engine
Now let’s build the component that actually fetches our competitive intelligence data. First, we define how we want to store our data:
def _setup_database(self):
"""Create database and tables if they don't exist."""
os.makedirs(os.path.dirname(DATABASE_PATH), exist_ok=True)
conn = sqlite3.connect(DATABASE_PATH)
cursor = conn.cursor()
cursor.execute('''
CREATE TABLE IF NOT EXISTS snapshots (
id INTEGER PRIMARY KEY AUTOINCREMENT,
target_name TEXT NOT NULL,
url TEXT NOT NULL,
data TEXT NOT NULL,
markdown TEXT,
extracted_at TIMESTAMP NOT NULL,
UNIQUE(target_name, extracted_at)
)
''')
conn.commit()
conn.close()
Database Design Philosophy
The database design prioritizes simplicity for the purposes of this tutorial. SQLite requires zero configuration, creates a portable single-file database, provides sufficient capability for learning and prototyping, and comes built into Python without additional dependencies.
Our schema intentionally focuses on snapshots rather than normalized relational data. We store both structured data as JSON and raw markdown for maximum flexibility. Timestamps enable historical analysis and trend identification. The unique constraint prevents accidental duplicate snapshots during development.
This design works well for understanding competitive monitoring concepts and prototyping systems with moderate data volumes. However, it has limitations we’ll address in our production considerations section.
The Extraction Logic
Let’s now define the logic to extract data from the targets we set up in our config earlier.
def extract_all_targets(self) -> Dict[str, Any]:
"""Extract data from all configured targets."""
results = {}
timestamp = datetime.now()
for target_name, target_config in MONITORING_TARGETS.items():
print(f"Extracting {target_name}...")
try:
# Extract data based on configuration (with change tracking enabled)
if "extract_schema" in target_config:
# Use schema-based extraction
response = self.firecrawl.scrape(
target_config["url"],
formats=[
"markdown",
{
"type": "json",
"schema": target_config["extract_schema"]
}
]
)
extracted_data = response.get("json", {})
elif "extract_prompt" in target_config:
# Use prompt-based extraction
response = self.firecrawl.scrape(
target_config["url"],
formats=[
"markdown",
{
"type": "json",
"prompt": target_config["extract_prompt"]
}
]
)
extracted_data = response.get("json", {})
else:
# Just get markdown
response = self.firecrawl.scrape(
target_config["url"],
formats=["markdown"]
)
extracted_data = {}
markdown_content = response.get("markdown", "")
# Store in results
results[target_name] = {
"url": target_config["url"],
"data": extracted_data,
"markdown": markdown_content,
"extracted_at": timestamp.isoformat()
}
# Save to database
self._save_snapshot(
target_name,
target_config["url"],
extracted_data,
markdown_content,
timestamp
)
print(f"✓ Extracted {target_name}")
except Exception as e:
print(f"✗ Error extracting {target_name}: {str(e)}")
results[target_name] = {
"url": target_config["url"],
"error": str(e),
"extracted_at": timestamp.isoformat()
}
return results
Key Design Patterns for Reliable Extraction
The extraction logic implements several patterns that make the system robust for real-world use.
- Graceful degradation ensures that if one target fails to extract, monitoring continues for other targets. This prevents a single problematic website from breaking your entire competitive intelligence pipeline.
- Multiple format extraction captures both structured data and clean markdown text. The structured data enables automated analysis and comparison, while the markdown provides human-readable context and serves as a backup when structured extraction encounters unexpected page layouts.
- Consistent timestamps ensure all targets in a single monitoring run share the same timestamp, creating coherent snapshots for historical analysis. This prevents timing discrepancies that could confuse change detection.
- Error context preservation stores error information for debugging without crashing the system. This helps you understand why specific extractions fail and improve your monitoring configuration over time.
Understanding Firecrawl’s Response
When Firecrawl processes a page, it returns:
{
"markdown": "# Clean markdown of the page...",
"extract": {
# Your structured data based on schema/prompt
},
"metadata": {
"title": "Page title",
"statusCode": 200,
# ... other metadata
}
}
The markdown output represents the page content cleaned of navigation elements, advertisements, and other visual clutter. This is what makes Firecrawl superior to basic HTML scraping, you get the actual content without the noise. The extract field contains your structured data, formatted according to your schema or prompt. The metadata provides technical details about the extraction process.
Step 3: Intelligent Change Detection
Change detection is where our system provides real value. The goal is to understand which differences matter for competitive decision making.
from deepdiff import DeepDiff
class ChangeDetector:
def detect_changes(self, current, previous):
"""
Compare current snapshot with previous snapshot.
This is where the magic happens - DeepDiff intelligently
compares nested structures and gives us actionable insights.
"""
if not previous:
# First run - establish baseline
return {
"is_first_run": True,
"message": "First extraction - no previous data to compare",
"current_data": current
}
changes = {
"is_first_run": False,
"changes_detected": False,
"summary": [],
"details": {}
}
# Compare structured data if available
if current.get("data") and previous.get("data"):
data_diff = DeepDiff(
previous["data"],
current["data"],
ignore_order=True, # Order changes aren't usually significant
verbose_level=2, # Get detailed change information
exclude_paths=["root['timestamp']"] # Ignore expected changes
)
if data_diff:
changes["changes_detected"] = True
changes["details"]["data_changes"] = self._parse_deepdiff(data_diff)
# Also check for significant content changes
if current.get("markdown") and previous.get("markdown"):
current_len = len(current["markdown"])
previous_len = len(previous["markdown"])
# Threshold of 100 chars filters out minor changes
if abs(current_len - previous_len) > 100:
changes["changes_detected"] = True
changes["details"]["content_change"] = {
"previous_length": previous_len,
"current_length": current_len,
"difference": current_len - previous_len
}
return changes
Why DeepDiff?
Firecrawl does have a built-in change detection feature but it’s still in beta and I didn’t want to take the risk of trying something new with my client. I might update this in the future after I’ve tried it out but for now DeepDiff is a good, free alternative.
It understands the semantic meaning of differences rather than just identifying that something changed. So instead of flagging every tiny modification, creating noise that obscures important signals, it:
- Handles Nested Structures: Pricing plans often have nested features, tiers, etc.
- Ignores Irrelevant Changes: Array order changes don’t trigger false positives
- Provides Change Context: Tells us not just what changed, but where in the structure
- Makes Type-Aware Comparison: Knows that the string “100” and the integer 100 might represent the same value in different contexts
Parsing DeepDiff Output
DeepDiff returns changes in categories that we need to interpret and parse:
- values_changed: Modified values (price changes, text updates)
- iterable_item_added: New items in lists (new features, plans)
- iterable_item_removed: Removed items (discontinued features)
- dictionary_item_added: New fields (new data points)
- dictionary_item_removed: Removed fields (deprecated info)
def _parse_deepdiff(self, diff):
parsed = {}
# Value modifications - most common and important
if "values_changed" in diff:
parsed["modified"] = []
for path, change in diff["values_changed"].items():
parsed["modified"].append({
"path": self._clean_path(path),
"old_value": change["old_value"],
"new_value": change["new_value"]
})
# New items - often indicates new features or products
if "iterable_item_added" in diff:
parsed["added"] = []
for path, value in diff["iterable_item_added"].items():
parsed["added"].append({
"path": self._clean_path(path),
"value": value
})
# Removed items - could indicate discontinued offerings
if "iterable_item_removed" in diff:
parsed["removed"] = []
for path, value in diff["iterable_item_removed"].items():
parsed["removed"].append({
"path": self._clean_path(path),
"value": value
})
return parsed
def _clean_path(self, path):
"""
Convert DeepDiff's technical paths to readable descriptions.
Example: "root['plans'][2]['price']" becomes "plans.2.price"
"""
path = path.replace("root", "")
path = path.replace("[", ".").replace("]", "")
path = path.replace("'", "")
return path.strip(".")
The Importance of Thresholds
Notice the 100-character threshold for content changes. This is intentional because not all changes are worth acting on. Small modifications like fixing typos or adjusting formatting create noise that distracts from meaningful signals. Significant changes like new sections, removed features, or substantial content additions indicate strategic shifts worth investigating.
Setting appropriate thresholds requires understanding your competitive landscape. In fast-moving markets, you might want lower thresholds to catch early signals. In stable industries, higher thresholds prevent alert fatigue from minor updates.
Step 4: Creating Actionable Reports
While our change detection system identifies what’s different, the reporter system explains what those differences mean for your competitive position and what actions you should consider taking.
All we’re doing here is sending the information we’ve gathered to OpenAI (or the LLM of your choice) to turn into a report. On our first run, we ask it to generate a baseline of our competitor and then on subsequent runs we ask it to analyze the diffs within that context and produce an actionable report.
Most of this is just prompt engineering. Here are some basic prompts you can start with, but feel free to tweak it for your use case:
system_prompt = """You are a competitive intelligence analyst. Your job is to analyze competitor data and changes, then generate actionable business insights.
Given competitor monitoring data with DETECTED CHANGES, create a professional markdown report that includes:
1. **Executive Summary** - High-level insights and key takeaways
2. **Critical Changes** - Most important changes that require immediate attention
3. **Strategic Implications** - What these changes mean for competitive positioning
4. **Recommended Actions** - Specific steps the business should consider
5. **Market Intelligence** - Broader patterns and trends observed
Focus on business impact, not technical details. Be concise but insightful. Use markdown formatting with appropriate headers and bullet points."""
user_prompt = f"""Analyze this competitor monitoring data and generate a competitive intelligence report focused on CHANGES DETECTED:
**Date:** {timestamp.strftime('%B %d, %Y')}
**Data Overview:**
- Targets monitored: {len(analysis_data['targets_analyzed'])}
- Changes detected: {analysis_data['changes_detected']}
**Detailed Data with Changes:**
```json
{json.dumps(analysis_data, indent=2, default=str)}
```
Please generate a professional competitive intelligence report based on the changes detected. Focus on actionable business insights rather than technical details."""
Running the System
And those are our four components! As I mentioned earlier, I’m building this as part of a larger system for my client, so we have this set up to run automatically at regular intervals and aside from generating a report (which gets posted to slack automatically) it also updates other competitive positioning material like landing pages and sales enablement content.
But for the purposes of this demo, we can run this manually in the command line. Create a main.py file to orchestrate the full system:
def main():
"""Main execution function."""
print("=" * 60)
print("Competitor Research Automation with Firecrawl")
print("=" * 60)
# Load environment variables
load_dotenv()
api_key = os.getenv("FIRECRAWL_API_KEY")
if not api_key:
print("\nError: FIRECRAWL_API_KEY not found in environment variables")
print("Please set your API key in a .env file or as an environment variable")
print("Example: export FIRECRAWL_API_KEY='fc-your-key-here'")
sys.exit(1)
print(f"\nRun started at: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print(f"Monitoring {len(MONITORING_TARGETS)} targets\n")
# Initialize components
extractor = CompetitorExtractor(api_key)
detector = ChangeDetector()
reporter = AIReporter()
# Extract current data
print("Extracting current data from targets...\n")
current_results = extractor.extract_all_targets()
# Get previous snapshots for comparison
previous_snapshots = {}
for target_name in MONITORING_TARGETS.keys():
previous = extractor.get_previous_snapshot(target_name)
if previous:
previous_snapshots[target_name] = previous
# Detect changes
print("\nAnalyzing changes...")
all_changes = detector.detect_all_changes(current_results, previous_snapshots)
# Generate summary
change_summary = detector.summarize_changes(all_changes)
# Display summary in console
print("\nSummary of Changes:")
print("-" * 40)
if change_summary:
for summary_item in change_summary:
print(summary_item)
else:
print("No targets monitored yet.")
# Generate report
print("\nGenerating report...")
report_path = reporter.generate_report(current_results, all_changes, change_summary)
# Final status
print("\n" + "=" * 60)
print("Monitoring Complete!")
print(f"Report saved to: {report_path}")
# Check if this is the first run
if all([changes.get("is_first_run") for changes in all_changes["targets"].values()]):
print("\nThis was the first run - baseline data has been captured.")
print(" Run the script again later to detect changes!")
print("=" * 60)
The initial run serves as the foundation for all future competitive analysis. During this run, the system captures baseline data for each target, establishes the data structure for comparison, creates the storage schema, and validates that extraction works correctly for your chosen targets.
After establishing your baseline, subsequent runs focus on identifying and analyzing changes that inform competitive strategy.
Production Considerations: Understanding System Limitations
While this tutorial creates a functional competitive monitoring system, it’s designed for demonstration and learning rather than enterprise deployment. Understanding these limitations helps you recognize when and how to evolve the system for production use.
Database and Storage Limitations
The SQLite database provides excellent simplicity for learning and prototyping, but it has constraints that affect production scalability. SQLite handles concurrent reads well but struggles with concurrent writes, making it unsuitable for systems that need to extract data from multiple sources simultaneously. The single-file design makes backup and replication more complex than necessary for critical business systems.
For production systems, consider PostgreSQL or MySQL for better concurrency handling and enterprise features. Cloud databases like AWS RDS or Google Cloud SQL provide managed infrastructure, automated backups, and scaling capabilities.
API Rate Limiting and Cost Management
The current system makes API calls sequentially without sophisticated rate limiting or cost optimization. Firecrawl’s pricing scales with usage, so uncontrolled extraction could become expensive quickly. The system doesn’t implement intelligent scheduling based on page change frequency, meaning it might waste API calls on static content.
Production systems should implement adaptive scheduling that checks high-priority targets more frequently, uses exponential backoff for rate limiting, implements cost monitoring and alerts, and caches results when appropriate to reduce redundant API calls.
Error Recovery and Resilience
The current error handling is basic and suitable for development but insufficient for production reliability. Network failures, API timeouts, and parsing errors need more sophisticated handling. The system doesn’t implement retry logic with exponential backoff or distinguish between temporary and permanent failures.
Production systems require comprehensive logging for debugging and monitoring, retry mechanisms for transient failures, circuit breakers to prevent cascading failures, and health checks to monitor system status.
Data Quality and Validation
The tutorial system assumes extracted data is reliable and correctly formatted, but real-world web scraping encounters many data quality issues. Websites change their structure, introduce temporary errors, or modify content in ways that break extraction logic.
Production systems need data validation pipelines that verify extracted data meets expected formats, detect and handle parsing failures gracefully, implement data quality scoring to identify unreliable extractions, and provide alerts when data quality degrades.
Customizing and Extending The System
I’ve only shown you the core functionality of scraping competitors and identifying changes. With this in place as your foundation, there’s a lot you can do to turn this into a powerful competitive intelligence system for your company:
- Alerting system: Integrate with Slack or email to send out notifications to differerent people or teams in your organization based on the type of change.
- Track patterns: Extend the system to track changes over longer periods of time and see patterns.
- Add more data sources: Scrape their ads, social media, and other properties for more insights into their GTM and positioning.
- Integrate with BI: incorporate competitive data into executive dashboards, combine it with internal metrics, and support strategic planning processes
- Multi-competitor dashboards: Instead of just generating reports, you can create an interactive dashboard to visualize changes.
- Auto-update your assets: As I’m doing with my client, you can automatically update your competitive positioning assets like landing pages if there’s a significant product or pricing update.
Conclusion: From Monitoring to Intelligence
With tools like Firecrawl, we can abstract away the scraping and monitoring infrastructure and focus on building out an actual intelligence system that suggests and even takes actions for us.
Firecrawl also has a dashboard where you can experiment with the different scraping options and see what comes back. Give it a try and implement the code in your app.
And if you wan more tutorials on building useful AI agents, sign up below.
Want to build your own AI agents?
Sign up for my newsletter covering everything from the tools, APIs, and frameworks you need, to building and serving your own multi-step AI agents.