Cartesia AI Tutorial: Build an AI Podcast Generator

I was talking to a friend recently about an idea he had for generating AI podcasts in the format of How I Built This. He wanted to be able to just enter the name of a company and get a podcast on all the details of how it was started, on demand.

One way I’d build a system like this is first running deep research on the company, then turning it all into an engaging podcast script, and then finally converting that into a podcast with a voice AI.

The weakest link of that system is the voice AI. More specifically, how do you generate a voice that can keep listeners engaged for an hour. And how do you do it cost effectively.

That’s what drew me to Cartesia. Their most recent model sounds very life-like (especially in English, the other languages feel a bit flat) with the ability to play with the speed and emotion. And after meeting the CEO in a recent meetup, I decided to play around with it.

This project is a simplified version of my friend’s idea where you can put in the URL to a blog post and it generates a podcast based on that. I’m going to be generating them in my voice so that I can turn this blog into a podcast.

What We’re Building

The system has three distinct stages:

Content Extraction → Scrape and clean article text from any URL

Script Generation → Use AI to reformat content for spoken delivery

Voice Synthesis → Convert the script to ultra-realistic speech with Cartesia

Each stage has a single, well-defined responsibility. This separation matters because it makes the system testable, debuggable, and extensible. Want to add multi-voice support? Just modify the voice synthesis stage. Need better content extraction? Swap out the scraper without touching anything else.

The data flow looks like this:

Bash

URL → ContentFetcher → {title, content} → ContentProcessor → {script} → AudioGenerator → audio.wav

URL → ContentFetcher → {title, content} → ContentProcessor → {script} → AudioGenerator → audio.wav

Let’s build it.

Want to build your own AI agents?

Sign up for my newsletter covering everything from the tools, APIs, and frameworks you need, to building and serving your own multi-step AI agents.

Setting Up The Project

You’ll need API keys for:

Cartesia (get one here) – The star of the show
OpenAI (get one here) – For script generation
Firecrawl (get one here) – Optional but recommended for better content extraction

Store these in a .env file:

Python

CARTESIA_API_KEY=your_key_here
OPENAI_API_KEY=your_key_here
FIRECRAWL_API_KEY=your_key_here  # optional

CARTESIA_API_KEY=your_key_here
OPENAI_API_KEY=your_key_here
FIRECRAWL_API_KEY=your_key_here  # optional

And then install dependencies:

Bash

pip install cartesia openai python-dotenv requests beautifulsoup4 firecrawl-py

pip install cartesia openai python-dotenv requests beautifulsoup4 firecrawl-py

Now let’s build the pipeline, starting with content extraction.

Stage 1: Content Extraction

The first challenge is getting clean article text from arbitrary URLs. This is harder than it sounds because every website structures content differently. Some use <article> tags, others use <div class="content">, and some wrap everything in JavaScript frameworks that require browser rendering.

I use Firecrawl for all scraping needs. It’s an AI-powered scraper that intelligently identifies main content and handles all the other messy stuff out of the box.

It’s a paid product so if you want a free alternative, BeautifulSoup works.

I won’t go into how either of these works as I’ve covered them before. Our main implementation for our ContentFetcher that fetches and extracts content from the input URL is in content_fetcher.py:

Python

class ContentFetcher:
    def __init__(self):
        self.firecrawl_api_key = os.getenv("FIRECRAWL_API_KEY")
        self.firecrawl_client = None
        if FIRECRAWL_AVAILABLE and self.firecrawl_api_key:
            self.firecrawl_client = FirecrawlApp(api_key=self.firecrawl_api_key)
            print("Using Firecrawl for enhanced content extraction")
    def fetch(self, url: str) -> Dict[str, str]:
        """Fetch content from URL with automatic fallback."""
        print(f"Fetching content from: {url}")
        # Try Firecrawl first if available
        if self.firecrawl_client:
            try:
                return self._fetch_with_firecrawl(url)
            except Exception as e:
                print(f"Firecrawl failed: {e}, falling back to basic scraping")

class ContentFetcher:
    def __init__(self):
        self.firecrawl_api_key = os.getenv("FIRECRAWL_API_KEY")
        self.firecrawl_client = None
        if FIRECRAWL_AVAILABLE and self.firecrawl_api_key:
            self.firecrawl_client = FirecrawlApp(api_key=self.firecrawl_api_key)
            print("Using Firecrawl for enhanced content extraction")
    def fetch(self, url: str) -> Dict[str, str]:
        """Fetch content from URL with automatic fallback."""
        print(f"Fetching content from: {url}")
        # Try Firecrawl first if available
        if self.firecrawl_client:
            try:
                return self._fetch_with_firecrawl(url)
            except Exception as e:
                print(f"Firecrawl failed: {e}, falling back to basic scraping")

Stage 2: Script Generation with OpenAI

Now we have article text, but it’s not podcast-ready yet. Written content and spoken content are fundamentally different mediums:

Written: Can reference images (“As shown in Figure 1…”)
Spoken: Must describe everything verbally
Written: Readers can re-read complex sentences
Spoken: Listeners need shorter, clearer phrasing
Written: Acronyms like “API” are fine
Spoken: Need to be spelled out or expanded

This is where AI comes in. Rather than manually rewriting articles, we can use Sonnet 4.5 or GPT-5 (although I’m using 4o in this because it’s cheaper) to automatically transform content into podcast-friendly scripts.

Python

class ContentProcessor:
    """Processes content into podcast format using AI."""
    def __init__(self, config: PodcastConfig):
        self.config = config
        self.client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
    def process(self, title: str, content: str) -> dict:
        summary = self._generate_summary(title, content)
        # Format main content as podcast script
        main_script = self._format_for_podcast(title, content)
        # Build full script with intro/outro
        intro = INTRO_TEMPLATE.format(title=title, summary=summary)
        full_script = f"{intro}\n\n{main_script}\n\n{OUTRO_TEMPLATE}"
        return {
            'full_script': full_script,
            'word_count': len(full_script.split())
        }
    def _generate_summary(self, title: str, content: str) -> str:
        """Create engaging 2-3 sentence summary."""
        prompt = f"Create a 2-3 sentence summary of this article:\n\nTitle: {title}\n\n{content[:3000]}"
        response = self.client.chat.completions.create(
            model=self.config.ai_model,
            messages=[
                {"role": "system", "content": "You create engaging podcast introductions."},
                {"role": "user", "content": prompt}
            ],
            temperature=self.config.temperature,
            max_tokens=200
        )
        return response.choices[0].message.content.strip()
    def _format_for_podcast(self, title: str, content: str) -> str:
        """Format article as podcast script."""
        word_count = self.config.estimated_word_count
        prompt = CONTENT_FORMATTING_PROMPT.format(
            word_count=word_count,
            title=title,
            content=content
        )
        response = self.client.chat.completions.create(
            model=self.config.ai_model,
            messages=[
                {"role": "system", "content": "You are an expert podcast script writer."},
                {"role": "user", "content": prompt}
            ],
            temperature=self.config.temperature,
            max_tokens=word_count * 2
        )
        return response.choices[0].message.content.strip()

class ContentProcessor:
    """Processes content into podcast format using AI."""
    def __init__(self, config: PodcastConfig):
        self.config = config
        self.client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
    def process(self, title: str, content: str) -> dict:
        summary = self._generate_summary(title, content)
        # Format main content as podcast script
        main_script = self._format_for_podcast(title, content)
        # Build full script with intro/outro
        intro = INTRO_TEMPLATE.format(title=title, summary=summary)
        full_script = f"{intro}\n\n{main_script}\n\n{OUTRO_TEMPLATE}"
        return {
            'full_script': full_script,
            'word_count': len(full_script.split())
        }
    def _generate_summary(self, title: str, content: str) -> str:
        """Create engaging 2-3 sentence summary."""
        prompt = f"Create a 2-3 sentence summary of this article:\n\nTitle: {title}\n\n{content[:3000]}"
        response = self.client.chat.completions.create(
            model=self.config.ai_model,
            messages=[
                {"role": "system", "content": "You create engaging podcast introductions."},
                {"role": "user", "content": prompt}
            ],
            temperature=self.config.temperature,
            max_tokens=200
        )
        return response.choices[0].message.content.strip()
    def _format_for_podcast(self, title: str, content: str) -> str:
        """Format article as podcast script."""
        word_count = self.config.estimated_word_count
        prompt = CONTENT_FORMATTING_PROMPT.format(
            word_count=word_count,
            title=title,
            content=content
        )
        response = self.client.chat.completions.create(
            model=self.config.ai_model,
            messages=[
                {"role": "system", "content": "You are an expert podcast script writer."},
                {"role": "user", "content": prompt}
            ],
            temperature=self.config.temperature,
            max_tokens=word_count * 2
        )
        return response.choices[0].message.content.strip()

Aside from the main script, we’re generating a summary that acts as our intro. Most of this is boilerplate OpenAI calls.

The heavy lifting is done by the prompt. We’re asking OpenAI to convert the article into a script but also insert SSML (Speech Synthesis Markup Language) tags like [laughter], and pauses or breaks.

I’ll explain more about this below. For now just use this sample prompt:

Python

CONTENT_FORMATTING_PROMPT = """
You are a podcast script writer. Convert the following article into an engaging podcast script with natural emotional expression and pacing.

Requirements:
- Target length: approximately {word_count} words
- Write in a conversational, engaging tone suitable for audio
- Remove references to images, videos, or visual elements
- Spell out acronyms on first use
- Use natural speech patterns and transitions
- Break complex ideas into digestible segments
- Maintain the key insights and takeaways from the original content
- Do not add meta-commentary about being a podcast
- Write ONLY the words that should be spoken aloud
- Use short sentences and natural paragraph breaks for pacing
- Vary sentence length to create rhythm and emphasis

SSML TAGS - Use these inline tags to enhance delivery and pacing (Cartesia TTS will interpret them):

EMOTION TAGS - Add natural emotional expression at key moments:
- [laughter] - For genuine humor or lighthearted moments
- <emotion value="excited" /> - When discussing impressive achievements or breakthroughs
- <emotion value="curious" /> - When posing intriguing questions or exploring unknowns
- <emotion value="surprised" /> - For unexpected findings or revelations
- <emotion value="contemplative" /> - During reflective or contemplative passages

PAUSE/BREAK TAGS - Add dramatic pauses for emphasis:
- <break time="0.5s"/> - Short pause (half second) for brief emphasis
- <break time="1s"/> - Medium pause (one second) before important points
- <break time="1.5s"/> - Longer pause for dramatic effect or topic transitions
- Use pauses sparingly (1-3 per script) at natural transition points

Cartesia also supports other SSML tags like <speed ratio="1.2"/> and  <volume ratio="0.8"/> to vary the tone for added engagement.

Guidelines:
- Use emotion tags sparingly (2-5 times per script) at natural inflection points
- Use breaks for dramatic pauses before revealing key insights
- Place them where a human speaker would naturally pause or change tone
- They should feel organic, not forced
- Example: "And then something unexpected happened <break time="0.5s"/> <emotion value="surprised" /> the results exceeded all predictions."
- Example: "But here's the fascinating part <break time="1s"/> <emotion value="curious" /> what if we could do this at scale?"
- Example: "After months of research, they discovered <break time="1s"/> a completely new approach."

Article Title: {title}
Article Content: {content}

Generate only the podcast script below, ready to be read aloud:
"""

CONTENT_FORMATTING_PROMPT = """
You are a podcast script writer. Convert the following article into an engaging podcast script with natural emotional expression and pacing.

Requirements:
- Target length: approximately {word_count} words
- Write in a conversational, engaging tone suitable for audio
- Remove references to images, videos, or visual elements
- Spell out acronyms on first use
- Use natural speech patterns and transitions
- Break complex ideas into digestible segments
- Maintain the key insights and takeaways from the original content
- Do not add meta-commentary about being a podcast
- Write ONLY the words that should be spoken aloud
- Use short sentences and natural paragraph breaks for pacing
- Vary sentence length to create rhythm and emphasis

SSML TAGS - Use these inline tags to enhance delivery and pacing (Cartesia TTS will interpret them):

EMOTION TAGS - Add natural emotional expression at key moments:
- [laughter] - For genuine humor or lighthearted moments
- <emotion value="excited" /> - When discussing impressive achievements or breakthroughs
- <emotion value="curious" /> - When posing intriguing questions or exploring unknowns
- <emotion value="surprised" /> - For unexpected findings or revelations
- <emotion value="contemplative" /> - During reflective or contemplative passages

PAUSE/BREAK TAGS - Add dramatic pauses for emphasis:
- <break time="0.5s"/> - Short pause (half second) for brief emphasis
- <break time="1s"/> - Medium pause (one second) before important points
- <break time="1.5s"/> - Longer pause for dramatic effect or topic transitions
- Use pauses sparingly (1-3 per script) at natural transition points

Cartesia also supports other SSML tags like <speed ratio="1.2"/> and  <volume ratio="0.8"/> to vary the tone for added engagement.

Guidelines:
- Use emotion tags sparingly (2-5 times per script) at natural inflection points
- Use breaks for dramatic pauses before revealing key insights
- Place them where a human speaker would naturally pause or change tone
- They should feel organic, not forced
- Example: "And then something unexpected happened <break time="0.5s"/> <emotion value="surprised" /> the results exceeded all predictions."
- Example: "But here's the fascinating part <break time="1s"/> <emotion value="curious" /> what if we could do this at scale?"
- Example: "After months of research, they discovered <break time="1s"/> a completely new approach."

Article Title: {title}
Article Content: {content}

Generate only the podcast script below, ready to be read aloud:
"""

Stage 3: Voice Synthesis with Cartesia

We finally get to the fun part. Cartesia’s API is straightforward to use, but it offers some powerful features that aren’t immediately obvious from the documentation.

First, let’s make a custom voice. Cartesia comes with plenty of voices but they also have the option to clone yours with a 10 second audio sample. And it’s quite good!

Once we do that, we get back an ID which we pass through as a parameter (along with a number of other params) when we call the Cartesia API in audio_generator.py:

Python

with open(output_path, "wb") as audio_file:
    bytes_iter = self.client.tts.bytes(
        model_id="sonic-3",
        transcript=script,
        voice={
			      "mode": "id",
			      "id": #enter your custom voice ID here,
	      },
        language=en,
        generation_config={
            "volume": 1.0  # Volume level (0.5 to 2.0),
            "speed": 0.9  # Speed multiplier (0.6 to 1.5),
            "emotion": "excited"
        },
        output_format={
            "container": CONTAINER,
            "sample_rate": SAMPLE_RATE,
            "encoding": ENCODING,
        },
    )

    for chunk in bytes_iter:
        audio_file.write(chunk)

with open(output_path, "wb") as audio_file:
    bytes_iter = self.client.tts.bytes(
        model_id="sonic-3",
        transcript=script,
        voice={
			      "mode": "id",
			      "id": #enter your custom voice ID here,
	      },
        language=en,
        generation_config={
            "volume": 1.0  # Volume level (0.5 to 2.0),
            "speed": 0.9  # Speed multiplier (0.6 to 1.5),
            "emotion": "excited"
        },
        output_format={
            "container": CONTAINER,
            "sample_rate": SAMPLE_RATE,
            "encoding": ENCODING,
        },
    )

    for chunk in bytes_iter:
        audio_file.write(chunk)

Model Selection: sonic-3 vs sonic-turbo

Cartesia offers two models with different trade-offs:

sonic-3: 90ms latency, highest quality, most emotional range
sonic-turbo: 40ms latency, faster generation, still excellent quality

For podcast generation, I use sonic-3 because emotional range matters more than latency.

Voice and Generation Parameters

We also pass in our custom voice ID if we have cloned our voice. Cartesia also comes with a number of other voices, each with their own characteristics. Try them out, see which ones you like, and enter those IDs instead.

The more interesting parameters are the volume, speed, and emotion controls. What we’re passing through here are the voice defaults. In the config above I’m making the voice slightly slower than normal, and also giving it an “excited” emotion. Cartesia has dozens of different emotions that you can play with.

But podcast hosts do not have a monotone. They vary the speed and emotion. They pause, they laugh, and more. That’s why we had our script generator introduce SSML tags directly in the script.

Example script output:
“And then something unexpected happened <break time=”0.5s”> [surprise] the results exceeded all predictions.”
“But here’s the fascinating part <break time=”1s”> [curiosity] what if we could do this at scale?”

Cartesia’s TTS engine automatically interprets these tags when generating audio. This creates podcast audio that sounds like a human narrator reacting to the material with natural pauses and emotional inflection, rather than just reading prepared text.

And that’s how we get our engaging podcast host sound.

Moment Of Truth

And now we get to our moment of truth. Does it work? How does it sound?

You’ll want to create a main.py that takes in a URL as an argument and then passes it through our system:

Python

try:
    # Generate podcast
    result = generate_podcast(args.url, config, args)

    # Print success summary
    print("\n" + "="*70)
    print("PODCAST GENERATION COMPLETE!")
    print("="*70)
    print(f"\nTitle: {result['title']}")
    print(f"Audio file: {result['audio_path']}")
    print(f"Script length: {result['word_count']} words")

    if args.save_script:
        script_path = os.path.join(config.output_dir, f"{result['output_name']}_script.txt")
        print(f"Script file: {script_path}")

    print(f"\nYour podcast is ready to share!")
    print()

    return 0

except KeyboardInterrupt:
    print("\n\nOperation cancelled by user.")
    return 130

try:
    # Generate podcast
    result = generate_podcast(args.url, config, args)

    # Print success summary
    print("\n" + "="*70)
    print("PODCAST GENERATION COMPLETE!")
    print("="*70)
    print(f"\nTitle: {result['title']}")
    print(f"Audio file: {result['audio_path']}")
    print(f"Script length: {result['word_count']} words")

    if args.save_script:
        script_path = os.path.join(config.output_dir, f"{result['output_name']}_script.txt")
        print(f"Script file: {script_path}")

    print(f"\nYour podcast is ready to share!")
    print()

    return 0

except KeyboardInterrupt:
    print("\n\nOperation cancelled by user.")
    return 130

You can then call this via the command line in your terminal and you’ll get a wav file output.

I ran this through my recent blog post on Claude Skills and here’s what I got back:

Not bad right? I think the initial voice sample I recorded to train the custom voice could have been better (clearer, more consistent). And there are some minor script issues that can be sorted out with a better prompt, or perhaps using a better model like GPT-5 or Sonnet 4.5.

But for a POC this is quite good. And Cartesia works out to around 4c per minute which is a lot lower than Eleven Labs and other TTS models.

What Else Can You Build

I’m just scratching the surface of Cartesia’s offerings. They have a platform to build end-to-end voice agents that can be deployed in customer support, healthcare, finance, education, and more.

Even with the use case I just showed you, you can build out different types of applications. One way to extend this, for example, is to go back to the original idea of taking in a topic, doing deep research and gathering a ton of content, and then turning all of that into a script and generating a two-person podcast.

Some other TTS ideas:

Audiobook generation – Convert long-form content to audio
Accessibility tools – Make written content accessible to visually impaired users
Language learning – Generate pronunciation examples
Voice assistants – Create custom voice responses
Content localization – Generate audio in multiple languages (Cartesia supports 100+ languages)

The three-stage pipeline (extract → process → synthesize) is a general-purpose pattern for text-to-speech automation.

And if you need help building this, let me know!

Get more deep dives on AI

Like this post? Sign up for my newsletter and get notified every time I do a deep dive like this one.

Cartesia AI Tutorial: Build an AI Podcast Generator

What We’re Building

Want to build your own AI agents?

Setting Up The Project

Stage 1: Content Extraction

Stage 2: Script Generation with OpenAI

Stage 3: Voice Synthesis with Cartesia

Model Selection: sonic-3 vs sonic-turbo

Voice and Generation Parameters

Moment Of Truth

What Else Can You Build

Get more deep dives on AI

More posts

Building an AI-Powered Market Research Agent With Parallel AI

Cartesia AI Tutorial: Build an AI Podcast Generator

Claude Skills Tutorial: Give your AI Superpowers

Building a Competitor Intelligence Agent with Browserbase