Tendril

Lesson 61 of 2116

Audio Synthesis Pipelines

ElevenLabs, Stable Audio, and Suno expose APIs for voice, SFX, and music. Here's how to compose them into a production audio pipeline.

CreatorsCreative AI~23 min readAdvancedProfessionalCoderDesignerBI2 · Representation & ReasoningBI3 · LearningPrint / PDF

Lesson map

What this lesson covers

38 min19 blocks5 concepts

Learning path

The main moves in order

1Three kinds of audio, three kinds of API
2audio API
3ElevenLabs
4Stable Audio

Concept cluster

Terms to connect while reading

audio APIElevenLabsStable AudioSuno APIaudio pipeline

Sections6

Lists1

Notes4

Code4

Compare1

Section 1

Three kinds of audio, three kinds of API

Production audio pipelines usually combine three components: voice (narration, dialogue, character voices), SFX (footsteps, explosions, ambient), and music (beds, stings, full tracks). Each has its own API family and each has quirks.

ElevenLabs for voice

ElevenLabs v3 streaming TTS with audio tags.

python

from elevenlabs.client import ElevenLabs

client = ElevenLabs(api_key="YOUR_KEY")

# Multi-turn, emotional narration
audio_stream = client.text_to_speech.convert_as_stream(
    voice_id="21m00Tcm4TlvDq8ikWAM",
    model_id="eleven_v3",
    text="[whispering] Nobody knows what's behind the door. [excited] But today, we find out!",
    voice_settings={
        "stability": 0.4,     # lower = more expressive
        "similarity_boost": 0.75,
        "style": 0.5,         # higher = more emotional
        "use_speaker_boost": True,
    },
    output_format="mp3_44100_128",
)

with open("narration.mp3", "wb") as f:
    for chunk in audio_stream:
        f.write(chunk)

# Streaming = start playback before full generation completes
# Critical for conversational apps (sub-500ms first-chunk latency with v3)

Stable Audio for SFX and music beds

Stable Audio for an SFX cue. Great for textures and short loops.

python

import requests

# Stable Audio 2 via Stability AI API
resp = requests.post(
    "https://api.stability.ai/v2beta/audio/stable-audio-2/generate",
    headers={"Authorization": f"Bearer {STABILITY_KEY}"},
    files={"none": ""},
    data={
        "prompt": "Heavy wooden door creaking open in a stone hallway, reverberant, dramatic",
        "duration": 6,       # seconds
        "steps": 50,
        "cfg_scale": 7,
        "output_format": "mp3",
        "seed": 42,
    },
)
with open("sfx_door.mp3", "wb") as f:
    f.write(resp.content)

# Stable Audio 2: strong at SFX and short musical loops up to ~3 min

Check-in 1. Got it so far?

Suno for full songs

Suno v5 — async, returns two song variants.

python

import requests

# Suno API (via partner providers like api.box / suno.ai developer portal)
resp = requests.post(
    "https://api.suno.ai/v1/songs",
    headers={"Authorization": f"Bearer {SUNO_KEY}"},
    json={
        "prompt": "Upbeat indie-folk with ukulele, handclaps, joyful female lead",
        "lyrics": "[Verse]\nMorning comes slow in this little town\n[Chorus]\nWe rise / we rise / we rise with the sun",
        "model": "suno-v5",
        "duration": 120,
        "instrumental": False,
    },
)
job_id = resp.json()["id"]
# Suno is async — poll job status. Returns two stereo 48kHz MP3s typically.

Composing the pipeline

1Script / story → identify needed audio assets (N voice lines, M SFX, K music cues).
2Parallel-generate voice (ElevenLabs), SFX (Stable Audio), music (Suno / ElevenMusic).
3Normalize loudness (LUFS target: -16 for streaming, -14 for music apps).
4Mix with pydub, ffmpeg, or Reaper/DaVinci Fairlight programmatically.
5Export aligned to video timeline (if applicable). Save stems + mix.

Programmatic mix with pydub — ducking, overlay, normalize.

python

from pydub import AudioSegment

# Load assets
narration = AudioSegment.from_mp3("narration.mp3")
sfx_door = AudioSegment.from_mp3("sfx_door.mp3")
music_bed = AudioSegment.from_mp3("music_bed.mp3")

# Duck the music under narration (common broadcast technique)
music_ducked = music_bed - 12  # -12 dB under narration

# Build the timeline
output = AudioSegment.silent(duration=0)
output += music_ducked[:3000]                              # 3s music intro
output = output.overlay(narration, position=3000)          # narration starts at 3s
output = output.overlay(sfx_door, position=5500)           # door SFX at 5.5s
output += music_bed[len(output):len(output)+2000]          # 2s music outro

# Normalize to broadcast standard
from pydub.effects import normalize
final = normalize(output)
final.export("scene_1_mix.mp3", format="mp3", bitrate="192k")

Check-in 2. Got it so far?

Compare the options

Task	Best tool (2026)	Rough cost
Narration / dialogue	ElevenLabs v3.	$0.30/1k chars (Pro).
Short SFX cue	Stable Audio 2.	~$0.01-0.05 per cue.
Custom song (full)	Suno v5 or ElevenMusic.	$0.10-0.50 per song.
Real-time conversation voice	Cartesia Sonic / ElevenLabs Turbo / 11ai.	$0.50-3.00/hour of dialogue.
Licensed-data-only music	ElevenMusic.	Subscription-based.

Real-time (streaming) audio

Conversational agents need sub-500ms first-chunk latency. ElevenLabs Turbo v2 / v3 streaming, Cartesia Sonic, and OpenAI Realtime API all hit this. Architecture: user speech → STT → LLM token stream → chunked TTS → audio stream. Chunking matters: break LLM output on clause boundaries, send each to TTS as it arrives, play as chunks land.

Check-in 3. Got it so far?

Key terms in this lesson

Check-in 4. Got it so far?

End-of-lesson quiz

Check what stuck

15 questions · Score saves to your progress.

Tutor

Curious about “Audio Synthesis Pipelines”?

Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.

Progress saved locally in this browser. Sign in to sync across devices.

Related lessons

Audio Synthesis Pipelines

Three kinds of audio, three kinds of API

ElevenLabs for voice

Stable Audio for SFX and music beds

Suno for full songs

Composing the pipeline

Real-time (streaming) audio

Curious about “Audio Synthesis Pipelines”?

Keep going

Audio Synthesis Pipelines

Three kinds of audio, three kinds of API

ElevenLabs for voice

Stable Audio for SFX and music beds

Suno for full songs

Composing the pipeline

Real-time (streaming) audio

Curious about “Audio Synthesis Pipelines”?

Keep going