Lesson 55 of 1596
Audio Synthesis Pipelines
ElevenLabs, Stable Audio, and Suno expose APIs for voice, SFX, and music. Here's how to compose them into a production audio pipeline.
Creators · Creative AI · ~23 min read
Three kinds of audio, three kinds of API
Production audio pipelines usually combine three components: voice (narration, dialogue, character voices), SFX (footsteps, explosions, ambient), and music (beds, stings, full tracks). Each has its own API family and each has quirks.
ElevenLabs for voice
ElevenLabs v3 streaming TTS with audio tags.
from elevenlabs.client import ElevenLabs client = ElevenLabs(api_key="YOUR_KEY") # Multi-turn, emotional narration audio_stream = client.text_to_speech.convert_as_stream( voice_id="21m00Tcm4TlvDq8ikWAM", model_id="eleven_v3", text="[whispering] Nobody knows what's behind the door. [excited] But today, we find out!", voice_settings={ "stability": 0.4, # lower = more expressive "similarity_boost": 0.75, "style": 0.5, # higher = more emotional "use_speaker_boost": True, }, output_format="mp3_44100_128", ) with open("narration.mp3", "wb") as f: for chunk in audio_stream: f.write(chunk) # Streaming = start playback before full generation completes # Critical for conversational apps (sub-500ms first-chunk latency with v3)Stable Audio for SFX and music beds
Stable Audio for an SFX cue. Great for textures and short loops.
import requests # Stable Audio 2 via Stability AI API resp = requests.post( "https://api.stability.ai/v2beta/audio/stable-audio-2/generate", headers={"Authorization": f"Bearer {STABILITY_KEY}"}, files={"none": ""}, data={ "prompt": "Heavy wooden door creaking open in a stone hallway, reverberant, dramatic", "duration": 6, # seconds "steps": 50, "cfg_scale": 7, "output_format": "mp3", "seed": 42, }, ) with open("sfx_door.mp3", "wb") as f: f.write(resp.content) # Stable Audio 2: strong at SFX and short musical loops up to ~3 minSuno for full songs
Suno v5 — async, returns two song variants.
import requests # Suno API (via partner providers like api.box / suno.ai developer portal) resp = requests.post( "https://api.suno.ai/v1/songs", headers={"Authorization": f"Bearer {SUNO_KEY}"}, json={ "prompt": "Upbeat indie-folk with ukulele, handclaps, joyful female lead", "lyrics": "[Verse]\nMorning comes slow in this little town\n[Chorus]\nWe rise / we rise / we rise with the sun", "model": "suno-v5", "duration": 120, "instrumental": False, }, ) job_id = resp.json()["id"] # Suno is async — poll job status. Returns two stereo 48kHz MP3s typically.Composing the pipeline
- 1Script / story → identify needed audio assets (N voice lines, M SFX, K music cues).
- 2Parallel-generate voice (ElevenLabs), SFX (Stable Audio), music (Suno / ElevenMusic).
- 3Normalize loudness (LUFS target: -16 for streaming, -14 for music apps).
- 4Mix with pydub, ffmpeg, or Reaper/DaVinci Fairlight programmatically.
- 5Export aligned to video timeline (if applicable). Save stems + mix.
Programmatic mix with pydub — ducking, overlay, normalize.
from pydub import AudioSegment # Load assets narration = AudioSegment.from_mp3("narration.mp3") sfx_door = AudioSegment.from_mp3("sfx_door.mp3") music_bed = AudioSegment.from_mp3("music_bed.mp3") # Duck the music under narration (common broadcast technique) music_ducked = music_bed - 12 # -12 dB under narration # Build the timeline output = AudioSegment.silent(duration=0) output += music_ducked[:3000] # 3s music intro output = output.overlay(narration, position=3000) # narration starts at 3s output = output.overlay(sfx_door, position=5500) # door SFX at 5.5s output += music_bed[len(output):len(output)+2000] # 2s music outro # Normalize to broadcast standard from pydub.effects import normalize final = normalize(output) final.export("scene_1_mix.mp3", format="mp3", bitrate="192k")Compare the options
| Task | Best tool (2026) | Rough cost |
|---|---|---|
| Narration / dialogue | ElevenLabs v3. | $0.30/1k chars (Pro). |
| Short SFX cue | Stable Audio 2. | ~$0.01-0.05 per cue. |
| Custom song (full) | Suno v5 or ElevenMusic. | $0.10-0.50 per song. |
| Real-time conversation voice | Cartesia Sonic / ElevenLabs Turbo / 11ai. | $0.50-3.00/hour of dialogue. |
| Licensed-data-only music | ElevenMusic. | Subscription-based. |
Real-time (streaming) audio
Conversational agents need sub-500ms first-chunk latency. ElevenLabs Turbo v2 / v3 streaming, Cartesia Sonic, and OpenAI Realtime API all hit this. Architecture: user speech → STT → LLM token stream → chunked TTS → audio stream. Chunking matters: break LLM output on clause boundaries, send each to TTS as it arrives, play as chunks land.
Key terms in this lesson
End-of-lesson quiz
Check what stuck
8 questions · Score saves to your progress.
Tutor
Curious about “Audio Synthesis Pipelines”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Creators · 38 min
Open-Source vs. Closed Image Models
Flux Pro vs. Flux Dev. Midjourney vs. Stable Diffusion. The choice affects product architecture, cost, and what's possible. Here's the honest tradeoff.
Creators · 40 min
Video Generation at the API Level
Behind the glossy UIs, video models expose REST APIs. Here's how to call Sora, Veo, and Runway programmatically and build production pipelines.
Adults & Professionals · 60 min
Capstone — Ship a Real AI-Assisted Creative Project
Plan, build, and launch a real creative product using the full AI stack. This is the final deliverable of the Creative track.
