Audio Synthesis Pipelines

ElevenLabs, Stable Audio, and Suno expose APIs for voice, SFX, and music. Here's how to compose them into a production audio pipeline.

38 min · Reviewed 2026

Three kinds of audio, three kinds of API

Production audio pipelines usually combine three components: voice (narration, dialogue, character voices), SFX (footsteps, explosions, ambient), and music (beds, stings, full tracks). Each has its own API family and each has quirks.

ElevenLabs for voice

from elevenlabs.client import ElevenLabs

client = ElevenLabs(api_key="YOUR_KEY")

# Multi-turn, emotional narration
audio_stream = client.text_to_speech.convert_as_stream(
    voice_id="21m00Tcm4TlvDq8ikWAM",
    model_id="eleven_v3",
    text="[whispering] Nobody knows what's behind the door. [excited] But today, we find out!",
    voice_settings={
        "stability": 0.4,     # lower = more expressive
        "similarity_boost": 0.75,
        "style": 0.5,         # higher = more emotional
        "use_speaker_boost": True,
    },
    output_format="mp3_44100_128",
)

with open("narration.mp3", "wb") as f:
    for chunk in audio_stream:
        f.write(chunk)

# Streaming = start playback before full generation completes
# Critical for conversational apps (sub-500ms first-chunk latency with v3)ElevenLabs v3 streaming TTS with audio tags.

Stable Audio for SFX and music beds

import requests

# Stable Audio 2 via Stability AI API
resp = requests.post(
    "https://api.stability.ai/v2beta/audio/stable-audio-2/generate",
    headers={"Authorization": f"Bearer {STABILITY_KEY}"},
    files={"none": ""},
    data={
        "prompt": "Heavy wooden door creaking open in a stone hallway, reverberant, dramatic",
        "duration": 6,       # seconds
        "steps": 50,
        "cfg_scale": 7,
        "output_format": "mp3",
        "seed": 42,
    },
)
with open("sfx_door.mp3", "wb") as f:
    f.write(resp.content)

# Stable Audio 2: strong at SFX and short musical loops up to ~3 minStable Audio for an SFX cue. Great for textures and short loops.

Suno for full songs

import requests

# Suno API (via partner providers like api.box / suno.ai developer portal)
resp = requests.post(
    "https://api.suno.ai/v1/songs",
    headers={"Authorization": f"Bearer {SUNO_KEY}"},
    json={
        "prompt": "Upbeat indie-folk with ukulele, handclaps, joyful female lead",
        "lyrics": "[Verse]\nMorning comes slow in this little town\n[Chorus]\nWe rise / we rise / we rise with the sun",
        "model": "suno-v5",
        "duration": 120,
        "instrumental": False,
    },
)
job_id = resp.json()["id"]
# Suno is async — poll job status. Returns two stereo 48kHz MP3s typically.Suno v5 — async, returns two song variants.

Composing the pipeline

Script / story → identify needed audio assets (N voice lines, M SFX, K music cues).
Parallel-generate voice (ElevenLabs), SFX (Stable Audio), music (Suno / ElevenMusic).
Normalize loudness (LUFS target: -16 for streaming, -14 for music apps).
Mix with pydub, ffmpeg, or Reaper/DaVinci Fairlight programmatically.
Export aligned to video timeline (if applicable). Save stems + mix.

from pydub import AudioSegment

# Load assets
narration = AudioSegment.from_mp3("narration.mp3")
sfx_door = AudioSegment.from_mp3("sfx_door.mp3")
music_bed = AudioSegment.from_mp3("music_bed.mp3")

# Duck the music under narration (common broadcast technique)
music_ducked = music_bed - 12  # -12 dB under narration

# Build the timeline
output = AudioSegment.silent(duration=0)
output += music_ducked[:3000]                              # 3s music intro
output = output.overlay(narration, position=3000)          # narration starts at 3s
output = output.overlay(sfx_door, position=5500)           # door SFX at 5.5s
output += music_bed[len(output):len(output)+2000]          # 2s music outro

# Normalize to broadcast standard
from pydub.effects import normalize
final = normalize(output)
final.export("scene_1_mix.mp3", format="mp3", bitrate="192k")Programmatic mix with pydub — ducking, overlay, normalize.

Task	Best tool (2026)	Rough cost
Narration / dialogue	ElevenLabs v3.	$0.30/1k chars (Pro).
Short SFX cue	Stable Audio 2.	~$0.01-0.05 per cue.
Custom song (full)	Suno v5 or ElevenMusic.	$0.10-0.50 per song.
Real-time conversation voice	Cartesia Sonic / ElevenLabs Turbo / 11ai.	$0.50-3.00/hour of dialogue.
Licensed-data-only music	ElevenMusic.	Subscription-based.

Real-time (streaming) audio

Conversational agents need sub-500ms first-chunk latency. ElevenLabs Turbo v2 / v3 streaming, Cartesia Sonic, and OpenAI Realtime API all hit this. Architecture: user speech → STT → LLM token stream → chunked TTS → audio stream. Chunking matters: break LLM output on clause boundaries, send each to TTS as it arrives, play as chunks land.

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-creative-audio-pipeline-creators

What is the core idea behind "Audio Synthesis Pipelines"?
1. ElevenLabs, Stable Audio, and Suno expose APIs for voice, SFX, and music. Here's how to compose them into a production audio pipeline.
2. design
3. Lay out the exhibition concept and the requested object's role
4. Draw the board on paper based on AI's idea
Which term best describes a foundational idea in "Audio Synthesis Pipelines"?
1. ElevenLabs streaming
2. audio pipeline
3. Stable Audio
4. Suno API
A learner studying Audio Synthesis Pipelines would need to understand which concept?
1. audio pipeline
2. Stable Audio
3. ElevenLabs streaming
4. Suno API
Which of these is directly relevant to Audio Synthesis Pipelines?
1. audio pipeline
2. ElevenLabs streaming
3. Suno API
4. Stable Audio
Which of the following is a key point about Audio Synthesis Pipelines?
1. Script / story → identify needed audio assets (N voice lines, M SFX, K music cues).
2. Parallel-generate voice (ElevenLabs), SFX (Stable Audio), music (Suno / ElevenMusic).
3. Normalize loudness (LUFS target: -16 for streaming, -14 for music apps).
4. Mix with pydub, ffmpeg, or Reaper/DaVinci Fairlight programmatically.
Which of these does NOT belong in a discussion of Audio Synthesis Pipelines?
1. design
2. Script / story → identify needed audio assets (N voice lines, M SFX, K music cues).
3. Parallel-generate voice (ElevenLabs), SFX (Stable Audio), music (Suno / ElevenMusic).
4. Normalize loudness (LUFS target: -16 for streaming, -14 for music apps).
What is the key insight about "Consent records for voice cloning" in the context of Audio Synthesis Pipelines?
1. design
2. Lay out the exhibition concept and the requested object's role
3. If you cloned a voice, keep the verification video, the signed consent, and the revocation policy in cold storage.
4. Draw the board on paper based on AI's idea
What is the key insight about "License-first for music" in the context of Audio Synthesis Pipelines?
1. design
2. Lay out the exhibition concept and the requested object's role
3. Draw the board on paper based on AI's idea
4. Suno's lawsuit outcome is unknown. If your product ships commercial music, strongly consider ElevenMusic (licensed train…
What is the recommended tip about "Use AI as a co-creator" in the context of Audio Synthesis Pipelines?
1. Set creative constraints before generating: tone, length, style reference, POV.
2. design
3. Lay out the exhibition concept and the requested object's role
4. Draw the board on paper based on AI's idea
Which statement accurately describes an aspect of Audio Synthesis Pipelines?
1. design
2. Production audio pipelines usually combine three components: voice (narration, dialogue, character voices), SFX (footsteps, explosions, ambi…
3. Lay out the exhibition concept and the requested object's role
4. Draw the board on paper based on AI's idea
What does working with Audio Synthesis Pipelines typically involve?
1. design
2. Lay out the exhibition concept and the requested object's role
3. Conversational agents need sub-500ms first-chunk latency. ElevenLabs Turbo v2 / v3 streaming, Cartesia Sonic, and OpenAI Realtime API all hi…
4. Draw the board on paper based on AI's idea
Which best describes the scope of "Audio Synthesis Pipelines"?
1. It is unrelated to creative workflows
2. It applies only to the opposite beginner tier
3. It was deprecated in 2024 and no longer relevant
4. It focuses on ElevenLabs, Stable Audio, and Suno expose APIs for voice, SFX, and music. Here's how to compose them
Which section heading best belongs in a lesson about Audio Synthesis Pipelines?
1. ElevenLabs for voice
2. design
3. Lay out the exhibition concept and the requested object's role
4. Draw the board on paper based on AI's idea
Which section heading best belongs in a lesson about Audio Synthesis Pipelines?
1. design
2. Stable Audio for SFX and music beds
3. Lay out the exhibition concept and the requested object's role
4. Draw the board on paper based on AI's idea
Which section heading best belongs in a lesson about Audio Synthesis Pipelines?
1. design
2. Lay out the exhibition concept and the requested object's role
3. Suno for full songs
4. Draw the board on paper based on AI's idea

← Back to interactive lesson

Tendril · Creators · Creative AI