Sora: Video Generation Prompts And Their Limits

Video generation is the most expensive and least controllable AI media. Even when models like Sora are available, getting useful clips is a craft — and the platform reality keeps shifting.

9 min · Reviewed 2026

Why video is the hardest modality

A still image is one frame. A 10-second clip is hundreds of frames that must agree on what each object looks like, where it is, and how it moves. That coherence problem is why text-to-video models lag image models by a generation, and why running them is so expensive that platforms quietly come and go.

Sora and its successors — the moving target

OpenAI's Sora was the highest-profile text-to-video demo of 2024-2025 and its production availability has shifted multiple times. Treat the brand as an ecosystem signal more than a stable SKU; assume access, length limits, and pricing will change. The skills below transfer to whichever video model is currently available — Runway, Veo, Kling, or the next OpenAI release.

Shot-grammar prompting

Lead with the shot type — 'wide shot of', 'close-up on', 'overhead drone shot of'.
Describe the subject, then the action, then the camera movement.
Add lighting and time of day — 'late afternoon golden hour' beats 'sunny'.
End with film/aesthetic reference — 'shot on 16mm film', '90s skate video aesthetic'.
Keep clips under the model's recommended length; longer prompts that imply longer scenes degrade fast.

Where these models fail

Failure mode	What you see	Mitigation
Limb glitching	Hands warp, legs add joints	Avoid close-up on hands; loose clothing helps
Text in the scene	Garbled signage, fake letters	Avoid prompts with on-screen text
Multi-character consistency	Faces morph across cuts	Generate each character separately and composite
Physics violations	Liquids float, gravity off	Keep scenes simple; prefer slow motion
Audio mismatch	Generated audio is generic	Replace audio in post

Applied exercise

Pick a 10-second moment you would otherwise shoot on phone — a product demo intro, an establishing shot.
Write three prompt variations using shot-grammar structure.
Generate all three on whatever video model you have access to.
Note which prompt elements changed the output the most. Save your top patterns as a personal style guide.

The big idea: video generation is a real production tool today, but it is the most expensive and least stable AI medium. Build your craft on the prompts, not the brand.

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-openai-sora-creators

A creator wants to generate a scene with a conversation between two characters. What does the lesson recommend to avoid character consistency problems?
1. Increase the prompt length with extra descriptive adjectives for each character
2. Use detailed descriptions of both characters in a single prompt to maintain consistency
3. Apply the same seed number to both character generation attempts
4. Generate each character separately and composite them together in post-production
What mitigation strategy does the lesson suggest for dealing with limb glitching in video generation?
1. Add more action verbs to the prompt to force proper limb movement
2. Generate at lower resolution to reduce visual artifacts
3. Avoid close-up shots on hands and use loose-fitting clothing on figures
4. Use longer generation times to allow the model more processing time
What does the lesson identify as the primary issue with including text in video generation prompts?
1. Including text makes the prompt too long and reduces quality
2. The model produces garbled, illegible signage and fake letters
3. Text cannot be rendered in any video model currently available
4. Text increases generation time significantly
How does the lesson recommend handling scenes where physics violations commonly occur, such as liquids or gravity?
1. Add more specific physics terms to the prompt
2. Keep scenes simple and prefer slow-motion settings
3. Use multiple camera angles to hide physics errors
4. Generate at higher resolution to improve physics accuracy
The lesson mentions that audio generated by video models is often mismatched with the visual content. What is the recommended solution?
1. Use longer prompts that describe the desired audio in detail
2. Add background music keywords to the original prompt
3. Regenerate the video with different seed values to fix audio
4. Replace the generated audio with custom audio in post-production
What does the lesson say about the computational cost of video generation compared to large language model usage?
1. Video generation costs are negligible compared to LLM API calls
2. LLM usage is more expensive because language processing requires more GPU power
3. A 90-second video can consume more credits than a month of typical LLM use
4. Video generation is roughly equal to LLM costs on a per-minute basis
The lesson describes Sora as 'a moving target' and an 'ecosystem signal' rather than a stable product. What is the main reason for this characterization?
1. The model has been discontinued and replaced by newer versions
2. Access, length limits, and pricing have changed multiple times and will continue to shift
3. Sora is primarily a research project with no commercial release plans
4. Sora only works on certain operating systems and hardware configurations
Why does the lesson recommend writing separate prompts for each shot in a multi-shot scene rather than one long prompt?
1. The model's token limit doesn't allow for detailed multi-shot descriptions
2. One prompt trying to deliver multiple shots typically results in morphing and visual inconsistencies
3. Each shot requires a different model version to generate properly
4. Longer prompts cause the generation to fail completely
What is the advantage of adding a film aesthetic reference like 'shot on 16mm film' to a video generation prompt?
1. It significantly reduces the generation time required
2. It guarantees that the video will be high resolution
3. It helps the model generate video with a specific visual style and grain texture
4. It allows the model to bypass content restrictions
The lesson suggests that specific lighting descriptions like 'late afternoon golden hour' produce better results than generic terms like 'sunny'. Why?
1. More specific lighting details give the model clearer visual guidance for the output
2. Generic lighting terms cause the model to generate lower quality video
3. Sunny is a weather term, not a lighting term, and causes confusion
4. The model was trained specifically on golden hour imagery
What community-tested pattern does the lesson say consistently outperforms basic prompts for Sora and similar models?
1. Focus primarily on camera movement descriptions and omit subject details
2. Use extremely long prompts with as much detail as possible about every element in the scene
3. Lead with shot type, lock in the subject with specific details, describe one plausible action, and stitch shorter clips together
4. Generate the longest possible clip in a single take to maintain continuity
Why might a creator want to generate multiple 4-second clips instead of a single longer take?
1. Longer prompts that imply longer scenes tend to degrade in quality and consistency
2. The model charges less per second for shorter clips
3. Longer takes require special hardware access that most users don't have
4. Shorter clips are stored more efficiently on local devices
The lesson recommends treating prompt skills as transferable rather than specific to one platform. What is the rationale given?
1. Other video models are not worth learning because they will be obsolete soon
2. The skills transfer to whatever video model is currently available, whether Runway, Veo, Kling, or future releases
3. Sora is the only model that matters for professional work
4. Prompt syntax is standardized across all AI models
What failure mode involves characters' faces changing appearance between different cuts or frames in the same scene?
1. Limb glitching affecting the upper body
2. Physics violations related to facial expressions
3. Text rendering errors showing garbled faces
4. Multi-character consistency issues where faces morph across cuts
When budgeting for a video generation project, what comparison does the lesson use to illustrate the potential costs?
1. Budget compute the way you would budget studio time, expecting prices to swing as platforms tune margins
2. Video generation costs are roughly equivalent to text generation costs
3. Video generation costs are similar to image generation on a per-image basis
4. Most platforms offer unlimited video generation for a flat monthly fee

← Back to interactive lesson

Tendril · Creators · Model Families

Sora: Video Generation Prompts And Their Limits

Video generation is the most expensive and least controllable AI media. Even when models like Sora are available, getting useful clips is a craft — and the platform reality keeps shifting.

9 min · Reviewed 2026

Why video is the hardest modality

Sora and its successors — the moving target

Shot-grammar prompting

Lead with the shot type — 'wide shot of', 'close-up on', 'overhead drone shot of'.
Describe the subject, then the action, then the camera movement.
Add lighting and time of day — 'late afternoon golden hour' beats 'sunny'.
End with film/aesthetic reference — 'shot on 16mm film', '90s skate video aesthetic'.
Keep clips under the model's recommended length; longer prompts that imply longer scenes degrade fast.

Where these models fail

Failure mode	What you see	Mitigation
Limb glitching	Hands warp, legs add joints	Avoid close-up on hands; loose clothing helps
Text in the scene	Garbled signage, fake letters	Avoid prompts with on-screen text
Multi-character consistency	Faces morph across cuts	Generate each character separately and composite
Physics violations	Liquids float, gravity off	Keep scenes simple; prefer slow motion
Audio mismatch	Generated audio is generic	Replace audio in post

Applied exercise

Pick a 10-second moment you would otherwise shoot on phone — a product demo intro, an establishing shot.
Write three prompt variations using shot-grammar structure.
Generate all three on whatever video model you have access to.
Note which prompt elements changed the output the most. Save your top patterns as a personal style guide.

The big idea: video generation is a real production tool today, but it is the most expensive and least stable AI medium. Build your craft on the prompts, not the brand.

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-openai-sora-creators

A creator wants to generate a scene with a conversation between two characters. What does the lesson recommend to avoid character consistency problems?
1. Increase the prompt length with extra descriptive adjectives for each character
2. Use detailed descriptions of both characters in a single prompt to maintain consistency
3. Apply the same seed number to both character generation attempts
4. Generate each character separately and composite them together in post-production
What mitigation strategy does the lesson suggest for dealing with limb glitching in video generation?
1. Add more action verbs to the prompt to force proper limb movement
2. Generate at lower resolution to reduce visual artifacts
3. Avoid close-up shots on hands and use loose-fitting clothing on figures
4. Use longer generation times to allow the model more processing time
What does the lesson identify as the primary issue with including text in video generation prompts?
1. Including text makes the prompt too long and reduces quality
2. The model produces garbled, illegible signage and fake letters
3. Text cannot be rendered in any video model currently available
4. Text increases generation time significantly
How does the lesson recommend handling scenes where physics violations commonly occur, such as liquids or gravity?
1. Add more specific physics terms to the prompt
2. Keep scenes simple and prefer slow-motion settings
3. Use multiple camera angles to hide physics errors
4. Generate at higher resolution to improve physics accuracy
The lesson mentions that audio generated by video models is often mismatched with the visual content. What is the recommended solution?
1. Use longer prompts that describe the desired audio in detail
2. Add background music keywords to the original prompt
3. Regenerate the video with different seed values to fix audio
4. Replace the generated audio with custom audio in post-production
What does the lesson say about the computational cost of video generation compared to large language model usage?
1. Video generation costs are negligible compared to LLM API calls
2. LLM usage is more expensive because language processing requires more GPU power
3. A 90-second video can consume more credits than a month of typical LLM use
4. Video generation is roughly equal to LLM costs on a per-minute basis
The lesson describes Sora as 'a moving target' and an 'ecosystem signal' rather than a stable product. What is the main reason for this characterization?
1. The model has been discontinued and replaced by newer versions
2. Access, length limits, and pricing have changed multiple times and will continue to shift
3. Sora is primarily a research project with no commercial release plans
4. Sora only works on certain operating systems and hardware configurations
Why does the lesson recommend writing separate prompts for each shot in a multi-shot scene rather than one long prompt?
1. The model's token limit doesn't allow for detailed multi-shot descriptions
2. One prompt trying to deliver multiple shots typically results in morphing and visual inconsistencies
3. Each shot requires a different model version to generate properly
4. Longer prompts cause the generation to fail completely
What is the advantage of adding a film aesthetic reference like 'shot on 16mm film' to a video generation prompt?
1. It significantly reduces the generation time required
2. It guarantees that the video will be high resolution
3. It helps the model generate video with a specific visual style and grain texture
4. It allows the model to bypass content restrictions
The lesson suggests that specific lighting descriptions like 'late afternoon golden hour' produce better results than generic terms like 'sunny'. Why?
1. More specific lighting details give the model clearer visual guidance for the output
2. Generic lighting terms cause the model to generate lower quality video
3. Sunny is a weather term, not a lighting term, and causes confusion
4. The model was trained specifically on golden hour imagery
What community-tested pattern does the lesson say consistently outperforms basic prompts for Sora and similar models?
1. Focus primarily on camera movement descriptions and omit subject details
2. Use extremely long prompts with as much detail as possible about every element in the scene
3. Lead with shot type, lock in the subject with specific details, describe one plausible action, and stitch shorter clips together
4. Generate the longest possible clip in a single take to maintain continuity
Why might a creator want to generate multiple 4-second clips instead of a single longer take?
1. Longer prompts that imply longer scenes tend to degrade in quality and consistency
2. The model charges less per second for shorter clips
3. Longer takes require special hardware access that most users don't have
4. Shorter clips are stored more efficiently on local devices
The lesson recommends treating prompt skills as transferable rather than specific to one platform. What is the rationale given?
1. Other video models are not worth learning because they will be obsolete soon
2. The skills transfer to whatever video model is currently available, whether Runway, Veo, Kling, or future releases
3. Sora is the only model that matters for professional work
4. Prompt syntax is standardized across all AI models
What failure mode involves characters' faces changing appearance between different cuts or frames in the same scene?
1. Limb glitching affecting the upper body
2. Physics violations related to facial expressions
3. Text rendering errors showing garbled faces
4. Multi-character consistency issues where faces morph across cuts
When budgeting for a video generation project, what comparison does the lesson use to illustrate the potential costs?
1. Budget compute the way you would budget studio time, expecting prices to swing as platforms tune margins
2. Video generation costs are roughly equivalent to text generation costs
3. Video generation costs are similar to image generation on a per-image basis
4. Most platforms offer unlimited video generation for a flat monthly fee

← Back to interactive lesson