# Stable Audio

Generate music and sound effects with Stability AI's Stable Audio on CLORE.AI GPUs.

{% hint style="success" %}
All examples can be run on GPU servers rented through [CLORE.AI Marketplace](https://clore.ai/marketplace).
{% endhint %}

## Why Stable Audio?

* **High quality** - 44.1kHz stereo audio generation
* **Variable length** - Generate up to 95 seconds
* **Versatile** - Music, sound effects, ambient sounds
* **Text-to-audio** - Describe what you want to hear
* **Open weights** - Stable Audio Open available

## Model Variants

| Model             | Duration | Quality   | VRAM | License    |
| ----------------- | -------- | --------- | ---- | ---------- |
| Stable Audio Open | 47 sec   | Good      | 8GB  | Open       |
| Stable Audio 2.0  | 3 min    | Excellent | 12GB | Commercial |

## Quick Deploy on CLORE.AI

**Docker Image:**

```
pytorch/pytorch:2.5.1-cuda12.4-cudnn9-devel
```

**Ports:**

```
22/tcp
7860/http
```

**Command:**

```bash
pip install stable-audio-tools gradio && \
python -c "
import gradio as gr
import torch
from stable_audio_tools import get_pretrained_model
from stable_audio_tools.inference.generation import generate_diffusion_cond
import soundfile as sf
import tempfile

model, model_config = get_pretrained_model('stabilityai/stable-audio-open-1.0')
model = model.to('cuda')

def generate(prompt, duration, steps, seed):
    conditioning = [{
        'prompt': prompt,
        'seconds_start': 0,
        'seconds_total': duration
    }]

    generator = torch.Generator('cuda').manual_seed(seed) if seed > 0 else None

    output = generate_diffusion_cond(
        model,
        conditioning=conditioning,
        steps=steps,
        cfg_scale=7,
        sample_size=model_config['sample_size'],
        sample_rate=model_config['sample_rate'],
        device='cuda',
        seed=seed if seed > 0 else None
    )

    audio = output[0].T.cpu().numpy()

    with tempfile.NamedTemporaryFile(suffix='.wav', delete=False) as f:
        sf.write(f.name, audio, model_config['sample_rate'])
        return f.name

gr.Interface(
    fn=generate,
    inputs=[
        gr.Textbox(label='Prompt'),
        gr.Slider(1, 47, value=10, label='Duration (sec)'),
        gr.Slider(10, 150, value=100, label='Steps'),
        gr.Number(value=-1, label='Seed')
    ],
    outputs=gr.Audio(label='Generated Audio'),
    title='Stable Audio Open'
).launch(server_name='0.0.0.0', server_port=7860)
"
```

## Accessing Your Service

After deployment, find your `http_pub` URL in **My Orders**:

1. Go to **My Orders** page
2. Click on your order
3. Find the `http_pub` URL (e.g., `abc123.clorecloud.net`)

Use `https://YOUR_HTTP_PUB_URL` instead of `localhost` in examples below.

## Hardware Requirements

| Model             | Minimum GPU   | Recommended   |
| ----------------- | ------------- | ------------- |
| Stable Audio Open | RTX 3070 8GB  | RTX 3090 24GB |
| Stable Audio 2.0  | RTX 3090 12GB | RTX 4090 24GB |

## Installation

```bash
pip install stable-audio-tools torch torchaudio
```

## Basic Usage

### Text to Music

```python
import torch
import torchaudio
from stable_audio_tools import get_pretrained_model
from stable_audio_tools.inference.generation import generate_diffusion_cond

# Load model
model, model_config = get_pretrained_model("stabilityai/stable-audio-open-1.0")
model = model.to("cuda")

sample_rate = model_config["sample_rate"]
sample_size = model_config["sample_size"]

# Define what you want
conditioning = [{
    "prompt": "Upbeat electronic dance music with a catchy synth melody, 128 BPM",
    "seconds_start": 0,
    "seconds_total": 30
}]

# Generate
output = generate_diffusion_cond(
    model,
    conditioning=conditioning,
    steps=100,
    cfg_scale=7,
    sample_size=sample_size,
    sample_rate=sample_rate,
    device="cuda"
)

# Save
audio = output[0].T
torchaudio.save("music.wav", audio.cpu(), sample_rate)
```

### Sound Effects

```python
conditioning = [{
    "prompt": "Thunderstorm with heavy rain and distant thunder",
    "seconds_start": 0,
    "seconds_total": 20
}]

output = generate_diffusion_cond(
    model,
    conditioning=conditioning,
    steps=100,
    cfg_scale=7,
    sample_size=sample_size,
    sample_rate=sample_rate,
    device="cuda"
)

torchaudio.save("thunderstorm.wav", output[0].T.cpu(), sample_rate)
```

### Ambient Sounds

```python
conditioning = [{
    "prompt": "Peaceful forest ambience with birds singing and gentle wind",
    "seconds_start": 0,
    "seconds_total": 45
}]

output = generate_diffusion_cond(
    model,
    conditioning=conditioning,
    steps=100,
    cfg_scale=7,
    sample_size=sample_size,
    sample_rate=sample_rate,
    device="cuda"
)

torchaudio.save("forest.wav", output[0].T.cpu(), sample_rate)
```

## Prompt Examples

### Music Genres

```python
prompts = {
    "electronic": "Energetic EDM track with deep bass, synth arpeggios, and a driving beat, 130 BPM",
    "jazz": "Smooth jazz piano trio with upright bass and brushed drums, relaxed tempo",
    "rock": "Heavy rock guitar riff with distortion, drums, and bass, powerful and energetic",
    "classical": "Orchestral piece with strings and woodwinds, dramatic and cinematic",
    "ambient": "Atmospheric ambient soundscape with pads and subtle textures, dreamy",
    "hiphop": "Lo-fi hip hop beat with vinyl crackle, mellow piano, and chill drums, 85 BPM"
}
```

### Sound Effects

```python
prompts = {
    "explosion": "Massive explosion with debris and fire, cinematic",
    "footsteps": "Footsteps on gravel, slow walking pace",
    "car": "Sports car engine revving and accelerating",
    "water": "Water splashing and dripping in a cave",
    "wind": "Strong wind howling through mountains",
    "fire": "Crackling campfire with wood popping"
}
```

### Ambient/Background

```python
prompts = {
    "cafe": "Coffee shop ambience with quiet chatter and espresso machine",
    "ocean": "Ocean waves on a sandy beach, seagulls in distance",
    "city": "Busy city street with traffic, horns, and pedestrians",
    "rain": "Gentle rain on window with occasional thunder",
    "space": "Sci-fi spaceship interior hum and beeps"
}
```

## Advanced Options

### Controlling Generation

```python
output = generate_diffusion_cond(
    model,
    conditioning=conditioning,
    steps=150,              # More steps = better quality
    cfg_scale=7,            # Prompt adherence (5-10)
    sample_size=sample_size,
    sample_rate=sample_rate,
    device="cuda",
    seed=42                 # Reproducible results
)
```

### Variable Length

```python
# Short sound effect (5 seconds)
conditioning = [{
    "prompt": "Door creaking open slowly",
    "seconds_start": 0,
    "seconds_total": 5
}]

# Medium clip (30 seconds)
conditioning = [{
    "prompt": "Upbeat rock music",
    "seconds_start": 0,
    "seconds_total": 30
}]

# Maximum length (47 seconds for Open)
conditioning = [{
    "prompt": "Ambient electronic music, evolving textures",
    "seconds_start": 0,
    "seconds_total": 47
}]
```

## Batch Generation

```python
import os

prompts = [
    "Energetic drum and bass track",
    "Calm piano melody",
    "Sci-fi laser sound effects",
    "Rain on a tin roof"
]

output_dir = "./audio_output"
os.makedirs(output_dir, exist_ok=True)

for i, prompt in enumerate(prompts):
    conditioning = [{
        "prompt": prompt,
        "seconds_start": 0,
        "seconds_total": 15
    }]

    output = generate_diffusion_cond(
        model,
        conditioning=conditioning,
        steps=100,
        cfg_scale=7,
        sample_size=sample_size,
        sample_rate=sample_rate,
        device="cuda"
    )

    torchaudio.save(f"{output_dir}/audio_{i}.wav", output[0].T.cpu(), sample_rate)
    print(f"Generated: {prompt[:30]}...")

    torch.cuda.empty_cache()
```

## Gradio Web Interface

```python
import gradio as gr
import torch
import torchaudio
from stable_audio_tools import get_pretrained_model
from stable_audio_tools.inference.generation import generate_diffusion_cond
import tempfile

model, model_config = get_pretrained_model("stabilityai/stable-audio-open-1.0")
model = model.to("cuda")

sample_rate = model_config["sample_rate"]
sample_size = model_config["sample_size"]

def generate_audio(prompt, duration, steps, cfg_scale, seed):
    conditioning = [{
        "prompt": prompt,
        "seconds_start": 0,
        "seconds_total": duration
    }]

    generator_seed = seed if seed > 0 else None

    output = generate_diffusion_cond(
        model,
        conditioning=conditioning,
        steps=steps,
        cfg_scale=cfg_scale,
        sample_size=sample_size,
        sample_rate=sample_rate,
        device="cuda",
        seed=generator_seed
    )

    audio = output[0].T.cpu()

    with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as f:
        torchaudio.save(f.name, audio, sample_rate)
        return f.name

demo = gr.Interface(
    fn=generate_audio,
    inputs=[
        gr.Textbox(label="Prompt", placeholder="Describe the audio you want..."),
        gr.Slider(1, 47, value=15, step=1, label="Duration (seconds)"),
        gr.Slider(20, 200, value=100, step=10, label="Steps"),
        gr.Slider(1, 15, value=7, step=0.5, label="CFG Scale"),
        gr.Number(value=-1, label="Seed (-1 for random)")
    ],
    outputs=gr.Audio(label="Generated Audio", type="filepath"),
    title="Stable Audio Open - Text to Audio",
    description="Generate music and sound effects from text descriptions. Running on CLORE.AI.",
    examples=[
        ["Upbeat electronic dance music with synths, 128 BPM", 20, 100, 7, 42],
        ["Thunderstorm with heavy rain", 15, 100, 7, 123],
        ["Peaceful piano melody, emotional", 30, 100, 7, 456]
    ]
)

demo.launch(server_name="0.0.0.0", server_port=7860)
```

## Performance

| Duration | Steps | GPU      | Time  |
| -------- | ----- | -------- | ----- |
| 10 sec   | 100   | RTX 3090 | \~15s |
| 10 sec   | 100   | RTX 4090 | \~10s |
| 30 sec   | 100   | RTX 3090 | \~40s |
| 30 sec   | 100   | RTX 4090 | \~25s |
| 47 sec   | 100   | RTX 4090 | \~40s |

## Quality Tips

### Better Music

```python
# Include tempo and style
prompt = "Energetic rock music, electric guitar, drums, bass, 140 BPM, high energy"

# Be specific about instruments
prompt = "Solo acoustic guitar fingerpicking, folk style, warm and intimate"

# Describe mood
prompt = "Melancholic piano piece, minor key, slow tempo, emotional and sad"
```

### Better Sound Effects

```python
# Be specific
prompt = "Single gunshot from a rifle, outdoor, echo"

# Include environment
prompt = "Footsteps on wooden floor, indoor, slow pace, creaking"

# Describe texture
prompt = "Fire crackling, large bonfire, wood popping, sparks"
```

## Cost Estimate

Typical CLORE.AI marketplace rates:

| GPU           | Hourly Rate | \~30sec clips/hour |
| ------------- | ----------- | ------------------ |
| RTX 3060 12GB | \~$0.03     | \~50               |
| RTX 3090 24GB | \~$0.06     | \~90               |
| RTX 4090 24GB | \~$0.10     | \~140              |
| A100 40GB     | \~$0.17     | \~200              |

*Prices vary. Check* [*CLORE.AI Marketplace*](https://clore.ai/marketplace) *for current rates.*

## Troubleshooting

### Out of Memory

```python
# Reduce duration
conditioning = [{
    "prompt": prompt,
    "seconds_total": 15  # Instead of 47
}]

# Or enable CPU offload
model.enable_model_cpu_offload()
```

### Poor Quality Output

* Increase steps (150-200)
* Adjust CFG scale (try 5-10)
* Be more specific in prompt
* Try different seeds

### No Sound / Silence

* Check prompt is descriptive enough
* Avoid very abstract descriptions
* Try known-working prompts first

### Audio Artifacts

* Increase steps
* Lower CFG scale
* Reduce duration
* Check for GPU thermal throttling

## Stable Audio vs Others

| Feature  | Stable Audio | AudioCraft | Bark  |
| -------- | ------------ | ---------- | ----- |
| Music    | Excellent    | Excellent  | Poor  |
| SFX      | Great        | Good       | Poor  |
| Speech   | No           | No         | Yes   |
| Duration | 47s / 3min   | 30s        | 15s   |
| Quality  | 44.1kHz      | 32kHz      | 24kHz |
| Open     | Partial      | Yes        | Yes   |

**Use Stable Audio when:**

* High-quality music generation
* Sound effects for games/video
* Background music
* Ambient soundscapes

## Next Steps

* [AudioCraft](/guides/audio-and-voice/audiocraft-music.md) - Meta's music generation
* [Bark TTS](/guides/audio-and-voice/bark-tts.md) - Voice synthesis
* [Demucs](/guides/audio-and-voice/demucs-separation.md) - Audio separation
* [Whisper](/guides/audio-and-voice/whisper-transcription.md) - Transcription


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/audio-and-voice/stable-audio.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
