# Stable Audio

Generate music and sound effects with Stability AI's Stable Audio on CLORE.AI GPUs.

{% hint style="success" %}
All examples can be run on GPU servers rented through [CLORE.AI Marketplace](https://clore.ai/marketplace).
{% endhint %}

## Why Stable Audio?

* **High quality** - 44.1kHz stereo audio generation
* **Variable length** - Generate up to 95 seconds
* **Versatile** - Music, sound effects, ambient sounds
* **Text-to-audio** - Describe what you want to hear
* **Open weights** - Stable Audio Open available

## Model Variants

| Model             | Duration | Quality   | VRAM | License    |
| ----------------- | -------- | --------- | ---- | ---------- |
| Stable Audio Open | 47 sec   | Good      | 8GB  | Open       |
| Stable Audio 2.0  | 3 min    | Excellent | 12GB | Commercial |

## Quick Deploy on CLORE.AI

**Docker Image:**

```
pytorch/pytorch:2.5.1-cuda12.4-cudnn9-devel
```

**Ports:**

```
22/tcp
7860/http
```

**Command:**

```bash
pip install stable-audio-tools gradio && \
python -c "
import gradio as gr
import torch
from stable_audio_tools import get_pretrained_model
from stable_audio_tools.inference.generation import generate_diffusion_cond
import soundfile as sf
import tempfile

model, model_config = get_pretrained_model('stabilityai/stable-audio-open-1.0')
model = model.to('cuda')

def generate(prompt, duration, steps, seed):
    conditioning = [{
        'prompt': prompt,
        'seconds_start': 0,
        'seconds_total': duration
    }]

    generator = torch.Generator('cuda').manual_seed(seed) if seed > 0 else None

    output = generate_diffusion_cond(
        model,
        conditioning=conditioning,
        steps=steps,
        cfg_scale=7,
        sample_size=model_config['sample_size'],
        sample_rate=model_config['sample_rate'],
        device='cuda',
        seed=seed if seed > 0 else None
    )

    audio = output[0].T.cpu().numpy()

    with tempfile.NamedTemporaryFile(suffix='.wav', delete=False) as f:
        sf.write(f.name, audio, model_config['sample_rate'])
        return f.name

gr.Interface(
    fn=generate,
    inputs=[
        gr.Textbox(label='Prompt'),
        gr.Slider(1, 47, value=10, label='Duration (sec)'),
        gr.Slider(10, 150, value=100, label='Steps'),
        gr.Number(value=-1, label='Seed')
    ],
    outputs=gr.Audio(label='Generated Audio'),
    title='Stable Audio Open'
).launch(server_name='0.0.0.0', server_port=7860)
"
```

## Accessing Your Service

After deployment, find your `http_pub` URL in **My Orders**:

1. Go to **My Orders** page
2. Click on your order
3. Find the `http_pub` URL (e.g., `abc123.clorecloud.net`)

Use `https://YOUR_HTTP_PUB_URL` instead of `localhost` in examples below.

## Hardware Requirements

| Model             | Minimum GPU   | Recommended   |
| ----------------- | ------------- | ------------- |
| Stable Audio Open | RTX 3070 8GB  | RTX 3090 24GB |
| Stable Audio 2.0  | RTX 3090 12GB | RTX 4090 24GB |

## Installation

```bash
pip install stable-audio-tools torch torchaudio
```

## Basic Usage

### Text to Music

```python
import torch
import torchaudio
from stable_audio_tools import get_pretrained_model
from stable_audio_tools.inference.generation import generate_diffusion_cond

# Load model
model, model_config = get_pretrained_model("stabilityai/stable-audio-open-1.0")
model = model.to("cuda")

sample_rate = model_config["sample_rate"]
sample_size = model_config["sample_size"]

# Define what you want
conditioning = [{
    "prompt": "Upbeat electronic dance music with a catchy synth melody, 128 BPM",
    "seconds_start": 0,
    "seconds_total": 30
}]

# Generate
output = generate_diffusion_cond(
    model,
    conditioning=conditioning,
    steps=100,
    cfg_scale=7,
    sample_size=sample_size,
    sample_rate=sample_rate,
    device="cuda"
)

# Save
audio = output[0].T
torchaudio.save("music.wav", audio.cpu(), sample_rate)
```

### Sound Effects

```python
conditioning = [{
    "prompt": "Thunderstorm with heavy rain and distant thunder",
    "seconds_start": 0,
    "seconds_total": 20
}]

output = generate_diffusion_cond(
    model,
    conditioning=conditioning,
    steps=100,
    cfg_scale=7,
    sample_size=sample_size,
    sample_rate=sample_rate,
    device="cuda"
)

torchaudio.save("thunderstorm.wav", output[0].T.cpu(), sample_rate)
```

### Ambient Sounds

```python
conditioning = [{
    "prompt": "Peaceful forest ambience with birds singing and gentle wind",
    "seconds_start": 0,
    "seconds_total": 45
}]

output = generate_diffusion_cond(
    model,
    conditioning=conditioning,
    steps=100,
    cfg_scale=7,
    sample_size=sample_size,
    sample_rate=sample_rate,
    device="cuda"
)

torchaudio.save("forest.wav", output[0].T.cpu(), sample_rate)
```

## Prompt Examples

### Music Genres

```python
prompts = {
    "electronic": "Energetic EDM track with deep bass, synth arpeggios, and a driving beat, 130 BPM",
    "jazz": "Smooth jazz piano trio with upright bass and brushed drums, relaxed tempo",
    "rock": "Heavy rock guitar riff with distortion, drums, and bass, powerful and energetic",
    "classical": "Orchestral piece with strings and woodwinds, dramatic and cinematic",
    "ambient": "Atmospheric ambient soundscape with pads and subtle textures, dreamy",
    "hiphop": "Lo-fi hip hop beat with vinyl crackle, mellow piano, and chill drums, 85 BPM"
}
```

### Sound Effects

```python
prompts = {
    "explosion": "Massive explosion with debris and fire, cinematic",
    "footsteps": "Footsteps on gravel, slow walking pace",
    "car": "Sports car engine revving and accelerating",
    "water": "Water splashing and dripping in a cave",
    "wind": "Strong wind howling through mountains",
    "fire": "Crackling campfire with wood popping"
}
```

### Ambient/Background

```python
prompts = {
    "cafe": "Coffee shop ambience with quiet chatter and espresso machine",
    "ocean": "Ocean waves on a sandy beach, seagulls in distance",
    "city": "Busy city street with traffic, horns, and pedestrians",
    "rain": "Gentle rain on window with occasional thunder",
    "space": "Sci-fi spaceship interior hum and beeps"
}
```

## Advanced Options

### Controlling Generation

```python
output = generate_diffusion_cond(
    model,
    conditioning=conditioning,
    steps=150,              # More steps = better quality
    cfg_scale=7,            # Prompt adherence (5-10)
    sample_size=sample_size,
    sample_rate=sample_rate,
    device="cuda",
    seed=42                 # Reproducible results
)
```

### Variable Length

```python
# Short sound effect (5 seconds)
conditioning = [{
    "prompt": "Door creaking open slowly",
    "seconds_start": 0,
    "seconds_total": 5
}]

# Medium clip (30 seconds)
conditioning = [{
    "prompt": "Upbeat rock music",
    "seconds_start": 0,
    "seconds_total": 30
}]

# Maximum length (47 seconds for Open)
conditioning = [{
    "prompt": "Ambient electronic music, evolving textures",
    "seconds_start": 0,
    "seconds_total": 47
}]
```

## Batch Generation

```python
import os

prompts = [
    "Energetic drum and bass track",
    "Calm piano melody",
    "Sci-fi laser sound effects",
    "Rain on a tin roof"
]

output_dir = "./audio_output"
os.makedirs(output_dir, exist_ok=True)

for i, prompt in enumerate(prompts):
    conditioning = [{
        "prompt": prompt,
        "seconds_start": 0,
        "seconds_total": 15
    }]

    output = generate_diffusion_cond(
        model,
        conditioning=conditioning,
        steps=100,
        cfg_scale=7,
        sample_size=sample_size,
        sample_rate=sample_rate,
        device="cuda"
    )

    torchaudio.save(f"{output_dir}/audio_{i}.wav", output[0].T.cpu(), sample_rate)
    print(f"Generated: {prompt[:30]}...")

    torch.cuda.empty_cache()
```

## Gradio Web Interface

```python
import gradio as gr
import torch
import torchaudio
from stable_audio_tools import get_pretrained_model
from stable_audio_tools.inference.generation import generate_diffusion_cond
import tempfile

model, model_config = get_pretrained_model("stabilityai/stable-audio-open-1.0")
model = model.to("cuda")

sample_rate = model_config["sample_rate"]
sample_size = model_config["sample_size"]

def generate_audio(prompt, duration, steps, cfg_scale, seed):
    conditioning = [{
        "prompt": prompt,
        "seconds_start": 0,
        "seconds_total": duration
    }]

    generator_seed = seed if seed > 0 else None

    output = generate_diffusion_cond(
        model,
        conditioning=conditioning,
        steps=steps,
        cfg_scale=cfg_scale,
        sample_size=sample_size,
        sample_rate=sample_rate,
        device="cuda",
        seed=generator_seed
    )

    audio = output[0].T.cpu()

    with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as f:
        torchaudio.save(f.name, audio, sample_rate)
        return f.name

demo = gr.Interface(
    fn=generate_audio,
    inputs=[
        gr.Textbox(label="Prompt", placeholder="Describe the audio you want..."),
        gr.Slider(1, 47, value=15, step=1, label="Duration (seconds)"),
        gr.Slider(20, 200, value=100, step=10, label="Steps"),
        gr.Slider(1, 15, value=7, step=0.5, label="CFG Scale"),
        gr.Number(value=-1, label="Seed (-1 for random)")
    ],
    outputs=gr.Audio(label="Generated Audio", type="filepath"),
    title="Stable Audio Open - Text to Audio",
    description="Generate music and sound effects from text descriptions. Running on CLORE.AI.",
    examples=[
        ["Upbeat electronic dance music with synths, 128 BPM", 20, 100, 7, 42],
        ["Thunderstorm with heavy rain", 15, 100, 7, 123],
        ["Peaceful piano melody, emotional", 30, 100, 7, 456]
    ]
)

demo.launch(server_name="0.0.0.0", server_port=7860)
```

## Performance

| Duration | Steps | GPU      | Time  |
| -------- | ----- | -------- | ----- |
| 10 sec   | 100   | RTX 3090 | \~15s |
| 10 sec   | 100   | RTX 4090 | \~10s |
| 30 sec   | 100   | RTX 3090 | \~40s |
| 30 sec   | 100   | RTX 4090 | \~25s |
| 47 sec   | 100   | RTX 4090 | \~40s |

## Quality Tips

### Better Music

```python
# Include tempo and style
prompt = "Energetic rock music, electric guitar, drums, bass, 140 BPM, high energy"

# Be specific about instruments
prompt = "Solo acoustic guitar fingerpicking, folk style, warm and intimate"

# Describe mood
prompt = "Melancholic piano piece, minor key, slow tempo, emotional and sad"
```

### Better Sound Effects

```python
# Be specific
prompt = "Single gunshot from a rifle, outdoor, echo"

# Include environment
prompt = "Footsteps on wooden floor, indoor, slow pace, creaking"

# Describe texture
prompt = "Fire crackling, large bonfire, wood popping, sparks"
```

## Cost Estimate

Typical CLORE.AI marketplace rates:

| GPU           | Hourly Rate | \~30sec clips/hour |
| ------------- | ----------- | ------------------ |
| RTX 3060 12GB | \~$0.03     | \~50               |
| RTX 3090 24GB | \~$0.06     | \~90               |
| RTX 4090 24GB | \~$0.10     | \~140              |
| A100 40GB     | \~$0.17     | \~200              |

*Prices vary. Check* [*CLORE.AI Marketplace*](https://clore.ai/marketplace) *for current rates.*

## Troubleshooting

### Out of Memory

```python
# Reduce duration
conditioning = [{
    "prompt": prompt,
    "seconds_total": 15  # Instead of 47
}]

# Or enable CPU offload
model.enable_model_cpu_offload()
```

### Poor Quality Output

* Increase steps (150-200)
* Adjust CFG scale (try 5-10)
* Be more specific in prompt
* Try different seeds

### No Sound / Silence

* Check prompt is descriptive enough
* Avoid very abstract descriptions
* Try known-working prompts first

### Audio Artifacts

* Increase steps
* Lower CFG scale
* Reduce duration
* Check for GPU thermal throttling

## Stable Audio vs Others

| Feature  | Stable Audio | AudioCraft | Bark  |
| -------- | ------------ | ---------- | ----- |
| Music    | Excellent    | Excellent  | Poor  |
| SFX      | Great        | Good       | Poor  |
| Speech   | No           | No         | Yes   |
| Duration | 47s / 3min   | 30s        | 15s   |
| Quality  | 44.1kHz      | 32kHz      | 24kHz |
| Open     | Partial      | Yes        | Yes   |

**Use Stable Audio when:**

* High-quality music generation
* Sound effects for games/video
* Background music
* Ambient soundscapes

## Next Steps

* [AudioCraft](https://docs.clore.ai/guides/audio-and-voice/audiocraft-music) - Meta's music generation
* [Bark TTS](https://docs.clore.ai/guides/audio-and-voice/bark-tts) - Voice synthesis
* [Demucs](https://docs.clore.ai/guides/audio-and-voice/demucs-separation) - Audio separation
* [Whisper](https://docs.clore.ai/guides/audio-and-voice/whisper-transcription) - Transcription
