> For the complete documentation index, see [llms.txt](https://docs.clore.ai/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.clore.ai/guides/guides_v2-de/wissenschaft-and-forschung/esmfold.md).

# ESMFold-Proteinstruktur

**Ultra-schnelle Proteinfaltungs-Vorhersage von Meta AI** — sagt 3D-Proteinstrukturen aus Aminosäuresequenzen in Sekunden voraus, ohne Multiple Sequence Alignments.

> 🧬 Entwickelt von **Meta AI Research** | MIT-Lizenz | 10x–60x schneller als AlphaFold2

***

## Was ist ESMFold?

ESMFold ist das Proteinfaltungs-Vorhersagesystem von Meta AI, das **Evolutionary Scale Modeling (ESM-2)** — das weltweit größte Protein-Sprachmodell (15 Milliarden Parameter) — nutzt, um 3D-Proteinstrukturen direkt aus Aminosäuresequenzen vorherzusagen.

### Wesentliche Vorteile gegenüber AlphaFold2

| Funktion                            | ESMFold          | AlphaFold2           |
| ----------------------------------- | ---------------- | -------------------- |
| MSA erforderlich                    | ❌ Nein           | ✅ Ja                 |
| Geschwindigkeit (typisches Protein) | **\~2 Sekunden** | \~10 Minuten–Stunden |
| Genauigkeit (TM-Score)              | \~0.87           | \~0.92               |
| GPU-VRAM (650 aa)                   | \~8GB            | \~8GB                |
| Einzelne Sequenzeingabe             | ✅ Ja             | Begrenzt             |
| Verwaiste Proteine                  | ✅ Hervorragend   | Schwierigkeiten      |

### Warum kein MSA?

AlphaFold2 benötigt **Multiple Sequence Alignment (MSA)** — das Sammeln und Ausrichten evolutionärer Verwandter des Anfrageproteins. Das ist rechenintensiv und für neuartige oder konstruierte Proteine ohne evolutionäre Verwandte unmöglich.

ESMFold speichert evolutionäre Informationen **in seinen Sprachmodell-Gewichten** (trainiert auf 250 Millionen Proteinsequenzen) und eliminiert MSA vollständig. Das macht es:

* **Schneller:** Keine MSA-Suche (Minuten pro Vorhersage eingespart)
* **Skalierbarer:** Verarbeite ganze Proteome effizient
* **Besser für neuartige Proteine:** Konstruierte Sequenzen haben keine evolutionären Verwandten

***

## Schnellstart auf Clore.ai

### Schritt 1: Wähle einen Server

Auf [clore.ai](https://clore.ai) Marktplatz:

* **Minimum:** NVIDIA GPU mit **16GB VRAM** (das ESM-2 Sprachmodell ist groß)
* **Empfohlen:** A100 40GB, RTX 3090, RTX 4090 für das vollständige Modell
* **Kleinere Option:** Verwende `esm2_t33_650M_UR50D` für 8GB VRAM

GPU-VRAM-Anleitung:

| Proteinlänge   | Modellvariante  | Erforderlicher VRAM |
| -------------- | --------------- | ------------------- |
| Bis zu 300 aa  | ESMFold (3B)    | \~16GB              |
| Bis zu 500 aa  | ESMFold (3B)    | \~20GB              |
| Bis zu 1000 aa | ESMFold (3B)    | \~40GB              |
| Bis zu 600 aa  | ESMFold (Chunk) | \~8GB               |

### Schritt 2: Erstelle ein benutzerdefiniertes Docker-Image

```dockerfile
FROM pytorch/pytorch:2.1.0-cuda12.1-cudnn8-devel

# Systemabhängigkeiten
RUN apt-get update && apt-get install -y \
    git \
    wget \
    curl \
    openssh-server \
    libhdf5-dev \
    pkg-config \
    && rm -rf /var/lib/apt/lists/*

# Konfiguriere SSH
RUN mkdir /var/run/sshd && \
    echo 'root:esmfold' | chpasswd && \
    sed -i 's/#PermitRootLogin prohibit-password/PermitRootLogin yes/' /etc/ssh/sshd_config

# Installiere ESMFold und Abhängigkeiten
RUN pip install --no-cache-dir \
    fair-esm[esmfold] \
    torch \
    biopython \
    biotite \
    fastapi \
    uvicorn \
    pydantic \
    openmm==8.0.0 \
    pdbfixer

# Installiere OpenFold (erforderlich für ESMFold)
RUN pip install "git+https://github.com/aqlaboratory/openfold.git@4b41059694619831a7db195b7e0988fc4ff3a307"

EXPOSE 22

CMD ["/usr/sbin/sshd", "-D"]
```

### Schritt 3: Bereitstellung auf Clore.ai

* **Docker-Image:** `yourname/esmfold:latest`
* **Ports:** `22` (SSH)
* **Umgebung:** `NVIDIA_VISIBLE_DEVICES=all`

***

## Installation & Einrichtung

### Methode 1: pip install

```bash
# Installiere ESMFold
pip install fair-esm[esmfold]

# Installiere OpenFold (erforderliche Abhängigkeit)
pip install "git+https://github.com/aqlaboratory/openfold.git@4b41059694619831a7db195b7e0988fc4ff3a307"

# Optional, aber empfohlen
pip install biotite biopython
```

### Methode 2: Aus dem Quellcode

```bash
git clone https://github.com/facebookresearch/esm.git
cd esm
pip install -e ".[esmfold]"
```

### Installation überprüfen

```python
import esm
print("ESM-Version:", esm.__version__)

# Schnelltest Modell laden
model = esm.pretrained.esmfold_v1()
print("ESMFold erfolgreich geladen!")
```

***

## Grundlegende Nutzung

### Vorhersage einer einzelnen Proteinstruktur

```python
import torch
import esm

# Lade ESMFold-Modell
model = esm.pretrained.esmfold_v1()
model = model.eval().cuda()

# Optional: Aktivieren der Chunk-Größe, um VRAM zu sparen
# Erhöht die Rechenzeit, reduziert aber die VRAM-Nutzung
model.set_chunk_size(64)  # Für weniger VRAM reduzieren

# Proteinsequenz (Beispiel: Lysozym C)
sequence = "KVFGRCELAAAMKRHGLDNYRGYSLGNWVCAAKFESNFNTQATNRNTDGSTDYGILQINSRWWCNDGRTPGSRNLCNIPCSALLSSDITASVNCAKKIVSDGNGMNAWVAWRNRCKGTDVQAWIRGCRL"

# Struktur vorhersagen
with torch.no_grad():
    output = model.infer_pdb(sequence)

# PDB-Datei speichern
with open("lysozyme.pdb", "w") as f:
    f.write(output)

print(f"Struktur vorhergesagt! Gespeichert in lysozyme.pdb")
print(f"Sequenzlänge: {len(sequence)} Aminosäuren")
```

### Mehrere Sequenzen vorhersagen (Batch)

```python
import torch
import esm

model = esm.pretrained.esmfold_v1()
model = model.eval().cuda()

sequences = {
    "protein_A": "MKTAYIAKQRQISFVKSHFSRQ...",
    "protein_B": "MGDVEKGKKIFVQKCAQCHTVEK...",
    "ubiquitin": "MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGG",
}

for name, seq in sequences.items():
    with torch.no_grad():
        output = model.infer_pdb(seq)
    
    with open(f"{name}.pdb", "w") as f:
        f.write(output)
    
    print(f"Vorhergesagt {name}: {len(seq)} aa")

print("Alle Vorhersagen abgeschlossen!")
```

### Pro-Residuum Vertrauen erhalten (pLDDT)

```python
import torch
import esm
import numpy as np

model = esm.pretrained.esmfold_v1()
model = model.eval().cuda()

sequence = "MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGG"

with torch.no_grad():
    output = model.infer(sequence)

# Extrahiere pLDDT-Werte (Vertrauen pro Residuum)
plddt = output["plddt"].cpu().numpy()  # Form: [1, seq_len]
plddt_per_residue = plddt[0]

print(f"Mittleres pLDDT: {plddt_per_residue.mean():.2f}")
print(f"Residuen mit hohem Vertrauen (>90): {(plddt_per_residue > 90).sum()}")
print(f"Residuen mit niedrigem Vertrauen (<50): {(plddt_per_residue < 50).sum()}")

# Konfidenzregionen klassifizieren
for i, score in enumerate(plddt_per_residue):
    if score >= 90:
        confidence = "Sehr hoch (blau)"
    elif score >= 70:
        confidence = "Verlässlich (hellblau)"
    elif score >= 50:
        confidence = "Niedrig (gelb)"
    else:
        confidence = "Sehr niedrig (orange)"
    # print(f"Residuum {i+1}: {score:.1f} - {confidence}")  # Zum vollständigen Output auskommentieren
```

***

## REST-API-Server

Erstelle eine Produktions-API für ESMFold:

```python
# api_server.py
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import torch
import esm
import time
from typing import Optional

app = FastAPI(
    title="ESMFold Protein Structure Prediction API",
    description="Sage 3D-Proteinstrukturen aus Aminosäuresequenzen vorher",
    version="1.0.0"
)

# Lade Modell beim Start
print("ESMFold-Modell laden (das dauert ~30 Sekunden)...")
model = esm.pretrained.esmfold_v1()
model = model.eval().cuda()
model.set_chunk_size(64)  # Speicheroptimierung
print("ESMFold bereit!")

class PredictionRequest(BaseModel):
    sequence: str
    name: Optional[str] = "protein"

class PredictionResponse(BaseModel):
    name: str
    sequence_length: int
    pdb_content: str
    mean_plddt: float
    inference_time_seconds: float

@app.post("/predict", response_model=PredictionResponse)
async def predict_structure(request: PredictionRequest):
    """Sage 3D-Proteinstruktur aus Aminosäuresequenz vorher."""
    
    # Sequenz validieren
    valid_aa = set("ACDEFGHIKLMNPQRSTVWY")
    sequence = request.sequence.upper().strip()
    
    invalid = set(sequence) - valid_aa
    if invalid:
        raise HTTPException(
            status_code=400,
            detail=f"Ungültige Aminosäuren in der Sequenz: {invalid}. Verwende die standardmäßigen 20 Aminosäuren.",
        )
    
    if len(sequence) > 2000:
        raise HTTPException(
            status_code=400,
            detail="Sequenz zu lang (max. 2000 Aminosäuren). Für längere Sequenzen verwende chunked prediction.",
        )
    
    start_time = time.time()
    
    try:
        with torch.no_grad():
            output = model.infer(sequence)
            pdb_content = model.output_to_pdb(output)[0]
            
        plddt = output["plddt"].cpu().numpy()[0]
        mean_plddt = float(plddt.mean())
        
    except torch.cuda.OutOfMemoryError:
        torch.cuda.empty_cache()
        raise HTTPException(
            status_code=507,
            detail="GPU-Speicher erschöpft. Versuche eine kürzere Sequenz oder reduziere die Chunk-Größe.",
        )
    
    inference_time = time.time() - start_time
    
    return PredictionResponse(
        name=request.name,
        sequence_length=len(sequence),
        pdb_content=pdb_content,
        mean_plddt=mean_plddt,
        inference_time_seconds=round(inference_time, 2)
    )

@app.get("/health")
def health():
    gpu_mem = torch.cuda.memory_allocated() / 1024**3 if torch.cuda.is_available() else 0
    return {
        "status": "ok",
        "model": "ESMFold v1",
        "device": str(next(model.parameters()).device),
        "gpu_memory_gb": round(gpu_mem, 2)
    }

@app.get("/")
def root():
    return {"message": "ESMFold API — /predict um Strukturen vorherzusagen, /docs für Swagger UI"}
```

```bash
# Starte die API
pip install fastapi uvicorn
uvicorn api_server:app --host 0.0.0.0 --port 8080 --workers 1
```

***

## API-Nutzungsbeispiele

```bash
# Struktur via API vorhersagen
curl -X POST http://localhost:8080/predict \
  -H "Content-Type: application/json" \
  -d '{
    "name": "ubiquitin",
    "sequence": "MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGG"
  }' | python3 -c "
import sys, json
data = json.load(sys.stdin)
print(f\"Name: {data['name']}\")
print(f\"Länge: {data['sequence_length']} aa\")
print(f\"Mittleres pLDDT: {data['mean_plddt']:.1f}\")
print(f\"Zeit: {data['inference_time_seconds']}s\")
# PDB speichern
open('ubiquitin.pdb', 'w').write(data['pdb_content'])
print('PDB gespeichert!')
"
```

***

## Batch-Verarbeitungsskript

```python
# batch_predict.py
import torch
import esm
import os
from pathlib import Path
from Bio import SeqIO  # pip install biopython

def predict_fasta(fasta_file: str, output_dir: str, chunk_size: int = 64):
    """Sage Strukturen für alle Sequenzen in einer FASTA-Datei vorher."""
    
    Path(output_dir).mkdir(parents=True, exist_ok=True)
    
    # Modell laden
    model = esm.pretrained.esmfold_v1()
    model = model.eval().cuda()
    model.set_chunk_size(chunk_size)
    
    # FASTA lesen
    sequences = list(SeqIO.parse(fasta_file, "fasta"))
    print(f"Sage Strukturen für {len(sequences)} Proteine vorher...")
    
    results = []
    for i, record in enumerate(sequences):
        seq = str(record.seq).upper()
        name = record.id
        
        print(f"[{i+1}/{len(sequences)}] Sage {name} vorher ({len(seq)} aa)...")
        
        try:
            with torch.no_grad():
                output = model.infer(seq)
                pdb = model.output_to_pdb(output)[0]
            
            plddt = output["plddt"].cpu().numpy()[0].mean()
            
            # PDB speichern
            output_path = os.path.join(output_dir, f"{name}.pdb")
            with open(output_path, "w") as f:
                f.write(pdb)
            
            results.append({
                "name": name,
                "length": len(seq),
                "mean_plddt": round(float(plddt), 2),
                "output": output_path,
                "status": "success"
            })
            
        except Exception as e:
            print(f"  Fehler: {e}")
            results.append({"name": name, "status": f"error: {e}"})
    
    # Zusammenfassung schreiben
    import csv
    with open(os.path.join(output_dir, "summary.csv"), "w") as f:
        writer = csv.DictWriter(f, fieldnames=["name", "length", "mean_plddt", "output", "status"])
        writer.writeheader()
        writer.writerows(results)
    
    success = sum(1 for r in results if r.get("status") == "success")
    print(f"\nFertig! {success}/{len(sequences)} Strukturen erfolgreich vorhergesagt")
    print(f"Ergebnisse gespeichert in {output_dir}/")

if __name__ == "__main__":
    predict_fasta(
        fasta_file="./proteins.fasta",
        output_dir="./predicted_structures",
        chunk_size=64
    )
```

***

## Strukturen visualisieren

### Verwendung von Py3Dmol (Jupyter / Python)

```python
import py3Dmol  # pip install py3Dmol

with open("protein.pdb") as f:
    pdb_data = f.read()

view = py3Dmol.view(width=800, height=600)
view.addModel(pdb_data, "pdb")
view.setStyle({"cartoon": {"colorscheme": "ssJmol"}})
view.zoomTo()
view.show()
```

### Verwendung von PyMOL

```bash
# Installiere PyMOL
apt-get install pymol

# Struktur öffnen
pymol lysozyme.pdb
```

### Programmgesteuerte Visualisierung mit Biotite

```python
import biotite.structure.io.pdb as pdb
import biotite.structure as struc
import numpy as np

# Geladene Vorhersage-Struktur
pdb_file = pdb.PDBFile.read("lysozyme.pdb")
structure = pdb.get_structure(pdb_file, model=1)

# Sekundärstruktur analysieren
sse = struc.annotate_sse(structure)

helix_frac = (sse == 'a').mean() * 100
sheet_frac = (sse == 'b').mean() * 100
coil_frac = (sse == 'c').mean() * 100

print(f"Zusammensetzung der Sekundärstruktur:")
print(f"  Alpha-Helix:  {helix_frac:.1f}%")
print(f"  Beta-Faltblatt:   {sheet_frac:.1f}%")
print(f"  Coil/Andere:   {coil_frac:.1f}%")
```

***

## Speicheroptimierung

### Chunk-Größen-Anleitung

```python
# Kleinere chunk_size = weniger VRAM, langsameres Vorhersagen
# Größere chunk_size = mehr VRAM, schnelleres Vorhersagen

# Für 8GB VRAM (erlaubt bis zu ~400 aa)
model.set_chunk_size(32)

# Für 16GB VRAM (bis zu ~700 aa)
model.set_chunk_size(64)

# Für 40GB VRAM (bis zu ~2000 aa, kein Chunking)
model.set_chunk_size(None)  # Chunking deaktivieren
```

### CPU-Auslagerung für sehr lange Sequenzen

```python
# Lade Modell auf CPU, verschiebe für Inferenz auf GPU
model = esm.pretrained.esmfold_v1()
model = model.eval()

# Für Inferenz auf GPU verschieben, danach zurück auf CPU
model = model.cuda()
with torch.no_grad():
    output = model.infer(sequence)
model = model.cpu()  # GPU-Speicher freigeben
torch.cuda.empty_cache()
```

***

## Fehlerbehebung

### CUDA Out of Memory

```bash
# Chunk-Größe reduzieren
model.set_chunk_size(32)  # oder sogar 16

# Freien VRAM prüfen
nvidia-smi --query-gpu=memory.free --format=csv,noheader

# Für sehr lange Proteine in Domänen aufteilen
# Typischerweise sicher, Proteine > 1000 aa in Domänen von 300–500 aa zu trennen
```

### ImportError für openfold

```bash
# Neu installieren mit spezifischem Commit
pip install "git+https://github.com/aqlaboratory/openfold.git@4b41059694619831a7db195b7e0988fc4ff3a307"

# Installation überprüfen
python -c "import openfold; print('OpenFold OK')"
```

### Langsames Modellladen

```bash
# Der erste Ladevorgang lädt 2.7GB Modellgewichte — das ist normal
# Nachfolgende Ladevorgänge verwenden zwischengespeicherte Gewichte (~30s Ladezeit)

# Cache-Ort prüfen
python -c "import torch; print(torch.hub.get_dir())"
ls ~/.cache/torch/hub/
```

{% hint style="warning" %}
**Hinweis zum Speicher:** Das Sprachmodell von ESMFold (ESM-2 mit 15B Parametern) benötigt erheblichen VRAM. Für GPU-Server mit weniger als 16GB VRAM verwende die `esm2_t33_650M_UR50D` Backbone-Variante oder aktiviere aggressives Chunking.
{% endhint %}

{% hint style="info" %}
**pLDDT-Interpretation:**

* **>90** = Sehr hohes Vertrauen (blau in der AlphaFold-Farbgebung)
* **70–90** = Verlässlich (cyan/hellblau)
* **50–70** = Niedriges Vertrauen (gelb) — mit Vorsicht behandeln
* **<50** = Sehr niedriges Vertrauen (orange/rot) — wahrscheinlich ungeordnete Region
  {% endhint %}

***

## Clore.ai GPU-Empfehlungen

Der VRAM-Bedarf von ESMFold wird vom ESM-2 Sprachmodell mit 15B Parametern dominiert. Die Sequenzlänge fügt zusätzlichen Speicherbedarf hinzu.

| GPU       | VRAM  | Clore.ai-Preis | Maximale Sequenzlänge      | Vorhersagezeit (300 aa) |
| --------- | ----- | -------------- | -------------------------- | ----------------------- |
| RTX 3090  | 24 GB | \~$0.12/Stunde | \~400 aa (mit Chunking)    | \~8 Sekunden            |
| RTX 4090  | 24 GB | \~$0.70/Stunde | \~400 aa (mit Chunking)    | \~5 Sekunden            |
| A100 40GB | 40 GB | \~$1.20/Stunde | \~800 aa bequem            | \~3 Sekunden            |
| A100 80GB | 80 GB | \~$2.00/Stunde | \~1500+ aa, große Proteine | \~4 Sekunden            |

{% hint style="warning" %}
**Mindest-VRAM: 16GB.** ESMFold kann mit dem vollständigen ESM-2-Backbone nicht auf 8GB-GPUs laufen. Die RTX 3090/4090 (24GB) kann Proteine bis \~400 Aminosäuren ohne Chunking verarbeiten — aktiviere `chunk_size=64` in der API für längere Sequenzen.
{% endhint %}

**Bestes Preis-Leistungs-Verhältnis für die Forschung:** RTX 3090 für \~0,12$/Std. bewältigt die überwiegende Mehrheit der Proteinstrukturvorhersageaufgaben (durchschnittliches menschliches Protein: \~300–400 aa). Bei \~8 Sekunden pro Vorhersage kann man \~450 Strukturen pro Stunde für \~0,12$ insgesamt verarbeiten — im Vergleich zu AlphaFold2, das MSA-Berechnungen benötigt, die Minuten pro Struktur dauern.

**Hochdurchsatz-Proteomik:** Für das Screening von Tausenden Sequenzen verarbeitet eine A100 40GB (\~1,20$/Std.) mit gebatchter Inferenz \~1.200+ Vorhersagen pro Stunde — geeignet für Proteom-Skalen-Studien.

***

## Ressourcen

* 🐙 **GitHub:** [github.com/facebookresearch/esm](https://github.com/facebookresearch/esm)
* 🤗 **Modelle:** [huggingface.co/facebook/esmfold\_v1](https://huggingface.co/facebook/esmfold_v1)
* 📄 **Paper:** [Evolutionary-scale prediction of atomic-level protein structure with a language model (Science, 2023)](https://www.science.org/doi/10.1126/science.ade2574)
* 🌐 **ESM Metagenomic Atlas:** [esmatlas.com](https://esmatlas.com) — 772M Strukturen mit ESMFold vorhergesagt
* 💻 **Meta AI Blog:** [ai.meta.com/blog/protein-folding-esmfold-metagenomics](https://ai.meta.com/blog/protein-folding-esmfold-metagenomics/)
* 🔬 **ESM Änderungsprotokoll:** [github.com/facebookresearch/esm/blob/main/CHANGELOG.md](https://github.com/facebookresearch/esm/blob/main/CHANGELOG.md)


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://docs.clore.ai/guides/guides_v2-de/wissenschaft-and-forschung/esmfold.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.