# ESMFold Protein Structure

**Ultra-fast protein structure prediction by Meta AI** — predict 3D protein structures from amino acid sequences in seconds, without multiple sequence alignments.

> 🧬 Developed by **Meta AI Research** | MIT License | 10x–60x faster than AlphaFold2

***

## What is ESMFold?

ESMFold is Meta AI's protein structure prediction system that leverages **Evolutionary Scale Modeling (ESM-2)** — the world's largest protein language model (15 billion parameters) — to predict 3D protein structures directly from amino acid sequences.

### Key Advantages Over AlphaFold2

| Feature                 | ESMFold         | AlphaFold2     |
| ----------------------- | --------------- | -------------- |
| MSA required            | ❌ No            | ✅ Yes          |
| Speed (typical protein) | **\~2 seconds** | \~10 min–hours |
| Accuracy (TM-score)     | \~0.87          | \~0.92         |
| GPU VRAM (650aa)        | \~8GB           | \~8GB          |
| Single sequence input   | ✅ Yes           | Limited        |
| Orphan proteins         | ✅ Excellent     | Struggles      |

### Why No MSA?

AlphaFold2 requires **Multiple Sequence Alignment (MSA)** — collecting and aligning evolutionary relatives of the query protein. This is computationally expensive and impossible for novel or engineered proteins with no evolutionary relatives.

ESMFold stores evolutionary information **in its language model weights** (trained on 250 million protein sequences), eliminating MSA entirely. This makes it:

* **Faster:** No MSA search (minutes saved per prediction)
* **More scalable:** Process entire proteomes efficiently
* **Better for novel proteins:** Engineered sequences have no evolutionary relatives

***

## Quick Start on Clore.ai

### Step 1: Select a Server

On [clore.ai](https://clore.ai) marketplace:

* **Minimum:** NVIDIA GPU with **16GB VRAM** (the ESM-2 language model is large)
* **Recommended:** A100 40GB, RTX 3090, RTX 4090 for full model
* **Smaller option:** Use `esm2_t33_650M_UR50D` for 8GB VRAM

GPU VRAM guide:

| Protein Length | Model Variant   | VRAM Required |
| -------------- | --------------- | ------------- |
| Up to 300 aa   | ESMFold (3B)    | \~16GB        |
| Up to 500 aa   | ESMFold (3B)    | \~20GB        |
| Up to 1000 aa  | ESMFold (3B)    | \~40GB        |
| Up to 600 aa   | ESMFold (chunk) | \~8GB         |

### Step 2: Build Custom Docker Image

```dockerfile
FROM pytorch/pytorch:2.1.0-cuda12.1-cudnn8-devel

# System dependencies
RUN apt-get update && apt-get install -y \
    git \
    wget \
    curl \
    openssh-server \
    libhdf5-dev \
    pkg-config \
    && rm -rf /var/lib/apt/lists/*

# Configure SSH
RUN mkdir /var/run/sshd && \
    echo 'root:esmfold' | chpasswd && \
    sed -i 's/#PermitRootLogin prohibit-password/PermitRootLogin yes/' /etc/ssh/sshd_config

# Install ESMFold and dependencies
RUN pip install --no-cache-dir \
    fair-esm[esmfold] \
    torch \
    biopython \
    biotite \
    fastapi \
    uvicorn \
    pydantic \
    openmm==8.0.0 \
    pdbfixer

# Install OpenFold (required for ESMFold)
RUN pip install "git+https://github.com/aqlaboratory/openfold.git@4b41059694619831a7db195b7e0988fc4ff3a307"

EXPOSE 22

CMD ["/usr/sbin/sshd", "-D"]
```

### Step 3: Deploy on Clore.ai

* **Docker image:** `yourname/esmfold:latest`
* **Ports:** `22` (SSH)
* **Environment:** `NVIDIA_VISIBLE_DEVICES=all`

***

## Installation & Setup

### Method 1: pip install

```bash
# Install ESMFold
pip install fair-esm[esmfold]

# Install OpenFold (required dependency)
pip install "git+https://github.com/aqlaboratory/openfold.git@4b41059694619831a7db195b7e0988fc4ff3a307"

# Optional but recommended
pip install biotite biopython
```

### Method 2: From Source

```bash
git clone https://github.com/facebookresearch/esm.git
cd esm
pip install -e ".[esmfold]"
```

### Verify Installation

```python
import esm
print("ESM version:", esm.__version__)

# Quick model load test
model = esm.pretrained.esmfold_v1()
print("ESMFold loaded successfully!")
```

***

## Basic Usage

### Predict a Single Protein Structure

```python
import torch
import esm

# Load ESMFold model
model = esm.pretrained.esmfold_v1()
model = model.eval().cuda()

# Optional: Enable chunk-size to save VRAM
# Increases computation time but reduces VRAM usage
model.set_chunk_size(64)  # Reduce for less VRAM

# Protein sequence (example: Lysozyme C)
sequence = "KVFGRCELAAAMKRHGLDNYRGYSLGNWVCAAKFESNFNTQATNRNTDGSTDYGILQINSRWWCNDGRTPGSRNLCNIPCSALLSSDITASVNCAKKIVSDGNGMNAWVAWRNRCKGTDVQAWIRGCRL"

# Predict structure
with torch.no_grad():
    output = model.infer_pdb(sequence)

# Save PDB file
with open("lysozyme.pdb", "w") as f:
    f.write(output)

print(f"Structure predicted! Saved to lysozyme.pdb")
print(f"Sequence length: {len(sequence)} amino acids")
```

### Predict Multiple Sequences (Batch)

```python
import torch
import esm

model = esm.pretrained.esmfold_v1()
model = model.eval().cuda()

sequences = {
    "protein_A": "MKTAYIAKQRQISFVKSHFSRQ...",
    "protein_B": "MGDVEKGKKIFVQKCAQCHTVEK...",
    "ubiquitin": "MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGG",
}

for name, seq in sequences.items():
    with torch.no_grad():
        output = model.infer_pdb(seq)
    
    with open(f"{name}.pdb", "w") as f:
        f.write(output)
    
    print(f"Predicted {name}: {len(seq)} aa")

print("All predictions complete!")
```

### Get Per-Residue Confidence (pLDDT)

```python
import torch
import esm
import numpy as np

model = esm.pretrained.esmfold_v1()
model = model.eval().cuda()

sequence = "MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGG"

with torch.no_grad():
    output = model.infer(sequence)

# Extract pLDDT scores (per-residue confidence)
plddt = output["plddt"].cpu().numpy()  # Shape: [1, seq_len]
plddt_per_residue = plddt[0]

print(f"Mean pLDDT: {plddt_per_residue.mean():.2f}")
print(f"High confidence residues (>90): {(plddt_per_residue > 90).sum()}")
print(f"Low confidence residues (<50): {(plddt_per_residue < 50).sum()}")

# Classify confidence regions
for i, score in enumerate(plddt_per_residue):
    if score >= 90:
        confidence = "Very High (blue)"
    elif score >= 70:
        confidence = "Confident (light blue)"
    elif score >= 50:
        confidence = "Low (yellow)"
    else:
        confidence = "Very Low (orange)"
    # print(f"Residue {i+1}: {score:.1f} - {confidence}")  # Uncomment for full output
```

***

## REST API Server

Build a production API for ESMFold:

```python
# api_server.py
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import torch
import esm
import time
from typing import Optional

app = FastAPI(
    title="ESMFold Protein Structure Prediction API",
    description="Predict protein 3D structures from amino acid sequences",
    version="1.0.0"
)

# Load model at startup
print("Loading ESMFold model (this takes ~30 seconds)...")
model = esm.pretrained.esmfold_v1()
model = model.eval().cuda()
model.set_chunk_size(64)  # Memory optimization
print("ESMFold ready!")

class PredictionRequest(BaseModel):
    sequence: str
    name: Optional[str] = "protein"

class PredictionResponse(BaseModel):
    name: str
    sequence_length: int
    pdb_content: str
    mean_plddt: float
    inference_time_seconds: float

@app.post("/predict", response_model=PredictionResponse)
async def predict_structure(request: PredictionRequest):
    """Predict protein 3D structure from amino acid sequence."""
    
    # Validate sequence
    valid_aa = set("ACDEFGHIKLMNPQRSTVWY")
    sequence = request.sequence.upper().strip()
    
    invalid = set(sequence) - valid_aa
    if invalid:
        raise HTTPException(
            status_code=400,
            detail=f"Invalid amino acids in sequence: {invalid}. Use standard 20 amino acids."
        )
    
    if len(sequence) > 2000:
        raise HTTPException(
            status_code=400,
            detail="Sequence too long (max 2000 amino acids). For longer sequences, use chunked prediction."
        )
    
    start_time = time.time()
    
    try:
        with torch.no_grad():
            output = model.infer(sequence)
            pdb_content = model.output_to_pdb(output)[0]
            
        plddt = output["plddt"].cpu().numpy()[0]
        mean_plddt = float(plddt.mean())
        
    except torch.cuda.OutOfMemoryError:
        torch.cuda.empty_cache()
        raise HTTPException(
            status_code=507,
            detail="GPU out of memory. Try a shorter sequence or reduce chunk size."
        )
    
    inference_time = time.time() - start_time
    
    return PredictionResponse(
        name=request.name,
        sequence_length=len(sequence),
        pdb_content=pdb_content,
        mean_plddt=mean_plddt,
        inference_time_seconds=round(inference_time, 2)
    )

@app.get("/health")
def health():
    gpu_mem = torch.cuda.memory_allocated() / 1024**3 if torch.cuda.is_available() else 0
    return {
        "status": "ok",
        "model": "ESMFold v1",
        "device": str(next(model.parameters()).device),
        "gpu_memory_gb": round(gpu_mem, 2)
    }

@app.get("/")
def root():
    return {"message": "ESMFold API — /predict to predict structures, /docs for Swagger UI"}
```

```bash
# Run the API
pip install fastapi uvicorn
uvicorn api_server:app --host 0.0.0.0 --port 8080 --workers 1
```

***

## API Usage Examples

```bash
# Predict structure via API
curl -X POST http://localhost:8080/predict \
  -H "Content-Type: application/json" \
  -d '{
    "name": "ubiquitin",
    "sequence": "MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGG"
  }' | python3 -c "
import sys, json
data = json.load(sys.stdin)
print(f\"Name: {data['name']}\")
print(f\"Length: {data['sequence_length']} aa\")
print(f\"Mean pLDDT: {data['mean_plddt']:.1f}\")
print(f\"Time: {data['inference_time_seconds']}s\")
# Save PDB
open('ubiquitin.pdb', 'w').write(data['pdb_content'])
print('PDB saved!')
"
```

***

## Batch Processing Script

```python
# batch_predict.py
import torch
import esm
import os
from pathlib import Path
from Bio import SeqIO  # pip install biopython

def predict_fasta(fasta_file: str, output_dir: str, chunk_size: int = 64):
    """Predict structures for all sequences in a FASTA file."""
    
    Path(output_dir).mkdir(parents=True, exist_ok=True)
    
    # Load model
    model = esm.pretrained.esmfold_v1()
    model = model.eval().cuda()
    model.set_chunk_size(chunk_size)
    
    # Read FASTA
    sequences = list(SeqIO.parse(fasta_file, "fasta"))
    print(f"Predicting structures for {len(sequences)} proteins...")
    
    results = []
    for i, record in enumerate(sequences):
        seq = str(record.seq).upper()
        name = record.id
        
        print(f"[{i+1}/{len(sequences)}] Predicting {name} ({len(seq)} aa)...")
        
        try:
            with torch.no_grad():
                output = model.infer(seq)
                pdb = model.output_to_pdb(output)[0]
            
            plddt = output["plddt"].cpu().numpy()[0].mean()
            
            # Save PDB
            output_path = os.path.join(output_dir, f"{name}.pdb")
            with open(output_path, "w") as f:
                f.write(pdb)
            
            results.append({
                "name": name,
                "length": len(seq),
                "mean_plddt": round(float(plddt), 2),
                "output": output_path,
                "status": "success"
            })
            
        except Exception as e:
            print(f"  Error: {e}")
            results.append({"name": name, "status": f"error: {e}"})
    
    # Write summary
    import csv
    with open(os.path.join(output_dir, "summary.csv"), "w") as f:
        writer = csv.DictWriter(f, fieldnames=["name", "length", "mean_plddt", "output", "status"])
        writer.writeheader()
        writer.writerows(results)
    
    success = sum(1 for r in results if r.get("status") == "success")
    print(f"\nDone! {success}/{len(sequences)} structures predicted successfully")
    print(f"Results saved to {output_dir}/")

if __name__ == "__main__":
    predict_fasta(
        fasta_file="./proteins.fasta",
        output_dir="./predicted_structures",
        chunk_size=64
    )
```

***

## Visualizing Structures

### Using Py3Dmol (Jupyter / Python)

```python
import py3Dmol  # pip install py3Dmol

with open("protein.pdb") as f:
    pdb_data = f.read()

view = py3Dmol.view(width=800, height=600)
view.addModel(pdb_data, "pdb")
view.setStyle({"cartoon": {"colorscheme": "ssJmol"}})
view.zoomTo()
view.show()
```

### Using PyMOL

```bash
# Install PyMOL
apt-get install pymol

# Open structure
pymol lysozyme.pdb
```

### Programmatic Visualization with Biotite

```python
import biotite.structure.io.pdb as pdb
import biotite.structure as struc
import numpy as np

# Load predicted structure
pdb_file = pdb.PDBFile.read("lysozyme.pdb")
structure = pdb.get_structure(pdb_file, model=1)

# Analyze secondary structure
sse = struc.annotate_sse(structure)

helix_frac = (sse == 'a').mean() * 100
sheet_frac = (sse == 'b').mean() * 100
coil_frac = (sse == 'c').mean() * 100

print(f"Secondary structure composition:")
print(f"  Alpha helix:  {helix_frac:.1f}%")
print(f"  Beta sheet:   {sheet_frac:.1f}%")
print(f"  Coil/Other:   {coil_frac:.1f}%")
```

***

## Memory Optimization

### Chunk Size Guide

```python
# Lower chunk_size = less VRAM, slower prediction
# Higher chunk_size = more VRAM, faster prediction

# For 8GB VRAM (allows up to ~400 aa)
model.set_chunk_size(32)

# For 16GB VRAM (up to ~700 aa)
model.set_chunk_size(64)

# For 40GB VRAM (up to ~2000 aa, no chunking)
model.set_chunk_size(None)  # Disable chunking
```

### CPU Offloading for Very Long Sequences

```python
# Load model on CPU, move to GPU per inference
model = esm.pretrained.esmfold_v1()
model = model.eval()

# Move to GPU for inference, back to CPU after
model = model.cuda()
with torch.no_grad():
    output = model.infer(sequence)
model = model.cpu()  # Free GPU memory
torch.cuda.empty_cache()
```

***

## Troubleshooting

### CUDA Out of Memory

```bash
# Reduce chunk size
model.set_chunk_size(32)  # or even 16

# Check free VRAM
nvidia-smi --query-gpu=memory.free --format=csv,noheader

# For very long proteins, split into domains
# Typically safe to split proteins > 1000 aa into domains of 300-500 aa
```

### ImportError for openfold

```bash
# Reinstall with specific commit
pip install "git+https://github.com/aqlaboratory/openfold.git@4b41059694619831a7db195b7e0988fc4ff3a307"

# Check installation
python -c "import openfold; print('OpenFold OK')"
```

### Slow Model Loading

```bash
# First load downloads 2.7GB model weights — this is normal
# Subsequent loads use cached weights (~30s load time)

# Check cache location
python -c "import torch; print(torch.hub.get_dir())"
ls ~/.cache/torch/hub/
```

{% hint style="warning" %}
**Memory note:** ESMFold's language model (ESM-2 15B parameters) requires significant VRAM. For GPU servers with less than 16GB VRAM, use the `esm2_t33_650M_UR50D` backbone variant or enable aggressive chunking.
{% endhint %}

{% hint style="info" %}
**pLDDT interpretation:**

* **>90** = Very high confidence (blue in AlphaFold coloring)
* **70–90** = Confident (cyan/light blue)
* **50–70** = Low confidence (yellow) — treat with caution
* **<50** = Very low confidence (orange/red) — likely disordered region
  {% endhint %}

***

## Clore.ai GPU Recommendations

ESMFold's VRAM requirement is dominated by the ESM-2 15B parameter language model. Sequence length adds further memory overhead.

| GPU       | VRAM  | Clore.ai Price | Max Sequence Length        | Prediction Time (300 aa) |
| --------- | ----- | -------------- | -------------------------- | ------------------------ |
| RTX 3090  | 24 GB | \~$0.12/hr     | \~400 aa (with chunking)   | \~8 seconds              |
| RTX 4090  | 24 GB | \~$0.70/hr     | \~400 aa (with chunking)   | \~5 seconds              |
| A100 40GB | 40 GB | \~$1.20/hr     | \~800 aa comfortably       | \~3 seconds              |
| A100 80GB | 80 GB | \~$2.00/hr     | \~1500+ aa, large proteins | \~4 seconds              |

{% hint style="warning" %}
**Minimum VRAM: 16GB.** ESMFold cannot run on 8GB GPUs with the full ESM-2 backbone. The RTX 3090/4090 (24GB) can handle proteins up to \~400 amino acids without chunking — enable `chunk_size=64` in the API for longer sequences.
{% endhint %}

**Best value for research:** RTX 3090 at \~$0.12/hr handles the vast majority of protein structure prediction tasks (average human protein: \~300–400 aa). At \~8 seconds per prediction, you can process \~450 structures per hour for \~$0.12 total — compared to AlphaFold2 which requires MSA computation taking minutes per structure.

**High-throughput proteomics:** For screening thousands of sequences, A100 40GB (\~$1.20/hr) with batched inference processes \~1,200+ predictions per hour — viable for proteome-scale studies.

***

## Resources

* 🐙 **GitHub:** [github.com/facebookresearch/esm](https://github.com/facebookresearch/esm)
* 🤗 **Models:** [huggingface.co/facebook/esmfold\_v1](https://huggingface.co/facebook/esmfold_v1)
* 📄 **Paper:** [Evolutionary-scale prediction of atomic-level protein structure with a language model (Science, 2023)](https://www.science.org/doi/10.1126/science.ade2574)
* 🌐 **ESM Metagenomic Atlas:** [esmatlas.com](https://esmatlas.com) — 772M structures predicted with ESMFold
* 💻 **Meta AI Blog:** [ai.meta.com/blog/protein-folding-esmfold-metagenomics](https://ai.meta.com/blog/protein-folding-esmfold-metagenomics/)
* 🔬 **ESM Changelog:** [github.com/facebookresearch/esm/blob/main/CHANGELOG.md](https://github.com/facebookresearch/esm/blob/main/CHANGELOG.md)
