# AlphaFold2 Protein Prediction

> **Predict protein structures with Nobel Prize-winning AI — powered by GPU acceleration on Clore.ai**

AlphaFold2, developed by DeepMind, revolutionized structural biology by predicting protein 3D structures with atomic accuracy. It has been applied to over 200 million protein sequences and earned the 2024 Nobel Prize in Chemistry. Running AlphaFold2 requires significant GPU memory and compute — Clore.ai provides affordable access to the high-end GPUs needed.

**GitHub:** [google-deepmind/alphafold](https://github.com/google-deepmind/alphafold) — 13K+ ⭐

***

## Prerequisites

* A Clore.ai account with sufficient balance
* Basic familiarity with the Linux command line
* Your target protein sequence(s) in FASTA format
* \~2.5TB disk space for the full genetic databases (or use reduced databases for testing)

***

## Why Run AlphaFold2 on Clore.ai?

AlphaFold2 benefits enormously from GPU acceleration:

| Hardware         | Prediction Time (typical protein \~400aa) |
| ---------------- | ----------------------------------------- |
| CPU only         | 6–24+ hours                               |
| Single A100 80GB | 15–45 minutes                             |
| Single RTX 4090  | 20–60 minutes                             |
| Single RTX 3090  | 30–90 minutes                             |

Clore.ai offers A100, RTX 4090, and RTX 3090 nodes at a fraction of cloud provider costs, making large-scale proteomics studies accessible.

***

## Step 1 — Choose Your GPU Rental on Clore.ai

{% hint style="info" %}
**Recommended GPUs for AlphaFold2:**

* **A100 80GB** — Best for large proteins (>700 aa) and multimer prediction
* **RTX 4090 24GB** — Great for standard monomers (<500 aa)
* **RTX 3090 24GB** — Cost-effective for smaller proteins

For multimer prediction, 40GB+ VRAM is strongly recommended.
{% endhint %}

1. Log in to [clore.ai](https://clore.ai) and go to **Marketplace**
2. Filter by GPU model (A100 or RTX 4090 recommended)
3. Ensure the server has **at least 100GB disk space** (or 2.5TB for full databases)
4. Select a server and click **Rent**

***

## Step 2 — Configure Your Deployment

When setting up your rental order, use the following configuration:

**Docker Image:**

```
nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu20.04
```

{% hint style="warning" %}
AlphaFold2 requires a custom Docker setup. We will install it from source inside the container. Alternatively, use the community image `catgumag/alphafold` or `merteroglu/alphafold2` which pre-packages the environment.
{% endhint %}

**Ports to expose:**

```
22
```

**Environment Variables:**

```
NVIDIA_VISIBLE_DEVICES=all
NVIDIA_DRIVER_CAPABILITIES=compute,utility
```

**Minimum Resources:**

* CPU: 8 cores
* RAM: 32GB (64GB recommended for large proteins)
* Disk: 100GB minimum (2.5TB for full databases)

***

## Step 3 — Connect via SSH

Once your instance is running:

```bash
ssh root@<server-ip> -p <ssh-port>
```

Verify GPU is visible:

```bash
nvidia-smi
```

Expected output should show your GPU (e.g., A100 80GB SXM4).

***

## Step 4 — Install AlphaFold2

### Option A: Using the Official Installer Script

```bash
# Update system packages
apt-get update && apt-get install -y \
    wget \
    git \
    python3-pip \
    python3-dev \
    aria2 \
    hmmer \
    kalign \
    hhsuite

# Install Miniconda
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
bash miniconda.sh -b -p /opt/conda
export PATH="/opt/conda/bin:$PATH"

# Clone AlphaFold2
git clone https://github.com/google-deepmind/alphafold.git /opt/alphafold
cd /opt/alphafold

# Create conda environment
conda env create -f environment.yml
conda activate alphafold
```

### Option B: Using pip (Faster Setup)

```bash
# Install system dependencies
apt-get update && apt-get install -y \
    wget curl git aria2 hmmer kalign

# Install hhsuite
conda install -c bioconda hhsuite

# Clone and install AlphaFold2
git clone https://github.com/google-deepmind/alphafold.git /opt/alphafold
cd /opt/alphafold

pip install -r requirements.txt
pip install --upgrade "jax[cuda11_pip]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html

# Install AlphaFold itself
python3 setup.py install
```

***

## Step 5 — Download Genetic Databases

{% hint style="warning" %}
**Full database download requires \~2.5TB disk space and can take 6–24 hours.** For initial testing, use the reduced databases (see Reduced DB section below).
{% endhint %}

### Full Databases (Production Use)

```bash
cd /opt/alphafold

# Download all databases using the provided script
bash scripts/download_all_data.sh /data/alphafold_databases
```

This downloads:

* **BFD** (\~270GB) — Big Fantastic Database
* **UniRef90** (\~58GB) — UniProt Reference Clusters
* **MGnify** (\~64GB) — Metagenomics sequences
* **PDB70** (\~56GB) — Protein Data Bank representative structures
* **PDB seqres** (\~0.2GB)
* **UniClust30** (\~86GB)
* **Small BFD** (\~17GB) — Reduced version

### Reduced Databases (Testing/Development)

For testing on limited disk:

```bash
# Download only small_bfd and necessary databases
bash scripts/download_small_bfd.sh /data/alphafold_databases
bash scripts/download_pdb70.sh /data/alphafold_databases
bash scripts/download_uniclust30.sh /data/alphafold_databases
bash scripts/download_uniref90.sh /data/alphafold_databases
bash scripts/download_mgnify.sh /data/alphafold_databases
bash scripts/download_pdb_seqres.sh /data/alphafold_databases
bash scripts/download_uniprot.sh /data/alphafold_databases
```

***

## Step 6 — Download AlphaFold Model Weights

```bash
# Create directory for model parameters
mkdir -p /data/alphafold_databases/params

# Download model parameters (~3.5GB)
wget -q -P /data/alphafold_databases/params \
    https://storage.googleapis.com/alphafold/alphafold_params_2022-12-06.tar

# Extract
tar -xf /data/alphafold_databases/params/alphafold_params_2022-12-06.tar \
    -C /data/alphafold_databases/params
```

***

## Step 7 — Prepare Your Input Sequence

Create a FASTA file with your target protein sequence:

```bash
cat > /tmp/target_protein.fasta << 'EOF'
>my_protein
MKTLLLTLVVVTIVCLDLGAVGNGSGLKCRQTGSCVHFPKDLQALPKDDTASDLNRSLDAEAFKAFQRLAENFNATEYRDIQNFNNKIQHSLEELAKKLDEKLAKLKEKLKQLEN
EOF
```

{% hint style="info" %}
**FASTA Format Tips:**

* Header line starts with `>`
* Sequence should contain only standard amino acid letters (ACDEFGHIKLMNPQRSTVWY)
* Remove any gaps or non-standard characters
* For multimer prediction, include all chains with separate headers
  {% endhint %}

***

## Step 8 — Run AlphaFold2

### Monomer Prediction (Single Chain)

```bash
cd /opt/alphafold

python3 run_alphafold.py \
    --fasta_paths=/tmp/target_protein.fasta \
    --max_template_date=2022-01-01 \
    --model_preset=monomer \
    --db_preset=full_dbs \
    --data_dir=/data/alphafold_databases \
    --output_dir=/tmp/alphafold_output \
    --uniref90_database_path=/data/alphafold_databases/uniref90/uniref90.fasta \
    --mgnify_database_path=/data/alphafold_databases/mgnify/mgy_clusters_2022_05.fa \
    --template_mmcif_dir=/data/alphafold_databases/pdb_mmcif/mmcif_files \
    --obsolete_pdbs_path=/data/alphafold_databases/pdb_mmcif/obsolete.dat \
    --pdb70_database_path=/data/alphafold_databases/pdb70/pdb70 \
    --bfd_database_path=/data/alphafold_databases/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
    --uniclust30_database_path=/data/alphafold_databases/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \
    --use_gpu_relax=True
```

### Multimer Prediction (Protein Complex)

```bash
python3 run_alphafold.py \
    --fasta_paths=/tmp/complex.fasta \
    --max_template_date=2022-01-01 \
    --model_preset=multimer \
    --db_preset=full_dbs \
    --data_dir=/data/alphafold_databases \
    --output_dir=/tmp/alphafold_output \
    --uniref90_database_path=/data/alphafold_databases/uniref90/uniref90.fasta \
    --mgnify_database_path=/data/alphafold_databases/mgnify/mgy_clusters_2022_05.fa \
    --template_mmcif_dir=/data/alphafold_databases/pdb_mmcif/mmcif_files \
    --obsolete_pdbs_path=/data/alphafold_databases/pdb_mmcif/obsolete.dat \
    --uniprot_database_path=/data/alphafold_databases/uniprot/uniprot.fasta \
    --pdb_seqres_database_path=/data/alphafold_databases/pdb_seqres/pdb_seqres.txt \
    --use_gpu_relax=True
```

***

## Step 9 — Understanding Output Files

AlphaFold2 produces several output files per prediction:

```
/tmp/alphafold_output/my_protein/
├── ranked_0.pdb          # Best predicted structure
├── ranked_1.pdb          # Second-best prediction
├── ranked_2.pdb
├── ranked_3.pdb
├── ranked_4.pdb
├── result_model_1.pkl    # Full prediction data (pickle)
├── result_model_2.pkl
├── ...
├── msas/                 # Multiple Sequence Alignments
│   ├── bfd_uniclust_hits.a3m
│   ├── mgnify_hits.sto
│   └── uniref90_hits.sto
└── timings.json          # Runtime breakdown
```

{% hint style="info" %}
**Interpreting Results:**

* **ranked\_0.pdb** is your best structure — open it in PyMOL, ChimeraX, or UCSF Chimera
* **pLDDT score** (0–100): per-residue confidence. >90 = very high, 70–90 = good, 50–70 = low, <50 = disordered
* **PAE (Predicted Aligned Error)** plots show inter-domain confidence
  {% endhint %}

***

## Step 10 — Visualize Results

### Download PDB Files to Your Local Machine

```bash
# From your local machine:
scp -P <ssh-port> root@<server-ip>:/tmp/alphafold_output/my_protein/ranked_0.pdb ./

# Or use rsync for the full output directory:
rsync -avz -e "ssh -p <ssh-port>" \
    root@<server-ip>:/tmp/alphafold_output/ \
    ./alphafold_results/
```

### Visualize in PyMOL (locally)

```python
# In PyMOL:
load ranked_0.pdb
spectrum b, blue_white_red, minimum=0, maximum=100
# Color by pLDDT score (stored in B-factor column)
```

### Quick pLDDT Analysis

```python
import numpy as np

# Parse B-factor (pLDDT) from PDB
plddt_scores = []
with open('ranked_0.pdb', 'r') as f:
    for line in f:
        if line.startswith('ATOM'):
            plddt = float(line[60:66].strip())
            plddt_scores.append(plddt)

print(f"Mean pLDDT: {np.mean(plddt_scores):.1f}")
print(f"Residues >90 pLDDT: {sum(s > 90 for s in plddt_scores)}/{len(plddt_scores)}")
```

***

## Using ColabFold (Faster Alternative)

ColabFold is a faster AlphaFold2 implementation using MMseqs2 for MSA generation:

```bash
pip install colabfold[alphafold]

# Run prediction (much faster MSA step)
colabfold_batch /tmp/target_protein.fasta /tmp/colabfold_output \
    --num-recycle 3 \
    --use-gpu-relax
```

{% hint style="success" %}
**ColabFold is typically 10–40x faster** than the original AlphaFold2 pipeline due to the MMseqs2 MSA server. Ideal for iterative research workflows.
{% endhint %}

***

## Troubleshooting

### CUDA Out of Memory

```bash
# Reduce model complexity or use unified memory
export XLA_PYTHON_CLIENT_ALLOCATOR=platform
export XLA_PYTHON_CLIENT_MEM_FRACTION=0.85

# Or run with reduced recycling
--num_multimer_predictions_per_model 1
```

### HHblits / Jackhmmer Errors

```bash
# Ensure hhsuite is properly installed
which hhblits
hhblits --version

# Reinstall if needed
conda install -c bioconda hhsuite -y
```

### Database Download Failures

```bash
# Resume interrupted downloads with aria2
aria2c -c -x 16 -s 16 <database-url> -d /data/alphafold_databases/
```

### JAX/CUDA Compatibility Issues

```bash
# Check JAX can see GPU
python3 -c "import jax; print(jax.devices())"

# Reinstall JAX with correct CUDA version
pip install --upgrade "jax[cuda11_pip]" \
    -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
```

***

## Performance Tips

{% hint style="success" %}
**Optimize Your AlphaFold2 Runs:**

1. **Use ColabFold** for faster MSA generation (10–40x speedup)
2. **Set `--num-recycle 1`** for quick screening, use 3 for final predictions
3. **Use `--db_preset=reduced_dbs`** for exploratory work
4. **Batch multiple sequences** in one FASTA file for efficient pipeline runs
5. **Enable GPU relaxation** (`--use_gpu_relax=True`) — much faster than CPU relaxation
   {% endhint %}

***

## Cost Estimation on Clore.ai

| Scenario                          | GPU       | Est. Time | Est. Cost    |
| --------------------------------- | --------- | --------- | ------------ |
| Single protein (\~300aa)          | RTX 3090  | 1–2h      | \~$0.30–0.60 |
| Single protein (\~500aa)          | RTX 4090  | 45–90min  | \~$0.40–0.80 |
| Multimer complex                  | A100 80GB | 2–4h      | \~$1.50–3.00 |
| Proteome screening (100 proteins) | A100 80GB | 8–12h     | \~$6–10      |

*Costs are approximate and depend on current marketplace pricing.*

***

## Additional Resources

* [AlphaFold2 GitHub](https://github.com/google-deepmind/alphafold)
* [AlphaFold Database](https://alphafold.ebi.ac.uk/) — Pre-computed structures for 200M+ proteins
* [ColabFold GitHub](https://github.com/sokrypton/ColabFold)
* [DeepMind AlphaFold Blog](https://www.deepmind.com/research/highlighted-research/alphafold)
* [OpenFold](https://github.com/aqlaboratory/openfold) — Trainable PyTorch reimplementation
* [ESMFold](https://github.com/facebookresearch/esm) — Meta's faster alternative

***

*This guide covers AlphaFold2 deployment on Clore.ai GPU rentals. For the latest AlphaFold3, see the separate AlphaFold3 guide.*

***

## Clore.ai GPU Recommendations

| Use Case                    | Recommended GPU | Est. Cost on Clore.ai |
| --------------------------- | --------------- | --------------------- |
| Development/Testing         | RTX 3090 (24GB) | \~$0.12/gpu/hr        |
| Standard Proteins           | RTX 4090 (24GB) | \~$0.70/gpu/hr        |
| Large Molecules / Multimers | A100 80GB       | \~$1.20/gpu/hr        |

> 💡 All examples in this guide can be deployed on [Clore.ai](https://clore.ai/marketplace) GPU servers. Browse available GPUs and rent by the hour — no commitments, full root access.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/science-and-research/alphafold2.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
