# Mergekit Model Merging

**Mergekit** is the definitive toolkit for merging pretrained large language models. With 5K+ GitHub stars, it implements every major model merging algorithm — SLERP, TIES, DARE, DARE-TIES, MoE merging, and more — enabling you to create powerful new models without any training data or GPU training time.

{% hint style="success" %}
All examples can be run on GPU servers rented through [CLORE.AI Marketplace](https://clore.ai/marketplace).
{% endhint %}

***

## What is Mergekit?

Model merging is a powerful technique that combines the strengths of multiple LLMs into a single model:

* **No training required** — merge happens in weight space, not through backprop
* **Combine capabilities** — blend a coding model with an instruction-following model
* **Reduce weaknesses** — average out individual model failures across an ensemble
* **Create Mixture of Experts** — combine models into a sparse MoE architecture
* **Domain adaptation** — merge base model with domain-specialized models

Mergekit implements all state-of-the-art algorithms:

| Algorithm           | Description                                       | Best For                                            |
| ------------------- | ------------------------------------------------- | --------------------------------------------------- |
| **SLERP**           | Spherical linear interpolation between two models | Smooth blending of two similar models               |
| **TIES**            | Trim redundant parameters, elect signs, merge     | Combining multiple models with minimal interference |
| **DARE**            | Drop and rescale random parameters                | Reducing parameter interference in large merges     |
| **DARE-TIES**       | DARE + TIES combined                              | Best all-around for multi-model merges              |
| **Linear**          | Simple weighted average                           | Quick baseline merges                               |
| **Task Arithmetic** | Add/subtract task vectors                         | Adding/removing specific capabilities               |
| **Passthrough**     | Copy layers directly                              | MoE construction                                    |

{% hint style="info" %}
Model merging is surprisingly effective. Merged models often outperform their parents on benchmarks by combining complementary knowledge. The MergeKit community on HuggingFace hosts thousands of merged models.
{% endhint %}

***

## Server Requirements

| Component | Minimum                           | Recommended                   |
| --------- | --------------------------------- | ----------------------------- |
| GPU       | Not required (CPU merge possible) | A100 40 GB for large models   |
| VRAM      | —                                 | 80 GB for 70B model merges    |
| RAM       | 32 GB                             | 64 GB+ (models load into RAM) |
| CPU       | 8 cores                           | 16+ cores                     |
| Storage   | 100 GB                            | 500 GB+                       |
| OS        | Ubuntu 20.04+                     | Ubuntu 22.04                  |
| Python    | 3.10+                             | 3.11                          |

{% hint style="warning" %}
For CPU-only merging (the most common mode), RAM is your bottleneck. Merging two 7B models in bf16 requires \~28 GB RAM minimum. Use `--lazy-unpickle` for lower memory usage.
{% endhint %}

***

## Ports

| Port | Service | Notes                             |
| ---- | ------- | --------------------------------- |
| 22   | SSH     | Terminal access and file transfer |

Mergekit runs as a command-line tool — no web server needed.

***

## Installation on Clore.ai

### Step 1 — Rent a Server

1. Go to [Clore.ai Marketplace](https://clore.ai/marketplace)
2. Filter for **RAM ≥ 64 GB** (critical for large model merges)
3. Choose **Storage ≥ 500 GB** (merged models need space for 2-4 input models + output)
4. GPU is optional but useful if you want to test the merged model afterward
5. Open port **22** only

### Step 2 — Connect via SSH

```bash
ssh root@<server-ip> -p <ssh-port>
```

### Step 3 — Install Python Environment

```bash
# Install Python 3.11
apt-get update
apt-get install -y python3.11 python3.11-venv python3.11-pip git

# Create virtual environment
python3.11 -m venv /opt/mergekit
source /opt/mergekit/bin/activate
```

### Step 4 — Install Mergekit

```bash
# Install from PyPI
pip install mergekit

# Or install from source (recommended for latest features)
git clone https://github.com/arcee-ai/mergekit.git
cd mergekit
pip install -e '.[everything]'
```

### Step 5 — Install HuggingFace CLI

```bash
pip install huggingface_hub
huggingface-cli login  # Enter your HF token
```

### Step 6 — Verify Installation

```bash
mergekit --help
mergekit-yaml --help
```

***

## Downloading Models to Merge

```bash
# Download models you want to merge
# Using huggingface_hub

python3 << 'EOF'
from huggingface_hub import snapshot_download

# Download model 1
snapshot_download(
    repo_id="mistralai/Mistral-7B-Instruct-v0.3",
    local_dir="models/Mistral-7B-Instruct-v0.3"
)

# Download model 2
snapshot_download(
    repo_id="meta-llama/Llama-3.2-8B-Instruct",
    local_dir="models/Llama-3.2-8B-Instruct",
    token="hf_your-token"  # Required for gated models
)
EOF

# Or use huggingface_hub CLI
huggingface-cli download mistralai/Mistral-7B-Instruct-v0.3 \
  --local-dir models/Mistral-7B-Instruct-v0.3
```

***

## Merge Configurations

Mergekit uses YAML configuration files to define merges.

### Example 1: SLERP Merge (Two Models)

SLERP blends two models along a spherical arc — best for models of the same architecture:

```yaml
# slerp_merge.yaml
models:
  - model: models/Mistral-7B-Instruct-v0.3
  - model: models/OpenHermes-2.5-Mistral-7B

merge_method: slerp
base_model: models/Mistral-7B-Instruct-v0.3

slices:
  - sources:
    - model: models/Mistral-7B-Instruct-v0.3
      layer_range: [0, 32]
    - model: models/OpenHermes-2.5-Mistral-7B
      layer_range: [0, 32]

parameters:
  t:
    - filter: self_attn
      value: 0.5  # 50/50 blend for attention layers
    - filter: mlp
      value: 0.3  # 30% from model 2 for MLP layers
    - value: 0.5  # default for everything else

dtype: bfloat16
```

```bash
mergekit-yaml slerp_merge.yaml merged-model/ --lazy-unpickle
```

### Example 2: TIES Merge (Multiple Models)

TIES handles interference between multiple merged models:

```yaml
# ties_merge.yaml
models:
  - model: models/Mistral-7B-v0.3
    parameters:
      weight: 1.0    # Base model, full weight
      density: 1.0

  - model: models/Mistral-7B-coding
    parameters:
      weight: 0.7    # Coding capability
      density: 0.5   # Keep 50% of changed parameters

  - model: models/Mistral-7B-math
    parameters:
      weight: 0.5    # Math capability
      density: 0.3   # Keep 30% of changed parameters

merge_method: ties
base_model: models/Mistral-7B-v0.3

parameters:
  normalize: true
  int8_mask: true

dtype: bfloat16
```

```bash
mergekit-yaml ties_merge.yaml merged-ties/ --lazy-unpickle
```

### Example 3: DARE-TIES Merge (Best All-Around)

```yaml
# dare_ties_merge.yaml
models:
  - model: models/Llama-3.2-8B-Instruct
    parameters:
      weight: 1.0
      density: 0.7
      dare_linear: true

  - model: models/Llama-3.2-8B-code
    parameters:
      weight: 0.8
      density: 0.5
      dare_linear: true

  - model: models/Llama-3.2-8B-math
    parameters:
      weight: 0.6
      density: 0.4
      dare_linear: true

merge_method: dare_ties
base_model: models/Llama-3.2-8B-Instruct

parameters:
  normalize: true
  dare_density: 0.5
  dare_epsilon: 0.08

dtype: bfloat16
```

```bash
mergekit-yaml dare_ties_merge.yaml merged-dare-ties/ --lazy-unpickle
```

### Example 4: Task Arithmetic (Add Capabilities)

Add a "skill delta" to a base model:

```yaml
# task_arithmetic.yaml
# Adds the math skill from a math-tuned model to a general base model
models:
  - model: models/Llama-3.2-8B-Instruct
    parameters:
      weight: 1.0
  
  - model: models/Llama-3.2-8B-math
    parameters:
      weight: 0.7   # Positive = add this capability
  
  # To REMOVE a capability, use negative weight:
  # - model: models/Llama-3.2-8B-harmful
  #   parameters:
  #     weight: -0.5

merge_method: task_arithmetic
base_model: models/Llama-3.2-8B-Instruct

dtype: bfloat16
```

### Example 5: MoE (Mixture of Experts)

Combine models into a sparse MoE architecture:

```yaml
# moe_merge.yaml
base_model: models/Llama-3.2-8B-Instruct

gate_mode: hidden  # Use hidden states to route to experts
dtype: bfloat16
experts:
  - source_model: models/Llama-3.2-8B-coding
    positive_prompts:
      - "Write code"
      - "Debug this function"
      - "Implement an algorithm"
    negative_prompts:
      - "Tell me a story"
      - "Explain history"
  
  - source_model: models/Llama-3.2-8B-creative
    positive_prompts:
      - "Write a story"
      - "Be creative"
      - "Imagine"
    negative_prompts:
      - "Write code"
      - "Calculate"
```

```bash
mergekit-moe moe_merge.yaml merged-moe/ --lazy-unpickle
```

***

## Running the Merge

### Basic Command

```bash
# Activate environment
source /opt/mergekit/bin/activate

# Run merge with lazy unpickling (saves RAM)
mergekit-yaml your_config.yaml output_model/ --lazy-unpickle

# With CUDA acceleration (if GPU available)
mergekit-yaml your_config.yaml output_model/ \
  --lazy-unpickle \
  --cuda \
  --copy-tokenizer

# Low memory mode (slower but works on smaller servers)
mergekit-yaml your_config.yaml output_model/ \
  --lazy-unpickle \
  --low-cpu-memory
```

### Monitor Progress

```bash
# Mergekit shows progress for each layer
# Typical output:
# Loading model 1...
# Loading model 2...
# Merging layer 0/32: embed_tokens
# Merging layer 1/32: layers.0.self_attn
# ...
# Saving merged model...
# Done! Saved to output_model/
```

***

## Testing the Merged Model

```bash
# Quick test with transformers
python3 << 'EOF'
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_path = "output_model"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

prompt = "Explain the difference between LoRA and full finetuning:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=200, temperature=0.7)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))
EOF
```

***

## Publishing to HuggingFace

```bash
# Login
huggingface-cli login

# Create and push repository
python3 << 'EOF'
from huggingface_hub import HfApi
api = HfApi()

# Create repo
api.create_repo("my-merged-model-7b", private=False)

# Upload merged model
api.upload_folder(
    folder_path="output_model/",
    repo_id="your-username/my-merged-model-7b",
    repo_type="model"
)
print("Uploaded!")
EOF
```

***

## Advanced: Evolutionary Merge

Use Mergekit's evolutionary optimizer to find optimal merge weights:

```bash
# Install evolutionary optimizer
pip install 'mergekit[evo]'

# Run evolutionary search
mergekit-evolve evolve_config.yaml \
  --storage-path ./evolve-workspace \
  --n-iterations 100 \
  --task mmlu  # Optimize for MMLU benchmark
```

***

## Troubleshooting

### Out of Memory (OOM) during merge

```bash
# Always use --lazy-unpickle for large models
mergekit-yaml config.yaml out/ --lazy-unpickle

# Add --low-cpu-memory flag
mergekit-yaml config.yaml out/ --lazy-unpickle --low-cpu-memory

# Check available RAM before merging
free -h

# For 7B models you need ~30 GB RAM minimum
# For 13B models: ~60 GB RAM
# For 70B models: ~280 GB RAM (or use a high-RAM CPU server)
```

### `ValueError: models are not compatible`

```bash
# Models must have the same architecture
# You cannot merge Llama-3 with Mistral directly
# Check model configs
python3 -c "
import json
for path in ['models/model1/config.json', 'models/model2/config.json']:
    with open(path) as f:
        cfg = json.load(f)
    print(path, ':', cfg.get('model_type'), cfg.get('hidden_size'), cfg.get('num_hidden_layers'))
"
```

### Merge is very slow

```bash
# Use GPU for faster tensor operations
mergekit-yaml config.yaml out/ --lazy-unpickle --cuda

# Ensure PyTorch with CUDA is installed
python3 -c "import torch; print(torch.cuda.is_available())"

# If CUDA not available, install it:
pip install torch --index-url https://download.pytorch.org/whl/cu121
```

### Merged model produces gibberish

```bash
# Common causes:
# 1. Merging incompatible model families (e.g., Llama + Mistral)
# 2. Weights too extreme (t=0 or t=1 instead of 0.3-0.7 for SLERP)
# 3. Too high density in TIES (try density: 0.3-0.5)

# Diagnostic: test each parent model first
# Then try a 50/50 SLERP merge as baseline

# Check merged model config
cat output_model/config.json | python3 -m json.tool
```

### `FileNotFoundError` for model files

```bash
# List what was downloaded
ls -la models/your-model/

# Required files:
# config.json, tokenizer.json, *.safetensors (or *.bin)

# Re-download with force
huggingface-cli download <repo_id> --local-dir models/<name> --force-download
```

***

## Popular Merge Recipes

### General Assistant + Coding

```yaml
# Great for developers who also want general capability
models:
  - model: mistralai/Mistral-7B-Instruct-v0.3
    parameters: {weight: 1.0, density: 0.7}
  - model: mistralai/Codestral-7B  
    parameters: {weight: 0.8, density: 0.5}

merge_method: dare_ties
base_model: mistralai/Mistral-7B-Instruct-v0.3
dtype: bfloat16
```

### Multilingual Boost

```yaml
# Add multilingual capabilities to English model
models:
  - model: meta-llama/Llama-3.2-8B-Instruct
    parameters: {weight: 1.0, density: 0.8}
  - model: utter-project/EuroLLM-9B-Instruct
    parameters: {weight: 0.6, density: 0.4}

merge_method: ties
base_model: meta-llama/Llama-3.2-8B-Instruct
dtype: bfloat16
```

***

## Useful Links

* **GitHub**: <https://github.com/arcee-ai/mergekit> ⭐ 5K+
* **Documentation**: <https://github.com/arcee-ai/mergekit/wiki>
* **MergeKit Models on HuggingFace**: <https://huggingface.co/models?other=mergekit>
* **Arcee.ai Discord**: <https://discord.gg/arcee>
* **TIES Paper**: <https://arxiv.org/abs/2306.01708>
* **DARE Paper**: <https://arxiv.org/abs/2311.03099>
* **Clore.ai Marketplace**: <https://clore.ai/marketplace>

***

## Clore.ai GPU Recommendations

| Use Case               | Recommended GPU | Est. Cost on Clore.ai |
| ---------------------- | --------------- | --------------------- |
| Development/Testing    | RTX 3090 (24GB) | \~$0.12/gpu/hr        |
| Model Merging (7B–13B) | RTX 4090 (24GB) | \~$0.70/gpu/hr        |
| Large Models (70B+)    | A100 80GB       | \~$1.20/gpu/hr        |
| Multi-GPU Merging      | 2-4x A100 80GB  | \~$2.40–$4.80/hr      |

> 💡 All examples in this guide can be deployed on [Clore.ai](https://clore.ai/marketplace) GPU servers. Browse available GPUs and rent by the hour — no commitments, full root access.
