# Mergekit Model Merging

**Mergekit** is the definitive toolkit for merging pretrained large language models. With 5K+ GitHub stars, it implements every major model merging algorithm — SLERP, TIES, DARE, DARE-TIES, MoE merging, and more — enabling you to create powerful new models without any training data or GPU training time.

{% hint style="success" %}
All examples can be run on GPU servers rented through [CLORE.AI Marketplace](https://clore.ai/marketplace).
{% endhint %}

***

## What is Mergekit?

Model merging is a powerful technique that combines the strengths of multiple LLMs into a single model:

* **No training required** — merge happens in weight space, not through backprop
* **Combine capabilities** — blend a coding model with an instruction-following model
* **Reduce weaknesses** — average out individual model failures across an ensemble
* **Create Mixture of Experts** — combine models into a sparse MoE architecture
* **Domain adaptation** — merge base model with domain-specialized models

Mergekit implements all state-of-the-art algorithms:

| Algorithm           | Description                                       | Best For                                            |
| ------------------- | ------------------------------------------------- | --------------------------------------------------- |
| **SLERP**           | Spherical linear interpolation between two models | Smooth blending of two similar models               |
| **TIES**            | Trim redundant parameters, elect signs, merge     | Combining multiple models with minimal interference |
| **DARE**            | Drop and rescale random parameters                | Reducing parameter interference in large merges     |
| **DARE-TIES**       | DARE + TIES combined                              | Best all-around for multi-model merges              |
| **Linear**          | Simple weighted average                           | Quick baseline merges                               |
| **Task Arithmetic** | Add/subtract task vectors                         | Adding/removing specific capabilities               |
| **Passthrough**     | Copy layers directly                              | MoE construction                                    |

{% hint style="info" %}
Model merging is surprisingly effective. Merged models often outperform their parents on benchmarks by combining complementary knowledge. The MergeKit community on HuggingFace hosts thousands of merged models.
{% endhint %}

***

## Server Requirements

| Component | Minimum                           | Recommended                   |
| --------- | --------------------------------- | ----------------------------- |
| GPU       | Not required (CPU merge possible) | A100 40 GB for large models   |
| VRAM      | —                                 | 80 GB for 70B model merges    |
| RAM       | 32 GB                             | 64 GB+ (models load into RAM) |
| CPU       | 8 cores                           | 16+ cores                     |
| Storage   | 100 GB                            | 500 GB+                       |
| OS        | Ubuntu 20.04+                     | Ubuntu 22.04                  |
| Python    | 3.10+                             | 3.11                          |

{% hint style="warning" %}
For CPU-only merging (the most common mode), RAM is your bottleneck. Merging two 7B models in bf16 requires \~28 GB RAM minimum. Use `--lazy-unpickle` for lower memory usage.
{% endhint %}

***

## Ports

| Port | Service | Notes                             |
| ---- | ------- | --------------------------------- |
| 22   | SSH     | Terminal access and file transfer |

Mergekit runs as a command-line tool — no web server needed.

***

## Installation on Clore.ai

### Step 1 — Rent a Server

1. Go to [Clore.ai Marketplace](https://clore.ai/marketplace)
2. Filter for **RAM ≥ 64 GB** (critical for large model merges)
3. Choose **Storage ≥ 500 GB** (merged models need space for 2-4 input models + output)
4. GPU is optional but useful if you want to test the merged model afterward
5. Open port **22** only

### Step 2 — Connect via SSH

```bash
ssh root@<server-ip> -p <ssh-port>
```

### Step 3 — Install Python Environment

```bash
# Install Python 3.11
apt-get update
apt-get install -y python3.11 python3.11-venv python3.11-pip git

# Create virtual environment
python3.11 -m venv /opt/mergekit
source /opt/mergekit/bin/activate
```

### Step 4 — Install Mergekit

```bash
# Install from PyPI
pip install mergekit

# Or install from source (recommended for latest features)
git clone https://github.com/arcee-ai/mergekit.git
cd mergekit
pip install -e '.[everything]'
```

### Step 5 — Install HuggingFace CLI

```bash
pip install huggingface_hub
huggingface-cli login  # Enter your HF token
```

### Step 6 — Verify Installation

```bash
mergekit --help
mergekit-yaml --help
```

***

## Downloading Models to Merge

```bash
# Download models you want to merge
# Using huggingface_hub

python3 << 'EOF'
from huggingface_hub import snapshot_download

# Download model 1
snapshot_download(
    repo_id="mistralai/Mistral-7B-Instruct-v0.3",
    local_dir="models/Mistral-7B-Instruct-v0.3"
)

# Download model 2
snapshot_download(
    repo_id="meta-llama/Llama-3.2-8B-Instruct",
    local_dir="models/Llama-3.2-8B-Instruct",
    token="hf_your-token"  # Required for gated models
)
EOF

# Or use huggingface_hub CLI
huggingface-cli download mistralai/Mistral-7B-Instruct-v0.3 \
  --local-dir models/Mistral-7B-Instruct-v0.3
```

***

## Merge Configurations

Mergekit uses YAML configuration files to define merges.

### Example 1: SLERP Merge (Two Models)

SLERP blends two models along a spherical arc — best for models of the same architecture:

```yaml
# slerp_merge.yaml
models:
  - model: models/Mistral-7B-Instruct-v0.3
  - model: models/OpenHermes-2.5-Mistral-7B

merge_method: slerp
base_model: models/Mistral-7B-Instruct-v0.3

slices:
  - sources:
    - model: models/Mistral-7B-Instruct-v0.3
      layer_range: [0, 32]
    - model: models/OpenHermes-2.5-Mistral-7B
      layer_range: [0, 32]

parameters:
  t:
    - filter: self_attn
      value: 0.5  # 50/50 blend for attention layers
    - filter: mlp
      value: 0.3  # 30% from model 2 for MLP layers
    - value: 0.5  # default for everything else

dtype: bfloat16
```

```bash
mergekit-yaml slerp_merge.yaml merged-model/ --lazy-unpickle
```

### Example 2: TIES Merge (Multiple Models)

TIES handles interference between multiple merged models:

```yaml
# ties_merge.yaml
models:
  - model: models/Mistral-7B-v0.3
    parameters:
      weight: 1.0    # Base model, full weight
      density: 1.0

  - model: models/Mistral-7B-coding
    parameters:
      weight: 0.7    # Coding capability
      density: 0.5   # Keep 50% of changed parameters

  - model: models/Mistral-7B-math
    parameters:
      weight: 0.5    # Math capability
      density: 0.3   # Keep 30% of changed parameters

merge_method: ties
base_model: models/Mistral-7B-v0.3

parameters:
  normalize: true
  int8_mask: true

dtype: bfloat16
```

```bash
mergekit-yaml ties_merge.yaml merged-ties/ --lazy-unpickle
```

### Example 3: DARE-TIES Merge (Best All-Around)

```yaml
# dare_ties_merge.yaml
models:
  - model: models/Llama-3.2-8B-Instruct
    parameters:
      weight: 1.0
      density: 0.7
      dare_linear: true

  - model: models/Llama-3.2-8B-code
    parameters:
      weight: 0.8
      density: 0.5
      dare_linear: true

  - model: models/Llama-3.2-8B-math
    parameters:
      weight: 0.6
      density: 0.4
      dare_linear: true

merge_method: dare_ties
base_model: models/Llama-3.2-8B-Instruct

parameters:
  normalize: true
  dare_density: 0.5
  dare_epsilon: 0.08

dtype: bfloat16
```

```bash
mergekit-yaml dare_ties_merge.yaml merged-dare-ties/ --lazy-unpickle
```

### Example 4: Task Arithmetic (Add Capabilities)

Add a "skill delta" to a base model:

```yaml
# task_arithmetic.yaml
# Adds the math skill from a math-tuned model to a general base model
models:
  - model: models/Llama-3.2-8B-Instruct
    parameters:
      weight: 1.0
  
  - model: models/Llama-3.2-8B-math
    parameters:
      weight: 0.7   # Positive = add this capability
  
  # To REMOVE a capability, use negative weight:
  # - model: models/Llama-3.2-8B-harmful
  #   parameters:
  #     weight: -0.5

merge_method: task_arithmetic
base_model: models/Llama-3.2-8B-Instruct

dtype: bfloat16
```

### Example 5: MoE (Mixture of Experts)

Combine models into a sparse MoE architecture:

```yaml
# moe_merge.yaml
base_model: models/Llama-3.2-8B-Instruct

gate_mode: hidden  # Use hidden states to route to experts
dtype: bfloat16
experts:
  - source_model: models/Llama-3.2-8B-coding
    positive_prompts:
      - "Write code"
      - "Debug this function"
      - "Implement an algorithm"
    negative_prompts:
      - "Tell me a story"
      - "Explain history"
  
  - source_model: models/Llama-3.2-8B-creative
    positive_prompts:
      - "Write a story"
      - "Be creative"
      - "Imagine"
    negative_prompts:
      - "Write code"
      - "Calculate"
```

```bash
mergekit-moe moe_merge.yaml merged-moe/ --lazy-unpickle
```

***

## Running the Merge

### Basic Command

```bash
# Activate environment
source /opt/mergekit/bin/activate

# Run merge with lazy unpickling (saves RAM)
mergekit-yaml your_config.yaml output_model/ --lazy-unpickle

# With CUDA acceleration (if GPU available)
mergekit-yaml your_config.yaml output_model/ \
  --lazy-unpickle \
  --cuda \
  --copy-tokenizer

# Low memory mode (slower but works on smaller servers)
mergekit-yaml your_config.yaml output_model/ \
  --lazy-unpickle \
  --low-cpu-memory
```

### Monitor Progress

```bash
# Mergekit shows progress for each layer
# Typical output:
# Loading model 1...
# Loading model 2...
# Merging layer 0/32: embed_tokens
# Merging layer 1/32: layers.0.self_attn
# ...
# Saving merged model...
# Done! Saved to output_model/
```

***

## Testing the Merged Model

```bash
# Quick test with transformers
python3 << 'EOF'
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_path = "output_model"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

prompt = "Explain the difference between LoRA and full finetuning:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=200, temperature=0.7)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))
EOF
```

***

## Publishing to HuggingFace

```bash
# Login
huggingface-cli login

# Create and push repository
python3 << 'EOF'
from huggingface_hub import HfApi
api = HfApi()

# Create repo
api.create_repo("my-merged-model-7b", private=False)

# Upload merged model
api.upload_folder(
    folder_path="output_model/",
    repo_id="your-username/my-merged-model-7b",
    repo_type="model"
)
print("Uploaded!")
EOF
```

***

## Advanced: Evolutionary Merge

Use Mergekit's evolutionary optimizer to find optimal merge weights:

```bash
# Install evolutionary optimizer
pip install 'mergekit[evo]'

# Run evolutionary search
mergekit-evolve evolve_config.yaml \
  --storage-path ./evolve-workspace \
  --n-iterations 100 \
  --task mmlu  # Optimize for MMLU benchmark
```

***

## Troubleshooting

### Out of Memory (OOM) during merge

```bash
# Always use --lazy-unpickle for large models
mergekit-yaml config.yaml out/ --lazy-unpickle

# Add --low-cpu-memory flag
mergekit-yaml config.yaml out/ --lazy-unpickle --low-cpu-memory

# Check available RAM before merging
free -h

# For 7B models you need ~30 GB RAM minimum
# For 13B models: ~60 GB RAM
# For 70B models: ~280 GB RAM (or use a high-RAM CPU server)
```

### `ValueError: models are not compatible`

```bash
# Models must have the same architecture
# You cannot merge Llama-3 with Mistral directly
# Check model configs
python3 -c "
import json
for path in ['models/model1/config.json', 'models/model2/config.json']:
    with open(path) as f:
        cfg = json.load(f)
    print(path, ':', cfg.get('model_type'), cfg.get('hidden_size'), cfg.get('num_hidden_layers'))
"
```

### Merge is very slow

```bash
# Use GPU for faster tensor operations
mergekit-yaml config.yaml out/ --lazy-unpickle --cuda

# Ensure PyTorch with CUDA is installed
python3 -c "import torch; print(torch.cuda.is_available())"

# If CUDA not available, install it:
pip install torch --index-url https://download.pytorch.org/whl/cu121
```

### Merged model produces gibberish

```bash
# Common causes:
# 1. Merging incompatible model families (e.g., Llama + Mistral)
# 2. Weights too extreme (t=0 or t=1 instead of 0.3-0.7 for SLERP)
# 3. Too high density in TIES (try density: 0.3-0.5)

# Diagnostic: test each parent model first
# Then try a 50/50 SLERP merge as baseline

# Check merged model config
cat output_model/config.json | python3 -m json.tool
```

### `FileNotFoundError` for model files

```bash
# List what was downloaded
ls -la models/your-model/

# Required files:
# config.json, tokenizer.json, *.safetensors (or *.bin)

# Re-download with force
huggingface-cli download <repo_id> --local-dir models/<name> --force-download
```

***

## Popular Merge Recipes

### General Assistant + Coding

```yaml
# Great for developers who also want general capability
models:
  - model: mistralai/Mistral-7B-Instruct-v0.3
    parameters: {weight: 1.0, density: 0.7}
  - model: mistralai/Codestral-7B  
    parameters: {weight: 0.8, density: 0.5}

merge_method: dare_ties
base_model: mistralai/Mistral-7B-Instruct-v0.3
dtype: bfloat16
```

### Multilingual Boost

```yaml
# Add multilingual capabilities to English model
models:
  - model: meta-llama/Llama-3.2-8B-Instruct
    parameters: {weight: 1.0, density: 0.8}
  - model: utter-project/EuroLLM-9B-Instruct
    parameters: {weight: 0.6, density: 0.4}

merge_method: ties
base_model: meta-llama/Llama-3.2-8B-Instruct
dtype: bfloat16
```

***

## Useful Links

* **GitHub**: <https://github.com/arcee-ai/mergekit> ⭐ 5K+
* **Documentation**: <https://github.com/arcee-ai/mergekit/wiki>
* **MergeKit Models on HuggingFace**: <https://huggingface.co/models?other=mergekit>
* **Arcee.ai Discord**: <https://discord.gg/arcee>
* **TIES Paper**: <https://arxiv.org/abs/2306.01708>
* **DARE Paper**: <https://arxiv.org/abs/2311.03099>
* **Clore.ai Marketplace**: <https://clore.ai/marketplace>

***

## Clore.ai GPU Recommendations

| Use Case               | Recommended GPU | Est. Cost on Clore.ai |
| ---------------------- | --------------- | --------------------- |
| Development/Testing    | RTX 3090 (24GB) | \~$0.12/gpu/hr        |
| Model Merging (7B–13B) | RTX 4090 (24GB) | \~$0.70/gpu/hr        |
| Large Models (70B+)    | A100 80GB       | \~$1.20/gpu/hr        |
| Multi-GPU Merging      | 2-4x A100 80GB  | \~$2.40–$4.80/hr      |

> 💡 All examples in this guide can be deployed on [Clore.ai](https://clore.ai/marketplace) GPU servers. Browse available GPUs and rent by the hour — no commitments, full root access.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/training/mergekit.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
