# Haystack AI Framework

Haystack is deepset's open-source AI orchestration framework for building production-grade LLM applications. With 18K+ GitHub stars, it provides a flexible **pipeline-based architecture** that wires together document stores, retrievers, readers, generators, and agents — all in clean, composable Python. Whether you need RAG over private documents, semantic search, or multi-step agent workflows, Haystack handles the plumbing so you can focus on the application logic.

On Clore.ai, Haystack shines when you need a GPU for local model inference via Hugging Face Transformers or sentence-transformers. If you rely purely on external APIs (OpenAI, Anthropic), you can run it on CPU-only instances — but for embedding generation and local LLMs, a GPU cuts latency dramatically.

{% hint style="success" %}
All examples run on GPU servers rented through the [CLORE.AI Marketplace](https://clore.ai/marketplace).
{% endhint %}

{% hint style="info" %}
This guide covers **Haystack v2.x** (`haystack-ai` package). The v2 API differs significantly from v1 (`farm-haystack`). If you have existing v1 pipelines, see the [migration guide](https://docs.haystack.deepset.ai/docs/migration).
{% endhint %}

## Overview

| Property             | Details                                                                           |
| -------------------- | --------------------------------------------------------------------------------- |
| **Project**          | [deepset-ai/haystack](https://github.com/deepset-ai/haystack)                     |
| **License**          | Apache 2.0                                                                        |
| **GitHub Stars**     | 18K+                                                                              |
| **Version**          | v2.x (`haystack-ai`)                                                              |
| **Primary Use Case** | RAG, semantic search, document QA, agent workflows                                |
| **GPU Support**      | Optional — required for local embeddings / local LLMs                             |
| **Difficulty**       | Medium                                                                            |
| **API Serving**      | Hayhooks (FastAPI-based, REST)                                                    |
| **Key Integrations** | Ollama, OpenAI, Anthropic, HuggingFace, Elasticsearch, Pinecone, Weaviate, Qdrant |

### What You Can Build

* **RAG pipelines** — ingest documents, generate embeddings, retrieve context, answer questions
* **Semantic search** — query documents by meaning, not keywords
* **Document processing** — parse PDFs, HTML, Word docs; split, clean, and index content
* **Agent workflows** — multi-step reasoning with tool use (web search, calculators, APIs)
* **REST API services** — expose any Haystack pipeline as an endpoint via Hayhooks

## Requirements

### Hardware Requirements

| Use Case                                     | GPU        | VRAM  | RAM   | Disk   | Clore.ai Price  |
| -------------------------------------------- | ---------- | ----- | ----- | ------ | --------------- |
| **API mode only** (OpenAI/Anthropic)         | None / CPU | —     | 4 GB  | 20 GB  | \~$0.01–0.05/hr |
| **Local embeddings** (sentence-transformers) | RTX 3060   | 8 GB  | 16 GB | 30 GB  | \~$0.10–0.15/hr |
| **Local embeddings + small LLM** (7B)        | RTX 3090   | 24 GB | 16 GB | 50 GB  | \~$0.20–0.25/hr |
| **Local LLM** (13B–34B)                      | RTX 4090   | 24 GB | 32 GB | 80 GB  | \~$0.35–0.50/hr |
| **Large local LLM** (70B, quantized)         | A100 80GB  | 80 GB | 64 GB | 150 GB | \~$1.10–1.50/hr |

{% hint style="info" %}
For most RAG use cases, an **RTX 3090** at \~$0.20/hr is the sweet spot — 24 GB VRAM handles sentence-transformer embeddings + a 7B–13B local LLM simultaneously.
{% endhint %}

### Software Requirements

* Docker (pre-installed on Clore.ai servers)
* NVIDIA drivers + CUDA (pre-installed on Clore.ai GPU servers)
* Python 3.10+ (inside the container)
* CUDA 11.8 or 12.x

## Quick Start

### 1. Rent a Clore.ai Server

In the [Clore.ai Marketplace](https://clore.ai/marketplace), filter for:

* **VRAM**: ≥ 8 GB for embedding workloads, ≥ 24 GB for local LLMs
* **Docker**: Enabled (default on most listings)
* **Image**: `nvidia/cuda:12.1-devel-ubuntu22.04` or `pytorch/pytorch:2.2.0-cuda12.1-cudnn8-runtime`

Note the server's public IP and SSH port from **My Orders**.

### 2. Connect and Verify GPU

```bash
ssh root@<clore-server-ip> -p <port>

# Verify GPU is available
nvidia-smi

# Expected output shows your GPU, driver version, CUDA version
```

### 3. Build the Haystack Docker Image

Haystack v2 recommends pip installation. Create a custom Dockerfile:

```bash
mkdir -p /workspace/haystack-app && cd /workspace/haystack-app

cat > Dockerfile << 'EOF'
FROM nvidia/cuda:12.1-devel-ubuntu22.04

# Avoid interactive prompts
ENV DEBIAN_FRONTEND=noninteractive
ENV PYTHONUNBUFFERED=1

# Install Python and system deps
RUN apt-get update && apt-get install -y \
    python3.11 \
    python3-pip \
    python3.11-dev \
    git \
    curl \
    && rm -rf /var/lib/apt/lists/*

# Set python3.11 as default
RUN update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.11 1
RUN update-alternatives --install /usr/bin/python python python3.11 1

# Install Haystack v2 and core dependencies
RUN pip install --no-cache-dir \
    haystack-ai \
    hayhooks \
    sentence-transformers \
    transformers \
    torch \
    accelerate \
    fastapi \
    uvicorn

# Install optional integrations
RUN pip install --no-cache-dir \
    ollama-haystack \
    haystack-experimental

WORKDIR /app

# Default port for Hayhooks
EXPOSE 1416

CMD ["hayhooks", "run", "--host", "0.0.0.0", "--port", "1416"]
EOF

# Build the image
docker build -t haystack-clore:latest .
```

### 4. Run Haystack with Hayhooks

[Hayhooks](https://github.com/deepset-ai/hayhooks) turns any Haystack pipeline into a REST API automatically:

```bash
# Create a directory for your pipelines
mkdir -p /workspace/haystack-pipelines

# Run Hayhooks with GPU access
docker run -d \
  --name haystack \
  --gpus all \
  -p 1416:1416 \
  -v /workspace/haystack-pipelines:/app/pipelines \
  -e OPENAI_API_KEY=${OPENAI_API_KEY:-""} \
  -e HF_TOKEN=${HF_TOKEN:-""} \
  haystack-clore:latest

# Check it's running
curl http://localhost:1416/status
```

Expected response:

```json
{"status": "ok", "pipelines": []}
```

### 5. Create Your First RAG Pipeline

Write a pipeline YAML that Hayhooks will serve as an endpoint:

```bash
cat > /workspace/haystack-pipelines/rag_pipeline.yml << 'EOF'
# RAG pipeline using Ollama for LLM + local embeddings for retrieval
components:
  embedder:
    type: haystack.components.embedders.SentenceTransformersTextEmbedder
    init_parameters:
      model: BAAI/bge-small-en-v1.5

  retriever:
    type: haystack.components.retrievers.in_memory.InMemoryEmbeddingRetriever
    init_parameters:
      document_store:
        type: haystack_integrations.document_stores.in_memory.InMemoryDocumentStore

  prompt_builder:
    type: haystack.components.builders.PromptBuilder
    init_parameters:
      template: |
        Answer the question based on the context below.
        Context: {% for doc in documents %}{{ doc.content }}{% endfor %}
        Question: {{ question }}

  llm:
    type: haystack_integrations.components.generators.ollama.OllamaGenerator
    init_parameters:
      model: llama3
      url: http://host.docker.internal:11434

connections:
  - sender: embedder.embedding
    receiver: retriever.query_embedding
  - sender: retriever.documents
    receiver: prompt_builder.documents
  - sender: prompt_builder.prompt
    receiver: llm.prompt

inputs:
  query:
    - embedder.text
    - prompt_builder.question

outputs:
  answer: llm.replies
EOF
```

Hayhooks automatically discovers and serves this pipeline. Test it:

```bash
# List deployed pipelines
curl http://localhost:1416/pipelines

# Query the RAG pipeline
curl -X POST http://localhost:1416/rag_pipeline/run \
  -H "Content-Type: application/json" \
  -d '{"query": "What is Haystack?"}'
```

## Configuration

### Environment Variables

| Variable                     | Description                         | Example               |
| ---------------------------- | ----------------------------------- | --------------------- |
| `OPENAI_API_KEY`             | OpenAI API key for GPT models       | `sk-...`              |
| `ANTHROPIC_API_KEY`          | Anthropic API key for Claude        | `sk-ant-...`          |
| `HF_TOKEN`                   | Hugging Face token for gated models | `hf_...`              |
| `HAYSTACK_TELEMETRY_ENABLED` | Disable usage telemetry             | `false`               |
| `CUDA_VISIBLE_DEVICES`       | Select specific GPU                 | `0`                   |
| `TRANSFORMERS_CACHE`         | Cache path for HF models            | `/workspace/hf-cache` |

### Run with Full Configuration

```bash
docker run -d \
  --name haystack \
  --gpus '"device=0"' \
  -p 1416:1416 \
  -v /workspace/haystack-pipelines:/app/pipelines \
  -v /workspace/hf-cache:/root/.cache/huggingface \
  -e OPENAI_API_KEY="your-key-here" \
  -e HF_TOKEN="your-hf-token" \
  -e HAYSTACK_TELEMETRY_ENABLED=false \
  -e CUDA_VISIBLE_DEVICES=0 \
  --restart unless-stopped \
  haystack-clore:latest
```

### Document Ingestion Pipeline

Build a separate indexing pipeline to ingest documents:

```bash
cat > /workspace/index_documents.py << 'EOF'
import haystack
from haystack import Pipeline
from haystack.components.converters import PyPDFToDocument, TextFileToDocument
from haystack.components.preprocessors import DocumentSplitter, DocumentCleaner
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore

# Initialize document store
document_store = InMemoryDocumentStore()

# Build indexing pipeline
indexing_pipeline = Pipeline()
indexing_pipeline.add_component("converter", PyPDFToDocument())
indexing_pipeline.add_component("cleaner", DocumentCleaner())
indexing_pipeline.add_component("splitter", DocumentSplitter(
    split_by="word",
    split_length=200,
    split_overlap=20
))
indexing_pipeline.add_component("embedder", SentenceTransformersDocumentEmbedder(
    model="BAAI/bge-small-en-v1.5"
))
indexing_pipeline.add_component("writer", DocumentWriter(document_store=document_store))

# Connect components
indexing_pipeline.connect("converter", "cleaner")
indexing_pipeline.connect("cleaner", "splitter")
indexing_pipeline.connect("splitter", "embedder")
indexing_pipeline.connect("embedder", "writer")

# Run indexing
from pathlib import Path
indexing_pipeline.run({"converter": {"sources": list(Path("/data/documents").glob("*.pdf"))}})

print(f"Indexed {document_store.count_documents()} document chunks")
EOF

docker run --rm \
  --gpus all \
  -v /workspace:/workspace \
  -v /your/documents:/data/documents \
  -v /workspace/hf-cache:/root/.cache/huggingface \
  haystack-clore:latest \
  python3 /workspace/index_documents.py
```

### Using Vector Databases (Production)

For production workloads, replace the in-memory store with a persistent vector database:

```bash
# Launch Qdrant alongside Haystack
docker network create haystack-net

docker run -d \
  --name qdrant \
  --network haystack-net \
  -p 6333:6333 \
  -v /workspace/qdrant-data:/qdrant/storage \
  qdrant/qdrant

# Install Qdrant integration in Haystack container
# Add to Dockerfile:  RUN pip install qdrant-haystack
# Then use QdrantDocumentStore instead of InMemoryDocumentStore
```

## GPU Acceleration

Haystack uses GPU acceleration in two main scenarios:

### 1. Embedding Generation (Sentence Transformers)

GPU is highly beneficial for embedding large document collections:

```bash
cat > /workspace/benchmark_embeddings.py << 'EOF'
import time
import torch
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack import Document

# Check GPU availability
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")
if device == "cuda":
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

# Create embedder
embedder = SentenceTransformersDocumentEmbedder(
    model="BAAI/bge-base-en-v1.5"
)
embedder.warm_up()

# Benchmark
docs = [Document(content=f"Sample document {i} with some text content.") for i in range(100)]

start = time.time()
result = embedder.run(documents=docs)
elapsed = time.time() - start

print(f"Embedded 100 documents in {elapsed:.2f}s ({100/elapsed:.0f} docs/sec)")
EOF

docker run --rm --gpus all \
  -v /workspace:/workspace \
  haystack-clore:latest \
  python3 /workspace/benchmark_embeddings.py
```

### 2. Local LLM Inference (Hugging Face Transformers)

For running LLMs directly in Haystack without Ollama:

```bash
cat > /workspace/local_llm_pipeline.py << 'EOF'
from haystack import Pipeline
from haystack.components.builders import PromptBuilder
from haystack.components.generators.hugging_face import HuggingFaceLocalGenerator

# Uses GPU automatically when available
generator = HuggingFaceLocalGenerator(
    model="mistralai/Mistral-7B-Instruct-v0.2",
    task="text-generation",
    generation_kwargs={
        "max_new_tokens": 512,
        "temperature": 0.7,
        "do_sample": True,
    }
)

prompt_builder = PromptBuilder(template="Answer this question: {{ question }}")

pipeline = Pipeline()
pipeline.add_component("prompt_builder", prompt_builder)
pipeline.add_component("llm", generator)
pipeline.connect("prompt_builder.prompt", "llm.prompt")

result = pipeline.run({"prompt_builder": {"question": "What is RAG?"}})
print(result["llm"]["replies"][0])
EOF

docker run --rm --gpus all \
  -v /workspace:/workspace \
  -e HF_TOKEN="your-hf-token" \
  haystack-clore:latest \
  python3 /workspace/local_llm_pipeline.py
```

### 3. Pair with Ollama (Recommended Approach)

For the best combination of ease and performance, run Ollama for LLM inference and Haystack for orchestration:

```bash
# Step 1: Start Ollama (see Ollama guide)
docker run -d \
  --name ollama \
  --gpus all \
  -p 11434:11434 \
  -v /workspace/ollama:/root/.ollama \
  ollama/ollama

# Step 2: Pull a coding/chat model
docker exec ollama ollama pull llama3
docker exec ollama ollama pull nomic-embed-text  # For embeddings via Ollama

# Step 3: Start Haystack pointing to Ollama
docker run -d \
  --name haystack \
  --gpus '"device=0"' \
  -p 1416:1416 \
  --add-host=host.docker.internal:host-gateway \
  -v /workspace/haystack-pipelines:/app/pipelines \
  haystack-clore:latest
```

Monitor GPU usage across both containers:

```bash
watch -n 2 nvidia-smi
```

## Tips & Best Practices

### Choose the Right Embedding Model

| Model                          | VRAM     | Speed   | Quality   | Best For                 |
| ------------------------------ | -------- | ------- | --------- | ------------------------ |
| `BAAI/bge-small-en-v1.5`       | \~0.5 GB | Fastest | Good      | High-throughput indexing |
| `BAAI/bge-base-en-v1.5`        | \~1 GB   | Fast    | Better    | General RAG              |
| `BAAI/bge-large-en-v1.5`       | \~2 GB   | Medium  | Best      | Highest accuracy         |
| `nomic-ai/nomic-embed-text-v1` | \~1.5 GB | Fast    | Excellent | Long documents           |

### Pipeline Design Tips

* **Split documents wisely** — 200–400 word chunks with 10–15% overlap work well for most RAG use cases
* **Cache embeddings** — persist your document store to disk; re-embedding is expensive
* **Use `warm_up()`** — call `component.warm_up()` before production use to load models into GPU memory
* **Batch indexing** — process documents in batches of 32–64 for optimal GPU utilization
* **Filter with metadata** — use Haystack's metadata filtering to scope retrieval (e.g., by date, source, category)

### Cost Optimization

```bash
# Use spot-style pricing on Clore.ai — choose servers with lower $/hr
# For development/testing: RTX 3060 (~$0.10/hr) is sufficient for embedding
# For production embedding: RTX 3090 (~$0.20/hr) — 24 GB handles large batches
# For local LLM + embedding: A100 40GB (~$0.60/hr) — headroom for concurrent users

# Monitor resource usage
docker stats haystack
nvidia-smi dmon -s u -d 5  # GPU utilization every 5 seconds
```

### Secure Hayhooks for External Access

```bash
# Option 1: SSH tunnel (simplest, for personal use)
# From your local machine:
ssh -L 1416:localhost:1416 root@<clore-ip> -p <clore-ssh-port>
# Then access http://localhost:1416 locally

# Option 2: Add basic auth via nginx reverse proxy
docker run -d \
  --name nginx-proxy \
  -p 80:80 \
  -v /workspace/nginx.conf:/etc/nginx/conf.d/default.conf \
  nginx:alpine
```

## Troubleshooting

| Problem                            | Likely Cause                | Solution                                                                                           |
| ---------------------------------- | --------------------------- | -------------------------------------------------------------------------------------------------- |
| `ModuleNotFoundError: haystack`    | Package not installed       | Rebuild Docker image; check `pip install haystack-ai` succeeded                                    |
| `CUDA out of memory`               | Embedding model too large   | Use `bge-small-en-v1.5` or reduce batch size                                                       |
| Hayhooks returns 404 on pipeline   | YAML file not found         | Check volume mount; pipeline file must be in `/app/pipelines/`                                     |
| Slow embedding on CPU              | GPU not detected            | Verify `--gpus all` flag; check `torch.cuda.is_available()`                                        |
| Ollama connection refused          | Wrong hostname              | Use `--add-host=host.docker.internal:host-gateway`; set URL to `http://host.docker.internal:11434` |
| HuggingFace download fails         | Missing token or rate limit | Set `HF_TOKEN` env var; ensure model is not gated                                                  |
| Pipeline YAML parse error          | Invalid syntax              | Validate YAML; use `python3 -c "import yaml; yaml.safe_load(open('pipeline.yml'))"`                |
| Container exits immediately        | Startup error               | Check `docker logs haystack`; ensure Dockerfile CMD is correct                                     |
| Port 1416 not reachable externally | Firewall / port forwarding  | Expose port in Clore.ai order settings; check server's open ports                                  |

### Debug Commands

```bash
# Check container logs
docker logs haystack --tail 50 -f

# Test Hayhooks API
curl http://localhost:1416/status
curl http://localhost:1416/pipelines

# Interactive Python debug session
docker exec -it haystack python3

# Check GPU inside container
docker exec haystack python3 -c "import torch; print(torch.cuda.is_available(), torch.cuda.get_device_name(0))"

# Check installed packages
docker exec haystack pip show haystack-ai hayhooks
```

## Further Reading

* [Haystack Documentation](https://docs.haystack.deepset.ai/) — official v2 docs
* [Hayhooks GitHub](https://github.com/deepset-ai/hayhooks) — REST API serving for pipelines
* [Haystack Cookbook](https://haystack.deepset.ai/cookbook) — end-to-end tutorials (RAG, agents, search)
* [deepset-ai/haystack on GitHub](https://github.com/deepset-ai/haystack) — source, issues, releases
* [Haystack Integrations](https://haystack.deepset.ai/integrations) — full list of supported vector stores, LLMs, and tools
* [Ollama on Clore.ai](https://docs.clore.ai/guides/language-models/ollama) — pair Haystack with Ollama for local LLM inference
* [vLLM on Clore.ai](https://docs.clore.ai/guides/language-models/vllm) — high-throughput LLM serving backend for Haystack
* [GPU Comparison Guide](https://docs.clore.ai/guides/getting-started/gpu-comparison) — choose the right Clore.ai GPU for your workload
* [CLORE.AI Marketplace](https://clore.ai/marketplace) — rent GPU servers


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/ai-platforms-and-agents/haystack.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
