Haystack AI Framework

Deploy Haystack by deepset on Clore.ai — build production RAG pipelines, semantic search, and LLM agent workflows on affordable GPU infrastructure.

Haystack is deepset's open-source AI orchestration framework for building production-grade LLM applications. With 18K+ GitHub stars, it provides a flexible pipeline-based architecture that wires together document stores, retrievers, readers, generators, and agents — all in clean, composable Python. Whether you need RAG over private documents, semantic search, or multi-step agent workflows, Haystack handles the plumbing so you can focus on the application logic.

On Clore.ai, Haystack shines when you need a GPU for local model inference via Hugging Face Transformers or sentence-transformers. If you rely purely on external APIs (OpenAI, Anthropic), you can run it on CPU-only instances — but for embedding generation and local LLMs, a GPU cuts latency dramatically.

All examples run on GPU servers rented through the CLORE.AI Marketplace.

This guide covers Haystack v2.x (haystack-ai package). The v2 API differs significantly from v1 (farm-haystack). If you have existing v1 pipelines, see the migration guide.

Overview

Property

Details

Project

deepset-ai/haystack

License

Apache 2.0

GitHub Stars

18K+

Version

v2.x (haystack-ai)

Primary Use Case

RAG, semantic search, document QA, agent workflows

GPU Support

Optional — required for local embeddings / local LLMs

Difficulty

Medium

API Serving

Hayhooks (FastAPI-based, REST)

Key Integrations

Ollama, OpenAI, Anthropic, HuggingFace, Elasticsearch, Pinecone, Weaviate, Qdrant

What You Can Build

RAG pipelines — ingest documents, generate embeddings, retrieve context, answer questions
Semantic search — query documents by meaning, not keywords
Document processing — parse PDFs, HTML, Word docs; split, clean, and index content
Agent workflows — multi-step reasoning with tool use (web search, calculators, APIs)
REST API services — expose any Haystack pipeline as an endpoint via Hayhooks

Requirements

Hardware Requirements

Use Case

GPU

VRAM

RAM

Disk

Clore.ai Price

API mode only (OpenAI/Anthropic)

None / CPU

—

4 GB

20 GB

~$0.01–0.05/hr

Local embeddings (sentence-transformers)

RTX 3060

8 GB

16 GB

30 GB

~$0.10–0.15/hr

Local embeddings + small LLM (7B)

RTX 3090

24 GB

16 GB

50 GB

~$0.20–0.25/hr

Local LLM (13B–34B)

RTX 4090

24 GB

32 GB

80 GB

~$0.35–0.50/hr

Large local LLM (70B, quantized)

A100 80GB

80 GB

64 GB

150 GB

~$1.10–1.50/hr

For most RAG use cases, an RTX 3090 at ~$0.20/hr is the sweet spot — 24 GB VRAM handles sentence-transformer embeddings + a 7B–13B local LLM simultaneously.

Software Requirements

Docker (pre-installed on Clore.ai servers)
NVIDIA drivers + CUDA (pre-installed on Clore.ai GPU servers)
Python 3.10+ (inside the container)
CUDA 11.8 or 12.x

Quick Start

1. Rent a Clore.ai Server

In the Clore.ai Marketplace, filter for:

VRAM: ≥ 8 GB for embedding workloads, ≥ 24 GB for local LLMs
Docker: Enabled (default on most listings)
Image: nvidia/cuda:12.1-devel-ubuntu22.04 or pytorch/pytorch:2.2.0-cuda12.1-cudnn8-runtime

Note the server's public IP and SSH port from My Orders.

2. Connect and Verify GPU

ssh root@<clore-server-ip> -p <port>

# Verify GPU is available
nvidia-smi

# Expected output shows your GPU, driver version, CUDA version

3. Build the Haystack Docker Image

Haystack v2 recommends pip installation. Create a custom Dockerfile:

mkdir -p /workspace/haystack-app && cd /workspace/haystack-app

cat > Dockerfile << 'EOF'
FROM nvidia/cuda:12.1-devel-ubuntu22.04

# Avoid interactive prompts
ENV DEBIAN_FRONTEND=noninteractive
ENV PYTHONUNBUFFERED=1

# Install Python and system deps
RUN apt-get update && apt-get install -y \
    python3.11 \
    python3-pip \
    python3.11-dev \
    git \
    curl \
    && rm -rf /var/lib/apt/lists/*

# Set python3.11 as default
RUN update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.11 1
RUN update-alternatives --install /usr/bin/python python python3.11 1

# Install Haystack v2 and core dependencies
RUN pip install --no-cache-dir \
    haystack-ai \
    hayhooks \
    sentence-transformers \
    transformers \
    torch \
    accelerate \
    fastapi \
    uvicorn

# Install optional integrations
RUN pip install --no-cache-dir \
    ollama-haystack \
    haystack-experimental

WORKDIR /app

# Default port for Hayhooks
EXPOSE 1416

CMD ["hayhooks", "run", "--host", "0.0.0.0", "--port", "1416"]
EOF

# Build the image
docker build -t haystack-clore:latest .

4. Run Haystack with Hayhooks

Hayhooks turns any Haystack pipeline into a REST API automatically:

# Create a directory for your pipelines
mkdir -p /workspace/haystack-pipelines

# Run Hayhooks with GPU access
docker run -d \
  --name haystack \
  --gpus all \
  -p 1416:1416 \
  -v /workspace/haystack-pipelines:/app/pipelines \
  -e OPENAI_API_KEY=${OPENAI_API_KEY:-""} \
  -e HF_TOKEN=${HF_TOKEN:-""} \
  haystack-clore:latest

# Check it's running
curl http://localhost:1416/status

Expected response:

{"status": "ok", "pipelines": []}

5. Create Your First RAG Pipeline

Write a pipeline YAML that Hayhooks will serve as an endpoint:

cat > /workspace/haystack-pipelines/rag_pipeline.yml << 'EOF'
# RAG pipeline using Ollama for LLM + local embeddings for retrieval
components:
  embedder:
    type: haystack.components.embedders.SentenceTransformersTextEmbedder
    init_parameters:
      model: BAAI/bge-small-en-v1.5

  retriever:
    type: haystack.components.retrievers.in_memory.InMemoryEmbeddingRetriever
    init_parameters:
      document_store:
        type: haystack_integrations.document_stores.in_memory.InMemoryDocumentStore

  prompt_builder:
    type: haystack.components.builders.PromptBuilder
    init_parameters:
      template: |
        Answer the question based on the context below.
        Context: {% for doc in documents %}{{ doc.content }}{% endfor %}
        Question: {{ question }}

  llm:
    type: haystack_integrations.components.generators.ollama.OllamaGenerator
    init_parameters:
      model: llama3
      url: http://host.docker.internal:11434

connections:
  - sender: embedder.embedding
    receiver: retriever.query_embedding
  - sender: retriever.documents
    receiver: prompt_builder.documents
  - sender: prompt_builder.prompt
    receiver: llm.prompt

inputs:
  query:
    - embedder.text
    - prompt_builder.question

outputs:
  answer: llm.replies
EOF

Hayhooks automatically discovers and serves this pipeline. Test it:

# List deployed pipelines
curl http://localhost:1416/pipelines

# Query the RAG pipeline
curl -X POST http://localhost:1416/rag_pipeline/run \
  -H "Content-Type: application/json" \
  -d '{"query": "What is Haystack?"}'

Configuration

Environment Variables

Variable

Description

Example

OPENAI_API_KEY

OpenAI API key for GPT models

sk-...

ANTHROPIC_API_KEY

Anthropic API key for Claude

sk-ant-...

HF_TOKEN

Hugging Face token for gated models

hf_...

HAYSTACK_TELEMETRY_ENABLED

Disable usage telemetry

false

CUDA_VISIBLE_DEVICES

Select specific GPU

0

TRANSFORMERS_CACHE

Cache path for HF models

/workspace/hf-cache

Run with Full Configuration

docker run -d \
  --name haystack \
  --gpus '"device=0"' \
  -p 1416:1416 \
  -v /workspace/haystack-pipelines:/app/pipelines \
  -v /workspace/hf-cache:/root/.cache/huggingface \
  -e OPENAI_API_KEY="your-key-here" \
  -e HF_TOKEN="your-hf-token" \
  -e HAYSTACK_TELEMETRY_ENABLED=false \
  -e CUDA_VISIBLE_DEVICES=0 \
  --restart unless-stopped \
  haystack-clore:latest

Document Ingestion Pipeline

Build a separate indexing pipeline to ingest documents:

cat > /workspace/index_documents.py << 'EOF'
import haystack
from haystack import Pipeline
from haystack.components.converters import PyPDFToDocument, TextFileToDocument
from haystack.components.preprocessors import DocumentSplitter, DocumentCleaner
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore

# Initialize document store
document_store = InMemoryDocumentStore()

# Build indexing pipeline
indexing_pipeline = Pipeline()
indexing_pipeline.add_component("converter", PyPDFToDocument())
indexing_pipeline.add_component("cleaner", DocumentCleaner())
indexing_pipeline.add_component("splitter", DocumentSplitter(
    split_by="word",
    split_length=200,
    split_overlap=20
))
indexing_pipeline.add_component("embedder", SentenceTransformersDocumentEmbedder(
    model="BAAI/bge-small-en-v1.5"
))
indexing_pipeline.add_component("writer", DocumentWriter(document_store=document_store))

# Connect components
indexing_pipeline.connect("converter", "cleaner")
indexing_pipeline.connect("cleaner", "splitter")
indexing_pipeline.connect("splitter", "embedder")
indexing_pipeline.connect("embedder", "writer")

# Run indexing
from pathlib import Path
indexing_pipeline.run({"converter": {"sources": list(Path("/data/documents").glob("*.pdf"))}})

print(f"Indexed {document_store.count_documents()} document chunks")
EOF

docker run --rm \
  --gpus all \
  -v /workspace:/workspace \
  -v /your/documents:/data/documents \
  -v /workspace/hf-cache:/root/.cache/huggingface \
  haystack-clore:latest \
  python3 /workspace/index_documents.py

Using Vector Databases (Production)

For production workloads, replace the in-memory store with a persistent vector database:

# Launch Qdrant alongside Haystack
docker network create haystack-net

docker run -d \
  --name qdrant \
  --network haystack-net \
  -p 6333:6333 \
  -v /workspace/qdrant-data:/qdrant/storage \
  qdrant/qdrant

# Install Qdrant integration in Haystack container
# Add to Dockerfile:  RUN pip install qdrant-haystack
# Then use QdrantDocumentStore instead of InMemoryDocumentStore

GPU Acceleration

Haystack uses GPU acceleration in two main scenarios:

1. Embedding Generation (Sentence Transformers)

GPU is highly beneficial for embedding large document collections:

cat > /workspace/benchmark_embeddings.py << 'EOF'
import time
import torch
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack import Document

# Check GPU availability
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")
if device == "cuda":
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

# Create embedder
embedder = SentenceTransformersDocumentEmbedder(
    model="BAAI/bge-base-en-v1.5"
)
embedder.warm_up()

# Benchmark
docs = [Document(content=f"Sample document {i} with some text content.") for i in range(100)]

start = time.time()
result = embedder.run(documents=docs)
elapsed = time.time() - start

print(f"Embedded 100 documents in {elapsed:.2f}s ({100/elapsed:.0f} docs/sec)")
EOF

docker run --rm --gpus all \
  -v /workspace:/workspace \
  haystack-clore:latest \
  python3 /workspace/benchmark_embeddings.py

2. Local LLM Inference (Hugging Face Transformers)

For running LLMs directly in Haystack without Ollama:

cat > /workspace/local_llm_pipeline.py << 'EOF'
from haystack import Pipeline
from haystack.components.builders import PromptBuilder
from haystack.components.generators.hugging_face import HuggingFaceLocalGenerator

# Uses GPU automatically when available
generator = HuggingFaceLocalGenerator(
    model="mistralai/Mistral-7B-Instruct-v0.2",
    task="text-generation",
    generation_kwargs={
        "max_new_tokens": 512,
        "temperature": 0.7,
        "do_sample": True,
    }
)

prompt_builder = PromptBuilder(template="Answer this question: {{ question }}")

pipeline = Pipeline()
pipeline.add_component("prompt_builder", prompt_builder)
pipeline.add_component("llm", generator)
pipeline.connect("prompt_builder.prompt", "llm.prompt")

result = pipeline.run({"prompt_builder": {"question": "What is RAG?"}})
print(result["llm"]["replies"][0])
EOF

docker run --rm --gpus all \
  -v /workspace:/workspace \
  -e HF_TOKEN="your-hf-token" \
  haystack-clore:latest \
  python3 /workspace/local_llm_pipeline.py

3. Pair with Ollama (Recommended Approach)

For the best combination of ease and performance, run Ollama for LLM inference and Haystack for orchestration:

# Step 1: Start Ollama (see Ollama guide)
docker run -d \
  --name ollama \
  --gpus all \
  -p 11434:11434 \
  -v /workspace/ollama:/root/.ollama \
  ollama/ollama

# Step 2: Pull a coding/chat model
docker exec ollama ollama pull llama3
docker exec ollama ollama pull nomic-embed-text  # For embeddings via Ollama

# Step 3: Start Haystack pointing to Ollama
docker run -d \
  --name haystack \
  --gpus '"device=0"' \
  -p 1416:1416 \
  --add-host=host.docker.internal:host-gateway \
  -v /workspace/haystack-pipelines:/app/pipelines \
  haystack-clore:latest

Monitor GPU usage across both containers:

watch -n 2 nvidia-smi

Tips & Best Practices

Choose the Right Embedding Model

Model

VRAM

Speed

Quality

Best For

BAAI/bge-small-en-v1.5

~0.5 GB

Fastest

Good

High-throughput indexing

BAAI/bge-base-en-v1.5

~1 GB

Fast

Better

General RAG

BAAI/bge-large-en-v1.5

~2 GB

Medium

Best

Highest accuracy

nomic-ai/nomic-embed-text-v1

~1.5 GB

Fast

Excellent

Long documents

Pipeline Design Tips

Split documents wisely — 200–400 word chunks with 10–15% overlap work well for most RAG use cases
Cache embeddings — persist your document store to disk; re-embedding is expensive
Use warm_up() — call component.warm_up() before production use to load models into GPU memory
Batch indexing — process documents in batches of 32–64 for optimal GPU utilization
Filter with metadata — use Haystack's metadata filtering to scope retrieval (e.g., by date, source, category)

Cost Optimization

# Use spot-style pricing on Clore.ai — choose servers with lower $/hr
# For development/testing: RTX 3060 (~$0.10/hr) is sufficient for embedding
# For production embedding: RTX 3090 (~$0.20/hr) — 24 GB handles large batches
# For local LLM + embedding: A100 40GB (~$0.60/hr) — headroom for concurrent users

# Monitor resource usage
docker stats haystack
nvidia-smi dmon -s u -d 5  # GPU utilization every 5 seconds

Secure Hayhooks for External Access

# Option 1: SSH tunnel (simplest, for personal use)
# From your local machine:
ssh -L 1416:localhost:1416 root@<clore-ip> -p <clore-ssh-port>
# Then access http://localhost:1416 locally

# Option 2: Add basic auth via nginx reverse proxy
docker run -d \
  --name nginx-proxy \
  -p 80:80 \
  -v /workspace/nginx.conf:/etc/nginx/conf.d/default.conf \
  nginx:alpine

Troubleshooting

Problem

Likely Cause

Solution

ModuleNotFoundError: haystack

Package not installed

Rebuild Docker image; check pip install haystack-ai succeeded

CUDA out of memory

Embedding model too large

Use bge-small-en-v1.5 or reduce batch size

Hayhooks returns 404 on pipeline

YAML file not found

Check volume mount; pipeline file must be in /app/pipelines/

Slow embedding on CPU

GPU not detected

Verify --gpus all flag; check torch.cuda.is_available()

Ollama connection refused

Wrong hostname

Use --add-host=host.docker.internal:host-gateway; set URL to http://host.docker.internal:11434

HuggingFace download fails

Missing token or rate limit

Set HF_TOKEN env var; ensure model is not gated

Pipeline YAML parse error

Invalid syntax

Validate YAML; use python3 -c "import yaml; yaml.safe_load(open('pipeline.yml'))"

Container exits immediately

Startup error

Check docker logs haystack; ensure Dockerfile CMD is correct

Port 1416 not reachable externally

Firewall / port forwarding

Expose port in Clore.ai order settings; check server's open ports

Debug Commands

# Check container logs
docker logs haystack --tail 50 -f

# Test Hayhooks API
curl http://localhost:1416/status
curl http://localhost:1416/pipelines

# Interactive Python debug session
docker exec -it haystack python3

# Check GPU inside container
docker exec haystack python3 -c "import torch; print(torch.cuda.is_available(), torch.cuda.get_device_name(0))"

# Check installed packages
docker exec haystack pip show haystack-ai hayhooks

hashtagOverview

hashtagWhat You Can Build

hashtagRequirements

hashtagHardware Requirements

hashtagSoftware Requirements

hashtagQuick Start

hashtag1. Rent a Clore.ai Server

hashtag2. Connect and Verify GPU

hashtag3. Build the Haystack Docker Image

hashtag4. Run Haystack with Hayhooks

hashtag5. Create Your First RAG Pipeline

hashtagConfiguration

hashtagEnvironment Variables

hashtagRun with Full Configuration

hashtagDocument Ingestion Pipeline

hashtagUsing Vector Databases (Production)

hashtagGPU Acceleration

hashtag1. Embedding Generation (Sentence Transformers)

hashtag2. Local LLM Inference (Hugging Face Transformers)

hashtag3. Pair with Ollama (Recommended Approach)

hashtagTips & Best Practices

hashtagChoose the Right Embedding Model

hashtagPipeline Design Tips

hashtagCost Optimization

hashtagSecure Hayhooks for External Access

hashtagTroubleshooting

hashtagDebug Commands

hashtagFurther Reading

Overview

What You Can Build

Requirements

Hardware Requirements

Software Requirements

Quick Start

1. Rent a Clore.ai Server

2. Connect and Verify GPU

3. Build the Haystack Docker Image

4. Run Haystack with Hayhooks

5. Create Your First RAG Pipeline

Configuration

Environment Variables

Run with Full Configuration

Document Ingestion Pipeline

Using Vector Databases (Production)

GPU Acceleration

1. Embedding Generation (Sentence Transformers)

2. Local LLM Inference (Hugging Face Transformers)

3. Pair with Ollama (Recommended Approach)

Tips & Best Practices

Choose the Right Embedding Model

Pipeline Design Tips

Cost Optimization

Secure Hayhooks for External Access

Troubleshooting

Debug Commands

Further Reading