Haystack AI Framework

Deploy Haystack by deepset on Clore.ai — build production RAG pipelines, semantic search, and LLM agent workflows on affordable GPU infrastructure.

Haystack is deepset's open-source AI orchestration framework for building production-grade LLM applications. With 18K+ GitHub stars, it provides a flexible pipeline-based architecture that wires together document stores, retrievers, readers, generators, and agents — all in clean, composable Python. Whether you need RAG over private documents, semantic search, or multi-step agent workflows, Haystack handles the plumbing so you can focus on the application logic.

On Clore.ai, Haystack shines when you need a GPU for local model inference via Hugging Face Transformers or sentence-transformers. If you rely purely on external APIs (OpenAI, Anthropic), you can run it on CPU-only instances — but for embedding generation and local LLMs, a GPU cuts latency dramatically.

circle-check
circle-info

This guide covers Haystack v2.x (haystack-ai package). The v2 API differs significantly from v1 (farm-haystack). If you have existing v1 pipelines, see the migration guidearrow-up-right.

Overview

Property
Details

License

Apache 2.0

GitHub Stars

18K+

Version

v2.x (haystack-ai)

Primary Use Case

RAG, semantic search, document QA, agent workflows

GPU Support

Optional — required for local embeddings / local LLMs

Difficulty

Medium

API Serving

Hayhooks (FastAPI-based, REST)

Key Integrations

Ollama, OpenAI, Anthropic, HuggingFace, Elasticsearch, Pinecone, Weaviate, Qdrant

What You Can Build

  • RAG pipelines — ingest documents, generate embeddings, retrieve context, answer questions

  • Semantic search — query documents by meaning, not keywords

  • Document processing — parse PDFs, HTML, Word docs; split, clean, and index content

  • Agent workflows — multi-step reasoning with tool use (web search, calculators, APIs)

  • REST API services — expose any Haystack pipeline as an endpoint via Hayhooks

Requirements

Hardware Requirements

Use Case
GPU
VRAM
RAM
Disk
Clore.ai Price

API mode only (OpenAI/Anthropic)

None / CPU

4 GB

20 GB

~$0.01–0.05/hr

Local embeddings (sentence-transformers)

RTX 3060

8 GB

16 GB

30 GB

~$0.10–0.15/hr

Local embeddings + small LLM (7B)

RTX 3090

24 GB

16 GB

50 GB

~$0.20–0.25/hr

Local LLM (13B–34B)

RTX 4090

24 GB

32 GB

80 GB

~$0.35–0.50/hr

Large local LLM (70B, quantized)

A100 80GB

80 GB

64 GB

150 GB

~$1.10–1.50/hr

circle-info

For most RAG use cases, an RTX 3090 at ~$0.20/hr is the sweet spot — 24 GB VRAM handles sentence-transformer embeddings + a 7B–13B local LLM simultaneously.

Software Requirements

  • Docker (pre-installed on Clore.ai servers)

  • NVIDIA drivers + CUDA (pre-installed on Clore.ai GPU servers)

  • Python 3.10+ (inside the container)

  • CUDA 11.8 or 12.x

Quick Start

1. Rent a Clore.ai Server

In the Clore.ai Marketplacearrow-up-right, filter for:

  • VRAM: ≥ 8 GB for embedding workloads, ≥ 24 GB for local LLMs

  • Docker: Enabled (default on most listings)

  • Image: nvidia/cuda:12.1-devel-ubuntu22.04 or pytorch/pytorch:2.2.0-cuda12.1-cudnn8-runtime

Note the server's public IP and SSH port from My Orders.

2. Connect and Verify GPU

3. Build the Haystack Docker Image

Haystack v2 recommends pip installation. Create a custom Dockerfile:

4. Run Haystack with Hayhooks

Hayhooksarrow-up-right turns any Haystack pipeline into a REST API automatically:

Expected response:

5. Create Your First RAG Pipeline

Write a pipeline YAML that Hayhooks will serve as an endpoint:

Hayhooks automatically discovers and serves this pipeline. Test it:

Configuration

Environment Variables

Variable
Description
Example

OPENAI_API_KEY

OpenAI API key for GPT models

sk-...

ANTHROPIC_API_KEY

Anthropic API key for Claude

sk-ant-...

HF_TOKEN

Hugging Face token for gated models

hf_...

HAYSTACK_TELEMETRY_ENABLED

Disable usage telemetry

false

CUDA_VISIBLE_DEVICES

Select specific GPU

0

TRANSFORMERS_CACHE

Cache path for HF models

/workspace/hf-cache

Run with Full Configuration

Document Ingestion Pipeline

Build a separate indexing pipeline to ingest documents:

Using Vector Databases (Production)

For production workloads, replace the in-memory store with a persistent vector database:

GPU Acceleration

Haystack uses GPU acceleration in two main scenarios:

1. Embedding Generation (Sentence Transformers)

GPU is highly beneficial for embedding large document collections:

2. Local LLM Inference (Hugging Face Transformers)

For running LLMs directly in Haystack without Ollama:

For the best combination of ease and performance, run Ollama for LLM inference and Haystack for orchestration:

Monitor GPU usage across both containers:

Tips & Best Practices

Choose the Right Embedding Model

Model
VRAM
Speed
Quality
Best For

BAAI/bge-small-en-v1.5

~0.5 GB

Fastest

Good

High-throughput indexing

BAAI/bge-base-en-v1.5

~1 GB

Fast

Better

General RAG

BAAI/bge-large-en-v1.5

~2 GB

Medium

Best

Highest accuracy

nomic-ai/nomic-embed-text-v1

~1.5 GB

Fast

Excellent

Long documents

Pipeline Design Tips

  • Split documents wisely — 200–400 word chunks with 10–15% overlap work well for most RAG use cases

  • Cache embeddings — persist your document store to disk; re-embedding is expensive

  • Use warm_up() — call component.warm_up() before production use to load models into GPU memory

  • Batch indexing — process documents in batches of 32–64 for optimal GPU utilization

  • Filter with metadata — use Haystack's metadata filtering to scope retrieval (e.g., by date, source, category)

Cost Optimization

Secure Hayhooks for External Access

Troubleshooting

Problem
Likely Cause
Solution

ModuleNotFoundError: haystack

Package not installed

Rebuild Docker image; check pip install haystack-ai succeeded

CUDA out of memory

Embedding model too large

Use bge-small-en-v1.5 or reduce batch size

Hayhooks returns 404 on pipeline

YAML file not found

Check volume mount; pipeline file must be in /app/pipelines/

Slow embedding on CPU

GPU not detected

Verify --gpus all flag; check torch.cuda.is_available()

Ollama connection refused

Wrong hostname

Use --add-host=host.docker.internal:host-gateway; set URL to http://host.docker.internal:11434

HuggingFace download fails

Missing token or rate limit

Set HF_TOKEN env var; ensure model is not gated

Pipeline YAML parse error

Invalid syntax

Validate YAML; use python3 -c "import yaml; yaml.safe_load(open('pipeline.yml'))"

Container exits immediately

Startup error

Check docker logs haystack; ensure Dockerfile CMD is correct

Port 1416 not reachable externally

Firewall / port forwarding

Expose port in Clore.ai order settings; check server's open ports

Debug Commands

Further Reading

Last updated

Was this helpful?