Haystack AI 框架

在 Clore.ai 部署 deepset 的 Haystack —— 在经济型 GPU 基础设施上构建生产级 RAG 管道、语义搜索和 LLM 代理工作流。

Haystack 是 deepset 的开源 AI 编排框架，用于构建生产级别的 LLM 应用。凭借 18K+ 的 GitHub 星标，它提供了一个灵活的 基于管道的架构 将文档存储、检索器、阅读器、生成器和智能体连接在一起——全部使用简洁、可组合的 Python。无论你需要对私有文档的 RAG、语义搜索，还是多步智能体工作流，Haystack 都会处理底层 plumbing，让你可以专注于应用逻辑。

在 Clore.ai 上，当你需要通过 Hugging Face Transformers 或 sentence-transformers 在本地进行模型推理时，Haystack 的优势尤为明显。如果你完全依赖外部 API（OpenAI、Anthropic），可以在仅 CPU 的实例上运行——但对于嵌入生成和本地 LLM，GPU 会显著降低延迟。

所有示例均在通过以下方式租用的 GPU 服务器上运行： CLORE.AI 市场.

本指南涵盖 Haystack v2.x (haystack-ai 软件包）。v2 的 API 与 v1（有很大差异farm-haystack）。如果你有现有的 v1 管道，请参阅迁移指南.

概览

属性

详细信息

项目

deepset-ai/haystack

许可

Apache 2.0

GitHub 星标

18K+

版本

v2.x（haystack-ai)

主要使用场景

RAG、语义搜索、文档问答、智能体工作流

GPU 支持

可选 — 本地嵌入/本地 LLM 需要

难度

中等

API 提供

Hayhooks（基于 FastAPI，REST）

主要集成

Ollama、OpenAI、Anthropic、HuggingFace、Elasticsearch、Pinecone、Weaviate、Qdrant

你可以构建的内容

RAG 管道 — 摄取文档、生成嵌入、检索上下文、回答问题
语义搜索 — 按含义而非关键字查询文档
文档处理 — 解析 PDF、HTML、Word 文档；拆分、清理并索引内容
智能体工作流 — 使用工具（网页搜索、计算器、API）的多步推理
REST API 服务 — 通过 Hayhooks 将任何 Haystack 管道作为端点暴露

要求

硬件要求

模型变体

GPU

显存

内存

磁盘

Clore.ai 价格

仅 API 模式 （OpenAI/Anthropic）

无 / 仅 CPU

—

4 GB

20 GB

≈ $0.01–0.05/小时

本地嵌入 （sentence-transformers）

按小时费率

8 GB

16 GB

30 GB

≈ $0.10–0.15/小时

本地嵌入 + 小型 LLM （7B）

速度

24 GB

16 GB

50 GB

≈ $0.20–0.25/小时

本地 LLM （13B–34B）

512x512

24 GB

32 GB

80 GB

≈ $0.35–0.50/小时

大型本地 LLM （70B，量化）

4 小时会话

80 GB

64 GB

150 GB

≈ $1.10–1.50/小时

对于大多数 RAG 用例，速度在约 $0.20/小时的配置是最合适的 —— 24 GB VRAM 可同时处理 sentence-transformer 嵌入和 7B–13B 的本地 LLM。

软件要求

Docker（Clore.ai 服务器上预装）
NVIDIA 驱动 + CUDA（Clore.ai GPU 服务器上预装）
Python 3.10+（容器内）
CUDA 11.8 或 12.x

快速开始

1. 租用 Clore.ai 服务器

在 Clore.ai 市场中，筛选以下条件：

显存：嵌入工作负载需 ≥ 8 GB，本地 LLM 需 ≥ 24 GB
已预装 Docker：已启用（大多数列表默认开启）
镜像: nvidia/cuda:12.1-devel-ubuntu22.04 或 pytorch/pytorch:2.2.0-cuda12.1-cudnn8-runtime

从中注意服务器的公网 IP 和 SSH 端口 我的订单.

2. 连接并验证 GPU

ssh root@<clore-server-ip> -p <port>

# 验证 GPU 是否可用
nvidia-smi

# 预期输出显示你的 GPU、驱动版本、CUDA 版本

3. 构建 Haystack Docker 镜像

Haystack v2 推荐通过 pip 安装。创建自定义 Dockerfile：

mkdir -p /workspace/haystack-app && cd /workspace/haystack-app

cat > Dockerfile << 'EOF'
FROM nvidia/cuda:12.1-devel-ubuntu22.04

# 避免交互提示
ENV DEBIAN_FRONTEND=noninteractive
ENV PYTHONUNBUFFERED=1

# 安装 Python 和系统依赖
RUN apt-get update && apt-get install -y \
    python3.11 \
    python3-pip \
    python3.11-dev \
    git \
    curl \
    && rm -rf /var/lib/apt/lists/*

# 将 python3.11 设为默认
RUN update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.11 1
RUN update-alternatives --install /usr/bin/python python python3.11 1

# 安装 Haystack v2 和核心依赖
RUN pip install --no-cache-dir \
    haystack-ai \
    hayhooks \
    sentence-transformers \
    transformers \
    torch \
    accelerate \
    fastapi \
    uvicorn

# 安装可选集成
RUN pip install --no-cache-dir \
    ollama-haystack \
    haystack-experimental

WORKDIR /app

# Hayhooks 的默认端口
EXPOSE 1416

CMD ["hayhooks", "run", "--host", "0.0.0.0", "--port", "1416"]
EOF

# 构建镜像
docker build -t haystack-clore:latest .

4. 使用 Hayhooks 运行 Haystack

Hayhooks 会将任何 Haystack 管道自动转换为 REST API：

# 为你的管道创建目录
mkdir -p /workspace/haystack-pipelines

# 以 GPU 访问权限运行 Hayhooks
docker run -d \
  --name haystack \
  --gpus all \
  -p 1416:1416 \
  -v /workspace/haystack-pipelines:/app/pipelines \
  -e OPENAI_API_KEY=${OPENAI_API_KEY:-""} \
  -e HF_TOKEN=${HF_TOKEN:-""} \
  haystack-clore:latest

# 检查是否已在运行
curl http://localhost:1416/status

预期响应：

{"status": "ok", "pipelines": []}

5. 创建你的第一个 RAG 管道

编写一个 Hayhooks 将作为端点提供的管道 YAML：

cat > /workspace/haystack-pipelines/rag_pipeline.yml << 'EOF'
# 使用 Ollama 作为 LLM + 本地嵌入进行检索的 RAG 管道
components:
  embedder:
    type: haystack.components.embedders.SentenceTransformersTextEmbedder
    init_parameters:
      model: BAAI/bge-small-en-v1.5

  retriever:
    type: haystack.components.retrievers.in_memory.InMemoryEmbeddingRetriever
    init_parameters:
      document_store:
        type: haystack_integrations.document_stores.in_memory.InMemoryDocumentStore

  prompt_builder:
    type: haystack.components.builders.PromptBuilder
    init_parameters:
      template: |
        根据下面的上下文回答问题。
        Context: {% for doc in documents %}{{ doc.content }}{% endfor %}
        Question: {{ question }}

  llm:
    type: haystack_integrations.components.generators.ollama.OllamaGenerator
    init_parameters:
      model: llama3
      url: http://host.docker.internal:11434

connections:
  - sender: embedder.embedding
    receiver: retriever.query_embedding
  - sender: retriever.documents
    receiver: prompt_builder.documents
  - sender: prompt_builder.prompt
    receiver: llm.prompt

inputs:
  query:
    - embedder.text
    - prompt_builder.question

outputs:
  answer: llm.replies
EOF

Hayhooks 会自动发现并提供此管道。测试它：

# 列出已部署的管道
curl http://localhost:1416/pipelines

# 查询 RAG 管道
curl -X POST http://localhost:1416/rag_pipeline/run \
  -H "Content-Type: application/json" \
  -d '{"query": "What is Haystack?"}'

配置

使用环境变量进行 SSH 和 Jupyter 访问：

变量

示例

OPENAI_API_KEY

用于 GPT 模型的 OpenAI API 密钥

sk-...

ANTHROPIC_API_KEY

用于 Claude 的 Anthropic API 密钥

sk-ant-...

HF_TOKEN

用于受限模型的 Hugging Face 令牌

hf_...

HAYSTACK_TELEMETRY_ENABLED

禁用使用情况遥测

false

CUDA_VISIBLE_DEVICES

选择特定 GPU

0

TRANSFORMERS_CACHE

HF 模型的缓存路径

/workspace/hf-cache

使用完整配置运行

docker run -d \
  --name haystack \
  --gpus '"device=0"' \
  -p 1416:1416 \
  -v /workspace/haystack-pipelines:/app/pipelines \
  -v /workspace/hf-cache:/root/.cache/huggingface \
  -e OPENAI_API_KEY="your-key-here" \
  -e HF_TOKEN="your-hf-token" \
  -e HAYSTACK_TELEMETRY_ENABLED=false \
  -e CUDA_VISIBLE_DEVICES=0 \
  --restart unless-stopped \
  haystack-clore:latest

文档摄取管道

构建一个单独的索引管道以摄取文档：

cat > /workspace/index_documents.py << 'EOF'
import haystack
from haystack import Pipeline
from haystack.components.converters import PyPDFToDocument, TextFileToDocument
from haystack.components.preprocessors import DocumentSplitter, DocumentCleaner
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore

# 初始化文档存储
document_store = InMemoryDocumentStore()

# 构建索引管道
indexing_pipeline = Pipeline()
indexing_pipeline.add_component("converter", PyPDFToDocument())
indexing_pipeline.add_component("cleaner", DocumentCleaner())
indexing_pipeline.add_component("splitter", DocumentSplitter(
    split_by="word",
    split_length=200,
    split_overlap=20
))
indexing_pipeline.add_component("embedder", SentenceTransformersDocumentEmbedder(
    model="BAAI/bge-small-en-v1.5"
))
indexing_pipeline.add_component("writer", DocumentWriter(document_store=document_store))

# 连接组件
indexing_pipeline.connect("converter", "cleaner")
indexing_pipeline.connect("cleaner", "splitter")
indexing_pipeline.connect("splitter", "embedder")
indexing_pipeline.connect("embedder", "writer")

# 运行索引
from pathlib import Path
indexing_pipeline.run({"converter": {"sources": list(Path("/data/documents").glob("*.pdf"))}})

print(f"Indexed {document_store.count_documents()} document chunks")
EOF

docker run --rm \
  --gpus all \
  -v /workspace:/workspace \
  -v /your/documents:/data/documents \
  -v /workspace/hf-cache:/root/.cache/huggingface \
  haystack-clore:latest \
  python3 /workspace/index_documents.py

使用向量数据库（生产）

对于生产工作负载，用持久化向量数据库替换内存存储：

# 在 Haystack 旁启动 Qdrant
docker network create haystack-net

docker run -d \
  --name qdrant \
  --network haystack-net \
  -p 6333:6333 \
  -v /workspace/qdrant-data:/qdrant/storage \
  qdrant/qdrant

# 在 Haystack 容器中安装 Qdrant 集成
# 添加到 Dockerfile：  RUN pip install qdrant-haystack
# 然后使用 QdrantDocumentStore 替代 InMemoryDocumentStore

GPU 加速

Haystack 在两种主要场景下使用 GPU 加速：

1. 嵌入生成（Sentence Transformers）

对于大规模文档集合的嵌入，GPU 非常有益：

cat > /workspace/benchmark_embeddings.py << 'EOF'
import time
import torch
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack import Document

# 检查 GPU 可用性
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")
if device == "cuda":
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

# 创建 embedder
embedder = SentenceTransformersDocumentEmbedder(
    model="BAAI/bge-base-en-v1.5"
)
embedder.warm_up()

# 基准测试
docs = [Document(content=f"Sample document {i} with some text content.") for i in range(100)]

start = time.time()
result = embedder.run(documents=docs)
elapsed = time.time() - start

print(f"Embedded 100 documents in {elapsed:.2f}s ({100/elapsed:.0f} docs/sec)")
EOF

docker run --rm --gpus all \
  -v /workspace:/workspace \
  haystack-clore:latest \
  python3 /workspace/benchmark_embeddings.py

2. 本地 LLM 推理（Hugging Face Transformers）

用于在 Haystack 中直接运行 LLM（不使用 Ollama）：

cat > /workspace/local_llm_pipeline.py << 'EOF'
from haystack import Pipeline
from haystack.components.builders import PromptBuilder
from haystack.components.generators.hugging_face import HuggingFaceLocalGenerator

# 有 GPU 时会自动使用
generator = HuggingFaceLocalGenerator(
    model="mistralai/Mistral-7B-Instruct-v0.2",
    task="text-generation",
    generation_kwargs={
        "max_new_tokens": 512,
        "temperature": 0.7,
        "do_sample": True,
    }
)

prompt_builder = PromptBuilder(template="Answer this question: {{ question }}")

pipeline = Pipeline()
pipeline.add_component("prompt_builder", prompt_builder)
pipeline.add_component("llm", generator)
pipeline.connect("prompt_builder.prompt", "llm.prompt")

result = pipeline.run({"prompt_builder": {"question": "What is RAG?"}})
print(result["llm"]["replies"][0])
EOF

docker run --rm --gpus all \
  -v /workspace:/workspace \
  -e HF_TOKEN="your-hf-token" \
  haystack-clore:latest \
  python3 /workspace/local_llm_pipeline.py

3. 与 Ollama 配合（推荐方法）

为了在易用性与性能之间取得最佳平衡，使用 Ollama 进行 LLM 推理，Haystack 负责编排：

# 第 1 步：启动 Ollama（参见 Ollama 指南）
docker run -d \
  --name ollama \
  --gpus all \
  -p 11434:11434 \
  -v /workspace/ollama:/root/.ollama \
  ollama/ollama

# 第 2 步：拉取一个编码/聊天模型
docker exec ollama ollama pull llama3
docker exec ollama ollama pull nomic-embed-text  # 用于通过 Ollama 生成嵌入

# 第 3 步：启动指向 Ollama 的 Haystack
docker run -d \
  --name haystack \
  --gpus '"device=0"' \
  -p 1416:1416 \
  --add-host=host.docker.internal:host-gateway \
  -v /workspace/haystack-pipelines:/app/pipelines \
  haystack-clore:latest

监控两个容器的 GPU 使用情况：

watch -n 2 nvidia-smi

提示与最佳实践

选择合适的嵌入模型

A100

显存

性能

质量

最适合

BAAI/bge-small-en-v1.5

约 0.5 GB

最快

良好

高吞吐量索引

BAAI/bge-base-en-v1.5

≈ 1 GB

快速

更好

通用 RAG

BAAI/bge-large-en-v1.5

约 2 GB

中等

最佳

最高准确率

nomic-ai/nomic-embed-text-v1

≈ 1.5 GB

快速

优秀

长文档

管道设计建议

明智地拆分文档 — 对于大多数 RAG 用例，200–400 字的块且有 10–15% 的重叠效果良好
缓存嵌入 — 将文档存储持久化到磁盘；重新生成嵌入成本很高
使用 warm_up() — 在生产使用前调用 component.warm_up() 将模型加载到 GPU 内存中
批量索引 — 以 32–64 的批次处理文档以获得最佳 GPU 利用率
使用元数据过滤 — 使用 Haystack 的元数据过滤来限定检索范围（例如按日期、来源、类别）

成本优化

# 在 Clore.ai 使用类 spot 的定价 — 选择每小时费用较低的服务器
# 用于开发/测试：RTX 3060（≈ $0.10/小时）足以进行嵌入
# 用于生产嵌入：RTX 3090（≈ $0.20/小时）— 24 GB 可处理大批量
# 用于本地 LLM + 嵌入：A100 40GB（≈ $0.60/小时）— 为并发用户留有余量

# 监控资源使用情况
docker stats haystack
nvidia-smi dmon -s u -d 5  # 每 5 秒报告一次 GPU 利用率

为外部访问保护 Hayhooks

# 选项 1：SSH 隧道（最简单，个人使用）
# 从你的本地机器：
ssh -L 1416:localhost:1416 root@<clore-ip> -p <clore-ssh-port>
# 然后在本地访问 http://localhost:1416

# 选项 2：通过 nginx 反向代理添加基础认证
docker run -d \
  --name nginx-proxy \
  -p 80:80 \
  -v /workspace/nginx.conf:/etc/nginx/conf.d/default.conf \
  nginx:alpine

# 使用固定种子以获得一致结果

问题

可能原因

解决方案

ModuleNotFoundError: haystack

未安装该包

重建 Docker 镜像；检查 pip install haystack-ai 是否成功

CUDA 内存不足（out of memory）

嵌入模型过大

使用 bge-small-en-v1.5 或减小批量大小

Hayhooks 在管道上返回 404

找不到 YAML 文件

检查卷挂载；管道文件必须位于 /app/pipelines/

CPU 上嵌入速度慢

未检测到 GPU

验证 --gpus all 标志；检查 torch.cuda.is_available()

Ollama 连接被拒绝

主机名错误

使用 --add-host=host.docker.internal:host-gateway；将 URL 设置为 http://host.docker.internal:11434

HuggingFace 下载失败

缺少令牌或速率限制

设置 HF_TOKEN 环境变量；确保模型不是受限的

管道 YAML 解析错误

无效语法

验证 YAML；使用 python3 -c "import yaml; yaml.safe_load(open('pipeline.yml'))"

容器立即退出

启动错误

检查 docker logs haystack；确保 Dockerfile 的 CMD 正确

端口 1416 外部无法访问

防火墙 / 端口转发

在 Clore.ai 订单设置中暴露端口；检查服务器的开放端口

调试命令

# 检查容器日志
docker logs haystack --tail 50 -f

# 测试 Hayhooks API
curl http://localhost:1416/status
curl http://localhost:1416/pipelines

# 交互式 Python 调试会话
docker exec -it haystack python3

# 在容器内检查 GPU
docker exec haystack python3 -c "import torch; print(torch.cuda.is_available(), torch.cuda.get_device_name(0))"

# 检查已安装的包
docker exec haystack pip show haystack-ai hayhooks

hashtag概览

hashtag你可以构建的内容

hashtag要求

hashtag硬件要求

hashtag软件要求

hashtag快速开始

hashtag1. 租用 Clore.ai 服务器

hashtag2. 连接并验证 GPU

hashtag3. 构建 Haystack Docker 镜像

hashtag4. 使用 Hayhooks 运行 Haystack

hashtag5. 创建你的第一个 RAG 管道

hashtag配置

hashtag使用环境变量进行 SSH 和 Jupyter 访问：

hashtag使用完整配置运行

hashtag文档摄取管道

hashtag使用向量数据库（生产）

hashtagGPU 加速

hashtag1. 嵌入生成（Sentence Transformers）

hashtag2. 本地 LLM 推理（Hugging Face Transformers）

hashtag3. 与 Ollama 配合（推荐方法）

hashtag提示与最佳实践

hashtag选择合适的嵌入模型

hashtag管道设计建议

hashtag成本优化

hashtag为外部访问保护 Hayhooks

hashtag# 使用固定种子以获得一致结果

hashtag调试命令

hashtag延伸阅读

概览

你可以构建的内容

要求

硬件要求

软件要求

快速开始

1. 租用 Clore.ai 服务器

2. 连接并验证 GPU

3. 构建 Haystack Docker 镜像

4. 使用 Hayhooks 运行 Haystack

5. 创建你的第一个 RAG 管道

配置

使用环境变量进行 SSH 和 Jupyter 访问：

使用完整配置运行

文档摄取管道

使用向量数据库（生产）

GPU 加速

1. 嵌入生成（Sentence Transformers）

2. 本地 LLM 推理（Hugging Face Transformers）

3. 与 Ollama 配合（推荐方法）

提示与最佳实践

选择合适的嵌入模型

管道设计建议

成本优化

为外部访问保护 Hayhooks

# 使用固定种子以获得一致结果

调试命令

延伸阅读