# LlamaIndex

LlamaIndex (formerly GPT Index) is a **data framework for LLM applications** with over **37,000 GitHub stars**. While LangChain focuses on chaining LLM calls, LlamaIndex excels at **data ingestion, indexing, and structured querying** — making it the go-to choice when your application needs to reason over large, heterogeneous document collections.

LlamaIndex provides first-class support for complex data structures (databases, APIs, PDFs, Notion pages, GitHub repos) and sophisticated retrieval strategies. Running it on Clore.ai GPU servers with local LLMs eliminates API costs and keeps your data private.

Key strengths:

* 📊 **Data connectors** — 160+ integrations (PDF, SQL, Notion, Slack, GitHub, etc.)
* 🗂️ **Multiple index types** — vector, tree, list, keyword, knowledge graph
* 🔍 **Advanced retrieval** — sub-question decomposition, recursive retrieval, hybrid search
* 🤖 **Query engines** — SQL, structured, and natural language over any data source
* 🧩 **Multi-modal** — images, audio, and video alongside text
* 💾 **Persistence** — built-in support for ChromaDB, Pinecone, Weaviate, etc.
* ⚡ **Async-first** — built for production throughput
* 🔗 **LangChain compatible** — use both frameworks together

{% hint style="success" %}
All examples can be run on GPU servers rented through [CLORE.AI Marketplace](https://clore.ai/marketplace).
{% endhint %}

***

## Server Requirements

| Parameter | Minimum                 | Recommended                   |
| --------- | ----------------------- | ----------------------------- |
| GPU       | NVIDIA RTX 3080 (10 GB) | NVIDIA RTX 4090 (24 GB)       |
| VRAM      | 8 GB (7B model)         | 24 GB (13B–34B models)        |
| RAM       | 16 GB                   | 32–64 GB                      |
| CPU       | 4 cores                 | 16 cores                      |
| Disk      | 30 GB                   | 100+ GB (local models + data) |
| OS        | Ubuntu 20.04+           | Ubuntu 22.04                  |
| CUDA      | 11.8+                   | 12.1+                         |
| Python    | 3.9+                    | 3.11                          |
| Ports     | 22, 8000                | 22, 8000, 11434 (Ollama)      |

{% hint style="info" %}
LlamaIndex is a Python library — GPU resources are consumed by the underlying LLM and embedding model. For production deployments, pair LlamaIndex with Ollama (for local inference) and ChromaDB (for vector storage), both running on your Clore.ai GPU server.
{% endhint %}

***

## Quick Deploy on CLORE.AI

### 1. Find a suitable server

Go to [CLORE.AI Marketplace](https://clore.ai/marketplace) and choose based on your LLM size:

| Use Case              | GPU              | Notes                          |
| --------------------- | ---------------- | ------------------------------ |
| Development / Testing | RTX 3080 (10 GB) | 7B models, small document sets |
| Production (small)    | RTX 4090 (24 GB) | 13B models, medium datasets    |
| Production (large)    | A100 40G / 80G   | 34B–70B models, large datasets |
| Enterprise            | H100 (80 GB)     | Maximum throughput             |

### 2. Configure your deployment

**Docker Image (base):**

```
nvidia/cuda:12.1.0-cudnn8-devel-ubuntu22.04
```

**Port Mappings:**

```
22    → SSH access
8000  → LlamaIndex API / Gradio UI
11434 → Ollama inference engine
```

**Startup Script:**

```bash
#!/bin/bash
# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh
ollama serve &
sleep 5
ollama pull llama3:8b
ollama pull nomic-embed-text

# Install LlamaIndex
pip install llama-index llama-index-llms-ollama llama-index-embeddings-ollama
pip install chromadb fastapi uvicorn

python /workspace/app.py
```

### 3. Access the API

```
http://<your-clore-server-ip>:8000
```

***

## Step-by-Step Setup

### Step 1: SSH into your server

```bash
ssh root@<your-clore-server-ip> -p <ssh-port>
```

### Step 2: Install Ollama

```bash
curl -fsSL https://ollama.ai/install.sh | sh
ollama serve &
sleep 5

# Pull models
ollama pull llama3:8b              # LLM for generation
ollama pull nomic-embed-text       # Embedding model

# Verify
ollama list
```

### Step 3: Set up Python environment

```bash
mkdir -p /workspace/llamaindex-app
cd /workspace/llamaindex-app

python3 -m venv venv
source venv/bin/activate
pip install --upgrade pip
```

### Step 4: Install LlamaIndex packages

```bash
# Core LlamaIndex
pip install llama-index

# LLM integrations
pip install llama-index-llms-ollama
pip install llama-index-llms-openai     # Optional: OpenAI

# Embedding integrations
pip install llama-index-embeddings-ollama
pip install llama-index-embeddings-huggingface

# Vector store integrations
pip install llama-index-vector-stores-chroma

# Data loaders
pip install llama-index-readers-file
pip install llama-index-readers-web

# Optional: additional readers
pip install pypdf docx2txt
```

### Step 5: Configure global settings

```python
# settings.py
from llama_index.core import Settings
from llama_index.llms.ollama import Ollama
from llama_index.embeddings.ollama import OllamaEmbedding

# Configure LLM
Settings.llm = Ollama(
    model="llama3:8b",
    base_url="http://localhost:11434",
    request_timeout=300.0,
    temperature=0.1,
)

# Configure embeddings
Settings.embed_model = OllamaEmbedding(
    model_name="nomic-embed-text",
    base_url="http://localhost:11434",
)

# Configure chunk settings
Settings.chunk_size = 1024
Settings.chunk_overlap = 200
```

### Step 6: Build your first index

```python
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

# Load documents from a directory
documents = SimpleDirectoryReader("/workspace/data/docs").load_data()
print(f"Loaded {len(documents)} documents")

# Build vector index (auto-embeds and stores)
index = VectorStoreIndex.from_documents(documents)

# Save index to disk
index.storage_context.persist("/workspace/index_storage")
print("Index built and saved!")
```

### Step 7: Query the index

```python
from llama_index.core import load_index_from_storage, StorageContext

# Load existing index
storage_context = StorageContext.from_defaults(persist_dir="/workspace/index_storage")
index = load_index_from_storage(storage_context)

# Create query engine
query_engine = index.as_query_engine(similarity_top_k=5)

# Ask questions
response = query_engine.query("What GPU servers are available on Clore.ai?")
print(f"Answer: {response}")
print(f"\nSources: {len(response.source_nodes)} nodes used")
```

***

## Usage Examples

### Example 1: Basic Document Q\&A

```python
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.llms.ollama import Ollama
from llama_index.embeddings.ollama import OllamaEmbedding
from pathlib import Path

# Configure LLamaIndex with local Ollama models
Settings.llm = Ollama(model="llama3:8b", base_url="http://localhost:11434")
Settings.embed_model = OllamaEmbedding(
    model_name="nomic-embed-text",
    base_url="http://localhost:11434"
)

# Create sample documents directory
data_dir = Path("/workspace/data")
data_dir.mkdir(exist_ok=True)

# Create a sample document
(data_dir / "clore_faq.txt").write_text("""
Clore.ai FAQ

Q: What is Clore.ai?
A: Clore.ai is a decentralized GPU cloud marketplace connecting GPU owners with AI researchers and developers who need computing power.

Q: What GPUs are available?
A: Clore.ai offers GPUs ranging from NVIDIA GTX 1080 to the latest H100 80GB. Popular options include RTX 4090, A100 40G/80G, and RTX 3090.

Q: How does pricing work?
A: Prices are set by GPU providers and vary by GPU model, VRAM, and availability. Generally 30-70% cheaper than AWS/GCP.

Q: What software can I run?
A: Any Docker container. Pre-configured images for PyTorch, TensorFlow, ComfyUI, Stable Diffusion, and more are available.
""")

# Build index
documents = SimpleDirectoryReader(str(data_dir)).load_data()
index = VectorStoreIndex.from_documents(documents, show_progress=True)

# Query
query_engine = index.as_query_engine(similarity_top_k=3)

questions = [
    "What GPUs does Clore.ai offer?",
    "How does Clore.ai pricing compare to AWS?",
    "Can I run custom Docker containers?",
]

for q in questions:
    print(f"\n❓ {q}")
    response = query_engine.query(q)
    print(f"💬 {response}")
```

***

### Example 2: Multi-Document RAG with ChromaDB

```python
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, StorageContext, Settings
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.llms.ollama import Ollama
from llama_index.embeddings.ollama import OllamaEmbedding
import chromadb

# Configure LLM and embeddings
Settings.llm = Ollama(model="llama3:8b", base_url="http://localhost:11434")
Settings.embed_model = OllamaEmbedding(
    model_name="nomic-embed-text",
    base_url="http://localhost:11434"
)

# Connect to ChromaDB (running on same Clore.ai server)
chroma_client = chromadb.HttpClient(host="localhost", port=8001)
chroma_collection = chroma_client.get_or_create_collection("llamaindex_docs")

# Create ChromaDB vector store for LlamaIndex
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

# Load documents from multiple sources
docs_dir = "/workspace/data/docs"
documents = SimpleDirectoryReader(
    docs_dir,
    recursive=True,              # Include subdirectories
    required_exts=[".pdf", ".txt", ".md"],  # Only these formats
    filename_as_id=True          # Use filename as doc ID
).load_data()

print(f"Loaded {len(documents)} documents from {docs_dir}")

# Build index (stores in ChromaDB)
index = VectorStoreIndex.from_documents(
    documents,
    storage_context=storage_context,
    show_progress=True
)
print("Index built and persisted in ChromaDB!")

# Load existing index (future sessions)
# index = VectorStoreIndex.from_vector_store(vector_store)

# Advanced query engine with metadata filtering
from llama_index.core.vector_stores import MetadataFilter, MetadataFilters

# Query with metadata filter
filtered_engine = index.as_query_engine(
    similarity_top_k=5,
    filters=MetadataFilters(
        filters=[
            MetadataFilter(key="file_type", value=".pdf"),
        ]
    )
)

response = filtered_engine.query("Summarize the key technical concepts in the documents.")
print(f"\nFiltered response: {response}")
```

***

### Example 3: Sub-Question Decomposition

```python
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.core.query_engine import SubQuestionQueryEngine
from llama_index.core.tools import QueryEngineTool, ToolMetadata
from llama_index.llms.ollama import Ollama
from llama_index.embeddings.ollama import OllamaEmbedding

Settings.llm = Ollama(model="llama3:8b", base_url="http://localhost:11434")
Settings.embed_model = OllamaEmbedding(model_name="nomic-embed-text", base_url="http://localhost:11434")

# Create separate indices for different knowledge domains
def build_index(docs_path, index_name):
    docs = SimpleDirectoryReader(docs_path).load_data()
    index = VectorStoreIndex.from_documents(docs)
    return index

# Separate knowledge bases
pricing_index = build_index("/workspace/data/pricing", "pricing")
technical_index = build_index("/workspace/data/technical", "technical")
faq_index = build_index("/workspace/data/faq", "faq")

# Wrap as tools
tools = [
    QueryEngineTool(
        query_engine=pricing_index.as_query_engine(),
        metadata=ToolMetadata(
            name="pricing_docs",
            description="Contains pricing information, cost comparisons, and billing details for Clore.ai."
        )
    ),
    QueryEngineTool(
        query_engine=technical_index.as_query_engine(),
        metadata=ToolMetadata(
            name="technical_docs",
            description="Contains technical documentation about GPU specs, Docker deployment, and APIs."
        )
    ),
    QueryEngineTool(
        query_engine=faq_index.as_query_engine(),
        metadata=ToolMetadata(
            name="faq_docs",
            description="Contains frequently asked questions and their answers."
        )
    ),
]

# Sub-question engine decomposes complex queries
sub_question_engine = SubQuestionQueryEngine.from_defaults(
    query_engine_tools=tools,
    verbose=True
)

# Complex multi-part question
complex_question = """
Compare the cost of running a 7B parameter LLM on Clore.ai vs AWS for 100 hours,
and explain the technical setup required for each option.
"""

print(f"Question: {complex_question}")
response = sub_question_engine.query(complex_question)
print(f"\nComprehensive Answer:\n{response}")
```

***

### Example 4: Knowledge Graph Index

```python
from llama_index.core import KnowledgeGraphIndex, SimpleDirectoryReader, Settings
from llama_index.llms.ollama import Ollama
from llama_index.embeddings.ollama import OllamaEmbedding

Settings.llm = Ollama(model="llama3:13b", base_url="http://localhost:11434")  # Larger model for better extraction
Settings.embed_model = OllamaEmbedding(model_name="nomic-embed-text", base_url="http://localhost:11434")

# Load documents
documents = SimpleDirectoryReader("/workspace/data/docs").load_data()

# Build Knowledge Graph (extracts entities and relationships)
kg_index = KnowledgeGraphIndex.from_documents(
    documents,
    max_triplets_per_chunk=10,   # Extract up to 10 triplets per chunk
    include_embeddings=True,
    show_progress=True
)

# Save the graph
kg_index.storage_context.persist("/workspace/kg_storage")
print(f"Knowledge graph built!")
print(f"Nodes: {len(kg_index.index_struct.table)}")

# Query the knowledge graph
kg_query_engine = kg_index.as_query_engine(
    include_text=True,            # Include source text
    retriever_mode="keyword",     # Use keyword-based retrieval
    response_mode="tree_summarize"
)

questions = [
    "What are the relationships between GPU models and use cases?",
    "How are pricing and GPU specifications related?",
    "What deployment methods connect to which services?",
]

for q in questions:
    print(f"\n🔍 {q}")
    response = kg_query_engine.query(q)
    print(f"📊 {response}")
```

***

### Example 5: SQL Query Engine over Database

```python
from llama_index.core import SQLDatabase, Settings
from llama_index.core.query_engine import NLSQLTableQueryEngine
from llama_index.llms.ollama import Ollama
from sqlalchemy import create_engine, text
import pandas as pd

Settings.llm = Ollama(model="llama3:8b", base_url="http://localhost:11434")

# Create sample database with GPU marketplace data
engine = create_engine("sqlite:////workspace/clore_data.db")

# Create and populate tables
with engine.connect() as conn:
    conn.execute(text("""
        CREATE TABLE IF NOT EXISTS gpu_servers (
            id INTEGER PRIMARY KEY,
            gpu_model TEXT,
            vram_gb INTEGER,
            price_per_hour REAL,
            location TEXT,
            available INTEGER
        )
    """))

    conn.execute(text("""
        INSERT OR REPLACE INTO gpu_servers VALUES
        (1, 'RTX 4090', 24, 0.65, 'US-East', 1),
        (2, 'RTX 4090', 24, 0.70, 'EU-West', 1),
        (3, 'A100 80G', 80, 2.50, 'US-West', 1),
        (4, 'H100 80G', 80, 4.20, 'US-East', 0),
        (5, 'RTX 3090', 24, 0.35, 'Asia-Pacific', 1),
        (6, 'RTX 3080', 10, 0.20, 'EU-East', 1),
        (7, 'A100 40G', 40, 1.50, 'US-East', 1)
    """))
    conn.commit()

# Create LlamaIndex SQL database wrapper
sql_database = SQLDatabase(engine, include_tables=["gpu_servers"])

# Natural language to SQL query engine
query_engine = NLSQLTableQueryEngine(
    sql_database=sql_database,
    tables=["gpu_servers"],
)

# Query the database in natural language
nl_queries = [
    "What is the cheapest GPU server available?",
    "Show me all GPU servers with more than 40GB of VRAM",
    "What is the average price per hour for RTX 4090 servers?",
    "Which locations have GPU servers available?",
    "List all available A100 servers sorted by price",
]

for query in nl_queries:
    print(f"\n💬 Natural Language: {query}")
    response = query_engine.query(query)
    print(f"📊 Answer: {response}")
    if hasattr(response, 'metadata') and 'sql_query' in response.metadata:
        print(f"🔧 SQL: {response.metadata['sql_query']}")
```

***

## Configuration

### Docker Compose (Full LlamaIndex Stack)

```yaml
version: '3.8'

services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    runtime: nvidia
    ports:
      - "11434:11434"
    volumes:
      - ollama_models:/root/.ollama
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
    restart: unless-stopped

  chromadb:
    image: chromadb/chroma:latest
    container_name: chromadb
    ports:
      - "8001:8000"
    volumes:
      - chroma_data:/chroma/chroma
    environment:
      - IS_PERSISTENT=TRUE
      - ANONYMIZED_TELEMETRY=FALSE
    restart: unless-stopped

  llamaindex-api:
    build:
      context: .
      dockerfile: Dockerfile
    container_name: llamaindex-api
    ports:
      - "8000:8000"
    volumes:
      - ./data:/workspace/data
      - ./indices:/workspace/indices
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
      - CHROMA_HOST=chromadb
      - CHROMA_PORT=8000
      - LLM_MODEL=llama3:8b
      - EMBED_MODEL=nomic-embed-text
    depends_on:
      - ollama
      - chromadb
    restart: unless-stopped

volumes:
  ollama_models:
  chroma_data:
```

### Key Configuration Variables

| Setting                   | Default        | Description                |
| ------------------------- | -------------- | -------------------------- |
| `Settings.llm`            | OpenAI GPT-3.5 | LLM for generation         |
| `Settings.embed_model`    | OpenAI Ada     | Embedding model            |
| `Settings.chunk_size`     | 1024           | Text chunk size in tokens  |
| `Settings.chunk_overlap`  | 200            | Overlap between chunks     |
| `Settings.num_output`     | 256            | Max tokens in LLM response |
| `Settings.context_window` | 4096           | LLM context window size    |

***

## Performance Tips

### 1. Async Queries for Throughput

```python
import asyncio
from llama_index.core import VectorStoreIndex

query_engine = index.as_query_engine(use_async=True)

async def batch_query(questions):
    tasks = [query_engine.aquery(q) for q in questions]
    return await asyncio.gather(*tasks)

questions = ["Q1?", "Q2?", "Q3?", "Q4?", "Q5?"]
answers = asyncio.run(batch_query(questions))
```

### 2. Hybrid Search (Keyword + Semantic)

```python
from llama_index.core import VectorStoreIndex
from llama_index.core.retrievers import VectorIndexRetriever, KeywordTableSimpleRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.retrievers import QueryFusionRetriever

# Combine vector and keyword retrieval
retriever = QueryFusionRetriever(
    [
        index.as_retriever(similarity_top_k=5),  # Vector retrieval
        index.as_retriever(retriever_mode="keyword"),  # Keyword retrieval
    ],
    similarity_top_k=5,
    num_queries=3,  # Generate multiple query variations
    use_async=True,
    verbose=True,
)

query_engine = RetrieverQueryEngine(retriever=retriever)
```

### 3. Re-Ranking for Quality

```python
from llama_index.core.postprocessor import SentenceTransformerRerank

# Add re-ranking step after retrieval
reranker = SentenceTransformerRerank(
    model="cross-encoder/ms-marco-MiniLM-L-2-v2",
    top_n=3
)

query_engine = index.as_query_engine(
    similarity_top_k=10,  # Retrieve more candidates
    node_postprocessors=[reranker]  # Re-rank to top 3
)
```

### 4. Streaming for Responsive UIs

```python
# Stream tokens as they're generated
streaming_engine = index.as_query_engine(streaming=True)
response = streaming_engine.query("Explain how Clore.ai works")

for token in response.response_gen:
    print(token, end="", flush=True)
```

***

## Troubleshooting

### Issue: Embedding model not connecting to Ollama

```bash
# Test Ollama embeddings directly
curl http://localhost:11434/api/embeddings -d '{
  "model": "nomic-embed-text",
  "prompt": "test text"
}'
```

### Issue: Index building is slow

```bash
# Monitor GPU usage during embedding
watch -n1 nvidia-smi

# Use smaller batch sizes
from llama_index.core import VectorStoreIndex
index = VectorStoreIndex.from_documents(
    docs,
    show_progress=True,
    # Insert in smaller batches
)
```

### Issue: ModuleNotFoundError for integrations

```bash
# LlamaIndex uses plugin architecture in v0.10+
pip install llama-index-llms-ollama
pip install llama-index-embeddings-ollama
pip install llama-index-vector-stores-chroma

# Check installed packages
pip list | grep llama
```

### Issue: Context window exceeded

```python
# Reduce chunk size
Settings.chunk_size = 512
Settings.chunk_overlap = 50

# Or use a model with larger context
Settings.llm = Ollama(
    model="llama3:8b",
    context_window=8192  # Extend context window
)
```

### Issue: Queries return irrelevant results

```python
# Increase similarity top-k
query_engine = index.as_query_engine(similarity_top_k=10)

# Or use a better embedding model
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
Settings.embed_model = HuggingFaceEmbedding(
    model_name="BAAI/bge-large-en-v1.5"
)
```

***

## Links

* **GitHub**: <https://github.com/run-llama/llama_index>
* **Official Docs**: <https://docs.llamaindex.ai>
* **PyPI**: <https://pypi.org/project/llama-index>
* **Integrations**: <https://llamahub.ai>
* **Discord**: <https://discord.gg/dGcwcsnxhU>
* **Blog**: <https://www.llamaindex.ai/blog>
* **CLORE.AI Marketplace**: <https://clore.ai/marketplace>

***

## Clore.ai GPU Recommendations

| Use Case                  | Recommended GPU | Est. Cost on Clore.ai |
| ------------------------- | --------------- | --------------------- |
| Development/Testing       | RTX 3090 (24GB) | \~$0.12/gpu/hr        |
| Production RAG            | RTX 3090 (24GB) | \~$0.12/gpu/hr        |
| High-throughput Embedding | RTX 4090 (24GB) | \~$0.70/gpu/hr        |

> 💡 All examples in this guide can be deployed on [Clore.ai](https://clore.ai/marketplace) GPU servers. Browse available GPUs and rent by the hour — no commitments, full root access.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/rag-and-vector-databases/llamaindex.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.