# HuggingFace Transformers

Use the Transformers library for NLP, vision, and audio on GPU.

{% hint style="success" %}
All examples can be run on GPU servers rented through [CLORE.AI Marketplace](https://clore.ai/marketplace).
{% endhint %}

## Renting on CLORE.AI

1. Visit [CLORE.AI Marketplace](https://clore.ai/marketplace)
2. Filter by GPU type, VRAM, and price
3. Choose **On-Demand** (fixed rate) or **Spot** (bid price)
4. Configure your order:
   * Select Docker image
   * Set ports (TCP for SSH, HTTP for web UIs)
   * Add environment variables if needed
   * Enter startup command
5. Select payment: **CLORE**, **BTC**, or **USDT/USDC**
6. Create order and wait for deployment

### Access Your Server

* Find connection details in **My Orders**
* Web interfaces: Use the HTTP port URL
* SSH: `ssh -p <port> root@<proxy-address>`

## What is Transformers?

Hugging Face Transformers provides:

* 100,000+ pretrained models
* Easy model loading and inference
* Fine-tuning support
* Multi-modal capabilities

## Quick Deploy

**Docker Image:**

```
pytorch/pytorch:2.5.1-cuda12.4-cudnn9-devel
```

**Ports:**

```
22/tcp
```

**Command:**

```bash
pip install transformers accelerate datasets huggingface_hub
```

## Installation

```bash
pip install transformers[torch]
pip install accelerate  # For large models
pip install datasets    # For training data
```

## Text Generation

### Basic Generation

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "mistralai/Mistral-7B-Instruct-v0.2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

prompt = "Explain quantum computing in simple terms:"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

outputs = model.generate(
    **inputs,
    max_new_tokens=200,
    temperature=0.7,
    do_sample=True
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
```

### Chat Models

```python
from transformers import pipeline

pipe = pipeline(
    "text-generation",
    model="meta-llama/Llama-2-7b-chat-hf",
    torch_dtype=torch.float16,
    device_map="auto"
)

messages = [
    {"role": "user", "content": "What is machine learning?"}
]

outputs = pipe(
    messages,
    max_new_tokens=256,
    do_sample=True,
    temperature=0.7
)

print(outputs[0]["generated_text"][-1]["content"])
```

### Streaming

```python
from transformers import TextStreamer

streamer = TextStreamer(tokenizer)

model.generate(
    **inputs,
    max_new_tokens=200,
    streamer=streamer
)
```

## Quantization

### 4-bit Quantization

```python
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-13b-hf",
    quantization_config=bnb_config,
    device_map="auto"
)
```

### 8-bit Quantization

```python
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    load_in_8bit=True,
    device_map="auto"
)
```

## Embeddings

```python
from transformers import AutoModel, AutoTokenizer
import torch

model_name = "sentence-transformers/all-MiniLM-L6-v2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name).cuda()

def get_embedding(text):
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True).to("cuda")
    with torch.no_grad():
        outputs = model(**inputs)
    # Mean pooling
    embedding = outputs.last_hidden_state.mean(dim=1)
    return embedding

emb = get_embedding("Hello, world!")
print(f"Embedding shape: {emb.shape}")
```

## Image Classification

```python
from transformers import pipeline
from PIL import Image

classifier = pipeline("image-classification", model="google/vit-base-patch16-224", device=0)

image = Image.open("cat.jpg")
results = classifier(image)

for result in results:
    print(f"{result['label']}: {result['score']:.4f}")
```

## Object Detection

```python
from transformers import pipeline
from PIL import Image

detector = pipeline("object-detection", model="facebook/detr-resnet-50", device=0)

image = Image.open("street.jpg")
results = detector(image)

for result in results:
    print(f"{result['label']}: {result['score']:.4f} at {result['box']}")
```

## Image Segmentation

```python
from transformers import pipeline
from PIL import Image

segmenter = pipeline("image-segmentation", model="facebook/maskformer-swin-base-ade", device=0)

image = Image.open("scene.jpg")
results = segmenter(image)

for segment in results:
    print(f"{segment['label']}: score {segment['score']:.4f}")
```

## Speech Recognition

```python
from transformers import pipeline

transcriber = pipeline(
    "automatic-speech-recognition",
    model="openai/whisper-large-v3",
    device=0
)

result = transcriber("audio.mp3")
print(result["text"])
```

## Text-to-Speech

```python
from transformers import pipeline
import scipy

synthesizer = pipeline("text-to-speech", model="microsoft/speecht5_tts", device=0)

speech = synthesizer("Hello, this is a test of text to speech.")

scipy.io.wavfile.write("output.wav", rate=speech["sampling_rate"], data=speech["audio"])
```

## Fine-Tuning

### Prepare Dataset

```python
from datasets import load_dataset

dataset = load_dataset("imdb")
train_dataset = dataset["train"].select(range(1000))
eval_dataset = dataset["test"].select(range(200))
```

### Training

```python
from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    TrainingArguments,
    Trainer
)

model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_train = train_dataset.map(tokenize_function, batched=True)
tokenized_eval = eval_dataset.map(tokenize_function, batched=True)

training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    fp16=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_eval,
)

trainer.train()
```

## Pipeline Tasks

| Task                 | Pipeline Name                  |
| -------------------- | ------------------------------ |
| Text Generation      | `text-generation`              |
| Fill Mask            | `fill-mask`                    |
| Summarization        | `summarization`                |
| Translation          | `translation`                  |
| Question Answering   | `question-answering`           |
| Sentiment Analysis   | `sentiment-analysis`           |
| Image Classification | `image-classification`         |
| Object Detection     | `object-detection`             |
| Speech Recognition   | `automatic-speech-recognition` |
| Text-to-Speech       | `text-to-speech`               |

## Multi-GPU

```python
from transformers import AutoModelForCausalLM

# Automatic device placement
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-70b-hf",
    device_map="auto",
    torch_dtype=torch.float16
)

# Manual device map
device_map = {
    "model.embed_tokens": 0,
    "model.layers.0": 0,
    "model.layers.1": 0,
    # ...
    "model.layers.39": 1,
    "model.norm": 1,
    "lm_head": 1
}

model = AutoModelForCausalLM.from_pretrained(
    "model_name",
    device_map=device_map
)
```

## Memory Optimization

```python

# Flash Attention 2
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    torch_dtype=torch.float16,
    attn_implementation="flash_attention_2",
    device_map="auto"
)

# Gradient checkpointing
model.gradient_checkpointing_enable()
```

## Model Hub

### Download Models

```python
from huggingface_hub import snapshot_download

snapshot_download(
    repo_id="meta-llama/Llama-2-7b-hf",
    local_dir="./llama-2-7b"
)
```

### Upload Models

```python
from huggingface_hub import HfApi

api = HfApi()
api.upload_folder(
    folder_path="./my_model",
    repo_id="username/my-model",
    repo_type="model"
)
```

## Performance Tips

| Tip                             | Effect              |
| ------------------------------- | ------------------- |
| Use `torch_dtype=torch.float16` | 50% less memory     |
| Enable Flash Attention 2        | 2x faster attention |
| Use quantization                | 75% less memory     |
| Batch inference                 | Higher throughput   |

## Troubleshooting

## Cost Estimate

Typical CLORE.AI marketplace rates (as of 2024):

| GPU       | Hourly Rate | Daily Rate | 4-Hour Session |
| --------- | ----------- | ---------- | -------------- |
| RTX 3060  | \~$0.03     | \~$0.70    | \~$0.12        |
| RTX 3090  | \~$0.06     | \~$1.50    | \~$0.25        |
| RTX 4090  | \~$0.10     | \~$2.30    | \~$0.40        |
| A100 40GB | \~$0.17     | \~$4.00    | \~$0.70        |
| A100 80GB | \~$0.25     | \~$6.00    | \~$1.00        |

*Prices vary by provider and demand. Check* [*CLORE.AI Marketplace*](https://clore.ai/marketplace) *for current rates.*

**Save money:**

* Use **Spot** market for flexible workloads (often 30-50% cheaper)
* Pay with **CLORE** tokens
* Compare prices across different providers

## Next Steps

* vLLM Inference - Production serving
* [Fine-tune LLMs](https://docs.clore.ai/guides/training/finetune-llm) - LoRA training
* [DeepSpeed Training](https://docs.clore.ai/guides/training/deepspeed-training) - Distributed training
