# Phi-4

Run Microsoft's Phi-4 - a small but powerful language model.

{% hint style="success" %}
All examples can be run on GPU servers rented through [CLORE.AI Marketplace](https://clore.ai/marketplace).
{% endhint %}

## Renting on CLORE.AI

1. Visit [CLORE.AI Marketplace](https://clore.ai/marketplace)
2. Filter by GPU type, VRAM, and price
3. Choose **On-Demand** (fixed rate) or **Spot** (bid price)
4. Configure your order:
   * Select Docker image
   * Set ports (TCP for SSH, HTTP for web UIs)
   * Add environment variables if needed
   * Enter startup command
5. Select payment: **CLORE**, **BTC**, or **USDT/USDC**
6. Create order and wait for deployment

### Access Your Server

* Find connection details in **My Orders**
* Web interfaces: Use the HTTP port URL
* SSH: `ssh -p <port> root@<proxy-address>`

## What is Phi-4?

Phi-4 from Microsoft offers:

* 14B parameters with excellent performance
* Beats larger models on benchmarks
* Strong reasoning and math
* Efficient inference

## Model Variants

| Model          | Parameters        | VRAM | Specialty          |
| -------------- | ----------------- | ---- | ------------------ |
| Phi-4          | 14B               | 16GB | General            |
| Phi-3.5-mini   | 3.8B              | 4GB  | Lightweight        |
| Phi-3.5-MoE    | 42B (6.6B active) | 16GB | Mixture of Experts |
| Phi-3.5-vision | 4.2B              | 6GB  | Vision             |

## Quick Deploy

**Docker Image:**

```
pytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtime
```

**Ports:**

```
22/tcp
8000/http
```

**Command:**

```bash
pip install transformers accelerate torch && \
python phi4_server.py
```

## Accessing Your Service

After deployment, find your `http_pub` URL in **My Orders**:

1. Go to **My Orders** page
2. Click on your order
3. Find the `http_pub` URL (e.g., `abc123.clorecloud.net`)

Use `https://YOUR_HTTP_PUB_URL` instead of `localhost` in examples below.

## Using Ollama

```bash

# Run Phi-4
ollama run phi4

# Phi-3.5 mini (faster)
ollama run phi3.5

# Phi-3.5 vision
ollama run phi3.5-vision
```

## Installation

```bash
pip install transformers accelerate torch
```

## Basic Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "microsoft/Phi-4"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

messages = [
    {"role": "system", "content": "You are a helpful AI assistant."},
    {"role": "user", "content": "Explain the difference between TCP and UDP."}
]

inputs = tokenizer.apply_chat_template(
    messages,
    return_tensors="pt",
    add_generation_prompt=True
).to("cuda")

outputs = model.generate(
    inputs,
    max_new_tokens=512,
    temperature=0.7,
    do_sample=True
)

response = tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)
print(response)
```

## Phi-3.5-Vision

For image understanding:

```python
from transformers import AutoModelForCausalLM, AutoProcessor
from PIL import Image
import torch

model_id = "microsoft/Phi-3.5-vision-instruct"

processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

image = Image.open("diagram.png")

messages = [
    {"role": "user", "content": "<|image_1|>\nDescribe this diagram in detail."}
]

prompt = processor.tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)

inputs = processor(prompt, [image], return_tensors="pt").to("cuda")

outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.7
)

response = processor.decode(outputs[0], skip_special_tokens=True)
print(response)
```

## Math and Reasoning

```python
messages = [
    {"role": "user", "content": """
Solve step by step:
A farmer has chickens and rabbits.
Total heads: 35
Total legs: 94
How many of each animal?
"""}
]

# Phi-4 excels at step-by-step reasoning
```

## Code Generation

```python
messages = [
    {"role": "user", "content": """
Write a Python implementation of binary search tree with:
- Insert
- Search
- Delete
- In-order traversal
Include type hints and docstrings.
"""}
]
```

## Quantized Inference

```python
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-4",
    quantization_config=quantization_config,
    device_map="auto",
    trust_remote_code=True
)
```

## Gradio Interface

```python
import gradio as gr
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "microsoft/Phi-4"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True
)

def chat(message, history, system_prompt, temperature):
    messages = [{"role": "system", "content": system_prompt}]
    for h in history:
        messages.append({"role": "user", "content": h[0]})
        messages.append({"role": "assistant", "content": h[1]})
    messages.append({"role": "user", "content": message})

    inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")
    outputs = model.generate(inputs, max_new_tokens=512, temperature=temperature, do_sample=True)

    return tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)

demo = gr.ChatInterface(
    fn=chat,
    additional_inputs=[
        gr.Textbox(value="You are a helpful assistant.", label="System"),
        gr.Slider(0.1, 1.5, value=0.7, label="Temperature")
    ],
    title="Phi-4 Chat"
)

demo.launch(server_name="0.0.0.0", server_port=7860)
```

## Performance

| Model         | GPU      | Tokens/sec |
| ------------- | -------- | ---------- |
| Phi-3.5-mini  | RTX 3060 | \~100      |
| Phi-3.5-mini  | RTX 4090 | \~150      |
| Phi-4         | RTX 4090 | \~60       |
| Phi-4         | A100     | \~90       |
| Phi-4 (4-bit) | RTX 3090 | \~40       |

## Benchmarks

| Model         | MMLU  | HumanEval | GSM8K |
| ------------- | ----- | --------- | ----- |
| Phi-4         | 84.8% | 82.6%     | 94.6% |
| GPT-4-Turbo   | 86.4% | 85.4%     | 94.2% |
| Llama-3.1-70B | 83.6% | 80.5%     | 92.1% |

*Phi-4 matches or beats much larger models*

## Troubleshooting

### "trust\_remote\_code" error

* Add `trust_remote_code=True` to `from_pretrained()`
* This is required for Phi models

### Repetitive outputs

* Lower temperature (0.3-0.6)
* Add repetition\_penalty=1.1
* Use proper chat template

### Memory issues

* Phi-4 is efficient but still needs \~8GB for 14B
* Use 4-bit quantization if needed
* Reduce context length

### Wrong output format

* Use `apply_chat_template()` for proper formatting
* Check you're using instruct version, not base

## Cost Estimate

Typical CLORE.AI marketplace rates (as of 2024):

| GPU       | Hourly Rate | Daily Rate | 4-Hour Session |
| --------- | ----------- | ---------- | -------------- |
| RTX 3060  | \~$0.03     | \~$0.70    | \~$0.12        |
| RTX 3090  | \~$0.06     | \~$1.50    | \~$0.25        |
| RTX 4090  | \~$0.10     | \~$2.30    | \~$0.40        |
| A100 40GB | \~$0.17     | \~$4.00    | \~$0.70        |
| A100 80GB | \~$0.25     | \~$6.00    | \~$1.00        |

*Prices vary by provider and demand. Check* [*CLORE.AI Marketplace*](https://clore.ai/marketplace) *for current rates.*

**Save money:**

* Use **Spot** market for flexible workloads (often 30-50% cheaper)
* Pay with **CLORE** tokens
* Compare prices across different providers

## Use Cases

* Math tutoring
* Code assistance
* Document analysis (vision)
* Efficient edge deployment
* Cost-effective inference

## Next Steps

* Qwen2.5 - Alternative model
* Gemma 2 - Google's model
* Llama 3.2 - Meta's model
