Phi-4

Run Microsoft's Phi-4 small language model on Clore.ai GPUs

Run Microsoft's Phi-4 - a small but powerful language model.

All examples can be run on GPU servers rented through CLORE.AI Marketplace.

Renting on CLORE.AI

Visit CLORE.AI Marketplace
Filter by GPU type, VRAM, and price
Choose On-Demand (fixed rate) or Spot (bid price)
Configure your order:
- Select Docker image
- Set ports (TCP for SSH, HTTP for web UIs)
- Add environment variables if needed
- Enter startup command
Select payment: CLORE, BTC, or USDT/USDC
Create order and wait for deployment

Access Your Server

Find connection details in My Orders
Web interfaces: Use the HTTP port URL
SSH: ssh -p <port> root@<proxy-address>

What is Phi-4?

Phi-4 from Microsoft offers:

14B parameters with excellent performance
Beats larger models on benchmarks
Strong reasoning and math
Efficient inference

Model Variants

Model

Parameters

VRAM

Specialty

Phi-4

14B

16GB

General

Phi-3.5-mini

3.8B

4GB

Lightweight

Phi-3.5-MoE

42B (6.6B active)

16GB

Mixture of Experts

Phi-3.5-vision

4.2B

6GB

Vision

Quick Deploy

Docker Image:

pytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtime

Ports:

22/tcp
8000/http

Command:

pip install transformers accelerate torch && \
python phi4_server.py

Accessing Your Service

After deployment, find your http_pub URL in My Orders:

Go to My Orders page
Click on your order
Find the http_pub URL (e.g., abc123.clorecloud.net)

Use https://YOUR_HTTP_PUB_URL instead of localhost in examples below.

Using Ollama


# Run Phi-4
ollama run phi4

# Phi-3.5 mini (faster)
ollama run phi3.5

# Phi-3.5 vision
ollama run phi3.5-vision

Installation

pip install transformers accelerate torch

Basic Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "microsoft/Phi-4"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

messages = [
    {"role": "system", "content": "You are a helpful AI assistant."},
    {"role": "user", "content": "Explain the difference between TCP and UDP."}
]

inputs = tokenizer.apply_chat_template(
    messages,
    return_tensors="pt",
    add_generation_prompt=True
).to("cuda")

outputs = model.generate(
    inputs,
    max_new_tokens=512,
    temperature=0.7,
    do_sample=True
)

response = tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)
print(response)

Phi-3.5-Vision

For image understanding:

from transformers import AutoModelForCausalLM, AutoProcessor
from PIL import Image
import torch

model_id = "microsoft/Phi-3.5-vision-instruct"

processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

image = Image.open("diagram.png")

messages = [
    {"role": "user", "content": "<|image_1|>\nDescribe this diagram in detail."}
]

prompt = processor.tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)

inputs = processor(prompt, [image], return_tensors="pt").to("cuda")

outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.7
)

response = processor.decode(outputs[0], skip_special_tokens=True)
print(response)

Math and Reasoning

messages = [
    {"role": "user", "content": """
Solve step by step:
A farmer has chickens and rabbits.
Total heads: 35
Total legs: 94
How many of each animal?
"""}
]

# Phi-4 excels at step-by-step reasoning

Code Generation

messages = [
    {"role": "user", "content": """
Write a Python implementation of binary search tree with:
- Insert
- Search
- Delete
- In-order traversal
Include type hints and docstrings.
"""}
]

Quantized Inference

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-4",
    quantization_config=quantization_config,
    device_map="auto",
    trust_remote_code=True
)

Gradio Interface

import gradio as gr
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "microsoft/Phi-4"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True
)

def chat(message, history, system_prompt, temperature):
    messages = [{"role": "system", "content": system_prompt}]
    for h in history:
        messages.append({"role": "user", "content": h[0]})
        messages.append({"role": "assistant", "content": h[1]})
    messages.append({"role": "user", "content": message})

    inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")
    outputs = model.generate(inputs, max_new_tokens=512, temperature=temperature, do_sample=True)

    return tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)

demo = gr.ChatInterface(
    fn=chat,
    additional_inputs=[
        gr.Textbox(value="You are a helpful assistant.", label="System"),
        gr.Slider(0.1, 1.5, value=0.7, label="Temperature")
    ],
    title="Phi-4 Chat"
)

demo.launch(server_name="0.0.0.0", server_port=7860)

Performance

Model

GPU

Tokens/sec

Phi-3.5-mini

RTX 3060

~100

Phi-3.5-mini

RTX 4090

~150

Phi-4

RTX 4090

~60

Phi-4

A100

~90

Phi-4 (4-bit)

RTX 3090

~40

Benchmarks

Model

MMLU

HumanEval

GSM8K

Phi-4

84.8%

82.6%

94.6%

GPT-4-Turbo

86.4%

85.4%

94.2%

Llama-3.1-70B

83.6%

80.5%

92.1%

Phi-4 matches or beats much larger models

Troubleshooting

"trust_remote_code" error

Add trust_remote_code=True to from_pretrained()
This is required for Phi models

Repetitive outputs

Lower temperature (0.3-0.6)
Add repetition_penalty=1.1
Use proper chat template

Memory issues

Phi-4 is efficient but still needs ~8GB for 14B
Use 4-bit quantization if needed
Reduce context length

Wrong output format

Use apply_chat_template() for proper formatting
Check you're using instruct version, not base

Cost Estimate

Typical CLORE.AI marketplace rates (as of 2024):

GPU

Hourly Rate

Daily Rate

4-Hour Session

RTX 3060

~$0.03

~$0.70

~$0.12

RTX 3090

~$0.06

~$1.50

~$0.25

RTX 4090

~$0.10

~$2.30

~$0.40

A100 40GB

~$0.17

~$4.00

~$0.70

A100 80GB

~$0.25

~$6.00

~$1.00

Prices vary by provider and demand. Check CLORE.AI Marketplace for current rates.

Save money:

Use Spot market for flexible workloads (often 30-50% cheaper)
Pay with CLORE tokens
Compare prices across different providers

Use Cases

Math tutoring
Code assistance
Document analysis (vision)
Efficient edge deployment
Cost-effective inference

Next Steps

Qwen2.5 - Alternative model
Gemma 2 - Google's model
Llama 3.2 - Meta's model

PreviousGemma 2 NextLlama 4 (Scout & Maverick)

Last updated 26 days ago

Was this helpful?

hashtagRenting on CLORE.AI

hashtagAccess Your Server

hashtagWhat is Phi-4?

hashtagModel Variants

hashtagQuick Deploy

hashtagAccessing Your Service

hashtagUsing Ollama

hashtagInstallation

hashtagBasic Usage

hashtagPhi-3.5-Vision

hashtagMath and Reasoning

hashtagCode Generation

hashtagQuantized Inference

hashtagGradio Interface

hashtagPerformance

hashtagBenchmarks

hashtagTroubleshooting

hashtag"trust_remote_code" error

hashtagRepetitive outputs

hashtagMemory issues

hashtagWrong output format

hashtagCost Estimate

hashtagUse Cases

hashtagNext Steps

Renting on CLORE.AI

Access Your Server

What is Phi-4?

Model Variants

Quick Deploy

Accessing Your Service

Using Ollama

Installation

Basic Usage

Phi-3.5-Vision

Math and Reasoning

Code Generation

Quantized Inference

Gradio Interface

Performance

Benchmarks

Troubleshooting

"trust_remote_code" error

Repetitive outputs

Memory issues

Wrong output format

Cost Estimate

Use Cases

Next Steps