CodeLlama

Generate, complete, and explain code with CodeLlama on Clore.ai

Newer alternatives! For coding tasks, consider Qwen2.5-Coder (32B, state-of-the-art code gen) or DeepSeek-R1 (reasoning + coding). CodeLlama is still useful for lightweight deployments.

Generate, complete, and explain code with Meta's CodeLlama.

All examples can be run on GPU servers rented through CLORE.AI Marketplace.

Renting on CLORE.AI

Visit CLORE.AI Marketplace
Filter by GPU type, VRAM, and price
Choose On-Demand (fixed rate) or Spot (bid price)
Configure your order:
- Select Docker image
- Set ports (TCP for SSH, HTTP for web UIs)
- Add environment variables if needed
- Enter startup command
Select payment: CLORE, BTC, or USDT/USDC
Create order and wait for deployment

Access Your Server

Find connection details in My Orders
Web interfaces: Use the HTTP port URL
SSH: ssh -p <port> root@<proxy-address>

Model Variants

Model

Size

VRAM

Best For

CodeLlama-7B

8GB

Fast completion

CodeLlama-13B

13B

16GB

Balanced

CodeLlama-34B

34B

40GB

Best quality

CodeLlama-70B

70B

80GB+

Maximum quality

Variants

Base: Code completion
Instruct: Follow instructions
Python: Python-specialized

Quick Deploy

Docker Image:

pytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtime

Ports:

22/tcp
8000/http

Command:

pip install vllm && \
python -m vllm.entrypoints.openai.api_server \
    --model codellama/CodeLlama-7b-Instruct-hf \
    --port 8000

Accessing Your Service

After deployment, find your http_pub URL in My Orders:

Go to My Orders page
Click on your order
Find the http_pub URL (e.g., abc123.clorecloud.net)

Use https://YOUR_HTTP_PUB_URL instead of localhost in examples below.

Installation

Using Ollama


# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Run CodeLlama
ollama run codellama

# Run Python variant
ollama run codellama:python

Using Transformers

pip install transformers accelerate

Code Completion

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "codellama/CodeLlama-7b-hf"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto"
)

# Code completion
code = """
def fibonacci(n):
    '''Calculate the nth fibonacci number'''
"""

inputs = tokenizer(code, return_tensors="pt").to("cuda")

outputs = model.generate(
    **inputs,
    max_new_tokens=200,
    temperature=0.2,
    do_sample=True
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Instruct Model

For following coding instructions:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "codellama/CodeLlama-7b-Instruct-hf"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto"
)

prompt = """[INST] Write a Python function that:
1. Takes a list of numbers
2. Removes duplicates
3. Sorts in descending order
4. Returns top 5 elements
[/INST]"""

inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

outputs = model.generate(
    **inputs,
    max_new_tokens=500,
    temperature=0.2
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Fill-in-the-Middle (FIM)


# CodeLlama supports FIM for code insertion
prefix = """def calculate_area(shape, dimensions):
    if shape == "circle":
        radius = dimensions[0]
"""

suffix = """
    elif shape == "rectangle":
        length, width = dimensions
        return length * width
    return None
"""

# Use special tokens for FIM
prompt = f"<PRE> {prefix} <SUF>{suffix} <MID>"

inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=100)

Python-Specialized Model

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "codellama/CodeLlama-7b-Python-hf"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto"
)

# Python-specific completion
code = """
import pandas as pd
import numpy as np

def analyze_sales_data(df):
    '''Analyze sales data and return key metrics'''
"""

inputs = tokenizer(code, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=300)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

vLLM Server

python -m vllm.entrypoints.openai.api_server \
    --model codellama/CodeLlama-13b-Instruct-hf \
    --dtype float16 \
    --max-model-len 8192

API Usage

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="x")

response = client.chat.completions.create(
    model="codellama/CodeLlama-13b-Instruct-hf",
    messages=[
        {"role": "user", "content": "Write a FastAPI endpoint for user authentication"}
    ],
    temperature=0.2,
    max_tokens=1000
)

print(response.choices[0].message.content)

Code Explanation

code_to_explain = """
def quicksort(arr):
    if len(arr) <= 1:
        return arr
    pivot = arr[len(arr) // 2]
    left = [x for x in arr if x < pivot]
    middle = [x for x in arr if x == pivot]
    right = [x for x in arr if x > pivot]
    return quicksort(left) + middle + quicksort(right)
"""

prompt = f"[INST] Explain this code step by step:\n\n{code_to_explain}\n[/INST]"

Bug Fixing

buggy_code = """
def reverse_string(s):
    result = ""
    for i in range(len(s)):
        result += s[i]
    return result
"""

prompt = f"""[INST] Find and fix the bug in this code. The function should reverse a string:

{buggy_code}
[/INST]"""

Code Translation

python_code = """
def factorial(n):
    if n <= 1:
        return 1
    return n * factorial(n - 1)
"""

prompt = f"""[INST] Convert this Python code to JavaScript:

{python_code}
[/INST]"""

Gradio Interface

import gradio as gr
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "codellama/CodeLlama-7b-Instruct-hf"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto"
)

def generate_code(instruction, temperature, max_tokens):
    prompt = f"[INST] {instruction} [/INST]"
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

    outputs = model.generate(
        **inputs,
        max_new_tokens=max_tokens,
        temperature=temperature,
        do_sample=True
    )

    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response.split("[/INST]")[-1].strip()

demo = gr.Interface(
    fn=generate_code,
    inputs=[
        gr.Textbox(label="Instruction", lines=5, placeholder="Write a Python function that..."),
        gr.Slider(0.1, 1.0, value=0.2, label="Temperature"),
        gr.Slider(100, 2000, value=500, step=100, label="Max Tokens")
    ],
    outputs=gr.Code(language="python", label="Generated Code"),
    title="CodeLlama Code Generator"
)

demo.launch(server_name="0.0.0.0", server_port=7860)

Batch Processing

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "codellama/CodeLlama-7b-Instruct-hf"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto"
)

tasks = [
    "Write a function to validate email addresses",
    "Create a class for managing a shopping cart",
    "Write a function to parse JSON from a URL",
    "Create a decorator for timing function execution",
    "Write a function to generate random passwords"
]

for task in tasks:
    prompt = f"[INST] {task} [/INST]"
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

    outputs = model.generate(
        **inputs,
        max_new_tokens=500,
        temperature=0.2
    )

    result = tokenizer.decode(outputs[0], skip_special_tokens=True)
    print(f"\n=== {task} ===")
    print(result.split("[/INST]")[-1].strip())

Use with Continue (VSCode)

Configure Continue extension:

{
  "models": [
    {
      "title": "CodeLlama",
      "provider": "ollama",
      "model": "codellama:7b-instruct"
    }
  ],
  "tabAutocompleteModel": {
    "title": "CodeLlama",
    "provider": "ollama",
    "model": "codellama:7b-code"
  }
}

Performance

Model

GPU

Tokens/sec

CodeLlama-7B

RTX 3090

~90

CodeLlama-7B

RTX 4090

~130

CodeLlama-13B

RTX 4090

~70

CodeLlama-34B

A100

~50

Troubleshooting

Poor Code Quality

Lower temperature (0.1-0.3)
Use Instruct variant
Larger model if possible

Incomplete Output

Increase max_new_tokens
Check context length

Slow Generation

Use vLLM
Quantize model
Use smaller variant

Cost Estimate

Typical CLORE.AI marketplace rates (as of 2024):

GPU

Hourly Rate

Daily Rate

4-Hour Session

RTX 3060

~$0.03

~$0.70

~$0.12

RTX 3090

~$0.06

~$1.50

~$0.25

RTX 4090

~$0.10

~$2.30

~$0.40

A100 40GB

~$0.17

~$4.00

~$0.70

A100 80GB

~$0.25

~$6.00

~$1.00

Prices vary by provider and demand. Check CLORE.AI Marketplace for current rates.

Save money:

Use Spot market for flexible workloads (often 30-50% cheaper)
Pay with CLORE tokens
Compare prices across different providers

Next Steps

Open Interpreter - Execute code
vLLM Inference - Production serving
Mistral/Mixtral - Alternative models

PreviousQwen2.5 NextGemma 2

Last updated 7 days ago

Was this helpful?

hashtagRenting on CLORE.AI

hashtagAccess Your Server

hashtagModel Variants

hashtagVariants

hashtagQuick Deploy

hashtagAccessing Your Service

hashtagInstallation

hashtagUsing Ollama

hashtagUsing Transformers

hashtagCode Completion

hashtagInstruct Model

hashtagFill-in-the-Middle (FIM)

hashtagPython-Specialized Model

hashtagvLLM Server

hashtagAPI Usage

hashtagCode Explanation

hashtagBug Fixing

hashtagCode Translation

hashtagGradio Interface

hashtagBatch Processing

hashtagUse with Continue (VSCode)

hashtagPerformance

hashtagTroubleshooting

hashtagPoor Code Quality

hashtagIncomplete Output

hashtagSlow Generation

hashtagCost Estimate

hashtagNext Steps

Renting on CLORE.AI

Access Your Server

Model Variants

Variants

Quick Deploy

Accessing Your Service

Installation

Using Ollama

Using Transformers

Code Completion

Instruct Model

Fill-in-the-Middle (FIM)

Python-Specialized Model

vLLM Server

API Usage

Code Explanation

Bug Fixing

Code Translation

Gradio Interface

Batch Processing

Use with Continue (VSCode)

Performance

Troubleshooting

Poor Code Quality

Incomplete Output

Slow Generation

Cost Estimate

Next Steps