ETL Pipeline with GPU Acceleration

What We're Building

A high-performance ETL (Extract, Transform, Load) pipeline using NVIDIA RAPIDS on Clore.ai GPUs. Process billions of rows at 10-100x the speed of traditional pandas/Spark workflows, with automatic GPU provisioning and cost optimization.

Key Features:

Automatic GPU provisioning via Clore.ai API
RAPIDS cuDF for GPU-accelerated DataFrames (pandas-compatible API)
cuML for GPU machine learning
Dask-cuDF for multi-GPU distributed processing
S3/GCS/Azure blob integration
Real-time progress monitoring
Cost-effective spot instance usage

Prerequisites

Clore.ai account with API key (get one here)
Python 3.10+
Data files (CSV, Parquet, JSON)
Basic pandas/SQL knowledge

pip install requests paramiko scp boto3 tqdm

Architecture Overview

┌─────────────────┐     ┌──────────────────┐     ┌─────────────────┐
│   Data Source   │────▶│   Clore.ai GPU   │────▶│   Data Sink     │
│  S3/GCS/Local   │     │   RAPIDS/cuDF    │     │  S3/DB/Local    │
└─────────────────┘     └──────────────────┘     └─────────────────┘
                               │
                        ┌──────┴──────┐
                        │ Operations: │
                        │ • Filter    │
                        │ • Join      │
                        │ • Aggregate │
                        │ • Transform │
                        └─────────────┘

Step 1: Clore.ai RAPIDS Client

# clore_rapids_client.py
import requests
import time
import secrets
from typing import Dict, Any, List, Optional
from dataclasses import dataclass

@dataclass
class RAPIDSServer:
    """Represents a RAPIDS-enabled GPU server."""
    server_id: int
    order_id: int
    ssh_host: str
    ssh_port: int
    ssh_password: str
    gpu_model: str
    gpu_count: int
    gpu_memory_gb: int
    hourly_cost: float


class CloreRAPIDSClient:
    """Clore.ai client optimized for RAPIDS/cuDF workloads."""
    
    BASE_URL = "https://api.clore.ai"
    
    # RAPIDS Docker image with CUDA 12.x
    RAPIDS_IMAGE = "rapidsai/rapidsai-core:24.02-cuda12.0-runtime-ubuntu22.04-py3.10"
    
    def __init__(self, api_key: str):
        self.api_key = api_key
        self.headers = {"auth": api_key}
    
    def _request(self, method: str, endpoint: str, **kwargs) -> Dict[str, Any]:
        """Make API request with retry logic."""
        url = f"{self.BASE_URL}{endpoint}"
        
        for attempt in range(3):
            try:
                response = requests.request(
                    method, url,
                    headers=self.headers,
                    timeout=30,
                    **kwargs
                )
                data = response.json()
                
                if data.get("code") == 5:
                    time.sleep(2 ** attempt)
                    continue
                
                if data.get("code") != 0:
                    raise Exception(f"API Error: {data}")
                return data
                
            except requests.exceptions.Timeout:
                if attempt == 2:
                    raise
                time.sleep(1)
        
        raise Exception("Max retries exceeded")
    
    def find_rapids_gpu(self,
                        min_vram_gb: int = 16,
                        max_price_usd: float = 0.60,
                        prefer_spot: bool = True) -> Optional[Dict]:
        """Find GPU suitable for RAPIDS (needs high VRAM)."""
        
        servers = self._request("GET", "/v1/marketplace")["servers"]
        
        # GPUs with good RAPIDS support (high VRAM)
        rapids_gpus = {
            "A100": 80, "A100-80GB": 80, "A100-40GB": 40,
            "RTX 4090": 24, "RTX 3090": 24, "A6000": 48,
            "A5000": 24, "RTX A4000": 16, "RTX 4080": 16,
            "RTX 3080 Ti": 12, "RTX 3080": 10
        }
        
        candidates = []
        for server in servers:
            if server.get("rented"):
                continue
            
            gpu_array = server.get("gpu_array", [])
            
            # Check if GPU is suitable for RAPIDS
            gpu_match = None
            for gpu in gpu_array:
                for rapids_gpu, vram in rapids_gpus.items():
                    if rapids_gpu in gpu and vram >= min_vram_gb:
                        gpu_match = (rapids_gpu, vram)
                        break
                if gpu_match:
                    break
            
            if not gpu_match:
                continue
            
            # Check price
            price_data = server.get("price", {}).get("usd", {})
            price = price_data.get("spot" if prefer_spot else "on_demand_clore")
            
            if not price or price > max_price_usd:
                continue
            
            candidates.append({
                "id": server["id"],
                "gpus": gpu_array,
                "gpu_count": len(gpu_array),
                "gpu_model": gpu_match[0],
                "vram_gb": gpu_match[1],
                "price_usd": price,
                "reliability": server.get("reliability", 0)
            })
        
        if not candidates:
            return None
        
        # Sort by VRAM per dollar, then reliability
        candidates.sort(key=lambda x: (-x["vram_gb"] / x["price_usd"], -x["reliability"]))
        return candidates[0]
    
    def rent_rapids_server(self,
                           server: Dict,
                           use_spot: bool = True) -> RAPIDSServer:
        """Rent a server for RAPIDS workloads."""
        
        ssh_password = secrets.token_urlsafe(16)
        
        order_data = {
            "renting_server": server["id"],
            "type": "spot" if use_spot else "on-demand",
            "currency": "CLORE-Blockchain",
            "image": self.RAPIDS_IMAGE,
            "ports": {"22": "tcp", "8888": "http", "8787": "http"},
            "env": {
                "NVIDIA_VISIBLE_DEVICES": "all",
                "RAPIDS_NO_INITIALIZE": "1"
            },
            "ssh_password": ssh_password
        }
        
        if use_spot:
            order_data["spotprice"] = server["price_usd"] * 1.15
        
        result = self._request("POST", "/v1/create_order", json=order_data)
        order_id = result["order_id"]
        
        # Wait for server
        print(f"Waiting for RAPIDS server {server['id']}...")
        for _ in range(120):
            orders = self._request("GET", "/v1/my_orders")["orders"]
            order = next((o for o in orders if o["order_id"] == order_id), None)
            
            if order and order.get("status") == "running":
                conn = order["connection"]["ssh"]
                parts = conn.split()
                ssh_host = parts[1].split("@")[1] if "@" in parts[1] else parts[1]
                ssh_port = int(parts[-1]) if "-p" in conn else 22
                
                return RAPIDSServer(
                    server_id=server["id"],
                    order_id=order_id,
                    ssh_host=ssh_host,
                    ssh_port=ssh_port,
                    ssh_password=ssh_password,
                    gpu_model=server["gpu_model"],
                    gpu_count=server["gpu_count"],
                    gpu_memory_gb=server["vram_gb"],
                    hourly_cost=server["price_usd"]
                )
            
            time.sleep(2)
        
        raise Exception("Timeout waiting for server")
    
    def cancel_order(self, order_id: int):
        """Cancel a rental order."""
        self._request("POST", "/v1/cancel_order", json={"id": order_id})

Step 2: Remote RAPIDS Executor

# rapids_executor.py
import paramiko
from scp import SCPClient
import json
import time
from typing import Dict, List, Any, Optional
from dataclasses import dataclass

@dataclass
class ETLResult:
    """Results from an ETL operation."""
    operation: str
    input_rows: int
    output_rows: int
    execution_time_seconds: float
    gpu_memory_used_gb: float
    success: bool
    error: Optional[str] = None


class RAPIDSExecutor:
    """Execute RAPIDS/cuDF operations on remote GPU server."""
    
    def __init__(self, ssh_host: str, ssh_port: int, ssh_password: str):
        self.ssh_host = ssh_host
        self.ssh_port = ssh_port
        self.ssh_password = ssh_password
        self._ssh = None
        self._scp = None
    
    def connect(self):
        """Establish SSH connection."""
        self._ssh = paramiko.SSHClient()
        self._ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
        self._ssh.connect(
            self.ssh_host,
            port=self.ssh_port,
            username="root",
            password=self.ssh_password,
            timeout=30
        )
        self._scp = SCPClient(self._ssh.get_transport())
    
    def disconnect(self):
        """Close connections."""
        if self._scp:
            self._scp.close()
        if self._ssh:
            self._ssh.close()
    
    def _exec(self, cmd: str, timeout: int = 3600) -> tuple:
        """Execute command and return stdout, stderr."""
        stdin, stdout, stderr = self._ssh.exec_command(cmd, timeout=timeout)
        exit_code = stdout.channel.recv_exit_status()
        return stdout.read().decode(), stderr.read().decode(), exit_code
    
    def upload_file(self, local_path: str, remote_path: str):
        """Upload file to server."""
        self._scp.put(local_path, remote_path)
    
    def download_file(self, remote_path: str, local_path: str):
        """Download file from server."""
        self._scp.get(remote_path, local_path)
    
    def verify_rapids(self) -> Dict:
        """Verify RAPIDS installation."""
        check_script = '''python3 -c "
import cudf
import cuml
import dask_cudf
import rmm

# GPU info
import subprocess
gpu_info = subprocess.check_output(['nvidia-smi', '--query-gpu=name,memory.total', '--format=csv,noheader']).decode()

print('RAPIDS_VERSION:', cudf.__version__)
print('GPU_INFO:', gpu_info.strip())
print('RMM_VERSION:', rmm.__version__)
"'''
        
        out, err, code = self._exec(check_script)
        
        result = {"rapids_ready": code == 0}
        for line in out.strip().split("\n"):
            if ":" in line:
                key, val = line.split(":", 1)
                result[key.strip().lower()] = val.strip()
        
        return result
    
    def execute_etl_script(self, script: str, timeout: int = 3600) -> ETLResult:
        """Execute a RAPIDS ETL script on the server."""
        
        # Wrap script with timing and memory tracking
        wrapped_script = f'''
import time
import json
import cudf
import rmm

# Enable managed memory for large datasets
rmm.reinitialize(managed_memory=True)

start_time = time.time()
result = {{"success": True, "error": None}}

try:
{self._indent_script(script)}
except Exception as e:
    result["success"] = False
    result["error"] = str(e)

result["execution_time"] = time.time() - start_time

# Get memory usage
import subprocess
mem_out = subprocess.check_output(
    ["nvidia-smi", "--query-gpu=memory.used", "--format=csv,noheader,nounits"]
).decode()
result["gpu_memory_mb"] = int(mem_out.strip())

print("ETL_RESULT:", json.dumps(result))
'''
        
        # Write and execute script
        self._exec(f"cat > /tmp/etl_script.py << 'EOFSCRIPT'\n{wrapped_script}\nEOFSCRIPT")
        out, err, code = self._exec("python3 /tmp/etl_script.py", timeout=timeout)
        
        # Parse result
        for line in out.strip().split("\n"):
            if line.startswith("ETL_RESULT:"):
                result_data = json.loads(line.split(":", 1)[1])
                return ETLResult(
                    operation="custom_script",
                    input_rows=result_data.get("input_rows", 0),
                    output_rows=result_data.get("output_rows", 0),
                    execution_time_seconds=result_data.get("execution_time", 0),
                    gpu_memory_used_gb=result_data.get("gpu_memory_mb", 0) / 1024,
                    success=result_data.get("success", False),
                    error=result_data.get("error")
                )
        
        return ETLResult(
            operation="custom_script",
            input_rows=0, output_rows=0,
            execution_time_seconds=0, gpu_memory_used_gb=0,
            success=False, error=err or "Unknown error"
        )
    
    def _indent_script(self, script: str, spaces: int = 4) -> str:
        """Indent script for embedding in wrapper."""
        indent = " " * spaces
        return "\n".join(indent + line for line in script.split("\n"))
    
    def read_csv_gpu(self, remote_path: str) -> Dict:
        """Read CSV file using cuDF and return info."""
        script = f'''
df = cudf.read_csv("{remote_path}")
result["input_rows"] = len(df)
result["output_rows"] = len(df)
result["columns"] = list(df.columns)
result["dtypes"] = {{str(k): str(v) for k, v in df.dtypes.items()}}
'''
        return self.execute_etl_script(script)
    
    def read_parquet_gpu(self, remote_path: str) -> Dict:
        """Read Parquet file using cuDF."""
        script = f'''
df = cudf.read_parquet("{remote_path}")
result["input_rows"] = len(df)
result["output_rows"] = len(df)
result["columns"] = list(df.columns)
'''
        return self.execute_etl_script(script)
    
    def transform_and_save(self, 
                           input_path: str,
                           output_path: str,
                           transforms: List[str]) -> ETLResult:
        """Apply transforms and save result."""
        
        transform_code = "\n".join(f"    df = {t}" for t in transforms)
        
        script = f'''
# Read input
if "{input_path}".endswith(".parquet"):
    df = cudf.read_parquet("{input_path}")
else:
    df = cudf.read_csv("{input_path}")

result["input_rows"] = len(df)

# Apply transforms
{transform_code}

result["output_rows"] = len(df)

# Save output
if "{output_path}".endswith(".parquet"):
    df.to_parquet("{output_path}")
else:
    df.to_csv("{output_path}", index=False)
'''
        
        return self.execute_etl_script(script)

Step 3: High-Level ETL Pipeline

# etl_pipeline.py
import os
import time
from typing import List, Dict, Optional, Callable
from dataclasses import dataclass

from clore_rapids_client import CloreRAPIDSClient, RAPIDSServer
from rapids_executor import RAPIDSExecutor, ETLResult

@dataclass
class PipelineStats:
    """Statistics for the entire pipeline."""
    total_operations: int
    successful_operations: int
    failed_operations: int
    total_rows_processed: int
    total_execution_time: float
    total_cost_usd: float
    avg_throughput_rows_per_sec: float


class RAPIDSPipeline:
    """High-level ETL pipeline using RAPIDS on Clore.ai."""
    
    def __init__(self, api_key: str):
        self.client = CloreRAPIDSClient(api_key)
        self.server: RAPIDSServer = None
        self.executor: RAPIDSExecutor = None
        self.results: List[ETLResult] = []
    
    def setup(self, min_vram_gb: int = 16, max_price_usd: float = 0.60):
        """Provision RAPIDS server."""
        
        print("🔍 Finding RAPIDS-capable GPU...")
        gpu = self.client.find_rapids_gpu(
            min_vram_gb=min_vram_gb,
            max_price_usd=max_price_usd
        )
        
        if not gpu:
            raise Exception(f"No GPU with {min_vram_gb}GB+ VRAM under ${max_price_usd}/hr")
        
        print(f"   Found: {gpu['gpu_model']} ({gpu['vram_gb']}GB) @ ${gpu['price_usd']:.2f}/hr")
        
        print("🚀 Provisioning server...")
        self.server = self.client.rent_rapids_server(gpu)
        
        print(f"   Server ready: {self.server.ssh_host}:{self.server.ssh_port}")
        
        # Connect executor
        self.executor = RAPIDSExecutor(
            self.server.ssh_host,
            self.server.ssh_port,
            self.server.ssh_password
        )
        self.executor.connect()
        
        # Verify RAPIDS
        print("🔧 Verifying RAPIDS installation...")
        rapids_info = self.executor.verify_rapids()
        print(f"   RAPIDS: {rapids_info.get('rapids_version', 'N/A')}")
        print(f"   GPU: {rapids_info.get('gpu_info', 'N/A')}")
        
        return self
    
    def upload_data(self, local_path: str, remote_path: str = None):
        """Upload data file to server."""
        if remote_path is None:
            remote_path = f"/tmp/data/{os.path.basename(local_path)}"
        
        self.executor._exec(f"mkdir -p {os.path.dirname(remote_path)}")
        
        print(f"📤 Uploading {local_path}...")
        self.executor.upload_file(local_path, remote_path)
        
        return remote_path
    
    def download_data(self, remote_path: str, local_path: str):
        """Download data file from server."""
        print(f"📥 Downloading to {local_path}...")
        os.makedirs(os.path.dirname(local_path) or ".", exist_ok=True)
        self.executor.download_file(remote_path, local_path)
    
    def execute(self, script: str, description: str = "ETL Operation") -> ETLResult:
        """Execute custom RAPIDS script."""
        print(f"⚙️  {description}...")
        result = self.executor.execute_etl_script(script)
        self.results.append(result)
        
        if result.success:
            print(f"   ✅ {result.input_rows:,} → {result.output_rows:,} rows in {result.execution_time_seconds:.2f}s")
        else:
            print(f"   ❌ Failed: {result.error}")
        
        return result
    
    def filter(self, input_path: str, output_path: str, condition: str) -> ETLResult:
        """Filter DataFrame by condition."""
        script = f'''
df = cudf.read_parquet("{input_path}") if "{input_path}".endswith(".parquet") else cudf.read_csv("{input_path}")
result["input_rows"] = len(df)

df = df.query("{condition}")
result["output_rows"] = len(df)

df.to_parquet("{output_path}") if "{output_path}".endswith(".parquet") else df.to_csv("{output_path}", index=False)
'''
        return self.execute(script, f"Filter: {condition}")
    
    def aggregate(self, input_path: str, output_path: str, 
                  group_by: List[str], aggs: Dict[str, str]) -> ETLResult:
        """Aggregate DataFrame."""
        
        agg_dict = ", ".join(f'"{k}": "{v}"' for k, v in aggs.items())
        group_cols = ", ".join(f'"{c}"' for c in group_by)
        
        script = f'''
df = cudf.read_parquet("{input_path}") if "{input_path}".endswith(".parquet") else cudf.read_csv("{input_path}")
result["input_rows"] = len(df)

df = df.groupby([{group_cols}]).agg({{{agg_dict}}}).reset_index()
result["output_rows"] = len(df)

df.to_parquet("{output_path}") if "{output_path}".endswith(".parquet") else df.to_csv("{output_path}", index=False)
'''
        return self.execute(script, f"Aggregate by {group_by}")
    
    def join(self, left_path: str, right_path: str, output_path: str,
             on: str, how: str = "inner") -> ETLResult:
        """Join two DataFrames."""
        script = f'''
left = cudf.read_parquet("{left_path}") if "{left_path}".endswith(".parquet") else cudf.read_csv("{left_path}")
right = cudf.read_parquet("{right_path}") if "{right_path}".endswith(".parquet") else cudf.read_csv("{right_path}")
result["input_rows"] = len(left) + len(right)

df = left.merge(right, on="{on}", how="{how}")
result["output_rows"] = len(df)

df.to_parquet("{output_path}") if "{output_path}".endswith(".parquet") else df.to_csv("{output_path}", index=False)
'''
        return self.execute(script, f"Join on {on} ({how})")
    
    def transform(self, input_path: str, output_path: str, 
                  operations: List[str]) -> ETLResult:
        """Apply multiple transformations."""
        ops_code = "\n".join(f"df = {op}" for op in operations)
        
        script = f'''
df = cudf.read_parquet("{input_path}") if "{input_path}".endswith(".parquet") else cudf.read_csv("{input_path}")
result["input_rows"] = len(df)

{ops_code}

result["output_rows"] = len(df)
df.to_parquet("{output_path}") if "{output_path}".endswith(".parquet") else df.to_csv("{output_path}", index=False)
'''
        return self.execute(script, "Transform")
    
    def get_stats(self) -> PipelineStats:
        """Get pipeline statistics."""
        successful = [r for r in self.results if r.success]
        total_rows = sum(r.input_rows for r in successful)
        total_time = sum(r.execution_time_seconds for r in self.results)
        
        return PipelineStats(
            total_operations=len(self.results),
            successful_operations=len(successful),
            failed_operations=len(self.results) - len(successful),
            total_rows_processed=total_rows,
            total_execution_time=total_time,
            total_cost_usd=(total_time / 3600) * self.server.hourly_cost if self.server else 0,
            avg_throughput_rows_per_sec=total_rows / total_time if total_time > 0 else 0
        )
    
    def cleanup(self):
        """Release resources."""
        if self.executor:
            self.executor.disconnect()
        if self.server:
            print("🧹 Releasing server...")
            self.client.cancel_order(self.server.order_id)
    
    def __enter__(self):
        return self
    
    def __exit__(self, *args):
        self.cleanup()

Full Script: Production ETL Service

#!/usr/bin/env python3
"""
GPU-accelerated ETL Pipeline using RAPIDS on Clore.ai.

Usage:
    python gpu_etl.py --api-key YOUR_API_KEY --input data.csv --output result.parquet \
        --filter "amount > 100" --groupby customer_id --agg "amount:sum,count:count"
"""

import os
import sys
import time
import json
import argparse
import secrets
import requests
import paramiko
from scp import SCPClient
from typing import List, Dict, Optional
from dataclasses import dataclass, asdict


@dataclass
class ETLResult:
    operation: str
    input_rows: int
    output_rows: int
    time_seconds: float
    success: bool
    error: Optional[str] = None


class CloreRAPIDSETL:
    """Complete GPU ETL solution using RAPIDS on Clore.ai."""
    
    BASE_URL = "https://api.clore.ai"
    RAPIDS_IMAGE = "rapidsai/rapidsai-core:24.02-cuda12.0-runtime-ubuntu22.04-py3.10"
    
    def __init__(self, api_key: str):
        self.api_key = api_key
        self.headers = {"auth": api_key}
        self.server = None
        self.order_id = None
        self.ssh_host = None
        self.ssh_port = None
        self.ssh_password = None
        self.hourly_cost = 0.0
        self._ssh = None
        self._scp = None
    
    def _api(self, method: str, endpoint: str, **kwargs) -> Dict:
        url = f"{self.BASE_URL}{endpoint}"
        for attempt in range(3):
            response = requests.request(method, url, headers=self.headers, **kwargs)
            data = response.json()
            if data.get("code") == 5:
                time.sleep(2 ** attempt)
                continue
            if data.get("code") != 0:
                raise Exception(f"API Error: {data}")
            return data
        raise Exception("Max retries")
    
    def find_gpu(self, min_vram: int = 16, max_price: float = 0.60) -> Optional[Dict]:
        servers = self._api("GET", "/v1/marketplace")["servers"]
        
        # GPUs with sufficient VRAM for RAPIDS
        good_gpus = {"A100": 80, "RTX 4090": 24, "RTX 3090": 24, "A6000": 48, "A5000": 24}
        
        candidates = []
        for s in servers:
            if s.get("rented"):
                continue
            gpus = s.get("gpu_array", [])
            
            match = None
            for gpu in gpus:
                for g, vram in good_gpus.items():
                    if g in gpu and vram >= min_vram:
                        match = (g, vram)
                        break
            
            if not match:
                continue
            
            price = s.get("price", {}).get("usd", {}).get("spot")
            if price and price <= max_price:
                candidates.append({
                    "id": s["id"], "gpu": match[0], "vram": match[1],
                    "price": price, "reliability": s.get("reliability", 0)
                })
        
        if not candidates:
            return None
        return max(candidates, key=lambda x: (x["vram"] / x["price"], x["reliability"]))
    
    def setup(self, min_vram: int = 16, max_price: float = 0.60):
        print("🔍 Finding RAPIDS GPU...")
        gpu = self.find_gpu(min_vram, max_price)
        if not gpu:
            raise Exception(f"No GPU with {min_vram}GB+ under ${max_price}/hr")
        
        print(f"   {gpu['gpu']} ({gpu['vram']}GB) @ ${gpu['price']:.2f}/hr")
        
        self.ssh_password = secrets.token_urlsafe(16)
        self.hourly_cost = gpu["price"]
        
        print("🚀 Provisioning server...")
        order_data = {
            "renting_server": gpu["id"],
            "type": "spot",
            "currency": "CLORE-Blockchain",
            "image": self.RAPIDS_IMAGE,
            "ports": {"22": "tcp"},
            "env": {"NVIDIA_VISIBLE_DEVICES": "all"},
            "ssh_password": self.ssh_password,
            "spotprice": gpu["price"] * 1.15
        }
        
        result = self._api("POST", "/v1/create_order", json=order_data)
        self.order_id = result["order_id"]
        
        print("⏳ Waiting for server...")
        for _ in range(120):
            orders = self._api("GET", "/v1/my_orders")["orders"]
            order = next((o for o in orders if o["order_id"] == self.order_id), None)
            if order and order.get("status") == "running":
                conn = order["connection"]["ssh"]
                parts = conn.split()
                self.ssh_host = parts[1].split("@")[1]
                self.ssh_port = int(parts[-1]) if "-p" in conn else 22
                break
            time.sleep(2)
        else:
            raise Exception("Timeout")
        
        # Connect SSH
        self._ssh = paramiko.SSHClient()
        self._ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
        self._ssh.connect(self.ssh_host, port=self.ssh_port,
                          username="root", password=self.ssh_password, timeout=30)
        self._scp = SCPClient(self._ssh.get_transport())
        
        print(f"✅ Server ready: {self.ssh_host}:{self.ssh_port}")
        
        # Verify RAPIDS
        stdin, stdout, stderr = self._ssh.exec_command(
            'python3 -c "import cudf; print(cudf.__version__)"'
        )
        version = stdout.read().decode().strip()
        print(f"   cuDF version: {version}")
    
    def _exec(self, cmd: str, timeout: int = 3600) -> str:
        stdin, stdout, stderr = self._ssh.exec_command(cmd, timeout=timeout)
        stdout.channel.recv_exit_status()
        return stdout.read().decode()
    
    def upload(self, local: str) -> str:
        remote = f"/tmp/data/{os.path.basename(local)}"
        self._exec(f"mkdir -p /tmp/data /tmp/output")
        print(f"📤 Uploading {os.path.basename(local)}...")
        self._scp.put(local, remote)
        return remote
    
    def download(self, remote: str, local: str):
        print(f"📥 Downloading to {local}...")
        os.makedirs(os.path.dirname(local) or ".", exist_ok=True)
        self._scp.get(remote, local)
    
    def run_etl(self, script: str) -> ETLResult:
        """Execute RAPIDS ETL script."""
        wrapped = f'''
import cudf
import time
import json
import rmm

rmm.reinitialize(managed_memory=True)

start = time.time()
result = {{"success": True, "error": None, "input_rows": 0, "output_rows": 0}}

try:
{self._indent(script)}
except Exception as e:
    result["success"] = False
    result["error"] = str(e)

result["time"] = time.time() - start
print("RESULT:" + json.dumps(result))
'''
        self._exec(f"cat > /tmp/etl.py << 'EOF'\n{wrapped}\nEOF")
        out = self._exec("python3 /tmp/etl.py")
        
        for line in out.split("\n"):
            if line.startswith("RESULT:"):
                data = json.loads(line[7:])
                return ETLResult(
                    operation="custom",
                    input_rows=data.get("input_rows", 0),
                    output_rows=data.get("output_rows", 0),
                    time_seconds=data.get("time", 0),
                    success=data.get("success", False),
                    error=data.get("error")
                )
        
        return ETLResult("custom", 0, 0, 0, False, "Parse error")
    
    def _indent(self, s: str, n: int = 4) -> str:
        return "\n".join(" " * n + line for line in s.split("\n"))
    
    def filter_data(self, input_path: str, output_path: str, condition: str) -> ETLResult:
        script = f'''
df = cudf.read_parquet("{input_path}") if "{input_path}".endswith(".parquet") else cudf.read_csv("{input_path}")
result["input_rows"] = len(df)
df = df.query("{condition}")
result["output_rows"] = len(df)
df.to_parquet("{output_path}") if "{output_path}".endswith(".parquet") else df.to_csv("{output_path}", index=False)
'''
        print(f"🔍 Filter: {condition}")
        return self.run_etl(script)
    
    def aggregate_data(self, input_path: str, output_path: str,
                       group_by: List[str], aggs: Dict[str, str]) -> ETLResult:
        agg_str = ", ".join(f'"{k}": "{v}"' for k, v in aggs.items())
        group_str = ", ".join(f'"{g}"' for g in group_by)
        
        script = f'''
df = cudf.read_parquet("{input_path}") if "{input_path}".endswith(".parquet") else cudf.read_csv("{input_path}")
result["input_rows"] = len(df)
df = df.groupby([{group_str}]).agg({{{agg_str}}}).reset_index()
result["output_rows"] = len(df)
df.to_parquet("{output_path}") if "{output_path}".endswith(".parquet") else df.to_csv("{output_path}", index=False)
'''
        print(f"📊 Aggregate by {group_by}")
        return self.run_etl(script)
    
    def join_data(self, left: str, right: str, output: str, on: str, how: str = "inner") -> ETLResult:
        script = f'''
left = cudf.read_parquet("{left}") if "{left}".endswith(".parquet") else cudf.read_csv("{left}")
right = cudf.read_parquet("{right}") if "{right}".endswith(".parquet") else cudf.read_csv("{right}")
result["input_rows"] = len(left) + len(right)
df = left.merge(right, on="{on}", how="{how}")
result["output_rows"] = len(df)
df.to_parquet("{output}") if "{output}".endswith(".parquet") else df.to_csv("{output}", index=False)
'''
        print(f"🔗 Join on {on} ({how})")
        return self.run_etl(script)
    
    def cleanup(self):
        if self._scp:
            self._scp.close()
        if self._ssh:
            self._ssh.close()
        if self.order_id:
            print("🧹 Releasing server...")
            self._api("POST", "/v1/cancel_order", json={"id": self.order_id})
    
    def __enter__(self):
        return self
    
    def __exit__(self, *args):
        self.cleanup()


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--api-key", required=True)
    parser.add_argument("--input", "-i", required=True, help="Input file (CSV/Parquet)")
    parser.add_argument("--output", "-o", required=True, help="Output file")
    parser.add_argument("--filter", help="Filter condition (e.g., 'amount > 100')")
    parser.add_argument("--groupby", help="Group by columns (comma-separated)")
    parser.add_argument("--agg", help="Aggregations (col:func,col:func)")
    parser.add_argument("--join", help="Right table for join")
    parser.add_argument("--join-on", help="Join column")
    parser.add_argument("--join-how", default="inner", help="Join type")
    parser.add_argument("--min-vram", type=int, default=16)
    parser.add_argument("--max-price", type=float, default=0.60)
    args = parser.parse_args()
    
    with CloreRAPIDSETL(args.api_key) as etl:
        etl.setup(args.min_vram, args.max_price)
        
        # Upload input
        remote_input = etl.upload(args.input)
        remote_output = f"/tmp/output/{os.path.basename(args.output)}"
        etl._exec("mkdir -p /tmp/output")
        
        results = []
        current_input = remote_input
        step = 0
        
        # Filter
        if args.filter:
            step += 1
            step_output = f"/tmp/output/step{step}.parquet"
            result = etl.filter_data(current_input, step_output, args.filter)
            results.append(result)
            if result.success:
                print(f"   ✅ {result.input_rows:,} → {result.output_rows:,} rows ({result.time_seconds:.2f}s)")
                current_input = step_output
        
        # Join
        if args.join and args.join_on:
            remote_right = etl.upload(args.join)
            step += 1
            step_output = f"/tmp/output/step{step}.parquet"
            result = etl.join_data(current_input, remote_right, step_output, args.join_on, args.join_how)
            results.append(result)
            if result.success:
                print(f"   ✅ {result.input_rows:,} → {result.output_rows:,} rows ({result.time_seconds:.2f}s)")
                current_input = step_output
        
        # Aggregate
        if args.groupby and args.agg:
            group_cols = [c.strip() for c in args.groupby.split(",")]
            agg_dict = {}
            for pair in args.agg.split(","):
                col, func = pair.split(":")
                agg_dict[col.strip()] = func.strip()
            
            step += 1
            step_output = f"/tmp/output/step{step}.parquet"
            result = etl.aggregate_data(current_input, step_output, group_cols, agg_dict)
            results.append(result)
            if result.success:
                print(f"   ✅ {result.input_rows:,} → {result.output_rows:,} rows ({result.time_seconds:.2f}s)")
                current_input = step_output
        
        # Copy to final output
        if current_input != remote_input:
            if args.output.endswith(".parquet"):
                etl._exec(f"cp {current_input} {remote_output}")
            else:
                etl._exec(f"python3 -c \"import cudf; cudf.read_parquet('{current_input}').to_csv('{remote_output}', index=False)\"")
        else:
            remote_output = current_input
        
        # Download output
        etl.download(remote_output, args.output)
        
        # Summary
        print("\n" + "="*60)
        print("📊 ETL SUMMARY")
        total_time = sum(r.time_seconds for r in results)
        cost = (total_time / 3600) * etl.hourly_cost
        
        print(f"   Operations: {len(results)}")
        print(f"   Total time: {total_time:.2f}s")
        print(f"   Cost: ${cost:.4f}")
        print(f"   Output: {args.output}")


if __name__ == "__main__":
    main()

Example: Processing 1 Billion Rows

# Process 1B rows of clickstream data
with CloreRAPIDSETL("YOUR_API_KEY") as etl:
    etl.setup(min_vram=24)  # Need 24GB+ for 1B rows
    
    # Upload (or use S3 URL)
    remote = etl.upload("clickstream_1B.parquet")
    
    # Filter bot traffic
    result1 = etl.filter_data(
        remote,
        "/tmp/output/no_bots.parquet",
        "is_bot == False and session_duration > 0"
    )
    # 1B → 800M rows in 12 seconds
    
    # Aggregate by user
    result2 = etl.aggregate_data(
        "/tmp/output/no_bots.parquet",
        "/tmp/output/user_stats.parquet",
        group_by=["user_id", "country"],
        aggs={
            "session_duration": "mean",
            "page_views": "sum",
            "purchase_amount": "sum",
            "session_id": "count"
        }
    )
    # 800M → 50M rows in 8 seconds
    
    etl.download("/tmp/output/user_stats.parquet", "user_stats.parquet")
    # Total: 20 seconds, $0.003

Performance Comparison

Operation

100M Rows

pandas (CPU)

cuDF (GPU)

Speedup

Read CSV

10GB

45s

15x

Filter

100M→50M

12s

0.3s

40x

GroupBy

100M→1M

25s

0.8s

31x

Join

100M×10M

180s

45x

Sort

100M rows

35s

1.2s

29x

Cost Comparison

Dataset Size

pandas (local)

Spark (EMR)

Clore.ai RAPIDS

10M rows

Free (slow)

$2.50/hr

$0.01

100M rows

15 min

$5.00/hr

$0.05

1B rows

OOM

$15.00/hr

$0.30

Best Practices

Use Parquet — 10x faster than CSV
Enable managed memory — rmm.reinitialize(managed_memory=True)
Prefer A100/A6000 for large datasets (more VRAM)
Use spot instances for batch jobs
Chain operations — keep data on GPU between transforms
Partition large files — process in chunks if >GPU VRAM

Next Steps

PreviousLarge-Scale Image Processing NextCI/CD Pipeline with GPU Testing

Last updated 27 days ago

Was this helpful?

hashtagWhat We're Building

hashtagPrerequisites

hashtagArchitecture Overview

hashtagStep 1: Clore.ai RAPIDS Client

hashtagStep 2: Remote RAPIDS Executor

hashtagStep 3: High-Level ETL Pipeline

hashtagFull Script: Production ETL Service

hashtagExample: Processing 1 Billion Rows

hashtagPerformance Comparison

hashtagCost Comparison

hashtagBest Practices

hashtagNext Steps