general7 min read

Error Handling and Retry Patterns for Data Analysis APIs

How to handle timeouts, container expiry, malformed CSVs, and rate limits when integrating DataStoryBot — with production-ready retry patterns.

By DataStoryBot Team

Error Handling and Retry Patterns for Data Analysis APIs

DataStoryBot's API does non-trivial work. When you call /analyze, it spins up a Code Interpreter session, writes Python code, executes it against your data, and returns structured results. That takes seconds to minutes, not milliseconds. And things can go wrong at every step.

This article covers the failure modes you'll encounter when integrating DataStoryBot into production applications, and the patterns for handling them gracefully. The goal is an integration that recovers automatically from transient failures and fails clearly on permanent ones.

The Error Categories

Transient Errors (Retry)

  • 502/503 Bad Gateway / Service Unavailable — upstream capacity, usually resolves in seconds
  • 429 Too Many Requests — rate limit hit, back off and retry
  • 504 Gateway Timeout — the request took too long at the proxy layer, but the analysis may still be running
  • Container busy — the container is processing a previous request

Permanent Errors (Don't Retry)

  • 400 Bad Request — malformed request, missing parameters
  • 401 Unauthorized — invalid or expired API key
  • 413 Payload Too Large — file exceeds the 50 MB upload limit
  • 422 Unprocessable Entity — the file was received but can't be parsed as CSV

Ambiguous Errors (Retry Once)

  • 500 Internal Server Error — could be transient or permanent. Retry once; if it fails again, surface the error
  • Analysis returned no stories — the Code Interpreter ran but didn't find anything interesting. Might work with a different steering prompt; won't work with the same one

The Retry Pattern

Use exponential backoff with jitter for transient errors:

import time
import random
import requests
from requests.exceptions import RequestException

class DataStoryBotClient:
    def __init__(self, base_url="https://datastory.bot/api"):
        self.base_url = base_url
        self.session = requests.Session()

    def _request(self, method, path, max_retries=3, **kwargs):
        """Make an API request with retry logic."""
        last_exception = None

        for attempt in range(max_retries + 1):
            try:
                response = self.session.request(
                    method, f"{self.base_url}{path}", **kwargs
                )

                # Success
                if response.status_code < 400:
                    return response.json()

                # Permanent errors — don't retry
                if response.status_code in (400, 401, 413, 422):
                    raise ApiError(
                        response.status_code,
                        response.json().get("error", "Unknown error")
                    )

                # Rate limit — use Retry-After header if present
                if response.status_code == 429:
                    retry_after = int(
                        response.headers.get("Retry-After", 5)
                    )
                    time.sleep(retry_after)
                    continue

                # Transient errors — retry with backoff
                if response.status_code in (500, 502, 503, 504):
                    if attempt < max_retries:
                        delay = _backoff_delay(attempt)
                        time.sleep(delay)
                        continue
                    raise ApiError(
                        response.status_code,
                        f"Failed after {max_retries} retries"
                    )

            except RequestException as e:
                last_exception = e
                if attempt < max_retries:
                    delay = _backoff_delay(attempt)
                    time.sleep(delay)
                    continue
                raise ConnectionError(
                    f"Connection failed after {max_retries} retries: {e}"
                )

        raise last_exception or ApiError(0, "Unknown failure")


def _backoff_delay(attempt):
    """Exponential backoff with jitter."""
    base = 2 ** attempt  # 1, 2, 4, 8...
    jitter = random.uniform(0, base * 0.5)
    return base + jitter


class ApiError(Exception):
    def __init__(self, status_code, message):
        self.status_code = status_code
        self.message = message
        super().__init__(f"HTTP {status_code}: {message}")

Handling Each Endpoint

Upload Errors

The /upload endpoint is the simplest — it's a file upload. Failures are usually permanent:

def upload(self, csv_path):
    """Upload CSV with error handling."""
    import os

    # Pre-validate
    if not os.path.exists(csv_path):
        raise FileNotFoundError(f"CSV not found: {csv_path}")

    file_size = os.path.getsize(csv_path)
    if file_size > 50 * 1024 * 1024:  # 50 MB
        raise ValueError(
            f"File too large ({file_size / 1024 / 1024:.1f} MB). "
            "Maximum is 50 MB. Pre-aggregate or chunk the data."
        )

    if file_size == 0:
        raise ValueError("CSV file is empty")

    with open(csv_path, "rb") as f:
        result = self._request("POST", "/upload", files={"file": f})

    return result["containerId"]

Pre-validate file size and existence before hitting the API. A 50 MB upload that gets rejected wastes time and bandwidth.

Analyze Errors

The /analyze endpoint is where most complexity lives. Analysis takes 10-120 seconds, and Code Interpreter can fail in ways the API can't always predict:

def analyze(self, container_id, steering=None, timeout=300):
    """Run analysis with timeout and empty-result handling."""
    payload = {"containerId": container_id}
    if steering:
        payload["steeringPrompt"] = steering

    result = self._request(
        "POST", "/analyze",
        json=payload,
        timeout=timeout,
        max_retries=1  # Analysis is expensive — retry once at most
    )

    stories = result if isinstance(result, list) else result.get("stories", [])

    if not stories:
        raise AnalysisEmpty(
            "Analysis returned no stories. Try a different steering "
            "prompt or check that the CSV contains analyzable data."
        )

    return stories


class AnalysisEmpty(Exception):
    pass

Key points:

  • Longer timeout. Set timeout=300 (5 minutes) for the analysis request. The default 30-second timeout will kill legitimate analyses.
  • Limited retries. Analysis is computationally expensive. Retry once for 500/502/503 errors, but don't retry aggressively.
  • Empty result handling. An analysis that returns zero stories isn't an HTTP error — it's a semantic one. Handle it separately.

Refine Errors

The /refine endpoint depends on the container still being alive:

def refine(self, container_id, story_title):
    """Refine a story with container expiry handling."""
    try:
        return self._request("POST", "/refine", json={
            "containerId": container_id,
            "selectedStoryTitle": story_title
        }, timeout=300)
    except ApiError as e:
        if e.status_code == 404:
            raise ContainerExpired(
                "Container has expired. Re-upload the CSV and "
                "re-run analysis to get a new container."
            )
        raise


class ContainerExpired(Exception):
    pass

If the container expired between /analyze and /refine, you get a 404. The only recovery is to re-upload and re-analyze. Design your application flow to handle this — either by keeping the user engaged (so they refine within the 20-minute window) or by storing the original CSV path so you can re-upload automatically.

File Retrieval Errors

Chart and dataset downloads from /files/{containerId}/{fileId}:

def download_file(self, container_id, file_id, output_path):
    """Download a file with container expiry handling."""
    try:
        response = self.session.get(
            f"{self.base_url}/files/{container_id}/{file_id}",
            stream=True
        )
        response.raise_for_status()

        with open(output_path, "wb") as f:
            for chunk in response.iter_content(chunk_size=8192):
                f.write(chunk)

        return output_path
    except requests.HTTPError as e:
        if e.response.status_code == 404:
            raise ContainerExpired("Container expired — files no longer available")
        raise

The Complete Resilient Client

class ResilientDataStoryBot:
    """Production-ready DataStoryBot client with full error handling."""

    def __init__(self, base_url="https://datastory.bot/api"):
        self.client = DataStoryBotClient(base_url)

    def analyze_csv(self, csv_path, steering=None):
        """Full pipeline with error recovery."""
        # Upload (retries transient errors)
        container_id = self.client.upload(csv_path)

        # Analyze (retries once, raises AnalysisEmpty if no stories)
        stories = self.client.analyze(container_id, steering)

        # Refine the top story (handles container expiry)
        try:
            report = self.client.refine(
                container_id, stories[0]["title"]
            )
        except ContainerExpired:
            # Re-upload and re-analyze
            container_id = self.client.upload(csv_path)
            stories = self.client.analyze(container_id, steering)
            report = self.client.refine(
                container_id, stories[0]["title"]
            )

        # Download charts
        charts = []
        for i, chart in enumerate(report.get("charts", [])):
            path = f"/tmp/chart_{i+1}.png"
            try:
                self.client.download_file(
                    container_id, chart["fileId"], path
                )
                charts.append({"path": path, "caption": chart["caption"]})
            except ContainerExpired:
                pass  # Charts lost — report narrative is still usable

        return {
            "narrative": report["narrative"],
            "charts": charts,
            "stories": stories
        }

Rate Limiting

DataStoryBot uses per-key rate limits. If you hit 429 responses:

  1. Respect Retry-After headers. The response tells you how long to wait.
  2. Queue requests. Don't fire 50 analysis requests simultaneously. Use a semaphore or rate limiter.
  3. Cache results. If the same CSV might be analyzed multiple times, cache the analysis results to avoid redundant API calls.
import asyncio

class RateLimitedClient:
    def __init__(self, max_concurrent=3):
        self.semaphore = asyncio.Semaphore(max_concurrent)
        self.client = ResilientDataStoryBot()

    async def analyze(self, csv_path, steering=None):
        async with self.semaphore:
            return self.client.analyze_csv(csv_path, steering)

Logging for Debugging

Log every API interaction in production. When something goes wrong at 3 AM, logs are all you have:

import logging

logger = logging.getLogger("datastorybot")

# In _request method:
logger.info(f"API {method} {path} (attempt {attempt + 1}/{max_retries + 1})")
logger.info(f"API response: {response.status_code} ({response.elapsed.total_seconds():.1f}s)")
if response.status_code >= 400:
    logger.error(f"API error: {response.status_code} {response.text[:500]}")

Log the container ID with every request — it's the correlation key for debugging an entire analysis session.

What to Read Next

For the API endpoints and parameters that these patterns wrap, see the DataStoryBot API reference.

For async and webhook patterns when the synchronous approach isn't enough, read webhooks and async patterns for long-running analysis.

For the foundational API tutorial, start with getting started with the DataStoryBot API.

Ready to find your data story?

Upload a CSV and DataStoryBot will uncover the narrative in seconds.

Try DataStoryBot →