general8 min read

Building an AI Data Analyst with the DataStoryBot API

End-to-end tutorial: build a Python app that accepts CSV uploads and returns AI-generated data stories using the DataStoryBot API. FastAPI example included.

By DataStoryBot Team

Building an AI Data Analyst with the DataStoryBot API

Most "AI data analysis" tutorials end at a prompt template. You get a script that sends a CSV to an LLM, gets back some text, and calls it a day. The result is fragile, the output is unstructured, and the charts are nonexistent.

This tutorial is different. We are building a small but real application — a FastAPI service that accepts CSV uploads from users, forwards them to DataStoryBot for autonomous analysis, and returns structured narratives with charts. By the end, you will have a working AI data analyst you can deploy, integrate into other tools, or extend with your own logic.

What We Are Building

The application does three things:

  1. Accepts a CSV upload from any HTTP client
  2. Sends it through DataStoryBot's three-step pipeline (upload, analyze, refine)
  3. Returns a JSON response containing story angles, a written narrative, and chart URLs

The user uploads a file. They get back insights. No pandas on your server. No matplotlib. No prompt engineering. DataStoryBot's ephemeral Code Interpreter container handles the analysis — your app is just the interface.

Prerequisites

You need Python 3.9+ and two packages:

pip install fastapi uvicorn requests python-multipart

No API key is required during the current open beta. All requests go to https://datastory.bot. Check the playground for the latest auth requirements.

The DataStoryBot Pipeline

If you have not used the API before, here is the flow. Three endpoints, called in sequence:

EndpointMethodPurpose
/api/uploadPOST (multipart)Upload CSV, get containerId and fileId
/api/analyzePOST (JSON)Discover 3 story angles in the data
/api/refinePOST (JSON)Generate full narrative + charts for one story

The container is an ephemeral OpenAI Code Interpreter environment running GPT-4o. It lives for 20 minutes, then everything — data, charts, state — is deleted. For a deeper walkthrough of each endpoint, read the getting started guide.

Step 1: The Core Analysis Function

Before building the web layer, let us write the function that does the actual work. This wraps DataStoryBot's three API calls into a single function:

import requests
from typing import Optional

BASE_URL = "https://datastory.bot"

def analyze_csv(
    file_bytes: bytes,
    filename: str,
    steering_prompt: Optional[str] = None,
    story_index: int = 0,
    refinement_prompt: Optional[str] = None,
) -> dict:
    """Upload a CSV to DataStoryBot and return the full analysis."""

    # Step 1: Upload
    upload_resp = requests.post(
        f"{BASE_URL}/api/upload",
        files={"file": (filename, file_bytes, "text/csv")},
    )
    upload_resp.raise_for_status()
    upload_data = upload_resp.json()
    container_id = upload_data["containerId"]

    # Step 2: Analyze — discover story angles
    analyze_payload = {"containerId": container_id}
    if steering_prompt:
        analyze_payload["steeringPrompt"] = steering_prompt

    analyze_resp = requests.post(
        f"{BASE_URL}/api/analyze",
        json=analyze_payload,
    )
    analyze_resp.raise_for_status()
    stories = analyze_resp.json()

    # Step 3: Refine — generate the full narrative
    selected_title = stories[story_index]["title"]
    refine_payload = {
        "containerId": container_id,
        "selectedStoryTitle": selected_title,
    }
    if refinement_prompt:
        refine_payload["refinementPrompt"] = refinement_prompt

    refine_resp = requests.post(
        f"{BASE_URL}/api/refine",
        json=refine_payload,
    )
    refine_resp.raise_for_status()
    refined = refine_resp.json()

    # Build chart URLs so the caller can fetch them
    chart_urls = [
        f"{BASE_URL}/api/files/{container_id}/{chart['fileId']}"
        for chart in refined.get("charts", [])
    ]

    return {
        "metadata": upload_data["metadata"],
        "stories": stories,
        "selectedStory": selected_title,
        "narrative": refined["narrative"],
        "chartUrls": chart_urls,
        "containerId": container_id,
    }

This function handles the entire pipeline. The caller provides bytes and a filename; they get back structured results. The steering_prompt and refinement_prompt parameters are optional — they let callers guide the analysis without writing code.

Step 2: The FastAPI Application

Now wrap that function in a web server:

from fastapi import FastAPI, UploadFile, HTTPException, Query
from fastapi.responses import JSONResponse
from typing import Optional

app = FastAPI(
    title="AI Data Analyst",
    description="Upload a CSV, get AI-generated data stories",
)

@app.post("/analyze")
async def analyze_endpoint(
    file: UploadFile,
    steering_prompt: Optional[str] = Query(None),
    story_index: int = Query(0, ge=0, le=2),
    refinement_prompt: Optional[str] = Query(None),
):
    # Validate file type
    if not file.filename.endswith(".csv"):
        raise HTTPException(400, "Only CSV files are supported")

    file_bytes = await file.read()
    if len(file_bytes) == 0:
        raise HTTPException(400, "File is empty")
    if len(file_bytes) > 50 * 1024 * 1024:
        raise HTTPException(400, "File exceeds 50 MB limit")

    try:
        result = analyze_csv(
            file_bytes=file_bytes,
            filename=file.filename,
            steering_prompt=steering_prompt,
            story_index=story_index,
            refinement_prompt=refinement_prompt,
        )
        return JSONResponse(content=result)
    except requests.exceptions.HTTPError as e:
        raise HTTPException(502, f"DataStoryBot API error: {e.response.status_code}")
    except (KeyError, IndexError) as e:
        raise HTTPException(502, f"Unexpected API response: {str(e)}")

Run it:

uvicorn app:app --host 0.0.0.0 --port 8000

Step 3: Testing the Application

Use curl to test the full pipeline:

curl -X POST http://localhost:8000/analyze \
  -F "file=@sales_data.csv" \
  -G -d "steering_prompt=Focus on revenue trends"

Or with Python:

import requests

with open("sales_data.csv", "rb") as f:
    resp = requests.post(
        "http://localhost:8000/analyze",
        files={"file": ("sales_data.csv", f, "text/csv")},
        params={"steering_prompt": "Focus on customer retention"},
    )

result = resp.json()
print(f"Found {len(result['stories'])} story angles")
print(f"\nNarrative:\n{result['narrative']}")
print(f"\nCharts: {result['chartUrls']}")

The response is structured JSON. The narrative is Markdown. The chart URLs point to PNG images in the ephemeral container — download them within 20 minutes, or they are gone.

Adding Steering Prompts

The basic endpoint above lets callers pass a steering_prompt query parameter. This is more useful than it looks. Different data types benefit from different analysis angles, and your application knows things about the data that DataStoryBot does not.

A simple pattern: maintain a dictionary of domain-specific prompts and select one based on the file name or detected columns:

STEERING_PROMPTS = {
    "sales": "Focus on revenue trends, regional differences, and seasonality",
    "users": "Focus on retention, churn indicators, and engagement patterns",
    "support": "Focus on ticket volume trends, resolution times, and escalation rates",
    "financial": "Focus on margin analysis, cost drivers, and period-over-period changes",
}

def pick_prompt(columns: list[str]) -> str | None:
    col_text = " ".join(columns).lower()
    if "revenue" in col_text or "sales" in col_text:
        return STEERING_PROMPTS["sales"]
    if "churn" in col_text or "retention" in col_text:
        return STEERING_PROMPTS["users"]
    if "ticket" in col_text or "resolution" in col_text:
        return STEERING_PROMPTS["support"]
    return None

Call this after the upload step, using the columns from the metadata response, and pass the result as the steering prompt to the analyze endpoint. The stories DataStoryBot returns will be more relevant because the analysis is guided toward the domain that matters.

You can also let users override this by passing their own steering prompt through your API. Give the system a default, give the user an escape hatch.

Downloading and Caching Charts

The chart URLs returned by the refine step point to files inside an ephemeral container that expires in 20 minutes. If your application displays these charts later (in a dashboard, in an email, in a saved report), you need to download them immediately:

import requests
from pathlib import Path

def download_charts(container_id: str, charts: list[dict], output_dir: str = "./charts") -> list[str]:
    Path(output_dir).mkdir(exist_ok=True)
    saved = []
    for chart in charts:
        resp = requests.get(
            f"{BASE_URL}/api/files/{container_id}/{chart['fileId']}"
        )
        filename = f"{output_dir}/{chart['fileId']}.png"
        with open(filename, "wb") as f:
            f.write(resp.content)
        saved.append(filename)
    return saved

Call this right after the refine step completes. Once the charts are on your storage, they are decoupled from the container lifecycle. This is especially important if you are building a feature where users can revisit past analyses.

Production Considerations

The basic example above handles HTTP errors. In production, you need to handle three additional failure modes:

Container expiry. The 20-minute container lifetime means your endpoint can fail if the user waits too long between steps. Return a 410 and prompt re-upload when you get a 404 from DataStoryBot.

Large file timeouts. DataStoryBot's analysis step runs real code in a container. Large files take longer. Set timeout=120 on your requests to the analyze and refine endpoints.

Malformed CSVs. Not every file that ends in .csv is valid. DataStoryBot's upload endpoint will return an error for files it cannot parse. Surface that error clearly rather than returning a generic 502.

Extending the Application

Split into two endpoints. Instead of running all three steps in one call, expose /stories (upload + analyze) and /refine (refine only) separately. This lets users see all three story angles and pick the one that matters before generating the full report. Pass the containerId between calls.

Add a steering prompt library. Maintain a dictionary of domain-specific prompts. When users upload financial data, automatically steer toward revenue trends. When they upload behavior data, steer toward retention patterns.

Download and cache charts. Instead of passing through DataStoryBot's file URLs, download the charts to your own storage immediately after refinement. This decouples your application from the 20-minute container lifetime.

Batch processing. Each file gets its own container, so multiple analyses can run in parallel. Use ThreadPoolExecutor or a job queue to process incoming CSVs concurrently.

What Happens Inside the Container

When your application calls the analyze endpoint, DataStoryBot does not just summarize the CSV. Inside the ephemeral Code Interpreter container, GPT-4o reads the file with pandas, inspects column types and distributions, generates hypotheses, writes and executes Python code to test each hypothesis, ranks findings by statistical significance and narrative interest, and returns three story angles backed by computed evidence.

The charts are real matplotlib renders from actual computed values. The statistics are calculated, not hallucinated. For more on how the AI approaches data analysis, read how to use AI to analyze your data.

Next Steps

Try the full API flow interactively in the DataStoryBot playground before building on top of it.

For the complete API endpoint reference — request schemas, response formats, and error codes — see the API getting started guide.

If you are building a frontend for this, the React integration guide covers the client-side component architecture for upload, story selection, and result display.

Ready to find your data story?

Upload a CSV and DataStoryBot will uncover the narrative in seconds.

Try DataStoryBot →