generalMarch 24, 202614 min read

The Data Story Workflow: Discover, Select, Refine, Publish

Map DataStoryBot's 4-step UI flow to the professional data storytelling process — and replicate it programmatically via the API.

By DataStoryBot Team

The Data Story Workflow: Discover, Select, Refine, Publish

Professional data storytelling follows a recognizable pattern. Analysts upload data, explore it to find angles worth telling, choose the most relevant story for their audience, refine the framing, and publish or hand off the finished output. This loop exists whether you're an analyst using Tableau, a data scientist working in notebooks, or a developer building a reporting pipeline.

DataStoryBot structures this same loop into four explicit steps: Discover, Select, Refine, and Publish. Each step in the UI corresponds directly to an API call, which means the workflow you prototype in the playground is the same workflow you automate in production. This article maps each step — what it does, why it exists in the storytelling process, and exactly how to replicate it via the API.

Why This Workflow Exists

The biggest failure mode in data analysis is skipping straight to narrative. An analyst opens a CSV, sees a number that looks interesting, and writes a headline around it — without first surveying what else the data contains. The selected number might be the most important story, or it might be the third-most important story. Without an exploration phase, you don't know which.

The Discover → Select → Refine → Publish sequence forces the exploration step. Before you commit to a story angle, you see what DataStoryBot found across the full dataset. That changes what you write about.

The four steps also map onto the professional data storytelling methodology described in academic and practitioner literature: data exploration, story selection (editorial judgment), narrative refinement for audience, and delivery. DataStoryBot operationalizes this process as an API contract.

Step 1: Discover — Upload and Surface Story Angles

What It Does

The Discover step ingests your CSV and returns three candidate story angles. It doesn't just summarize the data — it finds the patterns with the strongest narrative potential: statistical anomalies, trends, comparisons between groups, inflection points in time series.

In the UI Playground

In the DataStoryBot playground at datastory.bot, the Discover step is the file upload screen. You drag in a CSV (up to 50 MB) and DataStoryBot creates an ephemeral analysis environment — an OpenAI Code Interpreter container running GPT-4o. Within about 30-60 seconds, you see three story cards, each with a title and a preview chart.

The three stories are distinct by design. DataStoryBot doesn't return three variations on the same theme — it attempts to cover different analytical angles: a trend story, a comparison story, an anomaly story. You're seeing the data from three perspectives simultaneously.

Via the API

The Discover step is two calls: POST /api/upload and POST /api/analyze.

import requests

BASE_URL = "https://datastory.bot"

# Upload the CSV — creates the ephemeral container
with open("sales_data.csv", "rb") as f:
    upload_resp = requests.post(
        f"{BASE_URL}/api/upload",
        files={"file": ("sales_data.csv", f, "text/csv")}
    )

upload_data = upload_resp.json()
container_id = upload_data["containerId"]
metadata = upload_data["metadata"]

print(f"Container: {container_id}")
print(f"Dataset: {metadata['rowCount']} rows, {metadata['columnCount']} columns")
print(f"Columns: {metadata['columns']}")

The upload response includes a metadata object with row count, column count, and column names. Always validate this before proceeding — it confirms the file parsed correctly and shows you what DataStoryBot is working with.

# Discover story angles
analyze_resp = requests.post(
    f"{BASE_URL}/api/analyze",
    json={"containerId": container_id}
)

stories = analyze_resp.json()

for story in stories:
    print(f"\n[{story['id']}] {story['title']}")
    print(f"  {story['summary']}")

A typical analyze response:

[
  {
    "id": "story_1",
    "title": "Enterprise Segment Drove 71% of Q4 Revenue Growth",
    "summary": "Enterprise customers generated $4.2M of the $5.9M total Q4 revenue increase, with APAC leading at 44% growth quarter-over-quarter.",
    "chartFileId": "file-chart001"
  },
  {
    "id": "story_2",
    "title": "Return Rates Tripled for Pro Product Line Since August",
    "summary": "The Pro line return rate climbed from 2.1% to 6.8% between August and December, while other lines held below 3%.",
    "chartFileId": "file-chart002"
  },
  {
    "id": "story_3",
    "title": "Direct Channel Surpassed Retail for the First Time",
    "summary": "Direct-to-consumer revenue exceeded retail channel revenue in November, driven by a 31% increase in repeat purchases.",
    "chartFileId": "file-chart003"
  }
]

Each story has a preview chart you can download immediately:

# Optionally preview the charts from the discover step
for story in stories:
    chart_resp = requests.get(
        f"{BASE_URL}/api/files/{container_id}/{story['chartFileId']}"
    )
    with open(f"preview_{story['id']}.png", "wb") as f:
        f.write(chart_resp.content)

Steering the Discovery

If you already have a hypothesis, pass a steeringPrompt to the analyze call. This guides Code Interpreter toward the analytical angle you care about without locking out other findings:

analyze_resp = requests.post(
    f"{BASE_URL}/api/analyze",
    json={
        "containerId": container_id,
        "steeringPrompt": (
            "We suspect APAC growth is masking weakness in EMEA and North America. "
            "Investigate regional performance differences and surface any regions "
            "where revenue is flat or declining despite overall growth."
        )
    }
)

Steering prompts work best when you state both the hypothesis and the evidence you want examined. See using steering prompts to control analysis direction for a full pattern library.

Step 2: Select — Choose Which Story to Tell

What It Does

The Select step is editorial judgment: given the stories DataStoryBot found, which one matters most for your audience right now? This is the step where the human makes the decision. The API surfaces possibilities; the analyst or product system decides what to pursue.

In the UI Playground

The playground displays the three story cards side by side. Each card shows the story title, the one-sentence summary, and the preview chart. You click the card for the story you want to develop into a full narrative.

This UI step has no direct API analog — it's a decision point, not a computation. But if you're automating the workflow, you need to implement selection logic in your code.

Via the API

For automated pipelines, you have several options for implementing selection:

Manual selection (interactive scripts):

print("\nAvailable stories:")
for i, story in enumerate(stories):
    print(f"  {i + 1}. {story['title']}")
    print(f"     {story['summary']}\n")

choice = int(input("Select a story (1-3): ")) - 1
selected_story = stories[choice]

Keyword-based selection (automated pipelines):

PRIORITY_KEYWORDS = ["churn", "return rate", "decline", "drop", "down"]

def score_story(story):
    text = (story["title"] + " " + story["summary"]).lower()
    return sum(1 for kw in PRIORITY_KEYWORDS if kw in text)

# Select the story most aligned with priority keywords
selected_story = max(stories, key=score_story)
print(f"Selected: {selected_story['title']}")

First story (default, for exploration pipelines):

# DataStoryBot surfaces the strongest story as story_1 by default
selected_story = stories[0]

Index-based (when the analyze call was steered):

If you've used a steering prompt to focus the analysis, the first story returned is typically the most directly aligned with your prompt. In that case, stories[0] is a reliable default.

The selection logic you implement encodes your editorial judgment — what types of stories your users or stakeholders care about. For a customer-facing analytics product, you might always surface the story with the highest business impact signal. For an internal reporting pipeline, you might select based on which story changed most since the last run.

Step 3: Refine — Adjust the Analysis and Narrative

What It Does

The Refine step takes the selected story angle and generates a complete analysis: statistical evidence, multiple charts, a filtered dataset, and a full Markdown narrative. It also accepts a refinementPrompt to adjust the output for specific audiences or formats.

This is the most computation-intensive step. Code Interpreter writes and executes Python — running aggregations, building charts, filtering data — and GPT-4o synthesizes the results into a structured narrative.

In the UI Playground

After clicking a story card in the Select step, the playground shows a refinement panel. You can optionally enter a refinement instruction before clicking "Generate Story." The instruction field accepts plain English: "make it executive-friendly," "focus on EMEA," "include statistical confidence levels," "keep it under 200 words." If you leave it blank, DataStoryBot uses default narrative settings.

The generation takes 15-30 seconds, then the full story renders — narrative text on the left, charts on the right, with a download button for the filtered dataset.

Via the API

# Basic refine — uses default narrative settings
refine_resp = requests.post(
    f"{BASE_URL}/api/refine",
    json={
        "containerId": container_id,
        "selectedStoryTitle": selected_story["title"]
    }
)

result = refine_resp.json()

narrative = result["narrative"]       # Markdown string
charts = result["charts"]             # List of {fileId, caption}
dataset = result["resultDataset"]     # {fileId, caption}

print(f"Narrative: {len(narrative)} chars")
print(f"Charts: {len(charts)}")

With a refinement prompt:

# Tailored for an executive audience
refine_resp = requests.post(
    f"{BASE_URL}/api/refine",
    json={
        "containerId": container_id,
        "selectedStoryTitle": selected_story["title"],
        "refinementPrompt": (
            "Executive audience. Lead with the business impact in the first sentence. "
            "Keep total length under 200 words. Avoid technical jargon. "
            "End with a single recommended action."
        )
    }
)

# Tailored for an analytics team
refine_resp = requests.post(
    f"{BASE_URL}/api/refine",
    json={
        "containerId": container_id,
        "selectedStoryTitle": selected_story["title"],
        "refinementPrompt": (
            "Analytics team audience. Include the statistical methods used. "
            "Report confidence intervals for any trend claims. "
            "Note sample sizes and any data quality caveats. "
            "Flag correlations that should not be interpreted as causation."
        )
    }
)

Refinement prompts can control three independent dimensions simultaneously: content focus (which aspects of the data to emphasize), narrative tone (executive vs. technical vs. operational), and output format (length, structure, language). You can mix all three in a single prompt.

You can also call refine multiple times on the same story with different refinement prompts — each call produces a different narrative from the same underlying analysis. The container persists for 20 minutes, so multiple refine calls against the same upload are fast and cheap.

Step 4: Publish — Get the Final Output Package

What It Does

The Publish step collects the narrative, charts, and filtered dataset and makes them available for downstream use — rendering in a UI, attaching to an email, storing in a CMS, pushing to Slack, or embedding in a dashboard.

In the UI Playground

The playground's final screen renders the narrative as formatted text with embedded charts. A toolbar provides download buttons for the narrative (as Markdown), each chart (as PNG), and the filtered dataset (as CSV). There's also a copy-to-clipboard button for the narrative text.

Via the API

Everything is downloaded via GET /api/files/{containerId}/{fileId}:

import os

output_dir = "story_output"
os.makedirs(output_dir, exist_ok=True)

# Save the narrative
with open(f"{output_dir}/narrative.md", "w") as f:
    f.write(result["narrative"])
print(f"Narrative saved: {output_dir}/narrative.md")

# Download all charts
for i, chart in enumerate(result["charts"], start=1):
    chart_resp = requests.get(
        f"{BASE_URL}/api/files/{container_id}/{chart['fileId']}"
    )
    filename = f"{output_dir}/chart_{i}.png"
    with open(filename, "wb") as f:
        f.write(chart_resp.content)
    print(f"Chart {i} saved: {filename} — {chart['caption']}")

# Download the filtered dataset
ds_resp = requests.get(
    f"{BASE_URL}/api/files/{container_id}/{dataset['fileId']}"
)
with open(f"{output_dir}/filtered_data.csv", "wb") as f:
    f.write(ds_resp.content)
print(f"Dataset saved: {output_dir}/filtered_data.csv")

Critical: containers expire 20 minutes after creation. All files — charts, narratives, filtered datasets — are deleted when the container expires. Download everything before that window closes. For batch workflows that process many files, track container creation time and re-upload if you're approaching the limit.

import time

container_created_at = time.time()
CONTAINER_TTL = 20 * 60  # 20 minutes in seconds
SAFETY_MARGIN = 2 * 60   # Stop 2 minutes early

def container_is_alive():
    elapsed = time.time() - container_created_at
    return elapsed < (CONTAINER_TTL - SAFETY_MARGIN)

Complete Python Workflow

Here's the full four-step pipeline as a single script, with error handling:

import requests
import os
import sys
import time

BASE_URL = "https://datastory.bot"

def run_data_story_workflow(
    csv_path: str,
    output_dir: str = "story_output",
    steering_prompt: str = None,
    refinement_prompt: str = None,
    story_index: int = 0,
) -> dict:
    """
    Run the complete DataStoryBot workflow: discover, select, refine, publish.

    Args:
        csv_path: Path to the CSV file to analyze.
        output_dir: Directory to write narrative, charts, and dataset.
        steering_prompt: Optional guidance for the analyze step.
        refinement_prompt: Optional audience/format instructions for the refine step.
        story_index: Which of the 3 discovered stories to refine (0, 1, or 2).

    Returns:
        dict with keys: narrative, chart_paths, dataset_path, selected_story
    """
    os.makedirs(output_dir, exist_ok=True)
    started_at = time.time()

    # --- Step 1: Discover (Upload) ---
    print(f"[1/4] Uploading {csv_path}...")
    with open(csv_path, "rb") as f:
        upload_resp = requests.post(
            f"{BASE_URL}/api/upload",
            files={"file": (os.path.basename(csv_path), f, "text/csv")}
        )
    upload_resp.raise_for_status()
    upload_data = upload_resp.json()

    container_id = upload_data["containerId"]
    meta = upload_data["metadata"]
    print(f"    Container: {container_id}")
    print(f"    Dataset: {meta['rowCount']} rows, {meta['columnCount']} columns")

    # --- Step 1b: Discover (Analyze) ---
    print("[2/4] Discovering story angles...")
    analyze_payload = {"containerId": container_id}
    if steering_prompt:
        analyze_payload["steeringPrompt"] = steering_prompt

    analyze_resp = requests.post(
        f"{BASE_URL}/api/analyze",
        json=analyze_payload
    )
    analyze_resp.raise_for_status()
    stories = analyze_resp.json()

    print(f"    Found {len(stories)} story angles:")
    for i, story in enumerate(stories):
        marker = " <-- selected" if i == story_index else ""
        print(f"    {i + 1}. {story['title']}{marker}")

    # --- Step 2: Select ---
    selected_story = stories[story_index]
    print(f"\n[3/4] Refining: {selected_story['title']}")

    # --- Step 3: Refine ---
    refine_payload = {
        "containerId": container_id,
        "selectedStoryTitle": selected_story["title"]
    }
    if refinement_prompt:
        refine_payload["refinementPrompt"] = refinement_prompt

    refine_resp = requests.post(
        f"{BASE_URL}/api/refine",
        json=refine_payload
    )
    refine_resp.raise_for_status()
    result = refine_resp.json()

    # --- Step 4: Publish (Download) ---
    elapsed = time.time() - started_at
    remaining = (20 * 60) - elapsed
    print(f"[4/4] Downloading outputs ({remaining:.0f}s remaining in container)...")

    # Narrative
    narrative_path = f"{output_dir}/narrative.md"
    with open(narrative_path, "w") as f:
        f.write(result["narrative"])

    # Charts
    chart_paths = []
    for i, chart in enumerate(result["charts"], start=1):
        chart_resp = requests.get(
            f"{BASE_URL}/api/files/{container_id}/{chart['fileId']}"
        )
        chart_resp.raise_for_status()
        path = f"{output_dir}/chart_{i}.png"
        with open(path, "wb") as f:
            f.write(chart_resp.content)
        chart_paths.append({"path": path, "caption": chart["caption"]})
        print(f"    Chart {i}: {chart['caption']}")

    # Filtered dataset
    ds = result["resultDataset"]
    ds_resp = requests.get(
        f"{BASE_URL}/api/files/{container_id}/{ds['fileId']}"
    )
    ds_resp.raise_for_status()
    dataset_path = f"{output_dir}/filtered_data.csv"
    with open(dataset_path, "wb") as f:
        f.write(ds_resp.content)
    print(f"    Dataset: {ds['caption']}")

    total = time.time() - started_at
    print(f"\nDone in {total:.1f}s. Output: {output_dir}/")

    return {
        "narrative": result["narrative"],
        "chart_paths": chart_paths,
        "dataset_path": dataset_path,
        "selected_story": selected_story,
    }


if __name__ == "__main__":
    output = run_data_story_workflow(
        csv_path="sales_data.csv",
        output_dir="story_output",
        steering_prompt="Focus on regional performance differences over the past two quarters.",
        refinement_prompt="Executive audience. Lead with the financial impact. Under 250 words.",
        story_index=0,
    )

    print("\n--- NARRATIVE PREVIEW ---")
    print(output["narrative"][:500] + "...")

How This Workflow Maps to Data Storytelling Methodology

The four-step sequence isn't arbitrary — it reflects how experienced analysts actually structure their work.

Discover corresponds to the exploratory analysis phase. Professional data storytellers don't start with a conclusion; they examine the data to understand what patterns exist before deciding what to communicate. DataStoryBot's /analyze endpoint compresses the exploration phase from hours to seconds by running statistical profiling, trend detection, and comparative analysis simultaneously. You still make editorial decisions, but you make them with more information.

Select is the editorial judgment call. This is where analyst intuition matters most. Two analysts might look at the same three discovered stories and choose different ones based on their knowledge of the audience, the business context, or the current strategic priorities. The API supports this: it returns options, and your code (or your user) makes the selection.

Refine is where storytelling craft enters. The same finding framed for a board of directors reads very differently from the same finding framed for an engineering team. The refinementPrompt parameter encodes that audience awareness into the generation step. Good data storytellers spend significant time on this phase — testing different framings, adjusting the level of technical detail, making sure the implications section lands correctly for the specific reader.

Publish is distribution and audit. Professional data stories need to exist in multiple formats (email, slide, dashboard, PDF) and be traceable to source data. The combination of Markdown narrative + PNG charts + filtered CSV covers this: the narrative is format-agnostic, the charts embed in anything, and the filtered dataset serves as the audit trail.

Generating Multiple Stories from One Dataset

If you need comprehensive coverage rather than a single focal story, refine all three discovered stories:

stories = requests.post(
    f"{BASE_URL}/api/analyze",
    json={"containerId": container_id}
).json()

sections = []
for story in stories:
    result = requests.post(
        f"{BASE_URL}/api/refine",
        json={
            "containerId": container_id,
            "selectedStoryTitle": story["title"]
        }
    ).json()
    sections.append(result["narrative"])

# Combine into a single Markdown document
full_report = "\n\n---\n\n".join(sections)
with open("full_report.md", "w") as f:
    f.write(full_report)

Three refine calls take roughly 60-90 seconds total — well within the 20-minute container TTL. Download all charts after all three refine calls complete to keep the pipeline linear.

Next Steps

The workflow described here is the foundation for most DataStoryBot integrations. Once you have it running end-to-end, the extensions are straightforward:

For a complete introduction to the API endpoints with JavaScript and curl examples, see getting started with the DataStoryBot API.
For a deep dive into controlling analysis direction with steering prompts — including 10 tested examples — see using steering prompts to control analysis direction.
For patterns around scheduling, batching, and building fully automated reporting pipelines, see automating data stories via API.
To experiment with the four-step flow without writing any code, try the DataStoryBot playground — the UI exposes each step interactively.

Ready to find your data story?

Upload a CSV and DataStoryBot will uncover the narrative in seconds.

Try DataStoryBot →