Pandas vs. DataStoryBot: When to Script and When to Ship
An honest comparison of pandas and DataStoryBot for CSV analysis. Same dataset, both approaches, with code. Know when to script and when to ship.
Pandas vs. DataStoryBot: When to Script and When to Ship
Pandas is the most important library in the Python data ecosystem. It is also the library that developers spend the most time fighting. The dtype coercion, the SettingWithCopyWarning, the groupby syntax that never looks right on the first try — pandas is powerful, flexible, and slow to get results from when all you need is "what is going on in this CSV."
DataStoryBot takes a different approach. You upload a CSV. An AI agent running inside an ephemeral Code Interpreter container writes and executes pandas code for you, then returns a narrative, charts, and a filtered dataset. Three API calls. No local Python environment.
This is not a "pandas is dead" article. Pandas is the right tool when you need full control over your analysis. DataStoryBot is the right tool when you need to move fast and the exact methodology matters less than the insight. This article shows the same analysis done both ways, honestly, so you can see where each one shines.
The Dataset
We will use an ecommerce orders dataset with these columns: order_id, order_date, customer_id, product_category, region, quantity, unit_price, discount_pct, is_returning_customer. About 25,000 rows covering 12 months of transactions.
The question: "What are the most important patterns in this data?"
The Pandas Approach
Here is a thorough exploratory analysis in pandas. This is what a competent data analyst writes when handed a new CSV:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Load and clean
df = pd.read_csv("ecommerce_orders.csv")
df["order_date"] = pd.to_datetime(df["order_date"])
df["revenue"] = df["quantity"] * df["unit_price"] * (1 - df["discount_pct"])
print(f"Shape: {df.shape}")
print(f"Date range: {df['order_date'].min()} to {df['order_date'].max()}")
print(f"Null counts:\n{df.isnull().sum()}")
print(f"\nBasic stats:\n{df.describe()}")
Already 12 lines of code and we have not found anything interesting yet. Let us continue:
# Monthly revenue trend
monthly = df.set_index("order_date").resample("M")["revenue"].agg(["sum", "count"])
monthly.columns = ["revenue", "orders"]
fig, ax1 = plt.subplots(figsize=(12, 5))
ax1.bar(monthly.index, monthly["revenue"], width=20, alpha=0.7, label="Revenue")
ax2 = ax1.twinx()
ax2.plot(monthly.index, monthly["orders"], color="red", label="Order Count")
ax1.set_title("Monthly Revenue and Order Count")
ax1.legend(loc="upper left")
ax2.legend(loc="upper right")
plt.tight_layout()
plt.savefig("monthly_trend.png")
# Revenue by region
regional = df.groupby("region")["revenue"].agg(["sum", "mean", "count"])
regional.columns = ["total_revenue", "avg_order_revenue", "order_count"]
regional = regional.sort_values("total_revenue", ascending=False)
print(f"\nRegional breakdown:\n{regional}")
# Category performance
category = df.groupby("product_category").agg(
total_revenue=("revenue", "sum"),
avg_discount=("discount_pct", "mean"),
order_count=("order_id", "count"),
avg_quantity=("quantity", "mean"),
)
category = category.sort_values("total_revenue", ascending=False)
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
category["total_revenue"].plot(kind="barh", ax=axes[0], title="Revenue by Category")
category["avg_discount"].plot(kind="barh", ax=axes[1], title="Avg Discount by Category")
plt.tight_layout()
plt.savefig("category_analysis.png")
# Returning vs new customers
retention = df.groupby("is_returning_customer").agg(
orders=("order_id", "count"),
revenue=("revenue", "sum"),
avg_revenue=("revenue", "mean"),
avg_discount=("discount_pct", "mean"),
)
print(f"\nReturning vs New:\n{retention}")
# Discount impact
df["discount_bucket"] = pd.cut(
df["discount_pct"],
bins=[0, 0.05, 0.10, 0.20, 0.50, 1.0],
labels=["0-5%", "5-10%", "10-20%", "20-50%", "50%+"],
)
discount_impact = df.groupby("discount_bucket", observed=True).agg(
avg_revenue=("revenue", "mean"),
avg_quantity=("quantity", "mean"),
order_count=("order_id", "count"),
)
print(f"\nDiscount Impact:\n{discount_impact}")
That is about 60 lines of code. It produces two charts, some printed tables, and a set of basic breakdowns. The total time to write this, assuming you are comfortable with pandas: 15-25 minutes. If you are rusty on the API or need to Google the resample syntax, add another 10.
And here is the thing: this analysis only answers the questions I thought to ask. I checked monthly trends, regional breakdowns, category performance, and retention. I did not check day-of-week effects, customer lifetime value distributions, cross-category purchase patterns, or whether discount rates vary by region in ways that affect margin. Those might be where the real stories are.
The DataStoryBot Approach
The same dataset, same question, three API calls:
# Upload
curl -X POST https://datastory.bot/api/upload \
-F "file=@ecommerce_orders.csv"
import requests
BASE_URL = "https://datastory.bot"
# Upload
with open("ecommerce_orders.csv", "rb") as f:
upload = requests.post(
f"{BASE_URL}/api/upload",
files={"file": ("ecommerce_orders.csv", f, "text/csv")}
).json()
container_id = upload["containerId"]
print(f"Uploaded: {upload['metadata']['rowCount']} rows, "
f"{upload['metadata']['columnCount']} columns")
# Analyze
stories = requests.post(
f"{BASE_URL}/api/analyze",
json={"containerId": container_id}
).json()
for story in stories:
print(f"\n{story['title']}")
print(f" {story['summary']}")
# Refine the most interesting story
refined = requests.post(
f"{BASE_URL}/api/refine",
json={
"containerId": container_id,
"selectedStoryTitle": stories[0]["title"],
}
).json()
print(f"\n{refined['narrative']}")
# Download charts
for chart in refined["charts"]:
data = requests.get(
f"{BASE_URL}/api/files/{container_id}/{chart['fileId']}"
)
with open(f"chart_{chart['fileId']}.png", "wb") as f:
f.write(data.content)
print(f"Saved: {chart['caption']}")
That is 35 lines of code. Time to write: under 5 minutes (or zero if you copy from the getting started guide). Time to run: 30-60 seconds. No pandas installed. No matplotlib. No cleaning code.
But more importantly, the output is different. DataStoryBot does not produce tables and charts for the dimensions I chose. It produces narratives for the patterns it found. And those patterns might include things I would not have looked for:
Discounts Destroy Margin in Electronics but Drive Volume in Apparel
Electronics orders with >15% discounts generate 40% less margin per
unit than non-discounted orders, with no statistically significant
increase in quantity. Apparel orders with the same discount range
show 2.8x the quantity with only 12% margin compression — a net
positive ROI on discounting.
That insight requires a cross-tabulation of discount impact by category with margin analysis. It is the kind of analysis a pandas script would find if you specifically coded for it. DataStoryBot found it because it tested combinations I did not think to test.
The Honest Comparison
Here is where each approach wins:
| Dimension | Pandas | DataStoryBot |
|---|---|---|
| Time to first insight | 15-25 min | 1-2 min |
| Lines of code | 50-100+ | 15-35 |
| Setup required | Python, pandas, matplotlib | None (HTTP calls) |
| Methodology control | Full | Steering prompts |
| Reproducibility | Exact (same code = same output) | High (same API, structured output) |
| Handles schema changes | Breaks | Adapts automatically |
| Finds unexpected patterns | Only what you code for | Tests broad hypotheses |
| Custom statistical tests | Any test you want | Limited to what the AI runs |
| Works offline | Yes | No |
| Data stays local | Yes | Data enters ephemeral container |
| Audit trail | Your code is the audit trail | Narrative + filtered dataset |
Where Pandas Wins
Custom logic. If you need a specific statistical test — a Mann-Whitney U test, a Cox proportional hazards model, a custom segmentation algorithm — pandas gives you the flexibility to implement it. DataStoryBot runs general-purpose exploratory analysis. It does not do bespoke modeling.
Large-scale ETL. If your analysis is really a transformation pipeline — clean the data, reshape it, join it with three other tables, and output a specific format — pandas (or polars, or dask) is the right tool. DataStoryBot analyzes data; it does not transform it for downstream systems.
Deterministic reproduction. The same pandas script on the same data produces the exact same output every time. DataStoryBot's output varies slightly between runs because the AI makes different analytical choices. For regulatory or audit contexts where byte-identical reproduction matters, script everything.
Offline and air-gapped environments. Pandas runs locally. DataStoryBot requires an internet connection and sends your data to an ephemeral container. If your data cannot leave your network, pandas is the only option.
Cost at scale. If you are analyzing thousands of files per day, pandas scripts running on your own infrastructure will be cheaper than API calls. The crossover point depends on your engineer's time versus API costs, but at high volume, in-house wins on marginal cost.
Where DataStoryBot Wins
Speed to insight. Uploading a file and getting three story angles in 30 seconds beats writing, debugging, and interpreting a pandas script every time. If you need to understand a new dataset now, the API is faster.
Hypothesis generation. The most dangerous phrase in data analysis is "I already know what to look for." DataStoryBot tests combinations you would not have checked. The value is in the stories you would not have written the code to find.
Non-analyst users. If the person who needs the insight cannot write Python, pandas is not an option. DataStoryBot's API can be wrapped in a simple UI (or called from a no-code tool) to give non-technical users access to real analysis.
Schema resilience. Your pandas script breaks when a column is renamed or a new column appears. DataStoryBot reads the schema fresh every time and adapts. For weekly CSV exports where the schema drifts, this matters.
Narrative output. Pandas produces numbers and charts. DataStoryBot produces written narratives with numbers and charts. If the deliverable is a report for stakeholders (not a notebook for analysts), the narrative saves the step of translating data into language.
The Crossover Point
The crossover point — where pandas becomes more efficient than DataStoryBot — depends on how many times you will run the same analysis.
One-off exploration on a new dataset: DataStoryBot wins. Faster by 10-20 minutes. No environment setup.
Weekly report on the same schema: DataStoryBot wins if the schema might change or you want fresh story angles each week. Pandas wins if you need the exact same charts and tables every time.
Daily analysis in a production pipeline: Depends. If the pipeline needs to adapt to varying inputs (user-uploaded CSVs, inconsistent schemas), DataStoryBot is more resilient. If the pipeline is stable and well-defined, pandas is cheaper and more predictable.
Custom statistical modeling: Pandas wins. DataStoryBot does not do custom modeling. It does exploratory analysis.
Building an analytics feature into a product: DataStoryBot wins. You do not want to maintain pandas infrastructure for every customer's data. Three API calls and you have a feature. See how to automate CSV analysis for this pattern.
Using Both Together
The most effective pattern is not either/or. It is DataStoryBot for discovery, pandas for follow-up.
import requests
import pandas as pd
from scipy import stats
BASE_URL = "https://datastory.bot"
# Step 1: Use DataStoryBot to find what's interesting
with open("ecommerce_orders.csv", "rb") as f:
upload = requests.post(
f"{BASE_URL}/api/upload",
files={"file": ("ecommerce_orders.csv", f, "text/csv")}
).json()
stories = requests.post(
f"{BASE_URL}/api/analyze",
json={"containerId": upload["containerId"]}
).json()
# DataStoryBot found that discounts affect categories differently
# Now use pandas for a targeted statistical test
# Step 2: Use pandas for the precise analysis
df = pd.read_csv("ecommerce_orders.csv")
df["revenue"] = df["quantity"] * df["unit_price"] * (1 - df["discount_pct"])
df["margin"] = df["quantity"] * df["unit_price"] * df["discount_pct"]
# Test the specific hypothesis DataStoryBot surfaced
electronics = df[df["product_category"] == "Electronics"]
apparel = df[df["product_category"] == "Apparel"]
for cat_name, cat_df in [("Electronics", electronics), ("Apparel", apparel)]:
discounted = cat_df[cat_df["discount_pct"] > 0.15]["revenue"]
full_price = cat_df[cat_df["discount_pct"] <= 0.15]["revenue"]
t_stat, p_val = stats.ttest_ind(discounted, full_price)
print(f"\n{cat_name}:")
print(f" Discounted avg revenue: ${discounted.mean():,.2f}")
print(f" Full price avg revenue: ${full_price.mean():,.2f}")
print(f" t-statistic: {t_stat:.3f}, p-value: {p_val:.4f}")
DataStoryBot finds the story. Pandas validates it with the exact statistical rigor you need. Discovery and validation are different tasks, and they benefit from different tools.
The Real Comparison Is Time
Every minute you spend writing pandas boilerplate is a minute you are not spending on interpretation. Every hour you spend debugging a SettingWithCopyWarning is an hour you are not spending on the insight that moves the business.
Pandas is not going anywhere. It is the foundation. But for the common case — "I have a CSV and I need to know what is in it" — the fastest path to an answer is not installing pandas. It is three API calls.
For a deeper look at automating this workflow, read five ways to automate CSV analysis. For the specific use case of building this into a product, see the automated CSV analysis guide.
Or just try it. Upload a CSV to the DataStoryBot playground and see how the output compares to your last pandas notebook. The stories you did not think to look for are usually the most valuable ones.
Ready to find your data story?
Upload a CSV and DataStoryBot will uncover the narrative in seconds.
Try DataStoryBot →