generalMarch 24, 20267 min read

Handling Messy CSVs: Nulls, Mixed Types, and Encoding Issues

How DataStoryBot's Code Interpreter handles dirty data — nulls, mixed column types, encoding problems — and when to pre-clean vs. let the AI figure it out.

By DataStoryBot Team

Handling Messy CSVs: Nulls, Mixed Types, and Encoding Issues

Real CSV files are messy. The textbook tutorial uses a clean 10-row dataset where every column is perfectly typed and nothing is missing. Your actual data has nulls where they shouldn't be, columns where "N/A" and "" and "null" and "-" all mean the same thing, date formats that change halfway through the file, and that one row where someone pasted a formula instead of a value.

The question isn't whether your CSV has problems. The question is whether your analysis tool can handle them without crashing or silently producing garbage results.

DataStoryBot's Code Interpreter runs Python inside an isolated container. When it encounters messy data, it writes cleaning code — type coercion, null handling, encoding fixes — before running the analysis. This article covers what it handles automatically, what needs guidance via steering prompts, and when you should pre-clean the data yourself.

What "Messy" Actually Means

CSV messiness falls into categories, and each requires a different fix:

Null and Missing Values

The most common issue. Missing data appears in CSVs as:

Empty strings (,,)
The literal string "null" or "NULL"
"N/A", "n/a", "NA", "#N/A"
"-" or "--"
"." (common in government datasets)
0 when it should be null (ambiguous — is it a real zero or a missing value?)

Pandas handles empty strings and "NA" automatically. But "-" and "." and "null" need explicit na_values configuration. DataStoryBot's Code Interpreter typically adds this:

df = pd.read_csv(
    "/mnt/data/messy_data.csv",
    na_values=["", "null", "NULL", "N/A", "n/a", "NA", "#N/A", "-", "--", "."]
)

Mixed Column Types

A column that's mostly numeric but has a few text values:

revenue
1234.50
2345.00
pending
1567.80
N/A

Pandas reads this as a string column because of "pending." Code Interpreter detects this pattern and coerces:

df["revenue"] = pd.to_numeric(df["revenue"], errors="coerce")

The errors="coerce" flag turns unparseable values into NaN instead of raising an error. This is almost always what you want — keep the numbers, null out the garbage.

Date Format Inconsistency

Dates are the second most common messiness. A single column might contain:

2026-03-15
03/15/2026
15-Mar-2026
March 15, 2026
3/15/26

Code Interpreter uses pd.to_datetime() with infer_datetime_format=True, which handles most format variations. For truly inconsistent dates, it falls back to dateutil.parser:

from dateutil import parser
df["date"] = df["date"].apply(lambda x: parser.parse(str(x)) if pd.notna(x) else pd.NaT)

Encoding Issues

CSV files from Excel on Windows often use cp1252 or latin-1 encoding instead of UTF-8. Symptoms: garbled characters, UnicodeDecodeError, or wrong characters in names and descriptions.

Code Interpreter tries UTF-8 first, then falls back:

try:
    df = pd.read_csv("/mnt/data/file.csv", encoding="utf-8")
except UnicodeDecodeError:
    df = pd.read_csv("/mnt/data/file.csv", encoding="latin-1")

Structural Issues

Extra header rows — some CSVs have metadata in the first few rows before the actual header
Trailing commas — creating phantom empty columns
Inconsistent delimiters — tabs mixed with commas, or semicolons (common in European exports)
Merged cells — exported from Excel with merged cell ranges that produce null-filled rows

What DataStoryBot Handles Automatically

When you upload a CSV and call /analyze, the Code Interpreter inspects the data before running analysis. Here's what it typically handles without any guidance:

Null detection and reporting. The analysis narrative will mention columns with significant null rates: "The region column has 12% missing values, which were excluded from the regional breakdown."

Type coercion. Numeric columns with a few text values get coerced. Date columns get parsed. Boolean columns with various representations (true/false, 1/0, yes/no) get standardized.

Encoding fallback. UTF-8 → latin-1 → cp1252 cascade.

Basic structural fixes. Stripping whitespace from column names, handling BOM characters at the start of files, detecting the delimiter.

When to Guide with Steering Prompts

Some messiness is ambiguous — the Code Interpreter can't know your intent without context.

Zeros that might be nulls:

steering = (
    "In the revenue column, zero values should be treated as "
    "missing data, not actual zero revenue. Exclude them from "
    "averages and aggregations."
)

Domain-specific null indicators:

steering = (
    "Values of 999 and 9999 in the temperature column are "
    "sensor error codes, not real temperatures. Treat them as nulls."
)

Date format specification:

steering = (
    "Dates are in DD/MM/YYYY format (European), not MM/DD/YYYY. "
    "March 4 is written as 04/03/2026."
)

Multiple tables in one file:

steering = (
    "The CSV has two sections separated by a blank row. "
    "The first section (rows 1-50) is monthly summary data. "
    "The second section (rows 52-200) is daily detail data. "
    "Analyze the daily detail data only."
)

When to Pre-Clean

Sometimes the data is messy enough that you should fix it before uploading. Pre-clean when:

The file won't parse at all. If the CSV has severely inconsistent delimiters, embedded newlines in unquoted fields, or binary garbage mixed in, no automated tool will fix it reliably. Open it in a text editor, fix the structural issues, then upload.

The cleaning logic is business-critical. If how you handle nulls directly affects the analysis conclusion (e.g., should missing survey responses be counted as neutral or excluded?), make that decision explicitly rather than letting the AI choose.

The file exceeds 50 MB. Large files with messiness take longer to clean inside the container because Code Interpreter explores the data to discover the issues. Pre-cleaning with a local pandas script is faster.

You need reproducible cleaning. If this is a recurring pipeline and the same cleaning must happen identically every time, write a script. DataStoryBot's Code Interpreter might handle the same messiness slightly differently on each run.

Quick Pre-Cleaning Script

import pandas as pd

def clean_csv(input_path, output_path):
    """Fix common CSV issues before uploading to DataStoryBot."""

    df = pd.read_csv(
        input_path,
        encoding="latin-1",
        na_values=["", "null", "NULL", "N/A", "n/a", "-", "--", "."],
    )

    # Strip whitespace from column names
    df.columns = df.columns.str.strip()

    # Coerce numeric columns
    for col in ["revenue", "quantity", "price"]:
        if col in df.columns:
            df[col] = pd.to_numeric(df[col], errors="coerce")

    # Parse dates
    for col in ["date", "order_date", "created_at"]:
        if col in df.columns:
            df[col] = pd.to_datetime(df[col], errors="coerce")

    # Drop fully empty rows
    df.dropna(how="all", inplace=True)

    df.to_csv(output_path, index=False)
    print(f"Cleaned: {len(df)} rows, {len(df.columns)} columns")
    return output_path

Data Quality Audit Mode

Before running analysis, you can use DataStoryBot as a data quality tool — steer it to report on messiness rather than analyze the content:

stories = requests.post(f"{BASE_URL}/analyze", json={
    "containerId": container_id,
    "steeringPrompt": (
        "Focus on data quality assessment. For each column, report: "
        "data type, null count and percentage, unique value count, "
        "and any suspicious patterns (mixed types, unusual values, "
        "potential encoding issues). Do not analyze trends or "
        "patterns — just assess data quality."
    )
})

This returns story angles focused on quality issues rather than business insights:

[
  {
    "id": 1,
    "title": "3 Columns Have Over 15% Missing Data",
    "summary": "The region (18% null), discount_code (34% null), and referral_source (22% null) columns have significant missing data..."
  },
  {
    "id": 2,
    "title": "Revenue Column Contains 47 Non-Numeric Values",
    "summary": "The revenue column is predominantly numeric but contains 47 rows with values like 'pending', 'refunded', and 'cancelled'..."
  }
]

Run the quality audit first, fix the issues it identifies, then re-upload for the actual analysis. Two API calls instead of one, but you know your analysis is based on clean data.

What to Read Next

For the complete CSV analysis workflow (assuming clean data), see how to analyze a CSV file automatically.

For handling large files that need chunking or pre-aggregation, read CSV analysis at scale: processing large datasets via API.

For a comparison of manual pandas cleaning versus DataStoryBot's automated approach, see pandas vs. DataStoryBot.

Ready to find your data story?

Upload a CSV and DataStoryBot will uncover the narrative in seconds.

Try DataStoryBot →