Building a data quality monitor for CSV pipelines

Remember the taxi dashboard we built in Part 1? It works great — until it doesn't. One day the vendor changes timestamp formats. The next week, fare amounts arrive in cents instead of dollars. Your dashboard shows revenue down 98%, and it takes three days to realize the data was wrong, not the business.

In Part 2, we learned to handle these problems when we know about them upfront. But what about the problems we don't anticipate? What about catching them automatically, the moment they appear?

This is Part 3: building a data quality monitor that watches your CSV pipelines in real time.

The best time to catch a data quality issue is before it enters your analytics pipeline. The second best time is immediately after.

What we're protecting against

In Part 2, we encountered specific failure modes:

Mixed types — a user_id column with N/A or PENDING values mixed in.
Timestamp chaos — four different date formats in the same column.
Encoding drift — files switching from UTF-8 to Latin-1.
Leading zeros — ZIP codes parsed as integers, turning 01234 into 1234.

We fixed each with specific CsvSpecs configurations - but those fixes assume you know the problem exists. A proper monitor catches these issues before you've debugged them manually.

Let's build one using the taxi data from Part 1.

Step 1: Establish a baseline from good data

First, load the taxi data from Part 1 and capture what "normal" looks like:

from deephaven import read_csv, agg

# Load our known-good taxi data
trips = read_csv(
    "https://media.githubusercontent.com/media/deephaven/examples/main/Taxi/csv/taxi.csv"
)

# Capture baseline statistics for key columns
baseline = trips.agg_by(
    [
        agg.count_("row_count"),
        agg.min_("min_fare = fare_amount"),
        agg.max_("max_fare = fare_amount"),
        agg.avg("avg_fare = fare_amount"),
        agg.min_("min_tip = tip_amount"),
        agg.max_("max_tip = tip_amount"),
        agg.avg("avg_total = total_amount"),
    ]
)

This baseline captures what "normal" looks like — average fares, typical tip ranges, expected row counts. When new data arrives, we compare against these expectations.

Step 2: Define rules for the problems we've seen

Now let's encode validation rules that catch the specific issues from Part 2:

from deephaven import new_table
from deephaven.column import string_col, double_col
from deephaven.constants import NULL_DOUBLE

# Validation rules based on Part 2 failure modes
validation_rules = new_table(
    [
        string_col(
            "column_name",
            [
                "fare_amount",
                "fare_amount",
                "fare_amount",
                "tip_amount",
                "passenger_count",
                "passenger_count",
                "trip_distance",
                "PULocationID",
            ],
        ),
        string_col(
            "rule_type",
            [
                "min",
                "max",
                "cents_check",
                "min",
                "min",
                "max",
                "min",
                "not_null",
            ],
        ),
        double_col(
            "threshold",
            [
                0.0,  # fare_amount min
                500.0,  # fare_amount max
                0.5,  # fare_amount cents_check (avg > $0.50)
                0.0,  # tip_amount min
                1.0,  # passenger_count min
                6.0,  # passenger_count max
                0.0,  # trip_distance min
                NULL_DOUBLE,  # PULocationID not_null (threshold N/A)
            ],
        ),
        string_col(
            "description",
            [
                "Fare must be positive",
                "Fare over $500 is suspicious",
                "Average fare below $0.50 suggests cents-to-dollars error",
                "Tip cannot be negative",
                "Must have at least 1 passenger",
                "More than 6 passengers is suspicious",
                "Trip distance must be non-negative",
                "Pickup location is required",
            ],
        ),
    ]
)

Notice the cents_check rule — this catches the scenario from the intro where a vendor starts sending amounts in cents instead of dollars. If average fare suddenly drops to $0.12, something is wrong.

Step 3: Build real-time validation

Here's where Part 1's key insight pays off: the same code works for static and streaming data. Let's build validation that runs continuously on live taxi data:

from deephaven import time_table

# Simulate live taxi data arriving (same pattern as Part 1)
# Inject bad records: ~5% violations (negative values) and ~5% cents-instead-of-dollars
live_trips = (
    time_table("PT1S")
    .update(
        [
            "VendorID = (int)(Math.random() * 2) + 1",
            "IsBadRecord = Math.random() < 0.05",
            "IsCentsRecord = Math.random() < 0.05",
            "passenger_count = IsBadRecord ? (int)(Math.random() * 10) : (int)(Math.random() * 6) + 1",
            "trip_distance = IsBadRecord ? -1.0 : Math.round(Math.random() * 15 * 100) / 100.0",
            "base_fare = 2.5 + Math.abs(trip_distance) * 2.5",
            "fare_amount = IsBadRecord ? -5.0 : (IsCentsRecord ? base_fare / 100.0 : base_fare)",
            "tip_amount = Math.round(fare_amount * Math.random() * 0.25 * 100) / 100.0",
            "total_amount = fare_amount + tip_amount",
            "PULocationID = (int)(Math.random() * 263) + 1",
        ]
    )
    .drop_columns(["IsBadRecord", "IsCentsRecord", "base_fare"])
)

# Real-time validation: flag suspicious rows as they arrive
violations = live_trips.update(
    [
        # Flag based on our Part 2 learnings
        "FareIssue = fare_amount < 0 || fare_amount > 500",
        "TipIssue = tip_amount < 0",
        "PassengerIssue = passenger_count < 1 || passenger_count > 6",
        "DistanceIssue = trip_distance < 0",
        "HasIssue = FareIssue || TipIssue || PassengerIssue || DistanceIssue",
    ]
).where("HasIssue")

Violations table showing flagged records

Now violations is a live table that shows only problematic rows — and it updates automatically as new data arrives. This is the real-time monitoring we promised.

Step 4: Detect the cents-to-dollars bug

Remember the intro scenario? A vendor starts sending fares in cents instead of dollars. Here's how to catch that automatically by monitoring aggregate statistics:

from deephaven.updateby import rolling_avg_tick

# Calculate rolling statistics on recent data
rolling_stats = live_trips.update_by(
    [
        # Rolling average of last 10 fares (small window to catch issues fast)
        rolling_avg_tick(cols=["avg_fare_10 = fare_amount"], rev_ticks=10),
        rolling_avg_tick(cols=["avg_tip_10 = tip_amount"], rev_ticks=10),
    ]
)

# Flag when averages look wrong
anomalies = rolling_stats.update(
    [
        # If average fare drops below $5, something is wrong (cents instead of dollars)
        "FareAnomaly = avg_fare_10 < 5.0",
        # If average fare exceeds $100, also suspicious
        "FareSpike = avg_fare_10 > 100.0",
    ]
).where("FareAnomaly || FareSpike")

Anomalies table showing flagged records

If a vendor suddenly switches to cents, avg_fare_10 will plummet to ~$0.12 instead of ~$12. The anomalies table catches this immediately.

Step 5: Detect schema drift

In Part 2, we saw how columns can change types unexpectedly — a user_id column suddenly containing N/A values. While the previous steps monitor live data, schema drift happens at load time. Here's how to detect it:

from deephaven import new_table
from deephaven.column import string_col, double_col, int_col

# Define the expected schema as a reference table
expected_schema = new_table(
    [
        string_col("pickup_datetime", []),
        double_col("fare_amount", []),
        double_col("tip_amount", []),
        int_col("passenger_count", []),
    ]
)

# Simulate a "drifted" schema — fare_amount became a string (vendor added $ prefix)
drifted_schema = new_table(
    [
        string_col("pickup_datetime", []),
        string_col("fare_amount", []),  # Type changed!
        double_col("tip_amount", []),
        int_col("passenger_count", []),
        string_col("surcharge_type", []),  # New column added
    ]
)


def compare_schemas(expected, actual):
    """Compare two table schemas, return differences."""
    expected_cols = {c.name: str(c.data_type) for c in expected.columns}
    actual_cols = {c.name: str(c.data_type) for c in actual.columns}

    issues = []

    for col, dtype in expected_cols.items():
        if col not in actual_cols:
            issues.append(f"Missing column: {col}")
        elif expected_cols[col] != actual_cols[col]:
            issues.append(f"Type changed: {col} was {dtype}, now {actual_cols[col]}")

    for col in actual_cols:
        if col not in expected_cols:
            issues.append(f"New column: {col} ({actual_cols[col]})")

    return issues


# Compare and display drift
schema_drift = compare_schemas(expected_schema, drifted_schema)
print("Schema issues detected:", schema_drift)

This catches exactly the kind of issues we saw in Part 2 — a vendor adding a $ prefix to fares (changing the column from double to string), or adding unexpected columns.

Step 6: Build the monitoring dashboard

Now let's combine everything into a dashboard using deephaven.ui. We'll use ui.dashboard with panels so each table is resizable and scrollable:

from deephaven import ui

# Use ui.dashboard with panels for resizable, scrollable tables
dashboard = ui.dashboard(
    ui.column(
        ui.row(
            ui.panel(ui.table(live_trips.tail(50)), title="Recent Trips"),
            ui.panel(ui.table(violations.tail(50)), title="Violations"),
        ),
        ui.row(
            ui.panel(ui.table(anomalies.tail(20)), title="Statistical Anomalies"),
            ui.panel(ui.table(validation_rules), title="Active Validation Rules"),
        ),
    )
)

Data quality monitoring dashboard

This dashboard shows live data, live violations, and live anomalies — all updating in real time. When something goes wrong, you see it immediately.

Note

Our simulation injects ~5% violations (negative values) and ~5% cents-instead-of-dollars records to demonstrate both tables populating. In production, these tables only show data when real issues occur.

Step 7: Alert on issues

The final piece: triggering alerts when problems appear. Use Deephaven's table listeners to react to violations:

from deephaven.table_listener import listen

alert_log = []


def on_violation(update, is_replay):
    """Called when new violations appear."""
    if update.added_size > 0:
        # Log the alert (in production, send to Slack, PagerDuty, etc.)
        alert_log.append(f"{update.added_size} data quality violations detected!")
        print(f"ALERT: {update.added_size} data quality violations detected!")


# Monitor the violations table
handle = listen(violations, on_violation)

Now you're notified the moment bad data arrives — not days later when someone notices the dashboard looks wrong.

Connecting it all together

Here's the full picture of how this series fits together:

Part	Focus	Key Technique
Part 1	Load → Dashboard	`read_csv`, `agg_by`, `deephaven.ui`
Part 2	Handle problems	`CsvSpecs`, type overrides, encoding
Part 3 (this post)	Prevent problems	Real-time validation, anomaly detection

The techniques from Part 2 — CsvSpecs for null literals, type overrides for ZIP codes, charset for encoding — are your fix when you know what's wrong. The monitoring from Part 3 is how you discover what's wrong in the first place.

The payoff

Before: You load a CSV, run your reports, and discover three days later that revenue tanked because amounts were in cents.

After: A real-time monitor catches the anomaly within seconds. You get a Slack alert. You apply the appropriate CsvSpecs fix from Part 2. Your dashboard never shows bad data.

Series wrap-up

Over these three posts, we've covered:

From CSV to dashboard — Loading, transforming, and visualizing data in a live workflow.
Problem CSVs — Handling edge cases that break other tools.
Data quality monitoring (this post) — Building systems that watch your data for you.

The common thread: Deephaven treats CSVs not as static files to be loaded once, but as data sources that fit into a larger, real-time analytics pipeline. The same code works whether your data arrives as a file drop or a streaming feed.

Questions? Ideas for what we should cover next? Join us on Slack — we'd love to hear from you.

Part 3 of the CSV Mastery series: Catch problems before they spread