The CSV files Pandas can't handle

Part 2 of the CSV Mastery series: Conquering edge cases and problem data

February 10 2026

Margaret KennedyMargaret KennedyCommunications Director @Deephaven
AI prompt: A glowing CSV file icon wrapped in yellow caution tape and red warning triangles, with corrupted characters floating around it

Some CSV files just don't want to cooperate. You know the ones — the 40GB monster that crashes your notebook, the file with timestamps in three different formats, the export from a legacy system that uses backslashes as escape characters and Latin-1 encoding because it was written in 1998.

These are the files that turn a 30-minute task into a multi-day debugging session.

This is Part 2 of our CSV Mastery series. In Part 1, we built a real-time dashboard from a well-behaved CSV. Today, we're going to break things — or rather, show you how Deephaven's high-performance CSV reader handles the files that break everything else.

The true test of a CSV reader isn't how fast it handles clean data — it's whether it survives the data you're actually working with.

Problem 1: The file that doesn't fit in memory

Pandas loads your entire CSV into memory. For a 40GB file on a machine with 32GB of RAM, that's a non-starter.

Deephaven's CSV reader takes a different approach. It's column-oriented and multithreaded — parsing happens in parallel across all your cores, and the resulting table uses a compact columnar format that's far more memory-efficient than Pandas' row-based DataFrames.

The difference is storage efficiency. Pandas stores each row as a Python object with per-cell overhead. Deephaven stores each column as a contiguous array of primitive values — no object wrappers, no per-cell memory tax. A 40GB CSV might become a 15GB Deephaven table, well within your 32GB machine's capacity.

Plus, multithreaded parsing means you're not waiting on a single core to churn through the file. All your cores work in parallel, so a file that takes Pandas 10 minutes to load might take Deephaven 90 seconds.

When you'll hit this: Log files, historical tick data, IoT sensor dumps, anything from a system that's been running for years.

Problem 2: Mixed types that break type inference

Consider a column that looks like this:

Pandas will either fail or silently convert everything to strings — losing numeric operations on the valid data. Deephaven's CSV reader uses two-phase type inference: it examines every value in a column before deciding on a type, not just the first few rows.

But what if you know those should be integers, and N/A and PENDING are errors you want to catch? You can customize the parsing:

Now you have proper integers with null markers, and you can query for the problematic rows:

When you'll hit this: Any CSV that comes from human data entry, legacy system exports, or data that's been through multiple ETL processes.

Problem 3: Timestamps in multiple formats

Real-world timestamp columns are a nightmare:

Four rows, four formats. Pandas will turn this into strings and leave you to sort it out.

Deephaven's ISO datetime parser handles the most common formats automatically. If your data uses a consistent non-ISO format, you can specify it directly:

But what about truly chaotic data — multiple formats in the same column? Load it as strings first, then handle each format explicitly:

That ternary chain looks intimidating, but it's also explicit - you can see exactly how each format is handled, and when something fails, you know where to look. Compare that to Pandas' pd.to_datetime(df['created_at'], infer_datetime_format=True), which silently guesses and silently fails.

When you'll hit this: Merged datasets from multiple sources, data from systems in different locales, anything involving user-submitted dates.

Problem 4: Non-UTF-8 encodings

The file looks like garbage because it's encoded in Latin-1, Windows-1252, or some other legacy encoding:

Deephaven supports explicit charset specification:

When you'll hit this: Exports from European systems, legacy databases, anything older than 2010.

Problem 5: Escape characters and malformed quoting

Some CSVs use backslash escaping instead of RFC 4180 double-quote escaping:

Deephaven now supports custom escape characters:

When you'll hit this: Exports from MySQL, some ERP systems, hand-edited CSVs.

Problem 6: Per-column type overrides

Sometimes the automatic inference is almost right, but you need to override specific columns:

This is critical for columns like ZIP codes, phone numbers, and account IDs that look numeric but shouldn't be treated as numbers.

When you'll hit this: Any data with identifiers that have leading zeros, or when you need precise control over memory usage.

Problem 7: Missing or misplaced headers

Not every CSV has a header row. Some have the header on line 3 after two rows of metadata. Some have no header at all.

If there's junk before the header, you can skip leading rows:

When you'll hit this: Sensor dumps, legacy report exports, files generated by scientific instruments, anything where a human added "helpful" context at the top.

The compound problem

Real-world files often have multiple issues. Here's a kitchen-sink example:

One declaration, all problems handled.


The pattern here is consistent: Deephaven's defaults handle the common cases intelligently, but when your data doesn't fit the mold, you have precise control over every aspect of parsing. You're not fighting the tool — you're configuring it.

This matters because data problems compound. A file with encoding issues and mixed types and malformed quoting isn't three times harder than a file with one problem — it's ten times harder, because each issue masks the others. Having a single, composable configuration object (CsvSpecs) that addresses all of them at once is the difference between a 20-minute fix and a two-day nightmare.

What's next

You now have a toolkit for the files that make other tools give up. But so far we've been reactive — dealing with problems as we encounter them. What if you could catch data quality issues before they contaminate your analytics?

In Part 3, we'll build a data quality monitor that validates incoming CSVs automatically, flags anomalies, and alerts you before bad data propagates downstream.

Have a CSV horror story? Share it on Slack — we'd love to hear what you're dealing with.