Data-driven bracket picks: March Madness analytics before tip-off

Your bracket pool has 50 entries. Everyone picks the 1-seeds to the Final Four. The chalk brackets pile up, and when Duke wins it all, you split the pot 20 ways.

The smart money isn't on picking winners — it's on picking different winners. Upsets that others miss. Value picks where the data says the seed is wrong.

Let's use Deephaven to analyze historical tournament data, calculate upset probabilities, and build a bracket strategy that maximizes your edge.

The best bracket isn't the most accurate. It's the most accurate where others are wrong.

Historical upset rates

First question: how often do upsets actually happen? Let's load historical tournament results and find out.

This example uses data from the deephaven/examples repo, sourced from Kaggle: March Madness Data by nishaanamin. Deephaven can read CSVs directly from URLs:

from deephaven import read_csv

# Load historical seed performance (2008-2025)
seed_results = read_csv(
    "https://media.githubusercontent.com/media/deephaven/examples/refs/heads/main/MarchMadness/csv/Seed%20Results.csv"
)

This dataset shows how each seed has performed historically — wins, losses, and advancement rates by round. The original CSV has a WIN% column, but Deephaven sanitizes it to WIN (special characters are removed):

# View upset-prone seeds (lower win rates = more upsets against them)
# Note: WIN% in the CSV becomes WIN after sanitization
upset_rates = seed_results.view(["SEED", "GAMES", "W", "L", "WIN"]).sort("SEED")

The view operation selects specific columns from a table, and sort orders the rows. Look at the results: 5-seeds win only about 53% of their games, while 12-seeds win 35%. That's a coin flip dressed up as a mismatch.

The 12-vs-5 matchup is notorious for a reason — historically, 12-seeds win about 35% of the time. That's not a fluke; it's a pattern. The 5-seed is often an overrated power conference team, while the 12-seed is a hot mid-major.

Finding value picks

Upset rates tell you what happens on average. But this year's bracket has specific teams. We need to identify where the seed doesn't match the talent — teams that are better than their seeding suggests.

# Load team efficiency metrics (KenPom/BartTorvik data)
kenpom_raw = read_csv(
    "https://media.githubusercontent.com/media/deephaven/examples/refs/heads/main/MarchMadness/csv/KenPom%20Barttorvik.csv"
)

# Filter to 2025 tournament teams and select key columns
# KADJ EM RANK in CSV becomes KADJ_EM_RANK after sanitization (spaces → underscores)
team_metrics = (
    kenpom_raw.where("YEAR == 2025 && ROUND < 68")
    .view(["TEAM", "SEED", "KADJ_EM_RANK", "BARTHAG", "GAMES", "W"])
    .sort("SEED")
)

The where operation filters rows — here we keep only 2025 tournament teams (ROUND < 68 means they made the 64-team field, excluding play-in losers). We then view just the columns we need. Notice how KADJ EM RANK from the CSV is referenced as KADJ_EM_RANK — Deephaven automatically sanitizes column names.

Tip

CSV column name sanitization: When Deephaven reads CSVs, column names are automatically sanitized to be valid identifiers — spaces become underscores, and special characters are removed. So KADJ EM RANK becomes KADJ_EM_RANK, and WIN% becomes WIN. For other CSV options like custom delimiters or encodings, see CsvSpecs.

Now let's find teams where the power ratings suggest they're better than their seed:

# Find "underseeded" teams: low KADJ_EM_RANK (better) relative to their seed
# KADJ_EM_RANK is KenPom's adjusted efficiency margin ranking (lower = better)
value_picks = (
    team_metrics.update(
        [
            # Expected rank based on seed (1-seed ~ rank 4, 16-seed ~ rank 250)
            "ExpectedRank = SEED * 16",
            # Value = how much better their rank is than expected (positive = underseeded)
            "ValueGap = ExpectedRank - KADJ_EM_RANK",
            # Flag teams ranked significantly better than their seed suggests
            "IsUnderseeded = ValueGap > 20",
        ]
    )
    .where("SEED >= 9")  # Focus on lower seeds (potential upset picks)
    .sort_descending("ValueGap")
)

The update operation adds new calculated columns to each row. Here we're computing how much better each team's KenPom ranking is than their seed suggests — a 12-seed ranked like a typical 6-seed has serious upset potential. Teams at the top of this list are your value picks.

Look for games where:

Historical upset rate is high (12-vs-5, 11-vs-6)
Efficiency gap is small — the lower seed is better than their seeding suggests
Schedule strength differs — a mid-major with a weak schedule but high efficiency often gets underseeded

Pool strategy: contrarian picks

Here's where game theory enters. If everyone in your pool picks Duke, you don't gain anything when Duke wins — you just keep pace. You gain when you're right where others are wrong.

# Load public pick percentages (how often the public picks each team to advance)
public_picks = read_csv(
    "https://media.githubusercontent.com/media/deephaven/examples/refs/heads/main/MarchMadness/csv/Public%20Picks.csv"
)

# R64 column = percentage picking team to win first round game
public_picks_2025 = public_picks.where("YEAR == 2025").view(
    ["TEAM", "R64", "R32", "S16", "F4"]
)

Now we combine our value picks with public sentiment to find contrarian opportunities. The join operation merges two tables based on a matching column — in this case, team name:

# Join value picks with public pick rates
contrarian_analysis = value_picks.join(
    public_picks_2025.rename_columns(["TEAM", "PublicR64 = R64"]),
    on=["TEAM"],
).update(
    [
        # Parse percentage string to number
        "PublicPct = Double.parseDouble(PublicR64.replace(`%`, ``))",
        # Contrarian edge: underseeded teams the public is ignoring
        "ContrarianEdge = IsUnderseeded && PublicPct < 40 ? `HIGH` : (IsUnderseeded ? `MEDIUM` : `LOW`)",
    ]
)

# Best contrarian picks: underseeded teams with low public support
best_contrarian = (
    contrarian_analysis.where("ContrarianEdge != `LOW`")
    .view(["TEAM", "SEED", "KADJ_EM_RANK", "ValueGap", "PublicPct", "ContrarianEdge"])
    .sort_descending("ValueGap")
)

The 12-seed with 35% win probability that only 28% of brackets pick? That's your edge. You're not just betting on an upset. You're betting on an upset that separates you from the field.

Building your bracket

Putting it together:

# Final recommendations: combine all our analysis
final_recommendations = (
    contrarian_analysis.update(
        [
            "Recommendation = ContrarianEdge == `HIGH` ? `STRONG UPSET PICK` : (ContrarianEdge == `MEDIUM` ? `CONSIDER UPSET` : `PICK CHALK`)",
        ]
    )
    .view(["TEAM", "SEED", "KADJ_EM_RANK", "ValueGap", "PublicPct", "Recommendation"])
    .sort("SEED")
)

Your data-driven bracket strategy:

Lock in the chalk where efficiency gaps are large (1-vs-16, 2-vs-15).
Take calculated upsets in 12-vs-5, 11-vs-6 where efficiency says the game is closer than the seed.
Maximize contrarian value — pick upsets where the public is underweighting the lower seed.

What data can't tell you

Deephaven gives you tools to analyze data — not a crystal ball. We're not promising you'll win your bracket pool. March Madness is called "madness" for a reason.

A few caveats:

Injuries matter — a last-minute injury to a star player changes everything.
Hot streaks are real — a team peaking at the right time can outperform their season metrics.
Experience counts — tournament experience (coaches and players) doesn't show up in efficiency ratings.
Luck is real — a bouncing ball, a bad call, a cold shooting night. The best team doesn't always win.

Use the data as a starting point, then layer in your basketball knowledge. Watch the games. Trust your gut on the close calls. And remember: even the best models get it wrong.

Data gets you to the final three picks. Intuition picks the winner.

Next steps

Get started with Deephaven
Learn about aggregations
Part 2: Real-time March Madness analytics — track your bracket during the tournament.

Questions about analytics? Find us on Slack.

Use historical tournament data to find value picks and optimize your bracket strategy