AI-powered baseball analytics: Natural language queries on Statcast data

Opening Day is here. Somewhere around the third inning, the questions start piling up — who's sitting on 99 tonight? Why is that pitcher's spin rate down from last season? Can Ohtani actually hit that high slider?

The data exists. Statcast tracks every pitch, every swing, every batted ball in granular detail. But turning a question into an answer usually means writing queries, joining tables, and knowing which of the 90-something columns you actually need. Most fans and even many analysts don't make it that far.

This post shows a different path: load Statcast data into Deephaven, connect Claude to it via MCP, and ask questions in plain English. The agent reads the schema, writes the query, executes it, and returns the answer. You stay focused on the baseball.

"Who throws the fastest fastball in the AL East?" — That's the whole query. The AI handles the rest.

What you'll need

Two things before you start:

Claude Desktop (or Cursor or Windsurf) — any AI tool with MCP support works. This post uses Claude Desktop.
About 15 minutes — that's the realistic setup time start to finish.

The architecture is simple: Deephaven holds and queries the data. Claude figures out what to query and how to express it. MCP is the bridge between the two. We'll set these up in order.

Part 1: Get Deephaven running

Note

The code in this post runs in a Deephaven console. If you don't have Deephaven running yet, see the Quickstart guide — Docker or pip, under five minutes. Once it's running, install the extra dependency inside your environment:

pip install pybaseball

Open http://localhost:10000 to confirm Deephaven is running. You'll see the browser-based console where you'll run the Python code in this post.

Part 2: Load the data

Open the Deephaven IDE at http://localhost:10000 and run the following in a new console. This pulls every pitch from the first week of the 2026 season:

from pybaseball import statcast
from deephaven import pandas as dhpd

statcast_df = statcast(start_dt="2026-03-27", end_dt="2026-04-03")
pitches = dhpd.to_table(statcast_df)

Statcast Data in Deephaven

pitches is now a Deephaven table with 90+ columns per pitch: velocity, spin rate, movement, launch angle, exit velocity, pitch location, game state, and more. Working with all 90 columns at once is unwieldy, so create a focused view of the ones you'll actually query:

pitches_core = pitches.view(
    [
        "game_date",
        "pitch_type",  # FF, SL, CH, etc.
        "release_speed",  # Velocity (mph)
        "release_spin_rate",  # Spin rate (rpm)
        "pfx_x",
        "pfx_z",  # Movement
        "launch_speed",  # Exit velocity
        "launch_angle",
        "events",  # Outcome: strikeout, home_run, etc.
        "player_name",  # Pitcher
        "batter",
        "home_team",
        "away_team",
    ]
)

pitches_core table

view creates a lightweight reference without copying data — good for exploration when the source dataset is large.

Adding division context

The raw Statcast data has home_team and away_team but no division information. To answer questions like "which AL East pitchers throw hardest," you need to derive the pitching team and join it against a reference table. Build that reference now, in the same Deephaven console:

Teams reference table

from deephaven import new_table
from deephaven.column import string_col

teams = new_table(
    [
        string_col(
            "team",
            [
                "NYY",
                "BOS",
                "TBR",
                "TOR",
                "BAL",
                "CLE",
                "MIN",
                "DET",
                "CHW",
                "KCR",
                "HOU",
                "SEA",
                "TEX",
                "LAA",
                "OAK",
                "ATL",
                "NYM",
                "PHI",
                "MIA",
                "WSN",
                "MIL",
                "CHC",
                "STL",
                "CIN",
                "PIT",
                "LAD",
                "SDP",
                "SFG",
                "ARI",
                "COL",
            ],
        ),
        string_col(
            "division",
            [
                "AL East",
                "AL East",
                "AL East",
                "AL East",
                "AL East",
                "AL Central",
                "AL Central",
                "AL Central",
                "AL Central",
                "AL Central",
                "AL West",
                "AL West",
                "AL West",
                "AL West",
                "AL West",
                "NL East",
                "NL East",
                "NL East",
                "NL East",
                "NL East",
                "NL Central",
                "NL Central",
                "NL Central",
                "NL Central",
                "NL Central",
                "NL West",
                "NL West",
                "NL West",
                "NL West",
                "NL West",
            ],
        ),
        string_col(
            "league",
            [
                "AL",
                "AL",
                "AL",
                "AL",
                "AL",
                "AL",
                "AL",
                "AL",
                "AL",
                "AL",
                "AL",
                "AL",
                "AL",
                "AL",
                "AL",
                "NL",
                "NL",
                "NL",
                "NL",
                "NL",
                "NL",
                "NL",
                "NL",
                "NL",
                "NL",
                "NL",
                "NL",
                "NL",
                "NL",
                "NL",
            ],
        ),
    ]
)

With the reference table built, derive the pitching team and join:

pitches_with_teams = pitches.update_view(
    "pitching_team = (inning_topbot == `Top`) ? home_team : away_team"
).natural_join(teams, on=["pitching_team=team"], joins=["division", "league"])

Now the AI can answer division-specific questions without you spelling out join logic every time you ask one.

Filtering by team in the UI

For quick team-specific exploration, create a dropdown filter source:

team_filter = pitches_with_teams.select_distinct(["home_team"])

Open Controls > Dropdown Filter in the Deephaven IDE, set home_team as both the source and filter column, and any table with a home_team column filters to your chosen team instantly.

Dropdown filter configuration

Part 3: Connect Claude

The data is loaded and structured. Now connect Claude Desktop so it can query that data.

Install Deephaven MCP

Install the MCP package:

pip install "deephaven-mcp[core]"

This is the MCP server that runs on your machine and bridges Claude to your Deephaven session.

Configure Claude Desktop

Open Claude Desktop's configuration file. On macOS it lives at ~/Library/Application Support/Claude/claude_desktop_config.json. Add this block:

{
  "mcpServers": {
    "deephaven-systems": {
      "command": "/path/to/.venv/bin/dh-mcp-systems-server",
      "args": [],
      "env": {
        "DH_MCP_CONFIG_FILE": "/path/to/deephaven_mcp.json"
      }
    }
  }
}

Create deephaven_mcp.json and point it at your running Deephaven session. Use the PSK token from your Deephaven startup logs (Docker) or your server configuration (pip):

{
  "community": {
    "sessions": {
      "local": {
        "host": "localhost",
        "port": 10000,
        "auth_type": "io.deephaven.authentication.psk.PskAuthenticationHandler",
        "auth_token": "YOUR_PSK_TOKEN"
      }
    }
  }
}

If you started Deephaven with anonymous auth or via pip, set "auth_type": "Anonymous" and omit auth_token. Full configuration options are in the MCP setup guide.

Restart Claude Desktop after saving the config. You'll know the connection is active when Deephaven tools appear in Claude's interface.

Asking questions

With Deephaven holding the data and Claude connected to it, you can type questions instead of writing queries. Claude reads the table schema, figures out how to express the question as a Deephaven operation, executes it, and returns the result.

The conversations below show what the interaction looks like. The player names and numbers are illustrative — your results will reflect the real Statcast data you loaded.

"Who throws the fastest fastball in the AL East?"

You: Who throws the fastest fastball in the AL East?

Claude: Looking at pitches_with_teams for four-seam fastballs in the AL East...

pitches_with_teams
    .where("pitch_type == `FF` && division == `AL East`")
    .agg_by([agg.max_("max_velo=release_speed")], by=["player_name"])
    .sort_descending("max_velo")
    .head(10)

Result:

player_name	max_velo
Félix Bautista	101.8
Aroldis Chapman	100.4
Gerrit Cole	100.2

Félix Bautista tops the list at 101.8 mph. Claude mapped "fastest fastball" to max release_speed on pitch_type = 'FF', and "AL East" to a division filter on the joined table — that join setup from Part 2 is doing real work here.

"Show me pitchers whose spin rate dropped this month."

You: Show me pitchers whose spin rate dropped significantly in the last month.

Claude: Comparing average spin rates — last 30 days versus the 30 days before...

Result:

player_name	recent_spin	prior_spin	change_pct
Corbin Burnes	2,412	2,589	-6.8%
Dylan Cease	2,198	2,341	-6.1%

That's a multi-step query — time windows, per-pitcher aggregations, percentage calculations — expressed as a single sentence.

"Which hitters struggle against high sliders?"

You: Which hitters struggle most against high sliders?

Claude: Filtering for sliders above the zone (plate_z > 2.5), then calculating whiff rate and batting average...

Result:

batter_name	pitches_seen	whiff_rate	ba
Joey Gallo	47	52.3%	.043
Kyle Schwarber	38	48.7%	.079

"Compare Ohtani's exit velocity this year vs last year"

You: Compare Ohtani's exit velocity in 2026 versus 2025.

Claude: Pulling Shohei Ohtani's batted ball data for both seasons...

Result:

Season	Avg exit velo	Max exit velo	Hard-hit %
2025	93.2	118.4	54.2%
2026	94.1	119.1	56.8%

Exit velocity is up slightly, with hard-hit rate climbing 2.6 points year over year.

The agent writes queries you'd spend ten minutes constructing — and explains what the numbers mean.

Replaying a real game

The queries above work on any historical dataset. The more interesting case is asking those same questions while a game is still happening — before you know how the at-bat ends.

The easiest way to test that without waiting for a live game is to replay real historical data pitch by pitch. Game 1 of the 2024 World Series has a useful ending: Dodgers trailing 3–2 in the tenth, bases loaded, Freddie Freeman at the plate. Load it and sort into game order:

from pybaseball import statcast
from deephaven import pandas as dhpd

ws_df = statcast(start_dt="2024-10-25", end_dt="2024-10-25")
ws_df_sorted = ws_df.sort_values(
    ["inning", "inning_topbot", "at_bat_number", "pitch_number"]
).reset_index(drop=True)

Then stream it into a ticking table using DynamicTableWriter:

import threading
import time
import pandas as pd
from deephaven.table_factory import DynamicTableWriter
import deephaven.dtypes as dht

writer = DynamicTableWriter(
    {
        "inning": dht.int32,
        "inning_half": dht.string,
        "at_bat_number": dht.int64,
        "pitcher_name": dht.string,
        "pitch_type": dht.string,
        "release_speed": dht.double,
        "release_spin_rate": dht.double,
        "description": dht.string,
        "events": dht.string,
        "launch_speed": dht.double,
        "launch_angle": dht.double,
    }
)

live_game = writer.table


def replay(speed=6):
    for _, row in ws_df_sorted.iterrows():
        writer.write_row(
            int(row["inning"]) if pd.notna(row["inning"]) else 0,
            str(row["inning_topbot"]) if pd.notna(row["inning_topbot"]) else "",
            int(row["at_bat_number"]) if pd.notna(row["at_bat_number"]) else 0,
            str(row["player_name"]) if pd.notna(row["player_name"]) else "",
            str(row["pitch_type"]) if pd.notna(row["pitch_type"]) else "",
            float(row["release_speed"]) if pd.notna(row["release_speed"]) else 0.0,
            float(row["release_spin_rate"])
            if pd.notna(row["release_spin_rate"])
            else 0.0,
            str(row["description"]) if pd.notna(row["description"]) else "",
            str(row["events"]) if pd.notna(row["events"]) else "",
            float(row["launch_speed"]) if pd.notna(row["launch_speed"]) else 0.0,
            float(row["launch_angle"]) if pd.notna(row["launch_angle"]) else 0.0,
        )
        time.sleep(1.0 / speed)


threading.Thread(target=replay, daemon=True).start()

live_game is now a ticking table. Pitches arrive in game order while the thread runs, and every downstream query updates automatically as new rows land. Ask Claude mid-replay:

You: What's Gerrit Cole's pitch mix and average velocity so far?

Claude: Filtering live_game for Cole's pitches and grouping by type...

from deephaven.agg import avg, count_

cole_mix = (
    live_game.where("pitcher_name == `Cole, Gerrit`")
    .agg_by([count_("n"), avg("avg_velo=release_speed")], by=["pitch_type"])
    .sort_descending("n")
)

Result:

pitch_type	n	avg_velo
FF	45	96.7
SL	17	89.9
KC	11	83.9
FC	9	92.5
CH	4	89.7
SI	2	96.8

Run the same query five minutes later and the counts are higher — live_game keeps growing as pitches arrive. Ask about the tenth inning once it loads:

You: What happened in the bottom of the tenth?

Claude: Filtering live_game for inning 10, bottom half...

tenth_inning = live_game.where("inning == 10 && inning_half == `Bot`").view(
    [
        "at_bat_number",
        "pitcher_name",
        "pitch_type",
        "release_speed",
        "description",
        "events",
        "launch_speed",
        "launch_angle",
    ]
)

Result:

Bottom of the 10th inning replay

The last row is the one you came for: Nestor Cortes, 92.2 mph fastball, 109.8 mph off the bat at 30° — Freeman's grand slam, in actual Statcast data.

Going live during game time

The replay above uses the same architecture as a live feed. The only difference is the source: instead of replaying a sorted DataFrame row by row, you pull from a real-time API, a Kafka topic, or any other streaming source and write rows into the same DynamicTableWriter as they arrive.

For polling-based sources, function_generated_table wraps any fetch function into a ticking table on a schedule:

from deephaven import function_generated_table
from deephaven import pandas as dhpd
from pybaseball import statcast
from datetime import date


def fetch_today():
    today = date.today().strftime("%Y-%m-%d")
    return dhpd.to_table(statcast(start_dt=today, end_dt=today))


live_pitches = function_generated_table(
    table_generator=fetch_today,
    refresh_interval_ms=5 * 60 * 1000,
)

Every table built from live_pitches updates automatically on each refresh. Wire them into a deephaven.ui dashboard to get a live scoreboard:

import deephaven.ui as ui
from deephaven.agg import avg, count_, max_, sum_

velo_leaders = (
    live_pitches.where("pitch_type == `FF` && !isNull(release_speed)")
    .agg_by(
        [
            count_("pitches"),
            avg("avg_velo=release_speed"),
            max_("max_velo=release_speed"),
        ],
        by=["player_name"],
    )
    .sort_descending("avg_velo")
)

homers = (
    live_pitches.where("events == `home_run`")
    .update_view("batting_team = (inning_topbot == `Top`) ? away_team : home_team")
    .view(
        [
            "batting_team",
            "batter",
            "player_name",
            "pitch_type",
            "release_speed",
            "launch_speed",
            "launch_angle",
        ]
    )
)

hard_hit = (
    live_pitches.where("launch_speed > 95")
    .view(
        [
            "batter",
            "player_name",
            "pitch_type",
            "release_speed",
            "launch_speed",
            "launch_angle",
            "events",
        ]
    )
    .sort_descending("launch_speed")
)

whiff_by_type = (
    live_pitches.update_view(
        "is_whiff = description == `swinging_strike` || description == `swinging_strike_blocked`"
    )
    .agg_by([count_("total"), sum_("whiffs=is_whiff")], by=["pitch_type"])
    .update_view("whiff_pct = (double)whiffs / total * 100")
    .sort_descending("total")
)

baseball_dashboard = ui.dashboard(
    ui.column(
        ui.row(
            ui.panel(velo_leaders, title="Fastball velocity leaders"),
            ui.panel(homers, title="Home runs today"),
        ),
        ui.row(
            ui.panel(hard_hit, title="Hard-hit balls (95+ mph)"),
            ui.panel(whiff_by_type, title="Whiff rate by pitch type"),
        ),
    )
)

Live baseball dashboard with four auto-updating panels

All four panels update every five minutes without rerunning anything. Claude can query any of the underlying tables the same way it queried the replay — the architecture is identical.

Note

Deephaven connects natively to Kafka, Redpanda, and other streaming sources. See the Kafka integration guide for how to swap in a real-time feed.

Try it yourself

Try it in the browser — no install required.
Install Deephaven — free, takes about five minutes.
Connect Claude — point your AI at your Deephaven session.
Load Statcast data — Part 2 covers season-level setup; the replay section covers loading a specific game.
Start asking — one question in plain English, no queries to write.

The same pattern works for any sport with structured data — NBA shot tracking, NFL play-by-play, soccer event data. If you can get it into a table, you can query it in plain English.

Questions? Join us on Slack.

Query pitch-level data with plain English using Deephaven MCP

What you'll need

Part 1: Get Deephaven running

Part 2: Load the data

Adding division context

Filtering by team in the UI

Part 3: Connect Claude

Install Deephaven MCP

Configure Claude Desktop

Asking questions

"Who throws the fastest fastball in the AL East?"

"Show me pitchers whose spin rate dropped this month."

"Which hitters struggle against high sliders?"

"Compare Ohtani's exit velocity this year vs last year"

Replaying a real game

Going live during game time

Try it yourself