Kaggle is one of the best resources for data. If you're at a loss about what to do with your next project, jump into the rabbit hole of Kaggle for never-ending inspiration.
Ok - now you have ideas, but how do you actually get your hands on data and start to process it ASAP?
When I first started using Kaggle, I had no idea how easy it is getting the data into Python. My workflow was pretty manual:
- find an interesting data set
- explore data set limitations like size and variable type
- download the data
- copy or scp the data set to the place I could work on it
This process was tedious until I pieced together a nearly automated system. Now all I need to do is step 1!
This guide shows you how, with Python, to get a Kaggle CSV file NOW.
1) Get an API Key
Sorry, this is the longest step. You need to get a Kaggle API key.
When you see your JSON, don't close it! Copy those values into your environment variables.
Inside your Python console, enter the username and key:
import os
os.environ['KAGGLE_USERNAME'] = "USER_NAME_FROM_JSON"
os.environ['KAGGLE_KEY'] = "KEY_FROM_JSON"
2) What data should I get?
The hardest step is to decide what you want. Kaggle has all the data I've ever wanted. Want to read all the Harry Potter books again? Download the text corpus from Kaggle. Interested in the SEC Annual Financial Filings... yes, Kaggle has that. Or perhaps I'm writing an AI for music recognition - I could start at this music classification set on Kaggle.
Once you do pick a data set, the Kaggle API is fantastic. My favorite and easiest method to use is dataset_download_file
.
Set the data_set
and data_file
variables to match the data you want. In this example, I want to use the google-doodle
CSV file called list.csv
. (Note this automated method is only for CSVs with small modifications for other data types. Slack us if you need ideas on how to make those changes.)
I can either copy the API command or the url. Take only the username/notebook and assign it to the data_set
.
Next, I want to specify which data set. This one has two; I pick list.csv
and assign that to data_file
.
data_set = 'joonasyoon/google-doodles'
data_file ='list.csv'
Slack us if you found a cool data set to use!
3) Using the Kaggle API
The nitty-gritty details step. If you're using a CSV file, you won't need to change any of this code - it just works. The comments in the code echo the detailed explanation below the block.
# A install
os.system("pip install kaggle")
# B import and authenticate
import kaggle
from kaggle.api.kaggle_api_extended import KaggleApi
api = KaggleApi()
api.authenticate()
# C download
api.dataset_download_file(data_set, data_file)
# D unzip
from zipfile import ZipFile
zf = ZipFile(data_file+'.zip')
zf.extractall()
zf.close()
# E convert csv to a table for easy manipuation
from deephaven import read_csv
data=read_csv(data_file)
# E pandas command for csv if outside Deephaven
#import pandas as pd
#data=pd.read_csv(data_file)
- A) I like to import Kaggle with a local pip install. If you plan on using this frequently, you might want to read our guide on other ways to install Python packages.
- B) Next, import the library and authenticate your credentials.
- C) This uses the variables set for the notebook and file to download that file. Remember, this is inside a Docker container and not downloaded to your local machine.
- D) Most of the time the data set needs to be unzipped. This code unzips our file.
- E) Careful with this step. It seems like 90%+ of the data on Kaggle are
CSV
but there might be other formats. Deephaven works with those, too, like Parquet.
Bonus points and next steps
Deephaven is built for large data sets, like many of those on Kaggle. Try some of our features such as aggregations, formulas, or many of plotting options.