If you’ve ever worked with large volumes of raw data, you’ve likely dealt with annoying outliers and crowded graphs. This noise can make analysis challenging and obscure trends in data. Some of these problems can be avoided by cleaning your data. One of my favorite ways to do so is by using deciles. A decile is just a simple method of splitting up a ranked set of data into 10 equally large subsections. This categorizes large sets of ordered data into smaller subsections that can be analyzed independently.
To then summarize your data, all you have to do is average over each subsection. This results in 10 data points that nicely represent your data trends. These averaged deciles can be plotted, transforming graphs into ones that are much simpler and easier to interpret.
Reducing noise with a few lines of code
There are a few easy steps to calculate this type of decile analysis.
- First, sort the data on the independent variable (x-axis).
- Then, separate your values into 10 equally sized groups, which will be your deciles.
- Then, average all the x- and y-values that fall into the same decile
- Finally, plot those 10 averaged values.
In Deephaven, you only need a few lines of code:
result = table.sort(order_by=["IndependentVariable"]) \
.update(formulas=[ \
"Decile = ii/(IndependentVariable_.size()) * 100", \
"DecileRank = lowerBin(Decile, 10)" \
]) \
.view(formulas=["DecileRank", "IndependentVariable", "DependentVariables"]) \
.avg_by("DecileRank")
In my current project, I’m looking at cryptocurrencies and their momentums. I want to see if a common momentum trading strategy applies in the crypto world. Consequently, I’ve acquired a lot of raw financial data. A simple scatter plot of momentum signal vs future price change looked like this:
from deephaven.plot.figure import Figure
from deephaven.plot import PlotStyle, Colors, Shape
eth_momentum_scatter_plot = Figure() \
.plot_xy(series_name="Ethereum Momentum", t=eth_hist, x="Momentum", y="FuturePriceChange" ) \
.axes(plot_style=PlotStyle.SCATTER) \
.point(shape=Shape.SQUARE, size=10, label="Big Point", color=Colors.RED) \
.show()
As you can see, this graph was quite unhelpful. There is just too much noise, and it’s hard to see any reasonable trends. So, to clean up this data, I made a decile plot. Using the decile code above, I made average deciles easily.
decile_eth = eth_hist.sort(order_by=["Momentum"]) \
.update(formulas=[ \
"Decile = ii/(Momentum_.size()) * 100", \
"Decile_Rank = lowerBin(Decile, 10)" \
]) \
.view(formulas=["Decile_Rank", "Momentum", "PriceChange= FuturePriceChange"]) \
.avg_by("Decile_Rank")
Then, with my 10 averaged decile data points, all I had to do was simply graph it.
eth_decile = Figure() \
.plot_xy(series_name="Future Price Change", t=decile_eth, x="Momentum", y="PriceChange") \
.show()
This graph is much better for human analysis than my original raw graph.
While this does sacrifice a large volume of data points, you can see general trends that would otherwise be hard to see. This is only one example of the utility of deciles when working with raw, noisy data. No matter your project, deciles as an analysis tool can be vital to discovering new trends. So, if you want to analyze large amounts of data efficiently, look no further than Deephaven and its vast toolset.