Histograms in Postgres
April 10, 2018 • 7 minute read
Ever been searching through SQL tables and think, “Gee I wish I could just get a quick look at a histogram of this data?” Yeah me too. Happens all the time but alas, you’re stuck in a SQL prompt somewhere and you’re still 2-3 copy/pastes, an Excel array formula, and 5-6 clicks way from reaching this nirvana.
Well fret no more because I’m about to show you how to calculate and “plot” a
histogram from within Postgres. That’s right, you’ll be making ascii art all
from within the confines of your psql
shell.
What is a histogram?
Histograms are way visual way of displaying the distribution of your data. It helps answer the question, “What are the values of your data? and How often does each value occur?“.
Histograms shouldn’t be confused with “frequency charts”. “Frequency charts”, while they can be just as helpful, are only used for categorical (non-numerical) data. For example, if you wanted to count how many countries name’s started with each letter of the alphabet, you’d see something like this:
So while it’s a similar exercise, just know that when we’re talking about histograms, we’re talking about numerical data. For example, a histogram showing the price of a set of diamonds looks like this:
So you can see how they’re so useful! It’s a great way to take a quick look at a bunch of data without having to inspect each row individually or put everything into a pivot table in Excel.
Data
For this post I’m going to be using the [diamonds dataset](). It’s a fairly common sample dataset that’s found in both industry and academia. It even ships with some popular R packages like ggplot2.
The dataset consists of ~50K rows and 10 columns. In our example below we’ll be
calculating a histogram for the the price
column (just like the example above).
…insert first 5 rows of diamonds dataset…
Methodology
Calculating histograms consists of a few steps:
- Calculate the “bins”
- Counting the number of datapoints in each bin
- Making the visualization
Calculating Your Bins
Determining the “right” number of buckets, or bins, for your histogram is a hotly contested (or at least opinionated) topic. In short, there’s no one perfect way to calculate the number and size of your bins, you’ll need to take things on a case by case basis. There are a few common ways that have stayed popular over the years:
- Square-root: take the sqare root of the number of data points (n)
- Sturges’ formula: log-2 of n + 1
- Freedman–Diaconis’: (2 * (75th percentile - 25th percentile)) / (n^(1⁄3)
For our case we’re going to use a modified version of the square root method. Since there’s only so many rows we can display on our screen at once, we have an upper bound on the number of bins that we can reasonably use. For practical puposes, our output starts to break down after 50 bins, so the formula we’ll be using is:
min(50, sqrt(n))
Now we’ll of course need to translate this to SQL. This is actually fairly easy and intuitive to do:
select
, min(x) as x_min
, max(x) as x_max
least(50, ceil(sqrt(count(x)))) as nbins
into temp
bin_params
from
rnorm;
There are a few things going on here:
1) Counting the # of instances of x we have (n
)
2) Taking the sqrt
of that value
3) Rounding that value up to the nearest whole number (ceil
)
4) Taking the lower of the value of (3) and 50
As you can see we’re also saving the min/max value for x. This will come in handy later.
Next we’ll create a bin_range
table that contains the lower and upper bounds of
each bin. To do this we’ll use the generate_series
function in Postgres to
generate n
values between x_min
and x_max
.
We’ll then use the lag
function to offset this data to give the lower and upper
bounds their own columns.
select
generate_series(x_min::numeric, x_max::numeric, ((x_max - x_min) / nbins)::numeric) as bin
into temp bins
from
bin_params;
select
lag(bin) over (order by bin) as low_bin
, b.bin as high_bin
into temp bin_range
from
bins b;
low_bin | high_bin |
---|---|
-3.92909625498578 | |
-3.92909625498578 | -3.673483410809675 |
-3.673483410809675 | -3.417870566633570 |
-3.417870566633570 | -3.162257722457465 |
... | ... |
Binning and Counting
Conceptually this part is quite simple: we need to create a frequency table that gives us the # of points that occurr in each bin. However in SQL terms this is a little tricky. We’re going to be doing something that will make your Data Warehouse Engineer cringe…we’ll be doing a cartesian join!
We’ll start by joining our bins
table to our diamonds
table. We’re going
to be doing an inner join because each bin should at least have SOME data in
it (we calculated them with the same data after all). Now the next step is very
important. We’re going to join 2 tables where the value of price
is within the
range of one of the bins. It looks like this:
select
b.low_bin
, b.high_bin
, r.x
from
bin_range b
left join
rnorm r
on r.x <= b.high_bin and
r.x > b.low_bin
where
b.low_bin is not null and
b.high_bin is not null
order by
b.low_bin;
low_bin | high_bin | x |
---|---|---|
-3.60916920471936 | -3.375286320204154 | |
-3.375286320204154 | -3.141403435688948 | -3.14852622197941 |
-3.375286320204154 | -3.141403435688948 | -3.29226258210838 |
-3.375286320204154 | -3.141403435688948 | -3.27717558760196 |
-3.141403435688948 | -2.907520551173742 | -3.06108768424019 |
-3.141403435688948 | -2.907520551173742 | -2.96137290727347 |
-3.141403435688948 | -2.907520551173742 | -2.96593475434929 |
... | ... | ... |
This gets us almost all the way there. Last thing we need to do is aggregate the bins and count the # of times each set of bins has a particular value.
select
b.low_bin
, b.high_bin
, count(*) as freq
from
bin_range b
left join
rnorm r
on r.x <= b.high_bin and
r.x > b.low_bin
where
b.low_bin is not null and
b.high_bin is not null
group by
b.low_bin
, b.high_bin
order by
b.low_bin;
low_bin | high_bin | freq |
---|---|---|
... | ... | ... |
-2.547947306347932 | -2.282606475399404 | 25 |
-2.282606475399404 | -2.017265644450876 | 28 |
-2.017265644450876 | -1.751924813502348 | 41 |
-1.751924813502348 | -1.486583982553820 | 35 |
-1.486583982553820 | -1.221243151605292 | 50 |
-1.221243151605292 | -0.955902320656764 | 59 |
-0.955902320656764 | -0.690561489708236 | 43 |
-0.690561489708236 | -0.425220658759708 | 43 |
-0.425220658759708 | -0.159879827811180 | 40 |
... | ... | ... |
Our Psuedo-chart
Alright we’re almost there! Time for the fun part. We’re going to make a pseudo-visualization right here from our SQL table. Instead of displaying the frequency for each bin, we’re going to make a bar chart. To do that we’ll be using the repeat(s char, n int)
function from Postgres. This function will repeat a string (s
), n
number of times.
So in our case we’ll be repeat the '|'
character the by the count for each bin. To scale our chart, we’ll actually apply normalize so that the largest frequency has 50 '|'
marks and scale accordingly for the other buckets.
select
f.low_bin
, f.high_bin
, repeat('|', ceil(50 * f.freq / (select max(freq)::numeric from frequency_table))::int)
from
frequency_table f
;
low_bin | high_bin | count
--------------------+--------------------+----------------------------------------------------
-3.70145826274529 | -3.474553561931394 | ||||
-3.474553561931394 | -3.247648861117498 | |
-3.247648861117498 | -3.020744160303602 | |||||
-3.020744160303602 | -2.793839459489706 | ||||||||
-2.793839459489706 | -2.566934758675810 | |||||
-2.566934758675810 | -2.340030057861914 | |||||||||
-2.340030057861914 | -2.113125357048018 | ||||||||||||||||||||||||
-2.113125357048018 | -1.886220656234122 | ||||||||||||||||||||||||||||||
-1.886220656234122 | -1.659315955420226 | ||||||||||||||||||||||||||||||||||||
-1.659315955420226 | -1.432411254606330 | ||||||||||||||||||||||||||||||||||||
-1.432411254606330 | -1.205506553792434 | |||||||||||||||||||||||||||||||||||||||||
-1.205506553792434 | -0.978601852978538 | ||||||||||||||||||||||||||||||||||||||||||||||||||
-0.978601852978538 | -0.751697152164642 | ||||||||||||||||||||||||||||||||||||||||||
-0.751697152164642 | -0.524792451350746 | ||||||||||||||||||||||||||||||||||||||||||||||||||
-0.524792451350746 | -0.297887750536850 | ||||||||||||||||||||||||||||||||||||
-0.297887750536850 | -0.070983049722954 | ||||||||||||||||||||||||||||
-0.070983049722954 | 0.155921651090942 | |||||||||||||||||
0.155921651090942 | 0.382826351904838 | ||||||||||||
0.382826351904838 | 0.609731052718734 | ||||||||||||||||
0.609731052718734 | 0.836635753532630 | ||||||||||
0.836635753532630 | 1.063540454346526 | ||||||||||||
1.063540454346526 | 1.290445155160422 | ||||||
1.290445155160422 | 1.517349855974318 | ||
There you have it! You can simplify this script a lot by removing some of the more detailed bits regarding the bin selecting and scaling. You can find [simplified versions in this gist]().
Before You Go
Some other helpful resources for histograms and all things SQL: