Revisiting the calculation of medians on aggregated data.

From the obscure, to the obvious.

Introduction

In this article we touch upon a recurrent topic. I had a need for calculating percentiles for aggregated sales data and I had to browse through countless pages in order to find buggy or complex solutions. I decided to find something satisfying and simple for my use case when analyzing sales data. This was a difficult journey. It was harder than I thought and as you can see, I had to employ diverse tools until I reach my destination. I tried here to document in a pedadgogic way my findings.

Sales

Analyzing sales is at the heart of many consulting firms. By providing insights the clients can understand their market positioning, their performance and where to put optimisation effort. These analyses are one of the many products the consulting firms sell and the better the analysis, the more income is generated. From the smallest shop to the largest corporate, the era of computers gave a convenient way to generate sales reports that could be also used to stock items, predict profit or even demand. The biggest difference is the volume (or scale to put it alternatively) of generated data. But in all cases the demand of insights is real. While in the past, pencil and paper was used to keep track and analyse these numbers, computer programs (and especially spreadsheets) started making this a reality.

Median

One of the most typical analysis to get an idea about sales for a product is to calculate the mean or even the median (very popular these days with Covid-19 reports). While the mean is very well known by its unbiased estimator, the average, median is not commonly used in everyday life. You can find the juicy details … of course on Wikipedia. The bottom line is that given a series of odd measurements, we sort them ascending and we select the measurement that is in the middle (if the measurments are even, we artificially create a “middle” measurement). Wikipedia has beautiful graphs. The median is used as a more robust statistic compared to the mean. But what happens when we have repetitions?

A first attempt for an SQL script

While searching for relevant SQL code for the general case I bumped on the very enlightening SQL script here. Unfortunately I did not have an SQL server available. However the code was fairly standard. So all I had to do was to grab latest Bitnami Postgresql and get to work. Docker compose file (including a pgTap setup) and the Azure Data Studio notebook can be found in my supporting material Github repo. In order to run DB and tests you can consult this script. I have not added Flyway scripts, so please “Run all” the notebook when you start the DB in order to play!

Fractional sales

When people imagine sales, they imagine producs sold in prices and quantities. For example 1 can of beans at 1 Euro, 2 packets of chips at 0.5, a six pack of cokes and so on. Suppose now that we seek the median value of a specific type of packet of chips, for a month and a chain in Athens Greece. Unfortuantely this is not possible with the naive approach for a multitude of reasons.

Aggregates across shops
Fractional quantities.

The weighted median

Lets recap our problem. We are given a number of pairs of quantities and prices for a specific product and we would like to find the median prices this product is sold. Head over to Kaggle and download the supermarket sales dataset. We will use it for illustration purposes on real data. As you can see from the dataset, the sales are aggregated!

Empirical cdf

The nearest rank method as an alternative

When we attempt to find the median of an even number of samples (no weighting), like 1,2 ,3 and 4then we are faced with a dilemma. Which is the median. 2 or 3? Artificially we usually create the “odd” point we desperately need as (2+3)/2 = 1.5. But this is not universally accepted and correct as the number of samples increases. Let’s revisit the cdf figure. While 0.5 may not be “hit”, if we relax our requirements and move “north” we eventually “hit” something. In the example above, applying this methodology hits the 2 which is exactly 50%. This is the “lower median”. It is also called the “nearest rank” because it is the “nearest” we can reach to 50%. This obviously generalizes for percentiles different from 0.5, like 0.25 and 0.75 and it works fine for weighted percentiles. (see two paragraphs later the pdf, for formal definitions). Especially for the weighted case, this is how illustrated in the next python gist.

The SQL code.

The sql code for the nearest rank is easy to understand. First the code for the unweighted case

Closing thoughts.

We have not touched the interpolation aspect of the percentile calculation. This is another interesting topic. I mostly focused on what was necessary to do my work. SQL was the main application field where I needed a way to calculate percentiles in a quick way. But it pays sometimes to delve deep in others code and spend time thinking how to improve existing approaches. It pays to question the status quo, in every possible sense. I hope you enjoyed the article. Clap if you liked. Comment if you find a shortcomming.