Revisiting the calculation of medians on aggregated data.
In this article we touch upon a recurrent topic. I had a need for calculating percentiles for aggregated sales data and I had to browse through countless pages in order to find buggy or complex solutions. I decided to find something satisfying and simple for my use case when analyzing sales data. This was a difficult journey. It was harder than I thought and as you can see, I had to employ diverse tools until I reach my destination. I tried here to document in a pedadgogic way my findings.
Analyzing sales is at the heart of many consulting firms. By providing insights the clients can understand their market positioning, their performance and where to put optimisation effort. These analyses are one of the many products the consulting firms sell and the better the analysis, the more income is generated. From the smallest shop to the largest corporate, the era of computers gave a convenient way to generate sales reports that could be also used to stock items, predict profit or even demand. The biggest difference is the volume (or scale to put it alternatively) of generated data. But in all cases the demand of insights is real. While in the past, pencil and paper was used to keep track and analyse these numbers, computer programs (and especially spreadsheets) started making this a reality.
The explosive increase in computer power and the recent advances in Computer Technology, BigData Algorithmics, Data Engineering tooling and Machine learning allowed us to store big volumes of sales data for analysis. With sophisticated means the modern data analyst or the modern data scientist can form through new generation spreadsheet programs (something like the first generation spreadsheet programs but now dealing with huge amounts of data that reside in a third party) a picture of the sales. In any case despite the volume of the data, the analysis programs must run fast. Various tricks are employed, approximate analysis (Streaming Algorithms) or by precomputing or throwing more computing power to the problem.
One of the most typical analysis to get an idea about sales for a product is to calculate the mean or even the median (very popular these days with Covid-19 reports). While the mean is very well known by its unbiased estimator, the average, median is not commonly used in everyday life. You can find the juicy details … of course on Wikipedia. The bottom line is that given a series of odd measurements, we sort them ascending and we select the measurement that is in the middle (if the measurments are even, we artificially create a “middle” measurement). Wikipedia has beautiful graphs. The median is used as a more robust statistic compared to the mean. But what happens when we have repetitions?
The easiest way would be to just find the median. So given m distinct numbers each having N1, N2, …, Nm occurences, bounded by N, in an arbitrary sequence, then the complexity of calculating the median is N times (worst case) of calculating the median in the original case. Even worse if someone gave as the number with their occurences, in an attempt to compress the data, a naive approach would be to “explode” them in order to “resuse” the median function available in the various programming libraries. In this case we incur the above performance hit. I provide an example for R with tidyverse. (Yes I find tidyverse very convenient).
If we need to be “faster” we need to implement it ourselves or use an external library. By implementing “somehow” the original algorithm on the occurences by clever counting we could keep the complexity low (O(n * log(n)). See the following python function, values is a list of doubles, occureences is a list of positive numbers:
We sacrifice simplicity but we can do our work. This is the tip of an iceberg. Things will become way more interesting.
A first attempt for an SQL script
While searching for relevant SQL code for the general case I bumped on the very enlightening SQL script here. Unfortunately I did not have an SQL server available. However the code was fairly standard. So all I had to do was to grab latest Bitnami Postgresql and get to work. Docker compose file (including a pgTap setup) and the Azure Data Studio notebook can be found in my supporting material Github repo. In order to run DB and tests you can consult this script. I have not added Flyway scripts, so please “Run all” the notebook when you start the DB in order to play!
First we populate our code with some data
Then I show an adaptation to the very interesting code the author provides for median. His code actually improves upon an old answer and is a gem. I will not explain the code here, but the notebook is very clear with examples as to why this function would give the correct result. Take time to digest it and play with the notebook.
The author claims that his second function provided on that blog would give an answer for the cases where multiple occurences are happening. It does not work with aggregated data, only with multiple occurences.
One would expect that both these functions should give the same answer. The pgTap fragment shows success in this specific situation (as it happened with the data in the blog post!).
But the next test shows that it is not always the case.
While the code comes up with some number, it fails a very crucial property. The two calculations should give the same number. I had to try hard to find a case where these calculations are not correct. This is where my journey started.
When people imagine sales, they imagine producs sold in prices and quantities. For example 1 can of beans at 1 Euro, 2 packets of chips at 0.5, a six pack of cokes and so on. Suppose now that we seek the median value of a specific type of packet of chips, for a month and a chain in Athens Greece. Unfortuantely this is not possible with the naive approach for a multitude of reasons.
Everytime we sell something, we must fire an event in a distributed system and just filter the chips sales. There are problems of scalability. For this reason some convenience store chains may report aggregate sales of individual shops which may have slightly different prices.
A bad artifact of this is that now we do not have a smallset of prices, but it spreads with the inevitable performance hit. The spread is data specific. One could artificially truncate the range but this should be avoided especially when the range is not know in advance.
Other shops may not even report regularly and consequently we need to make averages of aggregate quantities sold and monetary volume. An example is shown.
Quantitites and monetary volumes are averaged to get (100+91+120) / 3 and (100 * 1.07 + 91 * 1.00 + 120 * 1.29) / 3 respectively. Price is derived by division of the two averages. What kind of median can we find from these reports? The true median is “hidden” by aggregations. We may even need to find the median on “predicted” or synthetic sales with fractional quantities.
One would also hope that things would be easier if we reported the sales of a product for a specific shop only. This would not gonna work too. An example reason for this is the TPRs (temporary price reductions). The same exact item, e.g. “Origano chips of brand XXX 120g) could be sold at different prices in the same month because of promotions.
The problem generalizes easily when we observe prices for mixed product sales, not only one product.
So the “trick” with occurences would never work because of fractional quantities. It will also not work because items do not take values in a fixed range of prices. The python code I provided before is a no-go. How can we find the median now? The original textbook definition does not apply. But the textbooks have more pages than one.
The weighted median
Lets recap our problem. We are given a number of pairs of quantities and prices for a specific product and we would like to find the median prices this product is sold. Head over to Kaggle and download the supermarket sales dataset. We will use it for illustration purposes on real data. As you can see from the dataset, the sales are aggregated!
Suppose we need to find the median for January. What the textbook says here is to find from the data the density through an approximation like the histogram of the sales prices and then integrate to find the half probability value. Why resort to approximate methods like historgram when we can use our full data. Let’s use the cumulative sum of the data to find the probability density of prices. No need for deciding on binning strategy. The next gist, does exactly this (we also divide with total sales to transform to an empirical cdf).
Observing the above, we are tempted to draw a horizontal line at 0.5 to “meet” the graph and find the value we are seeking. In general, if we are given samples with some positive weight (Quantity in our case) and value (Sales price in our case) we can follow this exact procedure. This is the essence of the weighted median. We can generalize to a percentile other than 0.5 as explained in wikipedia. But samples are discrete and the line may fail to meet something. Interpolation comes to the rescue. In this amazing report the interpolation trick is analyzed and generalized further. But we are looking at something easier. More SQL friendly.
The nearest rank method as an alternative
When we attempt to find the median of an even number of samples (no weighting), like 1,2 ,3 and 4then we are faced with a dilemma. Which is the median. 2 or 3? Artificially we usually create the “odd” point we desperately need as (2+3)/2 = 1.5. But this is not universally accepted and correct as the number of samples increases. Let’s revisit the cdf figure. While 0.5 may not be “hit”, if we relax our requirements and move “north” we eventually “hit” something. In the example above, applying this methodology hits the 2 which is exactly 50%. This is the “lower median”. It is also called the “nearest rank” because it is the “nearest” we can reach to 50%. This obviously generalizes for percentiles different from 0.5, like 0.25 and 0.75 and it works fine for weighted percentiles. (see two paragraphs later the pdf, for formal definitions). Especially for the weighted case, this is how illustrated in the next python gist.
Change 0.5 to the percentile you wish. In other words we sort by price ascending and accumulate quantities (with jumps when we change values) until we hit 50% or the first time we hit something bigger so as to account for the discreteness.
The runtime if the number of samples is big cannot be improved unless we “bucket” or “bin” the prices in order to reduce the complexity. But this is a very simple and understandable way to compute it without complicated interpolations. Also, in order to escape from degenerate cases, the weights are considered > 0. But is this approximate method a substitute for the real thing? Wikipedia says in the limit of infinite samples yes. This is not the whole truth as we will see. Things are better.
While searching the Internet, I was not able to find these properties and I had strong suspicions that they hold. So I took the time to prove my self. If you can suggest refrences, please share them. The Latex code of the notes are in the repository too. The bottom line is that the weighted nearest rank is a very sound and simple way for estimating percentiles.
The SQL code.
The sql code for the nearest rank is easy to understand. First the code for the unweighted case
and then the code for the weighted case. We use window functions in order to implement the the total summation and the finding of the first point “hitting” the threshold. The “threshold” parameter is the q. Let’s write a pgTap test to make sure that the median in the aggregated and exploded cases works the same.
The test indeed find the same value for threshold (q) equal to 0.5. Feel free to add more tests during your experiments. Since I had to experiment with rehgional data and calculate per region percentile data, a very common task, I include the code for this case.
The table specification for my experiments
and the corresponding function.
Here I use the PostgreSQL specific “distinct on” in order to get the “first line” per region since the code keeps in increasing percentage the lines greater or equal than the threshold and we are interested … well, in the first line. The repo contains a pgTap test for this case too. The main idea amounts to do separate calculations per region and then make sure that the function produces a result that decomposes to these two values.
We have not touched the interpolation aspect of the percentile calculation. This is another interesting topic. I mostly focused on what was necessary to do my work. SQL was the main application field where I needed a way to calculate percentiles in a quick way. But it pays sometimes to delve deep in others code and spend time thinking how to improve existing approaches. It pays to question the status quo, in every possible sense. I hope you enjoyed the article. Clap if you liked. Comment if you find a shortcomming.