Calculating Moving Median with BigQuery: A Deeper Dive
When working with time-series data, calculating moving averages and medians can be a useful way to identify trends and patterns. In this article, we’ll explore how to calculate a 7-day moving median using BigQuery Standard SQL.
Understanding the Problem
The problem presented involves calculating a 7-day moving median for a specific column in a table within BigQuery. The data contains outliers, which affect the accuracy of the moving average calculations. We need to find a way to handle these outliers and calculate the moving median accurately.
Background: What is a Moving Median?
A moving median is a statistical calculation that calculates the middle value of a set of numbers after shifting the dataset by a certain window size. This is also known as a rolling median or cumulative median. The formula for calculating the moving median is:
Moving Median = (Sum of values in the window) / 2
where window is the size of the moving average.
Handling Outliers
Outliers can significantly affect the accuracy of moving average calculations. In the case of a moving median, outliers can cause the middle value to be skewed by extreme values. To handle outliers, we need to find a way to exclude or downweight these values when calculating the moving median.
BigQuery’s ARRAYAGG Function
In BigQuery, we can use the ARRAYAGG function to calculate the sum of values in a window. The ARRAYAGG function takes an array as input and returns a single value that represents the sum or average of the elements in the array.
To calculate the moving median, we need to use the ARRAYAGG function with the PARTITION BY clause to group the data by the desired window size. We’ll also use the ORDER BY clause to sort the data within each window.
Writing the SQL Query
The SQL query provided in the original Stack Overflow question is a good starting point for calculating the moving median. However, we need to modify it slightly to handle the outliers and calculate the moving median accurately.
SELECT
t.*,
qtys[ordinal(cast(array_length(qtys) / 2 as int64))]
FROM (
SELECT
t.*,
array_agg(qty) OVER (PARTITION BY port ORDER BY datetime_diff(datetime, '2000-01-01', day) RANGE BETWEEN 7 PRECEDING AND CURRENT DAY) AS qtys
FROM t
WHERE extract(hour from datetime) = 11
) t;
This query uses the ARRAYAGG function to calculate the sum of values in each window. The PARTITION BY clause groups the data by the desired window size, and the ORDER BY clause sorts the data within each window.
Handling Even Number of Rows
When there are an even number of rows in the result set, the moving median calculation can be ambiguous. To handle this situation, we need to choose an arbitrary value for the middle row. In this query, we use the ordinal function to select the middle row, which is defined as:
qtys[ordinal(cast(array_length(qtys) / 2 as int64))]
This expression returns the value at the index equal to half of the length of the array minus one.
Using Window Functions
BigQuery supports window functions, which allow us to perform calculations across rows that are related to the current row. We can use these window functions to calculate the moving median more efficiently than using ARRAYAGG.
One way to write this query is by using a window function as shown below:
SELECT
t.*,
AVG(qty) OVER (PARTITION BY port ORDER BY datetime_diff(datetime, '2000-01-01', day) RANGE BETWEEN 7 PRECEDING AND CURRENT ROW) AS moving_median
FROM (
SELECT
t.*,
qty,
DENSE_RANK() OVER (PARTITION BY port ORDER BY datetime_diff(datetime, '2000-01-01', day) RANGE BETWEEN 7 PRECEDING AND CURRENT DAY) AS qty_rank
FROM t
WHERE extract(hour from datetime) = 11
) t;
This query uses the DENSE_RANK function to rank the rows within each window based on the values in the qty column. The moving_median expression then calculates the average of the ranked quantities.
Conclusion
Calculating moving medians with BigQuery requires a good understanding of how to handle outliers and use window functions efficiently. By using the ARRAYAGG function or window functions, we can calculate the moving median accurately even when dealing with time-series data that contains outliers.
Last modified on 2024-11-27