Wednesday, 4 April 2018

Approximate Top-N Query Processing (APPROX_RANK, APPROX_SUM, APPROX_COUNT) In Oracle Database 18c

The APPROX_RANK, APPROX_SUM and APPROX_COUNT functions were introduced in Oracle 18c to allow approximate top-n query processing. This extends the approximate query processing that was introduced in the previous two releases (12cR1 and 12cR2).

◈ Setup


The examples in this article require the following table. The table is populated with random data, so your results may look a little different than those presented here.

DROP TABLE t1 PURGE;

CREATE TABLE t1 AS
SELECT level AS id,
       TRUNC(DBMS_RANDOM.value(1, 5))*10 department,
       UPPER(TO_CHAR(TO_DATE(TRUNC(DBMS_RANDOM.value(1, 6)),'j'), 'jsp')) AS record_type,
       ROUND(DBMS_RANDOM.value(1,51),2) AS record_value
FROM   dual
CONNECT BY level <= 100000;

You can get an idea of your data set using the following query.

COLUMN record_type FORMAT A12

SELECT department,
       record_type,
       COUNT(*) AS record_count,
       SUM(record_value) AS sum_record_val
FROM   t1
GROUP BY department, record_type
ORDER BY department, record_type;

DEPARTMENT RECORD_TYPE  RECORD_COUNT  SUM_RECORD_VAL
10
10
10
10
10
20
20
20
20
20
30
30
30
30
30
40
40
40
40
40
FIVE
FOUR
ONE
THREE
TWO
FIVE
FOUR
ONE
THREE
TWO
FIVE
FOUR
ONE
THREE
TWO
FIVE
FOUR
ONE
THREE
TWO
5012
5042
4912
4922
5160
4816
4983
4961
5049
5018
5059
4969
5042
5115
4918
4965
5007
5017
4944
5089
130044.4
131431.59
128534.48
126185.08
133586.67
124278.87
130158.06
128156.44
131219.07
129587.6
131168.33
128337.92
131607.88
131455.11
127673.23
128567.84
128968.23
130735.96
130084.71
131310.33

SQL>

Introduction


In this section I'll keep showing the same query, highlighting different parts of the statement to illustrate the point being made. It's not important what the query actually does at this point. Just focus on the syntax. The examples are all based around the APPROX_SUM function, but the points apply equally to the APPROX_COUNT function.

The approximate top-n query processing currently focuses on scenarios where you are ranking groups of data based on a SUM of values or a COUNT of the number of rows in the group, hence the APPROX_SUM and APPROX_COUNT functions. These functions can only be used in combination with the APPROX_RANK function, which must be present in the HAVING clause of the query, but can also appear in the SELECT list, if you want to display the rank of the data.

SELECT department,
       record_type,
       APPROX_SUM(record_value) AS approx_sum_val,
       APPROX_RANK(PARTITION BY department ORDER BY APPROX_SUM(record_value) DESC) AS approx_rank_val
FROM   t1
GROUP BY department, record_type
HAVING APPROX_RANK(PARTITION BY department ORDER BY APPROX_SUM(record_value) DESC) <= 2
ORDER BY 1, 4;

The same approximate function used in the SELECT list must also appear in the ORDER BY clause of the APPROX_RANK function.

SELECT department,
       record_type,
       APPROX_SUM(record_value) AS approx_sum_val,
       APPROX_RANK(PARTITION BY department ORDER BY APPROX_SUM(record_value) DESC) AS approx_rank_val
FROM   t1
GROUP BY department, record_type
HAVING APPROX_RANK(PARTITION BY department ORDER BY APPROX_SUM(record_value) DESC) <= 2
ORDER BY 1, 4;

The ORDER BY clause is mandatory and must always include the DESC keyword, so this can't be used to perform bottom-n queries like most other methods.

SELECT department,
       record_type,
       APPROX_SUM(record_value) AS approx_sum_val,
       APPROX_RANK(PARTITION BY department ORDER BY APPROX_SUM(record_value) DESC) AS approx_rank_val
FROM   t1
GROUP BY department, record_type
HAVING APPROX_RANK(PARTITION BY department ORDER BY APPROX_SUM(record_value) DESC) <= 2
ORDER BY 1, 4;

The PARTITION BY clause must include all or a subset of the columns in the GROUP BY clause. If there is more than one column in the GROUP BY clause the PARTITION BY clause is mandatory. The PARTITION BY clause can be omitted if there is only a single column in the GROUP BY clause, as this will automatically be the partition.

SELECT department,
       record_type,
       APPROX_SUM(record_value) AS approx_sum_val,
       APPROX_RANK(PARTITION BY department ORDER BY APPROX_SUM(record_value) DESC) AS approx_rank_val
FROM   t1
GROUP BY department, record_type
HAVING APPROX_RANK(PARTITION BY department ORDER BY APPROX_SUM(record_value) DESC) <= 2
ORDER BY 1, 4;

The HAVING clause limits the ranks returned by the query. This is the top-n.

SELECT department,
       record_type,
       APPROX_SUM(record_value) AS approx_sum_val,
       APPROX_RANK(PARTITION BY department ORDER BY APPROX_SUM(record_value) DESC) AS approx_rank_val
FROM   t1
GROUP BY department, record_type
HAVING APPROX_RANK(PARTITION BY department ORDER BY APPROX_SUM(record_value) DESC) <= 2
ORDER BY 1, 4;

The APPROX_SUM and APPROX_COUNT functions can accept a second parameter of 'MAX_ERROR', which causes them to display the maximum error between the actual and approximate sum or count respectively. I've not seen this return anything but "0" in the data sets I've tried.

SELECT department,
       record_type,
       APPROX_SUM(record_value) AS approx_sum_val,
       APPROX_SUM(record_value, 'MAX_ERROR') AS approx_sum_val_me,
       APPROX_RANK(PARTITION BY department ORDER BY APPROX_SUM(record_value) DESC) AS approx_rank_val
FROM   t1
GROUP BY department, record_type
HAVING APPROX_RANK(PARTITION BY department ORDER BY APPROX_SUM(record_value) DESC) <= 2
ORDER BY 1, 5;

The following sections will provide some examples of the usage.

APPROX_SUM and APPROX_RANK


In this section we will show some examples of using the APPROX_SUM and APPROX_RANK functions together.

The following example displays the top-2 record types, based on the approximate sum of the values per record type across all departments.

SELECT record_type,
       APPROX_SUM(record_value) AS approx_sum_val,
       APPROX_SUM(record_value, 'MAX_ERROR') AS approx_sum_val_me,
       APPROX_RANK(ORDER BY APPROX_SUM(record_value) DESC) AS approx_rank_val
FROM   t1
GROUP BY record_type
HAVING APPROX_RANK(ORDER BY APPROX_SUM(record_value) DESC) <= 2
ORDER BY 1, 4;

RECORD_TYPE APPROX_SUM_VAL APPROX_SUM_VAL_ME APPROX_RANK_VAL
ONE
TWO
519034.76
522157.83
0
0
2
1

The following example displays the top-2 record types in each department, based on the approximate sum of the values per record type for each department.

SELECT department,
       record_type,
       APPROX_SUM(record_value) AS approx_sum_val,
       APPROX_SUM(record_value, 'MAX_ERROR') AS approx_sum_val_me,
       APPROX_RANK(PARTITION BY department ORDER BY APPROX_SUM(record_value) DESC) AS approx_rank_val
FROM   t1
GROUP BY department, record_type
HAVING APPROX_RANK(PARTITION BY department ORDER BY APPROX_SUM(record_value) DESC) <= 2
ORDER BY 1, 5;

DEPARTMENT RECORD_TYPE APPROX_SUM_VAL PPROX_SUM_VAL_ME APPROX_RANK_VAL
10
10
20
20
30
30
40
40
TWO
FOUR
THREE
FOUR
ONE
THREE
TWO
ONE
133631
131452
131247
130162
131646
131482
131315
130750
0
0
0
0
0
0
0
0
1
2
1
2
1
2
1
2

The following example displays the top department, based on the approximate sum of the values per department for all record types.

SELECT department,
       APPROX_SUM(record_value) AS approx_sum_val,
       APPROX_SUM(record_value, 'MAX_ERROR') AS approx_sum_val_me,
       APPROX_RANK(ORDER BY APPROX_SUM(record_value) DESC) AS approx_rank_val
FROM   t1
GROUP BY department
HAVING APPROX_RANK(ORDER BY APPROX_SUM(record_value) DESC) <= 1
ORDER BY 1, 4;

DEPARTMENT APPROX_SUM_VAL PPROX_SUM_VAL_ME APPROX_RANK_VAL
30 650242.47 0 1

SQL>

The following example displays the top department for each record type, based on the approximate sum of the values per department for each record type.

SELECT record_type,
       department,
       APPROX_SUM(record_value) AS approx_sum_val,
       APPROX_SUM(record_value, 'MAX_ERROR') AS approx_sum_val_me,
       APPROX_RANK(PARTITION BY record_type ORDER BY APPROX_SUM(record_value) DESC) AS approx_rank_val
FROM   t1
GROUP BY record_type, department
HAVING APPROX_RANK(PARTITION BY record_type ORDER BY APPROX_SUM(record_value) DESC) <= 1
ORDER BY 1, 5;

The following example displays the top department for each record type, based on the approximate sum of the values per department for each record type.

SELECT record_type,
       department,
       APPROX_SUM(record_value) AS approx_sum_val,
       APPROX_SUM(record_value, 'MAX_ERROR') AS approx_sum_val_me,
       APPROX_RANK(PARTITION BY record_type ORDER BY APPROX_SUM(record_value) DESC) AS approx_rank_val
FROM   t1
GROUP BY record_type, department
HAVING APPROX_RANK(PARTITION BY record_type ORDER BY APPROX_SUM(record_value) DESC) <= 1
ORDER BY 1, 5;

RECORD_TYPE DEPARTMENT APPROX_SUM_VAL APPROX_SUM_VAL_ME APPROX_RANK_VAL
FIVE
FOUR
ONE
THREE
TWO
30
10
30
30
10
131171
131452
131646
131482
133631
0
0
0
0
0
1
1
1
1
1

APPROX_COUNT and APPROX_RANK


In this section we will show some examples of using the APPROX_COUNT and APPROX_RANK functions together.

The following example displays the top-2 record types, based on the approximate number of rows per record type across all departments.

SELECT record_type,
       APPROX_COUNT(*) AS approx_count_val,
       APPROX_COUNT(*, 'MAX_ERROR') AS approx_count_val_me,
       APPROX_RANK(ORDER BY APPROX_COUNT(*) DESC) AS approx_rank_val
FROM   t1
GROUP BY record_type
HAVING APPROX_RANK(ORDER BY APPROX_COUNT(*) DESC) <= 2
ORDER BY 1, 4;


RECORD_TYPEAPPROX_SUM_VALAPPROX_SUM_VAL_MEAPPROX_RANK_VAL
THREE
TWO
20030
20185
0
0
2
1

The following example displays the top-2 record types in each department, based on the approximate number of rows per record type in each department.

SELECT department,
       record_type,
       APPROX_COUNT(*) AS approx_count_val,
       APPROX_COUNT(*, 'MAX_ERROR') AS approx_count_val_me,
       APPROX_RANK(PARTITION BY department ORDER BY APPROX_COUNT(*) DESC) AS approx_rank_val
FROM   t1
GROUP BY department, record_type
HAVING APPROX_RANK(PARTITION BY department ORDER BY APPROX_COUNT(*) DESC) <= 2
ORDER BY 1, 5;

DEPARTMENTRECORD_TYPEAPPROX_SUM_VALPPROX_SUM_VAL_MEAPPROX_RANK_VAL
10
10
20
20
30
30
40
40
TWO
FOUR
THREE
TWO
THREE
FIVE
TWO
ONE
5160
5042
5049
5018
5115
5059
5089
5017
0
0
0
0
0
0
0
0
1
2
1
2
1
2
1
2

The following example displays the top department, based on the approximate number of rows of all record types in the department

SELECT department,
       APPROX_COUNT(*) AS approx_count_val,
       APPROX_COUNT(*, 'MAX_ERROR') AS approx_count_val_me,
       APPROX_RANK(ORDER BY APPROX_COUNT(*) DESC) AS approx_rank_val
FROM   t1
GROUP BY department
HAVING APPROX_RANK(ORDER BY APPROX_COUNT(*) DESC) <= 1
ORDER BY 1, 4;

DEPARTMENTAPPROX_SUM_VALPPROX_SUM_VAL_MEAPPROX_RANK_VAL
3025103                   01

The following example displays the top department for each record type, based on the approximate number of rows of that record type in the department

SELECT record_type,
       department,
       APPROX_COUNT(*) AS approx_count_val,
       APPROX_COUNT(*, 'MAX_ERROR') AS approx_count_val_me,
       APPROX_RANK(PARTITION BY record_type ORDER BY APPROX_COUNT(*) DESC) AS approx_rank_val
FROM   t1
GROUP BY record_type, department
HAVING APPROX_RANK(PARTITION BY record_type ORDER BY APPROX_COUNT(*) DESC) <= 1
ORDER BY 1, 5;

RECORD_TYPEDEPARTMENTAPPROX_SUM_VALAPPROX_SUM_VAL_MEAPPROX_RANK_VAL
FIVE
FOUR
ONE
THREE
TWO
30
10
30
30
10
5059
5042
5042
5115
5160
0
0
0
0
0
1
1
1
1
1

SQL>