DBA Tips Archive for Oracle |
|
Using Analytic Functions in Oracle9i
by Jeff Hunter, Sr. Database Administrator
Contents
Overview
Analytic functions are designed to address problems like:
Although most of these types of problems can be solved in other programming languages like PL/SQL and Java,
the performance is often not very effiecient. Analytic functions add extensions to the SQL language to
not only make these operations easier to code; they make them faster than could be achieved with pure
SQL or PL/SQL. These extensions are currently under review by the ANSI SQL committee for inclusion in
the SQL specification.
Analytic functions enable you to compute aggregate values for a specific
group of rows. Groups are formed using the new
As mentioned above, analytic functions compute an aggregate value based on a group
of rows. They differ from aggregate functions in that they return multiple rows
for each group. The group of rows is called a window and is defined by an
analytic clause. For each row, a sliding window of rows is defined. This
window determines the range of rows used to perform the calculations for the
current row. Window sizes can be based on either a physical number of rows
or a logical interval such as time.
Analytic functions are the last set of operations performed in a query except for the
final ORDER BY clause. All joins and all WHERE, GROUP BY, and HAVING clauses are completed
BEFORE the analytic functions are processed. Therefore, analytic functions can appear only
in the SELECT list or ORDER BY clause.
The Syntax
Specify the name of an analytic function. Oracle provides many
analytic functions such as:
Analytic functions take 0 to 3 arguments.
The PARTITION BY clause logically breaks a single result set
into N groups, according to the criteria set by the partition
expressions. The words "PARTITION" and "GROUP" are used synonymously
here. The analytic functions are applied to each group independently,
they are reset for each group.
The ORDER BY clause specifies how the data is sorted within each
group (PARTITION). This will definitely affect the outcome of
any analytic function.
The windowing clause gives us a way to define a sliding or anchored
window of data, on which the analytic function will operate, within
a group. This clause can be used to have the analytic function compute
its value based on any arbitrary sliding or anchored window within a
group.
Distribution Functions
You can use CUME_DIST as an aggregate function or as an analytic function. When you use
CUME_DIST as an aggregate function, it determines the relative position of an apparent
row of information within an existing group of rows. CUME_DIST as an analytic function
determines the relative position of a specific value within a group of values.
The following example calculates the salary percentile for each employee within each
job category excluding job categories PRESIDENT and MANAGER. In the examle below, 50% of
CLERKS have salaries less than or equal to JAMES.
Inverse Distribution Functions
The PERCENTILE_CONT function is an inverse distribution function that assumes
a continuous distribution model. It takes a percentile value and a sort
specification, and returns an interpolated value that would fall into that
percentile value with respect to the sort specification. Nulls are ignored in
the calculation.
The PERCENTILE_DISC function is an inverse distribution function that assumes
a discrete distribution model. It takes a percentile value and a sort specification
and returns an element from the set. Nulls are ignored in the calculation.
Both functions, given a percentile and a sequence, will determine the
interpolated value at that percentile. In both functions, interpolation
ignores NULL values. All interpolated values will be between 0 and 1.
Aggregate Example
The following example computes the median salary in each department:
Analytic Example - (PERCENTILE_CONT)
In the following example, the median for Department 30 is 1375, which has a
corresponding percentile (Percent_Rank) of 0.5. None of the salaries in
Department 30 have a percentile of 0.5, so the median value must be interpolated
between 2400 (percentile 0.4) and 2500 (percentile 0.6), which evaluates to 2450.
Analytic Example - (PERCENTILE_DISC)
The following example calculates the median discrete percentile of
the salary of each employee in the sample table emp.
The median value for Department 10 is 2450, which is the value whose
corresponding percentile (Cume_Dist) is the smallest value greater than
or equal to 0.5. The median value for Department 30 is 1500, which is
the value whose corresponding percentile is the smallest value greater
than or equal to 0.5.
Top-N Queries
Let's take a look at several examples.
Example 1
This example will sort the sales people by salary from greatest to least.
Give the first three rows. If there are less then three people in a department,
this will return less than three records.
Example 2
In this example, I want to see the set of sales people who make the top 3 salaries -
that is, find the set of distinct salary amounts, sort them, take the largest three,
and list everyone who makes one of those values.
The DENSE_RANK() function computes the rank of a row in an ordered
group of rows. The ranks are consecutive integers beginning with 1.
The largest rank value is the number of unique values returned by the
query. Rank values are not skipped in the event of ties. Rows with equal
values for the ranking criteria receive the same rank.
The DENSE_RANK function does not skip numbers and will assign the same
number to those rows with the same value. Hence, after the result set is
built in the inline view, we can simply select all of the rows with a dense
rank of three or less, this gives us everyone who makes the top three
salaries by department number.
Rank Functions
The RANK function is
Positional Functions
Histogram Bucket Function
Grouping Sets
SQL Scripts
Copyright (c) 1998-2012 Jeffrey M. Hunter. All rights reserved.
All articles, scripts and material located at the Internet address of http://www.idevelopment.info is the copyright of Jeffrey M. Hunter
and is protected under copyright laws of the United States. This document may not be hosted on any other site without my express,
prior, written permission. Application to host any of the material elsewhere can be made by contacting me at jhunter@idevelopment.info.
I have made every effort and taken great care in making sure that the material included on my web site is technically accurate,
but I disclaim any and all responsibility for any loss, damage or destruction of data or any other property which may arise from
relying on it. I will in no case be liable for any monetary damages arising from such loss, damage or destruction.
Oracle has made analytic functions available starting with Oracle 8i (8.1.6).
This article documents the usage and syntax surrounding the new analytic functions
found in Oracle9i.
WITHIN GROUP
clause. Enhancements related to analytic functions new to Oracle9i
are explained below. In this section of the document, all examples use the
all famous EMP / DEPT tables. I included the DDL script (scott.sql)
to create these tables in the SQL Scripts
repository.
Analytic-Function(<Argument1>,<Argument2>,...)
OVER (
<Query-Partition-Clause>
<Order-By-Clause>
<Windowing-Clause>
)
The first analytical function to look at is the cumulative distribution function (CUME_DIST).
Cumulative distribution is an analytic function that computes the relative position of a
specified value in a group of values. For a row R, assuming ascending ordering, the
CUME_DIST of R is the number of rows with values lower than or equal to the value of R
divided by the number of rows being evaluated (the entire query result set or a partition).
The range of values returned by CUME_DIST is >0 to <=1. Tie values always evaluate to the
same cumulative distribution value.
SQL> SELECT job, ename, sal, CUME_DIST()
OVER (PARTITION BY job ORDER BY sal) AS cume_dist
FROM emp
WHERE job NOT IN ('MANAGER', 'PRESIDENT');
JOB ENAME SAL CUME_DIST
--------- ---------- ---------- ----------
ANALYST SCOTT 3000 1
ANALYST FORD 3000 1
CLERK SMITH 800 .25
CLERK JAMES 950 .5
CLERK ADAMS 1100 .75
CLERK MILLER 1300 1
SALESMAN WARD 1250 .5
SALESMAN MARTIN 1250 .5
SALESMAN TURNER 1500 .75
SALESMAN ALLEN 1600 1
10 rows selected.
Oracle9i introduces the PERCENTILE_CONT and PERCENTILE_DIST which are
both inverse distribution functions.
SQL> SELECT deptno,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY sal DESC)
"Median cont"
, PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY sal DESC)
"Median disc"
FROM emp
GROUP BY deptno;
DEPTNO Median cont Median disc
---------- ----------- -----------
10 2450 2450
20 2975 2975
30 1375 1500
SQL> SELECT
ename
, sal
, deptno
, PERCENTILE_CONT(0.5)
WITHIN GROUP (ORDER BY sal DESC)
OVER (PARTITION BY deptno) "Percentile_Cont"
, PERCENT_RANK()
OVER (PARTITION BY deptno
ORDER BY sal DESC) "Percent_Rank"
FROM emp WHERE deptno IN (10, 30);
ENAME SAL DEPTNO Percentile_Cont Percent_Rank
---------- ---------- ---------- --------------- ------------
KING 5000 10 2450 0
CLARK 2450 10 2450 .5
MILLER 1300 10 2450 1
BLAKE 2850 30 1375 0
ALLEN 1600 30 1375 .2
TURNER 1500 30 1375 .4
WARD 1250 30 1375 .6
MARTIN 1250 30 1375 .6
JAMES 950 30 1375 1
9 rows selected.
SQL> SELECT
ename
, sal
, deptno
, PERCENTILE_DISC(0.5)
WITHIN GROUP (ORDER BY sal DESC)
OVER (PARTITION BY deptno) "Percentile_Disc"
, CUME_DIST()
OVER (PARTITION BY deptno
ORDER BY sal DESC) "Cume_Dist"
FROM emp where deptno in (10, 30);
ENAME SAL DEPTNO Percentile_Disc Cume_Dist
---------- ---------- ---------- --------------- ----------
KING 5000 10 2450 .333333333
CLARK 2450 10 2450 .666666667
MILLER 1300 10 2450 1
BLAKE 2850 30 1500 .166666667
ALLEN 1600 30 1500 .333333333
TURNER 1500 30 1500 .5
WARD 1250 30 1500 .833333333
MARTIN 1250 30 1500 .833333333
JAMES 950 30 1500 1
9 rows selected.
Using Top-N Queries, you are able to get the Top-N records by some set of fields.
set autotrace on explain
break on deptno skip 1
SELECT *
FROM (
SELECT
deptno
, ename
, sal
, ROW_NUMBER()
OVER (
PARTITION BY deptno ORDER BY sal DESC
) Top3 FROM emp
)
WHERE Top3 <= 3;
DEPTNO ENAME SAL TOP3
---------- ---------- ---------- ----------
10 KING 5000 1
CLARK 2450 2
MILLER 1300 3
20 SCOTT 3000 1
FORD 3000 2
JONES 2975 3
30 BLAKE 2850 1
ALLEN 1600 2
TURNER 1500 3
9 rows selected.
Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE
1 0 VIEW
2 1 WINDOW (SORT PUSHED RANK)
3 2 TABLE ACCESS (FULL) OF 'EMP'
This query works by sorting each partition (or group, which is the deptno),
in a descending order, based on the salary column and then assigning a
sequential row number to each row in the group as it is processed. The use of
a WHERE clause after doing this to get just the first three rows in each partition.
SELECT *
FROM (
SELECT
deptno
, ename
, sal
,DENSE_RANK()
OVER (
PARTITION BY deptno ORDER BY sal desc
) TopN FROM emp
)
WHERE TopN <= 3
ORDER BY deptno, sal DESC;
DEPTNO ENAME SAL TOPN
---------- ---------- ---------- ----------
10 KING 5000 1
CLARK 2450 2
MILLER 1300 3
20 SCOTT 3000 1 <--- !
FORD 3000 1 <--- !
JONES 2975 2
ADAMS 1100 3
30 BLAKE 2850 1
ALLEN 1600 2
30 TURNER 1500 3
10 rows selected.
Execution Plan
--------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE
1 0 VIEW
2 1 WINDOW (SORT PUSHED RANK)
3 2 TABLE ACCESS (FULL) OF 'EMP'
In the example above, the DENSE_RANK function was used to get
the top three salaries. We assigned the dense rank to the salary
column and sorted it in a descending order.
The rank function was actually introduced in Oracle8i, but has
been enhanced in Oracle9i to now allow functions to accept
a parameter to specify the analysis.




SQL Scripts
scott.sql
SQL script that can be used to create the EMP / DEPT tables used in this article.
Saturday, 18-Sep-2010 17:53:12 EDT
Page Count: 34331