DBA Tips Archive for Oracle

  


Using Analytic Functions in Oracle9i

by Jeff Hunter, Sr. Database Administrator


Contents

  1. Overview
  2. Distribution Functions
  3. Inverse Distribution Functions
  4. Top-N Queries
  5. Rank Functions
  6. Positional Functions
  7. Histogram Bucket Function
  8. Grouping Sets
  9. SQL Scripts



Overview

Oracle has made analytic functions available starting with Oracle 8i (8.1.6). This article documents the usage and syntax surrounding the new analytic functions found in Oracle9i.

Analytic functions are designed to address problems like:

Although most of these types of problems can be solved in other programming languages like PL/SQL and Java, the performance is often not very effiecient. Analytic functions add extensions to the SQL language to not only make these operations easier to code; they make them faster than could be achieved with pure SQL or PL/SQL. These extensions are currently under review by the ANSI SQL committee for inclusion in the SQL specification.

Analytic functions enable you to compute aggregate values for a specific group of rows. Groups are formed using the new WITHIN GROUP clause. Enhancements related to analytic functions new to Oracle9i are explained below. In this section of the document, all examples use the all famous EMP / DEPT tables. I included the DDL script (scott.sql) to create these tables in the SQL Scripts repository.

As mentioned above, analytic functions compute an aggregate value based on a group of rows. They differ from aggregate functions in that they return multiple rows for each group. The group of rows is called a window and is defined by an analytic clause. For each row, a sliding window of rows is defined. This window determines the range of rows used to perform the calculations for the current row. Window sizes can be based on either a physical number of rows or a logical interval such as time.

Analytic functions are the last set of operations performed in a query except for the final ORDER BY clause. All joins and all WHERE, GROUP BY, and HAVING clauses are completed BEFORE the analytic functions are processed. Therefore, analytic functions can appear only in the SELECT list or ORDER BY clause.

The Syntax

Analytic-Function(<Argument1>,<Argument2>,...)
OVER (
  <Query-Partition-Clause>
  <Order-By-Clause>
  <Windowing-Clause>
)



Distribution Functions

The first analytical function to look at is the cumulative distribution function (CUME_DIST). Cumulative distribution is an analytic function that computes the relative position of a specified value in a group of values. For a row R, assuming ascending ordering, the CUME_DIST of R is the number of rows with values lower than or equal to the value of R divided by the number of rows being evaluated (the entire query result set or a partition). The range of values returned by CUME_DIST is >0 to <=1. Tie values always evaluate to the same cumulative distribution value.

You can use CUME_DIST as an aggregate function or as an analytic function. When you use CUME_DIST as an aggregate function, it determines the relative position of an apparent row of information within an existing group of rows. CUME_DIST as an analytic function determines the relative position of a specific value within a group of values.

The following example calculates the salary percentile for each employee within each job category excluding job categories PRESIDENT and MANAGER. In the examle below, 50% of CLERKS have salaries less than or equal to JAMES.

SQL> SELECT job, ename, sal, CUME_DIST()
     OVER (PARTITION BY job ORDER BY sal) AS cume_dist
     FROM emp
     WHERE job NOT IN ('MANAGER', 'PRESIDENT'); 


JOB       ENAME             SAL  CUME_DIST
--------- ---------- ---------- ----------
ANALYST   SCOTT            3000          1
ANALYST   FORD             3000          1
CLERK     SMITH             800        .25
CLERK     JAMES             950         .5
CLERK     ADAMS            1100        .75
CLERK     MILLER           1300          1
SALESMAN  WARD             1250         .5
SALESMAN  MARTIN           1250         .5
SALESMAN  TURNER           1500        .75
SALESMAN  ALLEN            1600          1

10 rows selected.



Inverse Distribution Functions

Oracle9i introduces the PERCENTILE_CONT and PERCENTILE_DIST which are both inverse distribution functions.

The PERCENTILE_CONT function is an inverse distribution function that assumes a continuous distribution model. It takes a percentile value and a sort specification, and returns an interpolated value that would fall into that percentile value with respect to the sort specification. Nulls are ignored in the calculation.

The PERCENTILE_DISC function is an inverse distribution function that assumes a discrete distribution model. It takes a percentile value and a sort specification and returns an element from the set. Nulls are ignored in the calculation.

Both functions, given a percentile and a sequence, will determine the interpolated value at that percentile. In both functions, interpolation ignores NULL values. All interpolated values will be between 0 and 1.

Aggregate Example

The following example computes the median salary in each department:

SQL> SELECT deptno,
        PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY sal DESC)
          "Median cont"
      , PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY sal DESC)
          "Median disc"
    FROM emp
        GROUP BY deptno;


    DEPTNO Median cont Median disc
---------- ----------- -----------
        10        2450        2450
        20        2975        2975
        30        1375        1500

Analytic Example - (PERCENTILE_CONT)

In the following example, the median for Department 30 is 1375, which has a corresponding percentile (Percent_Rank) of 0.5. None of the salaries in Department 30 have a percentile of 0.5, so the median value must be interpolated between 2400 (percentile 0.4) and 2500 (percentile 0.6), which evaluates to 2450.

SQL> SELECT
         ename
       , sal
       , deptno
       , PERCENTILE_CONT(0.5)
             WITHIN GROUP (ORDER BY sal DESC)
             OVER (PARTITION BY deptno) "Percentile_Cont"
       , PERCENT_RANK()
             OVER (PARTITION BY deptno 
                   ORDER BY sal DESC)   "Percent_Rank"
      FROM emp WHERE deptno IN (10, 30);


ENAME             SAL     DEPTNO Percentile_Cont Percent_Rank
---------- ---------- ---------- --------------- ------------
KING             5000         10            2450            0
CLARK            2450         10            2450           .5
MILLER           1300         10            2450            1
BLAKE            2850         30            1375            0
ALLEN            1600         30            1375           .2
TURNER           1500         30            1375           .4
WARD             1250         30            1375           .6
MARTIN           1250         30            1375           .6
JAMES             950         30            1375            1

9 rows selected.

Analytic Example - (PERCENTILE_DISC)

The following example calculates the median discrete percentile of the salary of each employee in the sample table emp.

The median value for Department 10 is 2450, which is the value whose corresponding percentile (Cume_Dist) is the smallest value greater than or equal to 0.5. The median value for Department 30 is 1500, which is the value whose corresponding percentile is the smallest value greater than or equal to 0.5.

SQL> SELECT
         ename
       , sal
       , deptno
       , PERCENTILE_DISC(0.5)
             WITHIN GROUP (ORDER BY sal DESC)
             OVER (PARTITION BY deptno) "Percentile_Disc"
       , CUME_DIST()
             OVER (PARTITION BY deptno 
                   ORDER BY sal DESC) "Cume_Dist"
     FROM emp where deptno in (10, 30);


ENAME             SAL     DEPTNO Percentile_Disc  Cume_Dist
---------- ---------- ---------- --------------- ----------
KING             5000         10            2450 .333333333
CLARK            2450         10            2450 .666666667
MILLER           1300         10            2450          1
BLAKE            2850         30            1500 .166666667
ALLEN            1600         30            1500 .333333333
TURNER           1500         30            1500         .5
WARD             1250         30            1500 .833333333
MARTIN           1250         30            1500 .833333333
JAMES             950         30            1500          1

9 rows selected.



Top-N Queries

Using Top-N Queries, you are able to get the Top-N records by some set of fields.

Let's take a look at several examples.

Example 1

This example will sort the sales people by salary from greatest to least. Give the first three rows. If there are less then three people in a department, this will return less than three records.

set autotrace on explain
break on deptno skip 1

SELECT *
FROM (
  SELECT
      deptno
    , ename
    , sal
    , ROW_NUMBER()
  OVER (
    PARTITION BY deptno ORDER BY sal DESC
  ) Top3 FROM emp
)
WHERE Top3 <= 3;


    DEPTNO ENAME             SAL       TOP3
---------- ---------- ---------- ----------
        10 KING             5000          1
           CLARK            2450          2
           MILLER           1300          3

        20 SCOTT            3000          1
           FORD             3000          2
           JONES            2975          3

        30 BLAKE            2850          1
           ALLEN            1600          2
           TURNER           1500          3

9 rows selected.


Execution Plan
----------------------------------------------------------
   0      SELECT STATEMENT Optimizer=CHOOSE
   1    0   VIEW
   2    1     WINDOW (SORT PUSHED RANK)
   3    2       TABLE ACCESS (FULL) OF 'EMP'
This query works by sorting each partition (or group, which is the deptno), in a descending order, based on the salary column and then assigning a sequential row number to each row in the group as it is processed. The use of a WHERE clause after doing this to get just the first three rows in each partition.

Example 2

In this example, I want to see the set of sales people who make the top 3 salaries - that is, find the set of distinct salary amounts, sort them, take the largest three, and list everyone who makes one of those values.

SELECT *
FROM (
  SELECT
      deptno
    , ename
    , sal
    ,DENSE_RANK()
  OVER (
    PARTITION BY deptno ORDER BY sal desc
  ) TopN FROM emp
)
WHERE TopN <= 3
ORDER BY deptno, sal DESC;

    DEPTNO ENAME             SAL       TOPN
---------- ---------- ---------- ----------
        10 KING             5000          1
           CLARK            2450          2
           MILLER           1300          3

        20 SCOTT            3000          1  <--- !
           FORD             3000          1  <--- !
           JONES            2975          2
           ADAMS            1100          3

        30 BLAKE            2850          1
           ALLEN            1600          2
        30 TURNER           1500          3


10 rows selected.

Execution Plan
--------------------------------------------
   0      SELECT STATEMENT Optimizer=CHOOSE
   1    0   VIEW
   2    1     WINDOW (SORT PUSHED RANK)
   3    2       TABLE ACCESS (FULL) OF 'EMP'
In the example above, the DENSE_RANK function was used to get the top three salaries. We assigned the dense rank to the salary column and sorted it in a descending order.

The DENSE_RANK() function computes the rank of a row in an ordered group of rows. The ranks are consecutive integers beginning with 1. The largest rank value is the number of unique values returned by the query. Rank values are not skipped in the event of ties. Rows with equal values for the ranking criteria receive the same rank.

The DENSE_RANK function does not skip numbers and will assign the same number to those rows with the same value. Hence, after the result set is built in the inline view, we can simply select all of the rows with a dense rank of three or less, this gives us everyone who makes the top three salaries by department number.



Rank Functions

The rank function was actually introduced in Oracle8i, but has been enhanced in Oracle9i to now allow functions to accept a parameter to specify the analysis.

The RANK function is




Positional Functions




Histogram Bucket Function




Grouping Sets




SQL Scripts

SQL Scripts
    scott.sql
SQL script that can be used to create the EMP / DEPT tables used in this article.


Copyright (c) 1998-2017 Jeffrey M. Hunter. All rights reserved.

All articles, scripts and material located at the Internet address of http://www.idevelopment.info is the copyright of Jeffrey M. Hunter and is protected under copyright laws of the United States. This document may not be hosted on any other site without my express, prior, written permission. Application to host any of the material elsewhere can be made by contacting me at jhunter@idevelopment.info.

I have made every effort and taken great care in making sure that the material included on my web site is technically accurate, but I disclaim any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on it. I will in no case be liable for any monetary damages arising from such loss, damage or destruction.

Last modified on
Saturday, 18-Sep-2010 17:53:12 EDT
Page Count: 45659