Northern California Oracle Users Group

Home » SQL Challenge » SQL Mini-challenge entry: Kim Berg Hansen v6

SQL Mini-challenge entry: Kim Berg Hansen v6

by Kim Berg Hansen

Okay, okay, Iggy, you want one buffer get? I’ll do it 😉

Actually two different versions that might be useful in different circumstances. First one goes like this:

create materialized view log on employees with rowid, commit scn
   (employee_id, job_id, first_name, last_name, department_id)
   including new values;
create materialized view log on departments with rowid, commit scn
   (department_id, location_id)
   including new values;
create materialized view log on locations with rowid, commit scn
   (location_id, city)
   including new values;
create materialized view log on jobs with rowid, commit scn
   (job_id, job_title)
   including new values;
create materialized view log on job_history with rowid, commit scn
   (job_id, employee_id)
   including new values;

create materialized view emp_jobs_mv
build immediate
refresh fast on commit
enable query rewrite
as
select e.first_name, e.last_name, j.job_title, l.location_id, l.city, 'HIST' record_type
     , j.rowid j_rowid, jh.rowid jh_rowid, e.rowid e_rowid, d.rowid d_rowid, l.rowid l_rowid
  from jobs j, job_history jh, employees e, departments d, locations l
 where jh.job_id = j.job_id
   and e.employee_id = jh.employee_id
   and d.department_id = e.department_id
   and l.location_id = d.location_id
union all
select e.first_name, e.last_name, j.job_title, l.location_id, l.city, 'CURR' record_type
     , j.rowid j_rowid, null jh_rowid, e.rowid e_rowid, d.rowid d_rowid, l.rowid l_rowid
  from jobs j, employees e, departments d, locations l
 where e.job_id = j.job_id
   and d.department_id = e.department_id
   and l.location_id = d.location_id;

create index emp_jobs_kings_presidents on emp_jobs_mv (
   case
      when first_name = 'Steven' and last_name = 'King' then 1
      when job_title = 'President' then 1
      else 0
   end
 , location_id
 , city
);

begin
   dbms_stats.gather_schema_stats(
      'HR'
    , estimate_percent => null
    , cascade => true
   );
end;

We make a single fast refresh on commit mview that contains both names and job titles (current as well as historical.) And then to make it possible to just access a single index block we create a very specialized index with a “boolean” indicating whether the row is King or president.

Now we select the data where the “boolean” expression is “true”:

select /*+ gather_plan_statistics */
       distinct location_id, city
  from (
   select e.first_name, e.last_name, j.job_title, l.location_id, l.city
     from jobs j, job_history jh, employees e, departments d, locations l
    where jh.job_id = j.job_id
      and e.employee_id = jh.employee_id
      and d.department_id = e.department_id
      and l.location_id = d.location_id
   union all
   select e.first_name, e.last_name, j.job_title, l.location_id, l.city
     from jobs j, employees e, departments d, locations l
    where e.job_id = j.job_id
      and d.department_id = e.department_id
      and l.location_id = d.location_id
  )
 where case
          when first_name = 'Steven' and last_name = 'King' then 1
          when job_title = 'President' then 1
          else 0
       end = 1;

And we get the desired single buffer get:

-----------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation         | Name                      | Starts | E-Rows |E-Bytes| Cost (%CPU)| E-Time   | A-Rows |   A-Time   | Buffers |
-----------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |                           |      1 |        |       |     2 (100)|          |      1 |00:00:00.01 |       1 |
|   1 |  HASH UNIQUE      |                           |      1 |      1 |    19 |     2  (50)| 00:00:01 |      1 |00:00:00.01 |       1 |
|*  2 |   INDEX RANGE SCAN| EMP_JOBS_KINGS_PRESIDENTS |      1 |      1 |    19 |     1   (0)| 00:00:01 |      1 |00:00:00.01 |       1 |
-----------------------------------------------------------------------------------------------------------------------------------------
                                                                                                                                         
Predicate Information (identified by operation id):                                                                                      
---------------------------------------------------                                                                                      
                                                                                                                                         
   2 - access("EMP_JOBS_MV"."SYS_NC00012$"=1)                                                                                            

That will scale and keep queries down to few buffer gets even for much more data – as long as there are few Kings and presidents 😉

Alternative solution puts the predicates inside the mview, so the mview only contains the desired data and nothing more. The mview becomes less generally usable, but for specialized cases it might be OK (in a way it is similar to creating indexes on expressions that are NULL for non-desired data to make a very small index with only desired data.)

create materialized view kings_presidents_mv
build immediate
refresh fast on commit
enable query rewrite
as
select e.first_name, e.last_name, j.job_title, l.location_id, l.city, 'HIST' record_type
     , j.rowid j_rowid, jh.rowid jh_rowid, e.rowid e_rowid, d.rowid d_rowid, l.rowid l_rowid
  from jobs j, job_history jh, employees e, departments d, locations l
 where jh.job_id = j.job_id
   and e.employee_id = jh.employee_id
   and d.department_id = e.department_id
   and l.location_id = d.location_id
   and j.job_title = 'President'
union all
select e.first_name, e.last_name, j.job_title, l.location_id, l.city, 'CURR' record_type
     , j.rowid j_rowid, null jh_rowid, e.rowid e_rowid, d.rowid d_rowid, l.rowid l_rowid
  from jobs j, employees e, departments d, locations l
 where e.job_id = j.job_id
   and d.department_id = e.department_id
   and l.location_id = d.location_id
   and ( (e.first_name = 'Steven' and e.last_name = 'King')
         or j.job_title = 'President' );

alter table kings_presidents_mv modify
(
  first_name   not null,
  last_name    not null,
  job_title    not null,
  location_id  not null,
  city         not null,
  record_type  not null,
  j_rowid      not null,
  e_rowid      not null,
  d_rowid      not null,
  l_rowid      not null
);

create index kings_presidents_ix on kings_presidents_mv (
   location_id, city
);

begin
   dbms_stats.gather_schema_stats(
      'HR'
    , estimate_percent => null
    , cascade => true
   );
end;

And we can just select the entire mview index via query rewrite:

select /*+ gather_plan_statistics */
       distinct location_id, city
  from (
   select l.location_id, l.city
     from jobs j, job_history jh, employees e, departments d, locations l
    where jh.job_id = j.job_id
      and e.employee_id = jh.employee_id
      and d.department_id = e.department_id
      and l.location_id = d.location_id
      and j.job_title = 'President'
   union all
   select l.location_id, l.city
     from jobs j, employees e, departments d, locations l
    where e.job_id = j.job_id
      and d.department_id = e.department_id
      and l.location_id = d.location_id
      and ( (e.first_name = 'Steven' and e.last_name = 'King')
            or j.job_title = 'President' )
  );
------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation          | Name                | Starts | E-Rows |E-Bytes| Cost (%CPU)| E-Time   | A-Rows |   A-Time   | Buffers |
------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |                     |      1 |        |       |     2 (100)|          |      1 |00:00:00.01 |       1 |
|   1 |  SORT UNIQUE NOSORT|                     |      1 |      1 |    11 |     2  (50)| 00:00:01 |      1 |00:00:00.01 |       1 |
|   2 |   INDEX FULL SCAN  | KINGS_PRESIDENTS_IX |      1 |      1 |    11 |     1   (0)| 00:00:01 |      1 |00:00:00.01 |       1 |
------------------------------------------------------------------------------------------------------------------------------------

No predicates are used at all – just return the contents of the index and done.

The setting of mview columns to NOT NULL to match the nullability of the master tables is one way to make that query use the index, if the columns had been nullable we would have got a full table scan of the mview instead, which would be 3 buffer gets on my 12.1 instance.

Alternative to setting the mview columns to NOT NULL could be to write the query like this:

select /*+ gather_plan_statistics */
       distinct location_id, city
  from (
   select l.location_id, l.city
     from jobs j, job_history jh, employees e, departments d, locations l
    where jh.job_id = j.job_id
      and e.employee_id = jh.employee_id
      and d.department_id = e.department_id
      and l.location_id = d.location_id
      and j.job_title = 'President'
   union all
   select l.location_id, l.city
     from jobs j, employees e, departments d, locations l
    where e.job_id = j.job_id
      and d.department_id = e.department_id
      and l.location_id = d.location_id
      and ( (e.first_name = 'Steven' and e.last_name = 'King')
            or j.job_title = 'President' )
  )
 where location_id is not null
   and city is not null;

Which also gets us index full scan:

------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation          | Name                | Starts | E-Rows |E-Bytes| Cost (%CPU)| E-Time   | A-Rows |   A-Time   | Buffers |
------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |                     |      1 |        |       |     2 (100)|          |      1 |00:00:00.01 |       1 |
|   1 |  SORT UNIQUE NOSORT|                     |      1 |      1 |    11 |     2  (50)| 00:00:01 |      1 |00:00:00.01 |       1 |
|*  2 |   INDEX FULL SCAN  | KINGS_PRESIDENTS_IX |      1 |      1 |    11 |     1   (0)| 00:00:01 |      1 |00:00:00.01 |       1 |
------------------------------------------------------------------------------------------------------------------------------------
                                                                                                                                    
Predicate Information (identified by operation id):                                                                                 
---------------------------------------------------                                                                                 
                                                                                                                                    
   2 - filter(("KINGS_PRESIDENTS_MV"."LOCATION_ID" IS NOT NULL AND "KINGS_PRESIDENTS_MV"."CITY" IS NOT NULL))                       

But that requires evaluating an access predicate 😉


9 Comments

  1. Iggy Fernandez says:

    Functionally correct, unbeatable performance, and easy to read; who could ask for anything more? But no one knows what criteria the judges will use. The NoCOUG Journal does not even say who will be the judges of the SQL throwdown. Bobby Flay and Alton Brown? I hope not.

    Like

    • vbarun says:

      Iggy, if I understand it correctly, the predicate values could change. What about the columns, meaning, could the query choose different columns to filter on (vs what is in the base query currently)?

      Like

      • Iggy Fernandez says:

        My opinion is that allowing the predicate values or the predicates to change would change the rules of the game. Kim is striving to creating auxiliary structures that can handle requirements that are not in the problem statement which is commendable but it is a self-imposed handicap. 🙂

        Like

    • Oh I don’t know… Two celebrity chefs like that might bring novel ideas as judges – I could do a lasagna as my entree 😉

      Like

  2. vbarun says:

    Kim, I wonder why don’t you just use a normal index, meaning an index that has all the columns vs the Boolean approach? As I have shown in my solution, you can still achieve just 1 buffer get with a normal index. In that way, you don’t restrict your solution to any specific predicate values…

    Like

    • I use the “boolean” approach because of predicates on name OR job title. Supposing you have a million employees and 10 Steven Kings and 10 Presidents? With a normal index, either you need full index scan, or at best if there is few distinct job titles you can skip scan an index with job title as leading column and then first/last name. The “boolean” approach index range scan directly gets the 20 out of the million. Alternatively create two indexes on the mview – one with job title leading, one with name leading – and then let the optimizer rewrite the OR to a index range scan for name union index range scan for job title.

      For solutions not restricted to specific predicate values, I think I would prefer my solution with three separate mviews. Unless it is very important to lower the buffer gets even more than that solution offers, I think the ease of management outweighs the slightly higher buffer gets. These “fast refresh on commit” mviews might need multiple “auxilliary” indexes in order to make the fast refreshes go fast (you wouldn’t want a refresh at every commit to have to full table scan the mview.) And the mviews with unions in them might need special care in those indexes, probably have to include ‘CURR’/’HIST’ record_type marker in the indexes to allow the fast refresh to be fast.

      So every bit of complexity increases chances of something going wrong. So we need to weigh the benefits of KISS using perhaps a bit more buffer gets versus the complexity needed to get just a bit fewer buffer gets. It will depend on the circumstances very much which solution is most applicable. For more general usability allowing use of different predicates, my favorite would be the slightly simpler solution with three mviews.

      But there can be use cases where the predicates are very specific and constitute a business rule. Consider if we have tables Customers, CRM_Cust_Activity, CRM_Activity_Log, Cust_Account, Cust_Offers. Say we have a business rule stating that our customer services department needs to get in touch with a customer if the customer has either one of three conditions:
      1) Has a row in CRM_Cust_Activity with Description like ‘%complaint%’ which has a child row in CRM_Activity_Log with Status=Unresolved
      2) Has Cust_Account.Balance <= -100000
      3) Has a row in Cust_Offers with Status=Pending
      Such a case might usefully implement a single materialized view Customers_To_Contact.

      If the mview itself contained the predicates, customer services can just "select * from Customers_To_Contact". (And I think I would do it like that rather than using query rewrite. Query rewrite has the advantage that if the mview isn't available, the query can still run from the master tables. But that might mean that the customer service department chokes the server… Letting the application select from the mview rather than using query rewrite will fail that application in case of mview failure, but only that – not kill the performance of the server. Also letting the applicition select from the mview encapsulates the business rule in one place: the mview definition.)

      If the mview contains all data and a "boolean" index is used, that might have the advantage that if the business rule changes slightly (say Cust_Account_Balance <= -150000), that would only entail changing the index, not rebuilding the entire mview.

      All the solutions have some pros and cons – you'll have to weigh the benefits in each specific use case 😉

      Like

  3. jimmybrock says:

    Kim, looks great. But with function-based indexes don’t you have gather stats in this matter?

    exec dbms_stats.gather_table_stats(ownname=>’HR’, tabname=> ‘EMPLOYEES’, method_opt=> ‘FOR ALL HIDDEN COLUMNS SIZE 1’);

    I think this is due to Oracle creating a virtual and hidden column on function-based indexes and this is the only way Oracle can collect stats. But I could be wrong, that is how I remember from earlier versions of Oracle.

    Like

  4. Iggy Fernandez says:

    Just a nitpicky observation that you don’t need to include the columns e.first_name, e.last_name and j.job_title in the materialized view 🙂

    Iggy

    Like

Leave a comment

Next NoCOUG meeting
August 20 at Chevron, San Ramon

Follow me on Twitter

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 280 other subscribers