by Kim Berg Hansen
Okay, okay, Iggy, you want one buffer get? I’ll do it 😉
Actually two different versions that might be useful in different circumstances. First one goes like this:
create materialized view log on employees with rowid, commit scn (employee_id, job_id, first_name, last_name, department_id) including new values; create materialized view log on departments with rowid, commit scn (department_id, location_id) including new values; create materialized view log on locations with rowid, commit scn (location_id, city) including new values; create materialized view log on jobs with rowid, commit scn (job_id, job_title) including new values; create materialized view log on job_history with rowid, commit scn (job_id, employee_id) including new values; create materialized view emp_jobs_mv build immediate refresh fast on commit enable query rewrite as select e.first_name, e.last_name, j.job_title, l.location_id, l.city, 'HIST' record_type , j.rowid j_rowid, jh.rowid jh_rowid, e.rowid e_rowid, d.rowid d_rowid, l.rowid l_rowid from jobs j, job_history jh, employees e, departments d, locations l where jh.job_id = j.job_id and e.employee_id = jh.employee_id and d.department_id = e.department_id and l.location_id = d.location_id union all select e.first_name, e.last_name, j.job_title, l.location_id, l.city, 'CURR' record_type , j.rowid j_rowid, null jh_rowid, e.rowid e_rowid, d.rowid d_rowid, l.rowid l_rowid from jobs j, employees e, departments d, locations l where e.job_id = j.job_id and d.department_id = e.department_id and l.location_id = d.location_id; create index emp_jobs_kings_presidents on emp_jobs_mv ( case when first_name = 'Steven' and last_name = 'King' then 1 when job_title = 'President' then 1 else 0 end , location_id , city ); begin dbms_stats.gather_schema_stats( 'HR' , estimate_percent => null , cascade => true ); end;
We make a single fast refresh on commit mview that contains both names and job titles (current as well as historical.) And then to make it possible to just access a single index block we create a very specialized index with a “boolean” indicating whether the row is King or president.
Now we select the data where the “boolean” expression is “true”:
select /*+ gather_plan_statistics */ distinct location_id, city from ( select e.first_name, e.last_name, j.job_title, l.location_id, l.city from jobs j, job_history jh, employees e, departments d, locations l where jh.job_id = j.job_id and e.employee_id = jh.employee_id and d.department_id = e.department_id and l.location_id = d.location_id union all select e.first_name, e.last_name, j.job_title, l.location_id, l.city from jobs j, employees e, departments d, locations l where e.job_id = j.job_id and d.department_id = e.department_id and l.location_id = d.location_id ) where case when first_name = 'Steven' and last_name = 'King' then 1 when job_title = 'President' then 1 else 0 end = 1;
And we get the desired single buffer get:
----------------------------------------------------------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows |E-Bytes| Cost (%CPU)| E-Time | A-Rows | A-Time | Buffers | ----------------------------------------------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | | 2 (100)| | 1 |00:00:00.01 | 1 | | 1 | HASH UNIQUE | | 1 | 1 | 19 | 2 (50)| 00:00:01 | 1 |00:00:00.01 | 1 | |* 2 | INDEX RANGE SCAN| EMP_JOBS_KINGS_PRESIDENTS | 1 | 1 | 19 | 1 (0)| 00:00:01 | 1 |00:00:00.01 | 1 | ----------------------------------------------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("EMP_JOBS_MV"."SYS_NC00012$"=1)
That will scale and keep queries down to few buffer gets even for much more data – as long as there are few Kings and presidents 😉
Alternative solution puts the predicates inside the mview, so the mview only contains the desired data and nothing more. The mview becomes less generally usable, but for specialized cases it might be OK (in a way it is similar to creating indexes on expressions that are NULL for non-desired data to make a very small index with only desired data.)
create materialized view kings_presidents_mv build immediate refresh fast on commit enable query rewrite as select e.first_name, e.last_name, j.job_title, l.location_id, l.city, 'HIST' record_type , j.rowid j_rowid, jh.rowid jh_rowid, e.rowid e_rowid, d.rowid d_rowid, l.rowid l_rowid from jobs j, job_history jh, employees e, departments d, locations l where jh.job_id = j.job_id and e.employee_id = jh.employee_id and d.department_id = e.department_id and l.location_id = d.location_id and j.job_title = 'President' union all select e.first_name, e.last_name, j.job_title, l.location_id, l.city, 'CURR' record_type , j.rowid j_rowid, null jh_rowid, e.rowid e_rowid, d.rowid d_rowid, l.rowid l_rowid from jobs j, employees e, departments d, locations l where e.job_id = j.job_id and d.department_id = e.department_id and l.location_id = d.location_id and ( (e.first_name = 'Steven' and e.last_name = 'King') or j.job_title = 'President' ); alter table kings_presidents_mv modify ( first_name not null, last_name not null, job_title not null, location_id not null, city not null, record_type not null, j_rowid not null, e_rowid not null, d_rowid not null, l_rowid not null ); create index kings_presidents_ix on kings_presidents_mv ( location_id, city ); begin dbms_stats.gather_schema_stats( 'HR' , estimate_percent => null , cascade => true ); end;
And we can just select the entire mview index via query rewrite:
select /*+ gather_plan_statistics */ distinct location_id, city from ( select l.location_id, l.city from jobs j, job_history jh, employees e, departments d, locations l where jh.job_id = j.job_id and e.employee_id = jh.employee_id and d.department_id = e.department_id and l.location_id = d.location_id and j.job_title = 'President' union all select l.location_id, l.city from jobs j, employees e, departments d, locations l where e.job_id = j.job_id and d.department_id = e.department_id and l.location_id = d.location_id and ( (e.first_name = 'Steven' and e.last_name = 'King') or j.job_title = 'President' ) );
------------------------------------------------------------------------------------------------------------------------------------ | Id | Operation | Name | Starts | E-Rows |E-Bytes| Cost (%CPU)| E-Time | A-Rows | A-Time | Buffers | ------------------------------------------------------------------------------------------------------------------------------------ | 0 | SELECT STATEMENT | | 1 | | | 2 (100)| | 1 |00:00:00.01 | 1 | | 1 | SORT UNIQUE NOSORT| | 1 | 1 | 11 | 2 (50)| 00:00:01 | 1 |00:00:00.01 | 1 | | 2 | INDEX FULL SCAN | KINGS_PRESIDENTS_IX | 1 | 1 | 11 | 1 (0)| 00:00:01 | 1 |00:00:00.01 | 1 | ------------------------------------------------------------------------------------------------------------------------------------
No predicates are used at all – just return the contents of the index and done.
The setting of mview columns to NOT NULL to match the nullability of the master tables is one way to make that query use the index, if the columns had been nullable we would have got a full table scan of the mview instead, which would be 3 buffer gets on my 12.1 instance.
Alternative to setting the mview columns to NOT NULL could be to write the query like this:
select /*+ gather_plan_statistics */ distinct location_id, city from ( select l.location_id, l.city from jobs j, job_history jh, employees e, departments d, locations l where jh.job_id = j.job_id and e.employee_id = jh.employee_id and d.department_id = e.department_id and l.location_id = d.location_id and j.job_title = 'President' union all select l.location_id, l.city from jobs j, employees e, departments d, locations l where e.job_id = j.job_id and d.department_id = e.department_id and l.location_id = d.location_id and ( (e.first_name = 'Steven' and e.last_name = 'King') or j.job_title = 'President' ) ) where location_id is not null and city is not null;
Which also gets us index full scan:
------------------------------------------------------------------------------------------------------------------------------------ | Id | Operation | Name | Starts | E-Rows |E-Bytes| Cost (%CPU)| E-Time | A-Rows | A-Time | Buffers | ------------------------------------------------------------------------------------------------------------------------------------ | 0 | SELECT STATEMENT | | 1 | | | 2 (100)| | 1 |00:00:00.01 | 1 | | 1 | SORT UNIQUE NOSORT| | 1 | 1 | 11 | 2 (50)| 00:00:01 | 1 |00:00:00.01 | 1 | |* 2 | INDEX FULL SCAN | KINGS_PRESIDENTS_IX | 1 | 1 | 11 | 1 (0)| 00:00:01 | 1 |00:00:00.01 | 1 | ------------------------------------------------------------------------------------------------------------------------------------ Predicate Information (identified by operation id): --------------------------------------------------- 2 - filter(("KINGS_PRESIDENTS_MV"."LOCATION_ID" IS NOT NULL AND "KINGS_PRESIDENTS_MV"."CITY" IS NOT NULL))
But that requires evaluating an access predicate 😉
Functionally correct, unbeatable performance, and easy to read; who could ask for anything more? But no one knows what criteria the judges will use. The NoCOUG Journal does not even say who will be the judges of the SQL throwdown. Bobby Flay and Alton Brown? I hope not.
LikeLike
Iggy, if I understand it correctly, the predicate values could change. What about the columns, meaning, could the query choose different columns to filter on (vs what is in the base query currently)?
LikeLike
My opinion is that allowing the predicate values or the predicates to change would change the rules of the game. Kim is striving to creating auxiliary structures that can handle requirements that are not in the problem statement which is commendable but it is a self-imposed handicap. 🙂
LikeLike
Oh I don’t know… Two celebrity chefs like that might bring novel ideas as judges – I could do a lasagna as my entree 😉
LikeLike
Kim, I wonder why don’t you just use a normal index, meaning an index that has all the columns vs the Boolean approach? As I have shown in my solution, you can still achieve just 1 buffer get with a normal index. In that way, you don’t restrict your solution to any specific predicate values…
LikeLike
I use the “boolean” approach because of predicates on name OR job title. Supposing you have a million employees and 10 Steven Kings and 10 Presidents? With a normal index, either you need full index scan, or at best if there is few distinct job titles you can skip scan an index with job title as leading column and then first/last name. The “boolean” approach index range scan directly gets the 20 out of the million. Alternatively create two indexes on the mview – one with job title leading, one with name leading – and then let the optimizer rewrite the OR to a index range scan for name union index range scan for job title.
For solutions not restricted to specific predicate values, I think I would prefer my solution with three separate mviews. Unless it is very important to lower the buffer gets even more than that solution offers, I think the ease of management outweighs the slightly higher buffer gets. These “fast refresh on commit” mviews might need multiple “auxilliary” indexes in order to make the fast refreshes go fast (you wouldn’t want a refresh at every commit to have to full table scan the mview.) And the mviews with unions in them might need special care in those indexes, probably have to include ‘CURR’/’HIST’ record_type marker in the indexes to allow the fast refresh to be fast.
So every bit of complexity increases chances of something going wrong. So we need to weigh the benefits of KISS using perhaps a bit more buffer gets versus the complexity needed to get just a bit fewer buffer gets. It will depend on the circumstances very much which solution is most applicable. For more general usability allowing use of different predicates, my favorite would be the slightly simpler solution with three mviews.
But there can be use cases where the predicates are very specific and constitute a business rule. Consider if we have tables Customers, CRM_Cust_Activity, CRM_Activity_Log, Cust_Account, Cust_Offers. Say we have a business rule stating that our customer services department needs to get in touch with a customer if the customer has either one of three conditions:
1) Has a row in CRM_Cust_Activity with Description like ‘%complaint%’ which has a child row in CRM_Activity_Log with Status=Unresolved
2) Has Cust_Account.Balance <= -100000
3) Has a row in Cust_Offers with Status=Pending
Such a case might usefully implement a single materialized view Customers_To_Contact.
If the mview itself contained the predicates, customer services can just "select * from Customers_To_Contact". (And I think I would do it like that rather than using query rewrite. Query rewrite has the advantage that if the mview isn't available, the query can still run from the master tables. But that might mean that the customer service department chokes the server… Letting the application select from the mview rather than using query rewrite will fail that application in case of mview failure, but only that – not kill the performance of the server. Also letting the applicition select from the mview encapsulates the business rule in one place: the mview definition.)
If the mview contains all data and a "boolean" index is used, that might have the advantage that if the business rule changes slightly (say Cust_Account_Balance <= -150000), that would only entail changing the index, not rebuilding the entire mview.
All the solutions have some pros and cons – you'll have to weigh the benefits in each specific use case 😉
LikeLike
Kim, looks great. But with function-based indexes don’t you have gather stats in this matter?
exec dbms_stats.gather_table_stats(ownname=>’HR’, tabname=> ‘EMPLOYEES’, method_opt=> ‘FOR ALL HIDDEN COLUMNS SIZE 1’);
I think this is due to Oracle creating a virtual and hidden column on function-based indexes and this is the only way Oracle can collect stats. But I could be wrong, that is how I remember from earlier versions of Oracle.
LikeLike
Good point, Jimmy. Like Richard Foote writes about here:
http://richardfoote.wordpress.com/2008/12/04/function-based-indexes-and-missing-statistics-no-surprises/
I haven’t found anything to indicate it should be different in newer versions, but if you look at my example on a 12.1 I just gather schema stats with cascade=true. And the access of the hidden virtual column actually does get an estimate E-Rows = 1, which could be indicating that the stats are in place for correct selectivity. Or I am just lucky due to the relatively small size of the dataset 😉
LikeLike
Just a nitpicky observation that you don’t need to include the columns e.first_name, e.last_name and j.job_title in the materialized view 🙂
Iggy
LikeLike