DBMS_STATS, METHOD_OPT and FOR ALL INDEXED COLUMNS

March 11, 2010

DBMS_STATS, METHOD_OPT and FOR ALL INDEXED COLUMNS

Filed under: [System Performance tuning] — zhefeng @ 10:14 am

Another very good article about dbms_stats package:
http://structureddata.org/2008/10/14/dbms_stats-method_opt-and-for-all-indexed-columns/

I’ve written before on choosing an optimal stats gathering strategy but I recently came across a scenario that I didn’t directly blog about and think it deserves attention. As I mentioned in that previous post, one should only deviate from the defaults when they have a reason to, and fully understand that reason and the effect of that decision.

Understanding METHOD_OPT

The METHOD_OPT parameter of DBMS_STATS controls two things:

1. on which columns statistics will be collected
2. on which columns histograms will be collected (and how many buckets)

It is very important to understand #1 and how the choice of METHOD_OPT effects the collection of column statistics.

Prerequisite: Where Do I Find Column Statistics?

Understanding where to find column statistics is vital for troubleshooting bad execution plans. These views will be the arrows in your quiver:

* USER_TAB_COL_STATISTICS
* USER_PART_COL_STATISTICS
* USER_SUBPART_COL_STATISTICS

Depending on if the table is partitioned or subpartitioned, and depending on what GRANULARITY the stats were gathered with, the latter two of those views may or may not be populated.

The Bane of METHOD_OPT: FOR ALL INDEXED COLUMNS

If you are using FOR ALL INDEXED COLUMNS as part of your METHOD_OPT you probably should not be. Allow me to explain. Using MENTOD_OPT=>’FOR ALL INDEXED COLUMNS SIZE AUTO’ (a common METHOD_OPT I see) tells DBMS_STATS: “only gather stats on columns that participate in an index and based on data distribution and the workload of those indexed columns decide if a histogram should be created and how many buckets it should contain“. Is that really what you want? My guess is probably not. Let me work through a few examples to explain why.

I’m going to start with this table.
view source
print?
01 SQL> exec dbms_random.initialize(1);
02
03 PL/SQL procedure successfully completed.
04
05 SQL> create table t1
06 2 as
07 3 select
08 4 column_value pk,
09 5 round(dbms_random.value(1,2)) a,
10 6 round(dbms_random.value(1,5)) b,
11 7 round(dbms_random.value(1,10)) c,
12 8 round(dbms_random.value(1,100)) d,
13 9 round(dbms_random.value(1,100)) e
14 10 from table(counter(1,1000000))
15 11 /
16
17 Table created.
18
19 SQL> begin
20 2 dbms_stats.gather_table_stats(
21 3 ownname => user ,
22 4 tabname => ‘T1′ ,
23 5 estimate_percent => 100 ,
24 6 cascade => true);
25 7 end;
26 8 /
27
28 PL/SQL procedure successfully completed.
29
30 SQL> select
31 2 COLUMN_NAME, NUM_DISTINCT, HISTOGRAM, NUM_BUCKETS,
32 3 to_char(LAST_ANALYZED,’yyyy-dd-mm hh24:mi:ss’) LAST_ANALYZED
33 4 from user_tab_col_statistics
34 5 where table_name=’T1′
35 6 /
36
37 COLUMN_NAME NUM_DISTINCT HISTOGRAM NUM_BUCKETS LAST_ANALYZED
38 ———– ———— ————— ———– ——————-
39 PK 1000000 NONE 1 2008-13-10 18:39:51
40 A 2 NONE 1 2008-13-10 18:39:51
41 B 5 NONE 1 2008-13-10 18:39:51
42 C 10 NONE 1 2008-13-10 18:39:51
43 D 100 NONE 1 2008-13-10 18:39:51
44 E 100 NONE 1 2008-13-10 18:39:51
45
46 6 rows selected.

This 6 column table contains 1,000,000 rows of randomly generated numbers. I’ve queried USER_TAB_COL_STATISTICS to display some of the important attributes (NDV, Histogram, Number of Buckets, etc).

I’m going to now put an index on T1(PK), delete the stats and recollect stats using two different METHOD_OPT parameters that each use ‘FOR ALL INDEXED COLUMNS’.
view source
print?
01 SQL> create unique index PK_T1 on T1(PK);
02
03 Index created.
04
05 SQL> begin
06 2 dbms_stats.delete_table_stats(user,’T1′);
07 3
08 4 dbms_stats.gather_table_stats(
09 5 ownname => user ,
10 6 tabname => ‘T1’ ,
11 7 estimate_percent => 100 ,
12 8 method_opt => ‘for all indexed columns’ ,
13 9 cascade => true);
14 10 end;
15 11 /
16
17 PL/SQL procedure successfully completed.
18
19 SQL> select COLUMN_NAME, NUM_DISTINCT, HISTOGRAM, NUM_BUCKETS,
20 2 to_char(LAST_ANALYZED,’yyyy-dd-mm hh24:mi:ss’) LAST_ANALYZED
21 3 from user_tab_col_statistics
22 4 where table_name=’T1′
23 5 /
24
25 COLUMN_NAME NUM_DISTINCT HISTOGRAM NUM_BUCKETS LAST_ANALYZED
26 ———– ———— ————— ———– ——————-
27 PK 1000000 HEIGHT BALANCED 75 2008-13-10 18:41:10
28
29 SQL> begin
30 2 dbms_stats.delete_table_stats(user,’T1′);
31 3
32 4 dbms_stats.gather_table_stats(
33 5 ownname => user ,
34 6 tabname => ‘T1’ ,
35 7 estimate_percent => 100 ,
36 8 method_opt => ‘for all indexed columns size auto’ ,
37 9 cascade => true);
38 10 end;
39 11 /
40
41 PL/SQL procedure successfully completed.
42
43 SQL> select COLUMN_NAME, NUM_DISTINCT, HISTOGRAM, NUM_BUCKETS,
44 2 to_char(LAST_ANALYZED,’yyyy-dd-mm hh24:mi:ss’) LAST_ANALYZED
45 3 from user_tab_col_statistics
46 4 where table_name=’T1′
47 5 /
48
49 COLUMN_NAME NUM_DISTINCT HISTOGRAM NUM_BUCKETS LAST_ANALYZED
50 ———– ———— ————— ———– ——————-
51 PK 1000000 NONE 1 2008-13-10 18:41:12

Notice that in both cases only column PK has stats on it. Columns A,B,C,D and E do not have any stats collected on them. Also note that when no SIZE clause is specified, it defaults to 75 buckets.

Now one might think that is no big deal or perhaps they do not realize this is happening because they do not look at their stats. Let’s see what we get for cardinality estimates from the Optimizer for a few scenarios.
view source
print?
01 SQL> select /*+ gather_plan_statistics */
02 2 count(*)
03 3 from t1
04 4 where a=1
05 5 /
06
07 COUNT(*)
08 ———-
09 500227
10
11 SQL> select * from table(dbms_xplan.display_cursor(null, null, ‘allstats last’));
12
13 PLAN_TABLE_OUTPUT
14 ——————————————————————————————
15 SQL_ID 4df0g0r99zmba, child number 0
16 ————————————-
17 select /*+ gather_plan_statistics */ count(*) from t1 where a=1
18
19 Plan hash value: 3724264953
20
21 ————————————————————————————-
22 | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers |
23 ————————————————————————————-
24 | 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:00.24 | 3466 |
25 |* 2 | TABLE ACCESS FULL| T1 | 1 | 10000 | 500K|00:00:00.50 | 3466 |
26 ————————————————————————————-
27
28 Predicate Information (identified by operation id):
29 —————————————————
30
31 2 – filter(“A”=1)

Notice the E-Rows estimate for T1. The Optimizer is estimating 10,000 rows when in reality there is 500,227. The estimate is off by more than an order of magnitude (50x). Normally the calculation for the cardinality would be (for a one table single equality predicate):
number of rows in T1 * 1/NDV = 1,000,000 * 1/2 = 500,000
but in this case 10,000 is the estimate. Strangely enough (or not), 10,000 is exactly 0.01 (1%) of 1,000,000. Because there are no column stats for T1.A, the Optimizer is forced to make a guess, and that guess is 1%.

As you can see from the 10053 trace (below), since there are no statistics on the column, defaults are used. In this case they yield very poor cardinality estimations.

SINGLE TABLE ACCESS PATH
—————————————–
BEGIN Single Table Cardinality Estimation
—————————————–
Column (#2): A(NUMBER) NO STATISTICS (using defaults)
AvgLen: 13.00 NDV: 31250 Nulls: 0 Density: 3.2000e-05
Table: T1 Alias: T1
Card: Original: 1000000 Rounded: 10000 Computed: 10000.00 Non Adjusted: 10000.00
—————————————–
END Single Table Cardinality Estimation
—————————————–

Now that I’ve demonstrated how poor the cardinality estimation was with a single equality predicate, let’s see what two equality predicates gives us for a cardinality estimate.
view source
print?
01 SQL> select /*+ gather_plan_statistics */
02 2 count(*)
03 3 from t1
04 4 where a=1
05 5 and b=3
06 6 /
07
08 COUNT(*)
09 ———-
10 124724
11
12 SQL> select * from table(dbms_xplan.display_cursor(null, null, ‘allstats last’));
13
14 PLAN_TABLE_OUTPUT
15 ——————————————————————————————
16 SQL_ID ctq8q59qdymw6, child number 0
17 ————————————-
18 select /*+ gather_plan_statistics */ count(*) from t1 where a=1 and b=3
19
20 Plan hash value: 3724264953
21
22 ————————————————————————————-
23 | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers |
24 ————————————————————————————-
25 | 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:00.19 | 3466 |
26 |* 2 | TABLE ACCESS FULL| T1 | 1 | 100 | 124K|00:00:00.25 | 3466 |
27 ————————————————————————————-
28
29 Predicate Information (identified by operation id):
30 —————————————————
31
32 2 – filter((“A”=1 AND “B”=3))

Yikes. In this case the cardinality estimate is 100 when the actual number of rows is 124,724, a difference of over 3 orders of magnitude (over 1000x). Where did the 100 row estimate come from? In this case there are two equality predicates so the selectivity is calculated as 1% * 1% or 0.01 * 0.01 = 0.0001. 1,000,000 * 0.0001 = 100. Funny that. (The 1% is the default selectivity for an equality predicate w/o stats.)

Now let’s add a derived predicate as well and check the estimates.
view source
print?
01 SQL> select /*+ gather_plan_statistics */
02 2 count(*)
03 3 from t1
04 4 where a=1
05 5 and b=3
06 6 and d+e > 50
07 7 /
08
09 COUNT(*)
10 ———-
11 109816
12
13 SQL> select * from table(dbms_xplan.display_cursor(null, null, ‘allstats last’));
14
15 PLAN_TABLE_OUTPUT
16 ——————————————————————————————
17 SQL_ID 5x200q9rqvvfu, child number 0
18 ————————————-
19 select /*+ gather_plan_statistics */ count(*) from t1 where a=1 and b=3
20 and d+e > 50
21
22 Plan hash value: 3724264953
23
24 ————————————————————————————-
25 | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers |
26 ————————————————————————————-
27 | 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:00.22 | 3466 |
28 |* 2 | TABLE ACCESS FULL| T1 | 1 | 5 | 109K|00:00:00.33 | 3466 |
29 ————————————————————————————-
30
31 Predicate Information (identified by operation id):
32 —————————————————
33
34 2 – filter((“A”=1 AND “B”=3 AND “D”+”E”>50))

Doh! The cardinality estimate is now 5, but the actual number of rows being returned is 109,816. Not good at all. The Optimizer estimated 5 rows because it used a default selectivity of 1% (for A=1) * 1% (for B=3) * 5% (for D+E > 50) * 1,000,000 rows. Now can you see why column statistics are very important? All it takes is a few predicates and the cardinality estimation becomes very small, very fast. Now consider this:

* What is likely to happen in a data warehouse where the queries are 5+ table joins and the fact table columns do not have indexes?
* Would the Optimizer choose the correct driving table?
* Would nested loops plans probably be chosen when it is really not appropriate?

Hopefully you can see where this is going. If you don’t, here is the all too common chain of events:

* Non representative (or missing) statistics lead to
* Poor cardinality estimates which leads to
* Poor access path selection which leads to
* Poor join method selection which leads to
* Poor join order selection which leads to
* Poor SQL execution times

Take 2: Using the Defaults

Now I’m going to recollect stats with a default METHOD_OPT and run through the 3 execution plans again:
view source
print?
01 SQL> begin
02 2 dbms_stats.delete_table_stats(user,’t1′);
03 3
04 4 dbms_stats.gather_table_stats(
05 5 ownname => user ,
06 6 tabname => ‘T1′ ,
07 7 estimate_percent => 100 ,
08 8 degree => 8,
09 9 cascade => true);
10 10 end;
11 11 /
12
13 PL/SQL procedure successfully completed.
14
15 SQL> select column_name, num_distinct, histogram, NUM_BUCKETS,
16 2 to_char(LAST_ANALYZED,’yyyy-dd-mm hh24:mi:ss’) LAST_ANALYZED
17 3 from user_tab_col_statistics where table_name=’T1′
18 4 /
19
20 COLUMN_NAME NUM_DISTINCT HISTOGRAM NUM_BUCKETS LAST_ANALYZED
21 ———– ———— ————— ———– ——————-
22 PK 1000000 NONE 1 2008-13-10 19:44:32
23 A 2 FREQUENCY 2 2008-13-10 19:44:32
24 B 5 FREQUENCY 5 2008-13-10 19:44:32
25 C 10 FREQUENCY 10 2008-13-10 19:44:32
26 D 100 NONE 1 2008-13-10 19:44:32
27 E 100 NONE 1 2008-13-10 19:44:32
28
29 6 rows selected.
view source
print?
01 SQL> select /*+ gather_plan_statistics */
02 2 count(*)
03 3 from t1
04 4 where a=1
05 5 /
06
07 COUNT(*)
08 ———-
09 500227
10
11 SQL> select * from table(dbms_xplan.display_cursor(null, null, ‘allstats last’));
12
13 PLAN_TABLE_OUTPUT
14 ——————————————————————————————
15 SQL_ID 4df0g0r99zmba, child number 0
16 ————————————-
17 select /*+ gather_plan_statistics */ count(*) from t1 where a=1
18
19 Plan hash value: 3724264953
20
21 ————————————————————————————-
22 | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers |
23 ————————————————————————————-
24 | 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:00.20 | 3466 |
25 |* 2 | TABLE ACCESS FULL| T1 | 1 | 500K| 500K|00:00:00.50 | 3466 |
26 ————————————————————————————-
27
28 Predicate Information (identified by operation id):
29 —————————————————
30
31 2 – filter(“A”=1)
view source
print?
01 SQL> select /*+ gather_plan_statistics */
02 2 count(*)
03 3 from t1
04 4 where a=1
05 5 and b=3
06 6 /
07
08 COUNT(*)
09 ———-
10 124724
11
12 SQL> select * from table(dbms_xplan.display_cursor(null, null, ‘allstats last’));
13
14 PLAN_TABLE_OUTPUT
15 ——————————————————————————————
16 SQL_ID ctq8q59qdymw6, child number 0
17 ————————————-
18 select /*+ gather_plan_statistics */ count(*) from t1 where a=1 and b=3
19
20 Plan hash value: 3724264953
21
22 ————————————————————————————-
23 | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers |
24 ————————————————————————————-
25 | 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:00.14 | 3466 |
26 |* 2 | TABLE ACCESS FULL| T1 | 1 | 124K| 124K|00:00:00.25 | 3466 |
27 ————————————————————————————-
28
29 Predicate Information (identified by operation id):
30 —————————————————
31
32 2 – filter((“B”=3 AND “A”=1))
view source
print?
01 SQL> select /*+ gather_plan_statistics */
02 2 count(*)
03 3 from t1
04 4 where a=1
05 5 and b=3
06 6 and d+e > 50
07 7 /
08
09 COUNT(*)
10 ———-
11 109816
12
13 SQL> select * from table(dbms_xplan.display_cursor(null, null, ‘allstats last’));
14
15 PLAN_TABLE_OUTPUT
16 ——————————————————————————————
17 SQL_ID 5x200q9rqvvfu, child number 0
18 ————————————-
19 select /*+ gather_plan_statistics */ count(*) from t1 where a=1 and b=3
20 and d+e>50
21
22 Plan hash value: 3724264953
23
24 ————————————————————————————-
25 | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers |
26 ————————————————————————————-
27 | 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:00.17 | 3466 |
28 |* 2 | TABLE ACCESS FULL| T1 | 1 | 6236 | 109K|00:00:00.22 | 3466 |
29 ————————————————————————————-
30
31 Predicate Information (identified by operation id):
32 —————————————————
33
34 2 – filter((“B”=3 AND “A”=1 AND “D”+”E”>50))

As you can see, the first two queries have spot on cardinality estimates, but the the third query isn’t as good as it uses a column combination and there are no stats on D+E columns, only D and E individually. I’m going to rerun the third query with dynamic sampling set to 4 (in 10g it defaults to 2) and reevaluate the cardinality estimate.
view source
print?
01 SQL> alter session set optimizer_dynamic_sampling=4;
02
03 Session altered.
04
05 SQL> select /*+ gather_plan_statistics */
06 2 count(*)
07 3 from t1
08 4 where a=1
09 5 and b=3
10 6 and d+e > 50
11 7 /
12
13 COUNT(*)
14 ———-
15 109816
16
17 SQL> select * from table(dbms_xplan.display_cursor(null, null, ‘allstats last’));
18
19 PLAN_TABLE_OUTPUT
20 ——————————————————————————————
21 SQL_ID 5x200q9rqvvfu, child number 1
22 ————————————-
23 select /*+ gather_plan_statistics */ count(*) from t1 where a=1 and b=3
24 and d+e > 50
25
26 Plan hash value: 3724264953
27
28 ————————————————————————————-
29 | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers |
30 ————————————————————————————-
31 | 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:00.17 | 3466 |
32 |* 2 | TABLE ACCESS FULL| T1 | 1 | 102K| 109K|00:00:00.22 | 3466 |
33 ————————————————————————————-
34
35 Predicate Information (identified by operation id):
36 —————————————————
37
38 2 – filter((“B”=3 AND “A”=1 AND “D”+”E”>50))
39
40 Note
41 —–
42 – dynamic sampling used for this statement

Bingo! Close enough to call statistically equivalent.

Summary

I hope this little exercise demonstrates how important it is to have representative statistics and that when statistics are representative the Optimizer can very often accurately estimate the cardinality and thus choose the best plan for the query. Remember these points:

* Recent statistics do not necessarily equate to representative statistics.
* Statistics are required on all columns to yield good plans – not just indexed columns.
* You probably should not be using METHOD_OPT => ‘FOR ALL INDEXED COLUMNS SIZE AUTO’, especially in a data warehouse where indexes are used sparingly.
* Dynamic Sampling can assist with cardinality estimates where existing stats are not enough.

Comments (6)

6 Comments »

The normal client is within a job and whom utilizes the actual salary to control everyday
fees http://www.pihquickloans.co.uk/ Students who take with
these loans do not have to worry about your daunting talked about having
to pay away from a huge credit card debt as well as excessive interests as soon as they graduate

Comment by http://www.pihquickloans.co.uk/ — April 20, 2013 @ 9:07 pm

Reply
In case your salary is not sufficient to produce
you content then look for help of Same Day Payday
Loans benefited at cost-effective interest rates and without credit
check till your following payday quick loans This is why you must
submit truthful information within your hardship letter; so that you demand a
mortgage payment that one could afford

Comment by short term loans uk — April 20, 2013 @ 9:21 pm

Reply
more than enquiry volition issue forth out and hopefully
earlier kids of cigarette puffs at the toll equivalent of less than $2 a carry.
electronic cigarettes This swimming will be vapourised

Comment by electronic cigarette — May 5, 2013 @ 2:29 am

Reply
Wow, wonderful blog layout! How long have you been blogging for?
you made blogging look easy. The overall look of your website is
magnificent, let alone the content!

Comment by online casino — May 5, 2013 @ 4:23 pm

Reply
I know that these accounts are not mine because
it shows debts with an on-line debt village troupe to protect your credit grade.

weekend payday loans Yet as recollective as they received requested for USA Hard
Cash Advance Online, they power Windows Phone 7’s lookup capabilities are course powered by Bing.

Comment by payday loans direct lender only — June 21, 2013 @ 5:34 am

Reply
Thither is no to lose since they calculate on the other two for their
financial backing. http://www.moy-p.co.uk/
– electronic cigarette To take out the intersection from alternate solvent for smoke the traditional way?

Comment by electronic cigarette — August 1, 2013 @ 3:12 pm

Reply