Home » RDBMS Server » Performance Tuning » Improving a "DISTINCT" query on a column with skewed data
Improving a "DISTINCT" query on a column with skewed data [message #65778] Wed, 22 December 2004 08:36 Go to next message
Shane Kaszyca
Messages: 16
Registered: May 2003
Junior Member
I am trying to perform a query for distinct values on a column in one of my tables. There are about 16000 rows in the table and about 30 distinct values total. The data distribution appears at the bottom of this message. The query is performing a full table scan. I have tried putting an index on the column but that doesn't seem to work. Basically, I am doing a SELECT DISTINCT value FROM table. Any idea how I can prevent a Full Table Scan? It's costing me 4 seconds - which I can't afford. The data is skewed to one of the values - It appears about 84 percent of the time. Regards,

Shane Kaszyca

-value-    -count-
value 1      38
value 2      113
value 3      70
value 4      436
value 5      172
value 6      48
value 7      1
value 8      1
value 9      117
value 10     1
value 11     129
value 12     13728
value 13     75
value 14     70
value 15     21
value 16     109
value 17     348
value 18     71
value 19     44
value 20     158
value 21     119
value 22     1
value 23     69
value 24     130
value 25     58
value 26     194
value 27     130
value 28     4
value 29     1
value 30     50
Re: Improving a "DISTINCT" query on a column with skewed data [message #65779 is a reply to message #65778] Wed, 22 December 2004 12:15 Go to previous messageGo to next message
Art Metzer
Messages: 2480
Registered: December 2002
Senior Member
----------------------------------------------------------------------
SQL> SELECT   description
  2  ,        qty
  3  ,        LPAD(TO_CHAR(RATIO_TO_REPORT(qty)
  4                        OVER () * 100
  5                ,       'fm90.00')
  6           ,    5)               "%AGE"
  7  FROM    (SELECT   description
  8           ,        COUNT(*)     qty
  9           FROM     t
 10           GROUP BY description)
 11  ORDER BY TO_NUMBER(SUBSTR(description,INSTR(description,' ') + 1))
 12  /
 
DESCRIPTION            QTY %AGE
--------------- ---------- -----
value 1                 38  0.23
value 2                113  0.69
value 3                 36  0.22
value 4                436  2.67
value 5                172  1.05
value 6                 48  0.29
value 7                  1  0.01
value 8                  1  0.01
value 9                117  0.72
value 10                 2  0.01
value 11               129  0.79
value 12             13728 84.18
value 13                75  0.46
value 14                34  0.21
value 15                21  0.13
value 16               109  0.67
value 17               348  2.13
value 18                71  0.44
value 19                44  0.27
value 20               158  0.97
value 21               119  0.73
value 22                 1  0.01
value 23                69  0.42
value 24                66  0.40
value 25                58  0.36
value 26               194  1.19
value 27                64  0.39
value 28                 4  0.02
value 29                 1  0.01
value 30                50  0.31
 
30 rows selected.
 
SQL>

----------------------------------------------------------------------
Doing this DISTINCT should not take four seconds. In my testing, I had no indexes, and I was able to get a response in about 60 milliseconds.

That said, I don't see how you can do a DISTINCT without doing a full table scan. I mean, think about it: if you're not doing a FTS on a DISTINCT operation, then the row(s) that you hadn't table-scanned might have held that additional, unique value (and you would get back a wrong answer). And full-scanning an index wouldn't give that much performance boost over full-scanning the table, especially if the table is skinny.

I ran tkprof on three approaches I could think of to this problem. Those results are listed below.
********************************************************************************
 
<font color=red>SELECT DISTINCT description
FROM
 t</font>
 
 
call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute      1      0.00       0.00          0          0          0           0
Fetch        2      0.06       0.06         43         46         12          30
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total        4      0.06       0.06         43         46         12          30
 
Misses in library cache during parse: 0
Optimizer goal: CHOOSE
Parsing user id: 217
 
Rows     Row Source Operation
-------  ---------------------------------------------------
     30  SORT UNIQUE
  16307   TABLE ACCESS FULL T
 
********************************************************************************

********************************************************************************
 
<font color=#663366>SELECT description
FROM
 (SELECT description, COUNT(*) FROM t GROUP BY description)</font>
 
 
call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute      1      0.00       0.00          0          0          0           0
Fetch        2      0.06       0.06          0         46         12          30
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total        4      0.06       0.06          0         46         12          30
 
Misses in library cache during parse: 0
Optimizer goal: CHOOSE
Parsing user id: 217
 
Rows     Row Source Operation
-------  ---------------------------------------------------
     30  SORT GROUP BY
  16307   TABLE ACCESS FULL T

********************************************************************************

********************************************************************************
 
<font color=blue>SELECT description
FROM
(
SELECT description
,      ROW_NUMBER()
       OVER (PARTITION BY description
             ORDER BY     NULL) r
FROM   t
)
WHERE  r = 1</font>
 
call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute      1      0.00       0.00          0          0          0           0
Fetch        2      0.37       0.37         66         46         18          30
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total        4      0.37       0.37         66         46         18          30
 
Misses in library cache during parse: 0
Optimizer goal: CHOOSE
Parsing user id: 217
 
Rows     Row Source Operation
-------  ---------------------------------------------------
     30  VIEW
  16307   WINDOW SORT
  16307    TABLE ACCESS FULL T
 
********************************************************************************
As you can see, the first two approaches perform similarly in 0.06 seconds, but the third one is a throwaway.

What kind of tkprof results do you get using your data?

And remember, full table scans are not always evil, indexes are not always good.
Re: Improving a "DISTINCT" query on a column with skewed data [message #65784 is a reply to message #65779] Thu, 23 December 2004 04:29 Go to previous message
Shane Kaszyca
Messages: 16
Registered: May 2003
Junior Member
I tried tracing, it does not appear that my trace file is being created...I looked at what could be the problem, and information says that the oracle instance must have write access to the dump destination directory. Permissions are drwxrwxrwt on the dest dir, and still nothing. Any ideas?

Shane
Previous Topic: Do Remote Table use Remote indexes ?
Next Topic: procedure or function which is best related to tuning
Goto Forum:
  


Current Time: Fri Apr 19 14:07:48 CDT 2024