Tuesday, August 4, 2009

Monotonicity in Contiguous Perfstat Snapshots

Oracle statspack is a well respected performance monitoring, or should I say, performance capturing and reporting tool. It is usually a backdrop reporting system that is set up with little effort to start capturing system performance. Many different things are captured, but, the data I'm particularly interested in for this blog, is the sql summary data.

The default user, into which the statspack objects are stored is 'perfstat'. There are a number of different tables in this schema, all of which relate to the saving of data thay you can normally see just floating around in the v$ views. One of the most interesting sets of data captured is that which is stored in the stats$sql_summary table. This table is the storage location for data normally visiable in the v$sqlarea table and it's contents are used to for the basis of the SQL performance section of the statspack reports.

However, there are a few anomolies that you must understand in order to make a proper assessment of your system's performance using the statspack tools. Basically, they appear to suffer from what I would describe as non-monotonic data across contiguous snapshot ids. Now, what does all that mean? I heard you mumble. Right, so consider a single SQL statement, that has a fairly high execution count. - i.e. it is called a lot of times. Let's suppose that it is called so many times that it is always in the SQL pool.

For example.
SELECT 'x' FROM DUAL;

Now lets say that after taking a good look through your stats$sql_summary table, that you find the hash value of this little statement. If you then select out from the stats$sql_summary table all occurences of this little statement, then you may be in for a surprise. Ordered over time, you would expect such a statement to show ever increasing values: for example, two important columns in the stats$sql_summary table are: executions and cpu_usage. But, the interesting thing is that these metrics are not always always incrementing (monotonic).

Now, if you have been around for a while, you will probably start to say something like, "ah, that'll be because the statement ages out of the pool. When it is then re-discovered, it is given a clean set of statistics". If you said, this then you are correct. That does, indeed, happen with statements. However, I have been working on a case in which a statement like the one in the example shows a daily incrementing 'staircase' pattern, and then glitches downwards to around 1/2 its maximum value, then it starts to increment once again. It then increments for a few more days before then suffering another huge drop in its value. (The glitch affects both executions and cpu_usage parameters. I haven't studied other metrics, but I suspect that it is also true of these too.)

So what is the big deal? Does it matter that this happens? Well yes it does. When you run a statspack report you choose two snap_ids that will be used to compute a 'what has happened between these two snapshots' report. I've discovered that these snapshots do NOT take into account all the intermediate data captured. So, for example, suppose that the above statement was called 100,000 times an hour, over a day, the number of executions that this statement reports being called, rises from say 0 to 2.4 million. After a day and one hour, say, it it's execution value drops from 2.4 million down to 1.2 million, and then one hour later it drops again to 600,000 and then again to 300,000. Now, you run a statspack report to cover the whole period, and instead of being told that the statement has executed 2.7 million times, you are told, quite incorrectly, that it was only executed 300,000 times. This figure being the difference between its starting value and its ending value, now 300,000. But, statspack has missed the fact that in the meantime, this statement achieved 2.4 millions peak and then got glitched downwards.

If the statement ages out over time, then you have an even bigger issue to contend with. For example, suppose that a really big query that only runs once per day at lunchtime and consumes one CPU hour. And that:

1. this statement is not in the midnight snapshot (having been aged out through the day) and
2. this statement remains in the pool only to make it into the 13:00 snapshot. (by 14:00 it is aged out and disappears).

Now, if you run a statspack report for the whole day: midnight to midnight, you will not be shown the SQL statement that was running from 12:00 to 13:00. Furthermore, suppose that the statement starts at 12:01 and completes at 13:01. Then it would only appear in a single snapshot - assuming that the statement ages out by 14:00. Under these conditions you will not be shown this statement if you run a report between 12:00 and 13:00 as the statement did not have a value at 12:00 to form the starting point.