Monday, July 01, 2013

Column statistics and Histogram in Sybase ASE

Some notes on the column level statistics and histogram, and their usage in Search Argument and Join:

Histogram: When you create an index, a histogram is created on the first column of the index. It represents the distribution of values in a column. It contains cells with weights that represent the percentage of rows of the column occupied by the values within the cell.
- Frequency cell: equality relationship, the weight of the cell is the portion of rows that has this value.
- Range celll: portion of rows whose values fall in this range. It can be used for estimating the number of rows to be returned when search argument values falls within a range cell.

When merging histograms from several partitions: we order the histograms from different partitions by the lower bound, then interpolate other values into the range.

Density (range cell and total): Represent the average number of duplicates in the column.
- Range cell density: measures the average number of duplicates of all values that are represented by range cells, excluding frequency cells. It's used for search arguments.
- Total density: measures the average number of duplicates in entire column including both range and frequency cells. Used for estimation of the number matching rows for joins.

Seach argument (SARG) estimation:
- For single SARG, the weight of the range/frequency cell covered are added up and multiplied with the total number of rows in the table, to get the estimated selectivity.
- For multiple SARGs, the selectivity on different columns are computed, then combined (can be simple weight multiplication) to get the total selectivity estimation.

Join estimation:
When statistics are available, use the total density to estimate the number of rows matching the join key. For multi-column join, use the composite total density.

Use join density estimates derived from join histograms that give accurate estimates of qualifying joining rows and the rows to be scanned in the outer and inner tables.

Use table-normalized histograms of the joining attributes. This technique gives an exact value for the skewed values (that is, frequency count) and uses the range cell densities from each histogram to estimate the cell counts of corresponding range cells.

The join density is dynamically computed from the join histogram. The first histogram join occurs typically between two base tables when both attributes have histograms. Every histogram join creates a new histogram on the corresponding attribute of the parent join's projection.