Scoring systems

Extensivity

Many intuitive scoring criteria for segments, such as fold-enrichment, produce quantities that are “intensive”, that is, they do not scale with a segment’s length. Since energies (and scores) are extensive quantities, applying an intensive measure produces an implicit a priori bias against large segments, because a segmentation with larger segments will have fewer segments, and therefore a lower cumulative score when fold-enrichment among all segments is identical.

Therefore, it is important to note that to turn an intensive segment quality metric into a scoring function with a uniform prior on segment length, one must scale a segment’s score in direct proportion to its length in bins: a “fair” scoring function must be extensive.

On the other hand, we can impose an explicit prior on segment length by building a scoring function from an intensive quality metric and applying to it a particular scaling relation with respect to segment length.

Log-odds scores

Potts energy model

In the space of segmentations \(S\), we can decompose the multiresolution Potts Hamiltonian described earlier into an energy function for segments:

When a heatmap is a stochastic matrix (e.g. a balanced Hi-C heatmap), we can take all \(k_i = 1\) and \(2m = N\). Then the Potts energy function can be written:

\[\begin{split}S_{\textrm{potts}}(a,b) &= \sum_{i=a}^{b-1}\sum_{j=a}^{b-1} \left(A_{ij} - \frac{\gamma}{N} \right) \\ &= \left[\sum_{i=a}^{b-1}\sum_{j=a}^{b-1} A_{ij}\right] - \left[\frac{\gamma}{N}(b-a)^2\right]\end{split}\]

Consider the submatrix corresponding to segment \([a,b)\) in the heatmap: A[a:b,a:b]. We see that the configuration null model makes an assumption about how much edge mass is dedicated to every such submatrix per pixel. The Potts model score takes the difference between the observed mass in the submatrix and this background mass.

The Potts scoring function imposes a segment length bias on segmentations: Take a segment with total edge mass \(c\) and a scale it up to twice its size, so that it has twice its original length and total edge mass \(c^2\). We can see that the Potts score will increase by a factor of four, rather than two.

Other scoring functions

Armatus

Filipova et al (2014) presented the optimal segmentation algorithm in the context of a Hi-C domain scoring function with a tunable scale parameter to find domains at multiple resolutions.

\[\begin{split}& q(a,b) = \frac{\sum_{i=a}^{b-1} \sum_{j=a}^{b-1} A_{ij}}{ (b-a)^\gamma }\\ & \mu(l) = \textrm{mean } q \textrm{ for segments with length } l \\ \\ & S_{ \textrm{armatus} }(a,b) = \max \left(0, q(a,b) - \mu(b-a) \right)\end{split}\]

Corner score

Phase-consistency score