We advocate the use of information retrieval metric score standardization to allow retrieval effectiveness scores derived on different topics and collections to be more meaningful in themselves and more easily compared.

In order to standardize the score that a system has received from a metric for a run against a given topic, one must know the mean of the scores achieved by the original experimental systems against the topic, and also the standard deviation of those scores. A standardized z-score is then calculated by subtracting the mean from the raw score, and dividing the result by the standard deviation.

A z-score is unbounded both above and below. Most IR
metrics in contrast range from 0 to 1. Additionally, the
unbounded nature of z-scores can lead to exagerrated
influence for outliers when aggregating scores. For these
reasons, we suggest mapping the z-score to the range 0 to
1. One function that can be used to do this is the
cumulative density function of the Normal distribution.
This should be provided by most good statistical packages.
Under R, use the `pnorm()` function.

Standardization factors are provided for the following metrics:

- Average Precision (AP).
- Sum of Precisions (SP). This AP without normalizing by the total number of (known) relevant documents.
- Discounted Cumulative Gain (DCG). As originally
described by Jarvelin
and Kekalainen, with discounting base
*b = 2*. - Normalized Discounted Cumulative Gain (nDCG).
- Variant Discounted Cumulative Gain (VDCG),
*b = 2*. As described by Burges et al.. This discounts by the log of the rank*r*+ 1, not the rank, and so does not need to adjust for*r < b*. - Normalized Variant Discounted Cumulative Gain (nVDCG).
- Precision at 10 (P@10).
- Reciprocal Rank (RR).
- Rank-Biased Precision, persistence parameter
*p=0.8*(RBP.8). - Rank-Biased Precision,
*p=0.95*(RBP.95). - Precision at R (RP).

The standardized score of metrics that have been normalized by dividing by the score of an ideal ranking is identical to the standardized score of unnormalized form of the the same metric. Therefore, sAP == snSP == sSP, sDCG == snDCG, sVDCG == snVDCG. Standardization factors for both the normalized and unnormalized forms of these metric pairs are provided for convenience; they should give the same results.

The files below contain the means and standard deviations of all of the above metrics for each of the specified TREC tracks. All experimental systems contributing to those tracks have been included in generating the standardization factors. Each file contains a matrix in CSV format, with topics as rows, and metrics as columns. The first row lists the metric names, and the first column lists the topic ids. The standardization factors may also be downloaded as a single tar.gz file.

If there are any test collections or metrics not listed that you think should be included, then please let us know.

*Disclaimer: This page, its content and style, are the responsibility of
the author and do not necessarily represent the views, policies, or
opinions of The University of Melbourne.*