Home > Spearman Rho null distribution

Motivation

Linear regressions have a fairly strict set of assumptions that need to be satisfied to make meaningful conclusions about the p-value or estimation error, such as normality and independence of the residuals. Moreover, many data sets do not satisfy these assumptions, even though many statistics books would have one believe most things are distributed normally. For example, most financial data sets are fat-tailed.

At the expense of producing a predictive model, Spearman Rho is a much more robust test for the existence of a monotonic relationship between the variables. However, the null distribution is more complex, so it's also more difficult to obtain an accurate p-value. (I'd argue an exact p-value is far less important than knowing certainly that the model assumptions hold!) The computation below shows the exact null distribution for small sample sizes.

Furthermore, the null distribution is also related to a number theory problem of intrinsic interest: the distribution of , where is a random permutation on the numbers 1 through N, and i ranges from 1 through N. See also A056876.

Unfortunately, until now, there was little data freely available on the internet about this distribution. Several statistics papers have discussed it, but the only free one I can find is here (Fast computation of the exact null distributio of Spearman's rho and Page's L statistic for samples with and without ties, M.A. van de Wiel and A. Di Bucchianico).

Computation technique

Note: I've discovered an algorithm with better complexity but haven't implemented it yet. See the next section for details.

As in the aforementioned paper, a convenient technique for calculating the distribution of , from which one easily obtains the Spearman Rho null distribution, is using the permanent of . One can verify that the coefficient of xi is the frequency for the sum i. So, this computation is equivalent to calculating the permanent of a certain symmetric matrix of monomials.

The paper uses a Laplace-type expansion for the permanent, and then by factoring out powers of x and using symmetry, a great deal of computation can be reused. They do not provide a complexity analysis or timing results, but it appears to be O(2nn5) on a cursory look. (We assume arithmetic operations have constant cost. In reality, precision of n log n is required.) It also appears to require a fair amount of memory for the reusable computations, although they do not address this in the paper.

Instead, I opted to use Ryser's formula,


This provides the same asymptotic time O(2nn5), but is much simpler to implement and requires very little memory. Several constant-factor optimizations can be made based on the structure of the problem: There are many other possibilities of small time-saving tricks, but with vanishing improvements in speed for larger n.

My program is a fairly straightforward implementation of the above in C++, no fancy assembly or other low-level coding was used. 96-bit integers were used, which should accommodate N through 30. (Note that intermediate results exceed the 96-bit limit. However, this just means the computation is done modulo 296, and the final result is still valid.) The platform is Win32, compiled with gcc. I may make the source available after cleaning it up, but I would also like to try some other optimizations, such as CPU vector operations.

An algorithm with better complexity

I did also find an algorithm with better complexity. Simply choose O(n3) values for x and compute the permanent. The generating function for the distribution can be recovered using polynomial interpolation techniques. Using an efficient implementation of Ryser's formula, the permanent takes time O(2nn), thus we arrive at complexity O(2nn4). Again, this is the number of arithmetic operations; n log n bits of precision are required.

Results

Below are tables of the exact distribution.

Note that the results of a computation are NOT copyrighted. Therefore you may freely use these tables. Of course, a citation would be appreciated.

n = 3
n = 4
n = 5
n = 6
n = 7
n = 8
n = 9
n = 10
n = 11
n = 12
n = 13
n = 14
n = 15
n = 16
n = 17
n = 18
n = 19
n = 20
n = 21
n = 22
n = 23
n = 24
n = 25
n = 26

Timing

The timings are on a 2GHz Core 2 Duo, Win32. All numbers are in seconds. This doesn't particularly follow O(2nn5), apparently due to cache and memory effects.

n64 bits96 bits
30.0000030.000005
40.0000150.000009
50.0000140.000027
60.0000550.000096
70.0001720.000370
80.0006360.001466
90.0022370.003279
100.0079550.01210
110.025660.03730
120.081030.1204
130.25000.3600
140.76081.079
152.0903.195
165.8489.528
1716.0928.07
1843.6883.09
19116.9234.8
20300.5940.1
211963
224851
2317600
2441240
25124800
26316200