Non-uniform shuffling

A shuffle of a finite sequence of length n of distinguishable elements refers to an algorithmic process which — modulo pseudo-randomness — can be modeled as a random variable uniformly distributed on the permutations \mathbb{S}_n.
However, most pseudo-random entropy sources provide only a pseudo-uniformly distributed realization of \mathbb{F}_2^\mathbb{N}, leading to the necessity of finding an algorithmic transformation process if one wishes to achieve a shuffle.
In the following, I will assume that a transforming process to a family of independent and uniformly on \{1..n\} distributed random variables is already present for any n\in\mathbb{N}.

One naive and seemingly correct (it is not) approach is to traverse the given sequence, uniformly swapping the current entry with another one, i.e.

void falseShuffle(uint64_t *arr, size_t len) {
    for (size_t j = 0; j < len; j++)
        swap(arr, j, unif(len)); }

as an exemplary C implementation where \texttt{unif}(n) is independent and uniformly distributed on \{0..n-1\}.

Yet, even though sensible on first sight, the above defined random variable is only in the most trivial cases uniformly distributed and — as empirical evidence suggests, see below — horrendously non-uniformly distributed otherwise.
To prove the non-uniformity postulated above, I first present the following number-theoretic result.

Claim. In only three trivial cases does the factorial of a natural number divide its tetration; formally

\forall\,n\in\mathbb{N}_{>2}:n!\nmid n^n.

Proof. Let n\in\mathbb{N}_{>2} be a natural number larger than two. By the definition of the factorial, \prod_{p<n\ \text{prime}}p\mid n!\mid n^n is evident. Adhering to the uniqueness of prime factorizations, \prod_{p<n}p\mid n follows. Observe that n-1>1 has to be prime since \forall\,p<n\ \text{prime}:n-1\equiv -1\not\equiv 0\mod p, implying n-1\mid\prod_p=n which cannot hold for n>2. QED

Now suppose, \texttt{falseShuffle} was indeed non-trivially distributed uniformly. Without loss of generality, all involved probability spaces were finite. Then there had to exist a surjection from this algorithm’s entropic state to \mathbb{S}_n with fibers of the same finite cardinality, implying n!\mid n^n. By the above proven claim, n<3 followed, making the distribution trivial.¬†QED

One possible reason for the surprising nature of this non-uniformity is the striking source code resemblance to a correct implementation, i.e.

void shuffle(uint64_t *arr, size_t len) {
    for (size_t j = 0; j < len; j++)
        swap(arr, j, j + unif(len - j)); }

as an exemplary C implementation which can be inductively shown to resemble the same structure as \mathbb{S}_n, in each step sprinkling in some uniform randomness and thus being itself uniformly distributed.

To see just how non-uniform \texttt{falseShuffle} is, I have calculated its discrete density for n=4:

[       |                ]
[   |   |||              ]
[   |   |||              ]
[   |   |||              ]
[   ||  ||||||| ||       ]
[||||| |||||||| |||    ||]
[|||||||||||||||||| || ||]
n = 4

If it was uniformly distributed, the discrete density would look like a rectangle; [||||| ... |||||]. Further plots for 0\leq n\leq 6 are shown in nonUniformity.txt.

Source code for the analysis and plotting: nonUniformity.hs. Empirical evidence of non-uniformity: nonUniformity.c.



Continuing my journey implementing various sorting algorithms in C, in this post I am departing from the most well-known algorithms and implementing one of my personal favourites — heapsort.
Download the C source file here: heapsort.c.

Contrary to algorithms like quicksort which immediately jump into action sorting the given array, heapsort operates on a data structure called a heap which it efficiently transforms into a sorted list. However, as most arrays are not of the heap structure, heapsort first needs to transform a given array into a heap. Thus heapsort is a two-step process — first creating a heap and then sorting said heap.

Sorting can be applied to an infinite number of objects provided there is an order defined among them. However, not much is gained from changing the underlying value one is sorting which is why in this post I will only focus on sorting integers — technically even only integers in the C type sense; n\in\mathbb{Z},-2^{31} \leq n < 2^{31}.

The heap

A heap is a special type of binary tree that satisfies the heap property — every parent node’s value is not less than its child node’s (if existent) values. Furthermore, a heap is maximally filled at every level but possibly the last where the elements are as far left as possible.
From these properties it follows that the greatest value will be located at the root node.

One can also define a heap such that the root node will house the smallest element; such a heap would lead to a list sorted in descending order (more on that later).

      ( 86 )      
     /      \     
  (31)      (64)  
  /  \      /     
(20)(-4)  (17)    

A heap containing integers.

Continue reading