Knuth’s reservoir sampling in Python and Perl

Algorithms that perform calculations on evolving data streams, but in fixed memory, have increasing relevance in the age of Big Data. The reservoir sampling algorithm outputs a sample of N lines from a file of undetermined size. It does so in a single pass, using memory proportional to N.

These two features – (i) a constant memory footprint and (ii) a capacity to operate on files of indeterminate size – make it ideal for working with very large data sets common to event processing.

While it has likely been multiply discovered and implemented, like many algorithms, it was codified by Knuth’s The Art of Computer Programming.

The trick of this algorithm is to first fill up the sample buffer, and afterwards, to probabilistically replace it with additional lines of input.

Python version

#!/usr/bin/python
import sys
import random
 
if len(sys.argv) == 3:
    input = open(sys.argv[2],'r')
elif len(sys.argv) == 2:
    input = sys.stdin;
else:
    sys.exit("Usage:  python samplen.py <lines> <?file>")
 
N = int(sys.argv[1]);
sample = [];
 
for i,line in enumerate(input):
    if i < N:
        sample.append(line)
    elif i >= N and random.random() < N/float(i+1):
        replace = random.randint(0,len(sample)-1)
        sample[replace] = line
 
for line in sample:
    sys.stdout.write(line)

Perl version

#!/usr/bin/perl -sw
 
$IN = 'STDIN' if (@ARGV == 1);
open($IN, '<'.$ARGV[1]) if (@ARGV == 2);
die "Usage:  perl samplen.pl <lines> <?file>n" if (!defined($IN));
 
$N = $ARGV[0];
@sample = ();
 
while (<$IN>) {
    if ($. <= $N) {
 $sample[$.-1] = $_;
    } elsif (($. > $N) && (rand() < $N/$.)) {
 $replace = int(rand(@sample));
 $sample[$replace] = $_;
    }
}
 
print foreach (@sample);
close($IN);

For example, imagine we are to sample 5 lines randomly from a 6-line file. Call i the line number of the input, and N the size of sample desired. For the first 5 lines (where i < = N), our sample fills entirely. (For the non-Perl hackers: the current line number i is held by the variable $., just as the special variable $_ holds the current line value).

It’s at successive lines of input that the probabilistic sampling starts: the 6th line has a 5/6th (N/i) chance of being sampled, and if chosen, it will replace one of the previously 5 chosen lines with a 1/5 chance: leaving them a (5/6 * 1/5) = 5/6 chance of being sampled. Thus all 6 lines have an equal chance of being sampled.

In general, as more lines are seen, the chance that any additional line is chosen for the sample falls; but the chance that any previously chosen line could be replaced grows. These two balance such that the probability for any given line of input to be sampled is identical.

A more sophisticated variation of this algorithm is one that can take into consideration a weighted sampling.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: