Simple Random Samples from a Sql database Simple Random Samples from a Sql database postgresql postgresql

Simple Random Samples from a Sql database


I think the fastest solution is

select * from table where rand() <= .3

Here is why I think this should do the job.

  • It will create a random number for each row. The number is between 0 and 1
  • It evaluates whether to display that row if the number generated is between 0 and .3 (30%).

This assumes that rand() is generating numbers in a uniform distribution. It is the quickest way to do this.

I saw that someone had recommended that solution and they got shot down without proof.. here is what I would say to that -

  • This is O(n) but no sorting is required so it is faster than the O(n lg n)
  • mysql is very capable of generating random numbers for each row. Try this -

    select rand() from INFORMATION_SCHEMA.TABLES limit 10;

Since the database in question is mySQL, this is the right solution.


There's a very interesting discussion of this type of issue here: http://www.titov.net/2005/09/21/do-not-use-order-by-rand-or-how-to-get-random-rows-from-table/

I think with absolutely no assumptions about the table that your O(n lg n) solution is the best. Though actually with a good optimizer or a slightly different technique the query you list may be a bit better, O(m*n) where m is the number of random rows desired, as it wouldn't necesssarily have to sort the whole large array, it could just search for the smallest m times. But for the sort of numbers you posted, m is bigger than lg n anyway.

Three asumptions we might try out:

  1. there is a unique, indexed, primary key in the table

  2. the number of random rows you want to select (m) is much smaller than the number of rows in the table (n)

  3. the unique primary key is an integer that ranges from 1 to n with no gaps

With only assumptions 1 and 2 I think this can be done in O(n), though you'll need to write a whole index to the table to match assumption 3, so it's not necesarily a fast O(n). If we can ADDITIONALLY assume something else nice about the table, we can do the task in O(m log m). Assumption 3 would be an easy nice additional property to work with. With a nice random number generator that guaranteed no duplicates when generating m numbers in a row, an O(m) solution would be possible.

Given the three assumptions, the basic idea is to generate m unique random numbers between 1 and n, and then select the rows with those keys from the table. I don't have mysql or anything in front of me right now, so in slightly pseudocode this would look something like:

create table RandomKeys (RandomKey int)create table RandomKeysAttempt (RandomKey int)-- generate m random keys between 1 and nfor i = 1 to m  insert RandomKeysAttempt select rand()*n + 1-- eliminate duplicatesinsert RandomKeys select distinct RandomKey from RandomKeysAttempt-- as long as we don't have enough, keep generating new keys,-- with luck (and m much less than n), this won't be necessarywhile count(RandomKeys) < m  NextAttempt = rand()*n + 1  if not exists (select * from RandomKeys where RandomKey = NextAttempt)    insert RandomKeys select NextAttempt-- get our random rowsselect *from RandomKeys rjoin table t ON r.RandomKey = t.UniqueKey

If you were really concerned about efficiency, you might consider doing the random key generation in some sort of procedural language and inserting the results in the database, as almost anything other than SQL would probably be better at the sort of looping and random number generation required.


Faster Than ORDER BY RAND()

I tested this method to be much faster than ORDER BY RAND(), hence it runs in O(n) time, and does so impressively fast.

From http://technet.microsoft.com/en-us/library/ms189108%28v=sql.105%29.aspx:

Non-MSSQL version -- I did not test this

SELECT * FROM Sales.SalesOrderDetailWHERE 0.01 >= RAND()

MSSQL version:

SELECT * FROM Sales.SalesOrderDetailWHERE 0.01 >= CAST(CHECKSUM(NEWID(), SalesOrderID) & 0x7fffffff AS float) / CAST (0x7fffffff AS int)

This will select ~1% of records. So if you need exact # of percents or records to be selected, estimate your percentage with some safety margin, then randomly pluck excess records from resulting set, using the more expensive ORDER BY RAND() method.

Even Faster

I was able to improve upon this method even further because I had a well-known indexed column value range.

For example, if you have an indexed column with uniformly distributed integers [0..max], you can use that to randomly select N small intervals. Do this dynamically in your program to get a different set for each query run. This subset selection will be O(N), which can many orders of magnitude smaller than your full data set.

In my test I reduced the time needed to get 20 (out 20 mil) sample records from 3 mins using ORDER BY RAND() down to 0.0 seconds!