How to guarantee that at least N rows are returned by recursive CTE in Postgres How to guarantee that at least N rows are returned by recursive CTE in Postgres postgresql postgresql

How to guarantee that at least N rows are returned by recursive CTE in Postgres


This is too long for a comment, but is informative about what's going on with my existing query. From the documentation on recursive query evaluation, the steps that a recursive query will take are:

Evaluate the non-recursive term. For UNION (but not UNION ALL), discard duplicate rows. Include all remaining rows in the result of the recursive query, and also place them in a temporary working table.

So long as the working table is not empty, repeat these steps:

a. Evaluate the recursive term, substituting the current contents of the working table for the recursive self-reference. For UNION (but not UNION ALL), discard duplicate rows and rows that duplicate any previous result row. Include all remaining rows in the result of the recursive query, and also place them in a temporary intermediate table.

b. Replace the contents of the working table with the contents of the intermediate table, then empty the intermediate table.

So my hunch in the comments (after trying with UNION ALL) was mostly on the right track.

As the documentation states, this is actually just a type of iteration that re-uses the previous non-recursive result part in place of the recursive name used in the recursive part.

So it's more like an ever shrinking process, where the "working table" used to substitute in for the recursive name consists only of the particular subset of results at the most recent recursive round which were not duplicates of previous results.

Here's an example. Say we have 5000 rows in the table and we want to sample 3000 unique rows, doing a recursive sample of 1000 (potentially not unique) samples at a time.

We do the first batch of 1000, remove duplicates so we end up with about 818 non-dupes based on the binomial distribution for these large numbers (N=5000, m = 1000, k=1, rearrange terms to avoid overflow).

These 818 become the working table and this result set is subbed in as the recursive term for our next pass. We draw another set of about 818 unique rows, but then have to remove duplicates (UNION) when comparing to the original 818 that are in the working table. Two different random draws of 818 will have significant overlap (somewhere around 150 on average), so all of those are discarded and whatever new unique rows remain become the new working table.

So you'll add about 818 unique samples on the first draw, then the working table shrinks, you'll about 650 on the second draw, the working table shrinks, ... and you keep doing this until either you reach the overall total samples desired (3000 in this case) or the working table ends up being empty.

Once the working table is small enough, there will be a very high probability that everything within it will be duplicated in the next draw of 1000, at which point the working table becomes empty and the recursion terminates.

For drawing 3000 samples, you might be able to do this working table update enough times. But as you move from 3000 closer to the table's total size of 5000, the probability shrinks to nearly zero very fast.

So instead of this being an optimizer issue that's short-circuiting with a smaller result set, it's actually an issue with the specific way "recursion" is implemented in Postgres -- it's not actually recursion but simple iteration that operates on the set difference between the current working table and the most recent recursive query result set. For random sampling like this, that working table will get smaller very fast with each iteration, until it's extremely likely to be empty due to the high probability of selecting a duplicate.


Your assessment is to the point. The recursive query in my referenced answer is only somewhat more flexible than the original simple query. It still requires relatively few gaps in the ID space and a sample size that is substantially smaller than the table size to be reliable.

While we need a comfortable surplus ("limit + buffer") in the simple query to cover the worst case scenario of misses and duplicates, we can work with a smaller surplus that's typically enough - since we have the safety net of a recursive query that will fill in if we should fall short of the limit in the first pass.

Either way, the technique is intended to get a small random selection from a big table quickly.

The technique is pointless for cases with too many gaps or (your focus) a sample size that is too close to the total table size - so that the recursive term might run dry before the limit is reached. For such cases a plain old:

SELECT *   -- or DISTINCT * to fold duplicates like UNION doesFROM   TABLEORDER  BY random()LIMIT  n;

.. is more efficient: you are going to read most of the table anyway.


It is possible to generate a known number of rows using Set Returning Functions (SRF). Also, OUTER joins guarantee, that one side of the join will be returned in full.

In this case, I assume FULL JOIN between a generate_series(1, 100) (if you aim for at least 100 rows) should do the trick. In fact, LEFT join might also do the trick, but it may as well filter out the extra rows that are actually wanted, therefore I would choose FULL.

P.S. If you could show your code, it'd be easier to provide some examples.