MySQL query to get "intersection" of numerous queries with limits MySQL query to get "intersection" of numerous queries with limits mysql mysql

MySQL query to get "intersection" of numerous queries with limits


Flatten Your Criteria


You can flatten your multi-dimensional criteria into a single level criteria

enter image description here

Now this criteria can be achieved in one query as follow

(SELECT * FROM users WHERE gender = 'Male' AND region = 'North' LIMIT 40) UNION ALL(SELECT * FROM users WHERE gender = 'Male' AND region = 'South' LIMIT 80) UNION ALL(SELECT * FROM users WHERE gender = 'Male' AND region = 'East' LIMIT 120) UNION ALL(SELECT * FROM users WHERE gender = 'Male' AND region = 'West' LIMIT 160) UNION ALL(SELECT * FROM users WHERE gender = 'Female' AND region = 'North' LIMIT 60) UNION ALL(SELECT * FROM users WHERE gender = 'Female' AND region = 'South' LIMIT 120) UNION ALL(SELECT * FROM users WHERE gender = 'Female' AND region = 'East' LIMIT 180) UNION ALL(SELECT * FROM users WHERE gender = 'Female' AND region = 'West' LIMIT 240)

Problem

  • It does not always return the correct result. For example, if there are less than 40 users whose are male and from north, then the query will return less than 1,000 records.

Adjust Your Criteria


Let say that there is less than 40 users whose are male and from north. Then, you need to adjust other criteria quantity to cover the missing quantity from "Male" and "North". I believe it is not possible to do it with bare SQL. This is pseudo code that I have in mind. For sake of simplification, I think we will only query for Male, Female, North, and South

conditions.add({ gender: 'Male',   region: 'North', limit: 40  })conditions.add({ gender: 'Male',   region: 'South', limit: 80  })conditions.add({ gender: 'Female', region: 'North', limit: 60  })conditions.add({ gender: 'Female', region: 'South', limit: 120  })foreach(conditions as condition) {    temp = getResultFromDatabaseByCondition(condition)    conditions.remove(condition)    // there is not enough result for this condition,    // increase other condition quantity    if (temp.length < condition.limit) {        adjust(...);    }}

Let say that there are only 30 northener male. So we need to adjust +10 male, and +10 northener.

To Adjust---------------------------------------------------Male        +10North       +10Remain Conditions----------------------------------------------------{ gender: 'Male',   region: 'South', limit: 80 }{ gender: 'Female', region: 'North', limit: 60  }{ gender: 'Female', region: 'South', limit: 120  }

'Male' + 'South' is the first condition that match the 'Male' adjustment condition. Increase it by +10, and remove it from the "remain condition" list. Since, we increase the South, we need to decrease it back at other condition. So add "South" condition into "To Adjust" list

To Adjust---------------------------------------------------South       -10North       +10Remain Conditions----------------------------------------------------{ gender: 'Female', region: 'North', limit: 60  }{ gender: 'Female', region: 'South', limit: 120  }Final Conditions----------------------------------------------------{ gender: 'Male',   region: 'South', limit: 90 }

Find condition that match the 'South' and repeat the same process.

To Adjust---------------------------------------------------Female      +10North       +10Remain Conditions----------------------------------------------------{ gender: 'Female', region: 'North', limit: 60  }Final Conditions----------------------------------------------------{ gender: 'Female', region: 'South', limit: 110  }{ gender: 'Male',   region: 'South', limit: 90 }

And finally

{ gender: 'Female', region: 'North', limit: 70  }{ gender: 'Female', region: 'South', limit: 110  }{ gender: 'Male',   region: 'South', limit: 90 }

I haven't come up with the exact implementation of adjustment yet. It is more difficult than I have expected. I will update once I can figure out how to implement it.


The problem that you describe is a multi-dimensional modeling problem. In particular, you are trying to get a stratified sample along multiple dimensions at the same time. The key to this is to go down to the smallest level of granularity and build up the sample from there.

I am further assuming that you want the sample to be representative at all levels. That is, you don't want all the users from "North" to be female. Or all the "males" to be from "West", even if that does meet the end criteria.

Start by thinking in terms of a total number of records, dimensions, and allocations along each dimension. For instance, for the first sample, think of it as:

  • 1000 records
  • 2 dimensions: gender, region
  • gender split: 60%, 40%
  • region split: 10%, 20%, 30%, 40%

Then, you want to allocate these numbers to each gender/region combination. The numbers are:

  • North, Male: 60
  • North, Female: 40
  • South, Male: 120
  • South, Female: 80
  • East, Male: 180
  • East, Female: 120
  • West, Male: 240
  • West, Female: 160

You'll see that these add up along the dimensions.

The calculation of the numbers in each cell is pretty easy. It is the product of the percentages times the total. So, "East, Female" is 30%*40% * 1000 . . . Voila! The value is 120.

Here is the solution:

  1. Take the input along each dimension as percentages of the total. And be sure they add up to 100% along each dimension.
  2. Create a table of the expected percentages for each of the cells. This is the product of the percentages along each dimension.
  3. Multiple the expected percentages by the overall total.
  4. The final query is outlined below.

Assume that you have a table cells with the expected count and the original data (users).

select enumerated.*from (select u.*,             (@rn := if(@dims = concat_ws(':', dim1, dim2, dim3), @rn + 1,                        if(@dims := concat_ws(':', dim1, dim2, dim3), 1, 1)                       )             ) as seqnum      from users u cross join           (select @dims = '', @rn := '') vars      order by dim1, dim2, dim3, rand()     ) enumerated join     cells     on enumerated.dims = cells.dimswhere enuemrated.seqnum <= cells.expectedcount;

Note that this is a sketch of the solution. You have to fill in the details about the dimensions.

This will work as long as you have enough data for all the cells.

In practice, when doing this type of multi-dimensional stratified sampling, you do run the risk that cells will be empty or too small. When this happens, you can often fix this with an additional pass afterwards. Take what you can from the cells that are large enough. These typically account for the majority of the data needed. Then add records in to meet the final count. The records to be added in are those whose values match what is needed along the most needed dimensions. However, this solution simply assumes that there is enough data to satisfy your criteria.


Problem with your request is that there's enormous number of options that can be used to achieve proposed numbers:

       Male    Female    Sum-----------------------------North:  100         0    100      South:  200         0    200East:   300         0    300 West:     0       400    400 Sum:    600       400-----------------------------North:   99         1    100      South:  200         0    200East:   300         0    300 West:     1       399    400 Sum:    600       400-----------------------------....-----------------------------North:    0       100    100      South:  200         0    200East:     0       300    300 West:   400         0    400 Sum:    600       400

Just by combining North, East and West (with south always male: 200) you'll get 400 possibilities how to achieve proposed numbers. And it gets even more complicated when you have just a limited amount of records per each "class" (Male/North = "class").

You may need need up to MIN(COUNT(gender), COUNT(location)) records for every cell in table above (for the case that it's counterpart will be zero).

That is up to:

       Male    Female    ---------------------North:  100       100      South:  200       200East:   300       300 West:   400       400 

So you need to count available records of each gender/location pair AVAILABLE(gender, location).

Finding particular fit seems to be close to semimagic squares[1][2].

And there are several questions on math.stackexchange.com about this [3][4].

I've ended up reading some paper on how to construct these and I doubt it's possible to do this with one select.

If you have enough records and won't end up in situation like this:

       Male    Female    ---------------------North:  100         0      South:  200       200East:   300         0 West:   200       200 

I would go with iterating trough locations and add proportional number of Males/Females in each step:

  1. M: 100 (16%); F: 0 (0%)
  2. M: 100 (16%); F: 200 (50%)
  3. M: 400 (66%); F: 200 (50%)
  4. M: 600 (100%); F: 400 (100%)

But this will give you only approximate results and after validating those you may want to iterate trough result few times and adjust counts in each category to be "good enough".