Quantcast
Channel: Select n random rows from SQL Server table - Stack Overflow
Viewing all articles
Browse latest Browse all 20

Answer by Rob Boek for Select n random rows from SQL Server table

$
0
0

newid()/order by will work, but will be very expensive for large result sets because it has to generate an id for every row, and then sort them.

TABLESAMPLE() is good from a performance standpoint, but you will get clumping of results (all rows on a page will be returned).

For a better performing true random sample, the best way is to filter out rows randomly. I found the following code sample in the SQL Server Books Online article Limiting Results Sets by Using TABLESAMPLE:

If you really want a random sample of individual rows, modify your query to filter out rows randomly, instead of using TABLESAMPLE. For example, the following query uses the NEWID function to return approximately one percent of the rows of the Sales.SalesOrderDetail table:

SELECT * FROM Sales.SalesOrderDetailWHERE 0.01 >= CAST(CHECKSUM(NEWID(),SalesOrderID) & 0x7fffffff AS float)              / CAST (0x7fffffff AS int)

The SalesOrderID column is included in the CHECKSUM expression so that NEWID() evaluates once per row to achieve sampling on a per-row basis. The expression CAST(CHECKSUM(NEWID(), SalesOrderID) & 0x7fffffff AS float / CAST (0x7fffffff AS int) evaluates to a random float value between 0 and 1.

When run against a table with 1,000,000 rows, here are my results:

SET STATISTICS TIME ONSET STATISTICS IO ON/* newid()   rows returned: 10000   logical reads: 3359   CPU time: 3312 ms   elapsed time = 3359 ms*/SELECT TOP 1 PERCENT NumberFROM NumbersORDER BY newid()/* TABLESAMPLE   rows returned: 9269 (varies)   logical reads: 32   CPU time: 0 ms   elapsed time: 5 ms*/SELECT NumberFROM NumbersTABLESAMPLE (1 PERCENT)/* Filter   rows returned: 9994 (varies)   logical reads: 3359   CPU time: 641 ms   elapsed time: 627 ms*/    SELECT NumberFROM NumbersWHERE 0.01 >= CAST(CHECKSUM(NEWID(), Number) & 0x7fffffff AS float)               / CAST (0x7fffffff AS int)SET STATISTICS IO OFFSET STATISTICS TIME OFF

If you can get away with using TABLESAMPLE, it will give you the best performance. Otherwise use the newid()/filter method. newid()/order by should be last resort if you have a large result set.


Viewing all articles
Browse latest Browse all 20

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>