On Tue, 17 Sep 2019, Sven Schreiber wrote:
[D]oesn't ranking involve expensive sorting? My gut feeling is
that
using mrandgen(i,...) to produce "more" than enough random indices,
then simply discarding the duplicates with uniq(), could be faster.
Here's what we do for datasets with "smpl n --random". We first see
which is smaller, the number of observations to be selected or the
number to be dropped. Working with the smaller, we generate random
integers in the appropriate range to select observations to include or
skip, discarding duplicates and proceeding till we have enough.
I've put a matrix version of this into git. It's provisionally called
msample and it takes a matrix and the scalar n as arguments.
matrix X = seq(1,100)'
eval msample(X, 20)
But one further point occurs to me: When people run such a function
does it meet expectation if the observations/rows are a random subset,
but in their original order? Or would users expect the order to be
shuffled? (Of course it's easy enough to shuffle the rows of a matrix
yourself if you want.)
Allin