On Sat, 21 Sep 2019, Artur Tarassow wrote:
Am 20.09.19 um 19:14 schrieb Allin Cottrell:
[revised timing
comparison]
>
> Ratio of Jack-to-msample() = 35.932
Oh, good point. I've thought that <if algo> is equivalent to <if
algo==1>. I also obtain a factor of about 26. As expected the C
based version dominates.
Since we've collectively devoted a fair amount of time to this issue
(the sampling without replacement, not just the timing) I think we
should try to resolve it as expeditiously as possible.
Sorry, this is going to be a bit long but I'll try to be concise.
1) Choice of algorithm. Jack's suggestion is very clever (no
surprise!) but it does appear to be a good deal slower than the
algorithm used by msample(). It's not just hansl-versus-C. I tried
coding Jack's ranking-based algorithm in C as an alternative
back-end for msample(), and in that case my timings show the hansl
version taking only about 20 percent longer than C.
The msample algorithm is quite different:
* Create a "pool" array of integers, {0, 1, ..., r-1}, where r is
the number of rows in the source matrix, and set poolsize = r.
Set k = 0.
* For i = 0 to n-1, where n is the required number of draws,
- Generate a random unsigned integer, u, on [0,poolsize)
- Write row pool[u] of the source matrix into row k of the
matrix to be returned
- poolsize <- poolsize - 1
- k <- k + 1
- If u did not select the last member of the pool, shift
the "tail" of the pool (to the right of the selection) one
place to the left to close up the gap left by the selected
row. Use the C-library function memmove().
This emulates the actual drawing of balls from an urn without
replacement.
I was not sure how expensive the calls to memmove() would be, but in
fact this method seems to be quite fast. Comparing C implementation
with C implementation, it's at least an order of magnitude faster
than using ranking, and very much faster if you're drawing
relatively few rows from a matrix with many rows.
In addition, in running lots of cases for timing purposes, the
problem that concerned Sven with the ranking method -- namely the
possibility of getting ties -- showed itself a couple of times:
Invalid selection vector
*** error in function sample_wo_rep
X[sel[1:n],]
(So there must have been a *.5 somewhere.)
I therefore propose that we use the "balls from urn" algorithm.
2) Interface. It would be nice if we could generalize an existing
function, resample, rather than add another. But I really don't see
how that could work.
resample takes a matrix argument plus an optional block-length
argument. Its return value always has as many rows as the input.
msample takes a matrix argument plus a number of cases to be drawn,
which seems the natural interface for such a function. (And I now
take it that we don't want to try adding a block-length parameter
when sampling without replacement.)
Trying to meld these two seems to me artificial. The documentation
would have to be kinda convoluted, such that a user might well
think, "Look, you've got two different functions in here, why not
just give them different names?"
The name "msample" is not set in stone; if anyone has a better
suggestion, go for it. But as of now I think we need a new function.
Allin