On Sat, 21 Sep 2019, Sven Schreiber wrote:
Am 21.09.2019 um 18:13 schrieb Allin Cottrell:
> On Sat, 21 Sep 2019, Artur Tarassow wrote:
> Since we've collectively devoted a fair amount of time to this issue
> (the sampling without replacement, not just the timing) I think we
> should try to resolve it as expeditiously as possible.
>
> Sorry, this is going to be a bit long but I'll try to be concise.
Thanks, Allin. What I still don't really fully understand is the
features that are needed (or are likely to be needed). Initially I said
we don't want ordered draws, but I'm not so sure that that couldn't also
be a sensible option.
I suppose it could be -- though my original preference for allowing
it was really just based on (what I took to be) its relative
computational simplicity, something I no longer think is very
important now that I've figured out an efficient algorithm for doing
the sampling + scrambling in one pass.
Something similar goes for the block length.
I think sampling by blocks without replacement is really problematic
and probably not to be messed with. Do you have a response to my
diagnosis in
https://www.mail-archive.com/gretl-users@gretlml.univpm.it/msg14200.html
The second possible approach I mention there would be
straightforward to implement but I seriously doubt it would have any
desirable statistical properties.
Maybe Artur's reference to R's sample function is a hint that
we should
study more closely what that function does?
The signature is
sample(x, size, replace = FALSE, prob = NULL)
where "size" is the number of draws and "prob" is an optional vector
of weights. As Artur said, sampling by blocks is not supported.
There's one more optional boolean argument, useHash, which is
available only if replace = FALSE, prob = NULL, and size <= n/2.
This indicates that "the hash-version of the algorithm should be
used" and is recommended for large n. Oddly, I don't see any
indication of whether useHash is the default when its conditions of
application are met.
Allin