[Gretl-users] Re: Resample without replacement (Artur Tarassow)

Wednesday, 18 September 2019

On Tue, 17 Sep 2019, Sven Schreiber wrote:

...
 [D]oesn't ranking involve expensive sorting? My gut feeling is
that
 using mrandgen(i,...) to produce "more" than enough random indices,
 then simply discarding the duplicates with uniq(), could be faster. 
Here's what we do for datasets with "smpl n --random". We first see
which is smaller, the number of observations to be selected or the
number to be dropped. Working with the smaller, we generate random
integers in the appropriate range to select observations to include or
skip, discarding duplicates and proceeding till we have enough.

I've put a matrix version of this into git. It's provisionally called
msample and it takes a matrix and the scalar n as arguments.

matrix X = seq(1,100)'
eval msample(X, 20)

But one further point occurs to me: When people run such a function
does it meet expectation if the observations/rows are a random subset,
but in their original order? Or would users expect the order to be
shuffled? (Of course it's easy enough to shuffle the rows of a matrix
yourself if you want.)

Allin

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

[Gretl-users] Re: Resample without replacement (Artur Tarassow)