I've added in git a function named ecdf which takes a series or
vector argument and returns the empirical CDF in the form of a two
column matrix with unique sorted values of the input in column 1 and
cumulative relative frequency in column 2.
This is not very difficult to do in hansl, but it's substantially
faster in C (the differential depending on the dimensions of the
problem) and I think it may be worth having. Test script below.
<hansl>
function matrix hansl_ecdf (const matrix M)
matrix ret = values(M) ~ 0
scalar n = rows(M)
loop i=1..rows(ret) -q
ret[i,2] = sumc(M .<= ret[i,1])/n
endloop
return ret
end function
scalar N = 3000
matrix M = zeros(N, 1)
loop i=1..N -q
M[i] = randint(1, 200)
endloop
scalar t1=0
scalar t2=0
loop 500 -q
set stopwatch
matrix ec1 = ecdf(M)
t1 += $stopwatch
matrix ec2 = hansl_ecdf(M)
t2 += $stopwatch
endloop
printf "built-in: %.3fs\n", t1
printf "hansl: %.3fs\n", t2
</hansl>
hansl_ecdf() could probably be improved upon, but it seems like a
"natural" solution for hansl users. Output on the machine I'm at:
built-in: 0.204s
hansl: 1.610s
(Besides, the built-in version handles both series and vectors, and
automatically drops NAs or NaNs from the calculation.)
If I've missed a reason why this is redundant (which sometimes
happens!) please let me know. Otherwise I'll go ahead and document
it.
Allin