On Wed, 8 Jan 2014, Sven Schreiber wrote:
Am 08.01.2014 04:31, schrieb Allin Cottrell:
> On Tue, 7 Jan 2014, Sven Schreiber wrote:
>
>
> 1) Writing data as text
>
...
>
> What's new here is that I've worked out a substantially faster way
> of determining the appropriate the format specification for each
> series. In my tests this cuts several seconds off big data writes.
> This improvement is independent of the compression level and the
> skip-padding status.
Yes the results are impressive! I'm just wondering whether it would also
help in this context to use the information on which variables are
officially "discrete". Most of them (not all) will be integer-valued for
example.
Good idea. Even if they're not integer-valued, they certainly should
not require 17 digits. The artifacts that I mentioned disappear if
you use the printf format %.15g, and this could safely be applied to
series marked as discrete without any need for elaborate testing.
Also, in principle the data are now changed with respect to the old
format I guess; hopefully just within the precision error margin of
doubles, but this should probably be tested -- or did you already?
I did. We need 17 signifcant digits if we're to reproduce exactly
results obtained using (e.g.) logs and random numbers (by reproduce
I mean: run a regression, save data, reopen data, run regression
again.) But we still use 17 digits if that's required -- that is, if
printing to 15 digits doesn't leave trailing zeros. [Personally, I
think logs and random numbers should be generated by script not
saved in a data file, but anyway.]
Allin