On Sun, 25 Aug 2013, Sven Schreiber wrote:
Am 24.08.2013 20:15, schrieb Sven Schreiber:
> Am 23.08.2013 11:21, schrieb Sven Schreiber:
>
>>
>> Here's my take at doing it in hansl (untested), but let's not forget
>> that the goal is (IMHO) to make the preprocessing unnecessary
>> altogether, by enabling 'join' do smaller/greater comparisons on ISO
>> date strings!
>
> Following is now an actually working version, tested with the real-world
> 1MB file of INDPRO. However, it is very slow, much slower than using my
> Python solution it seems to me. Don't know if there are some gretl
> string internals that could be sped up.
Specifically, the preprocessing including all calling overheads takes
<2sec with the Python solution, and roughly 120sec with native gretl.
I think the crucial lines are the following:
> loop repetitions # loop over the lines in file
> sscanf(rest,"%s\t%s\t%s\t%s\n",col1,col2,col3,col4)
> string rest = strstr(rest,"\n") + 1 # offset to drop the leading \n
>
That is, there are thousands of operations working on strings holding
(almost) the entire file content (in this case about 1MB as I said). I
have tried to consolidate this into a more clever sscanf line, but that
didn't really help. Glad to take more ideas.
In contrast, in python the file is read line per line. [...]
Yes, that will make the difference. I think that to do this sort of
thing efficiently on big files we would need at least one more
function: something like the internal function bufgets(), which
returns the next line of a text buffer held in memory until the
buffer is exhausted (probably with a switch to drop the trailing
newline character).
Greater flexibility would be available if we implemented the suite
of functions fopen, fclose, fgets and friends. This wouldn't be very
difficult since our functions would just be wrappers for the C
library functions, but it does raise the "Swiss army knife" issue as
mentioned by Jack.
Allin