Am 26.08.24 um 17:18 schrieb Cottrell, Allin:
On Mon, Aug 26, 2024 at 7:00 AM Artur T. <atecon(a)posteo.de>
wrote:
>
> I stumbled over an issue in case a dataset has no values for the last k
> columns.
>
> The dataset which I want to open has the following format:
>
> A B C D
> 1 4
> 2
> 3
>
> As can be seen, the last two columns have a header, but all rows are
> missing.
>
> For a csv file all four columns get imported as series, respectively.
> However, for ods- or xlsx-files only the first two columns are imported
> but not columns C and D. I could not find any hint in the help file.
>
> For both ods and xlsx files, the terminal prints the message "Sheet has
> 2 trailing empty variables"
>
> Can anybody explain this to me, please? Or is this a bug?
It's not a bug. You can take a look at the function
import_prune_columns(), at
https://sourceforge.net/p/gretl/git/ci/master/tree/plugin/import_common.c
, which shows this is a deliberate policy. However, there is an
inconsistency in that we don't apply that policy to plain CSV input
(which is handled in the main body of libgretl, not via a plugin).
I guess the history of this is that at one point I noticed some
spreadsheet imports with lots of empty trailing columns, but didn't
see examples of that sort in plain CSV. If it's important to make the
behavior consistent, my inclination would be to apply
import_prune_columns() to CSV files. In principle a file might contain
many such empty columns, eating up memory uselessly.
Thank you for your detailed explanation of the current behavior and its
rationale. I appreciate the historical reasons behind this design choice.
However, I'd like to offer a different perspective on this matter. In my
view, empty columns should not be automatically removed, as this
responsibility should lie with the user. In my specific use case, I was
quite surprised and confused by the differences in column numbers across
various datasets.
Let me explain my application: Each dataset represents a different
variable and contains time series for various countries. I iterate over
these datasets to vertically concatenate the time series (using the
stack() function) to construct panel series, which are later joined
together. At some point, I noticed that the dimensions weren't adding up
correctly. I was aware of missing values in the data but was perplexed
when the dimensions no longer matched.
This automatic removal of empty columns led to unexpected behavior in my
workflow and caused confusion. It took me some time to identify the
source of the discrepancy, which could have been avoided if the original
structure of the data had been preserved.
While I understand the intention to save memory by removing empty
trailing columns, I believe this approach can lead to unintended
consequences and potential data integrity issues for users who rely on
consistent column structures across their datasets.
Perhaps a compromise could be to keep the current behavior but add an
option to preserve all columns, even if empty. This way, users who need
the original structure can maintain it, while those who prefer the
memory-saving approach can continue to use the current default behavior.
What are your thoughts on this suggestion? I'm open to further
discussion on this topic.
Best,
Artur