On Mon, 30 Sep 2019, Riccardo (Jack) Lucchetti wrote:
Hi all,
I've had a recurring problem with string-valued series, and I'm struggling to
find a solution.
Suppose you have two or more string-valued series that you get from a csv or
Stata file, and that they represent comparable variables, so they contain the
same strings. Currently, we encode string-valued series by creating string
arrays that get filled by occurrence; this, however, impleis that there is no
guarantee that the correspondence between internal numerical values and
strings is the same for the different series. This makes it awkward to read
the output from commands such as freq or xtab.
Writing a script to correct for that has proven quite difficult, and what I
was able to come up with is VERY far from elegant. An example script follows,
and suggestions are much appreciated.
Wel, I doubt whether you'd consider the following more elegant than
your own version! But I think it's at least fairly transparent (and
if coded in C would probably be reasonably fast). I'm restricting
the scope to two series here but that restriction could be lifted.
(I should mention that in git there's now an optional third argument
to resample(), not yet documented, that allows you to oversample.)
<hansl>
nulldata 40
# create artificial string-valued series s1, s2 with
# inconsistent encodings
series s1 = resample(seq(1,4)', , 40)
series s2 = resample(seq(1,4)', , 40)
strings s1strs = defarray("B", "D", "A", "C")
strings s2strs = defarray("B", "A", "D", "C")
stringify(s1, s1strs)
stringify(s2, s2strs)
series x1 = s1
series x2 = s2
# display the inconsistency
print s1 x1 s2 x2 -o --range=1:12
# end constructed input
# retrieve (presumed common) string values
# and sort them
strings strs = sort(strvals(s1))
series s1mod s2mod
loop i=1..$nobs -q
loop j=1..nelem(strs) -q
if s1[i] == strs[j]
s1mod[i] = j
break
endif
endloop
loop j=1..nelem(strs) -q
if s2[i] == strs[j]
s2mod[i] = j
break
endif
endloop
endloop
stringify(s1mod, strs)
stringify(s2mod, strs)
x1 = s1mod
x2 = s2mod
# display consistent encoding
print s1mod x1 s2mod x2 -o --range=1:12
</hansl>
Allin