Hi all,
I've had a recurring problem with string-valued series, and I'm struggling
to find a solution.
Suppose you have two or more string-valued series that you get from a csv
or Stata file, and that they represent comparable variables, so they
contain the same strings. Currently, we encode string-valued series by
creating string arrays that get filled by occurrence; this, however,
impleis that there is no guarantee that the correspondence between
internal numerical values and strings is the same for the different
series. This makes it awkward to read the output from commands such as
freq or xtab.
Writing a script to correct for that has proven quite difficult, and what
I was able to come up with is VERY far from elegant. An example script
follows, and suggestions are much appreciated.
<hansl>
set verbose off
function series string_reorder(strings new, series x)
strings ss = strvals(x)
n = nelem(ss)
m = nelem(new)
series tmp = NA
loop i = 1 .. n --quiet
si = ss[i]
k = 0
loop j = 1 .. m --quiet
if si == new[j]
k = j
break
endif
endloop
if k>0 # found
tmp = (x == si) ? k : tmp
endif
endloop
return tmp
end function
clear
set verbose off
outfile "(a)dotdir/tmp.csv"
printf "var1,var2,var3\n"
printf "a,b,c\na,c,b\nb,b,b\nc,a,b\na,b,c\na,c,c"
end outfile
open "(a)dotdir/tmp.csv" --quiet
print var1 var2 --byobs
xtab var1 var2 # no good
# record encoding for var1
ss = strvals(var1)
# note: you can't just assign to var2
var2new = string_reorder(ss, var2)
stringify(var2new, ss)
delete var2
rename var2new var2
# values are the same, but the encoding is reordered
print var1 var2 --byobs
xtab var1 var2 # better
</hansl>
-------------------------------------------------------
Riccardo (Jack) Lucchetti
Dipartimento di Scienze Economiche e Sociali (DiSES)
Università Politecnica delle Marche
(formerly known as Università di Ancona)
r.lucchetti(a)univpm.it
http://www2.econ.univpm.it/servizi/hpp/lucchetti
-------------------------------------------------------