On Monday, June 3, 2024 at 11:07:25 PM EDT, Allin Cottrell <cottrell@wfu.edu> wrote:

On Sun, 2 Jun 2024, Sven Schreiber wrote:

> Am 02.06.2024 um 06:37 schrieb g s:
>> Yes, I'm not exactly clear on what the results are showing.
>>
>>
>> 1) I did a correlation matrix of the following variables: BirthRate,
>> Agriculture, Service, GDPPerCap, Population, InfantMort. I did NOT click on
>> "ensure uniform sample size".
>>
>> The top of the results box says
>> Correlation Coefficients, using the observations 4 - 229
>> (missing values were skipped)
>> Two-tailed critical values for n = 221: 5% 0.1320, 1% 0.1729
>>
>> A couple of results:
>> BirthRate and Agriculture = 0.7021
>> ...
> ...
>>
>> 3) Next, I did a correlation matrix of JUST BirthRate and Agriculture.Here
>> are the results:
>>
>> corr(BirthRate, Agriculture) = 0.68261942
>> Under the null hypothesis of no correlation:
>> t(220) = 13.855, with two-tailed p-value 0.0000
>>
>> The correlation here is different from steps 1 or 2,
>
> Yes, I can confirm that, and indeed the difference between 1 and 3 (values
> 0.702 and 0.683) is unexpected, I'd say. Perhaps there's something wrong when
> a subset of variables are selected and missing values are all over the place

True. In corr without the --uniform option we were trimming from the
start and end of the sample range observations with at least one
missing value among the selected series. That was wrong: we should
only trim observations with missing values for _all_ of the selected
series.

So some of the individual correlations could end up using fewer
observations than the maximum possible. In the example given above,
the BirthRate,Agriculture and population,GDPPerCap correlations
were being calculated with n = 221 and n = 225, respectively, when
this should have been n = 222 and n = 228.

That's now fixed in git; snapshots will follow before long.

Thanks, Gene, for probing this matter.

Allin Cottrell