Hi Luca
Many thanks for your clear explanation. My interest in the Bayesian estimator arises out of a personal interest (as a salmon angler) and concern that one of the UK government environment  agencies has been pooling data from the 12 rivers that are or have beenmonitored using fish counters to count the numvers of adult salmon entering a river  to reproduce. It is intended to use the pooled data model as a means of estimating the run size on 56 other non monitored rivers , utilising  a Bayesian model  (with Beta-binomial non informative prior).    A number of indepndent variables are in the model {eg river flow data, angler effort in days fished etc). The rationale is that as all salmon anglers must report their annual catch, the percentage of adult fish caught on the monitored rivers can be derived. This can then be used to estimate the numbers of fish entering non counter rivers given the catch data for each of those rivers.Essentially it is a panel data estimation problem of the proportion of fish caught by anglers in frequentist terms, and one I have already modelled in  gretl to compare with the official Bayesian model output now being proposed.

 The bottom line is that for a given reported catch level on a river without a fish counter,   if the percentage of fish caught on that river can be estimated then the total number of fish can be derived i.e. total estimated fish in river = rod catch/rod catch %. So the model dependent variable is percentage of fish caught on all 12 monitored rivers - logit transformed.

There are a number of underlying issues with the recorded river data that give me some cause for concern in their model which pools the full set of the 12 monitored  rivers' data  First, fish counter recording on 2 rainfed rivers in the pool ceased permanently in 2015, and on another temporarily, but is now being reinstalled. Second the hydrological profiles of 4 of the other rivers are unique, being chalkstream rivers where much of the flow is from subterranean spring water sources.    Half of the  pooled panel units therefore have rainfed river flow regimes, where salmon only enter those rivers at times of higher river flow levels.  It is clear  that rod catch proportions are higher on the chalk stream sample than on  the rainfed rivers  and in  fact the chalkstreams only account for 4% of the national salmon rod catch, so are disproportionately represented in the pooled sample.

In frequentist terms several of the exploitation rates in the panel of monitored rivers are only trend stationary, or AR1. {hence my question re recovering residuals from Bayesian model fitted estimates). One of their primary independent variables  -  fishing effort exhibits structural change which passed unrecognised in the official model.


The official model fails to account for different rod catch proportion levels on the different monitored rivers which LSQ DV panel data model framework readily handles with robust errors to cope with heteroskedasticity.  The official model simulations based on the Bayesian parameter mean estimates for 52 non monitored rivers overstates in consequence their rod catch proportions, and under-predicts the number of salmon entering non_counter rivers.

 In summary, the issue largely reduces to that relating to the constant in the estimated prediction equation, and appropriate rivers to include in the pool panel dataset. I have estimated the panel model with all 12, 10 (excluding 2 rivers where the counters have permanently ceased),  and 6 rivers ( also  excluding the  4 chalkstreams} ie a panel of rainfed rivers only. In that sequence of panel sizeof models estimated, model errors reduce and mean rod catch predicted percentages are lower than official model estimates.My dataset also  drops the initial years time sample years of 1994-2001 used in the official model to avoid including the pre-structural break in the rod effort series in 2001. My interest now is to replicate my model specification with Bayesian estimates, but to be able to establish whether there is serial correlaion in its residuals, which naturally I suspect will be lurking in the mean coefficient fitted version of the official model.

As I do chair a government special advisory panel for agricultural economics, you can see that I need to be clear in the issues of bayesian estimation before stirring the pot in the Ministry, which has oversight and funds the fisheries science agencies doing the modelling. A one size fits all model may be administratively convenient, but in my view is hardly appropriate. 

The government agency has published a paper about their model in a fisheries science journal which I intend to formally comment on. Much is unsaid in the article regarding post estimation diagnostic checking or panel selection. Tiny graphs are presented usually with log scale vertical axes to render modle fit and data points on the model difficult to see clearly. However I have photo enlarged the graphs, only to discover that depending on the river panel unit  some  39 to 70 percent of their actual recorded  datapoints fall outside  the 95% (in)credible interval of the model  fitted mean predicted values for each river.

Cheers


On Tue, 2 Apr 2024, 23:32 Luca Pedini, <luca_pedini@live.it> wrote:

Hi Brian,

 

thanks for the message and thank you Sven for the quick reply. Unfortunately I’ve been experiencing some issues with the mailing list recently:  I’m receiving the messages late or not  at all! Sorry!

 

As for your original questions, the function by default will generate non-informative results (frequentist) alongside the informative ones (truly Bayesian, obtained via the chosen prior): the idea behind is to have a direct comparison between the two. In the resulting output  bundle you can access both of them: the sub-bundle non_info_case reports all basic frequentist estimators; while  for the informative results you can access two different sub-bundles: i) sampled_coeff which contains all the sampled parameters and from which you can recover quantiles or any sort of function; and  ii)  post_summ, which simply reports posterior means and standard deviations.

 

In the sample script the informative prior is “not so informative”, for this reason the results are quite similar to the frequentist ones. But if you have to choose which posterior quantity to use between non-informative and informative, always use the informative one. Moreover, if you have to compute some functions on the posterior quantities, I suggest  computing them on the sampled parameters and not simply on the posterior mean. In this way, you can derive a distribution and more exact conclusions.

 

As for the residual questions, these are not stored: storing residuals for each sampled parameter will drain quickly the computer memory, especially with big dataset, for this reason, these are not saved. However, you can recover it via the sampled coefficients.

 

I hope this may help and if you have any other questions or doubts, do not hesitate to ask!

 

Best regards,

 

Luca

 

Da: Brian Revell <bjr.newmail@gmail.com>
Data: lunedì, 1 aprile 2024 23:18
A: Gretl list <gretl-users@gretlml.univpm.it>
Oggetto: [Gretl-users] Re: BayTool Package

Thanks Sven for the detailed explanstion.. I had eventually arrived at the conclusion that the mean value of the density was the point estimate being output albeit that the densities for each parameter were also an output plot option. It would seem then that the residuals from the mean values are not retrievable (yet) from running the GUI version, which incidentally has now attached itself permanently as an option within univariate options.

 

 It would appear in many applied studies that the parameter means with credible intervals are employed. In a forecasting or projection context with Bayesian estimation of time series data, one might suppose that the residuals would clearly still be important for post estimation diagnosis regarding stationarity and serial correlation.

 

Brian

 

On Mon, 1 Apr 2024, 20:55 Sven Schreiber, <sven.schreiber@fu-berlin.de> wrote:

Am 16.03.2024 um 17:22 schrieb Brian Revell:

Hi

I have tried using the BayTool package with some success (in terms of getting output),

I have read (perhaps not well enough)  the detailed paper by Luca Pedini.

Luca has had some email problems lately, perhaps he didn't get this one.

However,  I am puzzled why when simply using the linear model with conjugate prior, as a default  I get two sets of  posterior mean parameter and se estimates -NI-post m and I post-m and NI-Pst se and I-post se. . Looking at the graphs of the Gaussian Posterior density estimates, they appear to more closely centred on  the I-pos -mean.  Although the difference s between the N & O posterior estimates is small, what is the basis for selecting one rather than the other?

I believe that NI stands for non-informative and I for informative, which typically refers to the prior used. So basically the NI results should be purely data-driven, which usually means the max-likelihood or frequentist quantities. This is just for comparison, a Bayesian would of course go for the informative ones. (Provided she really believes in the prior, which for some reason doesn't seem to happen all that often in reality, but that's a different story.)

 

Second question  - how can I retrieve the fitted values and residuals from theB-Took Package. Hopefully it resides somewhere to be retirieved post estimation. Or is iit necessary to run the input data through the equation to generate it onself -and if so, which set of posterior mean estimates does one use?

Luca may correct me eventually, but it seems those things are not directly available. I guess this makes sense from a Bayesian perspective since it gives you densities, not point estimates. For example, do you want to consider the posterior mean, median or mode as your quasi-point estimate? All imply different sets of residuals. In principle they could all be stored of course, but it's not clear whether that's the most practical thing to do.

Having said that, it looks like for the coefficients (in the sub-bundle post_summ) the means are stored, so in that context this arbitrary choice was made, apparently.

So for the CASE 1 in the sample script with covariates list X, I guess you could try:

 lincomb(X, out.post_summ.post_betamean)

to get the fitted values. (Untested.)

 

PS -the BayTool package does not seem to attach permanently to the URL "model" list of options.

Sorry, I don't understand. What URL?

cheers

sven

_______________________________________________
Gretl-users mailing list --
gretl-users@gretlml.univpm.it
To unsubscribe send an email to
gretl-users-leave@gretlml.univpm.it
Website:
https://gretlml.univpm.it/postorius/lists/gretl-users.gretlml.univpm.it/

_______________________________________________
Gretl-users mailing list -- gretl-users@gretlml.univpm.it
To unsubscribe send an email to gretl-users-leave@gretlml.univpm.it
Website: https://gretlml.univpm.it/postorius/lists/gretl-users.gretlml.univpm.it/