Model errors in tree biomass estimates computed with an approximation to a missing covariance matrix

Background Biomass and carbon estimation has become a priority in national and regional forest inventories. Biomass of individual trees is estimated using biomass equations. A covariance matrix for the parameters in a biomass equation is needed for the computation of an estimate of the model error in a tree level estimate of biomass. Unfortunately, many biomass equations do not provide key statistics for a direct estimation of model errors. This study proposes three new procedures for recovering missing statistics from available estimates of a coefficient of determination and sample size. They are complementary to a recently published study using a computationally intensive Monte Carlo approach. Results Our recovery approach use survey data from the population targeted for an estimation of tree biomass. Examples from Germany and Mexico illustrate and validate the methods. Applications with biomass estimation and robust recovered fit statistics gave reasonable estimates of model errors in tree level estimates of biomass. Conclusions It is good practice to provide estimates of uncertainty to any model-dependent estimate of above ground biomass. When a direct approach to estimate uncertainty is impossible due to missing model statistics, the proposed robust procedure is a first step to good practice. Our recommended approach offers protection against inflated estimates of precision.


Background
The importance of forest biomass for the global carbon cycle is widely recognized [1][2][3][4]. The imperative of maintaining global levels of forest biomass and slowing regional rates of decline [5] has fostered international cooperation, initiatives, and projects to this end [6][7][8].
A large number of countries have agreed to implement an accounting system for forest carbon and to report on national-level annual gains and losses [9][10][11].
With few exceptions, the forest carbon accounting system has a national forest inventory at its core, and a suite of models to expand and transform inventory data to forest carbon [12][13][14]. Carbon components not fully covered by an inventory are typically estimated from activity data (e.g. harvest, disturbance, and erosion) and models fitted to data from research studies of, for examples: litter-fall; litter-decomposition; fine-root turnover; seed production; and dead and downed-woody debris.
An estimate of the uncertainty in a carbon balance has become a routine requirement [15,16]. When the core inventory data comes from a probability sample, the uncertainty arises from three sources: observational and measurement errors [17][18][19], sampling errors, and errors in model parameters [12,20]. The live above-ground forest tree biomass (AGB) accounts for the largest contribution to the forest carbon balance [21,22].
In situ determination of AGB is extremely costly and destructive. A model-dependent approach with prediction of biomass from a biomass equation, with easy-to-measure explanatory variables, is the only practically feasible alternative [21,23].
It is, of course, very difficult to ascertain whether an off-the-shelf model is suitable for a particular application or not [37]. It remains a risky proposition to use externally fitted models without any form of validation or recalibration to local conditions [38]. An adopted model generates the desired predictions of above-ground biomass but a valid estimate of the associated covariance of model-parameters is needed to compute an estimate of the uncertainty in a prediction [12, 39, p. 73, 40]. A model-bias can only be quantified in a validation with actual observations of above-ground biomass and the predictors in a model [41, pp. 172 and 232, 42].
Although we have a plethora of equations for aboveground biomass as a function of, for example, stem diameter at a reference height of 1.3 m above ground level [21,26,31,43], information regarding the covariance matrix of model parameters is often missing. Available fit statistic is generally limited to one or more of the following: standard errors of estimated parameters, the coefficient of determination, the standard deviation of lack-of-fit residuals, and sample size [44].
This study demonstrates methods for recovering a covariance matrix for model parameters in a biomass equation from fit statistics restricted to: sample size (n) and the coefficient of determination (R 2 ) [44]. Our non-use of a possibly available estimate of the standard deviation of empirical residuals rests with its sensitivity to outliers [45], a strong dependency on the sampling design [39, p. 55], the distribution of the response and explanatory variables in the study that gave us the equation of interest. Wayson et al. [44] proposed a Monte-Carlo approach to recover missing estimates of the covariance among parameters in a biomass equation.
The key idea in their approach is to generate a distribution of pseudo-data that mirrors, to the extent possible, a known or assumed distribution of explanatory variables in the sample trees behind an equation. The tenet behind our approach is different. It is rooted in survey sampling [39]. Hence, the recovered estimates of uncertainty are assumed compatible with estimates that could have been obtained from a sample taken from the population, for which we desire estimates of biomass. It is fully recognized that our recovery is neither perfect nor unbiased. However, supported by our results, we argue that our approach is consistent with the main objective of any recovery procedure: to estimate model errors in population estimates of biomass as opposed to a rediscovery of 'lost' estimates of model errors.
Our demonstrations include examples with equations and data from the first German national forest inventory in 1987 (BWI-1) [40,46] and the 2004-2009 Mexican National Forest Inventory [47][48][49]. We discuss limitations to our approach, and recommend a robust recovery method. We also emphasize the need to develop new and fully documented biomass equations for important species in regions where they are currently lacking.

Examples from Germany
Substitutes for missing covariance matrices for the biomass models in Table 1 are listed in Table 2. For refitted matrices, there were three rejections of the null hypothesis of equality (actual = refitted) at the 5 % level of significance. For the recovered matrices there were one rejection, and for the robust recovery there were zero rejections. A distinct pattern emerged when comparing refitted, recovered, and robust variances. Refitting appears to overestimate the variance in a regression parameter; by approximately 70 % for the first parameter and approximately 35 % for the second parameter. In contrast, the recovered variances were, on average, smaller than the actual variances (5 and 16 %, respectively). Robust estimates of variance were closer to actual estimates of variance than refitted and recovered estimates. Substitute covariance matrices for the nonlinear models were, in general, closer to the missing (actual) covariance matrix than a substitute covariance matrix for a linear model.
Taking into consideration that the relative error in the regression coefficient to diameter DBH or DBH 2 is six to twelve times smaller than the relative error in the regression coefficient to √ DBH × HT or √ HT, a bias in the former is much more serious than in the latter. For unweighted linear and nonlinear equations, the robust procedure appears as the most attractive. As well, the strong impact of errors in the first regression coefficient on a tree-level estimate of AGB amplifies concerns surrounding the overestimation of model-errors encountered with the refitting procedure.
For the weighted least squares equations, the estimates of the substitute covariance matrix are in Table 3. Generally the results were worse entailing larger differences between estimated substitutes and actual covariance matrices. The refitted matrices were worst with five out of six significant departures from the actual matrices, and in terms of seriously overestimating the variances. The best results were obtained with the recovered matrices (two rejections of the null hypothesis of no difference). Yet there is an average overestimation of the first variance by 23 % and an average underestimation of the second by 25 %. Considering the larger contribution to the model error variance from the former, the overestimation is a concern. A robustly recovered matrix was in four cases significantly different from the actual covariance matrix and overestimated variances by 72 and 24 %.
Recovering an estimate of the residual variance was, as expected, easier than recovering a covariance matrix. The relative error in recovered estimates of the residual standard error varied from approximately −20 to +35 %.
Two of eight estimates were significantly different from the actual values (F-ratio test, P = 0.02), for the remaining six, the level of significance was 0.10 or greater.
Attempts at a recovery of the covariance matrices for the generalized above-ground biomass Eqs. 13-15 in Table 1 [26] failed, regardless of method. With the recovery methods, the estimated standard deviations of the three regression parameters were 2-8 times greater than those listed in Table 3 of Muukkonen [26]. Had we used the tabled values of the root mean squared error in lieu of the recovered substitute, the estimated errors would have been approximately 30-70 times too small. The failure is easy to explain: the fit-statistics of the generalized model apply to the set of models that are generalized. Footnotes to Table 3 in Muukkonen [26] carefully explain the constrained interpretation of the table entries. Due to the poor accuracy of the recovered generalized covariance matrices they were not used to gauge the error-propagation to estimates of tree-level AGB.
All recovery procedures are fraught with numerical problems due to co-linearity among regression coefficients (correlations coefficients varied between −0.87 and −0.97), and large differences in accuracy of parameter estimates. For example, the matrix condition number varied between 14.1 and 14.9, and determinants were less than 10 −5 suggesting a serious potential of amplified estimation errors when inverting a covariance matrix [50,51]. Challenges of this nature will also be encountered in applications of the proposed procedures.
A summary of the effect of replacing a missing (actual) covariance matrix with a substitute approximation on the model-error in an estimate of the mean per tree AGB (kg) is provided in Table 4. With the un-weighted linear models the relative model error in the average tree-level estimate of AGB is 7-12 % (column ACT in Table 4). Model errors in estimates based on a nonlinear un-weighted equation were approximately 2-3 % points lower.
Weighted regressions were uniformly superior with the lowest relative errors. Results with the substitute covariance matrices followed-by and large-these trends with estimates within one to 6 % points from results with the actual estimate of covariance. In the case of weighted regressions: two poor results with the refitting procedure with PINE data, and two for the robust recovery with BEECH data, stands out as examples of inflated estimates of model-error. The remaining estimates of error appear reasonable; yet do not indicate that one recovery procedure is substantially and consistently better than the presented alternatives.

Examples from Mexico
For Guazuma ulmifolia and Ochroma pyramidale the substitute estimates of the parameter error variances Data were selected from 335 plots from the 1987 (West) German national forest inventory (BWI-1987). Plots were dominated by one of the three species groups. Selected trees have a DBH ≥ 7 cm, and were selected with a probability proportional to their basal area at breast height (basal area factor of 4), [77, ch. 8].

Table 2 Actual, refitted, and recovered covariance matrices of non-weighted regression coefficients in equations
Actual covariance matrices are based on a sample size of 50. P Table 1 Table 3 Actual, refitted, and recovered covariance matrices of regression coefficients in weighted least squares equations in Table 1 Actual covariance matrices are based on a sample size of 50. were, not statistically significant from the actual estimates of error (Table 5). This is spite of overestimating, by a factor of approximately two, the variances in the regression parameters for G. ulmifolia. The relative small sample sizes of 18 and 16 trees limit our power to declare practically important differences significant. In case of Inga vera and Trichospernum mexicanum the substitute variances were two to four times larger than the published estimates. Each recovery procedure led to inflated estimates of variance. The basic recovery method holds a slight edge over the other two. We did not attempt a weighting scheme in the recovery procedure as the log transformation of AGB and DBH in most cases remove variance heteroscedasticity in the original scale of the residuals. Power functions as used for Quercus spp. are extremely sensitive to the weighting schemes used in the German examples. Besides, the original biomass equations were not obtained by weighted least squares [52] so we did not employ a weighted recovery scheme.
Substitute estimates of the residual standard error were considerably and statistically significantly smaller (30-240 %) than the published values. These results paired with the inflation of the variance of regression coefficients suggest a much smaller variation of the explanatory variables in the samples from the national inventory than in the sample used for fitting. A uniform distribution of the explanatory variables in the model fitting sample [53] could explain our results.
Tabled estimates of the residual standard errors for the three Quercus spp. Were three to four times smaller than recovered estimates. We noted that even a small reduction of 1-2 % in the published value of R 2 would bring the two sets of estimates within approximately 20 % of each other. Power functions are notorious in this regard.
When the uncertainty in biomass equation parameters was propagated to tree-level estimate of AGB, we obtained the average relative per tree model-errors in Table 6. Overall, the relative model errors in the average per tree AGB in Table 4 Relative model errors (%) in estimates of the mean per tree above-ground tree biomass with actual (ACT), refitted (REFIT), recovered (RECOV), and robustly recovered (RREC) covariance matrices for the parameters in the biomass equations in Table 1 Species  Table 5 Actual, refitted, and recovered variances of regression coefficients in Eqs. 1-7 in Table 8 Actual covariance matrices are based on sample sizes listed in Table 8.  G. ulmifolia appears too low, despite an apparent overestimation of the errors in the model parameters. Refitting of a missing covariance matrix via the parametric bootstrapping generated unrealistic large estimates of relative errors in Quercus laeta and Quercus spp. Numerical instability of the covariance matrix, small sample sizes, and random multiplicative residuals with a large variance is a recipe for poor results. As expected, the robust recovery produces the largest estimates of relative errors.

Discussion
The need for forest biomass equations has increased sharply over the past decades in response to efforts directed at quantifying stock and stock-changes in forest carbon and the potential for bioenergy extraction [21,42,54]. Ideally there would be an equation for each tree species and region with distinct growth forms and management regimes [55,56]. We are still far from this ideal. Even the equations we have are generally based on very limited sampling within a relatively small area and range of tree sizes [21]. This is understandable in light of the high costs of producing a biomass equation [21,26,57]. Biomass estimates for large trees are therefore fraught with problems of applicability of available biomass equations.
In the computation of forest biomass in a large region, country, or even a continent, it is common practice to use a suitable biomass equation for a particular species and growth region [58][59][60]. In most cases, there is no separate calibration of chosen biomass equations.
On this background, national and regional estimates of above-ground biomass should be regarded as no more than first-order approximations [16]. The requirement [11] to quantify or at least assess uncertainty in a national or regional estimate of forest biomass has precipitated a need for estimates of errors in the parameters of employed biomass equations. For a large number of equations, this information is partially or entirely missing [31,44].
In a context of model-dependent estimation of forest tree biomass and model-errors in these estimates, a covariance matrix of the model parameters is needed [12,40]. When this statistic is missing a substitute is needed. Wayson et al. [44] proposed a computationally intensive method for generating a large number of pseudo data of the dependent and independent variables in a biomass equation. Samples are then drawn repeatedly and the model is refitted each time. The sampling aims at mimicking the actual sampling process (if known) of the original data behind a biomass equation.
Our proposed procedures for computing a substitute for a missing covariance matrix are computationally faster and make direct use of data of the explanatory variables sampled from the population targeted for an estimation of biomass. The distribution of the explanatory variables used to compute (recover) a covariance matrix plays a pivotal role in both approaches. If the actual distribution behind an equation differs from the distribution in the recovery process, a covariance matrix different from the actual (but unknown) will emerge from a recovery procedure. We saw several examples of this in our examples, but an equal number of examples where a substitute matrix was not statistically different from the target matrix. Wayson et al. [44] do not report at this level of details, but we surmise that they encountered similar issues. It is now a question of whether these differences are relevant or not. We argue, that sampling the explanatory variables from the target population vouch for estimates adapted to the application domain rather than to a small sample of trees with unknown representation in the target population.
The most intuitive approach to recover a missing covariance matrix is a variant of the parametric bootstrap [61]. In the textbook version of a parametric bootstrap, n pseudo observations of Y are generated a large number of times (say B) by adding a random draw from the observed empirical regression residuals to the n model predictions obtained from the original regression model and the observed explanatory variables. The regression model is then refitted B times to the pseudo observations of Y. At the end, the analyst has B replications of the covariance matrix of the model parameters. Without observed residuals, this approach is not feasible. Instead, our recovery by refitting resorted to random sampling of the explanatory variables from the target population for biomass estimation, and residuals from a distribution deemed realistic to the case at hand (e.g. a gamma distribution for multiplicative residuals). Although this method in many cases was as good as with alternative approaches, it was equally clear that it entails a considerable risk of poor results. A risk traced to random interactions between Table 6 Estimates of mean AGB kg tree −

and relative errors (%) in estimates in mean AGB for seven Mexican species
Estimates are based on tree data provided by the Mexican NFI (see Table 9). The errors are derived with refitted (REFIT), recovered (RECOV), and robustly recovered (RREC) covariance matrices for the parameters in the biomass equations in Table 8 To carry this efficiency through to a recovered covariance matrix, a weighting scheme applied to the original biomass equation should be replicated in a recovery procedure.
A matrix recovery based on the average (vector) gradient of the model parameters with respect to the explanatory variables was, in the balance, better suited for the purpose of estimation of model errors in tree-level biomass estimates. A robust variant of the recovery is easy to compute and-despite expected and observed larger estimates of model-errors-we recommend this procedure as a prudent choice. For the purpose of reasonable estimates model-errors in tree-level estimates of biomass, it is not a strict requirement that a recovered covariance matrix is close to the actual but missing matrix. Most of our estimates, but especially those obtained with the robust recovery procedure, seem reasonable [13,16,20,57,64]. Our resampling of explanatory variables from inventory data representing the population targeted for an estimation of biomass, ensures that the mean of the explanatory variables will be close to the mean in the target population. Ceteris paribus, this will counter the aforementioned inflation of model-parameter variances [39, ch. 5.4].
An attempt to recover a covariance matrix can end in failure. A failure was demonstrated with the generalized biomass equations for beech, pine, and spruce in the temperate zone [26, Table 1]. A failure is pre-ordained when estimates of R 2 and a root mean squared errors are incompatible with the biomass equation applied to actual data. Our experience should raise awareness of potential pitfalls in published fit-statistics for a generalized equation, unless they reflect a proper meta-analysis [65].
Throughout we have treated published fit statistics as known entities. It would have been preferable to consider an empirical Bayesian recovery procedure [66]. The coefficient of determination is pivotal in our proposed procedures. Its sampling variance can only be estimated from the data supporting a biomass equation [67]. To recognize sampling variance in R 2 , a recovery is repeated a large number of times, each with a random draw from an anticipated distribution of R 2 , to create an empirical Bayes posterior distribution of the recovered statistic. The recovery procedure by Wayson et al. [44] contains elements of a Bayesian approach.
Although a recovered covariance matrix affords an estimate of the model error in a tree level biomass estimate, the model-error is conditional on a correctly specified model. If a published biomass equation is the result of an intensive model and variable screening process, we must expect optimism in published statistics and model-bias [68].
We have demonstrated the recovery of a missing covariance matrix without too much concern about sample size. Clearly, a biomass equation derived from a small sample size has a relatively high risk of model bias due to a high influence of individual observations [62, p. 170]. It is not possible to give a definite recommendation about the minimum sample size for our robust recovery procedure. However, a first approximation can be gained from the following example: If we have fitted a linear regression model with three parameters, and we wish to declare a standardized regression residual of 3 as significant at the 5 % level (an indication that the model is unduly influenced by residuals of this magnitude), we need a sample size of approximately 55 [69]. Thus an application of our recovery procedure for regression models supported by less than 55 trees should proceed with caution and attention to robustness.
In large sample inventories the model errors in point estimates of biomass will often dominate sampling errors [12,40]. Fortunately, when estimating a temporal change in biomass and carbon stock between two inventories, model errors in a difference all but cancel [Ibid]. Thus applying recovered conservative (robust) estimates of a missing covariance matrix will have little impact on the estimate of model errors in a difference.
We have demonstrated that reasonable (robust) estimates of model-errors in estimates of tree-level biomass can be derived from a minimum of two available fit statistics for a biomass equation: the coefficient of determination, and sample size. To complete an estimation of model-errors an analyst need access to forest inventory sample data of the explanatory variables from the population targeted for biomass estimation.

Conclusions
It is good practice to provide estimates of uncertainty to any model-dependent estimate of above ground biomass. When a direct approach to estimate uncertainty is impossible due to missing model statistics, the proposed robust procedure is a first step to good practice. Our recommended approach offers protection against inflated estimates of precision.

The biomass model
The model we consider for above-ground live tree biomass is parametric and can be expressed as where y i is the above-ground forest tree biomass (AGB in kg) of the ith tree, f is a known function (linear or (1) y i = f (x i ; b) + e i nonlinear), x i is a p × 1 row vector of regressor variables including an intercept (if any), b is a q × 1 vector of model parameters, and e i is a residual error. For a linear model p = q.
A model f fitted to n observations of x i and y i (i = 1,…, n) allows a prediction of the expected biomass in the, say, jth tree ŷ j from knowledge of x j and the estimated parameters b . In the application context of a forest inventory (survey) the model in (1) is used to predict AGB for out-of-sample trees. An estimator of the approximate out-of-sample model error variance in an estimate of AGB for a tree j with a known (measured) vector x j of explanatory variables is [70, ch. 6.3] where σ 2 e is an estimate of the variance of lack-of-fit residuals (e i ) of the trees used to fit the model in (1), and ∂f (x j |b)∂ −1 b is the vector of derivatives (gradients) with respect to the model parameters, and ĉov b is an estimate of the covariance among model parameters. All gradients are evaluated at the least squares estimate of b. A superscript 't' denotes the transpose of a vector or a matrix. When the model is linear in b the derivatives in 2 reduces to the vector x.
The q × q covariance matrix for b is [63, p. 17]

The estimation problem
It is clear from (2) that we cannot estimate the error in an out-of-sample estimate of the AGB in a single tree unless we have reasonable estimates of σ 2 e and ĉov b . Note, when we wish to estimate the error in an average of AGB in a large number (m) of trees, the contribution to the error from the residual variance can be ignored as it declines at a rate of m −1 , the second term, however, is only averaged over m [62, pp. 28-30].
When we are tasked with estimating the error variance in (2) but do not have estimates of σ 2 e or ĉov b we have to recover reasonable substitutes. Equations (2) and (3) implicitly suggest how to obtain substitutes σ 2 e for σ 2 e and c ov b for c ov b when we at least know the sample size n used to estimate the parameters in the biomass model in (1), and the coefficient of determination R 2 or, preferably, the adjusted coefficient of determination [62, p. 91].

Recovery of missing fit statistics
A basic recovery of a substitute for c ov b begins with B random samples (without replacement) of size n of x taken from an inventory sample from the population for which tree-level predictions of AGB via (1)  The average over the B replications of σ 2 e and c ov b now serves to approximate the error variance in AGB of a single tree (see (2)). Implicit in this estimator of residual variance is the assumption of a homogenous error-structure.
It is clear from (4) that the estimate σ 2 e depends on the sampling distribution of f x b,j |b which may be quite different from the distribution in the original sample used in model fitting. Most biomass functions are fitted to an approximate uniform distribution of the explanatory variables, as it achieves large-sample optimality for model fitting [39, ch. 7.5]. However, for typically small sample sizes in biomass studies, this no longer holds. Our repeated sampling from the target population assuage more robust and realistic estimates of the desired covariance matrix. Albeit under the proviso that the reported coefficient of determination has not been maximized by a combination of model-and variable-selection procedures, and a sampling design that C. paribus favors a linear model.

Recovery via refitting
A recovered estimate of the residual variance (see (4)) can be used in a parametric bootstrap [71] to recover a substitute for a missing covariance matrix c ov b . The refitting begins with n random draws of residuals (e j * , j = 1, …, n) from a t-distribution with n − q degrees of freedom. Pseudo data y * j = f x j |b + e * j is then used to re-estimate the parameters b * and the associated covariance matrix c ov(b * ). This process is repeated B times; the mean of the covariance matrices is now the substitute to use in computing an error of AGB via (2).
Adding a random residual to a biomass prediction ŷ j can make y j * negative in violation of AGB ≥ 0. Should that occur we recommend computing y j * from ŷ j × e * j where e j * is a random draw from a gamma distribution with parameters ν and ν −1 (i.e. with mean 1.0 and variance ν −1 ). The parameter ν can be found by using Goodman's formula for the exact variance of V ŷ i e * i [72]. However, in our examples this formula did not give us real-valued solutions of ν. By solving the equation in [5] for ν we obtained a good first-order approximation.

Recovery of off-diagonal elements in c ov b
In some cases estimates of errors in b are available, but without estimates of covariance. In this scenario a substitute covariance matrix can be recovered from

Robust recovery
A recovered substitute c ov b may differ substantially from the target covariance matrix c ov b when the distribution of x j in the samples taken from the population targeted for a prediction of AGB differs from the distribution of x i in the original-but unknown-sample used for model-fitting [39, ch. 5.4]. To mitigate this prospect, we propose a robust recovery of c ov b . It is borrowed from Gallant AR [63] and given in (7) where ẽ i is a random draw from a t-distribution with ⌊0.5 n⌋ degrees of freedom and variance σ 2 e . The choice of degrees of freedom for the t-distribution is arbitrary; it reflects the fact that most sample sizes supporting a tree biomass model are in the range of 6-30 [21]. A halving of these sample sizes results in increases of 3-21 % in 95 % percentiles from a student's t-distribution. Robust alternatives to the correlation coefficients in [6] can be computed with a weighting of gradients proportional to the inverse of abs(ẽ i ).

A weighted recovery
In regressions with a positively valued dependent variable (y), it is not uncommon to observe an increase in the variance of regression residuals with an increase in y [73, ch. 5.1]. A weighted least squares (WLS) approach to model-fitting would be appropriate. If f (x i ; b) was fitted using WLS the recovery of a substitute for côv b should also employ a weighting scheme. Equation [8] provides an example.
where W is an n × n diagonal matrix of sum-toone weights w 1 , …, w n . In tree biomass models, the weights would typically be proportional to the inverse of, say, DBH j 2 which gives the following weights w j = TDBH 2 × DBH j −2 where TDBH 2 is the sum of DBH j 2 over the n trees. A robust alternative to [8] is obtained by a straightforward extension of [7]. A weighting scheme is also needed when trees for model-fitting were selected by an unequal probability selection scheme. Weights should then be proportional to the inverse of the sample inclusion probability [73, p. 41].

The number B of resampling replications
The value of B was determined adaptively by monitoring the Monte Carlo error as a function of B [74]. In our examples we fixed B to 800. With this value of B, the Monte Carlo error in the determinant of c ov b was less than 4 %.

Comparing recovered and actual covariance matrices
A recovered substitute for a covariance matrix may vary considerably from an unknown target estimate when the joint distribution of the explanatory variables in the sample used for fitting differs from the joint-distribution in the target population for model application. In our demonstrations we knew, in most cases, the actual estimates of the missing covariance matrix. It is therefore of interest to test the hypothesis of equality between a recovered substitute and the actual estimate. We use Box's M-test to obtain a Chi square test-statistic and the probability of this test statistic under the null hypothesis of no difference [75, p. 281]. The same test was applied in examples where only the covariance in b are unknown.

Examples from Germany
We demonstrate the above recovery procedures with 15 biomass equations (Table 1) and data (HT, DBH) from 335 plots in the first German national forest inventory (BWI-1987). Note, the data represent trees selected with probability proportional to their basal area. Their mean HT and DBH are therefore larger than the mean of trees selected with equal probability. However, for purpose of a demonstration, this fact is deemed unimportant.
There are four equations (linear, nonlinear, weighted, un-weighted) for each of three species (BEECH, PINE, SPRUCE). Each equation (no. 1-12) were derived from a sample size of n = 50 randomly selected trees from five BWI plots. The five plots were excluded from any recovery procedure. In the model fitting, BWI predictions of tree AGB (kg per tree) multiplied with a random uniformly distributed error on the interval [0.9, 1.1] were used as the dependent variable and diameter at a reference height of 1.3 m (DBH) and tree height (HT) were used as predictors. A summary of the BWI data is in Table 7. The remaining three biomass equations (no. [13][14][15] are generalized species specific AGB equations from Muukkonen and Heiskanen [76]. They are assumed applicable throughout the temperate zone. An analyst may prefer a generalized biomass equation over a local/regional model derived from a relatively small sample size and potentially from a sub-population with a different relationship between AGB and the explanatory variables than in a population targeted for estimation of AGB.

Examples from Mexico
Four linear (on a log-log scale) biomass equations [53] with published estimates of R 2 adj , σ e , and standard errors of the regression coefficients are used to demonstrate the recovery procedures. The equations (no. 1-4) are in Table 8. Three non-linear biomass equations for Quercus spp. [52] with unknown standard errors of the regression coefficients were also included (no. [5][6][7]. The recovery procedures are demonstrated with data from the 2004-2009 Mexican national forest inventory [47,48]. Specifically, 132 sample plots and 1,843 trees with known DBH and HT were included (Table 9). Table 7 Means of DBH, HT, and AGB of trees from the 1987 German National Inventory used in this study Standard deviations are in parentheses. Note, the mean applies to the population from which 50 trees were selected at random for model-fitting and B = 800 sets of 50 trees were selected for the recovery process (a tree used for model fitting was disallowed in the recovery process). See Table 1 Table 9 Summary of tree size (mean DBH cm, mean HT m), stem density of species groups (N ha −1 ), and model-dependent predictions of above-ground forest tree biomass (AGB Mg ha −1 ) in the Mexican NFI (2004-2009) plots