Study area
This research includes six US states: California, Colorado, Georgia, New York, Texas and Wisconsin (Fig. 1). These six states were chosen because they are spread throughout the country, they have very contrasting ecological and socioeconomic differences, and represent each of the US FIA work units [25]. This diversity in locations allowed us to represent different forest types, disturbance history, population dynamics and drivers that could influence our study variables. In addition, there were at least two plot measurements for all of those states, which allows for a temporal analysis of change. The study area was limited to areas with forest land cover at the start of the study. For this manuscript, we consider the FIA definition of forest land, defined as land at least 0.4 ha in size and 36.6 m wide that contains at least 10% canopy cover by live tally trees of any size or has had at least 10 percent canopy cover of live tally species in the past [25].
Forests in the US cover more than one-third of its landscape and comprise more than 28 billion cubic meters of wood volume. Even though overall forest area has remained stable, there have been changes at regional and local scales that are not reflected by aggregated numbers. In addition, the road network in the country has expanded, making forest lands more accessible to people [26]. This easier access to forests would likely encourage land use changes. Major natural forest disturbances nationally include wildfire, insects, and disease [26]. These and other disturbances, as well as changing forest conditions, cause changes in tree species composition and distribution. Climate in the study region varies widely.
Data collection
Ground data used in this study were collected and organized into a database by the USDA Forest Inventory and Analysis (FIA) program from 2000 to 2017. This database includes different phases and tables depending on the scale and level of detail of the information contained in each. For this research, we worked with ground-sampled FIA plots that were remeasured approximately every five years for eastern states and 10 years for western states, which allowed for a spatiotemporal analysis. These plots covered a 0.4 hectare sample area and were divided into four fixed-radius (7.32 m) subplots. Within each subplot, a microplot (2.07 m radius) was nested. Here, trees with a diameter at breast height (d.b.h.) less than 12.7 cm but greater than 2.54 cm were measured. Forest and non-forest conditions were assigned by the field crew in each remeasurement, which allowed us to track any land use changes that occurred. To obtain human population and housing information, we used the United States’ 2010 National Census [19]. For obtaining spectral information, we used a mosaic of Landsat 8 surface reflectance imagery (30 m spatial resolution) from the period 2018-05-01 to 2018-09-30 (which corresponded to the summer months) obtained from the Google Earth Engine platform (LANDSAT/LC08/C01/T1_SR). The initial image collection consisted of 4093 Landsat images that covered the study region during the summer months.
Data analysis
Information at the subplot and microplot levels was gathered for the different time periods available (initial time ti and final time tf) and scaled-up using the FIA expansion factors to a plot level (minimum modeling unit). The initial data set was divided into training and test data sets to fit and validate the models. The total number of plots with complete observations was 11,262, which where subdivided into the training and test data through probability samples with replacement in R (using a probability value of 0.7 for the training data and 0.3 for the test data) [27].
Our goal was to build one model per variable of interest (response variable). Therefore, model comparisons were performed to select the model which yielded the optimal fit statistics. Because our end goal was to build a replicable model applicable across the continental US, we included the variables state and forest type in both the LUC and the C change models. This allowed us to account for and capture the differences between the six states that were not captured with the rest of the explanatory variables.
The response variable for the logistic model was the probability of forested areas being converted to mixed or non-forested areas. For the purposes of this study, we will refer to forested plots which transitioned to mixed (partial forest/non-forest plot, e.g., one forest plus one non-forest condition) or non-forest conditions as plots that changed in land use. The response variable for the C model was C stock change in the total aboveground live biomass (defined as the sum of dry biomass located in the merchantable bole, top and limbs, stump, woodland tree species and saplings).
Model building
To explore the performance of different modeling approaches, we compared two main methods: a statistical model (parametric approach) and a machine learning algorithm (non-parametric approach). We compared the performance of these two main approaches to select the best performing model for each variable of interest. For both models, parametric methods were superior in predicting our response variable (See Additional file 1 for the Random Forests results and background). For the statistical models, we used mixed effects modeling because we worked with ecological data that had nested data, as well as temporal and spatial correlation structures. To select which type of mixed effects models to use, we looked at the characteristics of our variables of interest, especially the response variable. For the land use change model, the response variable had a binary behavior (0: no change and 1: change). Therefore, we chose a logistic model, which would give us the probability (or odds) of the response variable. In addition, for the C change model, the response variable was quantitative and had a linear relationship with the predictor variables. Therefore, we selected a linear regression model that accounted for correlated data, a linear mixed effects model with a random intercept.
Generalized linear mixed effects models
The generalized linear mixed effects models encompass several different types of regression models (i.e. general linear models, logistic regression, Poisson regression) while accounting for spatial and temporal correlations and nested data. This family of models has two main components: fixed effects (which includes the variables we are interested in) and random effects (which acknowledge the effect of the variables included here with a variation that is normally distributed with a certain variance) [28, 29].
For both the LUC and C change models, the training data were used to create a generalized mixed-effects model for each of the response variables. Explanatory variables (covariates) for both models were divided into the following categories: (1) forest attributes (ecological variables such as trees per hectare, basal area, and ecoregion), (2) plot attributes, (3) topography, (4) census data, (5) forest disturbances, and (6) number of conditions present in a plot (e.g., number of forest and non-forest types differentiated by reserved status, owner group, forest type, and stand density). The number of conditions present in a plot gave us a good representation of heterogeneity in a plot and allowed us to work with single- and multiple-condition plots. We regrouped the ecological classification codes into broader ecoregions due to the large number of categories, which would have complicated the model. See Additional files 2 and 3 for an extensive list of the variables used and metadata.
Variable selection techniques were applied in two stages. The first stage included dividing the variables into the categories defined above. Within each category, Pearson correlation coefficients were calculated to quantify relationships between quantitative variables. Variables with a coefficient of 0.5 or higher were marked to later verify that no correlated pair were kept in the final model. Categorical variables were visually assessed through scatterplot matrices. No variables were discarded in this step. In addition, for the logistic model we used the information value (IV) indicator to see how strong of a predictor each variable was and in which order to include the variables in the model (Additional file 2). The IV is used to select important variables in a logistic predictive model by ranking them based on their importance. We calculated the IV with the create_infotables and IV functions in R [30].
Variables were added to the model one at a time in descending order of their IV level of importance. For the regression model, variables were incorporated into the model according to the variable importance ranking obtained from the random forest model (Additional file 1). For both models, the Akaike information criterion (AIC) was used for model comparison. If an added variable increased the AIC value, it was removed. After this first variable selection stage, variables kept in each category were merged back to proceed with the variable selection.
We compared outputs of three variable selection approaches in R software using the car [31] and MASS [32] packages: forward selection, backward selection and stepwise selection. Finally, we used the variance inflation factor (VIF) to verify there was no multicollinearity (VIF > 10) present in the model. A logistic regression model with the remaining variables was created for the land use change model using the glmmPQL function in R with the cloglog link [32]. In contrast, a linear mixed-effects model was implemented for the C model using the glmer function [33]. For both models, different combinations of the retained variables plus state and forest types or ecoregion (independently and nested within each other) in the fixed or random section of the model were validated with the test dataset. This was done to account for correlated data points in our spatiotemporal analysis. In addition, interactions between potentially related covariates were examined. A complete workflow of the process can be observed in Fig. 2.
Model comparison
For both generalized mixed-effects models (logistic and linear models), a comparison between these techniques and the random forest approach was done to identify the best performing model. The criteria used to select the model was to look at the percentage of omission and commission errors, the overall accuracy, precision, and recall values. While we used the AIC values to compare models at the initial stage (to filter out preliminary models, Fig. 2), once we incorporated random effects in the models, AIC values were no longer estimated. Since we used the glmmPQL function in R to compute the models, no quasi-likelihood can be estimated for these mixed-effects models [34]. In the case of the C model, the root mean square error (RMSE) was used for model selection.
Creating predicted wall-to-wall likelihood of forest change
We created a spatially continuous map that covered the entire area of the six states of interest in a pixel by pixel format (i.e., a wall-to-wall map). This wall-to-wall map used the FIA data and the logistic model equation (See "Model building" section) to predict the likelihood of conversion of forested areas for every pixel. We used only the explanatory variables from the logistic model that had information available at a wall-to-wall scale to build the maps (See Fig. 3 for variables kept in the map and Additional File 2 for the complete list of variables). These variables were obtained through Google Earth Engine (GEE) and ArcMap 10.8 using original FIA coordinates, mosaic Landsat 8 images, and digital elevation data [35]. The variables used were: Normalized difference vegetation index -NDVI- (obtained from the Landsat 8 mosaic) as an estimate of basal area, ownership code [36], aspect [35], remeasurement period (Mean value per state from the FIA plot data), percentage of protected area in a county [19], and natural increase of the population (between 2000 and 2010) [19]. State and county boundaries were obtained from the United States Census Bureau [37]. All layers were converted to a raster format and transformed to the same projection (WGS84).
The NDVI variable was calculated in the GEE platform. Bands 5 and 4 from the imported Landsat 8 Surface Reflectance images for the summer months of 2018 for each state were used to generate a maximum NDVI for each pixel. This was done to obtain a mosaic of pixels displaying the maximum NDVI value for the period selected which would avoid clouds and ensure a value where hardwood trees had leaves. To extract NDVI and aspect values for each pixel, we used true FIA plot coordinates. If more than one plot measurement was available over time, means for plot coordinates values were calculated to account for variability and errors in each coordinate measurement. A buffer of 180 feet was used around the plot center to replicate the extent of a plot and calculate mean values for each pixel through zonal statistics.
With this wall-to-wall information, a multiple linear regression model was fit in R (with the predicted probability of change previously calculated as a response variable, see “Model building” section), obtaining an equation for the linear model. This equation (see Additional file 2) was used with the raster calculator tool in the ArcGIS environment. The maps generated here were clipped with the National Land Cover Database (NLCD) to display only forested areas. This new raster layer displayed the probability of conversion from forest to another land use (See Fig. 3 for a visual workflow).
For interpretation of the results shown in the maps, we overlapped ownership [36], protected areas [38], digital elevation models (DEM) [35], and major cities' shapefiles [37] to our maps.