---
title: "New-Models"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{New-Models}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
Usually, one creates a GOF-model-test-class via the function
*GOF_model()*. But this function is actually a wrapper for the
class *GOF_model_test*. That class needs other classes to
work properly. In particular it uses the three interfaces
*GOF_model_info_extractor*, *GOF_model_simulator*,
*GOF_model_trainer*. In general, objects of class *lm* or
*glm* can be used with *GOF_model*. However, there might
be situations where one has to overwrite the default
behavior. For instance, if you want to apply the methodology
to new models or simply because a model fit returned by an
R-package does not work properly with the GOF-package.
Here we show how to implement the three interfaces for some
concret cases.
# Least square estimates
Assume we have such a model
$$
Y = m(\beta^\top X) + \epsilon
$$
without any knowledge about $\epsilon$. Then we could
try to estimate the $\beta$ using a least square estimate
and using the GOF-package for check our fitted model.
First we generate a data set
```{r}
library(bootGOF)
set.seed(1)
X = runif(n = 200, min = 6, max = 14)
d = data.frame(x = X, y = sin(0.5 * X) + rnorm(200, sd = 0.2))
plot(y~x, data = d)
```
Lets have a short excursion at this point because the plot gives
the impression that the following simple model might apply:
```{r}
wrong_model = lm(y ~ I(x^2), data = d)
```
However, a goodness-of-fit-test rejects this model
```{r}
mt <- GOF_model(
data = d,
model = wrong_model,
simulator_type = "parametric",
nmb_boot_samples = 100,
y_name = "y",
Rn1_statistic = Rn1_KS$new())
mt$get_pvalue()
```
Note that in this simple case the standard diagnostic plots
also reveal that this model is not a sufficient. Now we
fit the model using a least square estimator:
```{r}
library(minpack.lm)
fit <- minpack.lm::nlsLM(y ~ sin(a * x),
data = d,
start = c(a = 0.5),
control = nls.control(maxiter = 500))
fit
```
In order to create a goodnes-of-fit-test using *GOF_model_test*
we have to implement three interfaces. The first interface requires
that we implement three functions *yhat*, *y_minus_yhat* and
*beta_x_covariates*, which are the predictions for the dependent
variable (also called target-variable), the residuals on the scale
of the dependent variable and the inner product of the estimated
parameters and the independent variables (also called covariates or
features). However, the object returned by *minpack.lm::nlsLM*
does not contain the original data set but that data set is necessary
to calculate the inner product. Hence, we make a list that
contains the model fit and the data that was used to fit the model.
```{r}
fit_and_data <- list(fit = fit, data = d)
```
Now we can implement the interface
```{r}
library(R6)
my_nls_info_extractor <- R6::R6Class(
classname = "my_nls_info_extractor",
inherit = GOF_model_info_extractor,
public = list(
yhat = function(model) {
predict(object = model$fit)
},
y_minus_yhat = function(model) {
residuals(object = model$fit)
},
beta_x_covariates = function(model) {
a_hat <- coef(object = model$fit)
x <- model$data$x
ret <- a_hat * x
return(ret)
}
))
my_info_extractor <- my_nls_info_extractor$new()
```
Implementing *yhat* and *y_minus_yhat* is straight forward
using the function already offered by R
but *beta_x_covariates* needs special attention. The
reason is that *minpack.lm::nlsLM* can also fit very
general models of the type $m(\beta, X)$. Hence,
there is now built-in-function to extract objects
like $\beta^\top X$. Lets look at the first few data points:
```{r}
head(d)
```
Now we are able to predict $y$:
```{r}
head(my_info_extractor$yhat(model = fit_and_data))
```
And calculate the difference between $y$ and our prediction:
```{r}
head(my_info_extractor$y_minus_yhat(model = fit_and_data))
```
And based on the estimated coefficient we can calculate the
inner product with $x$:
```{r}
head(my_info_extractor$beta_x_covariates(model = fit_and_data))
```
Since we did not make an assumption about the distribution of
$\epsilon$ we cannot use a parametric resampling scheme. However,
we can do a wild bootstrap that uses only the predictions
and the residuals. The class *GOF_sim_wild_rademacher* implements
this wild bootstrap but needs an info extractor to obtain the
preditions and residuals:
```{r}
my_simulator <- GOF_sim_wild_rademacher$new(
gof_model_info_extractor = my_info_extractor
)
```
This class generates as many observations (according
to the fitted model) as are contained in the data set
used to fit the model.
Again looking at first at the original data points:
```{r}
head(d)
```
Now lets look at new $y$s generated according to the fitted
model, i.e. following a negative binomial distribution
subject to the independent variables:
```{r}
head(my_simulator$resample_y(model = fit_and_data))
```
Note that the resampled $y$'s sometimes equal the
observed $y$. The reason is that the this wild bootstrap
performs 'predictions +/- residual' and only the sign
is drawn at random.
Finally, we need to implement the interface
*GOF_model_trainer* which requires a function
*refit* that is able to update the model
object by refitting it to a new data set. R already
provides the necessary function, i.e. *stats::update*.
However we combined the fitted model with the
data set in a list and we need to take into account:
```{r}
my_nls_trainer <- R6::R6Class(
classname = "GOF_nls_trainer",
inherit = GOF_model_trainer,
public = list(
refit = function(model, data) {
fit <- update(object = model$fit, data = data)
ret <- list(fit = fit, data = data)
return(ret)
}))
my_trainer <- my_nls_trainer$new()
```
This implementation basically equals the implementation
of *GOF\_lm/glm/\_trainer*. The only difference is that
we again store the data with the fit because *nlsLM()*
doesn't do it for us.
Of course, fitting the model again to the original data set
results in the same fitted model. With the defined classes
we can now easily generate a new data set and refit the
model to that new data set.
```{r}
new_data <- d
new_data$y <- my_simulator$resample_y(model = fit_and_data)
my_trainer$refit(model = fit_and_data, data = new_data)$fit
```
Now all ingredients are available for applying the
Goodness-of-Fit test:
```{r}
set.seed(1)
mt <- GOF_model_test$new(
model = fit_and_data,
data = d,
nmb_boot_samples = 100,
y_name = "y",
Rn1_statistic = Rn1_CvM$new(),
gof_model_info_extractor = my_info_extractor,
gof_model_resample = GOF_model_resample$new(
gof_model_simulator = my_simulator,
gof_model_trainer = my_trainer
)
)
mt$get_pvalue()
```
# Negative Binomial using the MASS-package
A negative binomial model is a generalized linear model.
Furthermore, within R MASS::glm.nb returns an object of class
*glm* and this package actually can process *glm*-classes.
However, MASS::glm.nb seems to have a bug that prevents
to propoerly update/refit such an object via the stats::update()
function. We briefly illustrate this using an artificial
data set:
```{r}
library(MASS)
set.seed(1)
X1 <- rnorm(100)
X2 <- rnorm(100)
d <- data.frame(
y = MASS::rnegbin(n = 100, mu = exp(0.2 + X1 * 0.2 + X2 * 0.6), theta = 2),
x1 = X1,
x2 = X2)
fit <- MASS::glm.nb(y~x1+x2, data = d)
fit
```
Note that fit-object shows that the call contained the
parameter *init.theta* which obviously was not provided
by us. The problem is that this *init.theta* parameter
is used by *stats::update* during refitting resampled data.
So fitting
resampled data and fitting the original data is slightly
different from the perspective of the fitting algorithm.
To circumvent this problem we can reimplement the
corresponding interface as follows:
```{r}
my_negbin_trainer <- R6::R6Class(
classname = "GOF_glmnb_trainer",
inherit = GOF_model_trainer,
public = list(
refit = function(model, data) {
MASS::glm.nb(formula = formula(model), data = data)
}))
```
This way we ensure that the original data set and
resampled data set are fitted in the same way.
Now we can create the GOF-test-class using this new refitting-class
```{r}
set.seed(1)
mt <- GOF_model_test$new(
model = fit,
data = d,
nmb_boot_samples = 100,
y_name = "y",
Rn1_statistic = Rn1_CvM$new(),
gof_model_info_extractor = GOF_glm_info_extractor$new(),
gof_model_resample = GOF_model_resample$new(
gof_model_simulator = GOF_glm_sim_param$new(),
gof_model_trainer = my_negbin_trainer$new()
)
)
mt$get_pvalue()
```
Lets compare the result with the default GOF-test for glm's
```{r}
set.seed(1)
mt2 <- GOF_model(
model = fit,
data = d,
nmb_boot_samples = 100,
simulator_type = "parametric",
y_name = "y",
Rn1_statistic = Rn1_CvM$new()
)
mt2$get_pvalue()
```
In this case the p-values do not differ. However, it could be
different in other settings.