--- title: "New-Models" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{New-Models} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` Usually, one creates a GOF-model-test-class via the function *GOF_model()*. But this function is actually a wrapper for the class *GOF_model_test*. That class needs other classes to work properly. In particular it uses the three interfaces *GOF_model_info_extractor*, *GOF_model_simulator*, *GOF_model_trainer*. In general, objects of class *lm* or *glm* can be used with *GOF_model*. However, there might be situations where one has to overwrite the default behavior. For instance, if you want to apply the methodology to new models or simply because a model fit returned by an R-package does not work properly with the GOF-package. Here we show how to implement the three interfaces for some concret cases. # Least square estimates Assume we have such a model $$ Y = m(\beta^\top X) + \epsilon $$ without any knowledge about $\epsilon$. Then we could try to estimate the $\beta$ using a least square estimate and using the GOF-package for check our fitted model. First we generate a data set ```{r} library(bootGOF) set.seed(1) X = runif(n = 200, min = 6, max = 14) d = data.frame(x = X, y = sin(0.5 * X) + rnorm(200, sd = 0.2)) plot(y~x, data = d) ``` Lets have a short excursion at this point because the plot gives the impression that the following simple model might apply: ```{r} wrong_model = lm(y ~ I(x^2), data = d) ``` However, a goodness-of-fit-test rejects this model ```{r} mt <- GOF_model( data = d, model = wrong_model, simulator_type = "parametric", nmb_boot_samples = 100, y_name = "y", Rn1_statistic = Rn1_KS$new()) mt$get_pvalue() ``` Note that in this simple case the standard diagnostic plots also reveal that this model is not a sufficient. Now we fit the model using a least square estimator: ```{r} library(minpack.lm) fit <- minpack.lm::nlsLM(y ~ sin(a * x), data = d, start = c(a = 0.5), control = nls.control(maxiter = 500)) fit ``` In order to create a goodnes-of-fit-test using *GOF_model_test* we have to implement three interfaces. The first interface requires that we implement three functions *yhat*, *y_minus_yhat* and *beta_x_covariates*, which are the predictions for the dependent variable (also called target-variable), the residuals on the scale of the dependent variable and the inner product of the estimated parameters and the independent variables (also called covariates or features). However, the object returned by *minpack.lm::nlsLM* does not contain the original data set but that data set is necessary to calculate the inner product. Hence, we make a list that contains the model fit and the data that was used to fit the model. ```{r} fit_and_data <- list(fit = fit, data = d) ``` Now we can implement the interface ```{r} library(R6) my_nls_info_extractor <- R6::R6Class( classname = "my_nls_info_extractor", inherit = GOF_model_info_extractor, public = list( yhat = function(model) { predict(object = model$fit) }, y_minus_yhat = function(model) { residuals(object = model$fit) }, beta_x_covariates = function(model) { a_hat <- coef(object = model$fit) x <- model$data$x ret <- a_hat * x return(ret) } )) my_info_extractor <- my_nls_info_extractor$new() ``` Implementing *yhat* and *y_minus_yhat* is straight forward using the function already offered by R but *beta_x_covariates* needs special attention. The reason is that *minpack.lm::nlsLM* can also fit very general models of the type $m(\beta, X)$. Hence, there is now built-in-function to extract objects like $\beta^\top X$. Lets look at the first few data points: ```{r} head(d) ``` Now we are able to predict $y$: ```{r} head(my_info_extractor$yhat(model = fit_and_data)) ``` And calculate the difference between $y$ and our prediction: ```{r} head(my_info_extractor$y_minus_yhat(model = fit_and_data)) ``` And based on the estimated coefficient we can calculate the inner product with $x$: ```{r} head(my_info_extractor$beta_x_covariates(model = fit_and_data)) ``` Since we did not make an assumption about the distribution of $\epsilon$ we cannot use a parametric resampling scheme. However, we can do a wild bootstrap that uses only the predictions and the residuals. The class *GOF_sim_wild_rademacher* implements this wild bootstrap but needs an info extractor to obtain the preditions and residuals: ```{r} my_simulator <- GOF_sim_wild_rademacher$new( gof_model_info_extractor = my_info_extractor ) ``` This class generates as many observations (according to the fitted model) as are contained in the data set used to fit the model. Again looking at first at the original data points: ```{r} head(d) ``` Now lets look at new $y$s generated according to the fitted model, i.e. following a negative binomial distribution subject to the independent variables: ```{r} head(my_simulator$resample_y(model = fit_and_data)) ``` Note that the resampled $y$'s sometimes equal the observed $y$. The reason is that the this wild bootstrap performs 'predictions +/- residual' and only the sign is drawn at random. Finally, we need to implement the interface *GOF_model_trainer* which requires a function *refit* that is able to update the model object by refitting it to a new data set. R already provides the necessary function, i.e. *stats::update*. However we combined the fitted model with the data set in a list and we need to take into account: ```{r} my_nls_trainer <- R6::R6Class( classname = "GOF_nls_trainer", inherit = GOF_model_trainer, public = list( refit = function(model, data) { fit <- update(object = model$fit, data = data) ret <- list(fit = fit, data = data) return(ret) })) my_trainer <- my_nls_trainer$new() ``` This implementation basically equals the implementation of *GOF\_lm/glm/\_trainer*. The only difference is that we again store the data with the fit because *nlsLM()* doesn't do it for us. Of course, fitting the model again to the original data set results in the same fitted model. With the defined classes we can now easily generate a new data set and refit the model to that new data set. ```{r} new_data <- d new_data$y <- my_simulator$resample_y(model = fit_and_data) my_trainer$refit(model = fit_and_data, data = new_data)$fit ``` Now all ingredients are available for applying the Goodness-of-Fit test: ```{r} set.seed(1) mt <- GOF_model_test$new( model = fit_and_data, data = d, nmb_boot_samples = 100, y_name = "y", Rn1_statistic = Rn1_CvM$new(), gof_model_info_extractor = my_info_extractor, gof_model_resample = GOF_model_resample$new( gof_model_simulator = my_simulator, gof_model_trainer = my_trainer ) ) mt$get_pvalue() ``` # Negative Binomial using the MASS-package A negative binomial model is a generalized linear model. Furthermore, within R MASS::glm.nb returns an object of class *glm* and this package actually can process *glm*-classes. However, MASS::glm.nb seems to have a bug that prevents to propoerly update/refit such an object via the stats::update() function. We briefly illustrate this using an artificial data set: ```{r} library(MASS) set.seed(1) X1 <- rnorm(100) X2 <- rnorm(100) d <- data.frame( y = MASS::rnegbin(n = 100, mu = exp(0.2 + X1 * 0.2 + X2 * 0.6), theta = 2), x1 = X1, x2 = X2) fit <- MASS::glm.nb(y~x1+x2, data = d) fit ``` Note that fit-object shows that the call contained the parameter *init.theta* which obviously was not provided by us. The problem is that this *init.theta* parameter is used by *stats::update* during refitting resampled data. So fitting resampled data and fitting the original data is slightly different from the perspective of the fitting algorithm. To circumvent this problem we can reimplement the corresponding interface as follows: ```{r} my_negbin_trainer <- R6::R6Class( classname = "GOF_glmnb_trainer", inherit = GOF_model_trainer, public = list( refit = function(model, data) { MASS::glm.nb(formula = formula(model), data = data) })) ``` This way we ensure that the original data set and resampled data set are fitted in the same way. Now we can create the GOF-test-class using this new refitting-class ```{r} set.seed(1) mt <- GOF_model_test$new( model = fit, data = d, nmb_boot_samples = 100, y_name = "y", Rn1_statistic = Rn1_CvM$new(), gof_model_info_extractor = GOF_glm_info_extractor$new(), gof_model_resample = GOF_model_resample$new( gof_model_simulator = GOF_glm_sim_param$new(), gof_model_trainer = my_negbin_trainer$new() ) ) mt$get_pvalue() ``` Lets compare the result with the default GOF-test for glm's ```{r} set.seed(1) mt2 <- GOF_model( model = fit, data = d, nmb_boot_samples = 100, simulator_type = "parametric", y_name = "y", Rn1_statistic = Rn1_CvM$new() ) mt2$get_pvalue() ``` In this case the p-values do not differ. However, it could be different in other settings.