How to Use Bootstrap Functions in R
Overview
You can write your code to perform bootstrap in R, but you can also use the functions that are provided by default. The process is simple as below, but it has many differences in how to use it compared to other functions, so it might feel very unfamiliar at first.
Guide
Step 1.
Define a function boot.fn() that returns the statistic you want to obtain. Of course, the name of the function doesn’t really matter. There must be a second option index
among the arguments, such as boot.fn<-function(dataset, index)
.
Step 2.
Set the options data, Statistic, R
in the boot()
function and run it.
data
is the dataset,Statistic
is the function defined in Step 1.R
is the number of repetitions.
Example
Let’s walk through the example step by step.
Default
is a dataset of default status and its conditions. Obviously, the default
variable is the dependent variable, and since it’s a predictive problem of whether it actually leads to default, logistic regression is appropriate.
The result of logistic regression is as shown above.
We will use bootstrap for the regression coefficients here.
The function used for bootstrap necessarily requires two arguments: dataset
and index
. The dataset
takes in the original data, and the index
will receive the tuple for resampling through bootstrap.
If you check return()
, you can see that what the function returns is not the model or summary as a whole but just a vector consisting of regression coefficients. This formal definition of the function means that one might find it hard to understand the bootstrap function without a basic knowledge of R. If one is proficient in statistics to the extent of using bootstrap, it wouldn’t be a big problem, but it’s normal to feel it a bit difficult, so don’t blame yourself.
The boot()
function takes in the dataset as data
and the number of times to run as R. A common mistake is in the Statistic
option, which accepts boot.fn()
, returning a ‘vector’, not the ‘function itself’ as boot.fn
.
If you keep getting errors, the issue is mostly here. Viewing the results, you can see that the coefficients, from the first regression coefficient as t1, t2, t3, t4, and the coefficients or standard errors are very similar to the results of the initial logistic regression analysis. What bootstrap aims to obtain here is the point estimate, which is not the original but the std. error. As you can compare, the original is just the regression coefficients obtained from the entire data, which is utterly unrelated to bootstrap itself and unnecessary.
Code
Here is the example code. In the case of bootstrap, since the way to use it is so unique, if you don’t quite understand, you can just change the example to fit your case and use it.