We considered two models:
The targets pipeline has alread fitted the models, computed feature importance and generated predictions. In this report, we just inspect the results.
library(tidyverse)
glmnet_fit <- targets::tar_read(glmnet_fit)
gbm_fit <- targets::tar_read(gbm_fit)
imp <- targets::tar_read(feature_importance)
perf <- targets::tar_read(performance)
First, we inspect whether the two models have converged.
par(mfrow = c(1, 2))
plot(glmnet_fit, main = "glmnet fit\n")
# print(gbm_fit)
gbm::gbm.perf(gbm_fit, method = "cv")
## [1] 4276
title(main = "gbm fit")
par(mfrow = c(1, 1))
We assess the predictive performance of the model by generating predictions for the test set and comparing the predictions to the true values using the RMSE and MAE (the lower the better). We also compute the performance of an intercept-only model to serve as a reference.
perf %>%
pivot_longer(!Model, names_to = "Metric", values_to = "Value") %>%
ggplot(aes(x = Model, y = Value)) +
facet_grid(rows = vars(Metric)) +
geom_col() +
scale_y_continuous(expand = expansion(mult = c(0, .05))) +
coord_flip() +
labs(x = "", y = "") +
theme_bw(base_size = 15)
Here, the glmnet model performs better than the gbm model.
We define the importance of features by:
We assess the identification of importance features by computing the Spearman rank correlation coefficient between the importance of the features given by the model and the true importance of the features (absolute value of coefficients used to generate the data). The higher the correlation the better.
tibble(Model = c("glmnet", "gbm"),
Correlation = c(cor(imp$True, imp$glmnet, method = "spearman"),
cor(imp$True, imp$gbm, method = "spearman"))) %>%
ggplot(aes(x = Model, y = Correlation)) +
geom_col() +
scale_y_continuous(expand = expansion(mult = c(0, .05))) +
coord_flip() +
labs(x = "", y = "Spearman correlation with true importance of parameters") +
theme_bw(base_size = 15)
Here, the glmnet model better identifies the most important features than the gbm model.