💡专注R语言在🩺生物医学中的使用
免费千人🐧QQ交流群:613637742 面向医学生/医生的实用机器学习教程
机器学习基础
mlr3系列
tidymodels系列
caret系列
分类问题中结果变量的类不平衡很常见,一个类明显多于另一种,这样的结果会产生很大影响。 目前处理这种类不平衡的方法主要有: 除此之外,有些算法在自带权重选项,可以给类少的类别配上更高的权重,这样也可以减少类不平衡带来的影响。 关于这几种方法的详细解释,大家可以自行了解,我们主要演示 虚构一个数据,训练集和测试集各有10000个样本,其中占比低的类大概只有5.9%的比例,数据如下: 可以看到两个类差别非常大! 下面演示 首先是 然后是 然后是 最后是 使用袋装法评估这几种方法配平后的数据,重抽样方法选择重复10折交叉验证,评价指标选择AUC。 查看不同模型的表现: 可以看到up-sampling效果最好,ROSE效果最差。 下面定义一个函数,分别用训练好的模型作用于测试集,并计算AUC: 发现在测试集中结果完全不一样。。 上面是先把数据划分为训练集和测试集,然后对训练集使用不同的配平方法,然后训练模型,最后在测试集中查看模型表现。 现在是在每一次重抽样中都使用一次配平方法。 查看训练集中的结果: 把训练好的模型作用于测试集,并查看结果: 今天虽然详细介绍了几种处理类不平衡数据的方法,但其实真正使用时也就1行代码的事,或者只是作为一个数据预处理步骤而已,大家理解下内涵即可。sub-sampling
:SMOTE
和ROSE
:同时增加少的类和减少多的类,需要借助DMwR
和ROSE
包。caret
中的实现方法。4种方法演示
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
set.seed(2969)
imbal_train <- twoClassSim(10000, intercept = -20, linearVars = 20)
imbal_test <- twoClassSim(10000, intercept = -20, linearVars = 20)
table(imbal_train$Class)
##
## Class1 Class2
## 9411 589caret
中的sub-sampling
。download-sampling
:set.seed(9560)
down_train <- downSample(x = imbal_train[, -ncol(imbal_train)],# 预测变量
y = imbal_train$Class) # 结果变量
table(down_train$Class)
##
## Class1 Class2
## 589 589up-sampling
:set.seed(9560)
up_train <- upSample(x = imbal_train[, -ncol(imbal_train)],
y = imbal_train$Class)
table(up_train$Class)
##
## Class1 Class2
## 9411 9411SMOTE
:DMwR
这个包默认方式肯定安装不了,需要下载到本地安装或者github安装,如果你不会,赶紧翻看我的视频教程:R语言必知的4种R包安装方式。library(DMwR)
## Loading required package: grid
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
set.seed(9560)
smote_train <- SMOTE(Class ~ ., data = imbal_train)
table(smote_train$Class)
##
## Class1 Class2
## 2356 1767ROSE
:library(ROSE)
## Loaded ROSE 0.0-4
set.seed(9560)
rose_train <- ROSE(Class ~ ., data = imbal_train)$data
table(rose_train$Class)
##
## Class1 Class2
## 4939 5061# 首先加个速
library(doParallel)
## Loading required package: foreach
## Loading required package: iterators
## Loading required package: parallel
cl <- makePSOCKcluster(16)
registerDoParallel(cl)
# 然后训练模型
# 选择重抽样方法和评价指标
ctrl <- trainControl(method = "repeatedcv", repeats = 5,
classProbs = TRUE,
summaryFunction = twoClassSummary)
# 原始训练集
set.seed(5627)
orig_fit <- train(Class ~ ., data = imbal_train,
method = "treebag",
nbagg = 50,
metric = "ROC",
trControl = ctrl)
# down sampling处理过的训练集
set.seed(5627)
down_outside <- train(Class ~ ., data = down_train,
method = "treebag",
nbagg = 50,
metric = "ROC",
trControl = ctrl)
# up sampling处理过的训练集
set.seed(5627)
up_outside <- train(Class ~ ., data = up_train,
method = "treebag",
nbagg = 50,
metric = "ROC",
trControl = ctrl)
# smote
set.seed(5627)
rose_outside <- train(Class ~ ., data = rose_train,
method = "treebag",
nbagg = 50,
metric = "ROC",
trControl = ctrl)
# rose
set.seed(5627)
smote_outside <- train(Class ~ ., data = smote_train,
method = "treebag",
nbagg = 50,
metric = "ROC",
trControl = ctrl)
stopCluster(cl)outside_models <- list(original = orig_fit,
down = down_outside,
up = up_outside,
SMOTE = smote_outside,
ROSE = rose_outside)
# 提取训练集中的模型表现
outside_resampling <- resamples(outside_models)
summary(outside_resampling, metric = "ROC")
##
## Call:
## summary.resamples(object = outside_resampling, metric = "ROC")
##
## Models: original, down, up, SMOTE, ROSE
## Number of resamples: 50
##
## ROC
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## original 0.9098237 0.9298348 0.9386021 0.9394130 0.9493394 0.9685873 0
## down 0.9095558 0.9282175 0.9453907 0.9438384 0.9596021 0.9836254 0
## up 0.9989350 0.9999980 1.0000000 0.9998402 1.0000000 1.0000000 0
## SMOTE 0.9697171 0.9782214 0.9834234 0.9817476 0.9857071 0.9928255 0
## ROSE 0.8782985 0.8941488 0.8980313 0.8993135 0.9056404 0.9203092 0# 编写一个函数,计算测试集的表现
test_roc <- function(model, data) {
library(pROC)
roc_obj <- roc(data$Class,
predict(model, data, type = "prob")[, "Class1"],
levels = c("Class2", "Class1"))
ci(roc_obj)
}
outside_test <- lapply(outside_models, test_roc, data = imbal_test)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
outside_test <- lapply(outside_test, as.vector)
outside_test <- do.call("rbind", outside_test)
colnames(outside_test) <- c("lower", "ROC", "upper")
outside_test <- as.data.frame(outside_test)
outside_test
## lower ROC upper
## original 0.9130010 0.9247957 0.9365905
## down 0.9286964 0.9368361 0.9449758
## up 0.9244128 0.9338499 0.9432869
## SMOTE 0.9429536 0.9490585 0.9551634
## ROSE 0.9383809 0.9459729 0.9535649在重抽样过程中使用subsampling
library(doParallel)
cl <- makePSOCKcluster(16)
registerDoParallel(cl)
ctrl <- trainControl(method = "repeatedcv", repeats = 5,
classProbs = TRUE,
summaryFunction = twoClassSummary,
## new option here:
sampling = "down")
set.seed(5627)
down_inside <- train(Class ~ ., data = imbal_train,
method = "treebag",
nbagg = 50,
metric = "ROC",
trControl = ctrl)
## now just change that option
ctrl$sampling <- "up"
set.seed(5627)
up_inside <- train(Class ~ ., data = imbal_train,
method = "treebag",
nbagg = 50,
metric = "ROC",
trControl = ctrl)
ctrl$sampling <- "rose"
set.seed(5627)
rose_inside <- train(Class ~ ., data = imbal_train,
method = "treebag",
nbagg = 50,
metric = "ROC",
trControl = ctrl)
ctrl$sampling <- "smote"
set.seed(5627)
smote_inside <- train(Class ~ ., data = imbal_train,
method = "treebag",
nbagg = 50,
metric = "ROC",
trControl = ctrl)
## Loading required package: recipes
## Loading required package: dplyr
stopCluster(cl)inside_models <- list(original = orig_fit,
down = down_inside,
up = up_inside,
SMOTE = smote_inside,
ROSE = rose_inside)
inside_resampling <- resamples(inside_models)
summary(inside_resampling, metric = "ROC")
##
## Call:
## summary.resamples(object = inside_resampling, metric = "ROC")
##
## Models: original, down, up, SMOTE, ROSE
## Number of resamples: 50
##
## ROC
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## original 0.9098237 0.9298348 0.9386021 0.9394130 0.9493394 0.9685873 0
## down 0.9140294 0.9381766 0.9453610 0.9438490 0.9492917 0.9684522 0
## up 0.8887678 0.9308075 0.9393226 0.9392084 0.9517913 0.9679569 0
## SMOTE 0.9262054 0.9426785 0.9519728 0.9507166 0.9591469 0.9735586 0
## ROSE 0.9305013 0.9442821 0.9489859 0.9511117 0.9572416 0.9756750 0inside_test <- lapply(inside_models, test_roc, data = imbal_test)
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
inside_test <- lapply(inside_test, as.vector)
inside_test <- do.call("rbind", inside_test)
colnames(inside_test) <- c("lower", "ROC", "upper")
inside_test <- as.data.frame(inside_test)
inside_test
## lower ROC upper
## original 0.9130010 0.9247957 0.9365905
## down 0.9354534 0.9419704 0.9484875
## up 0.9353945 0.9431074 0.9508202
## SMOTE 0.9419165 0.9486050 0.9552936
## ROSE 0.9369170 0.9448367 0.9527563
视频教程可关注我的b站:阿越就是我
🔖精选合集
微信扫一扫
关注该公众号