生信碱移
使用机器学习快速进行基因特征筛选。
GeneSelectR,是一个开源的R包,它创新性地结合了机器学习(ML)和生物信息学数据挖掘方法,帮助研究者进行组学特征筛选。通过GeneSelectR,可以使用多种机器学习方法和用户定义的参数,从标准化的RNA数据集中选择特征。
GeneSelectR 包的特征选择过程是使用 Python 的 scikit-learn 库实现的。该软件包需要安装 Anaconda 才能正常工作:
# install.packages("devtools")
devtools::install_github("dzhakparov/GeneSelectR", build_vignettes = FALSE)
# !首次启动时,需要为包创建conda工作环境
GeneSelectR::configure_environment()
每次分析需要通过设置正确的 conda 工作环境来重新启动 GeneSelectR 分析:
GeneSelectR::set_reticulate_python()
library(GeneSelectR)
输入的数据矩阵是一个数据框,其中样本为行,特征为列。示例数据集的访问和分析可以通过以下方式执行:
data("UrbanRandomSubset")
head(UrbanRandomSubset[,1:10])
# treatment 列作为分类标签
X <- UrbanRandomSubset %>% dplyr::select(-treatment) # get the feature matrix
y <- UrbanRandomSubset['treatment'] # store the data point label in a separate vector
y <- as.factor(y)
# 最基础的用法模式
selection_results <- GeneSelectR::GeneSelectR(X = X,
y = y,
njobs = -1 # 选择所有可用核心
执行分析前需要设置特征选择的算法默认参数值。如果未提供,则建立默认的特征选择方法和超参数网格。默认情况下,有四种方法来选择特征:单变量特征选择、带 L1 惩罚的逻辑回归、boruta和随机森林。每个默认特征选择方法组合如下:
fs_param_grids <- list(
"Lasso" = list(
"feature_selector__estimator__C" = c(0.01, 0.1, 1L, 10L),
"feature_selector__estimator__solver" = c('liblinear','saga')
),
"Univariate" = list(
"feature_selector__param" = seq(50L, 200L, by = 50L)
),
"boruta" = list(
"feature_selector__perc" = seq(80L, 100L, by = 10L),
'feature_selector__n_estimators' = c(50L, 100L, 250L, 500L)
),
"RandomForest" = list(
"feature_selector__estimator__n_estimators" = seq(100L, 500L,by = 50L),
"feature_selector__estimator__max_depth" = c(10L, 20L, 30L),
"feature_selector__estimator__min_samples_split" = c(2L, 5L, 10L),
"feature_selector__estimator__min_samples_leaf" = c(1L, 2L, 4L),
"feature_selector__estimator__bootstrap" = c(TRUE, FALSE)
)
)
需要注意的是,默认使用的参数搜索方法是 RandomizedSearchCV(随机的)。用户还可以指定search_type
参数进行贝叶斯优化或完整的网格参数查找。
该包支持其它来自sklearn库的算法。举个例子,如果您想使用 XGBoost 分类器而不是默认的随机森林:
xgb <- reticulate::import('xgboost')
xgb.classifier <- xgb$XGBClassifier()
xgb_param_grid <- list(
"classifier__learning_rate" = c(0.01, 0.05, 0.1),
"classifier__n_estimators" = c(100L, 200L, 300L),
"classifier__max_depth" = c(3L, 5L, 7L),
"classifier__min_child_weight" = c(1L, 3L, 5L),
"classifier__gamma" = c(0, 0.1, 0.2),
"classifier__subsample" = c(0.8, 1.0),
"classifier__colsample_bytree" = c(0.8, 1.0)
)
# 执行分析
selection_results <- GeneSelectR::GeneSelectR(X = X,
y = y,
njobs = -1,
classifier = xgb.classifier,
classifier_grid = xgb_param_grid)
特征重要性:
plot_feature_importance(selection_results, top_n_features = 10)
算法的表现:
plot_metrics(selection_results)
# or access it as a dataframe
selection_results$test_metrics
selection_results$cv_mean_score
算法间的交集:
overlap <- calculate_overlap_coefficients(selection_results)
plot_overlap_heatmaps(overlap)
# 对于自定义的算法集合:
custom_list <- list(custom_list = c('char1','char2','char3','char4','char5'),
custom_list2 = c('char1','char2','char3','char4','char5'))
overlap1 <- calculate_overlap_coefficients(selection_results, custom_lists = custom_list)
plot_overlap_heatmaps(overlap1)
upset交集图:
upset_upet(selection_results)
# plot upset with custom lists
upset_upet(selection_results, custom_lists = custom_list)
简单分享到这里
仅供粉丝老铁们参考
如有侵权或错误,请联系删除改正~
https://cran.r-project.org/web/packages/GeneSelectR/vignettes/example.html
微信扫一扫
关注该公众号