摘要
Lasso是一种能很好进行变量选择的方法,已被广泛应用。但面对高维海量数据集的时候会存在计算机消耗过大的情况。针对这种情况,文章提出一种spilt-and-conquer方法。首先把高维数据集均分成K份,进行变量选择,把每份选择出来的特征集进行合并后再进行变量选择。为了验证方法的优越性,使用了六组数据集进行实验。最后通过SVM、随机森林、神经网络的预测结果表明,spilt-and-conquer方法,在处理高维海量数据时具有很好的特性,并很大程度上节省了运行时间。
Lasso has been widely applied as one good method for variable selection. But for the high-dimensional massive data sets, there will be too much computer consumption. In view of this situation, this paper proposes the spilt-and-conquer method, in which the high-dimensional data sets are divided into K parts, and then variables are selected to merge each selected feature set before selecting variables. In order to verify the superiority of the proposed method, the paper uses six sets of data for experiments. Finally, the paper employs SVM, random forest and neural network to make a prediction, which shows that the spilt-and-conquer method has good performance in processing high dimensional mass data and also saves running time to a great extent.
作者
温焜
兰晓然
Wen Kun;Lan Xiaoran(School of Management,Nanchang University,Nanchang 330029,China;Jiangxi Administration Institute,Nanchang 330003,China;,Cangzhou Central Sub-branch of People's Bank of China,Cangzhou Hebei 061000,China)
出处
《统计与决策》
CSSCI
北大核心
2018年第16期74-76,共3页
Statistics & Decision