CORC  > 软件研究所  > 软件工程技术研究开发中心  > 学位论文
题名典型流式机器学习算法并行化设计与实现
作者沈雯婷
学位类别硕士
答辩日期2018-05-22
授予单位中国科学院研究生院
授予地点北京
导师王伟
关键词流式机器学习 增量学习 在线学习 流数据挖掘 并行化
学位专业计算机软件与理论
中文摘要

     机器学习是一类从已知数据中自动分析获得规律,并利用规律对未知数据进行预测的算法,广泛应用到了各个领域。当前很多机器学习的应用场景中数据是“高速”、“动态”、“无穷”的流式数据,训练数据量大且数据分布与时间相关。因此流式环境下,机器学习算法的模型需不断更新,同时保证实时性和准确性,传统的批处理无法满足以上需求。
    增量学习技术和阶段性计算的增量式批处理技术能够适应流式环境下机器学习的需求。为了使得模型能实时更新、数据重要性随时间衰减以提高准确性,流式机器学习算法基于以上技术进行改进。一些分布式机器学习框架上并行化地实现了部分流式机器学习算法,但是现有的工作存在以下问题:(1)典型流式机器学习算法缺少并行化方案;(2)缺少流式机器学习的计算模式和数据流模型;(3)已经实现的并行化流式算法实时性、准确性不佳。
    针对现有工作的不足,本文(1)总结了流式机器学习算法的特征;(2)将流式机器学习算法分为微批式增量更新、在线增量更新和在线概要更新三种计算模式,分别对应于批处理技术、在线学习技术和流数据挖掘技术在流式环境下的改进;(3)根据计算模式中计算步骤的时序逻辑建立数学模型,并将数学模型中计算函数输入输出之间的依赖关系表达为数据流模型,包括参数增量计算流、参数更新流和模型计算流;(4)提出流式的数据、变化的参数的划分方式,数据流模型中各个计算流的并行化实现方法,总结出算法流式化和并行化设计步骤和并行化方法决策规则;(5)根据提出的三种计算模式、数据流模型、并行化方法,基于Flink分布式流处理框架,流式化和并行化设计与实现典型的流式机器学习算法。
      实验结果表明,基于本文提出的流式机器学习的三种计算模型、对应的数据流模型和并行化设计方法,批处理算法、在线学习算法和流数据挖掘算法都可以简单地在分布式环境下并行化流式实现;计算延迟在百毫秒级别,达到实时性预期;吞吐率随着计算节点的增加而增大,算法具有扩展性;并行化实现的准确性对比串行实现的损失保持在1个数量级以内。
 

英文摘要

Machine learning automatically analyzes and obtains patterns from known data instances, and uses the patterns to predict unknown results. In current machine learning applications, the data instances are in a high-speed, dynamic, and infinite stream. Since the data volume is large and the data distribution is time-dependent, the model of streaming machine learning algorithm should be updated constantly while ensuring real-time and accuracy. Traditional batch processing algorithms cannot meet these requirements.
The methods of incremental learning and periodic incremental calculations of batch meet the needs in memory usage and model updating in streaming environments. To update model in real time and the reduce importance of the data over time, streaming machine learning algorithms are improved based on the above techniques. Although some streaming machine learning algorithms have been implemented in parallel on several distributed frameworks, the following problems still exist: (1) lack of parallel approaches for classical streaming machine learning algorithms, (2) lack of calculation modes and data flow models, (3) the performance of the parallel streaming algorithms that have been implemented is poor in real-time and accuracy.
Based on the deficiencies of existing works, the paper (1) summarizes the characteristics of the streaming machine learning algorithms, (2) generalizes the streaming machine learning algorithms as three modes: mini-batch incremental update, online incremental update and online sketch update, corresponding to the improving of batch processing, online learning and One-Pass techniques, (3) defines the mathematical models for calculation modes, and expresses the temporal logic and relationship between input and output as data flow models according to the calculation function in the mathematical models, (4) proposes partition methods for streaming data and changing model, proposes parallel implementation method for each computing flows and establishes steps for streaming and parallel implementation, (5) guides implementation for streaming machine algorithms based on the proposed calculation model, data flow model and parallel design rules.
The experimental results show that based on the calculation modes, data flow models and parallel rules proposed in this paper, batch algorithms and streaming algorithms can be easily implemented in parallel and streaming in distributed environment, real-time expectations are achieved because the computational delay is in hundred milliseconds, scalability performance reaches expectations when the throughput rate increases with growing computing nodes, and loss of accuracy of parallel implementation is kept within one order of magnitude compared to serial implementation.
 

语种中文
学科主题软件理论
内容类型学位论文
源URL[http://ir.iscas.ac.cn/handle/311060/19041]  
专题软件研究所_软件工程技术研究开发中心 _学位论文
作者单位中国科学院软件研究所
推荐引用方式
GB/T 7714
沈雯婷. 典型流式机器学习算法并行化设计与实现[D]. 北京. 中国科学院研究生院. 2018.
个性服务
查看访问统计
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。


©版权所有 ©2017 CSpace - Powered by CSpace