CORC  > 北京大学  > 信息科学技术学院
Efficient Distributed Machine Learning with Trigger Driven Parallel Training
Li, Shenglong ; Xue, Jilong ; Yang, Zhi ; Dai, Yafei
2016
关键词Distributed machine learning Straggler problem
英文摘要Distributed machine learning is becoming increasingly popular for large scale data mining on large scale cluster. To mitigate the interference of straggler machines, recent distributed machine learning systems support flexible model consistency, which allows worker using a local stale model to compute model update without waiting for the newest model, while limiting the asynchronous step in a certain bound to guarantee the algorithm correctness. However, bounded asynchronous computing can not tolerate consistent straggler. We explore that the root cause of this problem derives from the worker driven parallel training mechanism in existing systems. To address the straggler problem fundamentally and fully leverage the asynchronous efficiency, we propose a novel trigger driven parallel training mechanism, where model server proactively triggers to collect updates from workers instead of passively receiving them, which can inherently avoid the coordinating issue among workers. Besides, we devise a dynamic load balancing strategy to make the sampling frequency of each data equal. Furthermore, bounded asynchronous computing is introduced to achieve the algorithm efficiency, as well as the convergence guarantee. Finally, we integrate the above techniques into a distributed machine learning system called Squirrel. Squirrel provides simple programming interface and can easily deploy machine learning algorithms on distributed cluster. In comparison with traditional worker driven parallel training mechanism, trigger driven mechanism can improve up to 4x faster convergence speed of machine learning algorithm.; State Key Program of National Natural Science Foundation of China [61232004]; NSFC [61472009]; Shenzhen Key Fundamental Research Projects [JCYJ20151014093505032]; CPCI-S(ISTP)
语种英语
出处59th Annual IEEE Global Communications Conference (IEEE GLOBECOM)
内容类型其他
源URL[http://ir.pku.edu.cn/handle/20.500.11897/470163]  
专题信息科学技术学院
推荐引用方式
GB/T 7714
Li, Shenglong,Xue, Jilong,Yang, Zhi,et al. Efficient Distributed Machine Learning with Trigger Driven Parallel Training. 2016-01-01.
个性服务
查看访问统计
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。


©版权所有 ©2017 CSpace - Powered by CSpace