英文摘要 |
Machine learning automatically analyzes and obtains patterns from known data instances, and uses the patterns to predict unknown results. In current machine learning applications, the data instances are in a high-speed, dynamic, and infinite stream. Since the data volume is large and the data distribution is time-dependent, the model of streaming machine learning algorithm should be updated constantly while ensuring real-time and accuracy. Traditional batch processing algorithms cannot meet these requirements.
The methods of incremental learning and periodic incremental calculations of batch meet the needs in memory usage and model updating in streaming environments. To update model in real time and the reduce importance of the data over time, streaming machine learning algorithms are improved based on the above techniques. Although some streaming machine learning algorithms have been implemented in parallel on several distributed frameworks, the following problems still exist: (1) lack of parallel approaches for classical streaming machine learning algorithms, (2) lack of calculation modes and data flow models, (3) the performance of the parallel streaming algorithms that have been implemented is poor in real-time and accuracy.
Based on the deficiencies of existing works, the paper (1) summarizes the characteristics of the streaming machine learning algorithms, (2) generalizes the streaming machine learning algorithms as three modes: mini-batch incremental update, online incremental update and online sketch update, corresponding to the improving of batch processing, online learning and One-Pass techniques, (3) defines the mathematical models for calculation modes, and expresses the temporal logic and relationship between input and output as data flow models according to the calculation function in the mathematical models, (4) proposes partition methods for streaming data and changing model, proposes parallel implementation method for each computing flows and establishes steps for streaming and parallel implementation, (5) guides implementation for streaming machine algorithms based on the proposed calculation model, data flow model and parallel design rules.
The experimental results show that based on the calculation modes, data flow models and parallel rules proposed in this paper, batch algorithms and streaming algorithms can be easily implemented in parallel and streaming in distributed environment, real-time expectations are achieved because the computational delay is in hundred milliseconds, scalability performance reaches expectations when the throughput rate increases with growing computing nodes, and loss of accuracy of parallel implementation is kept within one order of magnitude compared to serial implementation.
|
修改评论