StreamScan: Fast scan algorithms for GPUs without global barrier synchronization

CORC > 软件研究所 > 软件所图书馆 > 会议论文

	StreamScan: Fast scan algorithms for GPUs without global barrier synchronization
	Yan, Shengen (1) ; Long, Guoping (1) ; Zhang, Yunquan (1)
	2013
会议名称	18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2013
会议日期	February 23, 2013 - February 27, 2013
会议地点	Shenzhen, China
关键词	Scan prefix-sum OpenCL CUDA GPU Parallel algorithms
页码	229-238
通讯作者	Yan, S.(yanshengen@gmail.com)
中文摘要	Scan (also known as prefix sum) is a very useful primitive for various important parallel algorithms, such as sort, BFS, SpMV, compaction and so on. Current state of the art of GPU based scan implementation consists of three consecutive Reduce-Scan-Scan phases. This approach requires at least two global barriers and 3N (N is the problem size) global memory accesses. In this paper we propose StreamScan, a novel approach to implement scan on GPUs with only one computation phase. The main idea is to restrict synchronization to only adjacent workgroups, and thereby eliminating global barrier synchronization completely. The new approach requires only 2N global memory accesses and just one kernel invocation. On top of this we propose two important op-timizations to further boost performance speedups, namely thread grouping to eliminate unnecessary local barriers, and register optimization to expand the on chip problem size. We designed an auto-tuning framework to search the parameter space automatically to generate highly optimized codes for both AMD and Nvidia GPUs. We implemented our technique with OpenCL. Compared with previous fast scan implementations, experimental results not only show promising performance speedups, but also reveal dramatic different optimization tradeoffs between Nvidia and AMD GPU platforms. © 2013 ACM.
英文摘要	Scan (also known as prefix sum) is a very useful primitive for various important parallel algorithms, such as sort, BFS, SpMV, compaction and so on. Current state of the art of GPU based scan implementation consists of three consecutive Reduce-Scan-Scan phases. This approach requires at least two global barriers and 3N (N is the problem size) global memory accesses. In this paper we propose StreamScan, a novel approach to implement scan on GPUs with only one computation phase. The main idea is to restrict synchronization to only adjacent workgroups, and thereby eliminating global barrier synchronization completely. The new approach requires only 2N global memory accesses and just one kernel invocation. On top of this we propose two important op-timizations to further boost performance speedups, namely thread grouping to eliminate unnecessary local barriers, and register optimization to expand the on chip problem size. We designed an auto-tuning framework to search the parameter space automatically to generate highly optimized codes for both AMD and Nvidia GPUs. We implemented our technique with OpenCL. Compared with previous fast scan implementations, experimental results not only show promising performance speedups, but also reveal dramatic different optimization tradeoffs between Nvidia and AMD GPU platforms. © 2013 ACM.
收录类别	SCI ; EI
会议录出版地	Association for Computing Machinery, General Post Office, P.O. Box 30777, NY 10087-0777, United States
语种	英语
ISSN号	0362-1340
ISBN号	9781450319225
WOS记录号	WOS:000324158900022
内容类型	会议论文
源URL	[http://ir.iscas.ac.cn/handle/311060/16554]
专题	软件研究所_软件所图书馆_会议论文
推荐引用方式 GB/T 7714	Yan, Shengen ,Long, Guoping ,Zhang, Yunquan . StreamScan: Fast scan algorithms for GPUs without global barrier synchronization[C]. 见:18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2013. Shenzhen, China. February 23, 2013 - February 27, 2013.

个性服务

查看访问统计

相关权益政策

暂无数据

收藏/分享

所有评论 (0)

[发表评论/异议/意见]

暂无评论

评论
权益异议
反馈意见

评注功能仅针对注册用户开放，请您登录

您对该条目有什么异议，请向管理员反馈。
内容：
Email：	*
单位:
验证码：	刷新

您在知识库使用过程中有什么好的想法或者建议可以反馈给我们。
标题：	*
内容：
Email：	*
验证码：	刷新

相关链接

CORC

联系我们