CORC  > 软件研究所  > 软件所图书馆  > 期刊论文
openblas: a high performance blas library on loongson 3a cpu
Zhang Xian-Yi ; Wang Qian ; Zhang Yun-Quan
刊名Ruan Jian Xue Bao/Journal of Software
2011
卷号22期号:UPPL. 2页码:208-216
关键词Computer software Software engineering
ISSN号1000-9825
中文摘要BLAS is a fundamental math library in scientific computing. Thus, each CPU vendor releases optimized BLAS library for its own CPU. Loongson CPU series are developed by the Institute of Computing Technology, Chinese Academy of Sciences. In 2010, it released Loongson 3 CPU series. This paper introduces the open source BLAS library OpenBLAS, which is forked on GotoBLAS 2-1.13 BSD version. BLAS Level 3 functions of OpenBLAS is optimized on Loongson 3A quad cores CPU. In sequential optimizations, blocking, hand coding assembly kernel, Loongson 3A special instructions and reordering instructions are utilized. The performance of BLAS Level 3 subroutines exceeded GotoBLAS and ATLAS by about 75% and 17%. Meanwhile, it exceeded GotoBLAS and ATLAS by about 103% and 36% in double precision functions. In parallel multi-threads optimization, this study used interleaved data buffer layout to avoid shared L2 Cache conflictions among multi-threads. OpenBLAS achieved 3.47 speedups on quad cores. In 4 threads, the performance of OpenBLAS BLAS Level3 functions exceeded GotoBLAS and ATLAS by about 69% and 34%, 89% and 55% in double precision functions. ©2011 Journal of Software.
英文摘要BLAS is a fundamental math library in scientific computing. Thus, each CPU vendor releases optimized BLAS library for its own CPU. Loongson CPU series are developed by the Institute of Computing Technology, Chinese Academy of Sciences. In 2010, it released Loongson 3 CPU series. This paper introduces the open source BLAS library OpenBLAS, which is forked on GotoBLAS 2-1.13 BSD version. BLAS Level 3 functions of OpenBLAS is optimized on Loongson 3A quad cores CPU. In sequential optimizations, blocking, hand coding assembly kernel, Loongson 3A special instructions and reordering instructions are utilized. The performance of BLAS Level 3 subroutines exceeded GotoBLAS and ATLAS by about 75% and 17%. Meanwhile, it exceeded GotoBLAS and ATLAS by about 103% and 36% in double precision functions. In parallel multi-threads optimization, this study used interleaved data buffer layout to avoid shared L2 Cache conflictions among multi-threads. OpenBLAS achieved 3.47 speedups on quad cores. In 4 threads, the performance of OpenBLAS BLAS Level3 functions exceeded GotoBLAS and ATLAS by about 69% and 34%, 89% and 55% in double precision functions. ©2011 Journal of Software.
收录类别EI
语种中文
公开日期2013-10-08
内容类型期刊论文
源URL[http://ir.iscas.ac.cn/handle/311060/16164]  
专题软件研究所_软件所图书馆_期刊论文
推荐引用方式
GB/T 7714
Zhang Xian-Yi,Wang Qian,Zhang Yun-Quan. openblas: a high performance blas library on loongson 3a cpu[J]. Ruan Jian Xue Bao/Journal of Software,2011,22(UPPL. 2):208-216.
APA Zhang Xian-Yi,Wang Qian,&Zhang Yun-Quan.(2011).openblas: a high performance blas library on loongson 3a cpu.Ruan Jian Xue Bao/Journal of Software,22(UPPL. 2),208-216.
MLA Zhang Xian-Yi,et al."openblas: a high performance blas library on loongson 3a cpu".Ruan Jian Xue Bao/Journal of Software 22.UPPL. 2(2011):208-216.
个性服务
查看访问统计
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。


©版权所有 ©2017 CSpace - Powered by CSpace