为满足客户对IaaS私有云的性能要求,虚拟机内存性能应达到一定水平虚拟机上安装Stream benchmark测试工具,在不运行应用情况下,测试内存4线程读写能力,结果为MB/s。
该四个指标算一个测试项
1)copy性能≥50000 MB/s。
2)scala性能≥50000 MB/s。
3)add性能≥50000 MB/s。
4)triad性能≥50000 MB/s
# 1,stream benchmark下载
下载基准测试程序请移步https://www.cs.virginia.edu/stream/FTP/Code/。如果是使用C,你只需要下载一个文件stream.c。
# 2,编译
如果是C程序源码,你可以使用gcc或g++进行编译。编译使用
gcc -O stream.c -o stream_exe你也可以加入openmp编译选项进行多核访存带宽的测试,加入openmp选项的编译使用
gcc -O -fopenmp stream.c -o stream_omp_exe
# 3. 运行及结果分析
直接使用./stream_omp_exe即可运行,如果多线程并未启动,可在运行前手动设置运行的进程数,如export OMP_NUM_THREADS=20
下面着重讲一下4个操作Copy,Scale,Add和Triad。
Copy为最简单的操作,即从一个内存单元中读取一个数,并复制到另一个内存单元,有2次访存操作。
Scale是乘法操作,从一个内存单元中读取一个数,与常数scale相乘,得到的结果写入另一个内存单元,有2次访存。
Add是加法操作,从两个内存单元中分别读取两个数,将其进行加法操作,得到的结果写入另一个内存单元中,有2次读和1次写共3次访存。
Triad是前面三种的结合,先从内存中读取一个数,与scale相乘得到一个乘积,然后从另一个内存单元中读取一个数与之前的乘积相加,得到的结果再写入内存。所以,有2次读和1次写共3次访存操作。
从上述的结果我们可以看出,测试的内存带宽Add>Triad>Copy>Scale。这是因为访存次数越多,内隐藏的访存延迟越大,得到的带宽越大。同理,运算的操作越复杂,操作时间就越长,程序运行时间就越长,得到的访存带宽就相应减少。这就是为什么3次访存的操作得到的带宽比2次访存操作得到的要大,而相同访存次数的操作,加法要比乘法得到的结果要好。
[root@ecs-3b4a ~]# wget https://www.cs.virginia.edu/stream/FTP/Code/stream.c
--2022-12-12 12:16:59-- https://www.cs.virginia.edu/stream/FTP/Code/stream.c
Resolving www.cs.virginia.edu (www.cs.virginia.edu)... 128.143.67.11
Connecting to www.cs.virginia.edu (www.cs.virginia.edu)|128.143.67.11|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 19967 (19K) [text/plain]
Saving to: ?.tream.c?
100%[==================================================================================================================================================>] 19,967 87.4KB/s in 0.2s
2022-12-12 12:17:02 (87.4 KB/s) - ?.tream.c?.saved [19967/19967]
[root@ecs-3b4a ~]# ls
byte-unixbench-5.1.3 stream.c v5.1.3
[root@ecs-3b4a ~]# gcc -O stream.c -o stream_exe
[root@ecs-3b4a ~]# ls
byte-unixbench-5.1.3 stream.c stream_exe v5.1.3
[root@ecs-3b4a ~]# ll
total 184
drwxrwxr-x 3 root root 4096 Jun 5 2015 byte-unixbench-5.1.3
-rw-r--r-- 1 root root 19967 Nov 30 2021 stream.c
-rwxr-xr-x 1 root root 12960 Dec 12 12:18 stream_exe
-rw-r--r-- 1 root root 145908 Dec 12 12:10 v5.1.3
[root@ecs-3b4a ~]# OMP_NUM_THERAD=4
[root@ecs-3b4a ~]# ./stream_exe
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
The *best* time for each kernel (excluding the first iteration)
will be used to compute the reported bandwidth.
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 7233 microseconds.
(= 7233 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Best Rate MB/s Avg time Min time Max time
Copy: 14719.4 0.010967 0.010870 0.011090
Scale: 14241.2 0.011349 0.011235 0.011477
Add: 16492.4 0.014596 0.014552 0.014676
Triad: 16534.7 0.014555 0.014515 0.014691
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
[root@ecs-3b4a ~]#




