Memory benchmarks: Addendum primum
September 15, 2015 Leave a comment
You might have noticed that I missed one case in my previous blog post: performing multiple independent random reads within a single thread. The code was expanded:
switch (inThreadParallelismLevel) { case 1: { for (int i = 0; i < steps; i++) { total0 += array[(11587L*i) & mask]; } break; } case 2: { for (int i = 0; i < steps; i++) { total0 += array[(24317L * i) & mask]; total1 += array[(14407L * i) & mask]; } break; } . . .
We saw that the random reads performance scaled nicely with number of threads up (kind of linearly up to 5 threads on the 6 cores CPU). So the question is whether we can achieve the same when performing multiple independent random reads within a thread.
Rather disappointingly this is not the case, here are the results for the 8 cores machine, the graph represents the memory bandwidth versus the number of concurrent requests within the single thread:
So, a bit of a surprise, there is absolutely no advantage here in running multiple reads within a thread. The scaling observed in the case of chained reads doesn’t happen here. Mind you it is going about 4 times faster here, so there is less scaling potential. More importantly the scaling across threads/cores is not present here. So it looks like this will only scale if the memory read requests originate from different cores rather than from a single one.