{"id":86,"date":"2024-03-20T17:27:01","date_gmt":"2024-03-20T15:27:01","guid":{"rendered":"https:\/\/tel-zur.net\/blog\/?p=86"},"modified":"2024-03-20T17:27:01","modified_gmt":"2024-03-20T15:27:01","slug":"the-roofline-model","status":"publish","type":"post","link":"https:\/\/tel-zur.net\/blog\/2024\/03\/20\/the-roofline-model\/","title":{"rendered":"The Roofline model"},"content":{"rendered":"\n<p class=\"has-text-align-left\">Guy Tel-Zur, March 20, 2024<\/p>\n\n\n\n<p>In this blog post I will explain what is the roofline model, its importance and how to measure the achieved performance of a computer program, and how it is compared to the peak theoretical performance of the computer. According to this model we measure the performance of a computer program as the ratio between the computational work done divided by the memory traffic that was required to allow this computation. This ratio is called the <em>arithmetic intensity<\/em> and it is measured in units of (#floating-point operations)\/(#Byte transferred  between the memory and the CPU). An excellent paper describing the roofline mode is given in [1] and it cover page is shown in next figure.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"540\" height=\"809\" src=\"https:\/\/tel-zur.net\/blog\/wp-content\/uploads\/2024\/03\/roofline_paper.png\" alt=\"The Roofline model paper.\" class=\"wp-image-89\" title=\"Figure 1.\" srcset=\"https:\/\/tel-zur.net\/blog\/wp-content\/uploads\/2024\/03\/roofline_paper.png 540w, https:\/\/tel-zur.net\/blog\/wp-content\/uploads\/2024\/03\/roofline_paper-200x300.png 200w\" sizes=\"auto, (max-width: 540px) 100vw, 540px\" \/><\/figure>\n\n\n\n<p>As a test case I used the famous <em>stream<\/em> benchmark. At its core stream does the following computational kernel:<\/p>\n\n\n\n<p class=\"has-text-align-center\">c[i] = a[i] + b[i];<\/p>\n\n\n\n<p>Where a, b and c are large arrays. The computational intensity in this case consists of 1 floating point operation (&#8216;+&#8217;) and 3 data movement (read a and b from memory and write back c). if a, b, and c are of type <em>float<\/em>, it means that each element contains 4bytes and the total the data movement is 12bytes, therefore the computational intensity is 1\/12 which is about 0.083. We will test this prediction later on. The official stream benchmark can be downloaded from [3]. However, for my purpose this code seems to be too-complex and also according to [2] the roofline results that it produces may be miss-leading. Therefore, I wrote a simple stream code myself. The reference code is enclosed in the code section below.<\/p>\n\n\n<pre class=\"wp-block-code synano\" data-theme=\"default\"><code class=\"language-auto\">#include &lt;stdio.h&gt;\n#include &lt;stdlib.h&gt; \/\/ for random numbers\n#include &lt;omp.h&gt;    \/\/ for omp_get_wtime()\n\n#define SIZE 5000000  \/\/ size of arrays\n#define REPS 1000    \/\/ number of repetitions to make the program run longer\n\nfloat a[SIZE],b[SIZE],c[SIZE];  \ndouble t_start, t_finish, t;\nint i,j;\n\nint main() {\n\n\/\/ initialize arrays\nfor (i=0; i&lt;SIZE; i++) {\n    a[i] = (float)rand();\n    b[i] = (float)rand();\n    c[i] = 0.;\n}\n\n\/\/ compute c[i] = a[i] + b[i]\nt_start = omp_get_wtime();\nfor (j=0; j&lt;REPS; j++)\n    for (i=0; i&lt;SIZE; i++)\n        c[i] = a[i] + b[i];\nt_finish = omp_get_wtime();\n\nt = t_finish - t_start;\n\nprintf(&quot;Run summary\\n&quot;);\nprintf(&quot;=================\\n&quot;);\nprintf(&quot;Array size: %d\\n&quot;,SIZE);\nprintf(&quot;Total time (sec.):%f\\n&quot;,t);\n\n\/\/ That&#039;s it!\nreturn 0;\n}<\/code><\/pre>\n\n\n<p><strong>The computational environment<\/strong><\/p>\n\n\n\n<p>I use a laptop running Linux Mint 21.3 with 8GB RAM on an Intel&#8217;s Core-i7. The compiler was Intel&#8217;s OneAPI (version 2024.0.2) and Intel Advisor for measuring and visualizing the roofline. If you want to reproduce my test you need as a first step to prepare the environment as can be seen here:<\/p>\n\n\n<pre class=\"wp-block-code synano\" data-theme=\"default\"><code class=\"language-auto\">$ &lt;strong&gt;source ~\/path\/to\/opt\/intel\/oneapi\/setvars.sh&lt;\/strong&gt; \n # change the line above according to the path in your file system\n:: initializing oneAPI environment ...\n   bash: BASH_VERSION = 5.1.16(1)-release\n   args: Using &quot;$@&quot; for setvars.sh arguments: \n:: advisor -- latest\n:: ccl -- latest\n:: compiler -- latest\n:: dal -- latest\n:: debugger -- latest\n:: dev-utilities -- latest\n:: dnnl -- latest\n:: dpcpp-ct -- latest\n:: dpl -- latest\n:: inspector -- latest\n:: ipp -- latest\n:: ippcp -- latest\n:: itac -- latest\n:: mkl -- latest\n:: mpi -- latest\n:: tbb -- latest\n:: vtune -- latest\n:: oneAPI environment initialized ::<\/code><\/pre>\n\n\n<p>Another, one time, preparation stage is setting <em>ptrace_scope<\/em> otherwise Advisor won&#8217;t work:<\/p>\n\n\n<pre class=\"wp-block-code synano\" data-theme=\"default\"><code class=\"language-auto\">$ cat \/proc\/sys\/kernel\/yama\/ptrace_scope\n1\n$ echo &quot;0&quot;|sudo tee \/proc\/sys\/kernel\/yama\/ptrace_scope\n[sudo] password for your_user_name:              \n0<\/code><\/pre>\n\n\n<p><strong>The results<\/strong><\/p>\n\n\n\n<p>First, I tested the un-optimized version that was listed above. The measured point obtained sits at 0.028FLOP\/Byte, this result is lower than the theoretical prediction and this means that we need to put more effort to improve the code. The roofline result of this un-optimized version is shown here:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"537\" src=\"https:\/\/tel-zur.net\/blog\/wp-content\/uploads\/2024\/03\/guy_stream_not_optimized-1024x537.png\" alt=\"\" class=\"wp-image-93\" style=\"width:659px;height:auto\" srcset=\"https:\/\/tel-zur.net\/blog\/wp-content\/uploads\/2024\/03\/guy_stream_not_optimized-1024x537.png 1024w, https:\/\/tel-zur.net\/blog\/wp-content\/uploads\/2024\/03\/guy_stream_not_optimized-300x157.png 300w, https:\/\/tel-zur.net\/blog\/wp-content\/uploads\/2024\/03\/guy_stream_not_optimized-768x402.png 768w, https:\/\/tel-zur.net\/blog\/wp-content\/uploads\/2024\/03\/guy_stream_not_optimized.png 1496w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>One can verify that the CPU spent most of its time in the main loop:<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"652\" height=\"215\" src=\"https:\/\/tel-zur.net\/blog\/wp-content\/uploads\/2024\/03\/profiling.png\" alt=\"\" class=\"wp-image-94\" srcset=\"https:\/\/tel-zur.net\/blog\/wp-content\/uploads\/2024\/03\/profiling.png 652w, https:\/\/tel-zur.net\/blog\/wp-content\/uploads\/2024\/03\/profiling-300x99.png 300w\" sizes=\"auto, (max-width: 652px) 100vw, 652px\" \/><\/figure>\n\n\n\n<p>In the recommendations section Intel Advisor state: &#8220;The performance of the loop is bounded by the private cache bandwidth. The bandwidth of the shared cache and DRAM may degrade performance.<br>To improve performance: &#8220;<em>Improve caching efficiency.\u00a0The loop is also scalar. <strong>To fix: Vectorize the loop<\/strong><\/em>&#8220;. Indeed in the next step I repeat the roofline measurement but with a vectorized executable. The compilation command I used is:<\/p>\n\n\n<pre class=\"wp-block-code synano\" data-theme=\"default\"><code class=\"language-auto\">icx -g -O3 -qopt-report-file=guy_stream.txt -qopenmp -o guy_stream_vec .\/guy_stream_vec.c<\/code><\/pre>\n\n\n<p>and the vectorization report says: <\/p>\n\n\n<pre class=\"wp-block-code synano\" data-theme=\"default\"><code class=\"language-auto\">Global optimization report for : main\n\nLOOP BEGIN at .\/guy_stream.c (15, 1)\n    remark #15521: Loop was not vectorized: loop control variable was not identified. Explicitly compute the iteration count before executing the loop or try using canonical loop form from OpenMP specification\nLOOP END\n\nLOOP BEGIN at .\/guy_stream.c (23, 1)\n    remark #15553: loop was not vectorized: outer loop is not an auto-vectorization candidate.\n\n    &lt;strong&gt;LOOP BEGIN at .\/guy_stream.c (24, 5)\n        remark #15300: LOOP WAS VECTORIZED\n        remark #15305: vectorization support: vector length 4&lt;\/strong&gt;\n    LOOP END\nLOOP END<\/code><\/pre>\n\n\n<p>This time the roofline plot reports on a performance improvement compared to the non-optimized code. However, in both cases the performance bottleneck is still the DRAM bandwidth, as expected. The vectorized roofline plot is shown here:<\/p>\n\n\n\n<figure class=\"wp-block-gallery has-nested-images columns-default is-cropped wp-block-gallery-1 is-layout-flex wp-block-gallery-is-layout-flex\">\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"485\" data-id=\"96\" src=\"https:\/\/tel-zur.net\/blog\/wp-content\/uploads\/2024\/03\/vectorization_1-1024x485.png\" alt=\"\" class=\"wp-image-96\" srcset=\"https:\/\/tel-zur.net\/blog\/wp-content\/uploads\/2024\/03\/vectorization_1-1024x485.png 1024w, https:\/\/tel-zur.net\/blog\/wp-content\/uploads\/2024\/03\/vectorization_1-300x142.png 300w, https:\/\/tel-zur.net\/blog\/wp-content\/uploads\/2024\/03\/vectorization_1-768x364.png 768w, https:\/\/tel-zur.net\/blog\/wp-content\/uploads\/2024\/03\/vectorization_1.png 1510w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n<\/figure>\n\n\n\n<p>This time the performance is 0.083FLOP\/Byte which is our theoretical prediction! This means that although the code hasn&#8217;t changed, the compiler managed to do the more &#8216;add&#8217; instructions per unit of time, in parallel, due to the vectorization support:<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"487\" height=\"151\" src=\"https:\/\/tel-zur.net\/blog\/wp-content\/uploads\/2024\/03\/Screenshot-at-2024-03-19-16-37-07.png\" alt=\"\" class=\"wp-image-97\" srcset=\"https:\/\/tel-zur.net\/blog\/wp-content\/uploads\/2024\/03\/Screenshot-at-2024-03-19-16-37-07.png 487w, https:\/\/tel-zur.net\/blog\/wp-content\/uploads\/2024\/03\/Screenshot-at-2024-03-19-16-37-07-300x93.png 300w\" sizes=\"auto, (max-width: 487px) 100vw, 487px\" \/><\/figure>\n\n\n\n<p>Another possible optimization one could think of is adding an alignment to the arrays in memory:<\/p>\n\n\n<pre class=\"wp-block-code synano\" data-theme=\"default\"><code class=\"language-auto\">__attribute__((aligned (64)))<\/code><\/pre>\n\n\n<p>However, adding this requirement also didn&#8217;t improve much the performance. It seems that we really reached the performance wall and the reason for that is that the bottleneck isn&#8217;t in the computation but in the DRAM bus performance.<\/p>\n\n\n\n<p>As a last step I tried another optimization technique, which is to add multi-threading, i.e. parallelizing the code with OpenMP. Adding an OpenMP parallel-for pragma causes the computational kernel to be computed in parallel. However, once again, there wasn&#8217;t any performance improvement.<\/p>\n\n\n<pre class=\"wp-block-code synano\" data-theme=\"default\"><code class=\"language-auto\"># pragma openmp parallel for\nfor (j=0; j&lt;REPS; j++)\n    for (i=0; i&lt;SIZE; i++)\n        c[i] = a[i] + b[i];<\/code><\/pre>\n\n\n<p>To conclude, the roofline mode is a strong tool for checking where are the performance bottlenecks in the code. As long that we suffer from the limitations of the DRAM (or the caches) there isn&#8217;t much we can do about improving the performance. The CPU can ingest more operations on new data but since the memory is slow the performance are poor. Unfortunately,  there is nothing we can do about it. This is a challenging issue that is pending to future computer architectures.<\/p>\n\n\n\n<p>If you enjoyed this article you are invited to leave a comment below. You can also subscribe to my <a href=\"https:\/\/www.youtube.com\/@tel-zur_computing\" data-type=\"link\" data-id=\"https:\/\/www.youtube.com\/@tel-zur_computing\">YouTube channel <\/a>(@tel-zur_computing) and follow me on <a href=\"https:\/\/twitter.com\/telzur\" data-type=\"link\" data-id=\"https:\/\/twitter.com\/telzur\">X<\/a> and <a href=\"https:\/\/www.linkedin.com\/in\/telzur\/\">Linkedin<\/a>.<\/p>\n\n\n\n<p><strong>References:<\/strong><\/p>\n\n\n\n<p>[1] SamueL Willias, Andrew Waterman, and David Patterson, &#8220;<a href=\"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/1498765.1498785\" data-type=\"link\" data-id=\"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/1498765.1498785\" target=\"_blank\" rel=\"noreferrer noopener\">Roofline: An insightful Visual Performance model for multicore Architectures<\/a>&#8220;, Communications of the ACM, April 2009, vol. 52, no. 4, pp 65-76. <\/p>\n\n\n\n<p>[2] Supplementary material to [1]: <a href=\"https:\/\/dl.acm.org\/doi\/10.1145\/1498765.1498785#sup\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/dl.acm.org\/doi\/10.1145\/1498765.1498785#sup<\/a><\/p>\n\n\n\n<p>[3] Stream, <a href=\"https:\/\/www.cs.virginia.edu\/stream\/ref.html\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/www.cs.virginia.edu\/stream\/ref.html<\/a><\/p>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Guy Tel-Zur, March 20, 2024 In this blog post I will explain what is the roofline model, its importance and how to measure the achieved performance of a computer program, and how it is compared to the peak theoretical performance of the computer. According to this model we measure the performance of a computer program &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/tel-zur.net\/blog\/2024\/03\/20\/the-roofline-model\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;The Roofline model&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[3],"tags":[],"class_list":["post-86","post","type-post","status-publish","format-standard","hentry","category-performance","entry"],"_links":{"self":[{"href":"https:\/\/tel-zur.net\/blog\/wp-json\/wp\/v2\/posts\/86","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/tel-zur.net\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/tel-zur.net\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/tel-zur.net\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/tel-zur.net\/blog\/wp-json\/wp\/v2\/comments?post=86"}],"version-history":[{"count":21,"href":"https:\/\/tel-zur.net\/blog\/wp-json\/wp\/v2\/posts\/86\/revisions"}],"predecessor-version":[{"id":113,"href":"https:\/\/tel-zur.net\/blog\/wp-json\/wp\/v2\/posts\/86\/revisions\/113"}],"wp:attachment":[{"href":"https:\/\/tel-zur.net\/blog\/wp-json\/wp\/v2\/media?parent=86"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/tel-zur.net\/blog\/wp-json\/wp\/v2\/categories?post=86"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/tel-zur.net\/blog\/wp-json\/wp\/v2\/tags?post=86"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}