GTZ Blog

Embarrassingly Parallel Computations with Ray

Ray is an excellent framework for large-scale distributed computations in Python. In this blog post, I demonstrate a simple example of Ray’s capability to perform Embarrassingly Parallel Computation with minimal and straightforward source code. It is well known that if we randomly throw dots inside a square enclosing a circle, the ratio of the number of dots that fall inside the circle to the total number of dots approaches π/4 (where π ≈ 3.1415926…), as the total number of dots approaches infinity. Figure 1 illustrates this concept.

Figure 1: Estimating Pi using a Monte Carlo computation.

Numerical methods that are based on random numbers are call Monte-Carlo computations. The accuracy of this method depends on the sample size. However, achieving higher accuracy requires a larger sample, which in turn increases computation time. Fortunately, this algorithm, which relies on random numbers, can be easily parallelized. Since random numbers are, by definition, uncorrelated, parallel tasks can execute the same algorithm simultaneously without interference. The only step remaining is to combine the partial results of these independent computations into a single final result using a reduction operation.

The simple Python code that implements this using Ray, along with the computation result executed in VS Code, is shown in Figure 2.

Figure 2: Python code using Ray in VS-Code.

The full code is enclosed below:

import ray
import random

# Initialize Ray
ray.init(ignore_reinit_error=True)

@ray.remote
def sample_pi(num_samples):
    count = 0
    for _ in range(num_samples):
        x, y = random.random(), random.random()
        if x*x + y*y <= 1.0:
            count += 1
    return count

# Number of samples for each task
num_samples = 10000000
# Number of tasks
num_tasks = 100

# Submit tasks to Ray
counts = ray.get([sample_pi.remote(num_samples) for _ in range(num_tasks)])

# Calculate the estimated value of pi
total_samples = num_samples * num_tasks
pi_estimate = 4 * sum(counts) / total_samples

print(f"Estimated value of pi: {pi_estimate}")

# Shutdown Ray
ray.shutdown()

It is recommended to verify the parallel execution of this code by monitoring your computer’s resource usage. For this purpose, you can use your system’s task manager. On Linux systems, tools like top or htop are particularly useful for observing CPU utilization in real time. Figure 3 provides an example of this process.

Figure 3: Verification of the parallel execution of the code by looking simultaneously at ‘htop’.

You are welcome to try this short example yourself and leave me a comment below.

Guy

How to convert a VirtualBox image to an AWS AMI?

Guy Tel-Zur, March 26, 2024

During the pandemic, when isolation took place, I was challenged how to keep my students practicing parallel processing programming in my “Introduction to Parallel Processing” course. The students couldn’t meet at the computer lab and so I developed a VirtualBox image with all the tools I needed for my course (a Linux machine with a compiler, MPI, HTCondor, and profiling tools such as TAU and Scalasca). This idea of a parallel processing full stack virtual machine (VM) is not unique or new, for example, there is an excellent tool from the E4S project. However, I preferred to create my own image that is customized to my syllabus. The VM allowed the students to import a ready to use infrastructure into their private computers with zero effort. The VM settings is shown in the next figure:

The VM desktop, which is based on CentOS 8 is shown here:

Since then, I kept using and upgrading this tool as an easy alternative to the full scale educational departmental cluster. Of course, this isn’t a tool for breaking performance records but it is quiet convenient for educational purposes. However there are some limitations: First, the VM can not work on too old computers. The minimum requirements are at least 4GB RAM, 2 cores and a few tens of GB storage. Another significant limitation is that it isn’t possible to test the scaling of the codes as one increases the number of parallel tasks (because it was limited to only 2 cores). Therefore, important terms like speedup and efficiency could not be demonstrated. Nevertheless, I decided to preserve this concept of a full stack single machine which is easy to use as a complimentary tool, but I wanted to also migrate it to the cloud so that anyone would be able to test the instance also with many cores! Transferring the VM to the cloud turned out to be a challenging task and I decided to summarize it here in order to ease your life in case that you also would want to convert a VirtualBox image (as an ova file) to an Amazon Web Services (AWS) machine image (AMI). Hopefully, after reading this post you will be able to complete that task in a fraction of the time I spent resolving all the challenges.

Step 1: Export the VM to an OVA (Open Virtualization Format) file. This part is easy, just click on “File” –> “Export Appliance”. It is a good practice to remove the .bash_history file before exporting the VM so that you will clear the history of the commands you used prior to the that moment.

Step 2: Assuming that you already have an account on AWS and that you installed AWS command line tools and credentials then create a S3 bucket and copy your ova file into that bucket:

aws s3 cp ./HPC_VM_2024.3.ova s3://gtz-vm-bucket/

This may take a few minutes, be patient.

Step 3: Security matters. You are asked to create a policy and a role to handle the image:

aws iam create-role --role-name vmimport --assume-role-policy-document file://trust-policy.json

Where the policy JSON file is:

$ more ./trust-policy.json 
{
   "Version": "2012-10-17",
   "Statement": [
      {
         "Effect": "Allow",
         "Principal": { "Service": "vmie.amazonaws.com" },
         "Action": "sts:AssumeRole",
         "Condition": {
            "StringEquals":{
               "sts:Externalid": "vmimport"
            }
         }
      }
   ]
}

You can read more about this step here. You also need to define a role in another JSON file:

{
   "Version":"2012-10-17",
   "Statement":[
      {
         "Effect":"Allow",
         "Action":[
            "s3:GetBucketLocation",
            "s3:GetObject",
            "s3:ListBucket" 
         ],
         "Resource":[
            "arn:aws:s3:::aria-images",
			"arn:aws:s3:::aria-images/*"
         ]
      },
      {
         "Effect":"Allow",
         "Action":[
            "ec2:ModifySnapshotAttribute",
            "ec2:CopySnapshot",
            "ec2:RegisterImage",
            "ec2:Describe*"
         ],
         "Resource":"*"
      }
   ]

and the command to set it is:

aws iam put-role-policy --role-name vmimport --policy-name vmimport --policy-document file://role-policy.json

Step 4: Now it is time to convert the ova in S3 into an AMI using this command:

aws ec2 import-image --description "HPC_VM_2024.3" --disk-containers "file://./containers.json"

A few seconds after hitting ‘enter’ you will see, as a response with your new AMI name. Look for “import-ami-XXXXXXXXX”. A typical response looks like this:

{
    "Description": "HPC_VM_2024.3",
    "ImportTaskId": "import-ami-01478e05448530ae4",
    "Progress": "1",
    "SnapshotDetails": [
        {
            "Description": "Guy Tel-Zur - Parallel Processing Education",
            "DiskImageSize": 0.0,
            "Format": "OVA",
            "UserBucket": {
                "S3Bucket": "gtz-vm-bucket",
                "S3Key": "HPC_VM_2024.3.ova"
            }
        }
    ],
    "Status": "active",
    "StatusMessage": "pending"
}

The content of containers.json is:

[
  {
    "Description": "My Parallel Processing OVA",
    "Format": "ova",
    "UserBucket": {
        "S3Bucket": "gtz-vm-bucket",
        "S3Key": "HPC_VM_2024.3.ova"
    }
}]

Replace the Description, S3Bucket, and S3Key fields with your preferences.

Step 5: Checking the progress. At this step you can execute this status checking command:

$ aws ec2 describe-import-image-tasks --import-task-ids import-ami-01478e05448530ae4
{
    "ImportImageTasks": [
        {
            "Description": "HPC_VM_2024.2",
            "ImportTaskId": "import-ami-01478e05448530ae4",
            "Progress": "19",
            "SnapshotDetails": [
                {
                    "DiskImageSize": 20736632320.0,
                    "Format": "VMDK",
                    "Status": "active",
                    "UserBucket": {
                        "S3Bucket": "gtz-vm-bucket",
                        "S3Key": "HPC_VM_2024.2.ova"
                    }
                }
            ],
            "Status": "active",
            "StatusMessage": "converting",

and see that “pending” was changed to “converting” which is a good sign.

But then came problems…

The image was not correctly prepared for the conversion into an AMI and I got the following error message:

$ aws ec2 describe-import-image-tasks --import-task-ids import-ami-01478e05448530ae4
{
    "ImportImageTasks": [
        {
            "Description": "HPC_VM_2024.2",
            "ImportTaskId": "import-ami-01478e05448530ae4",
            "SnapshotDetails": [
                {
                    "DeviceName": "/dev/sde",
                    "DiskImageSize": 20736632320.0,
                    "Format": "VMDK",
                    "Status": "completed",
                    "UserBucket": {
                        "S3Bucket": "gtz-vm-bucket",
                        "S3Key": "HPC_VM_2024.2.ova"
                    }
                }
            ],
            "Status": "deleted",
            "StatusMessage": "ClientError: BLSC-style GRUB found, but unable to detect default kernel",
            "Tags": []
        }
    ]

The way to resolve this error is to return to VirtualBox, boot the image and make modifications as root. By default, the GRUB_ENABLE_BLSC is set to true in the /etc/default/grub file. When this variable is set to true, GRUB2 uses blscfg files and entries in the grub.cfg file. To resolve the ClientError: BLSC-style GRUB error on import or export, set the GRUB_ENABLE_BLSC parameter to false in the /etc/default/grub file so: open /etc/default/grub file with a text editor, such as nano and modify GRUB_ENABLE_BLSC parameter to false. Then, run the following command to rebuild the grub config file:

grub2-mkconfig -o /boot/grub2/grub.cfg

To read more about this issue click here. Now, shut down the VM and repeat steps 1..5 (this is time consuming and tedious).

Then, I had another “surprise”. Because I upgraded the image over the years the VM had several kernels but sadly the one that is supported by AWS wasn’t installed. In my case I got this error message:

"StatusMessage": "ClientError: Unsupported kernel version 5.4.156-1.el8.elrepo.x86_64",

It turns out the AWS supports only a specific kernel for each Linux distribution, check here:

So I had to downgrade the kernel to 4.18.0 and make this kernel as the default when booting the image. Then, I had to repeat, once again, steps 1..5. Unfortunately, that wasn’t enough! The conversion process failed again and this time due to the presence of the other kernels. I had to completely remove all the other kernels and to be left only with the 4.18.0 kernel. Even the rescue kernel disturbed the conversion process:

"StatusMessage": "ClientError: Unsupported kernel version 0-rescue-c02fbb5c652549588dbb069f20f31872",

So I had to go back again to the VirtualBox image and to erase all the other kernels and repeat steps 1..5 🙁 🙁

This time it want well until the end!!!

{
    "ImportImageTasks": [
        {
            "Architecture": "x86_64",
            "Description": "HPC_VM_2024.3",
            "ImportTaskId": "import-ami-079a7bf1190f50bbf",
            "LicenseType": "BYOL",
            "Platform": "Linux",
            "Progress": "27",
            "SnapshotDetails": [
                {
                    "DeviceName": "/dev/sda1",
                    "DiskImageSize": 20776996864.0,
                    "Format": "VMDK",
                    "Status": "completed",
                    "UserBucket": {
                        "S3Bucket": "gtz-vm-bucket",
                        "S3Key": "HPC_VM_2024.3.ova"
                    }
                }
            ],
            "Status": "active",

Congratulations! Now we can go to AWS dashboard and find our new AMI in the EC2 panel:

In order to test the AMI click on “Launch instance from AMI”. The first instance I tried was a t2.micro just for testing the connection. A simple ssh connection from the terminal was successful using the the generated key pair:

Then, I wanted to test the image in its full glory, so I created another instance, this time with 8 cores (t2.2xlarge). This node will exceed the performance of the VirtualBox image and this was the motivation for the whole exercise:

Indeed now there are 8 happy cores running:

Now it is time for a few parallel computing tests. For that, I used the famous cpi.c program. In the following 3 figures results from TAU, Jumpshot, and Scalasca profiling tools correspondingly are shown:

Mission achieved! That’s it for now. For further reading check this link.

If you enjoyed this article you are invited to leave a comment below. You can also subscribe to my YouTube channel (@tel-zur_computing) or connect with me on X and on Linkedin.

Good luck!

The Roofline model

Guy Tel-Zur, March 20, 2024

In this blog post I will explain what is the roofline model, its importance and how to measure the achieved performance of a computer program, and how it is compared to the peak theoretical performance of the computer. According to this model we measure the performance of a computer program as the ratio between the computational work done divided by the memory traffic that was required to allow this computation. This ratio is called the arithmetic intensity and it is measured in units of (#floating-point operations)/(#Byte transferred between the memory and the CPU). An excellent paper describing the roofline mode is given in [1] and it cover page is shown in next figure.

As a test case I used the famous stream benchmark. At its core stream does the following computational kernel:

c[i] = a[i] + b[i];

Where a, b and c are large arrays. The computational intensity in this case consists of 1 floating point operation (‘+’) and 3 data movement (read a and b from memory and write back c). if a, b, and c are of type float, it means that each element contains 4bytes and the total the data movement is 12bytes, therefore the computational intensity is 1/12 which is about 0.083. We will test this prediction later on. The official stream benchmark can be downloaded from [3]. However, for my purpose this code seems to be too-complex and also according to [2] the roofline results that it produces may be miss-leading. Therefore, I wrote a simple stream code myself. The reference code is enclosed in the code section below.

#include <stdio.h>
#include <stdlib.h> // for random numbers
#include <omp.h>    // for omp_get_wtime()

#define SIZE 5000000  // size of arrays
#define REPS 1000    // number of repetitions to make the program run longer

float a[SIZE],b[SIZE],c[SIZE];  
double t_start, t_finish, t;
int i,j;

int main() {

// initialize arrays
for (i=0; i<SIZE; i++) {
    a[i] = (float)rand();
    b[i] = (float)rand();
    c[i] = 0.;
}

// compute c[i] = a[i] + b[i]
t_start = omp_get_wtime();
for (j=0; j<REPS; j++)
    for (i=0; i<SIZE; i++)
        c[i] = a[i] + b[i];
t_finish = omp_get_wtime();

t = t_finish - t_start;

printf("Run summary\n");
printf("=================\n");
printf("Array size: %d\n",SIZE);
printf("Total time (sec.):%f\n",t);

// That's it!
return 0;
}

The computational environment

I use a laptop running Linux Mint 21.3 with 8GB RAM on an Intel’s Core-i7. The compiler was Intel’s OneAPI (version 2024.0.2) and Intel Advisor for measuring and visualizing the roofline. If you want to reproduce my test you need as a first step to prepare the environment as can be seen here:

$ source ~/path/to/opt/intel/oneapi/setvars.sh 
 # change the line above according to the path in your file system
:: initializing oneAPI environment ...
   bash: BASH_VERSION = 5.1.16(1)-release
   args: Using "$@" for setvars.sh arguments: 
:: advisor -- latest
:: ccl -- latest
:: compiler -- latest
:: dal -- latest
:: debugger -- latest
:: dev-utilities -- latest
:: dnnl -- latest
:: dpcpp-ct -- latest
:: dpl -- latest
:: inspector -- latest
:: ipp -- latest
:: ippcp -- latest
:: itac -- latest
:: mkl -- latest
:: mpi -- latest
:: tbb -- latest
:: vtune -- latest
:: oneAPI environment initialized ::

Another, one time, preparation stage is setting ptrace_scope otherwise Advisor won’t work:

$ cat /proc/sys/kernel/yama/ptrace_scope
1
$ echo "0"|sudo tee /proc/sys/kernel/yama/ptrace_scope
[sudo] password for your_user_name:              
0

The results

First, I tested the un-optimized version that was listed above. The measured point obtained sits at 0.028FLOP/Byte, this result is lower than the theoretical prediction and this means that we need to put more effort to improve the code. The roofline result of this un-optimized version is shown here:

One can verify that the CPU spent most of its time in the main loop:

In the recommendations section Intel Advisor state: “The performance of the loop is bounded by the private cache bandwidth. The bandwidth of the shared cache and DRAM may degrade performance.
To improve performance: “Improve caching efficiency. The loop is also scalar. To fix: Vectorize the loop“. Indeed in the next step I repeat the roofline measurement but with a vectorized executable. The compilation command I used is:

icx -g -O3 -qopt-report-file=guy_stream.txt -qopenmp -o guy_stream_vec ./guy_stream_vec.c

and the vectorization report says:

Global optimization report for : main

LOOP BEGIN at ./guy_stream.c (15, 1)
    remark #15521: Loop was not vectorized: loop control variable was not identified. Explicitly compute the iteration count before executing the loop or try using canonical loop form from OpenMP specification
LOOP END

LOOP BEGIN at ./guy_stream.c (23, 1)
    remark #15553: loop was not vectorized: outer loop is not an auto-vectorization candidate.

    LOOP BEGIN at ./guy_stream.c (24, 5)
        remark #15300: LOOP WAS VECTORIZED
        remark #15305: vectorization support: vector length 4
    LOOP END
LOOP END

This time the roofline plot reports on a performance improvement compared to the non-optimized code. However, in both cases the performance bottleneck is still the DRAM bandwidth, as expected. The vectorized roofline plot is shown here:

This time the performance is 0.083FLOP/Byte which is our theoretical prediction! This means that although the code hasn’t changed, the compiler managed to do the more ‘add’ instructions per unit of time, in parallel, due to the vectorization support:

Another possible optimization one could think of is adding an alignment to the arrays in memory:

__attribute__((aligned (64)))

However, adding this requirement also didn’t improve much the performance. It seems that we really reached the performance wall and the reason for that is that the bottleneck isn’t in the computation but in the DRAM bus performance.

As a last step I tried another optimization technique, which is to add multi-threading, i.e. parallelizing the code with OpenMP. Adding an OpenMP parallel-for pragma causes the computational kernel to be computed in parallel. However, once again, there wasn’t any performance improvement.

# pragma openmp parallel for
for (j=0; j<REPS; j++)
    for (i=0; i<SIZE; i++)
        c[i] = a[i] + b[i];

To conclude, the roofline mode is a strong tool for checking where are the performance bottlenecks in the code. As long that we suffer from the limitations of the DRAM (or the caches) there isn’t much we can do about improving the performance. The CPU can ingest more operations on new data but since the memory is slow the performance are poor. Unfortunately, there is nothing we can do about it. This is a challenging issue that is pending to future computer architectures.

If you enjoyed this article you are invited to leave a comment below. You can also subscribe to my YouTube channel (@tel-zur_computing) and follow me on X and Linkedin.

References:

[1] SamueL Willias, Andrew Waterman, and David Patterson, “Roofline: An insightful Visual Performance model for multicore Architectures“, Communications of the ACM, April 2009, vol. 52, no. 4, pp 65-76.

[2] Supplementary material to [1]: https://dl.acm.org/doi/10.1145/1498765.1498785#sup

[3] Stream, https://www.cs.virginia.edu/stream/ref.html

IoT Project: Temperature measuremnts using Arduino, Raspberrypi, MQTT, Python, Influxdb and Grafana

Goal: Practice the model of IoT (Edge) data streaming and its remote analysis (Cloud).

This work was inspired by works of others, in particular [1] and [2].

The system schematic diagram is as follows:

Equipment:

Arduino Uno.
Arduino EthernetSheild.
LM35 temperature sensor (datasheet).
Raspberrypi-3, serving as an MQTT broker.
Another computer (my laptop) for visualization.
Ethernet network to interconnect all the devices (make sure that MQTT port, 1883 by default, is open, “port forwarding”).

Parts can be purchased in many stores including Aliexpress and alike.

Software:

Data acquisition code for the Arduino is developed, compiled, and uploaded to the device from the Arduino IDE.
Python code for obtaining the data on the Raspberrypi.
Influxdb for a database on the Raspberrypi.
Grafana for visualization (operated on the Raspebbeypi but also can be installed on another computer as well).

The codes I wrote are not perfect in terms of software quality, efficiency and security! So these code should not be used in any real application and their purpose is for educational use only.

The code is available on my github.

How to learn and understand this project. I would like to suggest a gradual approach:

First, learn MQTT basics and publish from the device a simple “Hello World” string which can be read (“subscribed”) by another computer on the same network.
Then, connect the temperature sensor, check that you can correctly read it, and then replace the “Hello World” string with the temperature reading.
Install the software on the Raspberrypi. You need to know how to create a new Influxdb database and have to master a few SQL elementary commands.
Install Grafana and connect it with Influxdb using a built-in module.

More ideas to go from here:

You can install an MQTT client on your mobile phone and after a short setup you can view the temperature from there:

2) You can upload the temperature reading to the cloud. It makes more sense to install the database on a big machine and not on the Raspberrypi since the data volume is expected to grow with time. In order to be as much as possible vendor-neutral, I decided not to use IoT-ready solutions by the cloud providers and therefore I installed a fresh Ubuntu (“ubuntu-focal-20.04” image) node in the AWS cloud (IaaS). After installing the needed software tools (in a similar way to the Raspberrypi) the node became ready to accept the temperature data:

It is then possible to install Grafana on a local computer and to connect it to the Influxdb in the cloud or to install Grafana also on the cloud and then to view it using tools such as VNCserver/client.

References:

[1] Maker.io, “How To Use Basic MQTT on Arduino“.

[2] DiyIOt, “Visualize MQTT Data with InfluxDB and Grafana“.

Embarrassingly Parallel Computations with Ray

How to convert a VirtualBox image to an AWS AMI?

The Roofline model

IoT Project: Temperature measuremnts using Arduino, Raspberrypi, MQTT, Python, Influxdb and Grafana

Protected: LMFIT

Protected: Fitting – a simple test

Protected: First code

Protected: Visit #2 at Weizmann

Protected: Day 1 @ Weizmann

Blogism