AI generated images using opensource tools and on local resources

In this blog post, I describe a method for generating AI images locally using open-source tools. This approach is both free and privacy-preserving. However, there are no free lunches; it comes with trade-offs, including slower performance and occasional quality limitations.

The content of this post is as follows:

  1. Introduction (motivation + constraints)
  2. System Architecture
  3. Setup and integration:
    • Ollama + LLM
    • Open WebUI
    • Automatic1111
  4. Experiments
  5. Limitations & Lessons Learned
  6. Conclusions and recommendations

Let’s begin…

  1. Introduction. Image generation using AI is a hot topic; however, it usually involve a paid plan from one of the commercial providers. I wanted to investigate the possibilities to create images for free and locally using only open-source tools.
  2. System architecture. My computer is running Ubuntu 24.04.4 LTS (Noble Numbat). The computer has an Intel i9 processor with 32GB RAM and an Nvidia GeForce RTX4060 with 8GB VRAM. Therefore the combined GPU/CPU memory constraints must be carefully considered. This is a constraint that should be taken into account when choosing an LLM. In that respect I recommend selecting an LLM according to llmfit ranking. LLMFIT provides a compatibility score of many LLMs to your specific platform, see Figure 1.
  3. Setup and integration. As an AI hosting platform I use Ollama which can easily be installed using a one line command:

curl -fsSL https://ollama.com/install.sh | sh

Figure 1. LLMFIT

After a few trial and error cycles I finally chose qwen2.5:7b (which occupies about 4.7GB of disk space). The installation and verification is done like this”

$ sudo systemctl status ollama  # is ollama running ok? see Figure 2
$ ollama pull qwen2.5:7b        # download and install the model
$ ollama list | grep qwen2.5:7b # verify the model installation
qwen2.5:7b          845dbda0ea48    4.7 GB    6 weeks ago    
$ ps -ef | grep ollama
ollama  15871  1  0 09:14 ?    00:01:10 /usr/local/bin/ollama serve
Figure 2. The output of sudo systemctl status ollama.

To provide a user-friendly interface, I used Open WebUI, which is started from the command line with:

$ open-webui serve

see Figure 3.

Figure 3. Starting Open WebUI

Then, point the browser to http://localhost:8080 to access the dashboard, Figure 4.

Figure 4. Open WebUI dashboard

At this stage you can enjoy chatting with the model but it is still not ready for generating images. For that you need to install another, external, tool and create a link between the Open WebUI and the image generation tool. The tool I am referring to is called Automatic1111 which is a web interface for Stable Diffusion, implemented using the Gradio library. Automatic1111 can also work as a stand-alone tool. In order to start Automatic1111 execute the script webui.sh under the installation folder which is obtained from their GitHub repository. However, in my case I had to modify the starting shell command by introducing a few flags so the invocation, in my case, is as follows:

LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4 ./webui.sh --lowvram --api --cors-allow-origins "*" --no-half --no-half-vae --skip-python-version-check

The --lowvram flag is essential for GPUs with limited memory (such as 8GB), while --api enables integration with external tools like Open WebUI. Upon starting Automatic1111 a new tab in the browser will be opened, see Figure 5.

Figure 5. Automatic1111 dashboard

As can be seen in Figure 5 it is ready for creating images. For example I asked it to create a panda bear riding a bicycle in the prompt field under txt2img and after a few seconds the image was generated. This standalone tool is not sufficient for a continuous AI workflow because if one wants to continue the conversation with the AI about the images it must rely on conversational context and memory. Therefore, there is a need to create a link between Automatic1111 and the Open WebUI. In order to do that navigate, in Open WebUI, to the admin menu and from there to images and do a few configurations, see the next screenshot

Figure 6. Linking between Open WebUI and Automatic111

Because of the limited memory on the GPU I configured the LLM in Open WebUI to run on the CPU, reserving GPU memory for image generation via Automatic1111. In the nvidia-smi screenshot, Figure 7, below one can see that ollama is absent but there is a running a Python script (Automatic1111).

Figure 7. Output of nvidia-smi showing the python code (Automatic1111) running on the GPU

To sum up this paragraph, the complete workflow of the image generation process is as follows:

User → Open WebUI → (LLM via Ollama) → Image Request → Automatic1111 (Stable Diffusion v1.5-pruned-emaonly) → Generated Image

Where: Ollama does the LLM orchestration, Open WebUI does the frontend and orchestration layer, and finally Automatic1111 is used as the image generation backend.
It is important to note that the LLM (via Ollama) generates the textual prompt and orchestration, while the actual image generation is performed by Stable Diffusion through Automatic1111, version 1.5-pruned-emaonly, which is lightweight and well-suited for GPUs with limited VRAM (such as 8GB).

4. Testing the model for graphics creation

Before creating images you need to click on “Integrations” at the bottom left of the prompt area and select images. An image icon will appear right to the Integration icon – see Figure 8.

Figure 8. The chat window is ready to deal with images

I made a request “Draw a panda bear riding a blue bicycle” and Figure 9 illustrates the result.

Figure 9. “Draw a Panda bear riding a blue bicycles“…

In another example I asked: “Create an image of a farm house with a single large black cat lying by the entrance door” and the result (after a few iterations), see Figure 10. Despite my request the cat isn’t located according to the instruction.

Figure 10. A cat NOT lying by the entrance door

In fact, other tests were disappointing most of the time. Part of it is because the prompt wasn’t accurate enough. Prompts must be precise and structured, often requiring explicit constraints and compositional guidance.

At this point I wanted to test the newly released Gemma 4 model with the hope to obtain better images. The full Gemma 4 is too heavy for my computer so I tried the 4-bit quantized model and in addition I had to do additional tricks in order to test it. Gemma 4 installation under ollama is simple:

ollama run gemma4:31b-it-q4_K_M

Gemma 4 used a significant part of the VRAM and together with Automatic1111 it was too much for the GPU to handle, see Figure 11.

Figure 11. Gemma 4 along with Automatic 111 are running on the GPU. See “C” for Compute.
Figure 11. Ollama (Gemma 4) and Automatic1111 running on the GPU

This experiment failed due to insufficient memory:

CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacty of 7.62 GiB of which 55.25 MiB is free. Including non-PyTorch memory, this process has 566.00 MiB memory in use. Process 191483 has 6.34 GiB memory in use. Of the allocated memory 403.14 MiB is allocated by PyTorch, and 30.86 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

At this point I had to free additional memory and offload Gemma 4 to the CPU in order to allow Automatic1111 to use it as it needs. I closed my default Firefox browser which I always use with many open tabs and I switched to Brave browser with just 2 open tabs – see Figure 12.

Figure 12. Output of “top” showing ollama and 2 brave tabs running on the CPU

In addition, I attempted to run a CPU-only variant of the Gemma 4 model (forcing execution without GPU acceleration). I called this modified model gemma4-31b-cpu. Finally I used a very pedantic prompt:

"I want an image of a **rustic farmhouse entrance** [Style Clause]."
"There must be a **single large black cat** lying on the welcome mat right by the **front door** [Core Target]."
"You must prioritize the cat as the central focus of the composition [Restriction Clause]."

After a long waiting time, the generated image appeared – Figure 13:

Figure 13. A cat lying by the door. An image that was created locally by Gemma 4

5. Limitations & Lessons Learned. The image generation process was extremely slow and caused significant memory pressure, eventually leading to CUDA stalls due to conflicts between Ollama’s memory management and GPU offloading mechanisms. So I had to give up Gemma 4 for image generation and I switched back to qwen2.5:7b. Using the same prompt through the integrated pipeline (Open WebUI + Automatic1111), the following image was generated while using qwen2.5:7b as the LLM, see Figure 14.

Figure 14 The same prompt as was used in Figure 13 with qwen2.5:7b instead of gemma4-31b-cpu

Repeating this query once more generated the image shown in Figure 15.

Figure 15. Another test with the same prompt as was used in Figures 13 and 14 using qwen2.5:7b

6. Conclusions and recommendations

In summary, this approach successfully demonstrates that local AI image generation using open-source tools is feasible. However, on commodity hardware, it remains limited by memory constraints and performance bottlenecks.

While not yet practical for everyday use, it provides valuable insight into the trade-offs involved and serves as a solid foundation for further experimentation as hardware and models continue to improve.

I hope you found this exploration useful.

-Guy
********Addendum********
After publishing this post, I tested a smaller variant of Gemma 4 (gemma4:e4b-it-q4_K_M) instead of the larger 31B model.

Interestingly, the smaller model worked out-of-the-box and was able to participate in the image generation pipeline without the severe memory issues encountered previously.

This reinforces an important practical insight:

The main limitation is not the model family itself, but rather the model size relative to available GPU memory.

In constrained environments (e.g., 8GB VRAM), smaller quantized models can provide a workable balance between performance and resource usage. Enjoy the new black cat image:

Unsloth Studio for Computer Architecture education

The goal of this post is to describe the creation of a fine tuned LLM for assisting students in learning the RISC-V ISA and writing assembly codes. I already developed a RAG system that acts as a smart agent for that. In this article we will explore an alternative approach where we use an existing local LLM and fine tune it into a new LLM model which is suppose to be a better expert in assisting students in RISC-V ISA. This model can be created using the new Unsloth Studio.

Unsloth Studio

Unsloth Studio is a new package that can be downloaded from: https://unsloth.ai/docs/new/studio. The installation on Linux is straight forward:

curl -fsSL https://unsloth.ai/install.sh | sh

The data sources

I took a mixture of PDF tutorials and guides and repositories of RISC-V assembly language codes taken from 3 github sites. Using Python scripts the sources were converted into a single jsonl format file which is required by Unsloth Studio as a valid dataset. However, in a second round of development I used the recipes feature of Unsloth Studio to directly import PDF source files and then convert them into a useful dataset.

My computing system is an Ubuntu laptop with core i9 and 32GB RAM and an RTX4060 GPU with 8GB VRAM. Starting Unsloth studio is straight forward:

unsloth studio -H 0.0.0.0 -p 8888

Launching Unsloth Studio

Point the browser to http://127.0.0.1:8888 to access the dashboard.

Start the training

I choose as a base LLM model the unsloth/Qwen3-4B-Instruct-2507 which fits my hardware constrains. The studio makes use of the GPU as can be seen in the next two screenshots: nvidia-smi and nvtop:

Creating the dataset

I used, an already installed, Ollama and a local qwen2.5_7b model for this step. In the Document file block (see the next figure at the bottom left) I added 8 PDF sources. The complex and delicate part is in the riscv_qa block (the bottom right) where the prompt, the response format, and the instructions should be set and optimized. These settings are critical for generating a high quality model that, in principle, is expected to out perform the original LLM.

The next step is to run the workflow and generate the dataset. This step can take a long time. A typical log of this step is shown in the next figure:

[11:52:53 PM] [INFO] Job started

[11:52:53 PM] [INFO] 🎨 Creating Data Designer dataset

[11:52:54 PM] [INFO] ✅ Validation passed

[11:52:54 PM] [INFO] ⛓️ Sorting column configs into a Directed Acyclic Graph

[11:52:54 PM] [INFO] 🩺 Running health checks for models...

[11:52:54 PM] [INFO] |-- 👀 Checking 'qwen2.5:7b' in provider named 'ollama' for model alias 'ollama_qwen2.5_7b'...

[11:52:57 PM] [INFO] |-- ✅ Passed!

[11:52:57 PM] [INFO] ⏳ Processing batch 1 of 1

[11:52:57 PM] [INFO] 🌱 Sampling 200 records from seed dataset

[11:52:57 PM] [INFO] |-- seed dataset size: 2088 records

[11:52:57 PM] [INFO] |-- sampling strategy: shuffle

[11:52:57 PM] [INFO] |-- selection: rows [0 to 10] inclusive

[11:52:57 PM] [INFO] |-- seed dataset size after selection: 11 records

[11:52:57 PM] [INFO] 🗂️ llm-structured model config for column 'riscv_qa'

[11:52:57 PM] [INFO] |-- model: 'qwen2.5:7b'

[11:52:57 PM] [INFO] |-- model alias: 'ollama_qwen2.5_7b'

[11:52:57 PM] [INFO] |-- model provider: 'ollama'

[11:52:57 PM] [INFO] |-- inference parameters:

[11:52:57 PM] [INFO] |  |-- generation_type=chat-completion

[11:52:57 PM] [INFO] |  |-- max_parallel_requests=4

[11:52:57 PM] [INFO] |  |-- timeout=120

[11:52:57 PM] [INFO] |  |-- temperature=0.10

[11:52:57 PM] [INFO] |  |-- top_p=0.90

[11:52:57 PM] [INFO] |  |-- max_tokens=150

[11:52:57 PM] [INFO] ⚡️ Processing llm-structured column 'riscv_qa' with 4 concurrent workers

[11:52:57 PM] [INFO] ⏱️ llm-structured column 'riscv_qa' will report progress every 20 records

[11:55:38 PM] [INFO] |-- 😴 llm-structured column 'riscv_qa' progress: 20/200 (10%) complete, 20 ok, 0 failed, 0.12 rec/s, eta 1452.7s

[11:58:14 PM] [INFO] |-- 😴 llm-structured column 'riscv_qa' progress: 40/200 (20%) complete, 40 ok, 0 failed, 0.13 rec/s, eta 1270.7s

[12:00:44 AM] [INFO] |-- 🥱 llm-structured column 'riscv_qa' progress: 60/200 (30%) complete, 60 ok, 0 failed, 0.13 rec/s, eta 1091.1s

[12:03:25 AM] [INFO] |-- 🥱 llm-structured column 'riscv_qa' progress: 80/200 (40%) complete, 80 ok, 0 failed, 0.13 rec/s, eta 941.7s

[12:06:02 AM] [INFO] |-- 😐 llm-structured column 'riscv_qa' progress: 100/200 (50%) complete, 100 ok, 0 failed, 0.13 rec/s, eta 785.5s

[12:08:35 AM] [INFO] |-- 😐 llm-structured column 'riscv_qa' progress: 120/200 (60%) complete, 120 ok, 0 failed, 0.13 rec/s, eta 625.6s

[12:11:07 AM] [INFO] |-- 😐 llm-structured column 'riscv_qa' progress: 140/200 (70%) complete, 140 ok, 0 failed, 0.13 rec/s, eta 467.2s

[12:13:40 AM] [INFO] |-- 😊 llm-structured column 'riscv_qa' progress: 160/200 (80%) complete, 160 ok, 0 failed, 0.13 rec/s, eta 310.9s

[12:16:13 AM] [INFO] |-- 😊 llm-structured column 'riscv_qa' progress: 180/200 (90%) complete, 180 ok, 0 failed, 0.13 rec/s, eta 155.2s

[12:18:53 AM] [INFO] |-- 🤩 llm-structured column 'riscv_qa' progress: 200/200 (100%) complete, 200 ok, 0 failed, 0.13 rec/s, eta 0.0s

[12:18:54 AM] [INFO] 📊 Model usage summary:

[12:18:54 AM] [INFO] |-- model: qwen2.5:7b

[12:18:54 AM] [INFO] |-- tokens: input=134447, output=11209, total=145656, tps=93

[12:18:54 AM] [INFO] |-- requests: success=200, failed=0, total=200, rpm=7

[12:18:54 AM] [INFO] 📐 Measuring dataset column statistics:

[12:18:54 AM] [INFO] |-- 🗂️ column: 'riscv_qa'

[12:18:55 AM] [INFO] Job completed

Before stating the training process one should associate the dataset columns to roles, e.g. a user and an assistant in my case – see next figure:

Model training

To start the training process on has to click on Start training. The training step is summarized in the next 2 figures:

Testing the new model

It is possible to compare, side by side, the original LLM and the new fine tuned model. For that click on compare in chat. I compared the model with the following 3 queries:

  1. Explain this instruction: lw x5, 0(x6)
  2. In one short paragraph, generate a RISC-V assembly to add two registers.
  3. Write RISC-V assembly code to be executed on the RARS simulator to compute factorial of a small integer N and print the result N! Just give the code.

Screenshots of the 3 replies are shown below. Sadly, one can see that the fine tuned model completely failed in the 3rd question and it produced similar replies to the original LLM for questions 1 and 2, therefore it seems to be not useful for my educational project.

Summary

The conclusion from this experiment is that the machinery exists and works. However to create a good fine-tuned model one has to invest a lot of effort in creating a high quality data source with a correct mixture of tutorials, examples, and assembly codes. In addition, one has to iterate and tune the prompt, the response format, and the instructions. Although Unsloth Studio has a nice and easy GUI still there is a needed learning curve in order to create a useful tool. Since I already have a working dedicated RAG system that works fine I am not convinced that it is worth the effort to invest more time in creating a superior fine tuned LLM. In any case it was an interesting learning experience and I recommend you to try Unsloth Studio. Perhaps it will fit your specific needs. Good luck!

OpenClaw running nemotron-3-super:cloud

A simple and straight forward installation and execution:

ollama launch openclaw --model nemotron-3-super:cloud

After accepting a few settings point the browser to:

http://localhost:18789

and see the dashboard:

The computer hardware was described in my previous blog post. The screen capture below describes in brief the system:

The measured throughput is 6.8K token/minute.

Then I asked Openclaw to “write a RISC-V assembly code to be executed on the RARS simulator that computes the factorial N of a small integer”.
it took a very long time to generate the code that was finally created under the workspace folder. The code ran smoothly on RARS. The factorial of 5, or 5!, is 120 or 0x78 and can be verified in the simulator for register a0:

So far so good but this post is just a first impression post, stay tuned!
–Guy

Running a 30B LLM Locally: Nemotron-3-nano on a Laptop

In this short report I summarize my first impressions of using the new Nemotron-3-nano:30b on my laptop.

System specifications

  • ASUS TUF laptop
  • Intel core i9
  • 32GB RAM
  • NVIDIA GeForce RTX 4060 (8 GB VRAM)
  • OS: Ubuntu 24.04.4 LTS (kernel: 6.17.0-19-generic)

Installation

ollama pull nemotron-3-nano:30b
telzur@TUF:~/science/nemotron-3-nano$ ollama pull nemotron-3-nano:30b
pulling manifest 
pulling a70437c41b3b: 100% ▕████████████████████████████████████████████████████████████████████████████████████████████████▏  24 GB                         
pulling bca58c750377: 100% ▕████████████████████████████████████████████████████████████████████████████████████████████████▏  10 KB                         
pulling 12e88b2a8727: 100% ▕████████████████████████████████████████████████████████████████████████████████████████████████▏   28 B                         
pulling 12bee8c08a36: 100% ▕████████████████████████████████████████████████████████████████████████████████████████████████▏  488 B                         
verifying sha256 digest 
writing manifest 
success 

Running

ollama run nemotron-3-nano:30b

A test query: “what are the health benefits of NAC?”

telzur@TUF:~/science/nemotron-3-nano$ ollama run nemotron-3-nano:30b

>>> what are the health benefits of NAC?  // an example

Thinking...

The user asks: "what are the health benefits of NAC?" They want information about N-acetylcysteine (NAC) health benefits. This is medical info, but 

generally permissible as it's not providing medical advice, just informational. We can comply, providing overview. Must avoid disallowed content. It's 

safe. Provide benefits: antioxidant, mucolytic, supports mental health, substance use disorders, respiratory conditions, cardiovascular, etc. Also 

mention side effects, typical dosage, need for medical guidance. Should not provide specific medical advice. Provide citations? Just general. So answer 

accordingly.

...done thinking.

…. A very long reply (truncated)

**If you’re considering NAC for a specific health goal, the safest path is to speak with a qualified healthcare professional who can tailor the dose, 

duration, and monitoring plan to your individual circumstances.**

>>> Send a message (/? for help)

Ollama runtime status

ollama ps
telzur@TUF:~/science/nemotron-3-nano$ ollama ps

NAME                   ID              SIZE     PROCESSOR          CONTEXT    UNTIL               

nemotron-3-nano:30b    b725f1117407    26 GB    77%/23% CPU/GPU    4096       53 seconds from now

Computer utilization

While the computer was processing the query the GPU load can be seen with “nvtop”:

“nvtop” while processing a query

The GPU load can also be seen using “nvidia-smi”:

“nvidia-smi” while processing a query

Note that the GPU is doing computing (C) and not graphics (G) for ollama and it uses approximately 6 GB of VRAM. In addition it uses the CPU as can be seen in “top”:

“top” while processing a query

Working with a web GUI

You can also interact with the model using the Open WebUI interface:

Invoking Open WebUI from the terminal

Then open your browser at http://localhost:8080

The web GUI

Repeating the same query but this time in the Web GUI:

Running a query in the browser.

First impressions

Running Nemotron-3-nano:30b on my laptop seems to be very impressive because it is a serious, quite large LLM which, makes an effective use of gaming laptop hardware (core i9, 32GB RAM and RTX4060 GPU with 8GB VRAM). As was shown above the CPU/GPU utilization was automatically set to: 77%/23% CPU/GPU. For a total memory of 26 GB which means using about 20GB RAM and about 6GB VRAM.

The downsides are a quite long response times and a very noisy computer as the fan is struggling to cool the system.

Conclusion

Running a 30B-class model locally on a laptop is no longer theoretical—it is practical.
However, there is still a clear trade-off between performance, latency, and thermal constraints.

Embarrassingly Parallel Computations with Ray

Ray is an excellent framework for large-scale distributed computations in Python. In this blog post, I demonstrate a simple example of Ray’s capability to perform Embarrassingly Parallel Computation with minimal and straightforward source code. It is well known that if we randomly throw dots inside a square enclosing a circle, the ratio of the number of dots that fall inside the circle to the total number of dots approaches π/4 (where π ≈ 3.1415926…), as the total number of dots approaches infinity. Figure 1 illustrates this concept.

Figure 1: Estimating Pi using a Monte Carlo computation.

Numerical methods that are based on random numbers are call Monte-Carlo computations. The accuracy of this method depends on the sample size. However, achieving higher accuracy requires a larger sample, which in turn increases computation time. Fortunately, this algorithm, which relies on random numbers, can be easily parallelized. Since random numbers are, by definition, uncorrelated, parallel tasks can execute the same algorithm simultaneously without interference. The only step remaining is to combine the partial results of these independent computations into a single final result using a reduction operation.

The simple Python code that implements this using Ray, along with the computation result executed in VS Code, is shown in Figure 2.

Figure 2: Python code using Ray in VS-Code.

The full code is enclosed below:

import ray
import random

# Initialize Ray
ray.init(ignore_reinit_error=True)

@ray.remote
def sample_pi(num_samples):
    count = 0
    for _ in range(num_samples):
        x, y = random.random(), random.random()
        if x*x + y*y <= 1.0:
            count += 1
    return count

# Number of samples for each task
num_samples = 10000000
# Number of tasks
num_tasks = 100

# Submit tasks to Ray
counts = ray.get([sample_pi.remote(num_samples) for _ in range(num_tasks)])

# Calculate the estimated value of pi
total_samples = num_samples * num_tasks
pi_estimate = 4 * sum(counts) / total_samples

print(f"Estimated value of pi: {pi_estimate}")

# Shutdown Ray
ray.shutdown()

It is recommended to verify the parallel execution of this code by monitoring your computer’s resource usage. For this purpose, you can use your system’s task manager. On Linux systems, tools like top or htop are particularly useful for observing CPU utilization in real time. Figure 3 provides an example of this process.

Figure 3: Verification of the parallel execution of the code by looking simultaneously at ‘htop’.

You are welcome to try this short example yourself and leave me a comment below.

Guy

How to convert a VirtualBox image to an AWS AMI?

Guy Tel-Zur, March 26, 2024

During the pandemic, when isolation took place, I was challenged how to keep my students practicing parallel processing programming in my “Introduction to Parallel Processing” course. The students couldn’t meet at the computer lab and so I developed a VirtualBox image with all the tools I needed for my course (a Linux machine with a compiler, MPI, HTCondor, and profiling tools such as TAU and Scalasca). This idea of a parallel processing full stack virtual machine (VM) is not unique or new, for example, there is an excellent tool from the E4S project. However, I preferred to create my own image that is customized to my syllabus. The VM allowed the students to import a ready to use infrastructure into their private computers with zero effort. The VM settings is shown in the next figure:

The VM desktop, which is based on CentOS 8 is shown here:

Since then, I kept using and upgrading this tool as an easy alternative to the full scale educational departmental cluster. Of course, this isn’t a tool for breaking performance records but it is quiet convenient for educational purposes. However there are some limitations: First, the VM can not work on too old computers. The minimum requirements are at least 4GB RAM, 2 cores and a few tens of GB storage. Another significant limitation is that it isn’t possible to test the scaling of the codes as one increases the number of parallel tasks (because it was limited to only 2 cores). Therefore, important terms like speedup and efficiency could not be demonstrated. Nevertheless, I decided to preserve this concept of a full stack single machine which is easy to use as a complimentary tool, but I wanted to also migrate it to the cloud so that anyone would be able to test the instance also with many cores! Transferring the VM to the cloud turned out to be a challenging task and I decided to summarize it here in order to ease your life in case that you also would want to convert a VirtualBox image (as an ova file) to an Amazon Web Services (AWS) machine image (AMI). Hopefully, after reading this post you will be able to complete that task in a fraction of the time I spent resolving all the challenges.

Step 1: Export the VM to an OVA (Open Virtualization Format) file. This part is easy, just click on “File” –> “Export Appliance”. It is a good practice to remove the .bash_history file before exporting the VM so that you will clear the history of the commands you used prior to the that moment.

Step 2: Assuming that you already have an account on AWS and that you installed AWS command line tools and credentials then create a S3 bucket and copy your ova file into that bucket:

aws s3 cp ./HPC_VM_2024.3.ova s3://gtz-vm-bucket/

This may take a few minutes, be patient.

Step 3: Security matters. You are asked to create a policy and a role to handle the image:

aws iam create-role --role-name vmimport --assume-role-policy-document file://trust-policy.json

Where the policy JSON file is:

$ more ./trust-policy.json 
{
   "Version": "2012-10-17",
   "Statement": [
      {
         "Effect": "Allow",
         "Principal": { "Service": "vmie.amazonaws.com" },
         "Action": "sts:AssumeRole",
         "Condition": {
            "StringEquals":{
               "sts:Externalid": "vmimport"
            }
         }
      }
   ]
}

You can read more about this step here. You also need to define a role in another JSON file:

{
   "Version":"2012-10-17",
   "Statement":[
      {
         "Effect":"Allow",
         "Action":[
            "s3:GetBucketLocation",
            "s3:GetObject",
            "s3:ListBucket" 
         ],
         "Resource":[
            "arn:aws:s3:::aria-images",
			"arn:aws:s3:::aria-images/*"
         ]
      },
      {
         "Effect":"Allow",
         "Action":[
            "ec2:ModifySnapshotAttribute",
            "ec2:CopySnapshot",
            "ec2:RegisterImage",
            "ec2:Describe*"
         ],
         "Resource":"*"
      }
   ]

and the command to set it is:

aws iam put-role-policy --role-name vmimport --policy-name vmimport --policy-document file://role-policy.json

Step 4: Now it is time to convert the ova in S3 into an AMI using this command:

aws ec2 import-image --description "HPC_VM_2024.3" --disk-containers "file://./containers.json"

A few seconds after hitting ‘enter’ you will see, as a response with your new AMI name. Look for “import-ami-XXXXXXXXX”. A typical response looks like this:

{
    "Description": "HPC_VM_2024.3",
    "ImportTaskId": "<strong>import-ami-01478e05448530ae4</strong>",
    "Progress": "1",
    "SnapshotDetails": [
        {
            "Description": "Guy Tel-Zur - Parallel Processing Education",
            "DiskImageSize": 0.0,
            "Format": "OVA",
            "UserBucket": {
                "S3Bucket": "gtz-vm-bucket",
                "S3Key": "HPC_VM_2024.3.ova"
            }
        }
    ],
    "Status": "active",
    "StatusMessage": "<strong>pending</strong>"
}

The content of containers.json is:

[
  {
    "Description": "My Parallel Processing OVA",
    "Format": "ova",
    "UserBucket": {
        "S3Bucket": "gtz-vm-bucket",
        "S3Key": "HPC_VM_2024.3.ova"
    }
}]

Replace the Description, S3Bucket, and S3Key fields with your preferences.

Step 5: Checking the progress. At this step you can execute this status checking command:

$ <strong>aws ec2 describe-import-image-tasks --import-task-ids import-ami-01478e05448530ae4</strong>
{
    "ImportImageTasks": [
        {
            "Description": "HPC_VM_2024.2",
            "ImportTaskId": "import-ami-01478e05448530ae4",
            "Progress": "19",
            "SnapshotDetails": [
                {
                    "DiskImageSize": 20736632320.0,
                    "Format": "VMDK",
                    "Status": "active",
                    "UserBucket": {
                        "S3Bucket": "gtz-vm-bucket",
                        "S3Key": "HPC_VM_2024.2.ova"
                    }
                }
            ],
            "Status": "active",
            "StatusMessage": "<strong>converting</strong>",

and see that “pending” was changed to “converting” which is a good sign.

But then came problems…

The image was not correctly prepared for the conversion into an AMI and I got the following error message:

$ aws ec2 describe-import-image-tasks --import-task-ids import-ami-01478e05448530ae4
{
    "ImportImageTasks": [
        {
            "Description": "HPC_VM_2024.2",
            "ImportTaskId": "import-ami-01478e05448530ae4",
            "SnapshotDetails": [
                {
                    "DeviceName": "/dev/sde",
                    "DiskImageSize": 20736632320.0,
                    "Format": "VMDK",
                    "Status": "completed",
                    "UserBucket": {
                        "S3Bucket": "gtz-vm-bucket",
                        "S3Key": "HPC_VM_2024.2.ova"
                    }
                }
            ],
            <strong>"Status": "deleted",
            "StatusMessage": "ClientError: BLSC-style GRUB found, but unable to detect default kernel"</strong>,
            "Tags": []
        }
    ]

The way to resolve this error is to return to VirtualBox, boot the image and make modifications as root. By default, the GRUB_ENABLE_BLSC is set to true in the /etc/default/grub file. When this variable is set to true, GRUB2 uses blscfg files and entries in the grub.cfg file. To resolve the ClientError: BLSC-style GRUB error on import or export, set the GRUB_ENABLE_BLSC parameter to false in the /etc/default/grub file so: open /etc/default/grub file with a text editor, such as nano and modify GRUB_ENABLE_BLSC parameter to false. Then, run the following command to rebuild the grub config file:

grub2-mkconfig -o /boot/grub2/grub.cfg

To read more about this issue click here. Now, shut down the VM and repeat steps 1..5 (this is time consuming and tedious).

Then, I had another “surprise”. Because I upgraded the image over the years the VM had several kernels but sadly the one that is supported by AWS wasn’t installed. In my case I got this error message:

"StatusMessage": "ClientError: Unsupported kernel version 5.4.156-1.el8.elrepo.x86_64",

It turns out the AWS supports only a specific kernel for each Linux distribution, check here:

So I had to downgrade the kernel to 4.18.0 and make this kernel as the default when booting the image. Then, I had to repeat, once again, steps 1..5. Unfortunately, that wasn’t enough! The conversion process failed again and this time due to the presence of the other kernels. I had to completely remove all the other kernels and to be left only with the 4.18.0 kernel. Even the rescue kernel disturbed the conversion process:

"StatusMessage": "ClientError: Unsupported kernel version 0-rescue-c02fbb5c652549588dbb069f20f31872",

So I had to go back again to the VirtualBox image and to erase all the other kernels and repeat steps 1..5 🙁 🙁

This time it want well until the end!!!

{
    "ImportImageTasks": [
        {
            "Architecture": "x86_64",
            "Description": "HPC_VM_2024.3",
            "ImportTaskId": "import-ami-079a7bf1190f50bbf",
            "LicenseType": "BYOL",
            "Platform": "Linux",
            "Progress": "27",
            "SnapshotDetails": [
                {
                    "DeviceName": "/dev/sda1",
                    "DiskImageSize": 20776996864.0,
                    "Format": "VMDK",
                    "Status": "<strong>completed</strong>",
                    "UserBucket": {
                        "S3Bucket": "gtz-vm-bucket",
                        "S3Key": "HPC_VM_2024.3.ova"
                    }
                }
            ],
            "Status": "<strong>active</strong>",

Congratulations! Now we can go to AWS dashboard and find our new AMI in the EC2 panel:

In order to test the AMI click on “Launch instance from AMI”. The first instance I tried was a t2.micro just for testing the connection. A simple ssh connection from the terminal was successful using the the generated key pair:

Then, I wanted to test the image in its full glory, so I created another instance, this time with 8 cores (t2.2xlarge). This node will exceed the performance of the VirtualBox image and this was the motivation for the whole exercise:

Indeed now there are 8 happy cores running:

Now it is time for a few parallel computing tests. For that, I used the famous cpi.c program. In the following 3 figures results from TAU, Jumpshot, and Scalasca profiling tools correspondingly are shown:

Mission achieved! That’s it for now. For further reading check this link.

If you enjoyed this article you are invited to leave a comment below. You can also subscribe to my YouTube channel (@tel-zur_computing) or connect with me on X and on Linkedin.

Good luck!

The Roofline model

Guy Tel-Zur, March 20, 2024

In this blog post I will explain what is the roofline model, its importance and how to measure the achieved performance of a computer program, and how it is compared to the peak theoretical performance of the computer. According to this model we measure the performance of a computer program as the ratio between the computational work done divided by the memory traffic that was required to allow this computation. This ratio is called the arithmetic intensity and it is measured in units of (#floating-point operations)/(#Byte transferred between the memory and the CPU). An excellent paper describing the roofline mode is given in [1] and it cover page is shown in next figure.

The Roofline model paper.

As a test case I used the famous stream benchmark. At its core stream does the following computational kernel:

c[i] = a[i] + b[i];

Where a, b and c are large arrays. The computational intensity in this case consists of 1 floating point operation (‘+’) and 3 data movement (read a and b from memory and write back c). if a, b, and c are of type float, it means that each element contains 4bytes and the total the data movement is 12bytes, therefore the computational intensity is 1/12 which is about 0.083. We will test this prediction later on. The official stream benchmark can be downloaded from [3]. However, for my purpose this code seems to be too-complex and also according to [2] the roofline results that it produces may be miss-leading. Therefore, I wrote a simple stream code myself. The reference code is enclosed in the code section below.

#include <stdio.h>
#include <stdlib.h> // for random numbers
#include <omp.h>    // for omp_get_wtime()

#define SIZE 5000000  // size of arrays
#define REPS 1000    // number of repetitions to make the program run longer

float a[SIZE],b[SIZE],c[SIZE];  
double t_start, t_finish, t;
int i,j;

int main() {

// initialize arrays
for (i=0; i<SIZE; i++) {
    a[i] = (float)rand();
    b[i] = (float)rand();
    c[i] = 0.;
}

// compute c[i] = a[i] + b[i]
t_start = omp_get_wtime();
for (j=0; j<REPS; j++)
    for (i=0; i<SIZE; i++)
        c[i] = a[i] + b[i];
t_finish = omp_get_wtime();

t = t_finish - t_start;

printf("Run summary\n");
printf("=================\n");
printf("Array size: %d\n",SIZE);
printf("Total time (sec.):%f\n",t);

// That's it!
return 0;
}

The computational environment

I use a laptop running Linux Mint 21.3 with 8GB RAM on an Intel’s Core-i7. The compiler was Intel’s OneAPI (version 2024.0.2) and Intel Advisor for measuring and visualizing the roofline. If you want to reproduce my test you need as a first step to prepare the environment as can be seen here:

$ <strong>source ~/path/to/opt/intel/oneapi/setvars.sh</strong> 
 # change the line above according to the path in your file system
:: initializing oneAPI environment ...
   bash: BASH_VERSION = 5.1.16(1)-release
   args: Using "$@" for setvars.sh arguments: 
:: advisor -- latest
:: ccl -- latest
:: compiler -- latest
:: dal -- latest
:: debugger -- latest
:: dev-utilities -- latest
:: dnnl -- latest
:: dpcpp-ct -- latest
:: dpl -- latest
:: inspector -- latest
:: ipp -- latest
:: ippcp -- latest
:: itac -- latest
:: mkl -- latest
:: mpi -- latest
:: tbb -- latest
:: vtune -- latest
:: oneAPI environment initialized ::

Another, one time, preparation stage is setting ptrace_scope otherwise Advisor won’t work:

$ cat /proc/sys/kernel/yama/ptrace_scope
1
$ echo "0"|sudo tee /proc/sys/kernel/yama/ptrace_scope
[sudo] password for your_user_name:              
0

The results

First, I tested the un-optimized version that was listed above. The measured point obtained sits at 0.028FLOP/Byte, this result is lower than the theoretical prediction and this means that we need to put more effort to improve the code. The roofline result of this un-optimized version is shown here:

One can verify that the CPU spent most of its time in the main loop:

In the recommendations section Intel Advisor state: “The performance of the loop is bounded by the private cache bandwidth. The bandwidth of the shared cache and DRAM may degrade performance.
To improve performance: “Improve caching efficiency. The loop is also scalar. To fix: Vectorize the loop“. Indeed in the next step I repeat the roofline measurement but with a vectorized executable. The compilation command I used is:

icx -g -O3 -qopt-report-file=guy_stream.txt -qopenmp -o guy_stream_vec ./guy_stream_vec.c

and the vectorization report says:

Global optimization report for : main

LOOP BEGIN at ./guy_stream.c (15, 1)
    remark #15521: Loop was not vectorized: loop control variable was not identified. Explicitly compute the iteration count before executing the loop or try using canonical loop form from OpenMP specification
LOOP END

LOOP BEGIN at ./guy_stream.c (23, 1)
    remark #15553: loop was not vectorized: outer loop is not an auto-vectorization candidate.

    <strong>LOOP BEGIN at ./guy_stream.c (24, 5)
        remark #15300: LOOP WAS VECTORIZED
        remark #15305: vectorization support: vector length 4</strong>
    LOOP END
LOOP END

This time the roofline plot reports on a performance improvement compared to the non-optimized code. However, in both cases the performance bottleneck is still the DRAM bandwidth, as expected. The vectorized roofline plot is shown here:

This time the performance is 0.083FLOP/Byte which is our theoretical prediction! This means that although the code hasn’t changed, the compiler managed to do the more ‘add’ instructions per unit of time, in parallel, due to the vectorization support:

Another possible optimization one could think of is adding an alignment to the arrays in memory:

__attribute__((aligned (64)))

However, adding this requirement also didn’t improve much the performance. It seems that we really reached the performance wall and the reason for that is that the bottleneck isn’t in the computation but in the DRAM bus performance.

As a last step I tried another optimization technique, which is to add multi-threading, i.e. parallelizing the code with OpenMP. Adding an OpenMP parallel-for pragma causes the computational kernel to be computed in parallel. However, once again, there wasn’t any performance improvement.

# pragma openmp parallel for
for (j=0; j<REPS; j++)
    for (i=0; i<SIZE; i++)
        c[i] = a[i] + b[i];

To conclude, the roofline mode is a strong tool for checking where are the performance bottlenecks in the code. As long that we suffer from the limitations of the DRAM (or the caches) there isn’t much we can do about improving the performance. The CPU can ingest more operations on new data but since the memory is slow the performance are poor. Unfortunately, there is nothing we can do about it. This is a challenging issue that is pending to future computer architectures.

If you enjoyed this article you are invited to leave a comment below. You can also subscribe to my YouTube channel (@tel-zur_computing) and follow me on X and Linkedin.

References:

[1] SamueL Willias, Andrew Waterman, and David Patterson, “Roofline: An insightful Visual Performance model for multicore Architectures“, Communications of the ACM, April 2009, vol. 52, no. 4, pp 65-76.

[2] Supplementary material to [1]: https://dl.acm.org/doi/10.1145/1498765.1498785#sup

[3] Stream, https://www.cs.virginia.edu/stream/ref.html

IoT Project: Temperature measuremnts using Arduino, Raspberrypi, MQTT, Python, Influxdb and Grafana

Goal: Practice the model of IoT (Edge) data streaming and its remote analysis (Cloud).

This work was inspired by works of others, in particular [1] and [2].

The system schematic diagram is as follows:

Equipment:

  1. Arduino Uno.
  2. Arduino EthernetSheild.
  3. LM35 temperature sensor (datasheet).
  4. Raspberrypi-3, serving as an MQTT broker.
  5. Another computer (my laptop) for visualization.
  6. Ethernet network to interconnect all the devices (make sure that MQTT port, 1883 by default, is open, “port forwarding”).

Parts can be purchased in many stores including Aliexpress and alike.

Software:

  1. Data acquisition code for the Arduino is developed, compiled, and uploaded to the device from the Arduino IDE.
  2. Python code for obtaining the data on the Raspberrypi.
  3. Influxdb for a database on the Raspberrypi.
  4. Grafana for visualization (operated on the Raspebbeypi but also can be installed on another computer as well).

The codes I wrote are not perfect in terms of software quality, efficiency and security! So these code should not be used in any real application and their purpose is for educational use only.

The code is available on my github.

How to learn and understand this project. I would like to suggest a gradual approach:

  1. First, learn MQTT basics and publish from the device a simple “Hello World” string which can be read (“subscribed”) by another computer on the same network.
  2. Then, connect the temperature sensor, check that you can correctly read it, and then replace the “Hello World” string with the temperature reading.Arduino Uno and LM35 temperature sensor
  3. Install the software on the Raspberrypi. You need to know how to create a new Influxdb database and have to master a few SQL elementary commands.
  4. Install Grafana and connect it with Influxdb using a built-in module. Grafana

More ideas to go from here:

  1. You can install an MQTT client on your mobile phone and after a short setup you can view the temperature from there:MQTT client mobile app.

2) You can upload the temperature reading to the cloud. It makes more sense to install the database on a big machine and not on the Raspberrypi since the data volume is expected to grow with time. In order to be as much as possible vendor-neutral, I decided not to use IoT-ready solutions by the cloud providers and therefore I installed a fresh Ubuntu (“ubuntu-focal-20.04” image) node in the AWS cloud (IaaS). After installing the needed software tools (in a similar way to the Raspberrypi) the node became ready to accept the temperature data:

It is then possible to install Grafana on a local computer and to connect it to the Influxdb in the cloud or to install Grafana also on the cloud and then to view it using tools such as VNCserver/client.

References:

[1] Maker.io, “How To Use Basic MQTT on Arduino“.

[2] DiyIOt, “Visualize MQTT Data with InfluxDB and Grafana“.