Using Local LLMs to Discover High-Performance Algorithms

Ever since I was a child, I’ve been fascinated by drawing. What struck me was not only the drawing act itself, but also the idea that every drawing could be improved more and more. I remember reaching very high levels with my drawing style. However, once I reached the peak of perfection, I would try to see how I could improve the drawing even further – alas, with disastrous results.

From there I always keep in mind the same mantra: “refine and iterate and you’ll reach perfection”. At university, my approach was to read books many times, expanding my knowledge searching for other sources, for finding hidden layers of meaning in each concept. Today, I apply this same philosophy to AI/ML and coding.

We know that matrix multiplication (matmul for simplicity here), is the core part of any AI process. Back in the past I developed LLM.rust, a Rust mirror of Karpathy’s LLM.c. The hardest point in the Rust implementation has been the matrix multiplication. Since we have to perform thousands of iterations for fine-tuning a GPT-based model, we need an efficient matmul operation. For this purpose, I had to use the BLAS library, implementing an `unsafe` strategy for overcoming the limits and barriers. The usage of `unsafe` in Rust is against Rust’s philosophy, that’s why I am always looking for safer methods for improve matmul in this context.

So, taking inspiration from Sam Altman’s statement – “ask GPT how to create value” – I decided to ask local LLMs to generate, benchmark, and iterate on their own algorithms to create a better, native Rust matmul implementation.

The challenge has some constraints:

We need to use our local environment. In my case, a MacBook Pro, M3, 36GB RAM;

Overcome the limits of tokens;

Time and benchmark the code within the generation loop itself

I know that achieving BLAS-level performances with this method is almost impossible, but I want to highlight how we can leverage AI for custom needs, even with our “tiny” laptops, so that we can unblock ideas and push boundaries in any field. This post wants to be an inspiration for practitioners, and people who want to get more familiar with Microsoft Autogen, and local LLM deployment.

All the cod implementation can be found in this Github repo. This is an on-going experiment, and many changes/improvements will be committed.

General idea

The overall idea is to have a roundtable of agents. The starting point is the MrAderMacher Mixtral 8x7B model Q4 K_M local model. From the model we create 5 entities:

the Proposer comes up with a new Strassen-like algorithm, to find a better and more efficient way to perform matmul;
the Verifier reviews the matmul formulation through symbolic math;
the Coder creates the underlying Rust code;
the Tester executes it and saves all the info to the vector database;
the Manager acts silently, controlling the overall workflow.

Agent	Role function
Proposer	Analyses benchmark times, and it proposes new tuning parameters and matmul formulations.
Verifier	(Currently disabled in the code). It verifies the proposer’s mathematical formulation through symbolic verification.
Coder	It takes the parameters, and it works out the Rust template code.
Tester	It runs the Rust code, it saves the code and computes the benchmark timing.
Manager	Overall control of the workflow.

Tab. 1: Roles of agents.

The overall workflow can be orchestrated through Microsoft Autogen as depicted in fig.1.

Fig.1: Matmul optimisation. The user have an initial request with a prompt. From there the manager orchestrates the overall workflow: 1) The proposer acts a theorist and generates a Strassen-like algorithm; 2) The verifier checks the mathematical correctness of the code; 3) The coder generates a Rust Neon code; 4) The tester runs the benchmark. [Image generated with Nano Banana Pro].

Prepare the input data and vector database

The input data is collected from all academic papers, focused on matrix multiplication optimisation. Many of these papers are referenced in, and related to, DeepMind’s Strassen paper. I want to start simply, so I collected 50 papers, published from 2020 till 2025, that specifically address matrix multiplication.

Next, I’ve used chroma to create the vector database. The critical aspect in generating a new vector database is how the PDFs are chunked. In this context, I used a semantic chunker. Differently from split text methods, the semantic chunker uses the actual meaning of the text, to determine where to cut. The goal is to keep the related sentences together in one chunk, making the final vector database more coherent and accurate. This is done using the local model BAAI/bge-base-en-v1.5. The Github gist below shows the full implementation.

The core code: `autogen-core` and GGML models

I have used Microsoft Autogen, in particular the autogen-core variant (version 0.7.5). Differently from the higher-level chat, in autogen-core we can have access to low-level event-driven building blocks, that are necessary to create a state-machine-driven workflow as we need. As a matter of fact, the challenge is to maintain a strict workflow. All the acting agents must act in a specific order: Proposer –> Verifier –> Coder –> Tester.

The core part is the BaseMatMulAgent, that inherits from AutoGen’s RoutedAgent. This base class allows us to standardise how LLM agents will take part in the chat, and they will behave.

From the code above, we can see the class is designed to participate in an asynchronous group chat, handling conversation history, calls to external tools and generating responses through the local LLM.

The core component is @message_handler, a decorator that registers a method as listener or subscriber , based on the message type. The decorator automatically detects the type hint of the first method’s argument – in our case is message: GroupChatMessage. It then subscribes the agent to receive any events of that type sent to the agent’s topic. The handle_message async method is then responsible for updating the agent’s internal memory, without generating a response.

With the listener-subscriber mechanism is in place, we can focus on the Manager class. The MatMulManager inherits RoutedAgent and orchestrates the overall agents’ flow.

The code above handles all the agents. We are skipping the Verifier part, for the moment. The Coder publish the final code, and the Tester takes care of saving both the code and the whole context to the Vector Database. In this way, we can avoid consuming all the tokens of our local model. At each new run, the model will catch-up on the latest generated algorithms from the vector database and propose a new solution.

A very important caveat, for making sure autogen-core can work with llama models on MacOS, make use of the following snippet:

#!/bin/bash 

CMAKE_ARGS="-DGGML_METAL=on" FORCE_CMAKE=1 pip install --upgrade --verbose --force-reinstall llama-cpp-python --no-cache-dir

Fig.2 summarises the entire code. We can roughly subdivide the code into 3 main blocks:

The BaseAgent, that handles messages through LLM’s agents, evaluating the mathematical formulation and generating code;
The MatMulManager orchestrates the entire agents’ flow;
autogen_core.SingleThreadedAgentRuntime allows us to make the entire workflow a reality.

Fig.2: Overall workflow in a nutshell. The base agent executes the LLM through agents, it evaluates the mathematical formulation, creates the algorithm in Rust, and save all the info in the vector database. The MatMulManager is the real core of the overall workflow. Finally, the `autogen_core.SingleThreadedAgentRuntime` makes all of this to work on our MacBook PRO. [Image created with Nano Banana Pro.]

Results and benchmark

All the Rust code has been revised and re-run manually. While the workflow is robust, working with LLMs requires a critical eye. Several times the model confabulated^*, generating code that looked optimised but failed to perform the actual matmul work.

The very first iteration generates a sort of Strassen-like algorithm (“Run 0” code in the fig.3):

The model thinks of better implementations, more Rust-NEON like, so that after 4 iterations it gives the following code (“Run 3” in fig.3):

We can see the usage of functions like vaddq_f32, specific CPU instruction for ARM processors, coming from std::arch::aarch64. The model manages to use rayon to split the workflow across multiple CPU cores, and inside the parallel threads it uses NEON intrinsics. The code itself is not totally correct, moreover, I’ve noticed that we’re running into an out-of-memory error when dealing with 1024×1024 matrices. I had to manually re-work out the code to make it work.

This brings us back to our my mantra “iterating to perfection”, and we can ask ourselves: ‘can a local agent autonomously refine Rust code to the point of mastering complex NEON intrinsics?’. The findings show that yes, even on consumer hardware, this level of optimisation is achievable.

Fig.3 shows the final results I’ve obtained after each iterations.

Fig.3: Logarithmic plot of the Rust-Neon implementation at various iterations. The calculations have been performed on 1024×1024 Matrix Multiplication benchmarks. [Image generated by the author].

The 0th and 2nd benchmark have some errors, as it is physically impossible to achieve such a results on a 1024×1024 matmul on a CPU:

the first code suffers from a diagonal fallacy, so the code is computing only diagonal blocks of the matrix and it is ignoring the rest;
the second code has a broken buffer, as it is repeatedly overwriting a small, cache-hot buffer 1028 floats, rather than transversing the full 1 million elements.

However, the code produced two real code, the run 1 and run 3. The first iteration achieves 760 ms, and it constitutes a real baseline. It suffers from cache misses and lack of SIMD vectorisation. The run 3 records 359 ms, the improvement is the implementation of NEON SIMD and Rayon parallelism.

*: I wrote “the model confabulates” on purposes. From a medical point-of-view, all the LLMs are not hallucinating, but confabulating. Hallucinations are a totally different situation w.r.t what LLMs are doing when babbling and generating “wrong” answers.

Conclusions

This experiment started with a question that seemed an impossible challenge: “can we use consumer-grade local LLMs to discover high-performance Rust algorithms that can compete with BLAS implementation?”.

We can say yes, or at least we have a valid and solid background, where we can build up better code to achieve a full BLAS-like code in Rust.

The post showed how to interact with Microsoft Autogen, autogen-core, and how to create a roundtable of agents.

The base model in use comes from GGUF, and it can run on a MacBook Pro M3, 36GB.

Of course, we didn’t find (yet) anything better than BLAS in a single simple code. However, we proved that local agentic workflow, on a MacBook Pro, can achieve what was previously thought to require a massive cluster and massive models. Eventually, the model managed to find a reasonable Rust-NEON implementation, “Run 3 above”, that has a speed up of over 50% on standard Rayon implementation. We must highlight that the backbone implementation was AI generated.

The frontier is open. I hope this blogpost can inspire you in trying to see what limits we can overcome with local LLM deployment.

I am writing this in a personal capacity; these views are my own.

Source link

#Local #LLMs #Discover #HighPerformance #Algorithms