...

Beyond Code Generation: Continuously Evolve Text with LLMs


the initial response from an LLM doesn’t suit you? You rerun it, right? Now, if you were to automate that…

success = false
while not success:
    response = prompt.invoke()
    success = evaluate(response)

Alright, something like that. People have done it for code, and the same applies to non-code if the evaluate() function is suitable. Nowadays, you can use LLMs for content generation and evaluation. However, a simple while loop that waits for the best random result is not always good enough. Sometimes, you need to modify the prompt. Experiment and mix things up, and keep track of what works and what doesn’t. Follow along different ideation paths to keep your options open…

In this article, we will discuss how OpenEvolve [1], an open-source implementation of Google’s AlphaEvolve paper [2], can be used for content creation. In the background, it applies this “experiment and mix, follow different paths” approach to optimize the LLM prompts.

The AlphaEvolve paper applied an evolutionary system to the code generation with LLMs. Read more about the exciting, brand-new results of this paper in my article, Google’s AlphaEvolve: Getting Started with Evolutionary Coding Agents. In essence, in a survival of the fittest scheme, programs are mixed and improved upon. The authors suggest that these evolutionary coding agents can achieve research breakthroughs and present several results.

Due to the sheer number of things that content can be, I think there may be potential for high-value content creation other than code that utilizes such a long-running, continuous evolution process. In this article, we explore how to apply the same technology to a non-code use case where LLMs, rather than algorithms, judge the results of the LLM-generated solution. We also dicuss how to examine the results.

Prerequisites

First, let’s prepare a quick, basic setup.

LLM server

In order to use OpenEvolve, you will need access to an LLM server with OpenAI-compatible API endpoints. You can register with Cerebras (they have a free tier), OpenAI, Google Gemini, or a similar service. Alternatively, if you have a capable GPU, you can set up your own server, for example with ollama. You will need to pick at least two different LLM models, a weak one (e.g., 4bn parameters) and a strong one (e.g., 17bn parameters).

Python envionment & git

I presume that you are running a Linux system with a prepared Python environment, in which you can create virtual environments and install packages from the Python Package index.

OpenEvolve setup

Install OpenEvolve, then prepare your own project & prompt folders:

git clone https://github.com/codelion/openevolve.git
cd openevolve
python3 -m venv .venv
source .venv/bin/activate
pip install -e .
mkdir -p examples/my_project/prompts

A little warning: OpenEvolve is currently a research project. Its code base is still developing quickly. Therefore, it is a good idea to follow all updates closely.

Configuration

Create the file examples/my_project/config.yaml:

checkpoint_interval: 1

# LLM configuration
llm:
  models:
    - name: "llama3.1-8b"
      weight: 0.8
      temperature: 1.5
    - name: "llama-4-scout-17b-16e-instruct"
      weight: 0.2
      temperature: 0.9
  evaluator_models:
    - name: "llama-4-scout-17b-16e-instruct"
      weight: 1.0
      temperature: 0.9
  api_base: "https://api.cerebras.ai/v1/" # The base URL of your LLM server API

# Prompt configuration
prompt:
  template_dir: "examples/my_project/prompts"
  num_top_programs: 0
  num_diverse_programs: 0

# Database configuration
database:
  num_islands: 3

# Evaluator configuration
evaluator:
  timeout: 60
  cascade_evaluation: false
  use_llm_feedback: true
  llm_feedback_weight: 1.0 # (Non-LLM metrics are weighed with a factor of 1)

diff_based_evolution: true
allow_full_rewrites: false

To get a general idea of what you are configuring here, consider how new solutions are generated and evaluated in OpenEvolve. Solutions consist of their respective text content and are stored in a database alongside their evaluation metrics and “side channel” textual results (e.g., errors during execution or textual improvement suggestions). The database also stores a list of elite programs and programs that perform particularly well on different metrics (MAP-Elites) to be able to provide inspirations for new solutions. An LLM generates these new, mutated solutions based on a single parent. Programmatic and/or LLM evaluators then judge the new solution before feeding it back into the database.

Sketch of the OpenEvolve generation & evaluation flow
The OpenEvolve generation and evaluation flow: Sample a parent and inspirations, generate a new child, evaluate it, and store it in the same island as the parent. (Image by author)

The configuration options include:

  • llm: models, evaluator_models
    For generation and evaluation, you can configure any number of models.
    The idea behind using multiple models is to use a fast (weak) model that quickly explores many different options and a slower (stronger) model that adds quality. For generation, the weight parameter controls the probability that each model will be selected in an iteration — it is only one model at a time, not multiple. For evaluation, all models will be executed each time, and their output metrics are weighed with the specified parameter.
    The temperature setting influence how random these models behave. A value of 1.5 is very high, and 0.9 is still a high temperature value. For the creative use case, I think these are good. For business content or code, use lower values. The OpenEvolve default setting is 0.7.
  • prompt: template_dir
    The template_dir option specifies the directory that contains the prompt templates that are used to overwrite the defaults. See below for more information on the folder’s contents.
  • database: num_top_programs, num_diverse_programs
    The prompts for generating new solutions can include inspirations from other programs in the database. With a value of 0, I turned this function off, because I found that the inspirations — which do not include the content itself, rather just metrics and change summary — were not too useful for creative content evolution.
  • database: num_islands controls how many separate sub-populations are maintained in the database. The more islands you use, the more diverging solution paths will result, whereas within the same island you will observe fewer substantial variations. For creative use cases, if you have enough time and resources to run many iterations, it may be beneficial to increase the number of islands.
  • evaluator: llm_feedback_weight
    The combined metrics generated by the evaluation LLMs are multiplied with this parameter. Together with the algorithmically generated metrics, the numeric average is then used to find the best program. Say the generated metrics were
    length: 1.0
    llm_correctness: 0.5
    llm_style: 0.7

    with an llm_feedback_weight of 1.0, the overall score would be (1.0+0.5*1.0+0.7*1.0)/3
  • diff_base_evolution / allow_full_rewrites:
    Two different prompt approaches for the generator LLM are supported. In the diff mode, the LLM uses a search-and-replace response format to replace specific elements in the current solution. In the full_rewrite mode, the LLM simply outputs a full rewrite. The latter mode is less demanding for less capable LLMs, but it is also less suitable for long content. Quality is also better with diff mode, based on my tests.

For more options, refer to configs/default_config.yaml.

Prompts

OpenEvolve’s default prompts are written for code evolution. Therefore, its prompts are not suitable for non-code generation by default. Fortunately, we can overwrite them. The default prompts are encoded in the file openevolve/prompt/templates.py.

Create the following files and adapt the prompts to match your use case. Let’s try a simple example for creating poems.

Initial placeholder content: examples/my_project/initial_content.txt

No initial poem, invent your own.

The initial prompt represents the “first generation” parent. It affects its offspring, the second-generation solutions.
For the initial content, you could provide an existing version or an empty placeholder text. You could also provide specific instructions, such as “Make sure it mentions cats,” to guide the initial generation in a desired direction. If you need more general context for all generations, include it in the system prompt.

The system prompt: examples/my_project/prompts/system_message.txt

You are a Shakespeare level poem writer, turning content into beautiful poetry and improving it further and further.

The system prompt just sets the general context for your generator model so it knows what your use case is all about. In this example, we are not creating code, we are writing poems.

User prompt for content generation: examples/my_project/prompts/diff_user.txt

# Current Solution Information
- Current performance metrics: {metrics}
- Areas identified for improvement: {improvement_areas}

{artifacts}

# Evolution History
{evolution_history}

# Current Solution
```
{current_program}
```

# Task
Suggest improvements to the answer that will lead to better performance on the specified metrics.

You MUST use the exact SEARCH/REPLACE diff format shown below to indicate changes:

>>>>>> REPLACE

Example of valid diff format:
>>>>>> REPLACE

You can suggest multiple changes. Each SEARCH section must exactly match text in the current solution. If the solution is a blank placeholder, make sure to respond with exactly one diff replacement -- searching for the existing placeholder string, replacing it with your initial solution.

The content generation user prompt is very general. It contains several placeholders, that will be replaced with the content from the solution database, including the evaluation results of the parent program. This prompt illustrates how the evolution process influences the generation of new solutions.

User prompt for content generation without the diff method: examples/my_project/prompts/full_rewrite.txt

# Current Solution Information
- Current metrics: {metrics}
- Areas identified for improvement: {improvement_areas}

{artifacts}

# Evolution History
{evolution_history}

# Current Solution
```
{current_program}
```

# Task
Rewrite the answer to improve its performance on the specified metrics.
Provide the complete new answer. Do not add reasoning, changelog or comments after the answer!

# Your rewritten answer here

Prompt fragment for the evolution history: examples/my_project/prompts/evolution_history.txt

## Previous Attempts

{previous_attempts}

## Top Performing Solution

{top_programs}

Prompt fragment for the top programs: examples/my_project/prompts/top_programs.txt

### Solution {program_number} (Score: {score})
```
{program_snippet}
```
Key features: {key_features}

System prompt for the evaluator: examples/my_project/prompts/evaluator_system_message.txt

You are a Shakespeare level poem writer and are being asked to review someone else's work.

This system prompt for the evaluator models is essentially the same as the system prompt for the generator LLM.

User prompt for the evaluator: examples/my_project/prompts/evaluation.txt

Evaluate the following poem:
1. Beauty: Is it beautiful?
2. Inspiring: Is its message inspired and meaningful?
3. Emotion: Does the poem trigger an emotional response?
4. Creativity: Is it creative?
5. Syntax: Is its syntax good? Is it only a poem or does it also contain non-poem content (if yes, rate as 0)? Are its lines overly long (if yes, rate low)?
6. Overall score: Give an overall rating. If Poem, Syntax or Length evaluation was not okay, give a bad overall feedback.

For each metric, provide a score between 0.0 and 1.0, where 1.0 is best.

Answer to evaluate:
```
{current_program}
```

Return your evaluation as a JSON object with the following format:
{{
    "beauty": score1,
    "inspiring": score2,
    "emotion": score3,
    "creativity": score4,
    "syntax": score5,
    "overall_score": score6,
    "improvement_suggestion": "..",
}}
Even for invalid input, return nothing but the JSON object.

This is where the magic happens. In this prompt, you must think of metrics that represent what you are optimizing. What determines whether the content is good or bad? Correctness? Humor? Writing skill? Decide what is important to you, and encode it wisely. This may take some experimentation before you see the evolution converge the way you intended. Play around as you observe the evolution of your content (more on that below).

Be careful — every metric is rated equally. They are multiplied by the llm_feedback_weight factor in your config.yaml. It is also a good idea to keep an overall_score metric that provides a summary of the big picture evaluation. You can then sort the generated solutions by it.

The improvement_suggestion is a textual recommendation from the evaluator LLM. It will be stored along with the metrics in the database and provided to the generator LLM when this solution is used as a parent, as part of the {artifacts} placeholder you saw above. (Note: As of this writing, textual LLM feedback is still a pull request under review in the OpenEvolve codebase, be sure to use a version that supports it.)

The evaluator program

OpenEvolve was designed for code generation with algorithmic evaluators. Although it is difficult to write an algorithm that judges the beauty of a poem, we can design a useful algorithmic evaluation function also for our content generation use case. For instance, we can define a metric that targets a particular number of lines or words.

Create a file examples/my_project/evaluation.txt:

from openevolve.evaluation_result import EvaluationResult


def linear_feedback(actual, target):
    deviation = abs(actual - target) / target
    return 1 - min(1.0, deviation)


def evaluate_stage1(file_path):
    # Read in file_path
    with open(file_path, 'r') as file:
        content = file.read()

    # Count lines and words
    lines = content.splitlines()
    num_lines = len(lines)
    num_words = sum(len(line.split()) for line in lines)

    # Target length
    line_target = 5
    word_target = line_target*7

    # Linear feedback between 0 (worst) and 1 (best)
    line_rating = linear_feedback(num_lines, line_target)
    word_rating = linear_feedback(num_words, word_target)
    combined_rating = (line_rating + word_rating) / 2

    # Create textual feedback
    length_comment_parts = []

    # Line count feedback
    line_ratio = num_lines / line_target
    if line_ratio > 1.2:
        length_comment_parts.append("Reduce the number of lines.")
    elif line_ratio  1.2:
        length_comment_parts.append("Reduce the number of words per line.")
    elif words_per_line_ratio 

This code has two aspects:
First, it creates a metric value that allows us to quantify the quality of the response length. If the response is too short or too long, the score is lower. If the response is just right, the score reaches 1.
Second, this code prepares textual feedback that the LLM can intuitively understand, so it knows what to change without getting lured into a predetermined idea of what to do when the length is not good. For example, it won’t mistakenly think: “I need to write more.. and more..”.

Data review: Evolution at play

Run the evolution process:

source .venv/bin/activate
export OPENAI_API_KEY="sk-.."
python3 openevolve-run.py \
    examples/my_project/initial_program.py \
    examples/my_project/evaluator.py \
    --config examples/my_project/config.yaml \
    --iterations 9

It is best to begin with only a few iterations and analyze the results closely to ensure everything is functioning properly. To do so, start the visualization web server and observe in real time:

python3 scripts/visualizer.py

Or, if you have a specific past checkpoint that you wish to analyze, open it with:

python3 scripts/visualizer.py --path examples/content_writing/openevolve_output/checkpoints/checkpoint_2

When rerunning your tests after making improvements, be sure to move the current checkpoint folders out of the way before starting over:

mkdir -p examples/my_project/archive
mv examples/my_project/openevolve_output/ examples/my_project/archive/
If everything is configured properly, you should see an evolution of improving results (Image by author)

In the visualization front end, click the nodes to see the associated current solution text, as well as all of their metrics, prompts and LLM responses. You can also easily click through children in the sidebar. Use the yellow locator button if you get lost in the graph and can’t see a node. By observing the prompts, you can trace how the evaluation response from a parent affects the generation user prompt of the child. (Note: As of this writing, prompt & response logging is still a pull request under review in the OpenEvolve codebase, be sure to use a version that supports it.)

If you are interested in comparing all solutions by a specific metric, select it from the top bar:

The metrics select box shows all the metrics produced by your evaluation.py logic and evaluation.txt prompt. With it, you can change the metric used to determine the radii of the nodes in the graph. (Image by author)
  • The node colors represent the islands, in which evolution takes place largely separately (if you run it long enough!) and in different directions. Occasionally, depending on the migration parameters in the configuration, individuals from one island can be copied over into another.
  • The size of each node indicates its performance on the currently selected metric.
  • The edges in the visualization show which parent was modified to produce the child. This clearly has the strongest influence on the descendant.

In fact, the AlphaEvolve algorithm incorporates learnings from several previous programs in its prompting (configurable top-n programs). The generation prompt is augmented with a summary of previous changes and their influence on the resulting metrics. This “prompt crossover” is not visualized. Also not visualized are the relations of “clones”: When a solution migrates to another island, it is copied with all of its data, including its ID. The copy shows up as an unlinked element in the graph.

In any case, the best solution will be saved to examples/my_project/openevolve_output/best/best_program.txt:

In silken moonlight, where night’s veil is lifted,
A constellation of dreams is gently shifted,
The heart, a canvas, painted with vibrant hues,
A symphony of feelings, in tender Muse.

Can I…

  • ..use my own start prompt?
    Yes! Just put the solution you already have in your initial_content.txt.
  • ..not create my own start prompt?
    Yes! Just put a placeholder like “No initial poem, invent your own. Make sure it mentions cats.” in your initial_content.txt.
  • ..not write any code?
    Yes! If you don’t want an algorithmic evaluator, put a stub in your evaluator.py like this:
def evaluate_stage1(file_path):
    return {}
def evaluate(file_path):
    return evaluate_stage1(file_path)
  • …use a local or non-OpenAI LLM?
    Yes, as long as it is compatible with the OpenAI API! In your config.yaml, change the llm: api_base: to a value like ”http://localhost:11434/v1/” for a default ollama configuration. On the command-line, set your API key before calling the Python program:
export OPENAI_API_KEY="ollama"

Final thought

This article described an experiment with the use of LLM feedback in the context of evolutionary algorithms. I wanted to enable and explore this use case, because the AlphaEvolve paper itself hinted at it — and mentioned that they hadn’t optimized for that yet. This is only the beginning. The right use cases where this comparatively high effort for content generation is worth it and more experiments still need to follow. Hopefully, all of this will become easier to use in the future.

Real-life results: In practice I find that improvements across all metrics are observable up to a certain point. However, it is difficult to obtain good numeric metrics from an LLM because their ratings are not fine-grained and therefore quickly plateau. Better prompts, especially for the evaluator, could possibly improve upon this. Either way, the combination of algorithmic and LLM evaluation with a powerful evolutionary algorithm and many configuration options makes the overall approach very effective.

To generate more exciting LLM metrics that justify the long-running evolution, multi-stage LLM evaluator pipelines could be incorporated. These pipelines could summarize content and ensure the presence of certain facts, among other things. By calling these pipelines from the evaluator.py file, this is possible right now within OpenEvolve.

With knowledge bases and tools, the capabilities of such evolutionary systems that incorporate LLM feedback can be extended further. An exciting addition for OpenEvolve could be the support for MCP servers in the future, but again, in the evaluator.py file you could already make use of these to generate feedback.

This whole approach could also be applied with multi-modal LLMs or a separate backend LLM, that generates the actual content in a different modality, and is prompted by the evolutionary system. Existing MCP servers could generate images, audio and more. As long as we have an LLM suitable for evaluating the result, we can then refine the prompt to generate new, improved offspring.

In summary, there are many more experiments within this exciting framework waiting to be done. I look forward to your responses and am eager to see the outcome of this. Thanks for reading!

References

  1. Asankhaya Sharma, OpenEvolve: Open-source implementation of AlphaEvolve (2025), Github
  2. Novikov et al., AlphaEvolve: A Gemini-Powered Coding Agent for Designing Advanced Algorithms (2025), Google DeepMind

Source link

#Code #Generation #Continuously #Evolve #Text #withLLMs