Testing Gen AI Applications: By Pratheepan Raju


When we start thinking about Generative AI, there are 2 things that come to mind, one is relative to the GenAI model itself with its countless possibilities and next is the application with definitive goal or purpose or problem
that needs to be met or solved leveraging GenAI models.

So, next the question arises, what test strategy must be followed for such cases. This post is intended to answer that query and lay out a simple road map to follow.

We also need to remember that unlike traditional testing where the output is fixed and predictable, GenAI models produce outputs are different and non-predictable. LLM’s produce creative responses in various ways where the same
input prompt does not produce the same output response.

Testing Categories

Let’s look at the typical testing categories:

  • Unit Testing
  • Release Testing
  • System Testing
  • Data Quality Testing
  • Model Evaluation
  • Regression Testing
  • Non-functional Testing
  • User Acceptance Testing

Of the above categories, there are 2 unique additions – Data Quality Testing and Model Evaluation. While other categories have been followed in general for any application with a User Interface / Screen, Business Layer where orchestration,
logging, etc are taken care and Database Layer where the data resides, these 2 Data Quality and Model Evaluation categories are related to GenAI features.

LLM testing

Let’s take a closer look at Data Quality testing, now business applications would need to have data from its database and not random data from elsewhere. This data needs to be fed to the LLM to then form into an output response
based on the input prompt. So, this data is vital that it is fed into the LLM model and that the response is framed using only this data in a human like form.  The boundary of this data needs to be validated and ensure that relevant data is given in the response
no matter what variations the LLM is responding with.

Next is the Model Evaluation. There are different models available in the market from different vendors. Each having unique capabilities and features. Once models are chosen, the next is to compare and score which model is closer
to the answer or solution being recommended. Model evaluation can be further categorized into Manual Evaluation and Automatic Evaluation.

Manual Evaluation

Manual Evaluation is the gold standard although it is slow and costly approach. Domain experts can provide detailed feedback and scoring the LLM outputs. Scoring could be on a range between 1 to 5, one being lowest/no match to
5 being the best match, the expert validates the response against the standard output if done manually. The evaluation must be done by different users for a comparison or feedback of the scoring and to have an agreeable score.

Automatic Evaluation

Automatic Evaluation is when testing involves another LLM and guardrails to do the monitoring and testing as not all request response can be monitored manually. This approach also useful post go-live as well and gives view on live
data monitoring scores. Statistical Evaluation techniques could also be followed collect metrics and then benchmark. Perplexity, BLEU, BERT, ROUGE, etc are some of the methods available. Some tools in market have these methods embedded to give as a package
with dashboards for easy review. Guardrails, though not a testing method but ensures that few of the caveats of LLM’s such as toxicity, accuracy, bias and hallucinations are under control. Guardrail scores could also be used for evaluating the LLM’s.

Conclusion

In the emerging future of GenAI, the capability of the tools is enhanced, however the testing boundaries need to be in place to ensure accuracy and relevant. The testing approach would need to be a combination of manual and automatic
for best results and coverage.

Source link

#Testing #Gen #Applications #Pratheepan #Raju