Continuous evaluation
A woman watches potatoes move down a conveyor belt (US National Archives, recoloured by deepai.org)
Last week I and my colleagues at the Incubator for AI put on a one-day conference about retrieval-augmented generation. We called it RAGtime, of course.
We pitched it differently from most AI events, in that it was neither a hackathon, nor an academic conference, nor a what-shall-we-do-about AI kind of conference. It was specifically aimed at people building production RAG systems.
Why? Because it’s become clear that prompt engineering and hope will not be enough to make our chatbots obedient.
RAG, as Robert Smith from our hosts Digital Catapult observed in his keynote, is “the most pilotable technology ever”. That’s the pithiest expression I’ve heard of the fact that most RAG demos promise the world and most (most!) RAG deployments deliver nothing solid, or worse.
Fly-branded ointment
In a conventional software system, non-deterministic processes—processes whose outcome it’s not possible to predict—are flies in the ointment, to be tweezered out at the first opportunity. Non-determinism means instability. Instability means bugs.
But for a system based on GenAI, the presence of the fly is the unique selling point.
It is pointless to try and make the system deterministic, so the only reasonable path to stability is to try and find mitigations.
I came away from RAGtime convinced that the best place to start on that is with the established patterns of aggressive test and deployment automation the DevOps movement discovered for increasing quality and resilience in software systems.
In line with continuous integration and continuous delivery (CI/CD), it seems we’ll call these practices “continuous evaluation”. And critically, the evaluation is not of models, but of products.
See no eval
One of the talks at RAGtime came from Henry Scott-Green at context.ai, who recently published a post called The Ultimate Guide to LLM Product Evaluation. It’s well worth your time. Henry’s compelling thesis is that “evals” as practised by model builders don’t cover the same things we care about when we’re building and operating products.
For instance, even if (maybe especially if) the foundation model has got “better”, you still need to ask questions like: if I change this prompt to fix this corner case I’ve just found, what will happen to the corner case I changed it to fix last week? Or: I’ve just doubled the size of my RAG corpus—do I continue to get accurate responses on the topics I did before? Or: is the LLM behind the API I call changing over time in a way that affects my outputs?
Henry’s approach treats evaluation as another strand of CI/CD: an accreting, repetitive process that runs on every change and every deployment. For instance, teams using context.ai evolve a set of standard model responses in the course of engineering their application, which become a ground truth for what they expect to see when they re-run the respective prompts every time they deploy it.
Busy, and bound to be incomplete? Yes. But then so is CI, and CI works.
Non-determinism budgets
Getting this right is going to be a huge data engineering, automated QA and analytics challenge. It will also be expensive, and at least at first, it will be inefficient. It will probably itself depend on LLMs to some extent to do the evaluation—complexity begetting complexity all the way down. We will probably have to run those LLMs ourselves, so we can be confident that they aren’t changing. We will inevitably waste a lot of time and energy.
But once we can quantify our exposure to non-determinism, we can get to work coming up with useful tools like, say, non-determinism budgets. What percentage of answers will you tolerate being wrong? What sort of answers? What range of wrongness do you expect? How exactly are you trading off range vs depth of knowledge?
Analytical data systems which tell you actually useful things about what your users are doing are hard enough to implement on their own. Getting an understanding of LLM behaviour in that context will require collecting and repeatedly testing on a huge amount of additional data, much of it wildly unstructured.
But making those feedback loops is where all the value lies. It enables us to build the hypotheses that are the bedrock of product development—if the system does X, then the real-world outcome will change by Y.
And it enables you to prove that the change you make to solve today’s problem is not causing the system to go backwards on the problem you solved yesterday, because the model becomes more and more constrained over time.
Trust: nice while it lasted
Henry’s rather bracing insight here, which I feel like on the whole those of us who get excited about GenAI (or perhaps even AGI) aren’t particularly keen to accept, is that we need to trust these systems as little as we possibly can while still getting useful outputs.
This is in direct tension with last year’s LangChain-powered hope/hype of implementing a superhuman chatbot in an afternoon.
I want to note that minimising trust, putting under constant surveillance, and strictly limiting the scope of activity all quite firmly deny the personhood of GenAI. That matters to users. A delegate at RAGtime questioned whether it would be appropriate for a chatbot to decline to answer a complaint, on the basis that a person in the same position would not do that.
I don’t think anyone has a satisfying answer to that question yet.
But if only highly-constrained models make it into production apps, I guess we’ll have to get used to chatbots being different from the rest of us.
The real work
Products like context.ai are powerful, but they are only tools which enable the real work.
That work will be to find methods and language to name and quantify the kinds of Non-determinism we care about.
It will be to continually shape our evaluation datasets so they can discover and detect non-determinism as it emerges in our systems.
And it will be to express those decisions in automated processes that run dozens if not hundreds of times a day—and to act on the resulting data.
That’s going to be a lot of data, a lot of infrastructure, and a lot of work. Most of it will be difficult and technical. But I’m optimistic about this approach, because it won’t require us to invent very much.