Let’s bake a cake
If I were in the mood to bake a cake, I might go to a recipe aggregator and look for recipes that meet my ancillary requirements (e.g., flavour, gluten free) and then look for the recipes that have the most reviews and highest average rating. This would generally tell me that such a recipe likely yields a good and tasty cake, potentially without much cause for consternation (e.g., I’m not going to randomly be asked to make ganache or temper chocolate).
Once I’ve selected a recipe for consideration, I might skim the list of ingredients and instructions to make sure I have everything I need. Rarely would I wonder if the cake is going to be tasty or easy to make,[1] as I have already vetted that aspect by the high review count and average score. Maybe I will have also read some comments to make sure that the cake is aligned with my taste profile (e.g., not too sweet) but the overall assumption is that since so many people have had success with this recipe then I should too.[2] If the recipe has a video, I may also watch that to confirm that the recipe is within my skill level when the instructions are unclear.
The one thing I certainly wouldn’t do, in general,[3] for a well reviewed recipe is to first test the recipe out at a smaller scale to make sure it works because I have some assurance that the recipe works by the successes of the reviewers.
Buying software, not as easy as baking cake
Buying software can be a challenging decision due to the competing interests at play internally and externally to the decision maker. This is particularly true when that software integrates technology that has quantifiable pros and cons (e.g., privacy and security, accuracy metrics, speed). In such cases, it can be easy to rely on numbers to help inform decisions (e.g., “The machine learning underlying the software has an accuracy of 95%.”) and get stakeholder buy-in.
On the other hand, such numbers (or metrics) may not align well with the gap that the software is trying to fill. Going back to cake baking, if a recipe is telling me it makes a vanilla cake in half the time of other recipes but I want to make a chocolate cake then this recipe is perhaps not an effective way to attain my desired outcome.
This is one of the biggest issues in buying software that integrates machine learning (“ML”) or “artificial intelligence” (“AI”) into it. The metrics that are reported typically focus on the specific outcomes from the ML/AI components (how well it classifies a document as a particular type, how well it predicts a customer will churn) and not always on the larger task being targetted.
Indeed, the metrics themselves may be presented in such a way that the numbers reported do not always reflect the face value understanding of the metric. For example, if a piece of ML identifies the end date of a contract with 90% accuracy, we might reasonably believe that the ML is performing quite effectively. On the other hand, if it turns out that the evaluation just requires the ML to identify the sentence containing that date, we might be less inclined to agree with that 90% accuracy number in spite of the fact that it is not necessarily wrong.[4]
This is not necessarily wrong because for the use case envisioned by the software vendor returning the entire sentence is not bad. This means we might have two questions:
- How are the metrics calculated?
- Does the task for which the metrics are being calculated align with my own task?
There is a third question that you might want to ask but we’ll come back to this after a brief interlude.
Interlude: Metrics and User Models [5]
Many of us are familiar with “ad hoc web search” even if we never call it that. It is simply the academic way to refer to using Google, Bing, or your preferred web search engine. This task has been a part of the field of Information Retrieval for as long as users have wanted to search the World Wide Web.
In some respects, the task of evaluating (web) search engines was much easier back in the 1990s when there was far less digital data but even then some notion of a user model was present. Indeed, returning an unordered set of documents[6] that matched query terms would not be ideal even in the 1990s (and earlier) as this could return tens to hundreds of thousands of documents. Accordingly, when a user issued a query to the search engine, the search engine would return them ranked by the system’s (implicit) relevance score for a document with respect to the query. Documents that the system would think are more relevant to the query would be returned before those it thought were less relevant.
With such a ranking, a user might then traverse that list until they had found enough information to satisfy their information need. This then encourages one of the simplest metrics: Precision@k, which measures the proportion of user relevant documents up to a pre-determined threshold, k. Note that documents that a user thinks are relevant and that a search system thinks are relevant to the same query need not be the same.
Precision@k is perhaps not a great model of how users process a ranked list of documents as they may not bother to examine all k documents. Instead, they may stop at the first relevant document they find and this can be quantified in a metric called reciprocal rank. Over the last several decades, there have been many different ideas about how users seek and make sense of information (e.g., the berry-picking model, information foraging) and subsequently a wide variety of metrics that have implicit user models built-in that assume how users traverse a search engine results page (e.g., rank-biased precision,[8] discounted cumulative gain[9]). Despite all of this work, there was and still continues to be no “one true” metric or model of user behaviour.[10] Most often, metrics are chosen (or adapted) based upon the alignment of the implied user model and the actual high-level search task being modelled by the system.
Determining whether a metric is good
Going back to our two questions from before the interlude, you may have gotten the sense that one of the questions might matter more than the other. Hopefully, it was whether the task that the metric is designed to measure aligns with your own task. While knowing how a metric is calculated matters, it matters less if that task (or user model) does not match what you are looking to use that software to accomplish.
Most marketing for software, unfortunately, will almost never really talk about metrics other than in a way that is isolated from the end-to-end use of the system. This is due in part to the fact that it can be exceedingly difficult to quantify and measure differences in user experience. For example, consider two systems that identify contract end dates:
- System A: Has a recall[11] of 0.95 and a precision[12] of 0.85 but it is difficult to identify and fix mistakes.
- System B: Has a recall of 0.87 and a precision of 0.8 but it is very easy to identify and fix mistakes. It would not be unreasonable to argue that System B is the better system if one has the resources to identify and fix (potentially costly) mistakes. On the other hand, if there are no resources to fix dates and the perceived risk of missing or misidentifying end dates is low then System A may be better. Ideally, we would love for there to be a System C that takes the best parts of System A and System B but that is not always feasible. That being said, it is not entirely clear how to quantify the differences between A and B in a single number that encompasses these different aspects.
Wait, wasn’t there a third question?
Yes, there is. And if you were particularly careful in analyzing how the interlude was worded you might have figured out what that question is. In the interlude, I was careful to distinguish whether the system or a user thought a document was relevant. This is meaningful since the system will not always have the full picture of what the user is trying to find or achieve.[13]
But how does this apply to ML and AI? Well, these algorithms have to be trained (or for generative systems, prompted) by someone. That is, an individual (or several) at the software vendor have gone through and vetted the data that goes into producing the trained AI/ML algorithm. In doing so, the training data will represent the software vendor’s conception of relevance about the task (e.g., what an executive employment agreement looks like, what an exclusivity clause is) and not necessarily some generally agreed upon concept.
The question is then whether you think that the software vendor’s conception of relevance matches your own. In some cases, they may have guidelines they can provide or other forms of explanation or reasoning (e.g., the prompt that was used). But they may also wish to refrain from disclosing some or all of this information as it is part of their “secret sauce.” In such cases, the best thing to do is work to understand whether the task their software purports to solve aligns with your own.
Measuring the unmeasurable
How can you make a determination that what the vendor’s software does is (close enough to) what you want the software to do?
Well, you could simply rely on metrics but that often does not end well since those metrics rarely encapsulate the full end-to-end experience as discussed previously. You could also deeply test the software, scrutinizing every detail, and interrogating the vendor about minutiae. But this approach is costly and means that you likely won’t be able to explore many options.
What we really want to measure when we look for software, especially AI/ML-based software, is whether there will be enough positive benefit associated with the entire cost of the software (e.g., implementation time, learning curve, monetary value). There is rarely a single easy way to determine this.
What we rely on is whether the vendor or their customers have published any form of data on their return on investment (“ROI”). The easiest ROI metrics show gains in revenue or reduction in costs. This could come in the form of time savings, ability to win more business, making better decisions, or mitigating risk. But ROI could come from something as complex as improved employee morale and reduced fatigue (e.g., using software may not directly save time but could improve quality of outputs as it minimizes the soul draining tedium employees would normally engage in). None of these things is necessarily easy or convenient to report but may be the best indicator of potential value to you.
The downside is that these measures are much, much harder to meaningfully measure and, much like the easier to measure metrics (e.g., accuracy, recall), still require you to understand whether your use of the software can and will align with customer experiences that led to these positive outcomes.
Making a decision
As we have seen there are many factors that can and should influence your purchasing decisions and many are not easy to measure nor will they always align directly with what you’re seeking to accomplish.
But what if we employed the same tried and true methodology that we employ to select a recipe? What does this look like? It would require us to look for software and determine if it does what we want it to do and look at reviews to see if what the software does what it says it does.
The tricky part is that software rarely has the same convenient aggregator like recipes. Instead we must rely on who the vendor claims uses their software (e.g., by logos on their website) and whether we know (or believe) that those customers match our intended use case. We may then wish to view further documentation (e.g., metrics, use case suggestions) or a demo of the software to ensure that my assumptions match. But we probably do not need to run an exhaustive test because we can have confidence by the number of customers similar to us that the software will likely meet our needs.
That being said, there are different enterprise software aggregators (e.g., Gartner, Legal Tech Hub, ProductHunt) but they come with caveats. Those that rely on the aggregator to perform the research themselves may have hidden biases resulting from other relationships and those that rely on public reviews may be influenced by negative reviews from customers that chose software that did not adequately align with their needs (e.g., lacking other resources or information, they bought software that wasn’t fit for their needs). Nevertheless, these types of sources can provide invaluable insights and information and can help make a decision or determine whether you need to actually perform a more substantive test.
When to test
There are times when one does need to test a recipe. It could be because it uses advanced techniques and this may split reviews or the recipe does not have many reviews. In any case, I might try and do a smaller version of the recipe using less ingredients so that if it is a bad recipe or the techniques are beyond my skill then I do not waste too much time or money on ingredients.
The same thing is appropriate for software. If you cannot be certain that a particular piece of software is right due to a lack of publicly announced customers or customers that are not obviously similar to you or your use case, then you may need to make use of a small scale test.
But keep the goal of the test in mind. The test is to determine whether this software aligns to your use case and facilitates completing your task(s). The test is not necessarily to re-evaluate different components (though that can be informative) but to determine if by employing the software that your higher levels goals are accomplished in a demonstrably better way (e.g., faster, cheaper, more accurately).
Choosing between System A or System B is your goal (or determining that you actually needed System C). But to make that choice, you also need to understand what you want to use the software to accomplish and then use that to build your acceptance criteria.
Footnotes
[1] There could be exceptions when an atypical ingredient or piece of equipment is called for.
[2] This collaborative effort to verify that the recipe produces tasty cake is also the core of the scientific method that requires scientists (or home cooks) to repeat an experiment (or recipe) and observe whether the outcomes match the hypothesis (a tasty cake is produced without much effort by combining ingredients in a particular way).
[3] I might do this for a recipe where there is a particularly novel technique that I might be unfamiliar with and have a high chance of messing up the first time.
[4] There is an even better example with respect to e-mail spam filtering. Historically, spam was substantially more prevalent than ham (yes, that is name for not spam) and so researchers had to use more nuanced metrics as simple measures like accuracy or precision and recall would be biased towards systems that indiscriminately classified things as spam (e.g., one could trivially get 99.99999% accuracy with such an approach but the user would never receive an email).
[5] This is a hilariously, over-simplified history lesson but is meant more to be illustrative than 100% historically accurate.
[6] I will use “document” interchangeably to refer to “web page” as this is the common way to generically refer to the items being retrieved in Information Retrieval research.
[7] Many of these behavioural theories are derived from naturalistic settings (e.g., how animals forage for food, how humans pick berries) after researchers observed how information seeking behaviours mirrored these settings. The berrypicking model originated with Bates in the late 1980s and information foraging with Pirolli and Card in the early 1990s. Models have continued to be developed and are beyond the scope of this post.
[8] Proposed by Moffat and Zobel in 2008, this measure allows an evaluator to model how persistent a searcher is while traversing a ranked list of documents (i.e., how quickly they give up).
[9] Proposed by Järvelin and Kekäläinen in 2002, this measure attempts to model the idea that users have a rate of gain (i.e., relevant information) that decreases as they traverse a ranked list and view relevant items.
[10] Indeed, with the continuing growth and adoption of large language models into search systems, this process is in many ways beginning all over again.
[11] Recall measures the proportion of all contract end dates that the ML identified during internal testing. Recall is influenced by false negatives or misses (e.g., contract end dates that were not identified).
[12] Precision measures the proportion of items that the ML said were contract end dates and were actual end dates. Precision is influenced by false positives (e.g., things that aren’t actually end dates but are called end dates).
[13] We also note that following Nicholas Belkin’s work, the user themselves may be in an anomalous state of knowledge and not know exactly what they’re trying to find until they find it. A common example of this is going through one’s kitchen looking for something to eat and not knowing what they want until they see it.