Checkr ditches GPT-4 for a smaller genAI model, streamlines background checks



Checkr provides 1.5 million personnel background checks per month for thousands of businesses, a process that requires generative AI (genAI) and machine learning tools to sift through massive amounts of unstructured data.

The automation engine produces a report about each potential job prospect based on background information that can come from a number of sources, and it categorizes criminal or other issues described in the report.

Of Checkr’s unstructured data about 2% is considered “messy,” meaning the records can’t be easily processed with traditional machine learning automation software. So, like many organizations today, Checkr decided to try a genAI tool — in this case, OpenAI’s GPT-4 large language model (LLM).

GPT-4, however, only achieved an 88% accuracy rate on background checks, and on the messy data, that figure dropped to 82%. Those low percentages meant the records didn’t meet customer standards.

Checkr then added retrieval augmented generation (or RAG) to its LLM, which added more information to improve the accuracy. While that worked on the majority of records (with 96% accuracy rates), the numbers for more difficult data dropped even further, to just only 79%.

The other problem? Both the general purpose GPT-4 model and the one using RAG had slow response times: background checks took 15 and seven seconds, respectively.

So, Checkr’s machine learning team decided to go small and try out an open-source small language model (SLM). Vlad Bukhin, Checkr’s machine learning engineer, fine-tuned the SLM using data collected over years to teach what the company sought in employee background checks and verifications.

That move did the trick. The accuracy rate for the bulk of the data inched up to 97% — and for the messy data it jumped to 85%. Query response times also dropped to just half a second. Additionally, the cost to fine-tune an SLM based on Llama-3 with about 8 billion parameters was one-fifth of that for a 1.8 billion-parameter GPT-4 model.  

To tweak its SLM, CheckR turned to Predibase, a company that offers a cloud platform through which Checkr takes thousands of examples from past background checks and then connects that data to Predibase. From there, the Predibase UI made it as easy as just clicking a few buttons to fine-tune the Llama-3 SLM. After a few hours of work, Bukhin had a custom model built.

Predibase operates a platform that enables companies to fine-tune SLMs and deploy them as a cloud service for themselves or others. It works with all types of SLMs, ranging in size from 300 million to 72 billion parameters.

SLMs have gained traction quickly and some industry experts even believe they’re already becoming mainstream enterprise technology. Designed to perform well for simpler tasks, SLMs are more accessible and easier to use for organizations with limited resources; they’re more natively secure, because they exist in a fully self-manageable environment; they can be fine-tuned for particular domains and data security; and they’re cheaper to run than LLMs.

Computerworld spoke with Bukhin and Predibase CEO Dev Rishi about the project, and the process for creating a custom SLM. The following are excerpts from that interview.

When you talk about categories of data used to perform background checks, and what you were trying to automate, what does that mean? Bukhin: “There are many different types of categorizations they would do, but in this case [we] were trying to understand what civil or criminal charges were being described in reports. For example, ‘disorderly conduct.’”

What was the challenge in getting your data prepared for use by an LLM? Bukhin: “Obviously, LLMs have only been popular for the past couple of years. We’ve been annotating unstructured data long before LLMs. So, we didn’t need to do a lot of data cleaning for this project, though there could be in the future because we are generating lots of unstructured data that we haven’t cleaned yet, and now that may be possible.”

Why did your initial attempt with GPT-4 fail? You started using RAG on an OpenAI model. Why didn’t it work as well as you’d hoped? Bukhin: “We tried GPT-4 with and without RAG for this use case, and it worked decently well for the 98% of the easy cases, but struggled with the 2% of more complex cases., was something I’d tried to fine tune before. RAG would go through our current training [data] set and it would pick up 10 examples of similarly categorized categories of queries we wanted, but these 2% [of complex cases, messy data] don’t appear in our training set. So that sample that we’re giving to the LLM wasn’t as effective.”

What did you feel failed? Bukhin: “RAG is useful for other use cases. In machine learning, you’re typically solving for the 80% or 90% of the problem, and then the longtail you handle more carefully. In this case where we are classifying text with a supervised model, it was kind of the opposite. I was trying to handle the last 2% — the unknown part. Because of that, RAG isn’t as useful because you’re bringing up known knowledge while dealing with the unknown 2%.”

Dev: “We see RAG be helpful for injecting fresh context into a given task. What Vlad is talking about is minority classes; things where you’re looking for the LLM to pick up on very subtle differences — in this case the classification data for background checks. In those cases, we find what’s more effective is teaching the model by example, which is what fine-tuning will do over a number of examples.”

Can you explain how you’re hosting the LLM and the background records? Is this SaaS or are you running this in your own data center? Bukhin: “This is where it’s more useful to use a smaller model. I mentioned we’re only classifying 2% of the data, but because we have a fairly large data lake that still is quite a few requests per second. Because our costs scale with usage, you have to think about the system set-up different. With RAG, you would need to give the model a lot of context and input tokens, which results in a very expensive and high latency model. Whereas with fine-tuning, because the classification part is already fine-tuned, you just give it the input. The number of tokens you’re giving it and that it’s churning out is so small that it becomes much more efficient at scale”

“Now I just have one instance that’s running and it’s not even using the full instance.”

What do you mean by “the 2% messy data” and what do you see as the difference between RAG and fine tuning? Dev: “The 2% refers to the most complex classification cases they’re working on.

“They have all this unstructured, complex and messy data they have to process and classify to automate the million-plus background checks they do every month for customers. Two percent of those records can’t process with their traditional machine learning models very well. That’s why he brought in a language model.

“That’s where he first used GPT-4 and the RAG process to try to classify those records to automate background checks, but they didn’t get good accuracy, which means those background checks don’t meet the needs of their customers with optimal occuracy.”

Vlad: “To give you an idea of scale, we process 1.5 million background checks per month. That results in one complex charge annotation request every three seconds. Sometimes that goes to several requests per second. That would be really tough to handle if it was a single instance LLM because it would just queue. It would probably take several seconds if you were using RAG on an LLM. It would take several seconds to answer that.

“In this case because it’s a small language model and it uses fewer GPUs, and the latency is less [under .15 seconds], you can accomplish more on a smaller instance.”

Do you have multiple SLMs running multiple applications, or just one running them all? Vlad: Thanks to the Predibase platform, you can launch several use cases solutions onto one [SLM] GPU instance. Currently, we just have the one, but there are several problems we’re trying to solve that we would eventually add. In Predibase terms, it’s called an Adapter. We would add another adatpersolution to the same model for a different use case.

“So, for example, if you’ve deployed a small language model like a Llama-3 and then we have an adapter solution on it that responds to one type of requests, we might have another adatper solution on that same instance because there’s still capacity, and itthat solution can respond to completely different type of requests using the same base model.

“Same [SLM] instance but a different parameterized set that’s responsible just for your solution.”

Dev: “This implementation we’ve open-sourced as well. So, for any technologist that’s interested in how it works, we have an open-source serving project called LoRAX. When you fine-tune a model… the way I think about it is RAG just injects some additional context when you make a request of the LLM, which is really good for Q&A-style use cases, such that it can get the freshest data. But it’s not good for specializing a model. That’s where fine tuning comes in, where you specialized it by giving it sets of specific examples. There are a few different techniques people use in fine-tuning models.

“The most common technique is called LoRA, or low-rank adaptation. You customize a small percentage of the overall parameters of the model. So, for example, Llama-3 has 8 billion parameters. With LoRA, you’re usually fine tuning maybe 1% of those parameters to make the entire model specialized for the task you want it to do. You can really shift the model to be able to the task you want it to do.

“What organizations have traditionally had to do is put every fine-tuned model on its own GPU. If you had three different fine-tuned models – even if 99% of those models were the same – every single one would need to be on its own server. This gets very expensive very quickly.”

One of the things we did with Predibase is have a single Llama 3 instance with 8 billion parameters and bring multiple fine-tuned Adapters towards it. We call this small percentage of customized model weights Adapters because they’re the small part of the overall model that have been adapted for a specific task.

Vlad ha a use case up now, let’s call it Blue, running on Llama 3 with 8 billion parameters that does the background classification. But if he had another use case, for example to be able to extract out key information in those checks, he could serve that same Adapter on top of his existing deployment.

This is essentially a way of building multiple use cases to be cost effective using the same GPU and base model.

How many GPU’s is Checkr using to run its SLM? “Vlad’s running on a single A100 GPU today.

“What we see is when using a small model version, like sub 8 billion-parameter models, you can run the entire model with multiple use cases on a single GPU, running on the Predibase cloud offering, which is a distributed cloud.”

What were the major differences between the LLM and the SLM? Bukhin: “I don’t know that I would have been able to run a production instance for this problem using GPT. These big models are very costly, and there’s always a tradeoff between cost and scale.

“At scale, when there are a lot of requests coming in, it’s just a little bit costly to run them over GPT. I think using a RAG situation, it was going to cost me about $7,000 per month using GPT, $12,000 if we didn’t use RAG but just asked GPT-4 directly.

“With the SLM, it costs about $800 a month.”

What were the bigger hurdles in implementing the genAI technology? Bukhin: “I’d say there weren’t a lot of hurdles. The challenge was as Predibase and other new vendors were coming up, there were still a lot of documentation holes and SDK holes that needed to be fixed so you could just run it.

“It’s so new that metrics were showing up as they needed to. The UI features weren’t as valuable. Basically, you had to do more testing on your own side after the model was built. You know, just debugging it. And, when it came to putting it into production, there were a few SDK errors we had to solve.

“Fine tuning the model itself [on Predibase]was tremendously easy. Parameter tuning was easy so we was just need to pick the right model.

“I found that not all models solve the problems with the same accuracy. We optimized with to Llama-3, but we’re constantly trying different models to see if we can get better performance, and better convergence to our training set.”

Even with small, fine-tuned models, users report problems, such as errors and hallucinations. What did you experience those issues, and how did you address them? Bukhin: Definitely. It hallucinates constantly. Luckily, when the problem is classification, you have the 230 possible responses. Quite frequently, amazingly, it comes up with responses that are not in that set of 230 possible [trained] responses. That’s so easy for me to check and just disregard and then redo it.

“It’s simple programmatic logic. This isn’t part of the small language model. In this context, we’re solving a very narrow problem: here’s some text. Now, classify it.

“This isn’t the only thing happening to solve the entire problem. There’s a fallback mechanism that happens… so, there are more models you try out and that that’s not working you try deep learning and then an LLM. There’s a lot of logic surrounding LLMs. There is logic that can help as guardrails. It’s never just the model. There’s programmatic logic around it.

“So, we didn’t need to do a lot of data cleaning for this project, though there could be in the future because we are generating lots of unstructured data that we haven’t cleaned yet, and now that may be possible. The effort to clean most of the data is already complete. But we could enhance some of the cleaning with LLMs”



Source link