Classifying undergrad degrees with GPT-4 as our intern

18 Aug, 2023

At work, we very much want to understand whether people tend to apply to train to teach subjects for which their degrees are relevant.

But our data isn’t good enough.

This post is about training a neural network to classify degrees for us, and the part GPT-4 played in that process.

What CAH?

Ideally every teacher training application we held would have its degree mapped to a “Common Aggregation Hierarchy” (CAH) code. With a CAH code we can say with some confidence, for instance, that your degree in 20th Century Film Studies is relevant to your application to train to teach Drama. This is great because it helps us understand how and why teachers choose to teach the subjects they do.

We have about 20,000 free-text degree names lacking CAH codes. And there are 166 possible codes.

This is a classic “get a team of interns to do it” sort of task — it calls for skill and discrimination, but fundamentally it’s rote work.

The main problem with that approach is that we don’t have interns. Thankfully, GPT-4 is good at this! You can give it a list of the codes, then a list of unstructured degree text, and it will intelligently assign codes.

So one way to achieve degree coding is to loop through all 20,000 degrees and pack them off to GPT-4 over the OpenAI API.

And GPT-4 is good enough! How do we know? Because the results look reasonable, and the work simply cannot be done perfectly: nobody can sort degrees into 166 arbitrary categories precisely as the makers of the CAH would have intended.

In some ways it’s better than the interns. For instance, it can hold all 166 categories in its 8k context window - more than I could manage.

But it’s not perfect. Classifying a tranche of 200 degrees using GPT-4 takes about 3 minutes and costs roughly 25 cents. Multiply that by 100 (we have ~20,000 degrees to classify) and it’s 5 hours and costs $25.

One hour and $6.72 later

That cost is not awful, but 5 hours is quite a long time, and that’s the best case scenario — the OpenAI API falls over from time to time too so the process needs babysitting.

We can work with less, though. Having spent an hour and a small amount of money getting results for just 4,000 of our unlabelled degrees, we can put that data to use training a neural network. That’s interesting to do because

it will be fast and free to run and we will be able to develop it as we learn more. GPT-4 is a black box
if we can implement something that works for this dataset we can do look at using neural networks on other unstructured or incomplete data — of which we have a lot — some of which will have personally identifiable information in and is therefore unsuitable for sending to OpenAI
we could run it next to the data in BigQuery where our analytics platform lives, which means we could potentially automate this kind of inference in future

Lumpy data

Before adding the GPT-4 data, this is what our data looks like, shown as a distribution of CAH codes. This core data is cobbled together from the official HECoS to CAH mapping (those 1,092 records, which I’m calling the CAH data) and some 2,000-odd rows of 2018 data from the Office for Students manually (I think!) mapping free text degrees to CAH codes, which we call the ILR data.

CAH3 frequency

You can see that the overwhelming majority of CAH codes have fewer than 10 examples, but it’s very lumpy. Some codes have over 150!

We can and do use sklearn’s compute_class_weight feature to balance our training data to privilege underrepresented classes, but its powers are limited when the absolute numbers are small.

More data should help!

Training a neural network

I trained a BertForSequenceClassification classifier from HuggingFace Transformers on different quantities of data and evaluated the outcome in two ways:

the classic plot of training and evaluation loss — that is to say, how far the network’s predictions ended up from the “true” data in the training and test data, both of which are cut from the whole dataset
by running the classifier against the 1,092 mappings in the original CAH list. That list is now a fraction of the data we have in the whole training set, and validating specifically against those “official” classes feels like a reasonable sanity check

Holding all other things constant, these were the results, with different quantities of training data.

Note that a CAH code, eg “07-01-01” has three parts, giving three levels of precision. CAH3 is the most specific and includes all three parts - something like “physics”. That’s the one we’re trying to assign here. CAH2 is two parts of the code, for example “physics and astronomy”. CAH1 is just the first part — e.g. “physical sciences”.

Given a significant amount of our relevance assessment relies on a matching CAH1 or CAH2 codes, 92% and 94% accurate matching is useful!

Dataset	CAH3 %	CAH2 %	CAH1 %
1. CAH	49	65	70
2. CAH + ILR	49	79	81
3. CAT + ILR + GPT	63	87	89
4. CAH + ILR + GPT + SR	78	92	94

*SR means I supplemented the training data with synonym replacement, creating new examples by replacing words in the original text with synonyms.

Classifications with the final, biggest dataset are significantly more accurate/plausible than those from the first. For example:

Degree	Before	After
development in the Americas	teacher training	development studies
gas engineering	electrical engineering	chemical, process and energy engineering
vetinary pathology	chemistry	veterinary medicine and dentistry
rural planning	law	planning (urban, rural and regional)
Swedish language	Iberian studies	German and Scandinavian studies

And here are the training and evaluation loss graphs for the various runs.

loss

Conclusion

We’ve encoded some of GPT-4’s degree-categorisation skills into our own neural network which is good enough for our purposes, breaking our dependency on GPT-4 and opening up the possibility of doing this for more tasks in future.

We’re not ready to put this into production yet. If you’re interested you can find the code here.

#Ml #Python #GPT-4 #BERT