Code Clinic | Finetuning Named Entity Recognition on Huggingface (Part 1)
I have a problem and it looks like this.
The story for this started a couple of years ago with a problem that I couldn’t find any apps that combined similar news stories in the same thread with a focus specifically on financial news and included posts from various sources including (but not limited to) social, traditional media, Reddit discussions, youtube videos, and Edgar filings.
What I built was an operational flow that kind of worked like this. Obviously, that is the simplified view.
So, over the last few years, I wrote several NLP functions that annotated financial news. It did a good enough job for phase 1. Yet over time not having “Cathie Wood” as one word on the post on the left started to annoy me and my users.
In addition, it happened several times that for posts I get from Reddit, the tags were incorrect. I.e., the tag is incorrect or might have picked up the wrong cash tag. An example might be “shop” and “$SHOP” for the Shopify symbol.
Since we have since launch aggregated about 250K posts on our platform, I figured it is now the right time to update our named entity recognition model.
Given that BERT is quite an established model, I figured using this as a first evaluation exercise, might be the right thing to do.
And the way I am going to implement this will be something along those lines:
Further, I will be using a classification algorithm that tags the identified entities as Location, Person, and Organization. Why?
Because we want to follow the IOB2 format that will help us then fine-tune the model on Huggingface later in the process.
But I am getting ahead of myself.
So procedurally, I will be performing some data management tasks first.
Extract data from MySQL
Store in a JSON file locally
Augment post stubs with full articles
Tag all posts in the document through a non-trained model.
Correct the tags in the dataset
Use this dataset to train on Huggingface
Enjoy the benefits of a fine-tuned model.
So let’s dive int
Keep reading with a 7-day free trial
Subscribe to Encyclopedia Autonomica to keep reading this post and get 7 days of free access to the full post archives.