Parallelizing Data Annotation: A Python Tutorial for Efficient NLP Processing
More horsepower is more horsepower is more horsepower is more horsepower
While I am still on my journey for building my almighty AI investment agent, I sourced this dataset of about 250,000 investment related articles (incl. Twitter and Reddit) that I wanted to annotate using Bidirectional Encoder Representations from the Transformers library (BERT) with named entities.
Named Entities commonly refer to specific persons, organizations, locations, expressions of times, quantities, monetary values, percentages that can be found in the body of the text and can be categorized into predefined classes.
This helps in Managing Risk by identifying regulatory bodies or legal entities. And also supports event detection and impact analysis where articles related to events, acquisitions, mergers, earnings reports, or economic indicators can be properly tagged.
Preferably, I decided onusing the large BERT model because it has higher accuracy than the base BERT model.
The hypothesis that I want to test is if I can use this dataset for building knowledge graphs between the articles and companies, ETFs, Bonds, etc. in support of my Risk Management.
The target should look like this
Sentence: Kevin Fitzsimmons I appreciate the guide on margin and NII.
Corresponding NER Tags: [{'end': 5, 'entity': 'B-PER', 'index': 1, 'score': '0.9995722', 'start': 0, 'word': 'Kevin'}
{'end': 8, 'entity': 'I-PER', 'index': 2, 'score': '0.9996213', 'start': 6, 'word': 'Fi'}
{'end': 10, 'entity': 'I-PER', 'index': 3, 'score': '0.9724247', 'start': 8, 'word': '##tz'}
{'end': 12, 'entity': 'I-PER', 'index': 4, 'score': '0.9850449', 'start': 10, 'word': '##si'}
{'end': 14, 'entity': 'I-PER', 'index': 5, 'score': '0.5496914', 'start': 12, 'word': '##mm'}
{'end': 17, 'entity': 'I-PER', 'index': 6, 'score': '0.9918085', 'start': 14, 'word': '##ons'}]
The problem that I observed is, if I run this task on my Macbook M1, it will run for about one week. That is a long time to validate one’s hypothesis.
Yet, I think it is worthwhile to test this on the full set of articles, because in the end, I am a big proponent of data is moat. Therefore, having solid control of your data can be a key value driver.
I am in the fortunate position to own a 6 GPU rig that I can perfectly use for tasks like this.
The strategy I will employ here is split the full dataset into 6, have each GPU operate on their subset, and then combine these datasets later.
So let’s dive in.
We will be building our code on the following building blocks.