Comprehensive Guide to Financial Datasets for AI and Machine Learning

Why did the financial dataset break up with the stock market? Because it couldn’t handle the constant ups and downs!

Mar 20, 2024

∙ Paid

We all know data is the new oil (throwback to 2017). In this post, I will provide you with an exploratory analysis of popular financial datasets that can be used for prominent tasks in training cognitive agents capable of reasoning, analyzing, summarizing, and annotated bodies of financial texts.

In investment finance, understanding the context of financial news is generally seen as a competitive advantage not only for retail investors but also for professional investors who might have not gained the same understanding or insights.

For many new investors, the naive first step into this is sentiment analysis.

Sentiment Analysis

Also known as the art of riding a water hose, Sentiment analysis (SA) is influenced by crowd psychology and is assumed to be revealed through buying and selling activity—and most recently social media finfluencers. Since SA is largely an emotion-driven concept, it does not necessarily correlate with fundamental changes in the market. Sentiment Analysis is popular with day traders and technical analysts who rely on market sentiment to measure and profit from short-term price moves driven by investor psychology. The same holds true for contrarian investors just in the other direction.

SA assumes that the psychological differences among heterogeneous investors have implications for asset pricing. Personally, I don’t hold that opinion.

For Huggingface-based datasets you can access them in many cases like this

from datasets import load_dataset

# Load the dataset from the Huggingface hub
dataset_name="ChanceFocus/fiqa-sentiment-classification"
dataset = load_dataset(dataset_name)

# Save the dataset to a CSV file
dataset["train"].to_csv("train.csv")
dataset["valid"].to_csv("valid.csv")
dataset["test"].to_csv("test.csv")

Let’s dive in. Sorted from simple to complex.

Financial PhraseBank

Financial PhraseBank is a dataset containing 4,840 sentences extracted from English-language financial news. These sentences are categorized by a ternary sentiment—positive, negative, or neutral. The dataset is divided by an agreement rate of 5-8 annotators.

With its about 5,000 sentences and two columns it’s easy to integrate and use.

Link: https://huggingface.co/datasets/financial_phrasebank

Encyclopedia Autonomica

Comprehensive Guide to Financial Datasets for AI and Machine Learning

Why did the financial dataset break up with the stock market? Because it couldn’t handle the constant ups and downs!

Sentiment Analysis

Financial PhraseBank

FiQA Sentiment Analysis

This post is for paid subscribers