Encyclopedia Autonomica

Encyclopedia Autonomica

Share this post

Encyclopedia Autonomica
Encyclopedia Autonomica
Comprehensive Guide to Financial Datasets for AI and Machine Learning
Copy link
Facebook
Email
Notes
More

Comprehensive Guide to Financial Datasets for AI and Machine Learning

Why did the financial dataset break up with the stock market? Because it couldn’t handle the constant ups and downs!

Jan Daniel Semrau (MFin, CAIO)'s avatar
Jan Daniel Semrau (MFin, CAIO)
Mar 20, 2024
∙ Paid
1

Share this post

Encyclopedia Autonomica
Encyclopedia Autonomica
Comprehensive Guide to Financial Datasets for AI and Machine Learning
Copy link
Facebook
Email
Notes
More
Share

We all know data is the new oil (throwback to 2017). In this post, I will provide you with an exploratory analysis of popular financial datasets that can be used for prominent tasks in training cognitive agents capable of reasoning, analyzing, summarizing, and annotated bodies of financial texts.

In investment finance, understanding the context of financial news is generally seen as a competitive advantage not only for retail investors but also for professional investors who might have not gained the same understanding or insights.

For many new investors, the naive first step into this is sentiment analysis.

Sentiment Analysis

Also known as the art of riding a water hose, Sentiment analysis (SA) is influenced by crowd psychology and is assumed to be revealed through buying and selling activity—and most recently social media finfluencers. Since SA is largely an emotion-driven concept, it does not necessarily correlate with fundamental changes in the market. Sentiment Analysis is popular with day traders and technical analysts who rely on market sentiment to measure and profit from short-term price moves driven by investor psychology. The same holds true for contrarian investors just in the other direction.

SA assumes that the psychological differences among heterogeneous investors have implications for asset pricing. Personally, I don’t hold that opinion.

For Huggingface-based datasets you can access them in many cases like this

from datasets import load_dataset

# Load the dataset from the Huggingface hub
dataset_name="ChanceFocus/fiqa-sentiment-classification"
dataset = load_dataset(dataset_name)

# Save the dataset to a CSV file
dataset["train"].to_csv("train.csv")
dataset["valid"].to_csv("valid.csv")
dataset["test"].to_csv("test.csv")

Let’s dive in. Sorted from simple to complex.

Encyclopedia Autonomica is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

Financial PhraseBank

Financial PhraseBank is a dataset containing 4,840 sentences extracted from English-language financial news. These sentences are categorized by a ternary sentiment—positive, negative, or neutral. The dataset is divided by an agreement rate of 5-8 annotators.

With its about 5,000 sentences and two columns it’s easy to integrate and use.

Link: https://huggingface.co/datasets/financial_phrasebank

FiQA Sentiment Analysis

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 JDS
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share

Copy link
Facebook
Email
Notes
More