Sturdy Statistics Finance One Pager

Limitations of LLMs for Financial Analysis

Patronus AI has developed a a test to benchmark LLMs specific to financial analysis. Below we report the key results for GPT-4, the best-performing model tested by Patronus AI (see Table 2 of Islam et al. 2023):1

GPT-4 Turbo
Algorithm	Accuracy	Description
Closed Book	9%	Direct use of the LLM
RAG	19%	RAG using OpenAI embeddings
Full Document	79%	Include the entire filing in the prompt
Oracle	85%	Include only the paragraph or page containing the answer in the prompt

These results reveal two clear limitations:

Cost + Performance: There is a clear trade off between cost and performance. The cost effective RAG clearly struggles in this benchmark. Sending the entire 10-Q document to GPT in the prompt yields better results, but at the cost of $1 per question. Additionally, the authors note that “even if an LLM appears to be giving reasonable responses, there remains a risk that its answers are hallucinations.”2
Unknown Unknowns: In order to use an LLM for financial analysis, the analyst must first ask the LLM a concrete question. Consequently, the LLM cannot tell you something you did not already know to ask about: the LLM cannot tell you what to ask, it cannot make unexpected discoveries, and it cannot identify emerging trends or themes.

Sturdy Statistical Solutions

While the Patronus AI study ends with a pessimistic note about the promise of LLMs for financial analysis, it is interesting to note that the “Oracle” setting is both the most accurate, and by far the most cost-effective, method for running GPT-4. Under the oracle setting, the authors state that “the model is given the prompt as well as the text from the page used to evidence the answer … This turns the task into ‘open book’ question answering by removing the retrieval challenge.” It appears that the key to using GPT-4 effectively for financial analysis is to first identify only the information needed to answer a question, and to provide that information in a short-form prompt.

The Sturdy Statistics API provides a proprietary information structuring and retrieval tool. Our demo application shows this tool to be extremely powerful in combination with LLMs such as GPT-4.

Cost + Performance: Our information-structuring API operates with a different cost structure than LLMs. With Sturdy Statistics, there is a one-time cost to train our model, but subsequent retrieval and analysis are far less expensive than even RAG. There is one time fixed training to index each new document; crucially, this fixed cost scales away with increased usage. In exchange for this fixed cost, GPT-4 can operate in “oracle” mode: rather than including 100 pages of a 10Q in the prompt for each question, we can send just a few sentences or paragraphs for each, lowering the GPT cost — the part which doesn’t scale — by a factor of roughly 100.
Unknown Unknowns: The training step for Sturdy Statistics models is (in whole or in part) unsupervised: in other words, we do not instruct the code on what to look for, or how to structure the data. The code is designed to automatically identify any significant, recurring themes. Because of this, our tool can surface to you the unknown unknowns: potentially crucial discoveries than an analyst would not have even thought to ask about. Our tool therefore confers a competitive advantage over companies using only the techniques described in Islam et al. 2023. Additionally, while GPT operates one document at a time, Sturdy Statistics instead analyzes the entire corpus holistically. This enables entire classes of analysis which are not possible with GPT alone.

Notes

¹ P. Islam, A. Kannappan, D. Kiela, R. Qian, N. Scherrer, and B. Vidgen. FinanceBench: A New Benchmark for Financial Question Answering. arXiv e-prints, Nov. 2023. doi: 10.48550/arXiv.2311.11944. ↩

² Also see Banerjee et al. (2024).

Banerjee, Sourav, Ayushi Agarwal, and Saloni Singla. “LLMs Will Always Hallucinate, and We Need to Live With This.” arXiv, Sept. 2024. http://arxiv.org/abs/2409.05746.

↩