2023 BloombergGPTALargeLanguageModel
- (Wu, Irsoy et al., 2023) ⇒ Shijie Wu, Ozan Irsoy, Steven Lu, Vadim Dabravolski, Mark Dredze, Sebastian Gehrmann, Prabhanjan Kambadur, David Rosenberg, and Gideon Mann. (2023). “BloombergGPT: A Large Language Model for Finance.” In: arXiv preprint arXiv:2303.17564. doi:10.48550/arXiv.2303.17564
Subject Headings: BloombergGPT, Domain-Specific LLM, Bloomberg L.P..
Notes
Cited By
Quotes
Abstract
The use of NLP in the realm of financial technology is broad and complex, with applications ranging from sentiment analysis and named entity recognition to question answering. Large Language Models (LLMs) have been shown to be effective on a variety of tasks; however, no LLM specialized for the financial domain has been reported in literature. In this work, we present BloombergGPT, a 50 billion parameter language model that is trained on a wide range of financial data. We construct a 363 billion token dataset based on Bloomberg's extensive data sources, perhaps the largest domain-specific dataset yet, augmented with 345 billion tokens from general purpose datasets. We validate BloombergGPT on standard LLM benchmarks, open financial benchmarks, and a suite of internal benchmarks that most accurately reflect our intended usage. Our mixed dataset training leads to a model that outperforms existing models on financial tasks by significant margins without sacrificing performance on general LLM benchmarks. Additionally, we explain our modeling choices, training process, and evaluation methodology. As a next step, we plan to release training logs (Chronicles) detailing our experience in training BloombergGPT.
...
1. Introduction
...
3 Model
3.1 Architecture
Our model is a decoder-only causal language model based on BLOOM (Scao et al., 2022).
...
5 Evaluation
We evaluated the performance of BloombergGPT on two broad categories of tasks: finance-specific and general purpose. The finance-specific tasks help us test our hypothesis that training on high-quality finance-specific data will yield better results on financial tasks. The general purpose tasks investigate whether the performance of our model is directly comparable to previously published results. For financial tasks, we assembled publicly available financial datasets that include a range of NLP tasks. Then, to directly test BloombergGPT's ability on Bloomberg tasks of interest, we also included tasks drawn from Bloomberg-internal high-quality evaluation sets for sentiment analysis and named entity recognition. For general-purpose tasks, we draw from multiple existing benchmarks and group results into the following categories: BIG-bench Hard, Knowledge Assessments, Reading Comprehension, and Linguistic Tasks. The number of tasks per type and the definitions of the groups are presented in Table 5.
Suite | Tasks | What does it measure? |
---|---|---|
Public Financial Tasks | 5 | Public datasets in the financial domain |
Bloomberg Financial Tasks | 12 | NER and sentiment analysis tasks |
Big-bench Hard (Suzgun et al., 2022) | 23 | Reasoning and general NLP tasks |
Knowledge Assessments | 5 | Testing closed-book information recall |
Reading Comprehension | 5 | Testing open-book tasks |
Linguistic Tasks | 9 | Not directly user-facing NLP tasks |
- Table 5: Evaluation Benchmarks. We evaluate BloombergGPT on a high-coverage set of standard benchmarks that assess downstream performance, taken from HELM, SuperGLUE, MMLU, and the GPT-3 suite. Since these have significant overlap and/or include each other, we restructure them into the categories presented here. We only evaluate on one setup per dataset. We further assess BloombergGPT on a suite of internal and public financial tasks.
Name | # Tokens (B) | # Params. (B) | Compute |
---|---|---|---|
BloombergGPT | 569 | 50.6 | 1.00 |
GPT-NeoX | 472 | 20 | 0.33 |
OPT | 300 | 66 | 0.69 |
BLOOM | 366 | 176 | 2.24 |
GPT-3 | 300 | 175 | 1.82 |
- Table 6: Evaluation model cohort. OPT and BLOOM each have multiple sizes available and we report those we evaluated. We note that compute numbers are only partially comparable between models: For example, BLOOM's training data is only 1/3 English, and OPT repeated some of its training data. We report GPT-3 results whenever available but did not run it ourselves due to lack of availability.
We compare BloombergGPT to the three closest models described in §7 based on model size, type of training data, overall performance, and most importantly, access. An overview of the model sizes and compute is provided in Table 6.
1. GPT-NeoX (Black et al., 2022): According to Liang et al. (2022), this model is the best performing available model under 50B parameters. 2. OPT66B (Zhang et al., 2022a): We chose to compare to OPT66B since our model size and structure roughly match, though our model is smaller. 3. BLOOM176B (Scao et al., 2022): While this model is substantially larger than BloombergGPT, we use the same model architecture and software stack. We note that BLOOM176B is multilingual, so while it is much larger, it also is trained on data from more languages.
All three models use some of the same general-purpose datasets we use in our training corpus. We additionally report results from the original GPT-3 (Brown et al., 2020) whenever externally available.2
...
5.3.3 Exploratory Task: NER
Even though NER is a well-established NLP task with state-of-the-art results using BERT (Wu and Dredze, 2019; Luoma and Pyysalo, 2020) and T5 (Liu et al., 2022) style models, NER is largely an unexplored task for generative LLMs. NER is not in HELM (Liang et al., 2022), there is a single (Polish) task in BIG-bench (Srivastava et al., 2022), and none of the LLM papers we study report NER performance. Hence, we consider NER as an exploratory task and report preliminary NER results given its importance in the Financial sector. There are a few reasons for why NER may be a difficult task for generative LLMs. NER is an information extraction task, and a better fit for encoder-decoder or encoder-only architectures. The generative nature of LLMs does not confer an advantage for NER. We find that extensive prompt engineering and a greater number of shots are required to obtain reasonable results for NER than for other tasks. Finance-specific NER has subtleties that make it especially difficult for zero or few-shot learning.
For example, consider the (fabricated) headline "Bloomberg: Mr. Musk adds new features to Twitter and comments on China". Depending on our annotation guidelines and downstream task needs: (a) the reporting news organization “Bloomberg” can be tagged or not, depending on whether we want only salient entities, (b) "Mr. Musk" or just "Musk" is the PER to be tagged, (c) "Twitter" can be tagged as an ORG or a PRD (product) as features are added to the Twitter product and not the organization, and (d) "China" can be tagged ORG or LOC, though the right tag is likely ORG. Without adding extensive annotation guidelines in the prompt, the LLM does not know the intended tagging behavior.
Based on preliminary testing, we determined the following setting to obtain the best performance on the internal NER tasks from all models. First, we restrict the entity types to be predicted to be ORG, PER, and LOC. In all, we filtered out less than 1% of entities. We also remove all documents that contain no entities (i.e., all "O"'s). Both of these modifications are intended to increase the usefulness of the examples seen in few-shot prompting. We expect that further work on prompt engineering for NER could produce better results.
We consider seven Bloomberg internal NER datasets from different domains.
- BN NER: This is a named entity recognition task on entities occurring in English long-form Bloomberg news content (the "BN wire") between 2017 to 2020.
- BFW NER: Similar to "BN NER" but instead of using the long-form BN wire, we use short-form stories from the "Bloomberg First Word" wire between 2018 to 2020.
- Filings NER: The goal of this task is to identify entities that occur in mandatory financial disclosures filed by companies. The dataset contains filings sampled between 2016 and 2019.
- Headlines NER: The goal of this task is to identify entities that occur in headlines of English Bloomberg news content. The dataset contains headlines sampled between 2016 and 2020.
- Premium NER: The goal of this task is to identify entities that occur in a subset of the third-party English news content ingested by Bloomberg. The dataset contains
References
;
Author | volume | Date Value | title | type | journal | titleUrl | doi | note | year | |
---|---|---|---|---|---|---|---|---|---|---|
2023 BloombergGPTALargeLanguageModel | Prabhanjan Kambadur Gideon Mann Mark Dredze Ozan Irsoy Sebastian Gehrmann Shijie Wu Steven Lu Vadim Dabravolski David Rosenberg | Bloomberggpt: A Large Language Model for Finance | 10.48550/arXiv.2303.17564 | 2023 |