2023 ZeroShotQuestionAnsweringoverFi
- (Phogat et al., 2023) ⇒ Karmvir Singh Phogat, Chetan Harsha, Sridhar Dasaratha, Shashishekar Ramakrishna, and Sai Akhil Puranam. (2023). “Zero-Shot Question Answering over Financial Documents Using Large Language Models.” In: arXiv preprint arXiv:2311.14722. doi:10.48550/arXiv.2311.14722
Subject Headings: Domain-Specific Zero-Shot Question-Answering.
Notes
- It introduces a large language model (LLM) based approach to answer complex questions requiring multi-hop numerical reasoning over financial reports.
- It proposes a new approach using zero-shot prompts for financial question answering with LLMs, eliminating the requirement for hand crafted examples.
- It considers two baseline zero-shot prompting techniques: standard dual prompt (ZS-STD) and zero-shot chain-of-thought prompt (ZS-CoT).
- It evaluates the approach on three financial question answering datasets using GPT models.
- It shows the ZS-FinPYT and ZS-FinDSL methods significantly outperform the baselines.
- It demonstrates ZS-FinPYT gave better results than ZS-FinDSL for text-davinci and gpt-3.5 models while both were comparable for gpt-4.
- It analyzes performance in detail over question types and complexity.
- It shows for text-davinci, the zero-shot approach is within 10% of few-shot results demonstrating its effectiveness.
- It demonstrates the potential to extract complex numerical reasoning from LLMs through careful system design and tailored prompts.
Cited By
Quotes
Abstract
We introduce a large language model (LLM) based approach to answer complex questions requiring multi-hop numerical reasoning over financial reports. While LLMs have exhibited remarkable performance on various natural language and reasoning tasks, complex reasoning problems often rely on few-shot prompts that require carefully crafted examples. In contrast, our approach uses novel zero-shot prompts that guide the LLM to encode the required reasoning into a Python program or a domain specific language. The generated program is then executed by a program interpreter, thus mitigating the limitations of LLM in performing accurate arithmetic calculations.
We evaluate the proposed approach on three financial datasets using some of the recently developed generative pretrained transformer (GPT) models and perform comparisons with various zero-shot baselines. The experimental results demonstrate that our approach significantly improves the accuracy for all the LLMs over their respective baselines. We provide a detailed analysis of the results, generating insights to support our findings. The success of our approach demonstrates the enormous potential to extract complex domain specific numerical reasoning by designing zero-shot prompts to effectively exploit the knowledge embedded in LLMs.
1 Introduction
In recent years, the development of large language models (LLMs) has achieved significant advances in natural language processing (NLP). Typically, LLMs are pretrained on large corpora of text from the internet which has given rise to the capability of adapting to a wide variety of new tasks from differ- ent domains without the need for huge amount of task specific data. Scaling up the size of these mod- els has not only improved sampling efficiency and performance, (Kaplan et al., 2020) but also introduced reasoning capabilities (Wei et al., 2022a,b; Kojima et al., 2022).
LLMs have been shown to perform well on tasks requiring reasoning capabilities in various domains, including code writing (Chen et al., 2021a), math problem solving (Lewkowycz et al., 2022; Polu et al., 2023), dialogue (Glaese et al., 2022; Thoppi- lan et al., 2022), common sense reasoning (Shwartz et al., 2020; Chowdhery et al., 2022) and symbolic reasoning (Wei et al., 2022b; Wang et al., 2023). The design of the prompt, known as prompt en- gineering, plays a significant role in adapting the pretrained LLMs to new tasks with little or no task specific training data. Recently, there has been extensive work (Liu et al., 2023) which demon- strates the importance of prompt design in usage of the LLMs and unlocking their reasoning capa- bilities. However, (Mahowald et al., 2023) argue that LLMs cannot combine elementary knowledge with common sense reasoning. (Valmeekam et al., 2022) claim that benchmarks on which LLMs show reasoning capabilities are simplistic and cannot be used as evidence. (Bubeck et al., 2023; Bang et al., 2023) show that LLMs face challenges in numeri- cal reasoning. Hence, adapting LLMs to new do- mains requires prompt engineering and a system design that can overcome the limitations of LLMs.
Question answering in the financial domain is an active area of research which could potentially benefit from the use of LLMs with appropriate system design. Financial question answering in- volves numerous steps and complex numerical rea- soning with precise arithmetic calculations, mak- ing it more challenging than classical question an- swering problems (Yang et al., 2018; Rajpurkar et al., 2018). Typically for complex problems, few- shot prompt based approaches have been used (Wei et al., 2022b; Chen et al., 2023). However it has been shown that the output of the LLMs is sensitive to the few-shot samples used as well as to the order- ing of those samples (Lu et al., 2022). Further, the samples can contain large number of tokens and providing multiple samples for few-shot prompts would increase the number of input tokens, some- times even crossing the limit of LLMs. Hence, designing and using few-shot prompts for financial question answering can become quite challenging. We propose a new approach using zero-shot prompts for financial question answering with LLMs, thus eliminating the requirement to create hand crafted examples. These prompts contain high-level instructions to guide the encoding of fi- nancial reasoning process into a Python program (ZS-FinPYT) or a domain specific language (ZS- FinDSL). For ZS-FinPYT, we achieve the zero-shot system by instructions that layout the high-level ap- proach to generate a valid Python program, while for ZS-FinDSL we enable the same by identifying a program structure for robust domain-specific lan- guages (DSL) program extraction. In both cases, the generated program is executed externally by a program executor to provide the final answer. We evaluate the use of the latest GPT-x models on their ability to perform financial reasoning as they have shown state-of-the-art performance on various tasks involving question answering and reasoning (OpenAI, 2023; Frieder et al., 2023; Kung et al., 2023). Specifically, we explore the use of the GPT models text-davinci-003, gpt-3.5-turbo and gpt-4 in answering financial questions.
We evaluate the proposed approach on three fi- nancial question answering datasets, with three dif- ferent GPT models and compare with various base- lines. The experimental results demonstrate that our approach significantly improves the accuracy for all models. The success of our approach demon- strates the enormous potential to extract complex domain specific numerical reasoning by carefully designing LLM based systems for specific applica- tions and crafting prompts to effectively exploit the knowledge embedded in the LLMs.
2 Background
NLP techniques have proven useful to solve various problems in the financial domain such as sentiment analysis to assist market prediction (Day and Lee, 2016; Akhtar et al., 2017) and fraud detection for risk management (Han et al., 2018; Wang et al., 2019). Financial domain specific language mod- els have been trained on large scale financial data and fine tuned for specific problems (Liu et al., 2021). (Chen et al., 2021b) introduce a large-scale question answering dataset, FinQA and propose FinQANet with a retriever-generator architecture based on pretrained BERT like models.
With the introduction of LLMs, it has become feasible to directly use these language models with- out domain specific pretraining. (Chen et al., 2022) propose a large-scale financial dataset, ConvFinQA for conversational question answering. They pro- pose a few-shot prompt (with 16 exemplars) based approach using GPT-3 text-davinci-002 model to generate a DSL program.
One of the key techniques which significantly improves reasoning abilities of LLMs is chain- of-thought prompting introduced by (Wei et al., 2022b). They propose a few-shot prompt that con- sists of triples: <input, chain-of-thought, output>, where the chain-of-thought (CoT) is a series of in- termediate natural language reasoning steps that leads to the final output. (Kojima et al., 2022) demonstrate that reasonable zero-shot learning is achieved by simply adding “Let’s think step by step” to the prompt and using a two-prompt ap- proach: the first prompt to extract the reasoning path and the second to extract the final answer. Unlike our approach, which avoids performing cal- culations using the LLM, both of these approaches utilize the LLM for generating mathematical ex- pressions that encode the reasoning and perform arithmetic at each step.
Program of thoughts (PoT) prompting (Chen et al., 2023) and Program-aided Language Mod- els (PAL) (Gao et al., 2023) are approaches that are conceptually similar to our proposed technique. However, (Chen et al., 2023) show only limited zero-shot prompting experiments for financial data sets. Their results indicate that few-shot prompting significantly outperforms the zero-shot prompts. (Gao et al., 2023) discuss only few-shot prompt- ing and do not show any results on financial data sets. In contrast, our work focuses entirely on op- timizing zero-shot prompts that generate Python program or domain specific language for financial question answering. We further demonstrate that carefully designed zero-shot prompts for financial question answering can achieve comparable results with few-shot methods.
3 Zero-shot Prompting for Financial Domains
We introduce a novel zero-shot template-based prompting for financial question answering. These
prompts are designed to generate executable pro- grams for answering questions. The executable program generation and their execution enables ac- curate mathematical calculations which eliminates arithmetic errors. We follow the prompt guidelines described in (Reynolds and McDonell, 2021) and employ the following principles for designing zero- shot prompting for question answering:
Signifier: A signifier is a pattern which keys the intended behavior. A task specific signifier directly elucidates the task at hand. The sentence – “Read the following passage and then answer the question”, specifically describes the question answering task that is to be performed.
Memetic proxy: A memetic proxy is a concept in which a character or characteristic situation is used as a proxy for an intention. “#Python” can be a memetic proxy for the LLM to clarify the intention that the response should have a Python program.
Constraining behavior: In addition to directing the LLM on the desirable response, it is important for the prompt to inform the LLM of undesirable re- sponses. Instructions restricting undesirable LLM responses fall into the constraining behavior cate- gory.
Meta prompting: A meta prompt is a short phrase or a fill-in-the-blank template encapsulat- ing a more general intention that will unfold into a more specific prompt when combined with ad- ditional information such as the task at hand. In the question answering task, the sentence –“Let us think step by step.”, elicits step-by-step reasoning in LLMs for answering questions.
Inspired by these prompt design principles, we present two zero-shot prompting techniques: ZS-FinPYT prompt that enables LLMs to gener- ate Python executable programs and ZS-FinDSL prompt that enables LLMs to generate executable domain specific language programs. We also dis- cuss two baseline zero-shot prompting techniques, one using a simple dual prompt and another using zero-shot chain-of-thought prompting (ZS-CoT) motivated by (Kojima et al., 2022). For repro- ducibility purposes, we provide exact prompts for all techniques and datasets.
3.1 Zero-shot FinPYT
The ZS-FinPYT prompt is a collection of instruc- tions that directs the LLM to generate a valid Python program that can be executed by the exec
function. Based on preliminary experiments, we identified the following requirements for the pro- posed prompt: (a) The prompt should describe the task such that it enables the LLM to generate consistent pro- grams for answering the questions. (b) The final answer to a question must be stored in a specified Python variable for enabling consistent extraction of executed answer. (c) The LLM generated program should not in- clude Python non-executable statements for seamless execution of Python programs. The ZS-FinPYT prompt is designed to accom- modate the above requirements in the following manner: Direct task specification using the signifier: We use the following signifier for explicitly speci- fying the question answering task:
The signifier explicitly calls out the task of writing a Python program to answer the question after read- ing the passage where the passage and the questions are identified with the identifiers “Passage:” and “Question:” respectively. Furthermore, the prompt directs the LLM to follow certain instructions while answering the question. Direct sub-task specification using the signi- fier: The sub-task of storing the final answer to a specific Python variable is described as a part of instructions to the LLM:
Constraining LLM behavior: To ensure nam- ing conventions are followed and prevent the gen- eration of non-executable statements, we include the following instructions in the prompt:
Memetic proxy phrases: Certain memetic proxy phrases are employed to implicitly convey in- tentions. For instance, the memetic phrase “#Com- ment: . . .” guides the LLM to understand that com- ments are always preceded by the “#” character. Similarly, the memetic phrase “#Python” instructs the LLM to generate a Python program. The ZS-FinPYT prompt for the FinQA dataset is depicted in Figure 1.
traction and constraining the LLM’s behavior by incorporating domain-specific knowledge. Direct task specification using the signifier: The program extraction task is specified using the following signifier:
3.2
Signifier Memetic proxy Constraining behavior Input
Figure 1: ZS-FinPYT prompt for FinQA
Zero-shot FinDSL
Constraining LLM behavior: To ensure consis- tent program extraction, we limit the mathematical operations to the set specified by the DSL. These operations are commonly used for financial ques- tion answering. Moreover, we constrain the pro- gram’s output to numerical or boolean values to
The zero-shot FinDSL (ZS-FinDSL) is a zero-shot prompting technique for program generation in a domain specific language (DSL). We use a DSL similar to (Chen et al., 2021b) with two differences: we don’t have table operators and instead we have a max and min operator. The output of the system is a DSL program that is extracted using a Python script and executed using a language interpreter. In the ZS-FinDSL technique, we adopt a dual prompt ap- proach to extract reasoning for answering questions and generating the corresponding DSL program. 3.2.1 Reasoning Extraction Prompt The reasoning extraction prompt of ZS-FinDSL consists of two parts: Direct task specification using the signifier: The question answering task is specified explicitly using the following signifier:
Meta prompting for reasoning: For generating step by step reasoning for answering the question, the following meta prompt is used:
3.2.2 Program Extraction Prompt The primary goal of the program extraction prompt is to extract DSL programs from the LLM’s re- sponse obtained through the reasoning extraction prompt. To achieve this, the program extraction prompt involves specifying the task of program ex-
make it executable. The following instructions are passed to the LLM to ensure consistent program extraction:
The ZS-FinDSL prompt for the FinQA dataset is shown in Figure 2. 3.3 Zero-shot Standard Dual A standard template based prompting approach for question answering is a zero-shot standard dual (ZS-STD) prompt which has a LLM answering prompt and an answer extraction prompt. In the LLM answering prompt, the question is appended below the passage and then the trigger word “An- swer” is added for LLM to generate the answer. The answer extraction prompt takes the LLM gen- erated answer along with the question and ap- pend a memetic proxy phrase – “The final answer (float/int/boolean) is” for extracting the final answer. The ZS-STD prompt for the FinQA dataset question answering is shown in Figure 3.
3.4 Zero-shot Chain of Thoughts Similar to the zero-shot reasoners (Kojima et al., 2022), zero-shot chain-of-thought (ZS-CoT) prompt is derived from the ZS-STD prompt by adding the reasoning trigger sentence –“Let us think step by step.” after the word “Answer:”. The answer extraction prompt of ZS-CoT is identical to the ZS-STD prompt. The ZS-CoT prompt for the
Signifier Memetic proxy Constraining behavior Meta prompt Input
Figure 2: ZS-FinDSL prompt for FinQA
Signifier Memetic proxy Input
Figure 3: ZS-STD prompt for FinQA
FinQA dataset question answering is described in Figure 4. All prompts for TATQA are identical to FinQA and for ConvFinQA dataset, the prompts are slightly modified to handle conversational ques- tions as shown in Appendix A. 4 Experiments 4.1 Experimental Design Datasets: We conduct our experiments on three financial question answering datasets FinQA (Chen et al., 2021b), ConvFinQA (Chen et al., 2022) and TATQA (Zhu et al., 2021) as summarized in Table 1. For our evaluations, we use the test split of FinQA, while for ConvFinQA and TATQA we use the dev set as answers for test splits of these datasets are not available. The evaluations for TATQA are re- stricted to questions of arithmetic type. The ques- tion answering task is to answer the questions using
the passage containing text and table content. The table content is represented in a textual format us- ing the strategy adopted in (Chen, 2022). In the textual format, the table columns are separated by ‘|’, the rows are separated by ‘\n’ and the empty cell are filled with ‘-’. Large Language Models: We experimented with three Azure OpenAI1 LLMs text-davinci-003, gpt-3.5-turbo, gpt-4. The Python programs gener- ated using LLMs are executed using Python func- tion exec. The domain specific programs are exe- cuted using the Python script provided by FinQA.2 In order to achieve a more precise and predictable outcome, the LLM parameters are set as follows: temperature = 0, top_prob = 0.95, max_tokens = 1000. Evaluation Metrics: For all the financial
1https://oai.azure.com/ 2https://github.com/czyssrs/FinQA
Signifier Memetic proxy Meta prompt Input
Figure 4: ZS-CoT prompt for FinQA
datasets – FinQA, ConvFinQA and TATQA, we implement the evaluation strategy discussed in pro- gram of thoughts prompting (Chen et al., 2023) on Github3 with slight modifications. The LLM responses are varying in nature for questions with answers in thousands, millions, and percentage. Examples: for the gold answer 7 million, the gpt response may be 7 million or 7,000,000; for the gold answer 23%, the gpt response may be 23% or 0.23. The evaluation strategy is modified to handle such cases. We relax the evaluations for ZS-CoT (Kojima et al., 2022) and standard dual prompting because LLMs using these prompting techniques generate answers instead of programs. Since LLMs cannot perform precise mathematical calculations (especially with high-precision floats and large numbers),we provide a tolerance while comparing the gpt final answer with the gold an- swer. The evaluation is implemented using the Python function isclose with a relative tolerance (rel_tol) of 0.001. The isclose functionality returns True while comparing the gpt final answer (aˆ) with the gold answer (a˜) if and only if the condition abs(aˆ − a˜) ≤ rel_tol ∗ max(abs(aˆ), abs(a˜)) is satisfied. Baselines: We consider two baselines for zero- shot prompting setting: ZS-STD prompt and ZS- CoT prompt. These zero-shot prompting tech- niques are evaluated with all three Azure OpenAI models (text-davinci-003, gpt-3.5-turbo, gpt-4) on all three financial datasets (FinQA, ConvFinQA and TATQA). 4.2 Main Results The evaluation results for the proposed prompt- ing techniques ZS-FinPYT and ZS-FinDSL along
3https://github.com/wenhuchen/Program-of-Thoughts
with the baselines ZS-STD prompt and ZS-CoT are summarized in Table 2. The ZS-FinPYT and ZS- FinDSL methods significantly outperform the ZS- STD prompt for all datasets and across all LLMs. The ZS-FinPYT achieves 4.5% to 47% and the ZS- FinDSL achieves 5.22% to 38.72% improvement in accuracy over ZS-STD. The increase in accuracy for text-davinci and gpt-3.5 are are much higher than that for gpt-4 as for gpt-4 the base model per- forms reasonably well. These results indicate that our prompts are able to induce the required rea- soning and successfully output the required Python programs or domain specific languages. Both methods also made significant improve- ments over the ZS-CoT method for text-davinci- 003 and gpt-3.5-turbo, with the ZS-FinPYT achiev- ing 3% to 33.22% and the ZS-FinDSL achieving 0% to 24.94% improvement over the ZS-CoT on different datasets. For gpt-4, our approach slightly outperforms the ZS-CoT for all datasets with im- provements in the range of 1.5-3.5%. However, it is important to highlight that ZS-CoT lacks the abil- ity to provide precise answers, and its accuracy is measured using a relaxed metric, while our method generates precise answers and an exact metric is used to measure accuracy. In general, the ZS-FinPYT approach gave better results than ZS-FinDSL for the text-davinci-and gpt-3.5-turbo models for the different datasets. For gpt-4 both methods are comparable. We also carried out an evaluation of OpenAI models using few-shot PoT prompting, as shown in Table 3. The comparisons indicate the excel- lent performance of our zero-shot method as we are within 10% of the few-shot and in many cases almost the same and for few cases even surpassing the few-shot performance.
Dataset Split Example Input Output FinQA Test 1147 Table + Text + Question Number+Binary ConvFinQA Dev 421 Table + Text + Multi-turn Question Number+Binary TATQA Dev † 718 Table + Text + Question Number+Binary † Only arithmetic questions from the Dev split of TATQA. Table 1: Financial question answering datasets for evaluation
Models FinQA ConvFinQA TATQA ZS-STD (text-davinci-003) 22.58 13.30 39.97 ZS-CoT (text-davinci-003) 41.15 27.08 68.94 ZS-FinDSL (text-davinci-003) 56.76 52.02 68.25 ZS-FinPYT (text-davinci-003) 66.60 60.30 78.40 ZS-STD (gpt-3.5-turbo) 32.26 47.74 49.03 ZS-CoT (gpt-3.5-turbo) 53.01 52.49 74.09 ZS-FinDSL (gpt-3.5-turbo) 61.12 60.81 77.86 ZS-FinPYT (gpt-3.5-turbo) 66.52 67.45 85.00 ZS-STD (gpt-4) 63.64 72.45 77.58 ZS-CoT (gpt-4) 74.19 75.30 90.11 ZS-FinDSL (gpt-4) 77.33 77.67 90.53 ZS-FinPYT (gpt-4) 77.51 76.95 93.00 Table 2: Comparison results of various models on different datasets.
4.3 Performance Analysis We conduct a performance analysis on FinQA dataset for two models gpt-4, gpt-3.5-turbo, see Ta- ble 4 for details. The FinQA questions are divided into various categories to gain further insights. Performance on text and table questions:The FinQA questions are divided into three sets depend- ing on where the information required to answer the question is available: table only questions, text- only questions, text-table questions. Performance regarding program steps: The FinQA questions are divided into three sets based on number of steps required to provide the answer: 1 step program, 2 step program and >2 step pro- gram. Performance regarding question types: The FinQA questions are divided into numerical and boolean type questions. The key findings are listed below: The models achieve the highest accuracy on table-only questions. As tables are structured and the tables in this dataset are simple, it maybe easier for the LLMs to more accurately extract the values as compared to extracting from unstructured text. Question with multi-hop reasoning are chal- lenging. As would be expected both models find it easier to answer questions with one or two hop-
reasoning as compared to questions needing more than two hop reasoning. Numerical questions are more challenging as compared to boolean questions. In general, gpt-4 and gpt-3.5-turbo models excel in answering boolean questions over arithmetic questions. How- ever, gpt-3.5-turbo’s performance declines with ZS- FinDSL prompt for boolean questions as compared to arithmetic questions. Examination of a few cases indicated that gpt-3.5-turbo has greater difficulty in writing DSL programs correctly for boolean ques- tions. 4.4 Error Analysis We sampled 50 test cases from FinQA dataset re- sults of text-davinci-003 model and examined in detail the entire output of the system to get fur- ther insight into the obtained results. As expected, ZS-STD prompt results in brief answers with a sen- tence or value as the output without providing any details on the reasoning potentially contributing to its poor performance. On the other hand, LLM responses with ZS-CoT details out the reasoning behind the answers and shows significantly better performance than ZS-STD. However, arithmetic er- rors results into a substantial drop in performance for both ZS-STD prompt and ZS-CoT. The ZS-FinPYT and ZS-FinDSL approaches
Models FinQA ConvFinQA TATQA Few-shot PoT (text-davinci-003)⋆ 72.27 69.35 83.21 ZS-FinPYT (text-davinci-003) 66.60 60.30 78.40 Few-shot PoT (gpt-3.5-turbo)⋆ 67.39 65.79 74.75 ZS-FinPYT (gpt-3.5-turbo) 66.52 67.45 85.00 Few-shot PoT (gpt-4)⋆ 78.46 82.42 91.89 ZS-FinPYT (gpt-4) 77.51 76.95 93.00 ⋆ Few-shot PoT uses 4-shots selected from the few-shots used in (Chen et al., 2023). Table 3: Performance of ZS-FinPYT and few-shot PoT on different datasets.
Methods ZS-FinPYT ZS-FinDSL gpt-4 gpt-3.5-turbo gpt-4 gpt-3.5-turbo overall accuracy 77.51 66.52 77.33 61.12 Performance on table and text table-only questions 80.91 71.36 81.36 63.94 text-only questions 74.45 58.39 73.36 60.22 table-text questions 67.44 55.81 68.22 48.84 Performance regarding program steps 1 step programs 80.73 69.27 79.82 62.08 2 step programs 77.02 64.79 77.26 63.08 >2 step programs 54.76 53.57 58.33 44.05 Performance regarding question types boolean questions 90.00 95.00 85.00 45.00 numerical questions 77.28 66.02 77.20 61.40 Table 4: Performance breakdown of various models on FinQA dataset.
demonstrated detailed reasoning. In the case of ZS-FinPYT the task of writing a Python program triggers reasoning while in the case of ZS-FinDSL there are two prompts where the first prompt is a meta prompt that drives the reasoning similar to ZS-CoT. These techniques produce programs instead of answers for questions and therefore, mit- igate arithmetic errors. Hence, these proposed tech- niques significantly outperforms ZS-CoT. The ZS- FinDSL performance is lower than ZS-FinPYT be- cause the program extraction step fails for some cases where the reasoning step is correct. One pos- sible explanation could be that the GPT systems have likely been trained on huge amounts of Python programs and hence can generate Python program efficiently where as for ZS-FinDSL the instruction contains the information on how to write out the domain specific program. This may be driving the slightly higher error rate of the ZS-FinDSL. Some demonstrative examples supporting these observa- tions may be found in Appendix B.
5 Conclusion
We proposed zero-shot prompting techniques to an- swer complex questions requiring multi-hop numer- ical reasoning over financial reports. The prompts guide the LLM to encode the required reasoning into a program that is executed by a program inter- preter. The approach demonstrated excellent results on three financial datasets, achieving significant im- provement over the respective baselines.We hope that our work will motivate a principled approach to prompt design with other LLMs.
Limitations
In this paper, we only experiment with the GPT- x series of LLMs. While this work shows the tremendous potential for zero-shot financial rea- soning with LLMs, it is possible that better perfor- mance may be obtained with other LLMs. More- over, the prompts we have proposed are designed to address specific problems observed with the three GPT models considered in this work. Other LLMs may behave differently and will likely need modifi- cation to the prompts to work effectively.
While we experiment and find zero-shot prompts that are effective for both ZS-FinPYT and ZS- FinDSL, and the error analysis provided insights into failures, there are also unexplained failures in reasoning and more research is needed to under- stand the behavior of LLMs for certain cases. For ZS-FinDSL, we observed some patterns that result in failure of program extraction. However, it is unclear what drives these failures and we leave that for future work.
For cases where the reasoning was incorrect, the system may provide an explanation with a high- level of confidence. Our prompt currently does not address or control for such behavior. In practice, this can pose challenges for practical use in real world systems.
A Prompts for ConvFinQA
The ConvFinQA prompts are slightly modified to handle conversational questions. ZS-FinPYT for ConvFinQA: For gpt-4, we use a single prompt where the last question in the series of questions is clearly marked and the system is instructed to answer the last questions as shown in Figure 5. For gpt-3.5-turbo and text-davinci- 003, we use dual prompt approach which consists of a reasoning extraction prompt and a program generation prompt, see Figure 6. The reasoning extraction prompt is there to generate answers with reasoning for all the questions in a conversation, and the program generation prompt generates a Python program answering the last question. ZS-FinDSL for ConvFinQA: The ZS-FinDSL for ConvFinQA , see Figure 7, is a dual prompt which consists of a reasoning prompt and a pro- gram extraction prompt that are similar to the corresponding prompts for FinQA. The reasoning prompt instructs the LLM to generate answers with reasoning for all questions in a conversation. The program extraction prompt is instructing the LLM to generate program for performing calculations to answer the last question. ZS-STD and ZS-CoT for ConvFinQA: The LLM answering prompt of ZS-STD, see Figure 8, and the reasoning extraction prompt of ZS-CoT, see Figure 9, are instructing the LLM to answer the questions of a conversation. Then the answer extraction prompt of both of these technique extract the final answer. B Error Analysis Examples We show some examples from FinQA dataset with the corresponding responses from the text-davinci- 003 model under various prompts. These examples demonstrate successful attempts and failure cases under various prompts. We begin with showing some examples where ZS-FinDSL (text-davinci-003) system generates correct reasoning and the corresponding program generation succeeded, see Figure 10 and Figure 11. Similarly, Figure 12 and Figure 13 show suc- cessful Python program generation by the system ZS-FinPYT (text-davinci-003). In most of the cases, the LLM answering prompt of ZS-STD (text-davinci-003) generates only a value or a sentence, see Figure 14 and Figure 15 for details. In some cases, the answer extraction step fails as shown in Figure 16.
The LLM responses with ZS-CoT details out the reasoning behind the answers and shows sig- nificantly better performance than ZS-STD. How- ever, arithmetic errors results into a substantial drop in performance for both ZS-STD prompt and ZS- CoT. Examples demonstrating arithmetic errors are shown in Figure 17 and Figure 18. The ZS-FinDSL performance is lower than ZS- FinPYT because the program extraction step fails for some cases where the reasoning step is correct as shown in Figure 19 and Figure 20.
Signifier Memetic proxy Constraining behavior Input
Figure 5: ZS-FinPYT (gpt-4) prompt for ConvFinQA
Signifier Memetic proxy Constraining behavior Meta prompt Input
Figure 6: ZS-FinPYT (gpt-3.5-turbo, text-davinci-003) prompt for ConvFinQA
Signifier Memetic proxy Constraining behavior Meta prompt Input
Figure 7: ZS-FinDSL prompt for ConvFinQA
Signifier Memetic proxy Input
Figure 8: ZS-STD prompt for ConvFinQA
Signifier Memetic proxy Meta prompt Input
Figure 9: ZS-CoT prompt for ConvFinQA
Figure 10: An example where ZS-FinDSL extracted correct reasoning and successfully generated the DSL program.
Figure 11: An example where ZS-FinDSL extracted correct reasoning and successfully generated the DSL program.
Figure 12: An example of successful Python program generation by ZS-FinPYT.
Figure 13: An example of successful Python program generation by ZS-FinPYT.
Figure 14: An example where ZS-STD’s LLM answering prompt generates one value response.
Figure 15: An example where ZS-STD’s LLM answering prompt generates one sentence response.
Figure 16: An example where answer extraction failed for ZS-STD.
Figure 17: An example of arithmetic errors made by ZS-STD and ZS-CoT prompts.
Figure 18: An example of arithmetic errors made by ZS-STD and ZS-CoT prompts.
Figure 19: An example where program extraction failed for ZS-FinDSL prompt.
Figure 20: An example where program extraction failed for ZS-FinDSL prompt.
References
- (Akhtar et al., 2017) ⇒ Md Shad Akhtar, Abhishek Kumar, Deepanway Ghosal, Asif Ekbal, and Pushpak Bhattacharyya. (2017). “A Multilayer Perceptron based Ensemble Technique for Fine-grained Financial Sentiment Analysis.” In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Pages 540-546.
- (Bang et al., 2023) ⇒ Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wen-liang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, et al. (2023). “A Multi-task, Multilingual, Multimodal Evaluation of Chat-GPT on Reasoning, Hallucination, and Interactivity.” arXiv Preprint arXiv:2302.04023.
- (Bubeck et al., 2023) ⇒ Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. (2023). “Sparks of Artificial General Intelligence: Early Experiments with GPT-4.” arXiv Preprint arXiv:2303.12712.
- (Chen et al., 2021a) ⇒ Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde, Jared Kaplan, Harrison Edwards, Yura Burda, et al. (2021). “Evaluating Large Language Models Trained on Code.” arXiv Preprint arXiv:2107.03374.
- (Wenhu Chen, 2022) ⇒ Wenhu Chen. (2022). “Large Language Models are Few (1)-Shot Table Reasoners.” arXiv Preprint arXiv:2210.06710.
- (Chen et al., 2023) ⇒ Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W. Cohen. (2023). “Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks.” Transactions on Machine Learning Research.
- (Chen et al., 2021b) ⇒ Zhiyu Chen, Wenhu Chen, Charese Smiley, Sameena Shah, Iana Borova, Dylan Langdon, Reema Moussa, Matt Beane, Ting Hao Huang, Bryan Routledge, et al. (2021). “FINQA: A Dataset of Numerical Reasoning over Financial Data.” In: 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Pages 3697-3711. Association for Computational Linguistics (ACL).
- (Chen et al., 2022) ⇒ Zhiyu Chen, Shiyang Li, Charese Smiley, Zhiqiang Ma, Sameena Shah, and William Yang Wang. (2022). “ConvFinQA: Exploring the Chain of Numerical Reasoning in Conversational Finance Question Answering.” arXiv Preprint arXiv:2210.03849.
- (Chowdhery et al., 2022) ⇒ Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, et al. (2022). “PaLM: Scaling Language Modeling with Pathways.” arXiv:2204.02311.
- (Day & Lee, 2016) ⇒ Min-Yuh Day and Chia-Chou Lee. (2016). “Deep Learning for Financial Sentiment Analysis on Finance News Providers.” In: 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), Pages 1127-1134. IEEE.
- (Frieder et al., 2023) ⇒ Simon Frieder, Luca Pinchetti, Ryan-Rhys Griffiths, Tommaso Salvatori, Thomas Lukasiewicz, et al. (2023). “Mathematical Capabilities of ChatGPT.” arXiv Preprint arXiv:2301.13867.
- (Gao et al., 2023) ⇒ Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. (2023). “PAL: Program-aided Language Models.” In: Proceedings of the 40th International Conference on Machine Learning, Volume 202 of Proceedings of Machine Learning Research, Pages 10764-10799. PMLR.
- (Glaese et al., 2022) ⇒ Amelia Glaese, Nat McAleese, Maja Tre˛bacz, John Aslanides, Vlad Firoiu, Timo Ewalds, Maribeth Rauh, Laura Weidinger, Martin Chadwick, Phoebe Thacker, et al. (2022). “Improving Alignment of Dialogue Agents via Targeted Human Judgements.” arXiv Preprint arXiv:2209.14375.
- (Han et al., 2018) ⇒ Jingguang Han, Utsab Barman, Jer Hayes, Jinhua Du, Edward Burgin, and Dadong Wan. (2018). “NextGen AML: Distributed Deep Learning based Language Technologies to Augment Anti Money Laundering Investigation.” In: Proceedings of ACL 2018, System Demonstrations, Pages 37-42.
- (Kaplan et al., 2020) ⇒ Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. (2020). “Scaling Laws for Neural Language Models.” arXiv Preprint arXiv:2001.08361.
- (Kojima et al., 2022) ⇒ Takeshi Kojima, Shixiang (Shane) Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. (2022). “Large Language Models are Zero-Shot Reasoners.” In: Advances in Neural Information Processing Systems, Volume 35, Pages 22199-22213. Curran Associates, Inc.
- (Kung et al., 2023) ⇒ Tiffany H. Kung, Morgan Cheatham, Arielle Medenilla, Czarina Sillos, Lorie De Leon, Camille Elepaño, et al. (2023). “Performance of ChatGPT on USMLE: Potential for AI-assisted Medical Education Using Large Language Models.” PLOS Digital Health, Pages 1-12.
- (Lewkowycz et al., 2022) ⇒ Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, et al. (2022). “Solving Quantitative Reasoning Problems with Language Models.” In: Advances in Neural Information Processing Systems, Volume 35, Pages 3843-3857. Curran Associates, Inc.
- (Liu et al., 2023) ⇒ Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. (2023). “Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing.” ACM Computing Surveys, 55(9):1-35.
- (Liu et al., 2021) ⇒ Zhuang Liu, Degen Huang, Kaiyu Huang, Zhuang Li, and Jun Zhao. (2021). “FinBERT: A Pre-trained Financial Language Representation Model for Financial Text Mining.” In: Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, Pages 4513-4519.
- (Lu et al., 2022) ⇒ Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp. (2022). “Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-shot Prompt Order Sensitivity.” In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Pages 8086-8098.
- (Mahowald et al., 2023) ⇒ Kyle Mahowald, Anna A Ivanova, Idan A Blank, Nancy Kanwisher, Joshua B Tenenbaum, and Evelina Fedorenko. (2023). “Dissociating Language and Thought in Large Language Models: A Cognitive Perspective.” arXiv Preprint arXiv:2301.06627.
- (OpenAI, 2023) ⇒ OpenAI. (2023). “GPT-4 Technical Report.” arXiv Preprint arXiv:2303.08774.
- (Polu et al., 2023) ⇒ Stanislas Polu, Jesse Michael Han, Kunhao Zheng, Mantas Baksys, Igor Babuschkin, and Ilya Sutskever. (2023). “Formal Mathematics Statement Curriculum Learning.” In: The Eleventh International Conference on Learning Representations 2023. OpenReview.net.
- (Rajpurkar et al., 2018) ⇒ Pranav Rajpurkar, Robin Jia, and Percy Liang. (2018). “Know What You Don’t Know: Unanswerable Questions for SQuAD.” In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Pages 784-789.
- (Reynolds & McDonell, 2021) ⇒ Laria Reynolds and Kyle McDonell. (2021). “Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm.” In: Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems, Pages 1-7.
- (Shwartz et al., 2020) ⇒ Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. (2020). “Unsupervised Commonsense Question Answering with Self-Talk.” In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Pages 4615-4629.
- (Thoppilan et al., 2022) ⇒ Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, et al. (2022). “LaMDA: Language Models for Dialog Applications.” arXiv Preprint arXiv:2201.08239.
- (Valmeekam et al., 2022) ⇒ Karthik Valmeekam, Alberto Olmo, Sarath Sreedharan, and Subbarao Kambhampati. (2022). “Large Language Models Still Can’t Plan (A Benchmark for LLMs on Planning and Reasoning about Change).” In: NeurIPS 2022 Foundation Models for Decision Making Workshop.
- (Wang et al., 2019) ⇒ Weikang Wang, Jiajun Zhang, Qian Li, Chengqing Zong, and Zhifei Li. (2019). “Are You for Real? Detecting Identity Fraud via Dialogue Interactions.” In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Pages 1762-1771.
- (Wang et al., 2023) ⇒ Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, and Denny Zhou. (2023). “Self-Consistency Improves Chain of Thought Reasoning in Language Models.” In: The Eleventh International Conference on Learning Representations 2023. OpenReview.net.
- (Wei et al., 2022a) ⇒ Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. (2022). “Emergent Abilities of Large Language Models.” Transactions on Machine Learning Research.
- (Wei et al., 2022b) ⇒ Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichtter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. (2022). “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.” In: Advances in Neural Information Processing Systems, Volume 35, Pages 24824-24837. Curran Associates, Inc.
- (Yang et al., 2018) ⇒ Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. (2018). “HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering.” In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Pages 2369-2380.
- (Zhu et al., 2021) ⇒ Fengbin Zhu, Wenqiang Lei, Youcheng Huang, Chao Wang, Shuo Zhang, Jiancheng Lv, Fuli Feng, and Tat-Seng Chua. (2021). “TAT-QA: A Question Answering Benchmark on a Hybrid of Tabular and Textual Content in Finance.” In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Pages 3277-3287.
;
Author | volume | Date Value | title | type | journal | titleUrl | doi | note | year | |
---|---|---|---|---|---|---|---|---|---|---|
2023 ZeroShotQuestionAnsweringoverFi | Karmvir Singh Phogat Chetan Harsha Sridhar Dasaratha Shashishekar Ramakrishna Sai Akhil Puranam | Zero-Shot Question Answering over Financial Documents Using Large Language Models | 10.48550/arXiv.2311.14722 | 2023 |