OpenAI o1 LLM
Jump to navigation
Jump to search
A OpenAI o1 LLM is an OpenAI GPT-4 reasoning LLM.
- Context:
- It can (typically) spend more time thinking before responding.
- ...
- It can process text, images, and audio data to handle complex reasoning tasks.
- It can provide enhanced real-time speech recognition and advanced text-to-speech capabilities.
- ...
- Example(s):
- o1-mini (o1-mini-2024-09-12), released on
2024-09-12
. - o1-preview (o1-preview-2024-09-12) released on
2024-09-12
. - o1-2024-12-17, released on
2024-12-17
. - ...
- o1-mini (o1-mini-2024-09-12), released on
- Counter-Example(s):
- See: OpenAI LLM Model, Foundation Neural Model, GPT-4 Turbo.
References
2024
- (Valmeekam et al., 2024) ⇒ Karthik Valmeekam, Kaya Stechly, and Subbarao Kambhampati. (2024). "LLMs Still Can't Plan; Can LRMs? A Preliminary Evaluation of OpenAI's o1 on PlanBench." In: arXiv preprint. DOI: 10.48550/arXiv.2409.13373
- ABSTRACT: The paper investigates whether large language models (LLMs) possess the ability to plan, a critical function of intelligent agents. Using PlanBench, a benchmark introduced in 2022, the authors evaluate the performance of various LLMs and OpenAI's new Large Reasoning Model (LRM) o1 (Strawberry). While o1 shows significant improvement, it still faces challenges in meeting the benchmark's full potential.
- NOTES:
- o1 is a Large Reasoning Model (LRM), designed by OpenAI to go beyond the capabilities of traditional autoregressive Large Language Models (LLMs), with a focus on reasoning and planning tasks.
- o1 incorporates reinforcement learning pre-training, allowing it to generate and evaluate chains of reasoning (Chain-of-Thought) to improve performance on complex tasks like planning.
- o1 shows substantial improvements in planning benchmarks, achieving 97.8% accuracy on simple PlanBench Blocksworld tasks, far surpassing previous LLMs, but its performance degrades significantly on larger, more complex problems.
- While o1 is more expensive and lacks guarantees, it dynamically adjusts its inference processes, using adaptive reasoning tokens, though it still struggles with unsolvable problems and complex, obfuscated tasks.
2024
- (OpenAI, 2024) ⇒ OpenAI. (2024). "Introducing OpenAI o1 LLM."
- NOTES:
- It is a multimodal large language model designed for advanced reasoning in challenging domains such as physics and computational tasks.
- NOTES:
2024
- https://openai.com/index/learning-to-reason-with-llms/
- NOTES:
- o1 is a new large language model trained with reinforcement learning to perform complex reasoning, producing a detailed internal chain of thought before responding to users.
- The model significantly outperforms GPT-4o on challenging reasoning benchmarks across math, coding, and science exams, demonstrating advanced problem-solving capabilities.
- On the 2024 AIME exams:
- In competitive programming, o1 ranks:
- In the 89th percentile on Codeforces.
- Achieving an Elo rating of 1807, outperforming 93% of human competitors.
- o1 exceeds human PhD-level accuracy on a benchmark of physics, biology, and chemistry problems (GPQA), surpassing the performance of human experts in these domains.
- Through reinforcement learning, o1:
- Refines its chain of thought.
- Learns to recognize and correct mistakes.
- Breaks down complex steps.
- Adopts different approaches when necessary.
- The model's performance improves with:
- Increased reinforcement learning (train-time compute).
- More time allocated for reasoning (test-time compute), showing scalability with computational resources.
- An early version, o1-preview:
- Human preference evaluations show:
- o1-preview is strongly preferred over GPT-4o in reasoning-intensive tasks like data analysis, coding, and math.
- It is less preferred in some natural language tasks.
- Chain of thought reasoning enhances safety and alignment by:
- Enabling the model to reason about safety rules internally.
- Making it more robust to unexpected scenarios.
- To balance user experience and safety:
- OpenAI provides a model-generated summary of the chain of thought.
- Instead of revealing the raw reasoning process to users.
- NOTES: