OpenAI GPT-4 Multimodal Language Model

An OpenAI GPT-4 Multimodal Language Model is an multimodal language model within an OpenAI GPT-4 LLM Family (LLM family).

Context:
- It can (typically) be a Foundation LLM.
- It can (typically) be a Instruction-Tuned LLM.
- ...
- It can range from being a Base GPT-4 Model to being a GPT-4 Turbo Model, depending on its model variant.
- It can range from being a Text-Only GPT-4 to being a Multimodal GPT-4, depending on its input modality.
- It can range from being a Standard Context GPT-4 to being an Extended Context GPT-4, depending on its context window size.
- ...
- It can have model architecture with:
  - GPT-4's Scale of ~1.8 trillion parameters across 120 layers
  - GPT-4 Mixture Of Experts (MoE) with 16 experts
  - GPT-4 Corpus of ~13T tokens
  - GPT-4 Training Cost of ~$63 million
  - GPT-4 Inference Architecture using 128 GPUs
- ...
- It can support input processing through:
  - Text Processing for natural language input
  - Code Processing for programming language input
  - Image Processing for visual input (as of 2023-09)
  - Audio Processing for speech input (in GPT-4 Omni)
- It can provide output generation through:
  - Text Output for natural language output
  - Code Output for programming solutions
  - Image Output for analytical tasks
- It can maintain factual accuracy through information verification.
- It can follow user instructions through steerability mechanism.
- ...
- It can be a Multi-Modal LLM.
- It can be a State-of-the-Art LLM.
- ...
Example(s):
- GPT-4 Base Models (from 2023-03, cutoff 2021-09), such as:
  - gpt-4 (8k) with 8,192 token context and base capabilitys
  - gpt-4-32k with 32,768 token context and extended context
  - gpt-4-0314 with initial release features
  - gpt-4-0613 with function calling support
- GPT-4 Vision Models (from 2023-09), such as:
  - gpt-4-vision-preview with visual analysis capability
  - gpt-4-vision-0124 with improved recognition
- GPT-4 Turbo Models (from 2023-11, cutoff 2023-04), such as:
  - gpt-4-turbo-preview with 128,000 token context
  - gpt-4-0125-preview with enhanced instruction handling
  - gpt-4-turbo-instruct with specialized completions
- GPT-4 Omni Models (from 2024-05, cutoff 2023-12), such as:
  - gpt-4o-2024-05 with initial multimodal support
  - gpt-4o-2024-vision with enhanced visual capabilities
  - gpt-4o-2024-audio with speech processing
- ...
Counter-Example(s):
- GPT-3.5, GPT o1.
- Google Gemini 1.5, Anthropic Claude 3.

See: Mixture of Experts (MoE), Azure OpenAI Service, Auto-GPT Framework, DSPy Framework.

References

2024-12-13

Perplexity.ai
- Question: Provide a comprehensive overview of the GPT4 Model Family, including its key variants, technical specifications, and capabilities. Describe its multimodal features, context window size, and performance improvements over previous models. Explain the significance of its SoA model features and its real-time processing abilities across text, audio, and image modalities. Compare its performance on benchmarks and exams to human-level performance. Include information on its training data, model size, and any known limitations or ethical considerations. How does GPT-4 family compare to other state-of-the-art language models?
- Answer: The GPT4 Model Family represents a significant advancement in Artificial Intelligence Technology, offering advanced Natural Language Processing and multimodal capabilities through its various model variants and specifications.
  - Model Variants:
    - GPT-4 and GPT-4 Turbo:
      - Released in March 2023 and November 2023 respectively.
      - Feature multimodal processing for text and image inputs.
      - Context window: 8,192 tokens for base model, expanded to 128,000 for Turbo.
      - Knowledge cutoff: September 2021 for GPT-4, April 2023 for Turbo.
    - GPT-4o ("Omni"):
      - Introduced in May 2024.
      - Features full integration of text, image, and audio modalities.
      - Includes real-time processing capabilities.
      - Enhanced non-English language performance.
      - Improved vision and audio understanding.
  - Multimodal Capabilities:
    - Input Processing:
      - Text and image processing.
      - Audio input and output (in GPT-4o).
      - Visual question answering.
      - Real-time verbal conversations.
  - Performance Improvements:
    - Enhanced Accuracy:
      - Reduced hallucinations.
      - Improved academic benchmark performance.
      - Enhanced reasoning capabilities.
      - Superior instruction handling.
  - Benchmark Performance:
    - Test Scores:
      - MMLU: 86.4% (vs. 70.0% for GPT-3.5).
      - HellaSwag: 95.3% (vs. 85.5% for GPT-3.5).
      - AI2 Reasoning Challenge: 96.3% (vs. 85.2% for GPT-3.5).
  - Training Information:
    - Training Data:
      - Larger and more diverse dataset compared to previous models.
      - Includes public information and licensed data.
      - Variable training cutoff dates by version.
  - Key Features:
    - Advanced Capabilities:
      - Improved steerability for behavior adjustment.
      - Enhanced language understanding.
      - Advanced visual analysis capabilities.
      - Long-term memory and context awareness.
  - Limitations and Ethical Considerations:
    - Challenges:
      - Potential for biased outputs.
      - Privacy concerns.
      - Lack of transparency.
      - Risk of incorrect information generation.
- Citations:

[1] https://platform.openai.com/docs/models/gp
[2] https://research.aimultiple.com/gpt4/
[3] https://lingarogroup.com/blog/whats-new-with-gpt-4-features-and-limitations
[4] https://www.techtarget.com/whatis/feature/GPT-4o-explained-Everything-you-need-to-know
[5] https://openai.com/index/gpt-4-research/
[6] https://www.restack.io/p/gpt-4-training-workshops-answer-training-data-size-cat-ai
[7] https://www.version1.com/blog/openai-gpt-4-a-complete-review/
[8] https://en.wikipedia.org/wiki/GPT-4
[9] https://www.datacamp.com/blog/what-we-know-gpt4
[10] https://pmc.ncbi.nlm.nih.gov/articles/PMC10795998/

2023

https://platform.openai.com/docs/models/gpt-4
- QUOTE: GPT-4 is a large multimodal model (accepting text inputs and emitting text outputs today, with image inputs coming in the future) that can solve difficult problems with greater accuracy than any of our previous models, thanks to its broader general knowledge and advanced reasoning capabilities. Like gpt-3.5-turbo, GPT-4 is optimized for chat but works well for traditional completions tasks using the Chat completions API. Learn how to use GPT-4 in our GPT guide.

Latest model	Description	Max tokens	Training data
gpt-4	More capable than any GPT-3.5 model, able to do more complex tasks, and optimized for chat. Will be updated with our latest model iteration 2 weeks after it is released.	8,192 tokens	Up to Sep 2021
gpt-4-0613	Snapshot of gpt-4 from June 13th 2023 with function calling data. Unlike gpt-4, this model will not receive updates, and will be deprecated 3 months after a new version is released.	8,192 tokens	Up to Sep 2021
gpt-4-32k	Same capabilities as the standard gpt-4 mode but with 4x the context length. Will be updated with our latest model iteration.	32,768 tokens	Up to Sep 2021
gpt-4-32k-0613	Snapshot of gpt-4-32 from June 13th 2023. Unlike gpt-4-32k, this model will not receive updates, and will be deprecated 3 months after a new version is released.	32,768 tokens	Up to Sep 2021
gpt-4-0314 (Legacy)	Snapshot of gpt-4 from March 14th 2023 with function calling data. Unlike gpt-4, this model will not receive updates, and will be deprecated on June 13th 2024 at the earliest.	8,192 tokens	Up to Sep 2021
gpt-4-32k-0314 (Legacy)	Snapshot of gpt-4-32 from March 14th 2023. Unlike gpt-4-32k, this model will not receive updates, and will be deprecated on June 13th 2024 at the earliest.	32,768 tokens

2023

https://the-decoder.com/gpt-4-architecture-datasets-costs-and-more-leaked/
- GPT-4's Scale: GPT-4 has ~1.8 trillion parameters across 120 layers, which is over 10 times larger than GPT-3.
- Mixture Of Experts (MoE): OpenAI utilizes 16 experts within their model, each with ~111B parameters for MLP. Two of these experts are routed per forward pass, which contributes to keeping costs manageable.
- Dataset: GPT-4 is trained on ~13T tokens, including both text-based and code-based data, with some fine-tuning data from ScaleAI and internally.
- Dataset Mixture: The training data included CommonCrawl & RefinedWeb, totaling 13T tokens. Speculation suggests additional sources like Twitter, Reddit, YouTube, and a large collection of textbooks.
- Training Cost: The training costs for GPT-4 was around $63 million, taking into account the computational power required and the time of training.
- Inference Cost: GPT-4 costs 3 times more than the 175B parameter Davinci, due to the larger clusters required and lower utilization rates.
- Inference Architecture: The inference runs on a cluster of 128 GPUs, using 8-way tensor parallelism and 16-way pipeline parallelism.
- Vision Multi-Modal: GPT-4 includes a vision encoder for autonomous agents to read web pages and transcribe images and videos. The architecture is similar to Flamingo. This adds more parameters on top and it is fine-tuned with another ~2 trillion tokens.

2023

https://openai.com/research/gpt-4
- We’ve created GPT-4, the latest milestone in OpenAI’s effort in scaling up deep learning. GPT-4 is a large multimodal model (accepting image and text inputs, emitting text outputs) that, while less capable than humans in many real-world scenarios, exhibits human-level performance on various professional and academic benchmarks. For example, it passes a simulated bar exam with a score around the top 10% of test takers; in contrast, GPT-3.5’s score was around the bottom 10%. We’ve spent 6 months iteratively aligning GPT-4 using lessons from our adversarial testing program as well as ChatGPT, resulting in our best-ever results (though far from perfect) on factuality, steerability, and refusing to go outside of guardrails. ...
  ...
- Steerability:
  We’ve been working on each aspect of the plan outlined in our post about defining the behavior of AIs, including steerability. Rather than the classic ChatGPT personality with a fixed verbosity, tone, and style, developers (and soon ChatGPT users) can now prescribe their AI’s style and task by describing those directions in the “system” message. System messages allow API users to significantly customize their users’ experience within bounds. ...
  ...
- Like previous GPT models, the GPT-4 base model was trained to predict the next word in a document, and was trained using publicly available data (such as internet data) as well as data we’ve licensed. The data is a web-scale corpus of data including correct and incorrect solutions to math problems, weak and strong reasoning, self-contradictory and consistent statements, and representing a great variety of ideologies and ideas. So when prompted with a question, the base model can respond in a wide variety of ways that might be far from a user’s intent. To align it with the user’s intent within guardrails, we fine-tune the model’s behavior using reinforcement learning with human feedback (RLHF). Note that the model’s capabilities seem to come primarily from the pre-training process—RLHF does not improve exam performance (without active effort, it actually degrades it). But steering of the model comes from the post-training process—the base model requires prompt engineering to even know that it should answer the questions.
- ...
- Once you have access, you can make text-only requests to the gpt-4 model (image inputs are still in limited alpha), which we will automatically update to our recommended stable model as we make new versions over time (you can pin the current version by calling gpt-4-0314, which we’ll support until June 14). Pricing is $0.03 per 1k prompt tokens and $0.06 per 1k completion tokens. Default rate limits are 40k tokens per minute and 200 requests per minute.
  gpt-4 has a context length of 8,192 tokens. We are also providing limited access to our 32,768–context (about 50 pages of text) version, gpt-4-32k, which will also be updated automatically over time (current version gpt-4-32k-0314, also supported until June 14). Pricing is $0.06 per 1K prompt tokens and $0.12 per 1k completion tokens. We are still improving model quality for long context and would love feedback on how it performs for your use-case. We are processing requests for the 8K and 32K engines at different rates based on capacity, so you may receive access to them at different times.

OpenAI GPT-4 Multimodal Language Model

References

2024-12-13

2023

2023

2023

Navigation menu

Search