Azure Provisioned Throughput Units (PTUs) Feature

From GM-RKB
(Redirected from Provisioned Throughput Unit)
Jump to navigation Jump to search

A Azure Provisioned Throughput Units (PTUs) Feature is a Microsoft Azure feature that enables customers to reserve processing capacity for Azure OpenAI Service models.

  • Context:
    • It can (typically) be used to provide consistent model performance for production-level AI applications that require predictable processing power.
    • It can (typically) be used for deploying AI models such as GPT-4 in specified regions and scaling based on the number of calls per minute and token usage.
    • It can (often) include the ability to reserve a fixed number of Provisioned Throughput Units to ensure capacity for high-demand usage patterns.
    • ...
    • It can be purchased as a monthly or yearly commitment, with discounts for long-term reservations, ensuring better cost predictability.
    • It can allow scaling up or down based on the workload requirements, making it ideal for both large and growing deployments.
    • It can offer better cost-efficiency and consistent performance compared to the Pay-As-You-Go model, particularly for enterprises running regular high-volume tasks.
    • It can be accessed and managed using Azure's capacity planning tools, enabling enterprises to plan and provision PTUs based on workload characteristics.
    • It can provide the flexibility to assign or reallocate PTU quota across different deployments within a subscription and region.
    • It can simplify quota management by offering model-independent capacity, unlike the Tokens Per Minute (TPM) quota.
    • It can support predictable latency for real-time AI applications, particularly where performance consistency is critical.
    • ...
  • Example(s):
  • Counter-Example(s):
    • Pay-As-You-Go Azure OpenAI services, which offer more flexibility for variable workloads but without the consistency of performance and cost predictability provided by PTUs.
    • Tokens Per Minute (TPM) Quota, which limits token processing capacity for variable workloads without the guaranteed throughput offered by PTUs.
  • See: Azure OpenAI Services, Azure EA for OpenAI Services, Tokens Per Minute Quota, Azure Cost Management, Azure OpenAI Capacity Calculator, Azure OpenAI Benchmarking Tool.


References

2024

[1] https://learn.microsoft.com/en-in/azure/ai-services/openai/concepts/provisioned-throughput
[2] https://info.cloudchampion.fi/rs/779-DUO-006/images/Mastering%20AOAI%20-%20PTU.pdf?version=0
[3] https://github.com/Azure/aoai-apim
[4] https://azure.microsoft.com/en-us/pricing/details/cognitive-services/openai-service/?WT.mc_id=javascript-110690-gllemos
[5] https://learn.microsoft.com/fi-fi/azure/ai-services/openai/how-to/provisioned-throughput-onboarding
[6] https://learn.microsoft.com/en-us/azure/cost-management-billing/reservations/azure-openai
[7] https://www.youtube.com/watch?v=1YV88fSLnU4
[8] https://azure.microsoft.com/en-us/products/ai-services/openai-service/

2024

  • Perplexity
    • To calculate the number of Provisioned Throughput Units (PTUs) needed for your specific Azure OpenAI Service workload, follow these steps:
      • Understand your workload requirements:
        • Determine the OpenAI Model you plan to use (e.g., GPT-4, GPT-3.5-Turbo).
        • Estimate your peak calls per minute.
        • Calculate the average tokens in prompt calls.
        • Estimate the average tokens in model responses.
      • Use the Azure OpenAI Capacity Calculator:
        • Access the calculator at https://oai.azure.com/portal/calculator.
        • Input your workload details (model, version, peak calls per minute, tokens in prompt, tokens in model response).
        • The calculator will provide an estimate of required PTUs[1][2].
      • Validate with benchmarking:
        • Use the Azure OpenAI Benchmarking Tool to test with real traffic patterns.
        • This helps account for variations in your workload that may affect PTU requirements[2].
      • Consider scaling and buffer:
        • PTUs are deployed in specific increments (e.g., 50 PTUs for GPT-4).
        • Round up to the nearest increment.
        • Consider adding a buffer for unexpected spikes in usage.
      • Monitor and adjust:
        • After deployment, monitor actual usage.
        • Adjust PTU allocation as needed based on real-world performance.
    • Remember that the capacity calculator provides an estimate based on simple input criteria. The most accurate way to determine your capacity is to benchmark a deployment with a representative workload for your use case[1].
      • For enterprise scenarios, it is often beneficial to work with Microsoft or a Cloud Partner to design and implement a solution that best fits your specific needs and use cases.
    • Citations:
[1] https://learn.microsoft.com/fi-fi/azure/ai-services/openai/how-to/provisioned-throughput-onboarding
[2] https://info.cloudchampion.fi/rs/779-DUO-006/images/Mastering%20AOAI%20-%20PTU.pdf?version=0
[3] https://www.youtube.com/watch?v=DNbJMTrIa6w
[4] https://support.freshservice.com/support/solutions/articles/50000004752-how-is-workload-calculated-
[5] https://azure.microsoft.com/en-us/pricing/details/cognitive-services/openai-service/?WT.mc_id=javascript-110690-gllemos
[6] https://osspeac.org/workload-calculator/
[7] https://learn.microsoft.com/en-in/azure/ai-services/openai/concepts/provisioned-throughput
[8] https://learn.microsoft.com/en-us/azure/cost-management-billing/reservations/azure-openai