Nexus provides token-based rate limiting for LLM endpoints, allowing you to control token consumption per user, per provider, and per model, with support for different user tiers.
Token rate limiting requires client identification to be configured. See Client Identification for setup instructions.
Configure token limits at the provider level to apply to all models from that provider:
[llm.providers.openai.rate_limits.per_user]
input_token_limit = 100000 # 100K input tokens
interval = "60s" # Per minute
Nexus counts tokens using the following approach:
-
Input Token Counting:
- Uses OpenAI's
cl100k_base
tokenizer (compatible with GPT-4/GPT-3.5) - Counts tokens for each message's role and content
- Adds ~3 tokens per message for internal structure
- Adds 3 tokens for assistant response initialization
- Uses OpenAI's
-
Pre-flight Check:
- Input tokens are checked against the limit
- Request is rejected with 429 status if limit would be exceeded
- Uses sliding window algorithm for rate limiting
Configure different limits for specific models (overrides provider-level limits):
# Provider-level default for all OpenAI models
[llm.providers.openai.rate_limits.per_user]
input_token_limit = 100000 # 100K input tokens
interval = "60s"
# Model-specific limit for GPT-4 (more restrictive)
[llm.providers.openai.rate_limits."gpt-4".per_user]
input_token_limit = 10000 # Only 10K input tokens for GPT-4
interval = "60s"
# Model-specific limit for GPT-3.5-turbo (less restrictive)
[llm.providers.openai.rate_limits."gpt-3.5-turbo".per_user]
input_token_limit = 500000 # 500K input tokens for GPT-3.5
interval = "60s"
Configure different token limits for user groups/tiers:
# Configure client identification with groups
[server.client_identification]
enabled = true
client_id.jwt_claim = "sub"
group_id.jwt_claim = "plan"
[server.client_identification.validation]
group_values = ["free", "pro", "enterprise"]
# Default limits for users without a group
[llm.providers.openai.rate_limits.per_user]
input_token_limit = 10000 # 10K input tokens default
interval = "60s"
# Group-specific limits
[llm.providers.openai.rate_limits.per_user.groups.free]
input_token_limit = 10000 # 10K input tokens for free tier
interval = "60s"
[llm.providers.openai.rate_limits.per_user.groups.pro]
input_token_limit = 100000 # 100K input tokens for pro tier
interval = "60s"
[llm.providers.openai.rate_limits.per_user.groups.enterprise]
input_token_limit = 1000000 # 1M input tokens for enterprise tier
interval = "60s"
Rate limits are evaluated in the following order (most to least specific):
- Model + Group:
[llm.providers.<provider>.rate_limits.<model>.per_user.groups.<group>]
- Model:
[llm.providers.<provider>.rate_limits.<model>.per_user]
- Provider + Group:
[llm.providers.<provider>.rate_limits.per_user.groups.<group>]
- Provider:
[llm.providers.<provider>.rate_limits.per_user]
The first matching configuration is used.
Here's a comprehensive example showing all rate limiting features:
# OpenAI provider with tiered rate limits
[llm.providers.openai]
type = "openai"
api_key = "{{ env.OPENAI_API_KEY }}"
# Configure models
[llm.providers.openai.models."gpt-4"]
[llm.providers.openai.models."gpt-3.5-turbo"]
# Default rate limits for all OpenAI models
[llm.providers.openai.rate_limits.per_user]
input_token_limit = 50000 # 50K input tokens
interval = "60s"
# Free tier limits for all OpenAI models
[llm.providers.openai.rate_limits.per_user.groups.free]
input_token_limit = 10000 # 10K input tokens
interval = "60s"
# Pro tier limits for all OpenAI models
[llm.providers.openai.rate_limits.per_user.groups.pro]
input_token_limit = 100000 # 100K input tokens
interval = "60s"
# Enterprise tier limits for all OpenAI models
[llm.providers.openai.rate_limits.per_user.groups.enterprise]
input_token_limit = 1000000 # 1M input tokens
interval = "60s"
# GPT-4 specific limits (more restrictive)
[llm.providers.openai.rate_limits."gpt-4".per_user]
input_token_limit = 25000 # 25K input tokens
interval = "60s"
# GPT-4 enterprise tier gets special treatment
[llm.providers.openai.rate_limits."gpt-4".per_user.groups.enterprise]
input_token_limit = 500000 # 500K input tokens
interval = "60s"
# Anthropic provider with simpler limits
[llm.providers.anthropic]
type = "anthropic"
api_key = "{{ env.ANTHROPIC_API_KEY }}"
[llm.providers.anthropic.models."claude-3-5-sonnet-20241022"]
[llm.providers.anthropic.models."claude-3-opus-20240229"]
# Single limit for all Anthropic models and users
[llm.providers.anthropic.rate_limits.per_user]
input_token_limit = 75000 # 75K input tokens
interval = "60s"
Token rate limits use the same storage backend as configured for server rate limits:
[server.rate_limits]
storage = "memory"
[server.rate_limits]
storage = { type = "redis", url = "redis://localhost:6379" }
See Storage Backends for detailed configuration.
Nexus uses a sliding window algorithm for token counting:
- Provides smooth rate limiting without hard resets
- Tokens are "returned" to the limit pool as time passes
- More accurate than fixed window counting
Nexus counts only input tokens for rate limiting:
- Only the tokens from the request messages are counted
- Output tokens are not counted against the rate limit
- This provides more predictable rate limiting behavior
# Client identification from JWT
[server.client_identification]
enabled = true
client_id.jwt_claim = "user_id"
group_id.jwt_claim = "subscription_tier"
[server.client_identification.validation]
group_values = ["trial", "basic", "professional", "unlimited"]
# OpenAI configuration
[llm.providers.openai]
type = "openai"
api_key = "{{ env.OPENAI_API_KEY }}"
[llm.providers.openai.models."gpt-3.5-turbo"]
[llm.providers.openai.models."gpt-4"]
# Trial users - very limited
[llm.providers.openai.rate_limits.per_user.groups.trial]
input_token_limit = 5000
interval = "1d"
# Basic tier - reasonable limits
[llm.providers.openai.rate_limits.per_user.groups.basic]
input_token_limit = 50000
interval = "1h"
# Professional tier - generous limits
[llm.providers.openai.rate_limits.per_user.groups.professional]
input_token_limit = 500000
interval = "1h"
# Unlimited tier - no provider-level limits (still subject to server limits)
# No configuration means no token limits applied
# Identify by employee ID and department
[server.client_identification]
enabled = true
client_id.jwt_claim = "employee_id"
group_id.jwt_claim = "department"
[server.client_identification.validation]
group_values = ["engineering", "marketing", "sales", "executive"]
# Different limits per department
[llm.providers.openai.rate_limits.per_user.groups.engineering]
input_token_limit = 1000000 # Engineers need more for coding
interval = "3600s"
[llm.providers.openai.rate_limits.per_user.groups.marketing]
input_token_limit = 200000 # Content generation
interval = "3600s"
[llm.providers.openai.rate_limits.per_user.groups.sales]
input_token_limit = 100000 # Email assistance
interval = "3600s"
[llm.providers.openai.rate_limits.per_user.groups.executive]
input_token_limit = 500000 # Reports and analysis
interval = "3600s"
- Start Conservative: Begin with lower limits and increase based on usage patterns
- Monitor Usage: Track actual input token consumption patterns
- Use Groups: Implement tiered access for different user types
- Model-Specific Limits: Set stricter limits for expensive models (GPT-4, Claude Opus)
- Client Identification: Use JWT claims for secure user identification in production
- Redis for Production: Use Redis storage for multi-instance deployments
- Grace Periods: Consider longer intervals for better user experience
- Clear Communication: Inform users about their limits and usage
Token rate limits work alongside other rate limiting mechanisms:
# IP-based rate limits (always active)
[server.rate_limits.per_ip]
limit = 1000
interval = "60s"
# Token-based rate limits (requires client identification)
[llm.providers.openai.rate_limits.per_user]
input_token_limit = 100000 # 100K input tokens
interval = "60s"
Both limits are enforced independently - a request must pass all applicable rate limit checks.
- Verify client identification is enabled and working
- Check that the user's group matches configured groups
- Ensure the model name in rate limits matches exactly
- Review logs for rate limiting decisions
- Remember that role names and message structure add tokens
- System messages count toward the token limit
- Token counts are estimates and may vary slightly
- Verify Redis is running and accessible
- Check connection string and credentials
- Monitor Redis memory usage
- Review connection pool settings
- Enable Token Forwarding for user-provided keys
- Review API Usage for integration examples
- Monitor metrics to optimize limits