Nexus provides token-based rate limiting for LLM endpoints, allowing you to control token consumption per user, per provider, and per model, with support for different user tiers.

Token rate limiting requires client identification to be configured. See Client Identification for setup instructions.

Configure token limits at the provider level to apply to all models from that provider:

[llm.providers.openai.rate_limits.per_user] input_token_limit = 100000 # 100K input tokens interval = "60s" # Per minute

Nexus counts tokens using the following approach:

  1. Input Token Counting:

    • Uses OpenAI's cl100k_base tokenizer (compatible with GPT-4/GPT-3.5)
    • Counts tokens for each message's role and content
    • Adds ~3 tokens per message for internal structure
    • Adds 3 tokens for assistant response initialization
  2. Pre-flight Check:

    • Input tokens are checked against the limit
    • Request is rejected with 429 status if limit would be exceeded
    • Uses sliding window algorithm for rate limiting

Configure different limits for specific models (overrides provider-level limits):

# Provider-level default for all OpenAI models [llm.providers.openai.rate_limits.per_user] input_token_limit = 100000 # 100K input tokens interval = "60s" # Model-specific limit for GPT-4 (more restrictive) [llm.providers.openai.rate_limits."gpt-4".per_user] input_token_limit = 10000 # Only 10K input tokens for GPT-4 interval = "60s" # Model-specific limit for GPT-3.5-turbo (less restrictive) [llm.providers.openai.rate_limits."gpt-3.5-turbo".per_user] input_token_limit = 500000 # 500K input tokens for GPT-3.5 interval = "60s"

Configure different token limits for user groups/tiers:

# Configure client identification with groups [server.client_identification] enabled = true client_id.jwt_claim = "sub" group_id.jwt_claim = "plan" [server.client_identification.validation] group_values = ["free", "pro", "enterprise"] # Default limits for users without a group [llm.providers.openai.rate_limits.per_user] input_token_limit = 10000 # 10K input tokens default interval = "60s" # Group-specific limits [llm.providers.openai.rate_limits.per_user.groups.free] input_token_limit = 10000 # 10K input tokens for free tier interval = "60s" [llm.providers.openai.rate_limits.per_user.groups.pro] input_token_limit = 100000 # 100K input tokens for pro tier interval = "60s" [llm.providers.openai.rate_limits.per_user.groups.enterprise] input_token_limit = 1000000 # 1M input tokens for enterprise tier interval = "60s"

Rate limits are evaluated in the following order (most to least specific):

  1. Model + Group: [llm.providers.<provider>.rate_limits.<model>.per_user.groups.<group>]
  2. Model: [llm.providers.<provider>.rate_limits.<model>.per_user]
  3. Provider + Group: [llm.providers.<provider>.rate_limits.per_user.groups.<group>]
  4. Provider: [llm.providers.<provider>.rate_limits.per_user]

The first matching configuration is used.

Here's a comprehensive example showing all rate limiting features:

# OpenAI provider with tiered rate limits [llm.providers.openai] type = "openai" api_key = "{{ env.OPENAI_API_KEY }}" # Configure models [llm.providers.openai.models."gpt-4"] [llm.providers.openai.models."gpt-3.5-turbo"] # Default rate limits for all OpenAI models [llm.providers.openai.rate_limits.per_user] input_token_limit = 50000 # 50K input tokens interval = "60s" # Free tier limits for all OpenAI models [llm.providers.openai.rate_limits.per_user.groups.free] input_token_limit = 10000 # 10K input tokens interval = "60s" # Pro tier limits for all OpenAI models [llm.providers.openai.rate_limits.per_user.groups.pro] input_token_limit = 100000 # 100K input tokens interval = "60s" # Enterprise tier limits for all OpenAI models [llm.providers.openai.rate_limits.per_user.groups.enterprise] input_token_limit = 1000000 # 1M input tokens interval = "60s" # GPT-4 specific limits (more restrictive) [llm.providers.openai.rate_limits."gpt-4".per_user] input_token_limit = 25000 # 25K input tokens interval = "60s" # GPT-4 enterprise tier gets special treatment [llm.providers.openai.rate_limits."gpt-4".per_user.groups.enterprise] input_token_limit = 500000 # 500K input tokens interval = "60s" # Anthropic provider with simpler limits [llm.providers.anthropic] type = "anthropic" api_key = "{{ env.ANTHROPIC_API_KEY }}" [llm.providers.anthropic.models."claude-3-5-sonnet-20241022"] [llm.providers.anthropic.models."claude-3-opus-20240229"] # Single limit for all Anthropic models and users [llm.providers.anthropic.rate_limits.per_user] input_token_limit = 75000 # 75K input tokens interval = "60s"

Token rate limits use the same storage backend as configured for server rate limits:

[server.rate_limits] storage = "memory"
[server.rate_limits] storage = { type = "redis", url = "redis://localhost:6379" }

See Storage Backends for detailed configuration.

Nexus uses a sliding window algorithm for token counting:

  • Provides smooth rate limiting without hard resets
  • Tokens are "returned" to the limit pool as time passes
  • More accurate than fixed window counting

Nexus counts only input tokens for rate limiting:

  • Only the tokens from the request messages are counted
  • Output tokens are not counted against the rate limit
  • This provides more predictable rate limiting behavior
# Client identification from JWT [server.client_identification] enabled = true client_id.jwt_claim = "user_id" group_id.jwt_claim = "subscription_tier" [server.client_identification.validation] group_values = ["trial", "basic", "professional", "unlimited"] # OpenAI configuration [llm.providers.openai] type = "openai" api_key = "{{ env.OPENAI_API_KEY }}" [llm.providers.openai.models."gpt-3.5-turbo"] [llm.providers.openai.models."gpt-4"] # Trial users - very limited [llm.providers.openai.rate_limits.per_user.groups.trial] input_token_limit = 5000 interval = "1d" # Basic tier - reasonable limits [llm.providers.openai.rate_limits.per_user.groups.basic] input_token_limit = 50000 interval = "1h" # Professional tier - generous limits [llm.providers.openai.rate_limits.per_user.groups.professional] input_token_limit = 500000 interval = "1h" # Unlimited tier - no provider-level limits (still subject to server limits) # No configuration means no token limits applied
# Identify by employee ID and department [server.client_identification] enabled = true client_id.jwt_claim = "employee_id" group_id.jwt_claim = "department" [server.client_identification.validation] group_values = ["engineering", "marketing", "sales", "executive"] # Different limits per department [llm.providers.openai.rate_limits.per_user.groups.engineering] input_token_limit = 1000000 # Engineers need more for coding interval = "3600s" [llm.providers.openai.rate_limits.per_user.groups.marketing] input_token_limit = 200000 # Content generation interval = "3600s" [llm.providers.openai.rate_limits.per_user.groups.sales] input_token_limit = 100000 # Email assistance interval = "3600s" [llm.providers.openai.rate_limits.per_user.groups.executive] input_token_limit = 500000 # Reports and analysis interval = "3600s"
  1. Start Conservative: Begin with lower limits and increase based on usage patterns
  2. Monitor Usage: Track actual input token consumption patterns
  3. Use Groups: Implement tiered access for different user types
  4. Model-Specific Limits: Set stricter limits for expensive models (GPT-4, Claude Opus)
  5. Client Identification: Use JWT claims for secure user identification in production
  6. Redis for Production: Use Redis storage for multi-instance deployments
  7. Grace Periods: Consider longer intervals for better user experience
  8. Clear Communication: Inform users about their limits and usage

Token rate limits work alongside other rate limiting mechanisms:

# IP-based rate limits (always active) [server.rate_limits.per_ip] limit = 1000 interval = "60s" # Token-based rate limits (requires client identification) [llm.providers.openai.rate_limits.per_user] input_token_limit = 100000 # 100K input tokens interval = "60s"

Both limits are enforced independently - a request must pass all applicable rate limit checks.

  • Verify client identification is enabled and working
  • Check that the user's group matches configured groups
  • Ensure the model name in rate limits matches exactly
  • Review logs for rate limiting decisions
  • Remember that role names and message structure add tokens
  • System messages count toward the token limit
  • Token counts are estimates and may vary slightly
  • Verify Redis is running and accessible
  • Check connection string and credentials
  • Monitor Redis memory usage
  • Review connection pool settings
  • Enable Token Forwarding for user-provided keys
  • Review API Usage for integration examples
  • Monitor metrics to optimize limits
© Grafbase, Inc.