Skip to content

Multi-server support: Externalize OAuth state management #81

@AnthonyRonning

Description

@AnthonyRonning

Problem Description

The OAuth authentication flow currently stores CSRF state tokens in server-local memory, preventing successful OAuth flows when users are load-balanced across multiple servers.

Current Implementation

  • Location: src/oauth.rs lines 26, 39, 136, 149, 247, 261
  • Structure: Each OAuth provider (GitHub, Google, Microsoft) maintains its own in-memory state store:
pub state_store: Arc<RwLock<HashMap<String, OAuthState>>>,
  • OAuth State Contents:
    • CSRF token for security
    • User context (project_id, redirect URLs)
    • Timestamp for expiration

Impact on Multi-Server Deployment

Without Changes (Sticky Sessions)

  • Will work if user stays on same server throughout OAuth flow
  • Will fail if:
    • Server crashes/restarts during OAuth flow → User must restart OAuth login
    • Load balancer reassigns user mid-flow → User must restart OAuth login
    • Sticky session expires during OAuth → User must restart OAuth login

User Experience Impact

HIGH SEVERITY - Users will need to completely restart the OAuth login process if they hit a different server, leading to:

  • Confusion ("Why did my login fail?")
  • Potential abandonment of signup/login
  • Support tickets about "broken" login

Proposed Solutions

Option 1: Redis-based State Store

Implementation:

use redis::{Client, Commands, RedisResult};

pub struct RedisOAuthStateStore {
    client: Client,
    ttl_seconds: u64, // e.g., 600 for 10 minutes
}

impl RedisOAuthStateStore {
    pub async fn store_state(&self, state_token: String, oauth_state: OAuthState) -> Result<(), Error> {
        let mut conn = self.client.get_connection()?;
        let serialized = serde_json::to_string(&oauth_state)?;
        conn.set_ex(&state_token, serialized, self.ttl_seconds)?;
        Ok(())
    }
    
    pub async fn get_state(&self, state_token: &str) -> Result<Option<OAuthState>, Error> {
        let mut conn = self.client.get_connection()?;
        let value: Option<String> = conn.get(state_token)?;
        match value {
            Some(v) => Ok(Some(serde_json::from_str(&v)?)),
            None => Ok(None),
        }
    }
    
    pub async fn remove_state(&self, state_token: &str) -> Result<(), Error> {
        let mut conn = self.client.get_connection()?;
        conn.del(state_token)?;
        Ok(())
    }
}

Pros:

  • ✅ Fast O(1) lookups
  • ✅ Built-in TTL support (auto-cleanup)
  • ✅ Atomic operations
  • ✅ Minimal latency impact

Cons:

  • ❌ Additional infrastructure dependency
  • ❌ Redis clustering needed for HA
  • ❌ Memory cost for Redis instance

Dependencies:

redis = { version = "0.25", features = ["tokio-comp", "connection-manager"] }

Option 2: PostgreSQL-based State Store

Implementation:

-- Migration: Create oauth_states table
CREATE TABLE oauth_states (
    state_token VARCHAR(255) PRIMARY KEY,
    oauth_state JSONB NOT NULL,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
    expires_at TIMESTAMP WITH TIME ZONE NOT NULL
);

-- Index for cleanup queries
CREATE INDEX idx_oauth_states_expires_at ON oauth_states(expires_at);
use diesel::prelude::*;
use chrono::{DateTime, Utc, Duration};

#[derive(Insertable)]
#[diesel(table_name = oauth_states)]
pub struct NewOAuthState {
    pub state_token: String,
    pub oauth_state: serde_json::Value,
    pub expires_at: DateTime<Utc>,
}

impl OAuthStateStore for PostgresStateStore {
    async fn store_state(&self, state_token: String, oauth_state: OAuthState) -> Result<(), Error> {
        let expires_at = Utc::now() + Duration::minutes(10);
        let new_state = NewOAuthState {
            state_token,
            oauth_state: serde_json::to_value(&oauth_state)?,
            expires_at,
        };
        
        diesel::insert_into(oauth_states::table)
            .values(&new_state)
            .execute(&self.conn)?;
        Ok(())
    }
    
    async fn get_state(&self, state_token: &str) -> Result<Option<OAuthState>, Error> {
        let result = oauth_states::table
            .filter(oauth_states::state_token.eq(state_token))
            .filter(oauth_states::expires_at.gt(Utc::now()))
            .select(oauth_states::oauth_state)
            .first::<serde_json::Value>(&self.conn)
            .optional()?;
            
        match result {
            Some(json) => Ok(Some(serde_json::from_value(json)?)),
            None => Ok(None),
        }
    }
}

Cleanup Task:

// Run periodically (e.g., every hour)
async fn cleanup_expired_states(db: &PgConnection) -> Result<usize, Error> {
    diesel::delete(oauth_states::table)
        .filter(oauth_states::expires_at.lt(Utc::now()))
        .execute(db)
}

Pros:

  • ✅ No additional infrastructure
  • ✅ Transactional consistency
  • ✅ Can join with user tables if needed
  • ✅ Existing backup/recovery procedures apply

Cons:

  • ❌ Slightly higher latency than Redis
  • ❌ Requires periodic cleanup task
  • ❌ Additional database load

Implementation Steps

  1. Create abstraction trait:
#[async_trait]
pub trait OAuthStateStore: Send + Sync {
    async fn store_state(&self, state_token: String, oauth_state: OAuthState) -> Result<(), Error>;
    async fn get_state(&self, state_token: &str) -> Result<Option<OAuthState>, Error>;
    async fn remove_state(&self, state_token: &str) -> Result<(), Error>;
}
  1. Update OAuth providers to use injected state store instead of local HashMap

  2. Add configuration for Redis/PostgreSQL selection

  3. Test scenarios:

    • OAuth flow across server restarts
    • Concurrent OAuth flows
    • State expiration handling
    • Performance under load

Recommendation

For production: Use Redis if you already have Redis infrastructure, otherwise PostgreSQL is simpler to maintain.

Reasoning:

  • OAuth state is ephemeral (10-minute lifetime)
  • Redis provides better performance for this use case
  • PostgreSQL is acceptable if avoiding new dependencies

Testing Plan

  1. Unit tests: Mock state store implementations
  2. Integration tests: Test with real Redis/PostgreSQL
  3. Load testing: Simulate 1000+ concurrent OAuth flows
  4. Chaos testing: Kill servers mid-OAuth flow

Migration Strategy

  1. Deploy with feature flag for new state store
  2. Test with small percentage of traffic
  3. Monitor error rates and latency
  4. Gradually increase traffic percentage
  5. Remove in-memory implementation

Monitoring

Add metrics for:

  • OAuth state store latency (p50, p95, p99)
  • OAuth flow success rate
  • State store errors
  • Expired state cleanup rate

Related Issues

  • #[TBD] - Externalize session states
  • #[TBD] - Externalize ephemeral keys

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions