Skip to content

Documentation

Bennett Wu edited this page Sep 24, 2025 · 12 revisions

Configuration

Configuring Docker Containers

The node server and whisper service Docker containers support all of the same options as detailed in Configuring Node Server and Configuring Whisper Service with the following exceptions:

  • For node server and whisper service
    • HOST and PORT are disabled.
      • Docker will expose the service on all hosts the Docker daemon is configured to use. Modify the daemon to change which hosts are used.
      • To select which port to use, publish a port when starting the Docker container
        docker run -p [YOUR PORT]:80 --env-file .env scribear/node-server:main
        
        docker run -p [YOUR PORT]:80 --env-file .env -v ./device_config.json:/app/device_config.json scribear/whisper-service-cpu:main
        

Configuring Docker Compose Deployment

When deploying with Docker compose, the node server and whisper service containers are configured the same way the containers themselves are configured (See Configuring Docker Containers) with the following exceptions:

  • WHISPER_SERVICE_ENDPOINT for node server is automatically generated using API_KEY.
  • NODE_PORT is added to automatically select the port Docker exposes node server on.
  • FRONTEND_PORT is added to automatically select the port Docker exposes the frontend on.

Configuring Node Server

Node server is configured using environment variables defined in .env.

Option Values Default Description
Runtime Options
NODE_ENV development, production, test production Indicates the environment service is running in.
LOG_LEVEL error, warn, info, debug, trace, silent info Sets the verbosity of logging.
Server Options
HOST string 127.0.0.1 The socket the node server will bind to. Use 0.0.0.0 to make available to local network, 127.0.0.1 for localhost only.
PORT number 8080 Port number that node server will listen for connections on. Should match the port node server is trying to connect to.
CORS_ORIGIN string * Cors origin configuration for node server.
SERVER_ADDRESS string 127.0.0.1:8080 Address the node server is reachable at. Used for ScribeAR QR code to allow other device to connect.
Whisper Service Options
WHISPER_SERVICE_ENDPOINT string Required, no default value Websocket address for whisper service endpoint. Should be in the format:
ws://${ADDRESS}:${PORT}/sourcesink
ADDRESS is the address or ip of the whisper service.
PORT is the port the whisper service is listening on. This should match what the whisper service is configured to use.
API_KEY string Required, no default value Api key for the whisper service. This should match what the whisper service is configured to use.
Authentication Options
REQUIRE_AUTH true, false true If true, requires authentication to connect to node server api, otherwise no authentication is used. See Authentication for details.
SOURCE_TOKEN string Required if REQUIRED_AUTH=true, no default value The key used by frontend to connect as audio source. See Authentication for details.
ACCESS_TOKEN_REFRESH_INTERVAL_SEC number Required if REQUIRED_AUTH=true, no default value Number of seconds to wait before generating a new refresh token. See Authentication for details.
ACCESS_TOKEN_BYTES number Required if REQUIRED_AUTH=true, no default value The number of random bytes used to generate access tokens
ACCESS_TOKEN_VALID_PERIOD_SEC number Required if REQUIRED_AUTH=true, no default value Number of seconds a newly generated refresh token is valid for. See Authentication for details.
SESSION_TOKEN_BYTES number Required if REQUIRED_AUTH=true, no default value The number of random bytes used to generate session tokens
SESSION_LENGTH_SEC number Required if REQUIRED_AUTH=true, no default value Number of seconds a newly generated session token is valid for. See Authentication for details.

Configuring Whisper Service

There are two files to configure whisper service, .env and device_config.json. .env configures the webserver whisper service uses while device_config.json configures the models a particular deployment of whisper service will support.

.env Options

Option Options Default Description
LOG_LEVEL info, debug, trace info Sets the verbosity of logging.
API_KEY string <empty string> The api key that must be passed to whisper-service in order to establish a connection. Should match api_key= url parameter for node server.
HOST string 127.0.0.1 The socket the whisper service will bind to. Use 0.0.0.0 to make available to local network, 127.0.0.1 for localhost only.
PORT number 8000 Port number that whisper service will listen for connections on. Should match the port node server is trying to connect to.

device_config.json

When node server or a user connects to whisper service, they must specify a model_key (See Model Selection). This uniquely identifies a particular model implementation and configuration (See Model Implementations and Configurations) to be run by whisper service. This is configured for each individual device whisper service runs on to account for various hardware capabilities. device_config.json configures this mapping from model_key to a implementation and configurations.

device_config.json is a JSON file (https://www.json.org/json-en.html). Think of it as a nested series of key-value pairs.

{
  "model_key": {
    "display_name": "string",
    "description": "string",
    "implementation_id": "string",
    "implementation_configuration": {},
    "available_features": {}
  },
  ...Create as many as desired
}
  • model_key
    • The unique identifier used by node server or ScribeAR to select which model whisper service should run.
    • These can be whatever string you'd like and you can define as many unique model_keys as you'd like.
  • display_name
    • This is a friendly name that the ScribeAR frontend can display to the user.
  • description
    • This is a description that the ScribeAR frontend can display to the user.
  • implementation_id
  • implementation_configuration
    • This is a configuration object that configures a model implementation.
    • This is unique for each implementation.
    • See Model Implementations and Configurations for the configuration options for your selected model implementation.
  • available_features
    • This is an object that defines which features that this model supports.

Example config:

{
  "mock_transcription_duration": {
    "display_name": "Sanity Test",
    "description": "Returns how many seconds of audio was received by whisper service.",
    "implementation_id": "mock_transcription_duration",
    "implementation_configuration": {},
    "available_features": {}
  },
  "faster-whisper:cpu-tiny-en": {
    "display_name": "Faster Whisper - tiny.en",
    "description": "Faster Whisper implementation of Open AI Whisper tiny.en model.",
    "implementation_id": "faster_whisper",
    "implementation_configuration": {
      "model": "tiny.en",
      "device": "cpu",
      "local_agree_dim": 2,
      "min_new_samples": 48000,
      "max_segment_samples": 480000
    },
    "available_features": {}
  }
}

Model Implementations and Configuration

  • Every model implementation (i.e. faster whisper, whisper.cpp) that whisper service supports is given an unique implementation_id.
  • Each model implementation has unique configuration options to tune its performance, accuracy, and more.
  • A list of implemntation_ids and their configuration schemas can be found below.

mock_transcription_duration

  • This is a sanity check model implementation. It returns "transcripts" that say how many seconds of audio was received.
  • Configuration:
    {}
    • This implementation has no configurable options.

faster_whisper

  • This uses Faster Whisper, an implementation of OpenAI's Whisper model using CTranslate2.
  • Configuration:
    {
        "model": "string",
        "device": "string",
        "local_agree_dim": "number",
        "min_new_samples": "number",
        "max_segment_samples": "number"
    }
    • model
      • The CTranslate2 whisper model that Faster Whisper should use.
      • Models are automatically downloaded from HuggingFace: https://huggingface.co/Systran
      • Models can alos be loaded from a file path to a CTranslate2 model.
    • device
      • The device to run the model on.
      • Options are: cpu, cuda, auto
    • local_agree_dim
      • The number of times model needs to agree with itself on the same segment of audio in order for a transcription to be considered "final".
      • Must be an integer at least 1.
    • min_new_samples
      • The minimum number of new audio samples to buffer before rerunning the model.
      • Running this more frequently means the model will be run more frequently, lowering latency but increasing resource usage.
      • Note currently audio is processed at 16,000 samples per second. So min_new_samples: 16000 corresponds to running the model approximately every second.
      • Must be an integer at least 1.
    • max_segment_samples
      • The maximum number of new audio samples a single run of the model is allowed to contain.
      • Note: Current whisper models can process a maximum of 30 seconds of audio
      • Must be an integer at least min_new_samples
    • Example:
      {
        "model": "tiny.en",
        "device": "cpu",
        "local_agree_dim": 2,
        "min_new_samples": 48000,
        "max_segment_samples": 480000
      }

Endpoints

Node Server Endpoints

  • GET /healthcheck
    • Simple healthcheck endpoint
    • Used by Docker container as a health probe to determine if container is alive or not
  • POST /accessToken
    • Fetches the currently active accessToken
    • Only clients with a valid sourceToken can access this endpoint
    • Request Body:
      {
        "sourceToken": "string",
      }
      • sourceToken
        • The source token used by node server for authentication. See Authentication
    • Response:
      {
        "accessToken": "string",
        "serverAddress": "string",
        "expires": "string"
      }
      • accessToken
        • The currently active accessToken
      • serverAddress
      • expires
        • The expiration datetime of accessToken given as an ISO date string
  • POST /startSession
    • Fetches a newly generated sessionToken
    • Only clients with a valid accessToken can access this endpoint
    • Request Body:
      {
        "accessToken": "string"
      }
      • accessToken
        • The access token used by node server for authentication. See Authentication
    • Response:
      {
        "sessionToken": "string",
        "expires": "string"
      }
      • sessionToken
        • The newly generated sessionToken
      • expires
        • The expiration datetime of accessToken given as an ISO date string
  • WS /api/sourcesink
    • Websocket endpoint for sending audio and receiving transcriptions
    • Client must send a JSON message containing a sourceToken immediately after connecting
  • WS /api/source
    • Websocket endpoint for sending audio
    • Client must send a NodeAuthMessage with valid sourceToken immediately after connecting
  • WS /api/sink
    • Websocket endpoint for receiving audio
    • Client must send a NodeAuthMessage with valid sessionToken or sourceToken immediately after connecting

Whisper Service Endpoints

  • GET /healthcheck
    • Simple healthcheck endpoint
    • Used by Docker container as a health probe to determine if container is alive or not
  • WS /sourcesink
    • Websocket endpoint for sending audio and receiving transcriptions
    • Client must send a JSON message containing a api_key immediately after connecting

Message Schemas

Below are the websocket messages that are sent

NodeAuthMessage

  • Frontend -> Node Server
  • Schema
    {
      "accessToken": "string / undefined",
      "sessionToken": "string / undefined",
      "sourceToken": "string / undefined"
    }
    • accessToken, sessionToken, sourceToken
      • The different tokens used by node server for authentication. See Authentication
      • Only one of these needs to be defined.

AudioChunk

  • Frontend -> Node Server -> Whisper Service
  • Frontend -> Whisper Service
  • WAV audio buffers
    • PCM 16 (2 bit)
    • 16k sample rate
    • 1 channel

SelectionOptions

  • Whisper Service -> Node Server -> Frontend
  • Whisper Service -> Frontend
  • Schema:
    [
      {
        "model_key": "string",
        "display_name": "string",
        "description": "string",
        "available_features": {}
      },
      ...repeated for however many models are supported
    ]

BackendTranscriptionBlock

  • Whisper Service -> Node Server -> Frontend
  • Whisper Service -> Frontend
  • Schema:
    {
      "type": "number",
      "text": "string",
      "start": "number",
      "end": "end"
    }
    • type
      • The type of transcription block this is.
      • 0 for a finalized block.
      • 1 for an in-progress block used to lower latency.
    • text
      • The text itself of the transcription
    • start
      • The start time in seconds since connection was open the given transcription block corresponds to
    • end
      • The end time in seconds since connection was open the given transcription block corresponds to

WhisperAuthMessage

  • Node Server -> Whisper Service
  • Frontend -> Whisper Service
  • Schema:
    {
      "api_key": "string"
    }
    • api_key is the api key client is presenting to whisper service for authentication. Must match configured API_KEY for whisper service to authenticate successfully.

SelectedOption

  • Frontend -> Node Server -> Whisper Service
  • Frontend -> Whisper Service
  • Schema:
    {
      "model_key": "string",
      "feature_selection": {}
    }
    • model_key
      • The model key the client is selection. Should match one of the keys presented by whisper service in SelectionOptions
    • feature_selection
      • The configuration for the available features the client would like to set. (Currently unused, see issue #11)

Protocols

Authentication

Node Server Kiosk Authentication

  • Node server needs to restrict access for clients sending audio and receiving transcriptions so that only an authorized device can send audio and so that students only have access to the transcript for the class they are in. This is accomplished using three tokens sourceToken, sessionToken, and accessToken.

Relevant Entities

  • Node Server
    • This is a node server instance.
  • Kiosk device
    • This is the kiosk device that is set up in a classroom to record audio and display transcriptions on a display.
    • This device also shows a QR code to users that can be scanned to connect a user's device to node server.
    • Connects to node server to send audio and receive transcriptions
  • User device
    • This is a user's personal device.
    • They can scan the QR displayed by the kiosk in order to connect to node server to receive transcriptions.

Tokens

  • sourceToken
    • This is a secret token configured in .env for node server.
    • Clients with a valid sourceToken are permitted to be an audio source for node server.
    • In addition, clients with a valid sourceToken are permitted to retrive the current accessToken.
  • sessionToken
    • This is a long lived token that is randomly generated by node server when a client starts a session.
    • An use must have a valid accessToken in order to start a session and receive a sessionToken.
    • Clients with a valid sessionToken are permitted to receive transcriptions, but not to send audio.
  • accessToken
    • This is a rotating token that is randomly generated by node server
    • This token is short-lived to limit sharing between users
    • An accessToken only alows a client to start a session

Authentication Flow

Authentication flow diagram

  1. Kiosk Device opens websocket connection to Node Server at the /api/sourcesink endpoint.
    • Kiosk Device sends a NodeAuthMessage containing sourceToken
    • Node Server verifies the sourceToken
      • If it is valid
        • The Kiosk Device is now able to send audio to and receive transcriptions from Node Server.
        • Node Server initializes a connection to Whisper Service once a source connects. See Whisper Service Authentication.
      • If it is invalid, Node Server closes the connection
  2. Kiosk Device makes a POST request to Node Server at the /api/accessToken endpoint.
    • Kiosk Device includes sourceToken in body of request.
  3. Kiosk Device receives currently active accessToken from Node Server.
    • Note: 2) and 3) are repeated on the background so that Kiosk Device always knows the currently active accessToken.
  4. User's Device scans QR code (or copies link) displayed by Kiosk Device.
    • This QR code includes the address to Node Server and the currently active accessToken from 3).
  5. User's Device makes a POST request to Node Server at the /api/startSession endpoint.
    • User's Device includes accessToken in body of request.
  6. User's Device receives a newly generated sessionToken from Node Server.
  7. User's Device opens websocket connection to Node Server at the /sink endpoint.
    • User's Device sends a NodeAuthMessage containing sessionToken.
    • Node Server verifies the sessionToken
      • If it is valid, the User's Device is now able to receive transcriptions from Node Server for the period sessionToken is valid for.
      • If it is invalid, Node Server closes the connection

Whisper Service Authentication

Relevant Entities

  • Node Server or ScribeAR Frontend
    • Both authenticate the same way with Whisper Service

Authentication Flow

  1. Node Server or the frontend send a WhisperAuthMessage containing an API key to Whisper Service
  2. Whisper Service checks if the API key received matches the configured API_KEY for Whisper Service (See Configuring Whisper Service)
    • If the keys match, Whisper Service moves on to Model Selection
    • If the keys don't match, Whisper Service closes the connection

Model Selection

Model selection negotiation occurs between whisper service and the frontend. When node server is used (frontend connected to node server, node server connected to whisper service), node server simply forwards the messages so it is not relevent to the model selection protocol. Note that this negotiation occurs after Whisper Service Authentication.

Relevant Entities

  • ScribeAR Frontend
    • This could be a kiosk device or a user's device (who's using ScribeAR for themselves)
  • Whisper Service
    • This is a whisper service instance

Model Selection Flow

  1. Whisper Service sends a SelectionOptions message to the Frontend containing the available models the Frontend can choose from
  2. Frontend selects one of the models and sends a SelectedOption message to Whisper Service
  3. Whisper Service initializes the selected model. If successfully, Whisper Service is ready to receive audio and return transcriptions.

Transcription

Transcription events occur between whisper service and the frontend. When node server is used (frontend connected to node server, node server connected to whisper service), node server simply forwards the messages so it is not relevent to the transcription protocol. Note that this protocol occurs after Model Selection with whisper service.

Relevant Entities

  • ScribeAR Frontend
    • This could be a kiosk device or a user's device (who's using ScribeAR for themselves)
  • Whisper Service
    • This is a whisper service instance

Transcription Flow

  • Frontend continuously sends AudioChunk messages to Whisper Service
  • When Whisper Service has enough audio and has generated a block of transcription, it sends a BackendTranscriptionBlock message to the Frontend
  • These two happen asyncronously, the Frontend doesn't have to wait for Whisper Service to return a transcription before sending more audio.