Documentation

Configuration
Endpoints
Protocols

Configuration

Configuring Docker Containers

The node server and whisper service Docker containers support all of the same options as detailed in Configuring Node Server and Configuring Whisper Service with the following exceptions:

For node server and whisper service
- HOST and PORT are disabled.
  - Docker will expose the service on all hosts the Docker daemon is configured to use. Modify the daemon to change which hosts are used.
  - To select which port to use, publish a port when starting the Docker container
```
docker run -p [YOUR PORT]:80 --env-file .env scribear/node-server:main
```
```
docker run -p [YOUR PORT]:80 --env-file .env -v ./device_config.json:/app/device_config.json scribear/whisper-service-cpu:main
```

Configuring Docker Compose Deployment

When deploying with Docker compose, the node server and whisper service containers are configured the same way the containers themselves are configured (See Configuring Docker Containers) with the following exceptions:

WHISPER_SERVICE_ENDPOINT for node server is automatically generated using API_KEY.
NODE_PORT is added to automatically select the port Docker exposes node server on.
FRONTEND_PORT is added to automatically select the port Docker exposes the frontend on.

Configuring Node Server

Node server is configured using environment variables defined in .env.

Option	Values	Default	Description
Runtime Options
`NODE_ENV`	`development`, `production`, `test`	`production`	Indicates the environment service is running in.
`LOG_LEVEL`	`error`, `warn`, `info`, `debug`, `trace`, `silent`	`info`	Sets the verbosity of logging.
Server Options
`HOST`	`string`	`127.0.0.1`	The socket the node server will bind to. Use `0.0.0.0` to make available to local network, `127.0.0.1` for localhost only.
`PORT`	`number`	`8080`	Port number that node server will listen for connections on. Should match the port node server is trying to connect to.
`CORS_ORIGIN`	`string`	`*`	Cors origin configuration for node server.
`SERVER_ADDRESS`	`string`	`127.0.0.1:8080`	Address the node server is reachable at. Used for ScribeAR QR code to allow other device to connect.
Whisper Service Options
`WHISPER_SERVICE_ENDPOINT`	`string`	Required, no default value	Websocket address for whisper service endpoint. Should be in the format: `ws://${ADDRESS}:${PORT}/sourcesink` `ADDRESS` is the address or ip of the whisper service. `PORT` is the port the whisper service is listening on. This should match what the whisper service is configured to use.
`API_KEY`	`string`	Required, no default value	Api key for the whisper service. This should match what the whisper service is configured to use.
Authentication Options
`REQUIRE_AUTH`	`true`, `false`	`true`	If `true`, requires authentication to connect to node server api, otherwise no authentication is used. See Authentication for details.
`SOURCE_TOKEN`	`string`	Required if `REQUIRED_AUTH=true`, no default value	The key used by frontend to connect as audio source. See Authentication for details.
`ACCESS_TOKEN_REFRESH_INTERVAL_SEC`	`number`	Required if `REQUIRED_AUTH=true`, no default value	Number of seconds to wait before generating a new refresh token. See Authentication for details.
`ACCESS_TOKEN_BYTES`	`number`	Required if `REQUIRED_AUTH=true`, no default value	The number of random bytes used to generate access tokens
`ACCESS_TOKEN_VALID_PERIOD_SEC`	`number`	Required if `REQUIRED_AUTH=true`, no default value	Number of seconds a newly generated refresh token is valid for. See Authentication for details.
`SESSION_TOKEN_BYTES`	`number`	Required if `REQUIRED_AUTH=true`, no default value	The number of random bytes used to generate session tokens
`SESSION_LENGTH_SEC`	`number`	Required if `REQUIRED_AUTH=true`, no default value	Number of seconds a newly generated session token is valid for. See Authentication for details.

Configuring Whisper Service

There are two files to configure whisper service, .env and device_config.json. .env configures the webserver whisper service uses while device_config.json configures the models a particular deployment of whisper service will support.

.env Options

Option	Options	Default	Description
`LOG_LEVEL`	`info`, `debug`, `trace`	`info`	Sets the verbosity of logging.
`API_KEY`	`string`	`<empty string>`	The api key that must be passed to whisper-service in order to establish a connection. Should match `api_key=` url parameter for node server.
`HOST`	`string`	`127.0.0.1`	The socket the whisper service will bind to. Use `0.0.0.0` to make available to local network, `127.0.0.1` for localhost only.
`PORT`	`number`	`8000`	Port number that whisper service will listen for connections on. Should match the port node server is trying to connect to.

device_config.json

When node server or a user connects to whisper service, they must specify a model_key (See Model Selection). This uniquely identifies a particular model implementation and configuration (See Model Implementations and Configurations) to be run by whisper service. This is configured for each individual device whisper service runs on to account for various hardware capabilities. device_config.json configures this mapping from model_key to a implementation and configurations.

device_config.json is a JSON file (https://www.json.org/json-en.html). Think of it as a nested series of key-value pairs.

{
  "model_key": {
    "display_name": "string",
    "description": "string",
    "implementation_id": "string",
    "implementation_configuration": {},
    "available_features": {}
  },
  ...Create as many as desired
}

model_key
- The unique identifier used by node server or ScribeAR to select which model whisper service should run.
- These can be whatever string you'd like and you can define as many unique model_keys as you'd like.
display_name
- This is a friendly name that the ScribeAR frontend can display to the user.
description
- This is a description that the ScribeAR frontend can display to the user.
implementation_id
- This is a unique identifier that identifies the model implementation.
- See Model Implementations and Configurations for a list of available implementations.
implementation_configuration
- This is a configuration object that configures a model implementation.
- This is unique for each implementation.
- See Model Implementations and Configurations for the configuration options for your selected model implementation.
available_features
- This is an object that defines which features that this model supports.

Example config:

{
  "mock_transcription_duration": {
    "display_name": "Sanity Test",
    "description": "Returns how many seconds of audio was received by whisper service.",
    "implementation_id": "mock_transcription_duration",
    "implementation_configuration": {},
    "available_features": {}
  },
  "faster-whisper:cpu-tiny-en": {
    "display_name": "Faster Whisper - tiny.en",
    "description": "Faster Whisper implementation of Open AI Whisper tiny.en model.",
    "implementation_id": "faster_whisper",
    "implementation_configuration": {
      "model": "tiny.en",
      "device": "cpu",
      "local_agree_dim": 2,
      "min_new_samples": 48000,
      "max_segment_samples": 480000
    },
    "available_features": {}
  }
}

Model Implementations and Configuration

Every model implementation (i.e. faster whisper, whisper.cpp) that whisper service supports is given an unique implementation_id.
Each model implementation has unique configuration options to tune its performance, accuracy, and more.
A list of implemntation_ids and their configuration schemas can be found below.

mock_transcription_duration

This is a sanity check model implementation. It returns "transcripts" that say how many seconds of audio was received.
Configuration:
```
{}
```
- This implementation has no configurable options.

faster_whisper

This uses Faster Whisper, an implementation of OpenAI's Whisper model using CTranslate2.
- https://github.com/SYSTRAN/faster-whisper
Configuration:
```
{
    "model": "string",
    "device": "string",
    "local_agree_dim": "number",
    "min_new_samples": "number",
    "max_segment_samples": "number"
}
```
- model
  - The CTranslate2 whisper model that Faster Whisper should use.
  - Models are automatically downloaded from HuggingFace: https://huggingface.co/Systran
  - Models can alos be loaded from a file path to a CTranslate2 model.
- device
  - The device to run the model on.
  - Options are: cpu, cuda, auto
- local_agree_dim
  - The number of times model needs to agree with itself on the same segment of audio in order for a transcription to be considered "final".
  - Must be an integer at least 1.
- min_new_samples
  - The minimum number of new audio samples to buffer before rerunning the model.
  - Running this more frequently means the model will be run more frequently, lowering latency but increasing resource usage.
  - Note currently audio is processed at 16,000 samples per second. So min_new_samples: 16000 corresponds to running the model approximately every second.
  - Must be an integer at least 1.
- max_segment_samples
  - The maximum number of new audio samples a single run of the model is allowed to contain.
  - Note: Current whisper models can process a maximum of 30 seconds of audio
  - Must be an integer at least min_new_samples
- Example:
```
{
  "model": "tiny.en",
  "device": "cpu",
  "local_agree_dim": 2,
  "min_new_samples": 48000,
  "max_segment_samples": 480000
}
```

Endpoints

Node Server Endpoints

GET /healthcheck
- Simple healthcheck endpoint
- Used by Docker container as a health probe to determine if container is alive or not
POST /accessToken
- Fetches the currently active accessToken
- Only clients with a valid sourceToken can access this endpoint
- Request Body:
```
{
  "sourceToken": "string",
}
```
  - sourceToken
    - The source token used by node server for authentication. See Authentication
- Response:
```
{
  "accessToken": "string",
  "serverAddress": "string",
  "expires": "string"
}
```
  - accessToken
    - The currently active accessToken
  - serverAddress
    - The configured node server address (See SERVER_ADDRESS option in Configuring Node Server)
  - expires
    - The expiration datetime of accessToken given as an ISO date string
POST /startSession
- Fetches a newly generated sessionToken
- Only clients with a valid accessToken can access this endpoint
- Request Body:
```
{
  "accessToken": "string"
}
```
  - accessToken
    - The access token used by node server for authentication. See Authentication
- Response:
```
{
  "sessionToken": "string",
  "expires": "string"
}
```
  - sessionToken
    - The newly generated sessionToken
  - expires
    - The expiration datetime of accessToken given as an ISO date string
WS /api/sourcesink
- Websocket endpoint for sending audio and receiving transcriptions
- Client must send a JSON message containing a sourceToken immediately after connecting
WS /api/source
- Websocket endpoint for sending audio
- Client must send a NodeAuthMessage with valid sourceToken immediately after connecting
WS /api/sink
- Websocket endpoint for receiving audio
- Client must send a NodeAuthMessage with valid sessionToken or sourceToken immediately after connecting

Whisper Service Endpoints

GET /healthcheck
- Simple healthcheck endpoint
- Used by Docker container as a health probe to determine if container is alive or not
WS /sourcesink
- Websocket endpoint for sending audio and receiving transcriptions
- Client must send a JSON message containing a api_key immediately after connecting

Message Schemas

Below are the websocket messages that are sent

NodeAuthMessage

Frontend -> Node Server
Schema
```
{
  "accessToken": "string / undefined",
  "sessionToken": "string / undefined",
  "sourceToken": "string / undefined"
}
```
- accessToken, sessionToken, sourceToken
  - The different tokens used by node server for authentication. See Authentication
  - Only one of these needs to be defined.

AudioChunk

Frontend -> Node Server -> Whisper Service
Frontend -> Whisper Service
WAV audio buffers
- PCM 16 (2 bit)
- 16k sample rate
- 1 channel

SelectionOptions

Whisper Service -> Node Server -> Frontend
Whisper Service -> Frontend
Schema:
```
[
  {
    "model_key": "string",
    "display_name": "string",
    "description": "string",
    "available_features": {}
  },
  ...repeated for however many models are supported
]
```
- model_key
  - The model key configured in device_config.json
- display_name
  - The display name configured in device_config.json
- description
  - The description configured in device_config.json
- available_features
  - The available features configured in device_config.json

BackendTranscriptionBlock

Whisper Service -> Node Server -> Frontend
Whisper Service -> Frontend
Schema:
```
{
  "type": "number",
  "text": "string",
  "start": "number",
  "end": "end"
}
```
- type
  - The type of transcription block this is.
  - 0 for a finalized block.
  - 1 for an in-progress block used to lower latency.
- text
  - The text itself of the transcription
- start
  - The start time in seconds since connection was open the given transcription block corresponds to
- end
  - The end time in seconds since connection was open the given transcription block corresponds to

WhisperAuthMessage

Node Server -> Whisper Service
Frontend -> Whisper Service
Schema:
```
{
  "api_key": "string"
}
```
- api_key is the api key client is presenting to whisper service for authentication. Must match configured API_KEY for whisper service to authenticate successfully.

SelectedOption

Frontend -> Node Server -> Whisper Service
Frontend -> Whisper Service
Schema:
```
{
  "model_key": "string",
  "feature_selection": {}
}
```
- model_key
  - The model key the client is selection. Should match one of the keys presented by whisper service in SelectionOptions
- feature_selection
  - The configuration for the available features the client would like to set. (Currently unused, see issue #11)

Protocols

Authentication

Node Server Kiosk Authentication

Node server needs to restrict access for clients sending audio and receiving transcriptions so that only an authorized device can send audio and so that students only have access to the transcript for the class they are in. This is accomplished using three tokens sourceToken, sessionToken, and accessToken.

Relevant Entities

Node Server
- This is a node server instance.
Kiosk device
- This is the kiosk device that is set up in a classroom to record audio and display transcriptions on a display.
- This device also shows a QR code to users that can be scanned to connect a user's device to node server.
- Connects to node server to send audio and receive transcriptions
User device
- This is a user's personal device.
- They can scan the QR displayed by the kiosk in order to connect to node server to receive transcriptions.

Tokens

sourceToken
- This is a secret token configured in .env for node server.
- Clients with a valid sourceToken are permitted to be an audio source for node server.
- In addition, clients with a valid sourceToken are permitted to retrive the current accessToken.
sessionToken
- This is a long lived token that is randomly generated by node server when a client starts a session.
- An use must have a valid accessToken in order to start a session and receive a sessionToken.
- Clients with a valid sessionToken are permitted to receive transcriptions, but not to send audio.
accessToken
- This is a rotating token that is randomly generated by node server
- This token is short-lived to limit sharing between users
- An accessToken only alows a client to start a session

Authentication Flow

Authentication flow diagram

Kiosk Device opens websocket connection to Node Server at the /api/sourcesink endpoint.
- Kiosk Device sends a NodeAuthMessage containing sourceToken
- Node Server verifies the sourceToken
  - If it is valid
    - The Kiosk Device is now able to send audio to and receive transcriptions from Node Server.
    - Node Server initializes a connection to Whisper Service once a source connects. See Whisper Service Authentication.
  - If it is invalid, Node Server closes the connection
Kiosk Device makes a POST request to Node Server at the /api/accessToken endpoint.
- Kiosk Device includes sourceToken in body of request.
Kiosk Device receives currently active accessToken from Node Server.
- Note: 2) and 3) are repeated on the background so that Kiosk Device always knows the currently active accessToken.
User's Device scans QR code (or copies link) displayed by Kiosk Device.
- This QR code includes the address to Node Server and the currently active accessToken from 3).
User's Device makes a POST request to Node Server at the /api/startSession endpoint.
- User's Device includes accessToken in body of request.
User's Device receives a newly generated sessionToken from Node Server.
User's Device opens websocket connection to Node Server at the /sink endpoint.
- User's Device sends a NodeAuthMessage containing sessionToken.
- Node Server verifies the sessionToken
  - If it is valid, the User's Device is now able to receive transcriptions from Node Server for the period sessionToken is valid for.
  - If it is invalid, Node Server closes the connection

Whisper Service Authentication

Relevant Entities

Node Server or ScribeAR Frontend
- Both authenticate the same way with Whisper Service

Authentication Flow

Node Server or the frontend send a WhisperAuthMessage containing an API key to Whisper Service
Whisper Service checks if the API key received matches the configured API_KEY for Whisper Service (See Configuring Whisper Service)
- If the keys match, Whisper Service moves on to Model Selection
- If the keys don't match, Whisper Service closes the connection

Model Selection

Model selection negotiation occurs between whisper service and the frontend. When node server is used (frontend connected to node server, node server connected to whisper service), node server simply forwards the messages so it is not relevent to the model selection protocol. Note that this negotiation occurs after Whisper Service Authentication.

Relevant Entities

ScribeAR Frontend
- This could be a kiosk device or a user's device (who's using ScribeAR for themselves)
Whisper Service
- This is a whisper service instance

Model Selection Flow

Whisper Service sends a SelectionOptions message to the Frontend containing the available models the Frontend can choose from
Frontend selects one of the models and sends a SelectedOption message to Whisper Service
Whisper Service initializes the selected model. If successfully, Whisper Service is ready to receive audio and return transcriptions.

Transcription

Transcription events occur between whisper service and the frontend. When node server is used (frontend connected to node server, node server connected to whisper service), node server simply forwards the messages so it is not relevent to the transcription protocol. Note that this protocol occurs after Model Selection with whisper service.

Relevant Entities

ScribeAR Frontend
- This could be a kiosk device or a user's device (who's using ScribeAR for themselves)
Whisper Service
- This is a whisper service instance

Transcription Flow

Frontend continuously sends AudioChunk messages to Whisper Service
When Whisper Service has enough audio and has generated a block of transcription, it sends a BackendTranscriptionBlock message to the Frontend
These two happen asyncronously, the Frontend doesn't have to wait for Whisper Service to return a transcription before sending more audio.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Documentation

Configuration

Configuring Docker Containers

Configuring Docker Compose Deployment

Configuring Node Server

Configuring Whisper Service

.env Options

device_config.json

Model Implementations and Configuration

Endpoints

Node Server Endpoints

Whisper Service Endpoints

Message Schemas

NodeAuthMessage

AudioChunk

SelectionOptions

BackendTranscriptionBlock

WhisperAuthMessage

SelectedOption

Protocols

Authentication

Node Server Kiosk Authentication

Whisper Service Authentication

Model Selection

Transcription

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally