-
Notifications
You must be signed in to change notification settings - Fork 0
Documentation
The node server and whisper service Docker containers support all of the same options as detailed in Configuring Node Server and Configuring Whisper Service with the following exceptions:
- For node server and whisper service
-
HOSTandPORTare disabled.- Docker will expose the service on all hosts the Docker daemon is configured to use. Modify the daemon to change which hosts are used.
- To select which port to use, publish a port when starting the Docker container
docker run -p [YOUR PORT]:80 --env-file .env scribear/node-server:maindocker run -p [YOUR PORT]:80 --env-file .env -v ./device_config.json:/app/device_config.json scribear/whisper-service-cpu:main
-
When deploying with Docker compose, the node server and whisper service containers are configured the same way the containers themselves are configured (See Configuring Docker Containers) with the following exceptions:
-
WHISPER_SERVICE_ENDPOINTfor node server is automatically generated usingAPI_KEY. -
NODE_PORTis added to automatically select the port Docker exposes node server on. -
FRONTEND_PORTis added to automatically select the port Docker exposes the frontend on.
Node server is configured using environment variables defined in .env.
| Option | Values | Default | Description |
|---|---|---|---|
| Runtime Options | |||
NODE_ENV |
development, production, test
|
production |
Indicates the environment service is running in. |
LOG_LEVEL |
error, warn, info, debug, trace, silent
|
info |
Sets the verbosity of logging. |
| Server Options | |||
HOST |
string |
127.0.0.1 |
The socket the node server will bind to. Use 0.0.0.0 to make available to local network, 127.0.0.1 for localhost only. |
PORT |
number |
8080 |
Port number that node server will listen for connections on. Should match the port node server is trying to connect to. |
CORS_ORIGIN |
string |
* |
Cors origin configuration for node server. |
SERVER_ADDRESS |
string |
127.0.0.1:8080 |
Address the node server is reachable at. Used for ScribeAR QR code to allow other device to connect. |
| Whisper Service Options | |||
WHISPER_SERVICE_ENDPOINT |
string |
Required, no default value | Websocket address for whisper service endpoint. Should be in the format: ws://${ADDRESS}:${PORT}/sourcesink ADDRESS is the address or ip of the whisper service. PORT is the port the whisper service is listening on. This should match what the whisper service is configured to use. |
API_KEY |
string |
Required, no default value | Api key for the whisper service. This should match what the whisper service is configured to use. |
| Authentication Options | |||
REQUIRE_AUTH |
true, false
|
true |
If true, requires authentication to connect to node server api, otherwise no authentication is used. See Authentication for details. |
SOURCE_TOKEN |
string |
Required if REQUIRED_AUTH=true, no default value |
The key used by frontend to connect as audio source. See Authentication for details. |
ACCESS_TOKEN_REFRESH_INTERVAL_SEC |
number |
Required if REQUIRED_AUTH=true, no default value |
Number of seconds to wait before generating a new refresh token. See Authentication for details. |
ACCESS_TOKEN_BYTES |
number |
Required if REQUIRED_AUTH=true, no default value |
The number of random bytes used to generate access tokens |
ACCESS_TOKEN_VALID_PERIOD_SEC |
number |
Required if REQUIRED_AUTH=true, no default value |
Number of seconds a newly generated refresh token is valid for. See Authentication for details. |
SESSION_TOKEN_BYTES |
number |
Required if REQUIRED_AUTH=true, no default value |
The number of random bytes used to generate session tokens |
SESSION_LENGTH_SEC |
number |
Required if REQUIRED_AUTH=true, no default value |
Number of seconds a newly generated session token is valid for. See Authentication for details. |
There are two files to configure whisper service, .env and device_config.json. .env configures the webserver whisper service uses while device_config.json configures the models a particular deployment of whisper service will support.
| Option | Options | Default | Description |
|---|---|---|---|
LOG_LEVEL |
info, debug, trace
|
info |
Sets the verbosity of logging. |
API_KEY |
string |
<empty string> |
The api key that must be passed to whisper-service in order to establish a connection. Should match api_key= url parameter for node server. |
HOST |
string |
127.0.0.1 |
The socket the whisper service will bind to. Use 0.0.0.0 to make available to local network, 127.0.0.1 for localhost only. |
PORT |
number |
8000 |
Port number that whisper service will listen for connections on. Should match the port node server is trying to connect to. |
When node server or a user connects to whisper service, they must specify a model_key (See Model Selection). This uniquely identifies a particular model implementation and configuration (See Model Implementations and Configurations) to be run by whisper service. This is configured for each individual device whisper service runs on to account for various hardware capabilities. device_config.json configures this mapping from model_key to a implementation and configurations.
device_config.json is a JSON file (https://www.json.org/json-en.html). Think of it as a nested series of key-value pairs.
{
"model_key": {
"display_name": "string",
"description": "string",
"implementation_id": "string",
"implementation_configuration": {},
"available_features": {}
},
...Create as many as desired
}-
model_key- The unique identifier used by node server or ScribeAR to select which model whisper service should run.
- These can be whatever string you'd like and you can define as many unique model_keys as you'd like.
-
display_name- This is a friendly name that the ScribeAR frontend can display to the user.
-
description- This is a description that the ScribeAR frontend can display to the user.
-
implementation_id- This is a unique identifier that identifies the model implementation.
- See Model Implementations and Configurations for a list of available implementations.
-
implementation_configuration- This is a configuration object that configures a model implementation.
- This is unique for each implementation.
- See Model Implementations and Configurations for the configuration options for your selected model implementation.
-
available_features- This is an object that defines which features that this model supports.
Example config:
{
"mock_transcription_duration": {
"display_name": "Sanity Test",
"description": "Returns how many seconds of audio was received by whisper service.",
"implementation_id": "mock_transcription_duration",
"implementation_configuration": {},
"available_features": {}
},
"faster-whisper:cpu-tiny-en": {
"display_name": "Faster Whisper - tiny.en",
"description": "Faster Whisper implementation of Open AI Whisper tiny.en model.",
"implementation_id": "faster_whisper",
"implementation_configuration": {
"model": "tiny.en",
"device": "cpu",
"local_agree_dim": 2,
"min_new_samples": 48000,
"max_segment_samples": 480000
},
"available_features": {}
}
}- Every model implementation (i.e. faster whisper, whisper.cpp) that whisper service supports is given an unique
implementation_id. - Each model implementation has unique configuration options to tune its performance, accuracy, and more.
- A list of
implemntation_ids and their configuration schemas can be found below.
mock_transcription_duration
- This is a sanity check model implementation. It returns "transcripts" that say how many seconds of audio was received.
- Configuration:
{}- This implementation has no configurable options.
faster_whisper
- This uses Faster Whisper, an implementation of OpenAI's Whisper model using CTranslate2.
- Configuration:
{ "model": "string", "device": "string", "local_agree_dim": "number", "min_new_samples": "number", "max_segment_samples": "number" }-
model- The CTranslate2 whisper model that Faster Whisper should use.
- Models are automatically downloaded from HuggingFace: https://huggingface.co/Systran
- Models can alos be loaded from a file path to a CTranslate2 model.
-
device- The device to run the model on.
- Options are:
cpu,cuda,auto
-
local_agree_dim- The number of times model needs to agree with itself on the same segment of audio in order for a transcription to be considered "final".
- Must be an integer at least
1.
-
min_new_samples- The minimum number of new audio samples to buffer before rerunning the model.
- Running this more frequently means the model will be run more frequently, lowering latency but increasing resource usage.
- Note currently audio is processed at 16,000 samples per second. So
min_new_samples: 16000corresponds to running the model approximately every second. - Must be an integer at least 1.
-
max_segment_samples- The maximum number of new audio samples a single run of the model is allowed to contain.
- Note: Current whisper models can process a maximum of 30 seconds of audio
- Must be an integer at least
min_new_samples
- Example:
{ "model": "tiny.en", "device": "cpu", "local_agree_dim": 2, "min_new_samples": 48000, "max_segment_samples": 480000 }
-
-
GET /healthcheck- Simple healthcheck endpoint
- Used by Docker container as a health probe to determine if container is alive or not
-
POST /accessToken- Fetches the currently active accessToken
- Only clients with a valid
sourceTokencan access this endpoint - Request Body:
{ "sourceToken": "string", }-
sourceToken- The source token used by node server for authentication. See Authentication
-
- Response:
{ "accessToken": "string", "serverAddress": "string", "expires": "string" }-
accessToken- The currently active
accessToken
- The currently active
-
serverAddress- The configured node server address (See
SERVER_ADDRESSoption in Configuring Node Server)
- The configured node server address (See
-
expires- The expiration datetime of
accessTokengiven as an ISO date string
- The expiration datetime of
-
-
POST /startSession- Fetches a newly generated
sessionToken - Only clients with a valid
accessTokencan access this endpoint - Request Body:
{ "accessToken": "string" }-
accessToken- The access token used by node server for authentication. See Authentication
-
- Response:
{ "sessionToken": "string", "expires": "string" }-
sessionToken- The newly generated
sessionToken
- The newly generated
-
expires- The expiration datetime of
accessTokengiven as an ISO date string
- The expiration datetime of
-
- Fetches a newly generated
-
WS /api/sourcesink- Websocket endpoint for sending audio and receiving transcriptions
- Client must send a JSON message containing a
sourceTokenimmediately after connecting
-
WS /api/source- Websocket endpoint for sending audio
- Client must send a NodeAuthMessage with valid
sourceTokenimmediately after connecting
-
WS /api/sink- Websocket endpoint for receiving audio
- Client must send a NodeAuthMessage with valid
sessionTokenorsourceTokenimmediately after connecting
-
GET /healthcheck- Simple healthcheck endpoint
- Used by Docker container as a health probe to determine if container is alive or not
-
WS /sourcesink- Websocket endpoint for sending audio and receiving transcriptions
- Client must send a JSON message containing a
api_keyimmediately after connecting
Below are the websocket messages that are sent
- Frontend -> Node Server
- Schema
{ "accessToken": "string / undefined", "sessionToken": "string / undefined", "sourceToken": "string / undefined" }-
accessToken,sessionToken,sourceToken- The different tokens used by node server for authentication. See Authentication
- Only one of these needs to be defined.
-
- Frontend -> Node Server -> Whisper Service
- Frontend -> Whisper Service
- WAV audio buffers
- PCM 16 (2 bit)
- 16k sample rate
- 1 channel
- Whisper Service -> Node Server -> Frontend
- Whisper Service -> Frontend
- Schema:
[ { "model_key": "string", "display_name": "string", "description": "string", "available_features": {} }, ...repeated for however many models are supported ]-
model_key- The model key configured in device_config.json
-
display_name- The display name configured in device_config.json
-
description- The description configured in device_config.json
-
available_features- The available features configured in device_config.json
-
- Whisper Service -> Node Server -> Frontend
- Whisper Service -> Frontend
- Schema:
{ "type": "number", "text": "string", "start": "number", "end": "end" }-
type- The type of transcription block this is.
-
0for a finalized block. -
1for an in-progress block used to lower latency.
-
text- The text itself of the transcription
-
start- The start time in seconds since connection was open the given transcription block corresponds to
-
end- The end time in seconds since connection was open the given transcription block corresponds to
-
- Node Server -> Whisper Service
- Frontend -> Whisper Service
- Schema:
{ "api_key": "string" }-
api_keyis the api key client is presenting to whisper service for authentication. Must match configuredAPI_KEYfor whisper service to authenticate successfully.
-
- Frontend -> Node Server -> Whisper Service
- Frontend -> Whisper Service
- Schema:
{ "model_key": "string", "feature_selection": {} }-
model_key- The model key the client is selection. Should match one of the keys presented by whisper service in SelectionOptions
-
feature_selection- The configuration for the available features the client would like to set. (Currently unused, see issue #11)
-
- Node server needs to restrict access for clients sending audio and receiving transcriptions so that only an authorized device can send audio and so that students only have access to the transcript for the class they are in. This is accomplished using three tokens
sourceToken,sessionToken, andaccessToken.
Relevant Entities
- Node Server
- This is a node server instance.
- Kiosk device
- This is the kiosk device that is set up in a classroom to record audio and display transcriptions on a display.
- This device also shows a QR code to users that can be scanned to connect a user's device to node server.
- Connects to node server to send audio and receive transcriptions
- User device
- This is a user's personal device.
- They can scan the QR displayed by the kiosk in order to connect to node server to receive transcriptions.
Tokens
-
sourceToken- This is a secret token configured in
.envfor node server. - Clients with a valid
sourceTokenare permitted to be an audio source for node server. - In addition, clients with a valid
sourceTokenare permitted to retrive the currentaccessToken.
- This is a secret token configured in
-
sessionToken- This is a long lived token that is randomly generated by node server when a client starts a session.
- An use must have a valid
accessTokenin order to start a session and receive asessionToken. - Clients with a valid
sessionTokenare permitted to receive transcriptions, but not to send audio.
-
accessToken- This is a rotating token that is randomly generated by node server
- This token is short-lived to limit sharing between users
- An
accessTokenonly alows a client to start a session
Authentication Flow

- Kiosk Device opens websocket connection to Node Server at the
/api/sourcesinkendpoint.- Kiosk Device sends a NodeAuthMessage containing
sourceToken - Node Server verifies the
sourceToken- If it is valid
- The Kiosk Device is now able to send audio to and receive transcriptions from Node Server.
- Node Server initializes a connection to Whisper Service once a source connects. See Whisper Service Authentication.
- If it is invalid, Node Server closes the connection
- If it is valid
- Kiosk Device sends a NodeAuthMessage containing
- Kiosk Device makes a
POSTrequest to Node Server at the/api/accessTokenendpoint.- Kiosk Device includes
sourceTokenin body of request.
- Kiosk Device includes
- Kiosk Device receives currently active
accessTokenfrom Node Server.- Note: 2) and 3) are repeated on the background so that Kiosk Device always knows the currently active
accessToken.
- Note: 2) and 3) are repeated on the background so that Kiosk Device always knows the currently active
- User's Device scans QR code (or copies link) displayed by Kiosk Device.
- This QR code includes the address to Node Server and the currently active
accessTokenfrom 3).
- This QR code includes the address to Node Server and the currently active
- User's Device makes a
POSTrequest to Node Server at the/api/startSessionendpoint.- User's Device includes
accessTokenin body of request.
- User's Device includes
- User's Device receives a newly generated
sessionTokenfrom Node Server. - User's Device opens websocket connection to Node Server at the
/sinkendpoint.- User's Device sends a NodeAuthMessage containing
sessionToken. - Node Server verifies the
sessionToken- If it is valid, the User's Device is now able to receive transcriptions from Node Server for the period
sessionTokenis valid for. - If it is invalid, Node Server closes the connection
- If it is valid, the User's Device is now able to receive transcriptions from Node Server for the period
- User's Device sends a NodeAuthMessage containing
Relevant Entities
- Node Server or ScribeAR Frontend
- Both authenticate the same way with Whisper Service
Authentication Flow
- Node Server or the frontend send a WhisperAuthMessage containing an API key to Whisper Service
- Whisper Service checks if the API key received matches the configured
API_KEYfor Whisper Service (See Configuring Whisper Service)- If the keys match, Whisper Service moves on to Model Selection
- If the keys don't match, Whisper Service closes the connection
Model selection negotiation occurs between whisper service and the frontend. When node server is used (frontend connected to node server, node server connected to whisper service), node server simply forwards the messages so it is not relevent to the model selection protocol. Note that this negotiation occurs after Whisper Service Authentication.
Relevant Entities
- ScribeAR Frontend
- This could be a kiosk device or a user's device (who's using ScribeAR for themselves)
- Whisper Service
- This is a whisper service instance
Model Selection Flow
- Whisper Service sends a SelectionOptions message to the Frontend containing the available models the Frontend can choose from
- Frontend selects one of the models and sends a SelectedOption message to Whisper Service
- Whisper Service initializes the selected model. If successfully, Whisper Service is ready to receive audio and return transcriptions.
Transcription events occur between whisper service and the frontend. When node server is used (frontend connected to node server, node server connected to whisper service), node server simply forwards the messages so it is not relevent to the transcription protocol. Note that this protocol occurs after Model Selection with whisper service.
Relevant Entities
- ScribeAR Frontend
- This could be a kiosk device or a user's device (who's using ScribeAR for themselves)
- Whisper Service
- This is a whisper service instance
Transcription Flow
- Frontend continuously sends AudioChunk messages to Whisper Service
- When Whisper Service has enough audio and has generated a block of transcription, it sends a BackendTranscriptionBlock message to the Frontend
- These two happen asyncronously, the Frontend doesn't have to wait for Whisper Service to return a transcription before sending more audio.