Skip to content

Remote Inference server

林品仲 edited this page Apr 17, 2025 · 1 revision

Remote Inference server

Overview

Memory and computer resources are very limited on mobile devices. Remote inference allows users to offload LLM models from their devices to a remote server, reducing the burden on the device and enabling the use of larger models for better responses. At the same time, it also provides more control over model choices, whether to use managed services like OpenAI and Gemini for ease of use, or self-hosted models like Llama, Gemma, and Mistral for enhanced privacy.

Spec

There are two different formats in the NoteUltra inferencing server: question and summary. Both of them are separated by paths your.server.com/question, your.server.com/summary. And they communicate in JSON format.

/question is used to answer questions based on transcription of the day.

  • context field contains the information retrieved from the user's vector database; the retrieval process is performed entirely on the user's device to ensure privacy. The amount of context is not specified, but you can set a limit manually on the server to prevent abuse when hosting an inference node publicly.
  • chatHistory contains the user's chat history, which has a fixed number of entries - you will only be able to receive a total of 6 entries.
  • question contains the question the user is currently asking

Input Format:

{
    "context":["meow","meow","meow"],
    "chatHistory": [
      { "USER": "meow" },
      { "Assistant": "meow" }
    ],
    "question": "meow"
}
  • Output contains one key result which contains the AI's response.

Output Format:

{
    "result": "meow"
}

/summary is used to summarize AI's response generated from /question to allow user to save them as notes.

  • context field contains a single string generated from the AI's response.

Input Format:

{
    "context": "NoteUltra is an app that transcripts voices in the background. Keep user's life organized without any effort. User can simply ask question like \"What item should I buy\". NoteUltra will retrieve the information from the transcription that contains thousands of words in few seconds. And also put user privacy first. Everything stored locally, and user have 100% control over their data."
}
  • Output contains two keys title and summary.
    • title is the title of the note, must be short and concise; ideally, it should be under 15 characters.
    • summary is the content of the note, There is no limit on the length of the summary; however, it should be structured and easy to read. Note that NoteUltra currently does not support markdown, so please avoid using it.

Output Format:

{
    "title": "About NoteUltra",
    "summary": "NoteUltra is a privacy-first app that uses transcription and AI to help users stay organized."
}

Clone this wiki locally