Important
Please read DISCLAIMER first
This project transforms the XIPUAI web service into an OpenAI-compatible API, enabling API calls that support knowledge bases, context, web search, temperature settings, and more.
Tip
Click the video link for a preview: https://drive.google.com/file/d/1-OGXcUYPfYZpAO9FbemmOAfxXSMPdFfM/view?usp=sharing
I was inspired by a project that reverse-engineered Google AI Studio into an OpenAI-formatted API service, allowing free use of the popular Gemini 2.5 Pro model (CJackHwang/AIstudioProxyAPI). Seeing expensive models like Claude 3 Opus and Sonnet 3.7 on the XJTLU AI platform (a playful nickname for which is Xipu AI), I had the same idea.
- An XJTLU student or staff account.
- See libraries for a list of dependencies.
- Developed on Windows 11.
- The
authmodule requires the Google Chrome browser to be installed. - A chat client that supports OpenAI-compatible APIs, such as Cherry Studio or DeepChat.
- Download the environment.yml file to an empty local folder.
- Open a command prompt or terminal in that folder.
- Create the conda environment from the file:
conda env create -f environment.yml
- double click
run.bat
-
Activate the environment:
conda activate genai_project -
In the same terminal window, run the configuration script:
python config.pyEnter your XJTLU username and password. The script will generate a
.envfile to store your credentials, so you won't have to enter them again.
Warning
Please take care of your privacy and never share the .env file with others!
-
Run the authentication script to get a token:
python auth.pyNo user interaction is required during this step; just wait for the terminal to confirm completion. (This requires Google Chrome to be installed). -
Start the adapter service:
uvicorn adapter:app --reloadThis will also create a
logfolder in the project directory for storing logs.
-
Connect your Desktop Client: Create a new provider, select the 'OpenAI Compatible' type. The API Key can be any random string of letters. Set the Base URL to
http://127.0.0.1:8000/v1/chat/completions(http://host.docker.internal:8000/v1for Dify in docker).Fetch the available models. The model list is hardcoded in the program and cannot be fetched in real-time. Directly entering a valid model ID will pass it to the web service.
-
Start a new chat, select a model, and begin your conversation.
This is a simple script for storing credentials. The user enters their username and password once, and the script writes them to a .env file in the root directory. This allows for automatic credential loading in the future, avoiding repetitive input.
-
Retrieves the username and password from the
.envfile. -
Launches a temporary Chrome window.
-
Simulates the user entering their credentials and clicking "Log In".
-
Intercepts the necessary authentication token and saves it to the
.envfile.Each time a new token is fetched, it overwrites the old one.
Use base64 to decode the JWT_TOKEN in the .env in the same directory to obtain relevant information. This is only used for exploration and does not affect the actual use effect.
Run commands in sequence to avoid manual input by the user
graph TD
subgraph precheck.py
A{.env file exist?}
A --exist--> D{username & password exist?}
A --does not exist--> 1
D --all exist--> F{tokens exist?}
D --dose not exist--> 1
F --D.N.E.--> 2
F --exist--> H{heartbeat session exist?}
H --exist--> I{tokentest.py}
H --D.N.E.--> 2
I --EXPIRE--> 2
I --valid--> 0
end
subgraph run.bat
a[start] --> b[conda environment check] --> c[activate 'genai_project' environment] --> d[run precheck.py] --> A
1 --> config.py --> auth.py --> e[uvicorn service start]
2 --> auth.py
0 --> e
end
The core concept is to package each user request (including context) into a single block of text and send it to the web service.
When a desktop client (like Cherry Studio) sends a chat request to our adapter, the following sequence of events unfolds:
-
Receive the Request (
chat_proxy):- FastAPI receives a
POST /v1/chat/completionsrequest that conforms to the OpenAI format. - We immediately parse the JSON payload, which contains the
model,temperature,messagesarray, and all other necessary information.
- FastAPI receives a
-
Create an Independent Session (
create_new_session):- This is the cornerstone of the entire process. We immediately send a
POSTrequest to the school'ssaveSessionendpoint. - The payload of this request includes all parameters from the client's request, such as
modelandtemperature. - Key Insight: We discovered that this endpoint allows us to configure all parameters at the moment of session creation, getting everything done in one step.
- The server returns a brand new, unique
sessionId. This session acts as a "disposable sandbox" dedicated solely to this single chat request.
- This is the cornerstone of the entire process. We immediately send a
-
Construct the Full Context (
prompt_parts&join):- We read the entire
messagesarray (includingsystem,user, andassistantroles and their content). - We use the simplest, most direct method: concatenating them into a single, massive text string formatted as
Role: Content, separated by newlines. - Key Insight: We confirmed that the backend server is powerful enough to process this "kitchen sink" prompt containing raw text, Markdown, and code snippets. We don't need to perform complex prompt engineering or content truncation on our end.
- We read the entire
-
The Crucial Strategic Delay (
asyncio.sleep):- After successfully creating the session and preparing the
prompt, we intentionally pause the program for a short period (INTER_REQUEST_DELAY, typically 1 second). - Key Insight: This is the lifeline that ensures the project's stability. It solves the
429 Too Many Requestserror that plagued us for so long. The root cause was a limit on the overall API request frequency, not on any single endpoint. This delay creates a sufficient time gap between the "create session" and "send chat" network requests, mimicking real user behavior and perfectly circumventing the rate limit.
- After successfully creating the session and preparing the
-
Send the Chat and Stream the Response (
stream_generator):- After the delay, we send a streaming
POSTrequest to the school'scompletionsendpoint. - The payload contains the
sessionIdwe just created and the massivefull_prompt. - We receive text chunks from the server in real-time, repackage them into OpenAI-formatted
chunks, and stream them back to the desktop client. This provides the "typewriter" effect for the user.
- After the delay, we send a streaming
-
Automatic Cleanup (
finally&delete_session):- When the chat stream ends (whether normally or due to an error), the
finallyblock in thestream_generatoris triggered. - Key Insight: We found that the server limits the total number of active sessions a user can have (to 50). To avoid exhausting this pool, we must "burn after reading."
- The
finallyblock starts a background async task that calls thedelete_sessionfunction. - This background task sends a
POSTrequest to the school'sdelSessionendpoint, including thesessionIdthat was just used, to permanently delete it from the server. - This "garbage collection" mechanism is the guarantee that our "stateless" model can operate long-term.
- When the chat stream ends (whether normally or due to an error), the
This process forms a perfect closed loop: Create -> Use -> Destroy. Every conversation is a new, independent interaction that does not rely on the server's historical state, giving the desktop client full control over the context.
The adapter script establishes and reuses a 'Persistent Heartbeat Session.'
The adapter service keeps this session active by sending a token-free savesession message to the web service at a configurable 20-minute interval.
This journey was like navigating through a thick fog, setting a course based on limited clues, hitting an iceberg, and then recalibrating.
- Symptom: The model seemed to have "amnesia," unable to understand multi-turn conversations or responding with "I don't understand" to long texts from a knowledge base.
- Initial Attempt (v1-v4): We simply concatenated all history into one big string. This worked for simple chats but failed with complex knowledge bases.
- The Wrong Turn (v4 - "Session State" Simulation): We mistakenly assumed we should mimic the web UI's stateful behavior, sending only the last user message and relying on the server's
sessionIdto maintain context. This was our biggest detour, as it contradicted our goal of building a stateless, client-controlled API. It resulted in a model with no context, which naturally couldn't answer correctly. - Another Attempt (v8 - "Session Injection"): We devised a "clever" method of trying to rebuild context on the server turn-by-turn with multiple POST requests. This failed due to excessive request frequency.
- The Final Epiphany (v10): Your observation—that the web UI could "swallow a huge chunk of text at once"—made us realize that the simplest approach was the right one. The problem wasn't how to send the context, but rather what kind of context the backend could handle and how fast we were sending it.
- Symptom: Frequent
Request too fastor429 Too Many Requestserrors. - Initial Attempt (v2/v3 - The "Ammunition Depot" Model): We designed an elaborate asynchronous "ammo box" (
session_id_ammo_box) to pre-fetch the next session ID in the background while processing the current request, intending to boost efficiency. - The Fatal Concurrency: The problem was that the background "pre-fetch" request (
saveSession) and the foreground "update parameters" request (alsosaveSession) were happening within milliseconds of each other. The server's rate-limiting mechanism immediately flagged this concurrent access as abnormal, causing it to fail. - The Wrong Solution (v8 - "Session Injection"): We abandoned the "ammo box," but the new "session injection" logic introduced an even denser sequence of requests, triggering the same rate limit.
- The Final Epiphany (v6, v10): We finally understood that the key wasn't concurrency, but frequency. We didn't need a complex async queue; we just needed a simple, human-like
asyncio.sleep()between consecutive API calls. This tiny delay was the "silver bullet" that solved all our rate-limiting problems.
- Symptom: The 429 error returned by the server included the message
...adjust the length of the context.... - The Wrong Assumption (v9): We took this message at face value and assumed the text was too long. This led us to develop an "intelligent truncation" module.
- A Fortunate Coincidence: Although based on a false premise, this truncation module happened to work. We later realized its success was not due to the truncation itself, but because of the incidental request delay it introduced and its potential to unintentionally clean "toxic" characters from the text.
- The Final Epiphany (v10): Your final tests proved that the backend could handle long text without truncation, as long as there was a delay. This taught us a profound lesson: API error messages don't always reveal the root cause of a problem; sometimes, they are a smokescreen. Reverse engineering requires bold hypotheses but, more importantly, careful verification.
| Error | Reason | Measure |
|---|---|---|
| "403" | Token expired | Re-run auth.py |
| "Unfortunately, I don't have any relevant information regarding this matter. However. Please feel free to ask me other questions, and I’ll do my best to help." | Prompt includes "poisonous text", the backend cannot read it | Delete the last conversation of this session |
| 200OK but return nothing. Send the same question to the web service and it will return "Unfortunately, I don't......" | Banned words detected by the school | modify your prompt |
| "Request Err!" | Enter "poisonous text", the backend cannot read it | Delete the last conversation of this session |
| 'INFO: 127.0.0.1:7607 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error' | Token error | re-run auth.py |
| 'Request too fast, please try again later!' | Conflict with scripted automatic messages | Try again in a few seconds |
-
adapter.py, remove maxtoken cut -
adapter.py, optimize input format -
adapter.py, Isolate the "heartbeat" as a subprocess -
adapter.py, fix bug: When HeartbeatSessionID in.envis invalid, the keepalive function will silently fail -
adapter.py, support MCP