Skip to content

Bug + Proposal: ekg-embedding search fails on long Chinese notes due to naive truncation #205

@minkieyume

Description

@minkieyume

Hi, I chose to use EKG because I was tired of the rigid hierarchical structure of traditional file systems. The categorization is too singular, making notes hard to retrieve, each note requires a manually defined title, and it's inconvenient for LLMs to flexibly query and access notes in real time.

However, I’ve encountered an issue with embeddings:
When a note's content is too long, using ekg-embedding-search sometimes fails to return the relevant note—even when the searched keyword exists verbatim within the text or even in the note’s tags.

I suspect the problem is related to the current truncation mechanism in ekg-embedding. The ekg-text-selector function appears to truncate long text based on English word delimiters, which might cause semantic loss in Chinese or improper segmentation of long Chinese passages for embedding.

Additionally, I’ve noticed that this relatively simple truncation strategy might be optimized for OpenAI models. For local models, compatibility issues may arise due to differences in tokenizer implementations—leading to inconsistent token counts.

While it is necessary to limit context length for LLMs, forcibly truncating text during vectorization could lead to contextual loss. Inspired by the LlamaIndex project, I’d like to propose the following improvements:

Allow users to define a custom tokenizer, and provide a Python-compatible tokenizer invocation module that lets users calculate token counts based on their model of choice.

Change the note processing logic: When a note exceeds the token limit, split it into chunks. Each chunk should retain metadata linking it to its original note. All chunks can then be embedded independently. This should help mitigate context loss and improve local retrieval reliability.

Support flexible text splitting methods, such as by paragraphs, Chinese/English punctuation marks, or even individual Chinese characters. When the resulting chunks still exceed the token limit, they could be further sliced using word spaces or character-based delimiters.

Include a small number of trailing words or characters from one chunk in the beginning of the next, to prevent semantic breakage. This overlap size should be a tunable parameter, with a suggested default of around 5–10 words or characters.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions