Skip to content

Conversation

@wannaphong
Copy link

Description

Add Thai language

Tests

Screenshots

Checklist

  • My pull request has a descriptive title (not a vague title like Update index.md).
  • My pull request targets the default branch of the repository (main or master).
  • My commit messages follow the contribution guidelines.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no
    visible errors.

Developer Certificate of Origin

For the purposes of this DCO, "license" is equivalent to "license or public domain dedication," and "open source license" is equivalent to "open content license or public domain dedication."

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.
Further Conditions
(e) This contribution does not grant me any rights to use the
    name of the project, the name of the author, the name of the
    organization, or the name of the author's organization in any
    advertising or publicity relating to this project or the
    contribution.
(f) This contribution does not grant me rights to any trademark or service
    mark belonging to the project, the author, the organization, or the
    author's organization.

This module provides functions for detecting and segmenting Thai text using the PyThaiNLP library.
Mark Thai script as completed and add usage note.
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds support for Thai language segmentation to the words-segmentation library by integrating the PyThaiNLP library for Thai text tokenization.

Key Changes:

  • Added a new thai.py module with Thai text detection and segmentation functions
  • Registered Thai in the language specification system with Unicode Thai script support
  • Added PyThaiNLP as a project dependency

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File Description
words_segmentation/thai.py New module implementing Thai text detection (has_thai), segmenter initialization (get_thai_segmenter), and word segmentation (segment_thai) using PyThaiNLP
words_segmentation/languages.py Registers Thai language with the Thai Unicode script and associates it with the segment_thai callback function
pyproject.toml Adds pythainlp dependency for Thai word segmentation
README.md Updates documentation to mark Thai script as implemented and references PyThaiNLP library

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

# Use pythainlp.word_tokenize() for precise segmentation and join with spaces
segments = pythainlp.word_tokenize(text)
# Filter out empty segments and join with single spaces
return list(segments)
Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment states "Filter out empty segments and join with single spaces" but the code only converts the segments to a list without any filtering. This comment appears to be copied from the Chinese module but doesn't match the implementation. The comment should either be removed or the code should be updated to filter empty segments if needed:

# Return the list of word segments
return list(segments)

Or if filtering is needed:

# Filter out empty segments
return [seg for seg in segments if seg]
Suggested change
return list(segments)
return [seg for seg in segments if seg]

Copilot uses AI. Check for mistakes.
Comment on lines +2 to +4
Chinese text pretokenization utilities.

This module provides functions for detecting and segmenting Chinese text using the jieba
Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The module docstring incorrectly describes this as a "Chinese text pretokenization" module when it should describe Thai text pretokenization. This should be updated to:

"""
Thai text pretokenization utilities.

This module provides functions for detecting and segmenting Thai text using the pythainlp
library for word segmentation.
"""
Suggested change
Chinese text pretokenization utilities.
This module provides functions for detecting and segmenting Chinese text using the jieba
Thai text pretokenization utilities.
This module provides functions for detecting and segmenting Thai text using the pythainlp

Copilot uses AI. Check for mistakes.
Returns:
True if Thai characters are found, False otherwise
"""
# Match any Han ideograph using Unicode property
Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment incorrectly states "Match any Han ideograph" which is copy-pasted from the Chinese module. This should be updated to describe Thai Unicode range:

# Match Thai characters in the Unicode Thai block (U+0E00-U+0E7F)
Suggested change
# Match any Han ideograph using Unicode property
# Match Thai characters in the Unicode Thai block (U+0E00-U+0E7F)

Copilot uses AI. Check for mistakes.

Example:
>>> segment_thai("สวัสดีโลก")
"สวัสดี โลก"
Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example shows the output as a string "สวัสดี โลก" but the function returns list[str]. While this matches the pattern in the Chinese and Japanese modules, the example should be updated to reflect the actual return type:

Example:
    >>> segment_thai("สวัสดีโลก")
    ["สวัสดี", "โลก"]
Suggested change
"สวัสดี โลก"
["สวัสดี", "โลก"]

Copilot uses AI. Check for mistakes.
"สวัสดี โลก"
"""
pythainlp = get_thai_segmenter()
# Use pythainlp.word_tokenize() for precise segmentation and join with spaces
Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The comment on line 66 states "Use pythainlp.word_tokenize()" but the actual method being called on line 67 is pythainlp.word_tokenize. While this is technically correct, this differs from the pattern used in the Chinese and Japanese modules where the comment describes what the function does rather than naming the specific method. For consistency, consider updating to:

# Tokenize Thai text into words using PyThaiNLP
Suggested change
# Use pythainlp.word_tokenize() for precise segmentation and join with spaces
# Tokenize Thai text into words using PyThaiNLP

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant