-
-
Notifications
You must be signed in to change notification settings - Fork 2
[WIP] Add Thai language #6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This module provides functions for detecting and segmenting Thai text using the PyThaiNLP library.
Mark Thai script as completed and add usage note.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds support for Thai language segmentation to the words-segmentation library by integrating the PyThaiNLP library for Thai text tokenization.
Key Changes:
- Added a new
thai.pymodule with Thai text detection and segmentation functions - Registered Thai in the language specification system with Unicode Thai script support
- Added PyThaiNLP as a project dependency
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
words_segmentation/thai.py |
New module implementing Thai text detection (has_thai), segmenter initialization (get_thai_segmenter), and word segmentation (segment_thai) using PyThaiNLP |
words_segmentation/languages.py |
Registers Thai language with the Thai Unicode script and associates it with the segment_thai callback function |
pyproject.toml |
Adds pythainlp dependency for Thai word segmentation |
README.md |
Updates documentation to mark Thai script as implemented and references PyThaiNLP library |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Use pythainlp.word_tokenize() for precise segmentation and join with spaces | ||
| segments = pythainlp.word_tokenize(text) | ||
| # Filter out empty segments and join with single spaces | ||
| return list(segments) |
Copilot
AI
Dec 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment states "Filter out empty segments and join with single spaces" but the code only converts the segments to a list without any filtering. This comment appears to be copied from the Chinese module but doesn't match the implementation. The comment should either be removed or the code should be updated to filter empty segments if needed:
# Return the list of word segments
return list(segments)Or if filtering is needed:
# Filter out empty segments
return [seg for seg in segments if seg]| return list(segments) | |
| return [seg for seg in segments if seg] |
| Chinese text pretokenization utilities. | ||
|
|
||
| This module provides functions for detecting and segmenting Chinese text using the jieba |
Copilot
AI
Dec 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The module docstring incorrectly describes this as a "Chinese text pretokenization" module when it should describe Thai text pretokenization. This should be updated to:
"""
Thai text pretokenization utilities.
This module provides functions for detecting and segmenting Thai text using the pythainlp
library for word segmentation.
"""| Chinese text pretokenization utilities. | |
| This module provides functions for detecting and segmenting Chinese text using the jieba | |
| Thai text pretokenization utilities. | |
| This module provides functions for detecting and segmenting Thai text using the pythainlp |
| Returns: | ||
| True if Thai characters are found, False otherwise | ||
| """ | ||
| # Match any Han ideograph using Unicode property |
Copilot
AI
Dec 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment incorrectly states "Match any Han ideograph" which is copy-pasted from the Chinese module. This should be updated to describe Thai Unicode range:
# Match Thai characters in the Unicode Thai block (U+0E00-U+0E7F)| # Match any Han ideograph using Unicode property | |
| # Match Thai characters in the Unicode Thai block (U+0E00-U+0E7F) |
|
|
||
| Example: | ||
| >>> segment_thai("สวัสดีโลก") | ||
| "สวัสดี โลก" |
Copilot
AI
Dec 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The example shows the output as a string "สวัสดี โลก" but the function returns list[str]. While this matches the pattern in the Chinese and Japanese modules, the example should be updated to reflect the actual return type:
Example:
>>> segment_thai("สวัสดีโลก")
["สวัสดี", "โลก"]| "สวัสดี โลก" | |
| ["สวัสดี", "โลก"] |
| "สวัสดี โลก" | ||
| """ | ||
| pythainlp = get_thai_segmenter() | ||
| # Use pythainlp.word_tokenize() for precise segmentation and join with spaces |
Copilot
AI
Dec 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The comment on line 66 states "Use pythainlp.word_tokenize()" but the actual method being called on line 67 is pythainlp.word_tokenize. While this is technically correct, this differs from the pattern used in the Chinese and Japanese modules where the comment describes what the function does rather than naming the specific method. For consistency, consider updating to:
# Tokenize Thai text into words using PyThaiNLP| # Use pythainlp.word_tokenize() for precise segmentation and join with spaces | |
| # Tokenize Thai text into words using PyThaiNLP |
Description
Add Thai language
Tests
Screenshots
Checklist
Update index.md).mainormaster).visible errors.
Developer Certificate of Origin
For the purposes of this DCO, "license" is equivalent to "license or public domain dedication," and "open source license" is equivalent to "open content license or public domain dedication."
Developer Certificate of Origin
Further Conditions