[WIP] Add Thai language #6

wannaphong · 2025-12-02T16:54:41Z

Description

Add Thai language

Tests

Screenshots

Checklist

My pull request has a descriptive title (not a vague title like Update index.md).
My pull request targets the default branch of the repository (main or master).
My commit messages follow the contribution guidelines.
My code follows the established code style of the repository.
I added or updated tests for the changes I made (if applicable).
I added or updated documentation (if applicable).
I tried running the project locally and verified that there are no
visible errors.

Developer Certificate of Origin

For the purposes of this DCO, "license" is equivalent to "license or public domain dedication," and "open source license" is equivalent to "open content license or public domain dedication."

Developer Certificate of Origin

Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

Further Conditions

(e) This contribution does not grant me any rights to use the
    name of the project, the name of the author, the name of the
    organization, or the name of the author's organization in any
    advertising or publicity relating to this project or the
    contribution.
(f) This contribution does not grant me rights to any trademark or service
    mark belonging to the project, the author, the organization, or the
    author's organization.

This module provides functions for detecting and segmenting Thai text using the PyThaiNLP library.

Mark Thai script as completed and add usage note.

Copilot

Pull request overview

This PR adds support for Thai language segmentation to the words-segmentation library by integrating the PyThaiNLP library for Thai text tokenization.

Key Changes:

Added a new thai.py module with Thai text detection and segmentation functions
Registered Thai in the language specification system with Unicode Thai script support
Added PyThaiNLP as a project dependency

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File	Description
`words_segmentation/thai.py`	New module implementing Thai text detection (`has_thai`), segmenter initialization (`get_thai_segmenter`), and word segmentation (`segment_thai`) using PyThaiNLP
`words_segmentation/languages.py`	Registers Thai language with the Thai Unicode script and associates it with the `segment_thai` callback function
`pyproject.toml`	Adds `pythainlp` dependency for Thai word segmentation
`README.md`	Updates documentation to mark Thai script as implemented and references PyThaiNLP library

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-03T09:05:38Z

words_segmentation/thai.py

+    # Use pythainlp.word_tokenize() for precise segmentation and join with spaces
+    segments = pythainlp.word_tokenize(text)
+    # Filter out empty segments and join with single spaces
+    return list(segments)


The comment states "Filter out empty segments and join with single spaces" but the code only converts the segments to a list without any filtering. This comment appears to be copied from the Chinese module but doesn't match the implementation. The comment should either be removed or the code should be updated to filter empty segments if needed:

# Return the list of word segments return list(segments)

Or if filtering is needed:

# Filter out empty segments return [seg for seg in segments if seg]

Suggested change

return list(segments)

return [seg for seg in segments if seg]

Copilot · 2025-12-03T09:05:39Z

words_segmentation/thai.py

+Chinese text pretokenization utilities.
+
+This module provides functions for detecting and segmenting Chinese text using the jieba


The module docstring incorrectly describes this as a "Chinese text pretokenization" module when it should describe Thai text pretokenization. This should be updated to:

""" Thai text pretokenization utilities. This module provides functions for detecting and segmenting Thai text using the pythainlp library for word segmentation. """

Suggested change

Chinese text pretokenization utilities.

This module provides functions for detecting and segmenting Chinese text using the jieba

Thai text pretokenization utilities.

This module provides functions for detecting and segmenting Thai text using the pythainlp

Copilot · 2025-12-03T09:05:39Z

words_segmentation/thai.py

+    Returns:
+        True if Thai characters are found, False otherwise
+    """
+    # Match any Han ideograph using Unicode property


The comment incorrectly states "Match any Han ideograph" which is copy-pasted from the Chinese module. This should be updated to describe Thai Unicode range:

# Match Thai characters in the Unicode Thai block (U+0E00-U+0E7F)

Suggested change

# Match any Han ideograph using Unicode property

# Match Thai characters in the Unicode Thai block (U+0E00-U+0E7F)

Copilot · 2025-12-03T09:05:40Z

words_segmentation/thai.py

+
+    Example:
+        >>> segment_thai("สวัสดีโลก")
+        "สวัสดี โลก"


The example shows the output as a string "สวัสดี โลก" but the function returns list[str]. While this matches the pattern in the Chinese and Japanese modules, the example should be updated to reflect the actual return type:

Example: >>> segment_thai("สวัสดีโลก") ["สวัสดี", "โลก"]

Suggested change

"สวัสดี โลก"

["สวัสดี", "โลก"]

Copilot · 2025-12-03T09:05:40Z

words_segmentation/thai.py

+        "สวัสดี โลก"
+    """
+    pythainlp = get_thai_segmenter()
+    # Use pythainlp.word_tokenize() for precise segmentation and join with spaces


[nitpick] The comment on line 66 states "Use pythainlp.word_tokenize()" but the actual method being called on line 67 is pythainlp.word_tokenize. While this is technically correct, this differs from the pattern used in the Chinese and Japanese modules where the comment describes what the function does rather than naming the specific method. For consistency, consider updating to:

# Tokenize Thai text into words using PyThaiNLP

Suggested change

# Use pythainlp.word_tokenize() for precise segmentation and join with spaces

# Tokenize Thai text into words using PyThaiNLP

wannaphong added 5 commits December 2, 2025 22:36

Add Thai text segmentation utilities

4687cec

This module provides functions for detecting and segmenting Thai text using the PyThaiNLP library.

Add Thai language segmentation support

d0c7336

Add pythainlp for Thai word segmentation

6d6485b

Fix Thai language scripts tuple format

3648a15

Update Thai script status in README

32dad49

Mark Thai script as completed and add usage note.

AmitMY requested a review from Copilot December 3, 2025 09:02

Copilot started reviewing on behalf of AmitMY December 3, 2025 09:02 View session

Copilot finished reviewing on behalf of AmitMY December 3, 2025 09:04

Copilot AI reviewed Dec 3, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] Add Thai language #6

[WIP] Add Thai language #6

Uh oh!

wannaphong commented Dec 2, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Dec 3, 2025

Uh oh!

Copilot AI Dec 3, 2025

Uh oh!

Copilot AI Dec 3, 2025

Uh oh!

Copilot AI Dec 3, 2025

Uh oh!

Copilot AI Dec 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	return list(segments)
	return [seg for seg in segments if seg]

		Chinese text pretokenization utilities.

		This module provides functions for detecting and segmenting Chinese text using the jieba

	# Match any Han ideograph using Unicode property
	# Match Thai characters in the Unicode Thai block (U+0E00-U+0E7F)

	"สวัสดี โลก"
	["สวัสดี", "โลก"]

	# Use pythainlp.word_tokenize() for precise segmentation and join with spaces
	# Tokenize Thai text into words using PyThaiNLP

Uh oh!

[WIP] Add Thai language #6

Are you sure you want to change the base?

[WIP] Add Thai language #6

Uh oh!

Conversation

wannaphong commented Dec 2, 2025

Description

Tests

Screenshots

Checklist

Developer Certificate of Origin

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant