Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,4 @@
A Python library for processing text in the Maltese language.

- PyPI: https://pypi.org/project/malti/
- Documentation: https://malti.readthedocs.io/
- Documentation: https://malti.readthedocs.io/
1 change: 1 addition & 0 deletions docs/source/malti.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,4 +6,5 @@ Top-level package.
.. toctree::
:maxdepth: 1

malti/sent_splitter
malti/tokeniser
10 changes: 10 additions & 0 deletions docs/source/malti/sent_splitter.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
sent_splitter
=============

Sentence splitters for Maltese text.

.. toctree::
:maxdepth: 1

sent_splitter/sent_splitter.rst
sent_splitter/km_sent_splitter
9 changes: 9 additions & 0 deletions docs/source/malti/sent_splitter/km_sent_splitter.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
km_sent_splitter
================

The MLRS Korpus Malti's sentence splitter.

.. toctree::
:maxdepth: 1

km_sent_splitter/km_sent_splitter.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
km_sent_splitter.py
===================

.. automodule:: malti.sent_splitter.km_sent_splitter.km_sent_splitter
:members:
:show-inheritance:
:inherited-members:
:special-members:
:exclude-members: __weakref__

10 changes: 10 additions & 0 deletions docs/source/malti/sent_splitter/sent_splitter.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
sent_splitter.py
================

.. automodule:: malti.sent_splitter.sent_splitter
:members:
:show-inheritance:
:inherited-members:
:special-members:
:exclude-members: __weakref__

1 change: 1 addition & 0 deletions docs/source/usage.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,3 +10,4 @@ Select a topic to learn about below:

usage/install
usage/tokenisers
usage/sentence_splitters
53 changes: 53 additions & 0 deletions docs/source/usage/sentence_splitters.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
Sentence splitters
==================

Sentence splitters are used to break up text represented as a single string (such as from a text file) into a list of sentences.


The ``split`` function
----------------------

The simplest way to sentence split a text in ``malti`` is as follows:

.. code-block:: python
:linenos:

import malti.sent_splitter

text = 'Eżempju ta\' sentenza. Eżempju ta\' sentenza oħra.'
sentences = malti.sent_splitter.split(text)
print(sentences)

.. code-block:: python

['Eżempju ta\' sentenza.', 'Eżempju ta\' sentenza oħra.']


The ``SentSplitter`` class
--------------------------

The above is a convenience function that makes use of a default sentence splitter (``KMSentSplitter`` in this version).
To gain access to all the features of sentence splitters, they should be used in their class form, for example:

.. code-block:: python
:linenos:

import malti.sent_splitter

splitter = malti.sent_splitter.KMSentSplitter()

text = 'Eżempju ta\' sentenza. Eżempju ta\' sentenza oħra.'
sentences = splitter.split(text)
print(sentences)

.. code-block:: python

['Eżempju ta\' sentenza.', 'Eżempju ta\' sentenza oħra.']


Available sentence splitters
----------------------------

The following sentence splitters are available:

* ``malti.tokeniser.KMSentSplitter`` (:doc:`../malti/sent_splitter/km_sent_splitter/km_sent_splitter`): A ``SentSplitter`` that is equivalent to the one used to split sentences in the `Korpus Malti <https://mlrs.research.um.edu.mt/CQPweb/>`_.
16 changes: 16 additions & 0 deletions docs/source/usage/tokenisers.rst
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,22 @@ Apart from ``tokenise``, every tokeniser can also return a list of indices of th

This tells you that the first word is found at ``sentence[0:7]``, the second word at ``sentence[8:11]``, and so on.

There is also a ``detokenise`` method that is meant to *approximately* invert the ``tokenise`` method by returning the original text given a list of tokens (although tokenisation is generally a lossy transformation which means that there is no guarantee that the original text can be recovered):

.. code-block:: python
:linenos:

import malti.tokeniser

tokeniser = malti.tokeniser.KMTokeniser()

tokens = ['Eżempju', "ta'", 'sentenza', '.']
text = tokeniser.detokenise(tokens)
print(text)

.. code-block:: python

'Eżempju ta\' sentenza.'

Available tokenisers
--------------------
Expand Down
1 change: 1 addition & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
sentence_splitter==1.4
2 changes: 1 addition & 1 deletion src/malti/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,6 @@

import os

__version__ = '0.1.0'
__version__ = '0.2.0'

path = os.path.dirname(os.path.abspath(__file__))
20 changes: 20 additions & 0 deletions src/malti/sent_splitter/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
'''
Sentence splitters for Maltese text.
'''

from malti.sent_splitter.sent_splitter import SentSplitter
from malti.sent_splitter.km_sent_splitter.km_sent_splitter import KMSentSplitter


#######################################################
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove these comments, they add unnecessary clutter.

def split(
text: str,
) -> list[str]:
'''
Default sentence splitter.
In this version, ``KMSentenceSplitter`` is used.

:param text: The text to split.
:return: The list of sentences.
'''
return KMSentSplitter().split(text)
3 changes: 3 additions & 0 deletions src/malti/sent_splitter/km_sent_splitter/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
'''
The MLRS Korpus Malti's sentence splitter.
'''
49 changes: 49 additions & 0 deletions src/malti/sent_splitter/km_sent_splitter/km_sent_splitter.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
'''
Korpus Malti sentence splitter.
'''

import os
import sentence_splitter
import malti
from malti.sent_splitter.sent_splitter import SentSplitter


__all__ = [
'KMSentSplitter',
]


#######################################################
class KMSentSplitter(SentSplitter):
Copy link
Collaborator

@KurtMica KurtMica Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use the full word "Sentence" for this? Similarly for the module name, etc.

'''
The sentence splitter used by the MLRS Korpus Malti corpus.
'''

#######################################################
def __init__(
self,
) -> None:
'''
Constructor.
'''
super().__init__()
self._spltter = sentence_splitter.SentenceSplitter(
language='it',
non_breaking_prefix_file=os.path.join(
malti.path, 'sent_splitter', 'km_sent_splitter',
'mt_non_breaking_prefixes.txt',
),
)

#######################################################
def split(
self,
text: str,
) -> list[str]:
'''
Split a text into a list of sentences.

:param text: The text to split.
:return: The list of sentences.
'''
return self._spltter.split(text)
Loading