Update PySBD component to support spaCy v3#114
Conversation
Codecov Report
@@ Coverage Diff @@
## master #114 +/- ##
==========================================
- Coverage 98.43% 98.35% -0.09%
==========================================
Files 38 39 +1
Lines 1150 1153 +3
==========================================
+ Hits 1132 1134 +2
- Misses 18 19 +1
Flags with carried forward coverage won't be shown. Click here to find out more.
📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more |
| }, | ||
| entry_points={ | ||
| "spacy_factories": ["pysbd = pysbd.utils:PySBDFactory"] |
There was a problem hiding this comment.
@rmitsch I would have to remove this entrypoint now as spacy uses @Language.factory decorator compulsorily in spacy v3 to register a custom component and since PySBDFactory resides at pysbd/utils.py, I would need to add spacy>=3 requirement to pysbd's setup.py
I wish to keep pysbd lightweight (use only inbuilt python modules).
Do you have any thoughts on this? Like other way around?
There was a problem hiding this comment.
Hm, that's tricky. You could have a look at vendoring @Language.factory. You'd definitely need the registry functionality which can be found in https://github.com/explosion/catalogue now. It's still relatively lightweight, but it's already breaking your requirement of only having inbuilt Python modules.
How's spacy_factories used within PSBD?
There was a problem hiding this comment.
It's not used in pysbd.
psybd python library is shipping psybd named spaCy component out-of-the-box via entrypoints.
Given a python environment with spacy and pysbd installed, nlp.add_pipe("pysbd") will work without importing pysbd explicitly.
More info here: https://spacy.io/usage/saving-loading#entry-points-components
There was a problem hiding this comment.
Alternatively: you could offer pybsd and pybsd[spacy], with only the latter supporting the usage as a spaCy v3.x component and installing spaCy by default.
There was a problem hiding this comment.
Yes, was thinking of doing this. Will look into it 👍🏼
| print('sent_id', 'sentence', sep='\t|\t') | ||
| for sent_id, sent in enumerate(doc.sents, start=1): | ||
| print(sent_id, sent.text, sep='\t|\t') | ||
| print(sent_id, repr(sent.text), sep='\t|\t') |
There was a problem hiding this comment.
Out of curiosity: why is the repr() necessary here?
| }, | ||
| entry_points={ | ||
| "spacy_factories": ["pysbd = pysbd.utils:PySBDFactory"] |
There was a problem hiding this comment.
Hm, that's tricky. You could have a look at vendoring @Language.factory. You'd definitely need the registry functionality which can be found in https://github.com/explosion/catalogue now. It's still relatively lightweight, but it's already breaking your requirement of only having inbuilt Python modules.
How's spacy_factories used within PSBD?
|
Are you still working on this? Otherwise I could have a look. |
|
Hey @davidberenstein1957, sure you can take a look at it. Let me know if you happen to work on the recommendations suggested by @rmitsch above. |
|
here would be an option to update the factory method and not require spacey as a hard requirement to pysbd. ` |

PySBD component using Language.factory