-
Notifications
You must be signed in to change notification settings - Fork 2
Feat: Improve chardict.py to add n-gram counts
#9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Warning format change to count n-gram occurences Can go up to large n-gram parametter by changing `NGRAM_MAX_LENGTH` constant. Switch from os module to pathlib to manage paths. Fixes OneDeadKey#7
e1a6466 to
cdc8b29
Compare
fabi1cazenave
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work. Sadly, for some reason it doesn’t work from the Makefile yet.
This JSON format change also requires some tweaks in merge.py as well, and they should be addressed in the same PR.
f69cf2a to
e7a27d5
Compare
|
Thinking out loud, I think the
And the |
I think it should have both.
Edit : I will work on
Does that seem right ? |
|
Let’s keep this PR as minimal as possible please. :-)
If you intend to keep the filename it the JSON data, then you can forget the corpora identification I’ve suggested for now. We’ll do that in another PR.
For this PR we just want to support the previous behavior, i.e. making an average of all input n-gram frequencies. |
|
Oh, I didn’t see you already did the job. Forget my previous comment, looking at your work right now. :-) |
fabi1cazenave
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running make does create the set of JSON files we expect in the /json directory but I’m worried by two things:
- the
namefield containsoutput— I’d suggest to drop thisnamefield for this PR - trigrams don’t seem to be sorted by frequencies, and they don’t seem to be rounded to a readable number of decimal digits either.
Besides that, I’m afraid the merge function introduces regressions:
- I’m not sure
mergeis still able to compute average frequencies of N input corpora (= our current need) because of its implementation - it has become too complex to use — I’ve left a comment about that.
I’d suggest to keep a single and simple merge function for this PR and implement other merge methods in a follow-up.
Last but not least, a type description of the new corpus format is needed. It can be a simple addition to the main docstring. (A proper mypy description would be awesome but I don’t know how to do this.)
99a06eb to
f69cf2a
Compare
…mpler and only implement current merge behaviour.
fabi1cazenave
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost there. This looks better and better, congrats! :-)
The two main remaining points are:
- the use of the
namefield as a supposedly unique key - the use of an empty list as a default argument.
Please rebase on the latest main branch when you’re done with my review comments.
| # get all n-grams | ||
| for ngram_start in range(len(txt)): | ||
| for ngram_length in range(NGRAM_MAX_LENGTH): | ||
| _ngram = get_ngram(txt, ngram_start, ngram_length) | ||
|
|
||
| if not _ngram: # _ngram is "" | ||
| continue | ||
|
|
||
| if _ngram not in ngrams[ngram_length]: | ||
| ngrams[ngram_length][_ngram] = 0 | ||
|
|
||
| ngrams[ngram_length][_ngram] += 1 | ||
| ngrams_count[ngram_length] += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works and the logic is simple (I like that), but I’m not a fan of how it’s implemented. If that’s okay with you:
- please address the rest of this review
- rebase on the latest
mainbranch (this PR is based on a previous version) - and I’ll add a commit to your PR to suggest another implementation of the same logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eager to see what changes you are thinking about.
I was pretty sure this was simple enough to be accepted but maybe it’s performance-wise that you would like to improve things ?
d63e3fb to
34b9633
Compare
… empty list argument)
|
thinking back about this pr, there is an inconsistency that bugs me :
I would tend to modify The json format would then be : is it a bad idea ? |
Warning format change to count n-gram occurences
Can go up to large n-gram parametter by changing
NGRAM_MAX_LENGTHconstant.Switch from os module to pathlib to manage paths.
Fixes #7