type hints and migrate to argparse. Simplify async setup.#60
type hints and migrate to argparse. Simplify async setup.#60Repsay wants to merge 6 commits intoNetherlandsForensicInstitute:masterfrom
Conversation
|
Hi, thanks for your PR! The input is very much appreciated. For our review process, it speeds things up if you separate your PR into 3 different PRs:
One note on typehinting: I am kind of on the fence for supporting typehinting in this case. But having type-suggestions in your arguments is something I can agree with. Looking forward to review your PRs ! |
Hey, I explored the possibility of splitting the request and changes into multiple requests. However, the only thing that can be separated from the current request without overhauling the entire codebase is the type hinting, as many of the changes are interconnected and dependent on each other. |
| # Punct will be added using rules. | ||
| if len(token) > 1: | ||
| if tag != 'PUNCT' or tag != '.' or tag != '': | ||
| if tag != "PUNCT" and tag != "." and tag != "": |
bin/demeuk.py
Outdated
| HASH_HEX_REGEX = '^[a-fA-F0-9]+$' | ||
| MAC_REGEX = '^([0-9A-Fa-f]{2}[:-]){5}([0-9A-Fa-f]{2})$' | ||
| UUID_REGEX = '^[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}$' | ||
| EMAIL_REGEX = r"[a-zA-Z0-9][a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]{0,63}@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z]{2,6})+" |
There was a problem hiding this comment.
The local part validation is stricter, only allowing valid characters in accordance with standard email formats.
|
|
||
|
|
||
| def add_split(line, punctuation=(' ', '-', r'\.')): | ||
| def add_split( |
There was a problem hiding this comment.
Added more possible splits by default
| false if line does not match regex | ||
| """ | ||
| for regex in regex: | ||
| for regex in regexes: |
| # Single byte encodings. Also it is beter to not include iso encoding by default. | ||
| # https://en.wikipedia.org/wiki/Character_encoding#Common_character_encodings | ||
| # Input_encoding is by default [utf8] | ||
| fallback_encodings = [ |
|
I have added some comments on the changes that are different then the migration of argparse and async. |
Thanks, I am trying to find a way to review your PR as functional and quickly as possible so the comments help. |
I will give it a try but behold for a lot of comments. I am considering splitting your PR myself. If we start with argparse and follow up from there i think it is doable. |
…with special characters
…both directions without duplicates to lines to avoid redoing the same work in the future.
First part of NetherlandsForensicInstitute#60 to split in parts. First just adding type hinting and making PEP8 compliant. Small improvements and fixing small bugs. Removing changing global variables.
Some small improvements for optimizations. Added fallback encodings. de-nest lot of code in clean_up. split between bytes processing and string processing to reduce unnecessary re-encoding/decoding. Instead of stop use continue to speed up processing. Instead of changing locale use it directly in opening the file. Part of NetherlandsForensicInstitute#60
I have been working with my own improved version of demeuk. This version uses argparse for arguments and made some small improvements in the checking of the lines by using continue instead of stop. Also made use of Manager and Process to simplify the async setup. If any things need to change before push to master please let me know.