Skip to content

type hints and migrate to argparse. Simplify async setup.#60

Open
Repsay wants to merge 6 commits intoNetherlandsForensicInstitute:masterfrom
Repsay:type-hints-and-improvements
Open

type hints and migrate to argparse. Simplify async setup.#60
Repsay wants to merge 6 commits intoNetherlandsForensicInstitute:masterfrom
Repsay:type-hints-and-improvements

Conversation

@Repsay
Copy link

@Repsay Repsay commented Aug 27, 2024

I have been working with my own improved version of demeuk. This version uses argparse for arguments and made some small improvements in the checking of the lines by using continue instead of stop. Also made use of Manager and Process to simplify the async setup. If any things need to change before push to master please let me know.

@GingerGeneste
Copy link
Contributor

Hi, thanks for your PR! The input is very much appreciated.

For our review process, it speeds things up if you separate your PR into 3 different PRs:

  • docopt -> argparse (wich is awesome btw!)
  • typehinting
  • async setup

One note on typehinting: I am kind of on the fence for supporting typehinting in this case. But having type-suggestions in your arguments is something I can agree with.
One note on async setup: I have not looked into detail of the async setup but we do like to keep the stdin functionality.

Looking forward to review your PRs !

@Repsay
Copy link
Author

Repsay commented Aug 27, 2024

Hi, thanks for your PR! The input is very much appreciated.

For our review process, it speeds things up if you separate your PR into 3 different PRs:

  • docopt -> argparse (wich is awesome btw!)
  • typehinting
  • async setup

One note on typehinting: I am kind of on the fence for supporting typehinting in this case. But having type-suggestions in your arguments is something I can agree with. One note on async setup: I have not looked into detail of the async setup but we do like to keep the stdin functionality.

Looking forward to review your PRs !

Hey, I explored the possibility of splitting the request and changes into multiple requests. However, the only thing that can be separated from the current request without overhauling the entire codebase is the type hinting, as many of the changes are interconnected and dependent on each other.

# Punct will be added using rules.
if len(token) > 1:
if tag != 'PUNCT' or tag != '.' or tag != '':
if tag != "PUNCT" and tag != "." and tag != "":
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug

bin/demeuk.py Outdated
HASH_HEX_REGEX = '^[a-fA-F0-9]+$'
MAC_REGEX = '^([0-9A-Fa-f]{2}[:-]){5}([0-9A-Fa-f]{2})$'
UUID_REGEX = '^[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}$'
EMAIL_REGEX = r"[a-zA-Z0-9][a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]{0,63}@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z]{2,6})+"
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The local part validation is stricter, only allowing valid characters in accordance with standard email formats.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixes #20



def add_split(line, punctuation=(' ', '-', r'\.')):
def add_split(
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added more possible splits by default

false if line does not match regex
"""
for regex in regex:
for regex in regexes:
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug or at least not clear usage

# Single byte encodings. Also it is beter to not include iso encoding by default.
# https://en.wikipedia.org/wiki/Character_encoding#Common_character_encodings
# Input_encoding is by default [utf8]
fallback_encodings = [
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added fallback encoding

@Repsay
Copy link
Author

Repsay commented Aug 27, 2024

I have added some comments on the changes that are different then the migration of argparse and async.

@GingerGeneste
Copy link
Contributor

I have added some comments on the changes that are different then the migration of argparse and async.

Thanks, I am trying to find a way to review your PR as functional and quickly as possible so the comments help.

@GingerGeneste
Copy link
Contributor

Hi, thanks for your PR! The input is very much appreciated.
For our review process, it speeds things up if you separate your PR into 3 different PRs:

  • docopt -> argparse (wich is awesome btw!)
  • typehinting
  • async setup

One note on typehinting: I am kind of on the fence for supporting typehinting in this case. But having type-suggestions in your arguments is something I can agree with. One note on async setup: I have not looked into detail of the async setup but we do like to keep the stdin functionality.
Looking forward to review your PRs !

Hey, I explored the possibility of splitting the request and changes into multiple requests. However, the only thing that can be separated from the current request without overhauling the entire codebase is the type hinting, as many of the changes are interconnected and dependent on each other.

I will give it a try but behold for a lot of comments. I am considering splitting your PR myself. If we start with argparse and follow up from there i think it is doable.

@GingerGeneste GingerGeneste self-requested a review August 27, 2024 11:09
Repsay added a commit to Repsay/demeuk that referenced this pull request Mar 12, 2025
First part of NetherlandsForensicInstitute#60 to split in parts. First just adding type hinting and making PEP8 compliant. Small improvements and fixing small bugs. Removing changing global variables.
@Repsay Repsay mentioned this pull request Mar 12, 2025
Repsay added a commit to Repsay/demeuk that referenced this pull request Mar 12, 2025
Some small improvements for optimizations. Added fallback encodings. de-nest lot of code in clean_up. split between bytes processing and string processing to reduce unnecessary re-encoding/decoding. Instead of stop use continue to speed up processing. Instead of changing locale use it directly in opening the file. Part of NetherlandsForensicInstitute#60
This was referenced Mar 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants