Skip to content

Conversation

@vsoch
Copy link
Member

@vsoch vsoch commented Mar 28, 2022

this optimization does not check duplicate urls between workers. So we might see
a tiny speedup given redudant urls across files! If this works well, it will be a new urlchecker release to replace the one in #768 .

Yeah, had a lot of fun this weekend and it still seems to be going strong! 😆

Signed-off-by: vsoch vsoch@users.noreply.github.com

Description

Motivation and Context

Checklist:

  • I have posted the link for the PR in the usrse slack (#website) to ask for reviewers
  • I have previewed changes locally
  • I have updated the README.md if necessary

cc @usrse-maintainers

vsoch added 2 commits March 28, 2022 05:22
this optimization does not check duplicate urls between workers. So we might see
a tiny speedup given redudant urls across files!

Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
@vsoch
Copy link
Member Author

vsoch commented Mar 28, 2022

@SuperKogito looks like we have a bug where it's not honoring the exclude list - I'll take a look later today!

@SuperKogito
Copy link

oh okay 👍 I will go over the code again later too

@vsoch
Copy link
Member Author

vsoch commented Mar 28, 2022

Actually just found the bug! Going to rebuild now and run again!

Signed-off-by: vsoch <vsoch@users.noreply.github.com>
@vsoch
Copy link
Member Author

vsoch commented Mar 28, 2022

okay just ran it thrice - 38 seconds 46 and 53 seconds (the second with a failure). For the other PR I saw 36 and 44 seconds. So arguably the change is trivial, at least for the repository here. I think this particular speedup would be more apparent for webby content with huge duplication between files! For our USRSE site I think we have tiny duplication, but it likely doesn't extend beyond the time of the longest worker.

Overall, I'm not convinced this is an improvement - the time it takes to parse the files twice seems to make it equal if not worse! Let's keep the PR open for now, hopefully we can merge #768 and get good testing of our new multiprocessing, and given how that does over a few weeks we can go from there for next steps.

Going back to bed! Chat later, thanks for the fun :)

@vsoch vsoch marked this pull request as draft March 28, 2022 11:47
@vsoch vsoch closed this May 10, 2022
@danielskatz danielskatz deleted the test/unique-url-optimization branch December 6, 2022 18:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants