Skip to content

Conversation

@kwei-zhang
Copy link
Contributor

@kwei-zhang kwei-zhang commented Mar 21, 2025

resolved issue by adding retries when 429 or 500+ status returned

@kwei-zhang kwei-zhang marked this pull request as draft March 21, 2025 23:14
@kwei-zhang kwei-zhang marked this pull request as ready for review March 21, 2025 23:14
@benoit74
Copy link
Collaborator

@kwei-zhang

  • how did you tested this is really solving the issue?
  • doesn't this somehow conflict with the _get_retry_adapter which is already configured?
  • I don't think that retrying on all error codes is a good idea, many errors are not retryable at all (e.g. 404) ; our problem is with 429, so I would rather retry only on 429 for now

@kwei-zhang
Copy link
Contributor Author

@kwei-zhang

  • how did you tested this is really solving the issue?
  • doesn't this somehow conflict with the _get_retry_adapter which is already configured?
  • I don't think that retrying on all error codes is a good idea, many errors are not retryable at all (e.g. 404) ; our problem is with 429, so I would rather retry only on 429 for now

Hi thanks for the feedback!

This is the error report before adding retries

ifixit2zim --output=output --language="zh"
...
requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://zh.ifixit.com/Guide
[MainThread::2025-03-22 14:37:10,548] WARNING:Not supposed to add a redirect for a home item
[MainThread::2025-03-22 14:37:10,549] ERROR:Interrupting process due to error
Traceback (most recent call last):
  File "/Users/kaiwei/ZJN/code/kiwix/ifixit/src/ifixit2zim/scraper.py", line 362, in run
    scraper.scrape_items()
  File "/Users/kaiwei/ZJN/code/kiwix/ifixit/src/ifixit2zim/scraper_generic.py", line 173, in scrape_items
    raise FinalScrapingFailureError(
ifixit2zim.exceptions.FinalScrapingFailureError: Too many homes failed to be processed: 1
[MainThread::2025-03-22 14:37:10,550] INFO:Cleaning up

This error only occurs during fetching the list of guides https://zh.ifixit.com/api/2.0/guides?limit=200&offset=xx. After adding retries I runs for around 5 min after finish the fetching and it works perfectly fine.


Sorry about that, I didn't notice it. I was following the design of another function

    @backoff.on_exception(
        backoff.expo,
        requests.exceptions.RequestException,
        max_time=16,
        on_backoff=backoff_hdlr,
    )
    def get_api_content(self, path, **params):
        ...

Besides that, I think there might be some issue with _get_retry_adapter function. The function looks like:

def _get_retry_adapter(
    max_retries: Optional[int] = 5,
) -> requests.adapters.BaseAdapter:  # pyright: ignore
    retries = requests.packages.urllib3.util.retry.Retry(  # pyright: ignore
        total=max_retries,  # total number of retries
        connect=max_retries,  # connection errors
        read=max_retries,  # read errors
        status=2,  # failure HTTP status (only those bellow)
        redirect=False,  # don't fail on redirections
        backoff_factor=30,  # sleep factor between retries
        status_forcelist=[
            413,
            429,
            500,
            502,
            503,
            504,
        ],  # force retry on the following codes
    )

    return requests.adapters.HTTPAdapter(max_retries=retries)  # pyright: ignore

but in urllib3 documentation it seems the backoff_factor is in seconds and will cause the program to sleep for (backoff_factor * (2 ** retry time)) which means our program will sleep for [30s, 60s, 120s ...]. I wonder if this is a little bit too long?


It should now fixed 😄

@benoit74
Copy link
Collaborator

which means our program will sleep for [30s, 60s, 120s ...]. I wonder if this is a little bit too long?

This is in fact intentional. When a server sends a 429 error or timeout on connection, retrying only few milliseconds or seconds later proved to not work at all. We need to let the upstream server "breath". A full ifixit scraper run typically takes hours, so if on some occasion we need to wait 30s because the server was overwhelmed, it is fine.

I will review your changes

Copy link
Collaborator

@benoit74 benoit74 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few more changes, thank you

@benoit74
Copy link
Collaborator

Also please fix your first comment so that PR is properly linked to issue: https://docs.github.com/en/issues/tracking-your-work-with-issues/using-issues/linking-a-pull-request-to-an-issue

@benoit74 benoit74 self-requested a review March 25, 2025 08:42
Copy link
Collaborator

@benoit74 benoit74 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, please:

I will merge it right after

@kwei-zhang
Copy link
Contributor Author

Thanks for the review! I have changed the PR message, can you please confirm if it is linked properly?

Copy link
Collaborator

@benoit74 benoit74 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@benoit74 benoit74 merged commit ec4cb27 into openzim:main Mar 27, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Problem while scraping home page of zh website

2 participants