Skip to content

Conversation

@evgeny-leksikov
Copy link
Contributor

What

fix discarding from pending on failed lane

Why ?

bugfix

How ?

  • increase completion counter only when flush cancel operation has been posted
  • purge pending for discarding lane since discard op can be on pending

@evgeny-leksikov evgeny-leksikov requested a review from dmitrygx June 11, 2021 12:23
@evgeny-leksikov evgeny-leksikov force-pushed the ucp_ep_close_err_handler branch from 844990b to 4e65d55 Compare June 21, 2021 10:40
if (ucp_worker_is_uct_ep_discarding(worker, uct_ep)) {
ucs_debug("UCT EP %p is being discarded on UCP Worker %p",
uct_ep, worker);
uct_ep_pending_purge(uct_ep, ucp_ep_err_pending_purge,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why needed?
UCT EP should be purged as a part of discarding procedure

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discard itself can be on pending

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discard itself can be on pending

but why we need to remove discarding from the pending?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should be removed by discarding

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it does not remove, ucp_ep_err_pending_purge does ucp_request_send_state_ff which posts flush cancel again, that's the fix

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls add comment to describe why it's needed

@evgeny-leksikov evgeny-leksikov requested a review from yosefe June 22, 2021 20:35
if (ucp_worker_is_uct_ep_discarding(worker, uct_ep)) {
ucs_debug("UCT EP %p is being discarded on UCP Worker %p",
uct_ep, worker);
uct_ep_pending_purge(uct_ep, ucp_ep_err_pending_purge,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls add comment to describe why it's needed

@evgeny-leksikov evgeny-leksikov requested a review from yosefe June 24, 2021 05:31
Comment on lines 683 to 684
* UCS_ERR_NO_RESOURCES, so need to purge the queue to resubmit the
* operation */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it's "abort the operation", not resubmit, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ucp_request_send_state_ff, in case of discard, has to re-submit the operation to avoid reordering, flush cancel must be completed last otherwise we can get error WQE when lanes are destroyed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add to comment:
/* We need to resubmit the FLUSH_CANCEL operation on the same failed lane, in order to make sure all previous outstanding operations are completed before destroying the failed endpoint */

@evgeny-leksikov evgeny-leksikov force-pushed the ucp_ep_close_err_handler branch from 8ba56db to 8bfb149 Compare June 24, 2021 12:14
@evgeny-leksikov evgeny-leksikov requested a review from yosefe June 25, 2021 08:15
@changchengx
Copy link
Contributor

port req->send.state.uct_comp.count = 0 yosefe#223
It also need to port #7332 to yosefe#223 at the same time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants