Skip to content

Conversation

@olliw42
Copy link
Owner

@olliw42 olliw42 commented May 24, 2024

this is a dev/test branch for working on retransmission

this one is simplified in two ways, to not make it too complicated and facilitate easier testing:

  • retransmission handling only for receiver->transmitter link direction. That's the more complicated direction also (entangles with flow control). Once this is bugged out adding tx->rx direction should be easy.
  • lossles transmission, i.e., a frame is send for as long as it is not acked. For low LQ this is bad. Eventually we should go to a scheme there it does only one retransmit attempt, but so it's easier to get going.

@brad112358 this might interest you, given you showed interest in this before. I actually would much appreciate you cheking it out; your strength in finding the little loopholes/bugs would be useful :)

anyone else is of course also massively welcome !!!!!

@olliw42
Copy link
Owner Author

olliw42 commented May 24, 2024

a pic I made to help me
mlrs-arq

@brad112358
Copy link
Contributor

brad112358 commented May 26, 2024

Thanks for including the very helpful and, I think clear, diagram. So far, I've only looked at that, but I already have a few comments/observations/suggestions.

  1. You seem to be sending back only ack or nack and not the received sequence number. I believe this is not optimal as we observe below.

  2. Sequence number 5 shows sub-optimal behavior which is a direct result of 1). In the 8th response, we send a nack with no indication of which sequence number was lost, even though, we have previously received and accepted the frame which was lost.

  3. Instead of responding with just ack or nack, it would be better to send back a received sequence number and to always just acknowledge the last successfully received data. In other words, it doesn't matter much that we haven't received the most recent message. What matters most is what was the last message we successfully did receive. This change would have eliminated an unnecessary retransmission for sequence number 5.

  4. The diagram doesn't specify the size of the sequence number field, but they seem to be at least 3 bits. I think the sequence numbers don't need to be so large. In fact, I think they can be 1 bit. This is because we have alternating transmission direction with no chance of out of order reception. This reduces the overhead for the sequence number and has the added advantage that responding with the last received sequence number rather than ack or nack still only requires 1 bit.

  5. What I describe above is actually the retransmission method ELRS uses (last time I checked) and I think it is probably optimal in terms of overhead. They can only use it in one direction because they don't send an equal number of frames in both directions, but we can use this method in both directions.

  6. There might be some small utility to using sequence numbers larger than 1 bit or to also send the ack/nack flag in terms of allowing each side to more accurately estimate the loss in both directions.

  7. I agree we should limit the number of retransmissions. Even with 1 bit sequence numbers, I think we can limit the number of retransmissions to any value we like as long as both sides agree what the limit is (both sides would need a counter for each direction).

Does this make sense to you? Am I missing something?

@olliw42
Copy link
Owner Author

olliw42 commented May 26, 2024

many thx for the comments

concerning all points related to seq not 1 bit: YES, 1 bit is fully sufficient for a basic mechanism. Eventually we may want to change. However, for purely historical reasons the seq number just happens to be 3 bit, and at this point I see no reason to change that. We know we could/will. Not a relevant point IMHO :)

there is btw always a situation which is not optional. See e.g seq 3

the protocol is just the standard protocol as on any web page (there are various names for it, so no name here)(all 1 bit). There are two versions, those which send an ack, and those which send the next desired number. The main challenge is handling the various non-protocol related states, like disconnects, frames which carry commands not serial data, etc. pp. Hence these states, and I figured that the send ack version is easier to handle these states.

it's possible that with > 1 bit one may reduce few edge cases

the problem with 5 is probably solved by using the other method, to send the next desired no. Hm. Maybe I shoud convert to that.
EDIT: yes, I think that's what we should go to ...

@brad112358
Copy link
Contributor

Can you elaborate a bit on what cases in mLRS the ack/nack is better than sending the sequence number of the last received frame? (You are faster at edits than I am at replys) If I understand your edit correctly, that was my point. Is there any practical difference between sending the next desired number vs the last received number?

@brad112358
Copy link
Contributor

brad112358 commented May 26, 2024

Also, 3 is only sub-optimal if you allow for sending ahead of acknowledged (whether by sequence number or ack) frames which requires more buffering and can result in out of order delivery.

I think such type of protocols are out of question

@olliw42
Copy link
Owner Author

olliw42 commented May 26, 2024

sorry, I edited your post ... grrrr, this damed github, not the first time I stepped into ths trap

@olliw42
Copy link
Owner Author

olliw42 commented May 26, 2024

Is there any practical difference between sending the next desired number vs the last received number?

the book keeping and state handling looked easier to me

@brad112358
Copy link
Contributor

the book keeping and state handling looked easier to me

To me, the opposite. Sending back the last received sequence number means you just check if the acknowledged sequence number matches what you last sent. If so, move on, if not, retransmit. The other way seems to require you to add when you respond and subtract when you compare. But, I suppose it's mostly a matter of how you think about it. Either way, there is not much state involved except for deciding when to give up on retransmission of a given frame.

Many of the algorithms found online and in text books are intended for more complex systems which don't just ping-pong messages at a constant interval like we do, so a very simple method can be optimal for us if we rule out buffering more than the most recent transmission as you have.

@olliw42
Copy link
Owner Author

olliw42 commented May 26, 2024

I was trying this initially, but then concluded for the ack, I will retry, could be that at the early stages I also had too much of how to abstract the code in mind. I didn't had sorted out the states initially. Anyway, it has benefits, so whatever what it's going to be so :)

@olliw42
Copy link
Owner Author

olliw42 commented May 27, 2024

@brad112358
so, changed it now to send the last received seq no as ack, instead of ack/nack-ing reception

seemingly works, in that it connects etc. pp, but the symptom described by @jlpoltrack also exists, i.e. MP shows lost packets ... so, appears something is still not working as expected ...

here the time plan for the changed protocol
(note, ack is only 1 bit, seq is 3 bits)

grafik

@olliw42
Copy link
Owner Author

olliw42 commented May 28, 2024

with

#define USE_ARQ_TX_SIM_MISS 9 //9
#define USE_ARQ_RX_SIM_MISS 5 //5

I do see continuous packet losses in MP, MP tells pretty stably 95-96% link quality, so around every 20iest packet is lost

the mLRS LQ metric on the OLED display tells something around 75% ... not sure if that means that the mechanism is helping

I do my tests btw in 19 Hz mode (with a 2.4 GHz system)

@brad112358
Copy link
Contributor

And you didn't add a retransmission limit yet? So something must be wrong if MP is correct. When I get some time, I'll try to reproduce the problem with QGC.

@brad112358
Copy link
Contributor

Do you use Bluetooth or UDP or TCP WiFi or wired serial for the GCS connection?

@olliw42
Copy link
Owner Author

olliw42 commented May 28, 2024

And you didn't add a retransmission limit yet?

yes, no retransmission limit

So something must be wrong if MP is correct.

yes :)

Do you use Bluetooth or UDP or TCP WiFi or wired serial for the GCS connection?

wired serial connection, from tx serial via usb-ttl to PC

one potential source of problem which I have not yet ruled out is that the stream flow control isn't good enough, so that AP sends too many messages, so that some are dropped every once in a while ... I'm using 19 Hz, so there is some restriction.
It's also curious that 95% is close to 1/19 ... can't see though how that could correlated.
Some more tests should clear up some speculations.

@brad112358
Copy link
Contributor

Looks like that fixed it. QGC is now reporting 0 lost messages

@olliw42
Copy link
Owner Author

olliw42 commented May 28, 2024

Looks like that fixed it. QGC is now reporting 0 lost messages

YES :) @jlpoltrack made the relevant comment

not sure you also follow the discussion at discord

@brad112358
Copy link
Contributor

brad112358 commented May 28, 2024

Well, It ran well for a while and then it started dropping a lot of messages and it seemed to get worse over time. I've power cycled both ends of the link (one at a time) and restarted the GCS, but it hasn't recovered. I'm not sure what happened.

@brad112358
Copy link
Contributor

I had the baud rate too high for the crap R9M inverter with weak pullup. Working fine at 115200 serial speed on the Tx

@olliw42
Copy link
Owner Author

olliw42 commented Jun 11, 2024

I think that the problem is that the whole retry thing doesn't work in the first place, and we are always struck by the issue that the seq is only 3 bit and the ack only 1 bit. Not talking about framelost/reset. I come to the conclusion, we can't ensure that we always do the right thing, once we have failure sequences which are longer than 8 times, the seq_no wraps and we have an issue. So, the only way out is to kind of model the behavior of the other side, and from that reconstruct the actual seq_no. It's in some way what your counter does. The seq_no would have to be longer than the number of frames for disconnect.
I must admit I didn't anticipate how difficult this "catch the frame losses/reset" thing would be.
I also admit that your added byte would resolve that easily. Issue with this is that the code to fill that byte would be pretty nasty.
So, we can have nasty code here or there.

For the moment I did option 1. See next push.

Maybe we also simply need to come to the conclusion that if we have long failure sequences, that we simply have to accept that we can't recover everything to 100 %. I mean, long failure sequences mean pretty low LQ anyway. I think experience is anyway that below ca 40% the link is pretty "erratic" anyway, in the sense that it's not like smoothly around a stationary value. Maybe we just want too much.

@olliw42
Copy link
Owner Author

olliw42 commented Jun 11, 2024

another point:
there are two optins for the returned seq_no. (a) the recipient can return which seq_no it has received, (b) the recipient can return which next seq_no it wants to have. (both are different from the inital ack/nack)
Currently it's (a). I looked at both schemes and concluded they are equal. But I haven't taken into account the frame lost detection. And maybe my conclusion is also wrong. I.e., maybe one should think through that procedure.
just to have mentioned that.

@brad112358
Copy link
Contributor

I don't think your (a) or(b) above can have any real impact on lost frame detection since they convey exactly the same information and either value can trivially be calculated from the other.

With regard to lost frame detection, I agree that finding a workable method depends on the behavior of the sending side so we need to nail that down first.

On the sending side, I think we will want more than a simple fixed limit on the number of retries for each frame.

  1. The two cases where we resend a frame are a) loss of the frame which would have contained the ack and b) receipt of an ack for the older (not most recently sent) frame (call it nac).
  2. When the ack is lost (a), we have no information to tell us if the last frame we sent was received so we resend, but we should probably have a small limit on the number of retries in this case because it is still possible that the frame was already received and we don't want to totally cripple message flow in one direction just because the other direction is unreliable. I would suggest that one or two retries may be all we want to do in this case.
  3. When we do receive an ack which indicates the last frame we sent was lost (b) (nac), the limit on retries should be larger because we know we are not just wasting bandwidth by resending. But, there should still be a limit on the number of retries because the data contained in the frame gets more stale with each resend and eventually it is more useful to give up and send a newer message if we have one waiting to be sent. I'm thinking something on the order of 1 second worth of retries at the most. Or, perhaps how long we try should depend on how many messages are waiting behind it. Perhaps we should even discard more than one message if there are many messages waiting so we send fresher information.
  4. So I think we need to increment two retry counters when we send a frame, one which is reset when we receive any message (ack or nac) and the other which is reset only when we receive an ack (for the last frame sent).
  5. When we give up on retry and drop a frame, if it's because of a nac, we should also flush any partial MAVLink message and start the next frame with a new MAVLink message. If we gave up because of a lost ack, we should not flush because the frame we are discarding might have been received so we should continue to send the rest of the MAVLink message.

Does this sound correct?

@olliw42
Copy link
Owner Author

olliw42 commented Jun 11, 2024

interesting and creative idea, but I think it's way too complicated really. It also doesn't solve the main issue, AFAICS.
I think the general strategy to have just 1 retry strikes a pretty good balance, is simple and also converges well into low LQ area. Doing few more retries when LQ is high or for the high speed link I think is also acceptable, but even here I would think one has to ask if it's worth the added complexity. I would think that the factual improvement in terms of user experience of these super sophisticated schemes will be minor. The big step is going to have one retry at all, IMHO. Retrying for 1 sec is out of my mind.

As regards frame loss, with the simple option 1. added we cover most situations well. It's failures really matter only in very low LQ situations, and here things are not nice anyway.

So, I guess I consider the topic largely "solved", i.e., my focus shifts towards doing more testing of the code.

I'm not sure you could find this agreeable.

My mind is also shifting towards adding a full reparsing to the mavlink parser, as this would resolve the frame loss issue completely. In fact, my mind is slightly different. Since quite some while, I would want to change the transmission such that the mavlink frame marker is not at the end of a mlrs frame payload, or split across two in the case of the two byte frame marker. Unfortunately, because of all the parsing and fifos etc it's not so easy to implement this. As a side effect of that it would also become easy to do the frameloss detection, pretty much like your old idea, namely simply scan the payload for the marker, compare to the parser expectation, and reset if no match. So, my mind is currently on the marker-not-at-end thing :)

@olliw42
Copy link
Owner Author

olliw42 commented Jun 13, 2024

@brad112358
I've pushed a code addition, which currently is nfc, but shows an idea which crossed my mind, to solve the parser reset thing

your old extra byte idea was to know where the next frame marker is, and compare to the parser state, and from that work out if something is missed.
I mentioned in the above the idea, that the same could - in principle - be achieved by scanning the payload for a marker, and then do the same. Problem is a bit that the mavlinkX two-byte marker can be split accross frames (the normal mavlink marker is useless as it is just one byte).
However, I know realized that this issue isn't really an issue because the mavlinkX parser does retract the header. Which means that if the parser is in the state of scanning the data in the mavlink frame, the marker is safely not split across frames. Likewise, if it is split, we can let the parser do it's job. Long story short, we only need to scan the data area of the mavlink frame for whether where is a marker.
The added code does that. It doesn't currently do the actual reset, but the comments say where this should happen.
Wanted to show that for you to see it, and possibly have and opinion on it :)

olliw42 added 5 commits June 20, 2024 06:07
# Conflicts:
#	mLRS/Common/arq.h
#	mLRS/CommonRx/mlrs-rx.cpp
#	mLRS/CommonTx/mlrs-tx.cpp
@olliw42
Copy link
Owner Author

olliw42 commented Jun 22, 2024

merged to main

@olliw42 olliw42 marked this pull request as draft June 22, 2024 16:16
@julled
Copy link

julled commented Jan 12, 2026

interesting work! looking forward to have at mainline at some point

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants