Skip to content

Conversation

@ansiwen
Copy link

@ansiwen ansiwen commented Oct 12, 2025

The old ARP cache implementation is very simple, and keeps entries of reachable but unused neighbors forever in the cache. Also it never probes if entries are still valid before the timeout is reached, which is by default a very long 20 minutes. Combined with the fact that gratuitous ARP requests for existing entries are accepted, this makes a simple DoS attack possible, that only recovers after these 20 minutes.

This is an improved implementation of the ARP cache logic, which handles dynamic network changes a lot better. It is inspired by the NUD of IPv6, but a lot simpler. It only introduces two more cache entry states: Stale and Probing. After the refresh interval (default: 1 minute) a Dynamic entry goes into Stale state. If an entry changes its MAC address because of a gratuitous ARP request, it goes into Stale immendiately. If a Stale entry is queried, it goes into Probing state, where it sends ARP requests similar to the Pending state, but as unicasts to the last known MAC address. If the probing is not successful, the entry is removed and further queries cause a normal broadcast request, otherwise it is replaced with a fresh Dynamic entry. If a Stale entry reaches timeout age (default 20 minutes), because it never has been queried, it is removed without further probing.

Compared to the old implentation the default values create on the happy path only one more unicast ARP request per minute and ARP cache entry.

Depends on #36

@ansiwen
Copy link
Author

ansiwen commented Oct 12, 2025

Here is how the now fast recovery from a bogus gratuitous ARP request looks like. The ping that is running in parallel only lost 9 packets:

2025-10-12T21:59:45-00:00: [WARNING] [ARP] MAC address for 192.168.1.100 moved from 2c:cf:67:3c:fe:4b to 2c:ff:ff:ff:ff:ff, marked as stale
2025-10-12T21:59:45-00:00: [DEBUG] [ARP] request for stale entry 192.168.1.100/2c:ff:ff:ff:ff:ff --> Probing
2025-10-12T21:59:46-00:00: [DEBUG] [ARP] sending unicast ARP probe for 192.168.1.100/2c:ff:ff:ff:ff:ff (5 left)
2025-10-12T21:59:47-00:00: [DEBUG] [ARP] sending unicast ARP probe for 192.168.1.100/2c:ff:ff:ff:ff:ff (4 left)
2025-10-12T21:59:49-00:00: [DEBUG] [ARP] sending unicast ARP probe for 192.168.1.100/2c:ff:ff:ff:ff:ff (3 left)
2025-10-12T21:59:50-00:00: [DEBUG] [ARP] sending unicast ARP probe for 192.168.1.100/2c:ff:ff:ff:ff:ff (2 left)
2025-10-12T21:59:52-00:00: [DEBUG] [ARP] sending unicast ARP probe for 192.168.1.100/2c:ff:ff:ff:ff:ff (1 left)
2025-10-12T21:59:53-00:00: [DEBUG] [ARP] unicast ARP probe timeout after 5 retries for 192.168.1.100/2c:ff:ff:ff:ff:ff failed
2025-10-12T21:59:53-00:00: [DEBUG] [ARP] sending ARP request for 192.168.1.100 --> Pending
2025-10-12T21:59:53-00:00: [DEBUG] [ARP] ARP reply received for 192.168.1.100/2c:cf:67:3c:fe:4b, adding cache entry
2025-10-12T22:00:55-00:00: [DEBUG] [ARP] ARP entry 192.168.1.100 (mac 2c:cf:67:3c:fe:4b) timed out --> Stale
2025-10-12T22:00:55-00:00: [DEBUG] [ARP] request for stale entry 192.168.1.100/2c:cf:67:3c:fe:4b --> Probing
2025-10-12T22:00:56-00:00: [DEBUG] [ARP] sending unicast ARP probe for 192.168.1.100/2c:cf:67:3c:fe:4b (5 left)
2025-10-12T22:00:56-00:00: [DEBUG] [ARP] ARP reply received for 192.168.1.100/2c:cf:67:3c:fe:4b, refreshing cache entry
2025-10-12T22:00:59-00:00: [DEBUG] [ARP] replying to ARP request for 192.168.1.1 from 192.168.1.100 (mac 2c:cf:67:3c:fe:4b)
...

@ansiwen ansiwen force-pushed the add-stale-probing branch 10 times, most recently from e59f514 to 1c8f6cb Compare October 13, 2025 04:08
@hannesm
Copy link
Member

hannesm commented Oct 13, 2025

Thanks for your contribution. Why is it a draft PR (is it ready for review or not)? Could you please keep unrelated changes out of the PR (here, for sure the "let trace = ..." should be removed to keep the diff small).

@hannesm
Copy link
Member

hannesm commented Oct 13, 2025

Is this "improved cach logic" taken from some book/paper/implementation?

@hannesm
Copy link
Member

hannesm commented Oct 13, 2025

simple DoS attack possible, that only recovers after these 20 minutes

As outlined on https://mirageos.org/security, please send security issues to the mail address provided there. Thanks a lot.

From what I understand, please tell me if this is correct or wrong: your worry is that gratuitous ARP packets are accepted and overwrite existing entries? So, if a (local, layer 2) attacker sends an gratuitous ARP packet, a unikernel will use this advertised MAC address to communicate with the IP address provided.

This is an improved implementation of the ARP cache logic, which handles dynamic network changes a lot better.

So, if I understand you correctly, there's now a not-so-obvious state machine in the code, and there's some "Probing" intervals going on (with unicast packets). Now, if someone sends a gratuitous ARP, the unikernel will probe 5 times (unicast), and the remove the entry?

Doesn't this lead to another denial of service attack (amplification - as an attacker I only have to send a single ARP packet, and suddenly the layer 2 network is filled with ARP requests (with an amplification factor of 5))?

--

Please note that I'm sure there are better strategies for how to handle gratuitous ARP, and this ARP implementation is not perfect. But I'm curious about how other implementations do it, and what exactly are the concerns and improvements in this PR.

@dinosaure dinosaure self-requested a review October 13, 2025 14:33
@ansiwen
Copy link
Author

ansiwen commented Oct 14, 2025

Why is it a draft PR (is it ready for review or not)?

That was on purpose. I wanted to see first if the CI is green and also experiment a bit more with the behavior. For me "Draft" means: reviews are welcome, but possibly larger parts of the code will still change or the whole PR might get redacted. But of course that can be different from project to project

Could you please keep unrelated changes out of the PR (here, for sure the "let trace = ..." should be removed to keep the diff small).

In a separate commit, or completely out of the PR? I mean, in this specific case it's not important anyway, can remove it completely. But in general I always think it's a good opportunity to clean up if you touch it anyway. (But in separate commits then, true.)

Is this "improved cach logic" taken from some book/paper/implementation?

As I wrote above it is inspired by NUD in IPv6 (see RFC4861 Sec 7.3) mainly by observing the behaviour of the Linux kernel, which apparently uses the NUD state machine also for IPv4. However NUD assumes access to "confirmations" from upper-level protocols, which we don't have here, so I removed the DELAY state, for example.

simple DoS attack possible, that only recovers after these 20 minutes

As outlined on https://mirageos.org/security, please send security issues to the mail address provided there. Thanks a lot.

Yeah, maybe "DoS" is not the right term, I don't really see it as a security issue, more as an interoperability issue. But right, I should probably be more aware of this.

From what I understand, please tell me if this is correct or wrong: your worry is that gratuitous ARP packets are accepted and overwrite existing entries? So, if a (local, layer 2) attacker sends an gratuitous ARP packet, a unikernel will use this advertised MAC address to communicate with the IP address provided.

My issue is less of a worry, but an observation while I was investigating network issues with the unikernel in a busy network. But yes, your description is correct. If I send a single gratuitous ARP packet with a "wrong" mac address to the unikernel for my local IP address, I can't reach it anymore for 20 minutes. But I can recover from this state by sending another gratuitous ARP packet with the correct MAC address. The issue is that gratuitous ARP packets for known IPs are trusted without confirmation. I compared that with how the Linux kernel behaved, which recovered within 10s by itself, and that's how I learned about the NUD, which it is using.

So, if I understand you correctly, there's now a not-so-obvious state machine in the code, and there's some "Probing" intervals going on (with unicast packets). Now, if someone sends a gratuitous ARP, the unikernel will probe 5 times (unicast), and the remove the entry?

Almost. Unsolicited updates (gratuitous ARP) are accepted as before, but now it changes the entry immediately to STALE state. (That is a MUST in the NUD spec). This means that next time this entry is queried it will go directly into PROBE (I skip the DELAY state, because we don't have the option go get "confirmation" by upper-layers.) In PROBE state it tries for a few times and rate limited to confirm that MAC address with unicast requests. Either they are answered, which refreshes the cache, or not, then the entry is removed after 5 times.

Doesn't this lead to another denial of service attack (amplification - as an attacker I only have to send a single ARP packet, and suddenly the layer 2 network is filled with ARP requests (with an amplification factor of 5))?

I can't imagine how that would work. You only can cause that for already existing ARP cache entries and only once per entry, and it is rate limited. Also it is what the Linux kernel is doing and what is described in RFC4861, so I don't think that can be a real issue.

Please note that I'm sure there are better strategies for how to handle gratuitous ARP, and this ARP implementation is not perfect. But I'm curious about how other implementations do it, and what exactly are the concerns and improvements in this PR.

We experience real-world connectivity issues in certain busy networks (which might contain many ill-behaving nodes), and while investigating, this was the first issue that I could find. (There might be also other issues, I'm still waiting for confirmation, because unfortunately I don't have direct access to that network.) So the goal is to make the ARP cache similarly resilient in such environments as the Linux kernel. In any case I think it makes not so much sense, that ARP cache entries are only probed once every 20 minutes and stay there forever, even if they are never queried again.

@ansiwen ansiwen closed this Oct 14, 2025
@ansiwen ansiwen reopened this Oct 14, 2025
@ansiwen ansiwen marked this pull request as ready for review October 14, 2025 20:59
@ansiwen ansiwen requested a review from hannesm as a code owner October 14, 2025 20:59
@hannesm
Copy link
Member

hannesm commented Oct 20, 2025

So, I'm not very convinced by this PR. I understand the desire, I'm not sure it brings us much. For reference, I studied various ARP implementations:

  • FreeBSD does not accept any gratuitous ARP (so you've to manually remove the entry from the ARP table), default lifetime is 20 minutes
  • OpenBSD does accept the new gratuitous ARP and overwrites (prints, similar to our ARP implementation, a warning)
  • Linux does something similar to what you describe

Putting trust or not into layer 2 packets is hard -- esp. with the different scenarios you may have: some computer with random mac addresses restarting, some local attacker trying to poison your ARP cache.

From your implementation, I still don't understand why there's need for separate Stale and Probing states? And what the timeout vs refresh are meant to be? It may be clearer if you draw a state diagram with under which conditions you expect an entry to move from one to the other states, including failures and timeouts.

It'd as well be helpful if someone else shares their opinion on this PR.

@ansiwen
Copy link
Author

ansiwen commented Oct 20, 2025

So, I'm not very convinced by this PR. I understand the desire, I'm not sure it brings us much.

Ok, so maybe we go step by step and we first check on which points we are on one page before we go into implementation specifics:

Do you also see a problem in the current behaviour, that a gratuitous ARP request for an already present IP in the cache makes the original node unreachable for 20 minutes, if the node doesn't constantly send gratuitous ARP requests itself? Even if this is not considered as a security issue, it is at least not a very robust behaviour.

For reference, I studied various ARP implementations:

  • FreeBSD does not accept any gratuitous ARP (so you've to manually remove the entry from the ARP table), default lifetime is 20 minutes
  • OpenBSD does accept the new gratuitous ARP and overwrites (prints, similar to our ARP implementation, a warning)
  • Linux does something similar to what you describe

Thanks for the research, that's interesting. Linux seems to use NUD for IPv4 since ages (2.2/2.4) and Windows since Vista. I wonder why *BSD didn't follow that.

Putting trust or not into layer 2 packets is hard -- esp. with the different scenarios you may have: some computer with random mac addresses restarting, some local attacker trying to poison your ARP cache.

Right, but isn't that a reason to not blindly trust a gratuitous MAC address update for flat 20 minutes? Even the very old original RFC 1122 (1989!) says regarding "ARP Cache Validation":

  1. Timeout -- Periodically time out cache entries,
    even if they are in use. Note that this timeout
    should be restarted when the cache entry is
    "refreshed" (by observing the source fields,
    regardless of target address, of an ARP broadcast
    from the system in question). For proxy ARP
    situations, the timeout needs to be on the order
    of a minute
    .
  2. Unicast Poll -- Actively poll the remote host by
    periodically sending a point-to-point ARP Request
    to it, and delete the entry if no ARP Reply is
    received from N successive polls. Again, the
    timeout should be on the order of a minute, and
    typically N is 2.

And this is from 35 years ago when the low bandwidth in large collision domains (10BASE2/10BASE5) made too many ARP packets relatively expensive. Today these numbers should be more than safe. And proxy ARP is omnipresent with VM infrastructures etc.

From your implementation, I still don't understand why there's need for separate Stale and Probing states? And what the timeout vs refresh are meant to be? It may be clearer if you draw a state diagram with under which conditions you expect an entry to move from one to the other states, including failures and timeouts.

Let's go into detail here after we agreed on a problem statement, but you can basically use the definitions from RFC 4861:

  STALE       The neighbor is no longer known to be reachable but
              until traffic is sent to the neighbor, no attempt
              should be made to verify its reachability.

  PROBE       The neighbor is no longer known to be reachable, and
              unicast Neighbor Solicitation probes are being sent to
              verify reachability.

The comments of the values should explain that, but here rephrased:

  • refresh is the interval after which Dynamic entries change to Stale
  • timeout is the interval after which Stale entries get deleted

It'd as well be helpful if someone else shares their opinion on this PR.

I also definitely would appreciate that. Since this is a quite delicate low-level interop component, I hesitate to use my implementation as a downstream fork for now, even if the community is not interested in this improvement. It would be great to have some input.

@ansiwen
Copy link
Author

ansiwen commented Oct 25, 2025

@haesbaert maybe you are a good candidate to have a look at this, if we can nerd-snipe you? I remember you helped us out already before with TCP/IP stuff. 😄

@Firobe
Copy link
Member

Firobe commented Oct 27, 2025

@ansiwen we (@hannesm, @reynir, me) discussed this PR at the Mirage meeting. There is not a clear consensus for now, but we all agree that the PR would benefit from:

  • a simple state diagram, as Hannes suggested above
  • a small explanation of what paths of the state machine are checked by the tests you've added, and maybe additional tests going through other paths (it's easy for state machines to go wrong and get stuck in one state).

Could you kindly provide that?

@haesbaert
Copy link
Member

The old ARP cache implementation is very simple, and keeps entries of reachable but unused neighbors forever in the cache. Also it never probes if entries are still valid before the timeout is reached, which is by default a very long 20 minutes. Combined with the fact that gratuitous ARP requests for existing entries are accepted, this makes a simple DoS attack possible, that only recovers after these 20 minutes.

I'm not sure I follow this, you say it keeps forever, but it expires after 20minutes?
If I understand this right, the old/current implemention does not overwrite an arp entry, so you have to wait the 20minutes and then learn the new address. Did I read this right?

I haven't read any arp RFC, so I don't have a lot to add, but here goes some speculation.

This is an improved implementation of the ARP cache logic, which handles dynamic network changes a lot better. It is inspired by the NUD of IPv6, but a lot simpler. It only introduces two more cache entry states: Stale and Probing. After the refresh interval (default: 1 minute) a Dynamic entry goes into Stale state. If an entry changes its MAC address because of a gratuitous ARP request, it goes into Stale immendiately. If a Stale entry is queried, it goes into Probing state, where it sends ARP requests similar to the Pending state, but as unicasts to the last known MAC address. If the probing is not successful, the entry is removed and further queries cause a normal broadcast request, otherwise it is replaced with a fresh Dynamic entry. If a Stale entry reaches timeout age (default 20 minutes), because it never has been queried, it is removed without further probing.

Compared to the old implentation the default values create on the happy path only one more unicast ARP request per minute and ARP cache entry.

While this seem to make sense, I wonder, why not just rely on the refresh interval of 1minute?

Anything that is older than 1minute needs to get probed, if it gets an answer it continues to exist, if not, it's zapped. If, during this 1 minute of staleness, you get a packet saying the IP changed, you mark the existing stale entry as "contested" or whatever, contested means, once it expires, you do a broadcast request asking for whoever the new owner is. With this you don't need two new states, only one. I might be misunderstanding something so take it with a grain of salt.

Concerning why OpenBSD and FreeBSD don't do what linux does, I'd wager it's likely because linux tries to solve everything in the most complex way since it doesn't know how it is supposed to be used, while Unixes compromise so they can have something simpler.

Also other things come to mind, I assume the number of entries is bound as well right? Otherwise one can just populate it with crap until OOM, you know, ocaml style.

@ansiwen
Copy link
Author

ansiwen commented Nov 4, 2025

Thanks a lot for having a look, @haesbaert!

The old ARP cache implementation is very simple, and keeps entries of reachable but unused neighbors forever in the cache. Also it never probes if entries are still valid before the timeout is reached, which is by default a very long 20 minutes. Combined with the fact that gratuitous ARP requests for existing entries are accepted, this makes a simple DoS attack possible, that only recovers after these 20 minutes.

I'm not sure I follow this, you say it keeps forever, but it expires after 20minutes? If I understand this right, the old/current implemention does not overwrite an arp entry, so you have to wait the 20minutes and then learn the new address. Did I read this right?

After 20 minutes it probes an entry exactly one time (but with a normal broadcast, so not exactly a probe), and removes it only, if it is not reachable anymore. It does that not matter if the entry has been in use or not. So, if you contacted another node one time, it will stay in the ARP cache as long as that node is reachable.

This is an improved implementation of the ARP cache logic, which handles dynamic network changes a lot better. It is inspired by the NUD of IPv6, but a lot simpler. It only introduces two more cache entry states: Stale and Probing. After the refresh interval (default: 1 minute) a Dynamic entry goes into Stale state. If an entry changes its MAC address because of a gratuitous ARP request, it goes into Stale immendiately. If a Stale entry is queried, it goes into Probing state, where it sends ARP requests similar to the Pending state, but as unicasts to the last known MAC address. If the probing is not successful, the entry is removed and further queries cause a normal broadcast request, otherwise it is replaced with a fresh Dynamic entry. If a Stale entry reaches timeout age (default 20 minutes), because it never has been queried, it is removed without further probing.
Compared to the old implentation the default values create on the happy path only one more unicast ARP request per minute and ARP cache entry.

While this seem to make sense, I wonder, why not just rely on the refresh interval of 1minute?

Anything that is older than 1minute needs to get probed, if it gets an answer it continues to exist, if not, it's zapped. If, during this 1 minute of staleness, you get a packet saying the IP changed, you mark the existing stale entry as "contested" or whatever, contested means, once it expires, you do a broadcast request asking for whoever the new owner is. With this you don't need two new states, only one. I might be misunderstanding something so take it with a grain of salt.

If you add an internal contested flag into the Stale state, that is the same as having another state, no? Having a separate explicit state is less error prone I would assume. The additional "Probing" state is necessary, because you still want to answer ARP cache queries until the probing is finished, and because you might want to send more than just one unicast probe. (The NUD RFC requires 3 retransmission.) Apart from that, as I wrote before, I didn't want to reinvent the wheel and just reuse what is already existing in the IPv6 world (NUD), which makes it easier for people to understand the code who are familiar with the IPv6 concepts.

Concerning why OpenBSD and FreeBSD don't do what linux does, I'd wager it's likely because linux tries to solve everything in the most complex way since it doesn't know how it is supposed to be used, while Unixes compromise so they can have something simpler.

Well, in this case I would argue, Linux (and Windows) have something simpler, because they have only one implementation, since they use NUD for both IPv4 and IPv6, while the BSDs have two separate implementations. (They still need NUD for IPv6).

Also other things come to mind, I assume the number of entries is bound as well right? Otherwise one can just populate it with crap until OOM, you know, ocaml style.

Yes, that was indeed the case until recently, but since #35 this is mitigated.

@ansiwen
Copy link
Author

ansiwen commented Nov 4, 2025

@ansiwen we (@hannesm, @reynir, me) discussed this PR at the Mirage meeting. There is not a clear consensus for now, but we all agree that the PR would benefit from:

  • a simple state diagram, as Hannes suggested above
  • a small explanation of what paths of the state machine are checked by the tests you've added, and maybe additional tests going through other paths (it's easy for state machines to go wrong and get stuck in one state).

Could you kindly provide that?

Thanks for discussing this PR! Sure, I can do that!

@ansiwen ansiwen force-pushed the add-stale-probing branch 7 times, most recently from 6b10cec to 6cff3cf Compare November 5, 2025 04:36
@ansiwen
Copy link
Author

ansiwen commented Nov 5, 2025

I also reimplemented the racy entries_expire mirage test and made it work on MacOS and Windows. (They have slower ticks in the CI and therefore must wait longer for the timeout.). It's in a separate commit, but can also pull it out in a separate PR, if that is preferred.

@hannesm
Copy link
Member

hannesm commented Nov 7, 2025

I also reimplemented the racy entries_expire mirage test and made it work on MacOS and Windows. (They have slower ticks in the CI and therefore must wait longer for the timeout.). It's in a separate commit, but can also pull it out in a separate PR, if that is preferred.

Yes, a separate PR with the changes for that would be great. Bonus points if the commit targets the main branch, and does not depend on your state machine work.

@hannesm
Copy link
Member

hannesm commented Nov 7, 2025

Thanks for your state transition diagram and your table.

For the events "reply received", I understand it.

Please correct me if I'm wrong in respect to timings: You propose to have (tick is called every 1.5 seconds):

  • "timeout" (only used in Stale state, set to 800 -> 20 minutes),
  • "refresh timeout" (only used in Dynamic state, transitions to Stale state, set to 40 -> 1 minute),
  • "retries exhausted" (used in Pending and Probing state, set to 2 (was 5) -> 3 seconds).

What is the reasoning for reducing the retries?

And please also help me to understand - in Probing you send out ARP requests to the unicast address, how would there a reply with a different mac is received? Is this a gratuitous ARP?

I think I understand now much better what you try to achieve here. It is still the case that we can't do much about ARP spoofing (even with this PR, an attacker can still continue to send lots of ARP replies from e.g. the IP address of the router -- and the ARP implementation will take them as valid) --> but that's how it is, if you're in an adversarial context, you're better off to configure the mac address of the router statically (something that I suspect is not (yet?) exposed in MirageOS -- could be a command-line argument).

let handle_reply t source mac =
let extcache =
let cache = M.add source (Dynamic (mac, t.epoch + t.timeout)) t.cache in
let update_cache () =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this renamed, and moved to be a function?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because extcache is not used in all branches, so I think it is better to do it on demand. For me at least update_cache is easier to understand.

@ansiwen ansiwen force-pushed the add-stale-probing branch 10 times, most recently from e1dce01 to 41fd60f Compare November 7, 2025 16:08
This change stabilizes the `entries_expire` mirage test by removing timing
dependencies, fixes it for CI environments with slow ticks (they seem to
have up to ~20ms ticks) and re-enables it for all platforms again.
The old ARP cache implementation is very simple, and keeps entries of
reachable but unused neighbors forever in the cache.  Also it never probes
if entries are still valid before the timeout is reached, which is by
default a very long 20 minutes.  Combined with the fact that gratuitous ARP
requests for existing entries are accepted, this makes a simple DoS attack
possible, that only recovers after these 20 minutes.

This is an improved implementation of the ARP cache logic, which handles
dynamic network changes a lot better.  It is inspired by the NUD of IPv6,
but a lot simpler.  It only introduces two more cache entry states: Stale
and Probing.  After the `refresh` interval (default: 1 minute) a Dynamic
entry goes into Stale state.  If an entry changes its MAC address because of
a gratuitous ARP request, it goes into Stale immediately.  If a Stale entry
is queried, it goes into Probing state, where it sends ARP requests similar
to the Pending state, but as unicasts to the last known MAC address.  If the
probing is not successful, the entry is removed and further queries cause a
normal broadcast request, otherwise it is replaced with a fresh Dynamic
entry.  If a Stale entry reaches `timeout` age (default 20 minutes), because
it never has been queried, it is removed without further probing.

Compared to the old implentation the default values create on the happy path
only one more unicast ARP request per minute and ARP cache entry.
@ansiwen
Copy link
Author

ansiwen commented Nov 7, 2025

Thanks for your state transition diagram and your table.

For the events "reply received", I understand it.

Please correct me if I'm wrong in respect to timings: You propose to have (tick is called every 1.5 seconds):

  • "timeout" (only used in Stale state, set to 800 -> 20 minutes),

  • "refresh timeout" (only used in Dynamic state, transitions to Stale state, set to 40 -> 1 minute),

  • "retries exhausted" (used in Pending and Probing state, set to 2 (was 5) -> 3 seconds).

That seems correct.

What is the reasoning for reducing the retries?

It was 6 requests before (1 + 5 retries), which felt a bit excessive to me. So I just went with what IPv6 NUD is using, which explicitly uses 3 (1 + 2 retries). See MAX_MULTICAST_SOLICIT and MAX_UNICAST_SOLICIT here. But I have no objections to go higher again, I don't think it actually makes any difference nowadays.

And please also help me to understand - in Probing you send out ARP requests to the unicast address, how would there a reply with a different mac is received? Is this a gratuitous ARP?

Yeah, that is not likely, but the code would still handle it, no matter if it is a real reply packet or a gratuitous ARP announcement (which is technically a request packet) which arrives coincidentally at that moment. In both cases the situation would be unclear and Stale state is therefore appropriate.

I think I understand now much better what you try to achieve here. It is still the case that we can't do much about ARP spoofing (even with this PR, an attacker can still continue to send lots of ARP replies from e.g. the IP address of the router -- and the ARP implementation will take them as valid) --> but that's how it is, if you're in an adversarial context, you're better off to configure the mac address of the router statically (something that I suspect is not (yet?) exposed in MirageOS -- could be a command-line argument).

Right, it's not perfect. But I would argue it is still a lot more resilient than before, because in such situations it would constantly try to verify the entries and therefore also recover a lot quicker. So it makes an attack more expensive and less disruptive.

@ansiwen
Copy link
Author

ansiwen commented Nov 8, 2025

Please correct me if I'm wrong in respect to timings: You propose to have (tick is called every 1.5 seconds):

  • "timeout" (only used in Stale state, set to 800 -> 20 minutes),
  • "refresh timeout" (only used in Dynamic state, transitions to Stale state, set to 40 -> 1 minute),
  • "retries exhausted" (used in Pending and Probing state, set to 2 (was 5) -> 3 seconds).

BTW: if that would be new code I would have called timeout expire instead, but that would change the external interface in a non backward compatible way. And thinking about it more, I wonder if it would be better to use the timeout parameter for the refresh timeout, and add an expire parameter instead. However, since the semantics change anyway, maybe it's a good thing to drop the timeout parameter completely and add both a refresh and expire parameter, so that people who set timeout explicitly before have to think about which setting is more appropriate for their case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants