I’m sure you’ve all logged into a VPN Router once or twice and seen this syslog:
%IOSXE-3-PLATFORM: R0/0: cpp_cp: QFP:0.0 Thread:000 TS: %IPSEC-3-REPLAY_ERROR: IPSec SA receives anti-replay error, DP Handle X, src_addr x.x.x.x, dest_addr y.y.y.y, SPI 0x0
Here is everything you need to know regarding the feature, the causes of the syslog, and the solutions to it.
IPSEC Anti-Replay is a feature available to the ESP data plane that sequentially marks packets as they are encapsulated with a number. Each new packet is encapsulated/encrypted and gets +1 added to its sequence number (in the ESP header) and is sent on.
Basically, this numbering system provides anti-replay attacks for the receiving end. Packets are literally marked in the data plane with a sequence number that is NOT encrypted.
Here’s what this looks like in a wireshark capture (ESP Sequence is the name in the header):
Once that packet makes it to the other end (receiving end) is when the sequence is checked.
The other side then receives this and references its sliding window.
Here are some examples of what could happen:
Packets 2-69 arrive, thus the current window size is 69. If packet 1 arrived after packet 69, it would be dropped.
Window size is currently 1. If packet 2 arrived after packet 12, or 63, it will be accepted as it’s within the 64 packet window.
The highest sequence number packet successfully received was 40, thus that’s the current window size on the receiver. Packet 35 arrives, then packet 30. Both are accepted as they fall within the 64 packet window.
If packet 40 arrived and was accepted, then it arrived again somehow, that second packet would be dropped.
Note: This is specifically talking about if a packet with same ESP sequence number arrives, it’s not talking about TCP retrans which would have a new sequence number in the ESP header.
This feature is beneficial in scenarios where a malicious actor is sitting in between or inline with the traffic and is actively spoofing both sides. This allows for more advanced attacks on both IKEv1 and IKEv2.
Earlier I mentioned this sequence is not encrypted. You might now be thinking, well if it’s not encrypted, what’s to stop the malicious actor from just editing the packets since they are the middle?
Although the ESP header is NOT encrypted, it is authenticated via the ESP AUTH for both tunnel and transport mode. Here’s what that looks like.
Thus no man (or woman) in the middle can tamper with the packets’ sequence as they are authenticated. This also PROBABLY provides a mechanism to verify anti-replay before decryption, which is the most CPU intensive operation of the receiver. Checking the sequence number before decrypting the packet makes sense or else we would just be wasting CPU cycles for no reason, because potentially the packet could be dropped after decryption due to sequence check failures. Not to mention preventing certain other denial of services.
Now, back to the IPSEC anti-replay window-size problem and solutions…
Currently, the default ipsec anti-replay window-size on IOS and IOS XE is 64 packets. Thus packets must fall within this 64 sequence sliding window, or else be dropped. However this can easily be changed (increased) or disabled as we’ll see below.
Here are the 6 major causes of the “%IPSEC-3-REPLAY_ERROR: IPSec SA receives anti-replay error” log.
1. Packet loss
if there is congestion on the link, or reliability issue of the path, then packet-loss will be observed. During this period, the packets may arrive at the receiver in an unintended order. Packets 1-49 made it and thus the window is at 49. Packet 160 comes in, this packet would be dropped. I don’t believe you get a syslog PER packet dropped as that would introduce a denial of service, but I would assume the syslog is somewhat throttled interval.
In these scenarios you can disable or increase the anti-replay window size, and it will ease the packet-loss, a little. But it, most likely, will not be noticeable to end users. The only solution here is to fix the packet loss. That solution may include calling the ISP, mark and queue packets (if available), fix license issues, or fix broken links.
Identifying if this is happening often can be done via:
show crypto ipsec sa peer x.x.x.x | i replay failed
Again, in the case of packet loss, this probably won’t help much as you’ll be saving a few packets out of potentially hundreds but it’s worth mentioning.
Config to increase the window size (global config) but can also be done per crypto map or ipsec profile:
crypto ipsec security-association replay window-size <64 | 128 | 256 | 512 | 1024>
or disable it:
crypto ipsec security-association replay disable
Crypto encapsulation happens BEFORE queuing. Which is coincidentally why we have features such as qos pre-classify (which clones the inner headers before encapsulation for post encap queuing).
The way this issue occurs is QoS can send packets 1, 2 last, and packet 50 and 51 first (during times of congestion) if the config calls for it. An example would be a priority queue. During congestion, the priority queue will be serviced first, however those voice packets have already been crypto encapsulated (and thus sequenced). And that is how packets sequenced become out of order before as they leave the Router. Now not only are you dropping packets due to congestion, but you are also dropping packets due to queuing (trying to handle the congestion). This behavior worsens the situation.
Here’s a diagram I have in my notes from the multi-sn feature set which visualizes this issue I described above.
Now before we dive deeper, there’s difference in how IOS and IOS XE handle QoS. In IOS the default class-map queue-limit was 64 packets (same as ipsec anti-replay window-size), in IOS XE the default class-map queue-limit is calculated to be 50ms (and can also be configured with number of packets like IOS classic used or even bytes).
A possible solution then becomes making sure your queued packets fall within the 64 packet sliding window by making your policy-map total queue-limit to be 64 packets (not a great idea as we are really limiting size).
Here’s what that solution would look like in a policy-map on the Sender (remember the receiver is experiencing the anti-replay issue).
policy-map TEST class TEST bandwidth percent 30 queue-limit 21 packets class TEST2 bandwidth percent 10 queue-limit 21 packets class class-default queue-limit 21 packets
As you can see above I have 3 queues, TEST, TEST2, and class-default. I simply divided 64 by 3 and discarded the remainders. In this case I did my config in IOS XE (CSR1000V), and I specified the queue-limit in packets and not ms, nor bytes. This method could be useful if you do not control the other side, or cannot change it.
A more optimal solution is making sure the receiver increases their window-size, or disables it.
crypto ipsec security-association replay window-size <number>
crypto ipsec security-association replay disable
crypto ipsec security-association multi-sn
show crypto ipsec sa peer x.x.x.x platform
That command runs a number of other commands, one being “show plat hard qfp active feature ipsec datapath crypto-sa #”
A more rare occurrence is that you are experiencing a bug in the IPSEC code and require an IOS upgrade. Unfortunately 100% identifying these bugs is very difficult, and most of the time not worth the headache. Start by reading the release notes and doing a CTRL+F for “Ipsec” or “replay”. If you suspect that the issue is bug related, just upgrade and test.
4. ESP HA with Over subscription
I’ve never used this feature set, nor do I want to. But over subscription of the nodes when running ESP HA would also cause the anti-replay window to be reached due to tail-drop/congestion.
5. Security Association Re-key Operations (normal rekeying after lifetime expires).
During SA rekey the old and new SA are key. After the new SA is up, the old one is deleted. It’s possible that right after this operation a packet will come in for the old SA, and thus be dropped as it’s not part of the current window.
6. QoS changes on active or in use policy-maps
Modifying in use policy-maps or their class-maps can interrupt the sequence numbering as the changes are taken affect right away and thus for a brief moment, the sequencing will be off for the receiving end and those packets will be dropped.