cross-posted from: https://infosec.pub/post/306795

I am interested in your ways to identify a bottleneck within a network.

In my case, I’ve got 2 locations, one in UK, one in Germany. Hardware is Fortigates for FW/routing and switches are Cisco/HPE. Locations are connected through an Ipsec VPN over the internet and all internet connections have at least a bandwidth of 100 Mbps.

The problem occurs as soon as one client in UK tries to download data via SSH from a server in Germany. The max download speed is 10 Mbps and for the duration of the download the whole location in UK has problems accessing resources through the VPN in Germany (Citrix, Exchange, Sharepoint, etc).

I’ve changed some information for privacy reasons but I’d be interested in your first steps on how to tackle such a problem. Do you have some kind of runbook that you follow? What are common errors that your encounter? (independently from my case too, just in general)

EDIT: Current list

  • packet capture on client and server to check for packet loss, latency, etc. - if packets dropped, check intermediate devices
  • check utilization of intermediate devices (CPU, RAM, etc)
  • check throughput with different tools (ipfer3, nc, etc) and protocols (TCP, UDP, etc) and compare
  • check if traffic shaper/ QoS are in place
  • check ports intermediate devices for port speed mismatch
  • MTU/MSS mismatch
  • is the internet connection affected too, or just traffic through the VPN
  • Ipsec configuration
  • turn off security function of FW temporary and check if it is still reproducible
  • traceroute from A to B, any latency spikes?
  • check RTT, RWND, MSS/MTU, TTL via pcap, on the transferring client itself and reference client, without and while an active data transfer

Prob not related but noteworthy:

  • check I/O of server and client

I’ll keep this list updated and appreciate further tips.


Update I had to postpone the session and will do the stress test on Monday or Tuesday evening. I’ll update you as soon as I have the results.


Update2 So, I’ll try to keep it short.

First iperf3 over TCP run (UK < DE) with same FW rules let me reproduce the problem. Max speed 10 Mbps, and DE < UK even slower, down to 1-2 Mbps. Pattern of the test implies an unreliable connection (short up to 30 Mbts, then 0, and so on). Traceroute shows same hops in both directions, no latency spikes, all good.

BUT ICMP and iperf3 over UDP runs show a packet loss of min 10% and up to 30% in both directions! Multiple speed tests to endpoints over the internet (UK>Internet) showed a download of 80 Mbts andupload of like 30 Mbts, which indicates a problem with the IPSec tunnel.

Some smaller things we’ve tried without any positive effect:

  • routing changes
  • disabling all security features for affected rule set
  • removed traffic shaper
  • Port speed/duplex negotiations are looking good
  • and some other things that I already forgot

Things we prepared:

  • We have opened some tickets at our ISPs to let them check it on their site > waiting for response
  • Set up smokeping to ping all provider/public/gw/ipsec endpoinrts/host IPs and see where packets could be dropped (server located in DE)
  • Planned a new session with an Fortigate expert to look in-depth into the IPSec configuration.

Need to do:

  • look through all packet captures (takes some time)
  • MSS/MTU missmatches / DF flags
  • further iperf3 tests with smaller/larger packet
  • double check ipsec configuration
  • QoS on Switches

I wish I had more time. I’ll keep you updated

  • wop@infosec.pubOP
    link
    fedilink
    English
    arrow-up
    2
    ·
    1 year ago

    Added the Update 2. Still some things to do, but we know a little bit more now. Feedback and questions are still welcome.

    • phase_change@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      Nice job. Packet loss will definitely cause these issues. Now, you just need to find the source of the packet loss.

      In your situation, I’d first try to figure out if it is ISP/Internet before looking inside either network. I wouldn’t expect it to be internal at these speeds. Though, did you get CPU/RAM readings on the network equipment during these tests? Maxing out either can result in packet loss.

      I’d start with two pairs of packet captures when the issue happened: endpoint to endpoint and edge router to edge router. Figure out if the packet loss is only happening in one direction or not. That is, are all the UK packets reaching DE but not all the DE making it back? You should clearly be able to narrow into a TCP conversation with dropped packets. Dropped packets aren’t ones that a system never sent, they’re ones that a system never received. Find some of those and start figuring out where the drop happened.

      • wop@infosec.pubOP
        link
        fedilink
        English
        arrow-up
        1
        ·
        1 year ago

        The ISPs are slow to answer if there is no active outage. Will take some time anyway.

        Packets are dropped in bot directions. I am currently looking through the pcaps and will do another stress test later - got another window. MTU/MSS is the prio today.

  • taladar@sh.itjust.works
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Are you sure that the download speed is 10Mbit/s and not 10Mbyte/s which would be close to saturating the 100Mbit/s link and would explain the other symptoms you are seeing?

    • wop@infosec.pubOP
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      Valid question. We’ve checked it multiple times, on the client and via monitoring that it is 10 Mbits. Thank you.

      • taladar@sh.itjust.works
        link
        fedilink
        English
        arrow-up
        1
        ·
        1 year ago

        Have you checked for resent packets or connection resets or similar things that might use up more bandwidth than the successfully received packets? I would probably use Wireguard for that.

        • wop@infosec.pubOP
          link
          fedilink
          English
          arrow-up
          1
          ·
          1 year ago

          Not yet. Wouldn’t expect it tbh, but you’ll never know. How would you utilize Wirehuard for it? I’d like to hear more about it.

          • taladar@sh.itjust.works
            link
            fedilink
            English
            arrow-up
            1
            ·
            1 year ago

            Oops, I meant Wireshark of course. Basically capture the packets and then check for any with errors.

            • wop@infosec.pubOP
              link
              fedilink
              English
              arrow-up
              1
              ·
              1 year ago

              Gotcha! - I thought Wireguard might has some logging features that could provide some insights. Thank you.

  • Kazaii@sh.itjust.works
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Pretty good suggestions here. Can’t remember the last time I saw such quality replies on r/networking .

    • wop@infosec.pubOP
      link
      fedilink
      English
      arrow-up
      2
      ·
      1 year ago

      Ping - Update 2 Your numbers are are still missing since I havent had time to look into the pcaps yet. I hope I can get it done by the end of the week, but we are a little bit wiser.

    • [email protected]@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      0
      ·
      1 year ago

      Just saw this part:

      the whole location in UK […]

      Some VPN solutions downgrade the MSS of all VPNs to the lowest common denominator for things like MTU/MSS. I guess that can make sense in a full-mesh, but whatever.
      Take a packet capture of another client while the problem one connects, you’ll likely see something.
      Decrypted traffic is usually easier to analyze.

      Ohhh and you say that’s when they connect through SSH? Check that he’s not tcp forwarding all traffic through his SSH connection somehow.

      • wop@infosec.pubOP
        link
        fedilink
        English
        arrow-up
        1
        ·
        1 year ago

        Getting a pcap of another client could bring some insight, yeah.

        SSH is used for the data transfer. Without knowing it at this moment, I’d assume scp or rsync. You mean whether all their internet traffic is routed through the active SSH session?

        • [email protected]@sh.itjust.works
          link
          fedilink
          English
          arrow-up
          1
          ·
          1 year ago

          I mean that in an SSH connection you can configure it to bind local/remote ports of local/remote IPs.
          The user might have unknowingly or maliciously configured their stuff to either:

          • forward all their traffic through the ssh session, adding more bandwidth than you’re expecting
          • remote port forward something important that’s somehow used by all your users to his machine. This is a bit unlikely, but then your symptoms are a bit weird.

          Unlikely, because they couldn’t bind a port that is already in use on the server. Still, that could technically happen if there’s a misconfigured load balancer, maybe from an old config that was never removed, that has that server as a member and just declares it down/up when that user starts listening on that port.

          That last one is far-fetched.
          I’d start with cpu/mem, mtu/mss, etc.

          I tend to have a bit of a bias towards absolutely far-fetched things because I’m basically the last line of support where I work. This means all most of the “normal” problems get filtered out before they get to me, which leaves me with the stuff that’s bananas.

    • wop@infosec.pubOP
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      Not yet. Just got access to the test clients and I have planned to do a troubleshooting session tomorrow in the morning. Not a big fan of stress testing the network on a working day haha