March 19th, 2006
I fancy that I now understand a little bit of what it's like to be a physicist dealing at the subatomic level.
For a while, I've been looking into ways of limiting bandwidth in linux. It's been a fairly long journey. I started off with a couple of userspace utilities and then moved on to the traffic shaper in the kernel and an obscure bit of code that Alan Cox threw out to use with the shaping device.
About a month ago, I stumbled upon the Linux Advanced Router Traffic Control HOWTO. I hadn't really dealt with any of the iproute2 tools by this point.
After playing with it for a while and using a set of priority queues (pfifo_fast) and marks from iptables to push certain packets through before others. That was OK but a particular box on our net which is _extremely_ chatty was overwhelming everything else.
So then I came across the Token Bucket Filter (tbf) which effectively put a hard limit on anything pushed into it. I slapped one for ~100k on our net and that did the trick rather nicely, except that all NFS access to the box in question was dead 90% of the time. That was affecting the download and install of updated packages and that's No Good(tm).
With some amount of resignation, I accepted that I was going to have to start using its queuing classes. So I set up a Hierarchical Token Bucket (htb) at the device and started adapting what I knew from tbf. I added a subclass with the same limits as tbf and had everything flowing that way by default. So far, so good, but that's not doing anything different.
So I set up another one with a limit approximating the theoretical speed of the switch (100mbit) and tried to take the easy way out by learning filters and using them to push through the firewall marks I'd applied for pfifo_fast. Somewhere in there, I realized that I ought to be writing real iproute2 filters because a number of them had really cool stuff in them (e.g. a text search on the packet, so any packet containing "cerise" would be pushed ahead of any others).
But now on the limited side, certain processes were starving. ping times through those classes were taking extraordinary amounts of time (which was really fun -- a ping from b to google takes ~70ms. A ping from the limited box to google takes anywhere from 1000 to 4000ms). So I implemented Stoichastic Fairness Queuing (sfq) -- a complicated sounding way of saying that roughly each connection shall be evaluated for speed. With the approximate packet rate taken into account, it'll do its best to ensure that the slower ones are serviced at the approximate rate and limit the fast ones as little as possible to accomodate them.
Otherwise known as, y'know, fair. 8)
So, in effect, you end up with this tree of classes which have the network device as a root. When a packet arrives, it goes from the root through classes (determined by filters and the default for that class) to a node from which no other decisions can be made (a leaf node). When that class decides it's time to send that packet, it pushes it up the line to the next class which eventually sends it out. So on and so forth until it falls out the device and on the wire.
That brings me to the current state of things:
/sbin/tc qdisc add dev eth0 root handle 1: htb default 10
(this sets up the very first class. It's an htb which serves only to be the parent of htb classes to be set up under it. No queueing involved here).
/sbin/tc class add dev eth0 parent 1: classid 1:1 htb rate 99mbit burst 0 cburst 0
/sbin/tc class add dev eth0 parent 1: classid 1:10 htb rate 80kbit burst 0 cburst 0
(Now begins the fun! Two classes, both are children nodes of the htb class. One has a limit of 99mbits/s, the other has a limit of 80kbits/s. The burst and cburst stuff is to allow periods of faster activity if the line's been slow for awhile. I've set those to 0. Note that the default of the parent class is "10". That corresponds to classid 1:10 -- the limited class).
/sbin/tc qdisc add dev eth0 parent 1:10 handle 10 sfq perturb 10
(Use sfq under the limited htb. The perturb argument is complicated -- involving the hash tables it sets up and the algorithms it uses to hash connections. In this case, every 10s, it rehashes everything with a different algorithm to avoid collision. I only have it on the limited class because I couldn't care less how things go on the fast line. I just want 'em out!).
/sbin/tc filter add dev eth0 parent 1:0 protocol ip prio 1 u32 match ip src 10.0.0.0/8 match ip dst 10.0.0.0/8 flowid 1:1
/sbin/tc filter add dev eth0 parent 1:0 protocol ip prio 1 u32 match ip src $MY-EXTERNAL-IP match ip dst 10.0.0.0/8 flowid 1:1
/sbin/tc filter add dev eth0 parent 1:0 protocol ip prio 1 u32 match ip src 10.0.0.0/8 match ip dst $MY-EXTERNAL-IP flowid 1:1
(filters. These are all installed at the parent htb class. As soon as the parent htb class is ready to toss its packet down the tree to a leaf node, it goes through these filters. If the packet matches its criteria, then it's tossed to the class with the id in "flowid". In this case, the unlimited class. These classify any packet staying in my network and tell them to move along as fast as possible. It took me a bit to realize that I needed to have filters for packets from or to my external IP as well because of my iptables NAT rule. "prio" lets you specify an order if you care. I don't).
/sbin/tc filter add dev eth0 parent 1:0 protocol ip prio 1 u32 match ip dport 80 0xFFFF match ip src 10.0.0.0/8 flowid 1:1
/sbin/tc filter add dev eth0 parent 1:0 protocol ip prio 1 u32 match ip sport 80 0xFFFF match ip dst 10.0.0.0/8 flowid 1:1
/sbin/tc filter add dev eth0 parent 1:0 protocol ip prio 1 u32 match ip dport 22 0xFFFF match ip src 10.0.0.0/8 flowid 1:1
/sbin/tc filter add dev eth0 parent 1:0 protocol ip prio 1 u32 match ip sport 22 0xFFFF match ip dst 10.0.0.0/8 flowid 1:1
hibachi filters. Any ssh packet to or from this box and any web packet to or from this box should ride roughshod over the rest and head to the sunset as fast as possible. This could in theory flood out all traffic, but because this is in a NATted networkd, all packets going to this box are ones either ssh, scp, or wgets on the same net or packets from the outside which I specifically request by starting an ssh/scp session to elsewhere or pointing a browser somewhere (speaking of -- I should probably write a pair for https and gopher). In other words, if the network's slow because of these rules, it's because I want it to be).
In reality, some of these rules are useless -- I have them in for the case where I start ingress queuing with my egress queuing (e.g. packets are coming in to fast, so you throw them away and hope that the other box gets a clue through the magic of IP bursts (ingress) vs. sending packets out at a schedule that you're setting up. I doubt I ever will care, really).
Ultimately, the solution I found is niche, arcane black magic that I've come to understand. Now I see it in everything net related in various forms and understand what the optimal application for it is. It's like knowing that there's Leeuwenhoek's 'wee beasties' in everyone or understanding how nucleon-nucleon interactions are mediated by mesons in every atom.
|Date:||March 19th, 2006 11:46 pm (UTC)|| |
I just set up some basic QoS rules, and apart from some weirdness with my emule, it works exactly as I want it to. Your way sounds like a lot more fun though, maybe I'll redo it, just cuz. ;)
I'm curious what you were using to set your QoS rules -- I assume it's some interface on your router.
It's quite probably using the same things -- this is all using kernel stuff. Naturally, that means that in order to use it, you'll have to compile a ton of modules & possibly your kernel.
I modularize anything that looks even remotely fun on the theory that if I'm ever bored and projectless, I can just skate through a menuconfig sniffing at anything interesting. Occasionally, I even do it.