First, another aside: BGP is a BEAST. If I went very in depth here, this article would be huge. Thus, please don't expect to be able to configure an ISP connected full BGP internet peer after reading this. :-)
BGP - The Border Gateway Protocol
For the uninitiated, BGP is "what the internet runs". Since the internet USED to be called the ARPANET, and since ARPANET was a seen as a huge asset for the DoD, BGP was built fundamentally for survivability. IE: if the Russkies nuked a major ARPANET hub (say San Francisco), they wanted BGP to quickly and efficiently route around that outage. As ARPANET exploded like wildfire after it turned into what we know as the internet today, scalability was added to BGP's fundamental design tenets. As wonky as BGP can be, it lives up to these two disparate design requirements quite well.
BGP v1 was "born" in RFC 1105 in 1989, and the current version is BGP v4, released in RFC 1654 in 1994. It is notable that the internet basically runs on a routing protocol that is pushing one decade in age. Being fairly old, you will see some serious similarities between BGP, and that other fairly old protocol, RIP. The need for BGP came about when, in the early 1980's, ARPANET admins/designers saw that the protocol they were running, Gateway-to-Gateway Protocol, didn't scale. GGP required every gateway to know about every other gateway, and it's networks (routers were called "gateways" back then). It was pretty obvious this thing would just eventually fall over if it kept growing, and they knew it would keep growing. So, the admins floated the idea of an "Autonomous System", wherein only those devices within the AS would know all of that AS's routes. Also, the admins of an AS would be free to run their network how they pleased. To keep track of AS's, they were assigned an AS number (a 16 bit integer, with 64512-65535 reserved al la 10.x.x.x in the IP world) by the same authority that handed out and tracked IP addresses. [A quick aside, I find it ironic that they chose only a 16 bit integer to track each unique AS number, when the whole reason for this new protocol was scalability, why only allow 64511 unique AS's worldwide?] Damnit, elguapo! Enough history, tell us about this BGP!
OK, first, some more concepts:
Autonomous System (AS) - this is simply all of the routers configured to run by a single entity. Take the aforementioned RIP, all of the routers that a company might configure to run RIP could be considered an AS. A company could break their network up into divisions by AS, or not. Totally determined by their local conditions.
Interior Gateway Protocol (IGP) - This is a protocol designed to run within an AS. RIP, for instance, is an IGP.
Exterior Gateway Protocols (EGP) - This is a protocol that is designed to exchange information between differing AS's. BGP is an EGP. Note: EGP's can run within the AS as well, and in fact this is what creates the two "flavors" of BGP. Internal BGP (iBGP) is when two BGP peers are within the same AS, External BGP (eBGP) is when two peers are in differing AS's.
OK, first, BGP is a Distance Vector Protocol. (first article if that term is foreign to you!) It is also, oddly enough, not really a routing protocol at all! You'll see why in a bit. As you'll recall from the prior articles, Distance Vector Protocols use the concept of "Hop Count" to make routing decisions (anal-retentive types will point out that BGP is actually a path-vector protocol. OK). BGP's "Hop Count" is a concept called "AS Path". AS Path is just that: "Which AS's did I have to go through to get to that network?" "Out of the box", BGP simply grabs the shortest AS path, and shoots the packet in question that way. It doesn't take a rocket scientist to see that BGP could easily bite you in the ass. If a network is 1x 56kbs AS away, and 2x 45Mbs DS3 AS's away, BGP's sending that puppy over the 56k link every time. I know, make a scrinchy face. But that's BGP. That obviously blows chunks from a logical routing decision standpoint, so BGP gives you lots and lots o' metrics and other tricks to keep that kind of stupid shit from happening. But note: you have to do it! BGP very muchly "gives you enough rope to hang yourself", as it were.
BGP operates on TCP port 179, it uses TCP so that BGP doesn't spend too many resources on communications reliability. TCP will handle acks, retransmissions, etc. Here's the rub, since BGP only exchanges AS_PATH and network information, it really doesn't have a mechanism for "finding" it's neighbors, unless they have a directly connected route to each and every one of them. Having a directly connected route to each neighbor is feasible if your AS has, say, 10 or fewer BGP peers. What if you have 600? That would require 359,400 connections!! (n*n-1, FYI) Hey, elguapo, I thought you said this thing scaled? Well, it does. It is a very common practice to run an IGP (like OSPF) inside your AS for the sole purpose of having your BGP peers be able to "find" each other, thus eliminating the need for directly connected BGP peers.
So, how's it route? A BGP route update will look something like this: 172.18.0.0/16 (325, 127, 1256) That is, the class B network 172.18.x.x is reachable via AS's 325, 127 and 1256 - in that order. The first thing BGP will do with this update is search the AS_PATH for it's own AS. If it's own AS is in the update, then adding this route to it's routing table would cause a loop. So if it is there, it just ignores that update. At this stage, BGP sort of takes note of whether it learned the route from an iBGP peer or an eBGP peer. IF it gets two of the same routes, with identical AS_PATHs, it will choose the eBGP route over the iBGP route. The next thing it does is compare that update to any other BGP routes to that network (it keeps a seperate "BGP table" for all the updates), if this one is the shortest AS_PATH, it drops whatever was there first, and adds this one. At any one time, there may be countless route entries for the same route in the BGP table, but it'll only populate the "live" routing table with the best one. Maybe now you can see why I said earlier that BGP wasn't really a routing protocol: It's really just a "prefix exchanger". (to borrow a term from a buddy at my previous employer - thanks Bill!!)
This brings us to those metrics I mentioned earlier. (This is totally Cisco-BGP centric. It's what I know, sorry). There are various types of these metrics, basically resulting from BGP engineers getting continuously bit in the ass, and therefore tacking on one more metric to fix whatever was the "problem du jour". (A full blown assumption on my part there) Those types are:
Well Known Mandatory (WKM) - Well known means all BGP vendors need to support it, and Mandatory means just that, it has to be present in every update.
Well Know Discretionary (WKD) - Again, all vendors need to support it, but Discretionary means it doesn't need to be present in each and every update.
Optional Transitive (OT) - Optional means a vendor can support it, or not. If it chooses not to, then it just ignores that part of the update. Transitive means that if it does choose to ignore it, it should still leave it in the update when it passes that update on to it's peers.
Optional Non-Transitive (ONT) - Again, Optional means the same, but Non-transitive means that if you choose to ignore it, you can drop that metric from the updates you forward to your peers.
Now for the metrics:
- ORIGIN (WKM) - Just that, where the route originated
- AS_PATH (WKM) - Just that, the list of AS's you have to go through to get to that network. If you're sending an update to an eBGP peer, you will prepend your own AS to the AS_PATH.
- NEXT_HOP (WKM) - this is the address of the next-hop router. "Well, WTF? Isn't this always going to be the address of the router that sent the update?" Nope. If you're advertising an eBGP route to an iBGP peer, NEXT_HOP will be the eBGP peer, not the advertising router. Otherwise, yes, it's the advertising router. You can force this with "NEXT_HOP SELF" when configuring the router.
- LOCAL_PREF (WKD) - This is an attribute that doesn't leave your AS, as it is just what it says. It is your AS's "Local Prefernce" on how to deal with that route. You could use this attribute to fix the 56k vs DS3 example I used above. Just set the LOCAL_PREF of the DS3 to 200, and the 56k to 100, and your AS will use that DS3 unless it goes away.
- ATOMIC_AGGREGATE (WKD) - This is how A BGP speaker tells it's peers that it's summarized numerous smaller routes into one big route. Why does this matter? Because when it does this, it whacks those smaller routes AS_PATH, and substitutes it's own AS.
- AGGREGATOR (OT) - If a BGP peer does aggregate routes, this is a method for letting them know who did the aggregating.
- COMMUNITY (OT) - this lets you create BGP "Communities", thereby letting you apply huge swaths of consistent BGP metrics to numerous BGP peers, without having to apply those metrics for each and every peer. Having an "iBGP" community and an "eBGP" community is a pretty obvius example.
- MULTI_EXIT_DISC (ONT) - MED is sort of a LOCAL_PREF, but for the outside world. It's a way of telling an eBGP peer how you'd like them to send traffic into your AS.
- ORIGINATOR_ID (ONT) - This prevents route loops when using "route reflectors". Think of route reflectors as a "route distributor". It kind of hands out routes to it's clients, thereby allowing net admins to minimize the number of full BGP peers. If I get a route with the ORIGINATOR_ID set to my router ID, we had a loop.
- CLUSTER_LIST (ONT) - This is a list of RR ID's thorugh which that route has passed. Kind of like an AS_PATH of RR clusters. If a RR sees it's cluster ID in a received route, it knows there's been a loop.
Well, that's it. Using and manipulating those metrics is how a net admin can "steer" traffic within his network, and how he can try and steer traffic from the outside world into his AS. Why "try"? Well, MED, for instance, is optional. Your eBGP peers may just ignore the damn thing. If you're multihomed to numerous eBGP peers, and your managers throw out the term "load balancing", run for cover. In my experience, load balancing via BGP blows chunks. Almost always, one of your eBGP ISP peers is going to "better connected", and thus that link is going to get a majority of your traffic. Note in my design goals above, scalability and rerouting around major outages were why BGP was made in the first place. "Load balancing" was never in the mix, so I can't fault BGP for doing what it was designed to do, I guess.