BGP in the Enterprise via some overlay, underlay, VXLAN and leaf-spine discussion
You have a routing protocol. It works for large scale inter-domain routing (Autonomous Systems) and it is a flooding protocol for an n-squared mesh.
It doesn’t have a link metric, it has no real policy metrics, and slow convergence (the Internet never really converges anymore), but you want to use it in the Enterprise?
Bad idea? Well maybe in a traditional Enterprise network, but with leaf-spine architecture for Data Centres could this be a goer? A good few people think so and are implementing today. Let’s have a look at some of this.
BGP is effectively built from two pieces, the protocol itself and the TCP transport which is used to carry protocol messages between peers.
Some acronyms follow, but hang on, hopefully this will become clear as you continue to read down.
If you are familiar with leaf-spine architectures there is a lot of noise around using BGP within the fabric, either eBGP or iBGP, (AS per leaf-spine, or everything in the same AS) as a DC routing protocol of choice.
There is also a lot of discussion around Multi-Protocol-BGP (MP-BGP) being used with VXLAN for DC interconnect (Ethernet VPN or EVPN) which effectively loads the NLRI portion of BGP with MAC addresses with the RD (Route Distinguisher) to ensure uniqueness, ultimately to extend Layer 2 segments across Data Centres over Layer 3.
VXLAN is L2 extension technology over a shared Layer 3 underlay infrastructure.
As briefly as possible (as this will be covered in more detail throughout), VXLAN will encapsulate the MAC address in a VXLAN header or tunnel (tunnels need an end-point, which is why you have VTEPs – VXLAN Tunnel End Points). From there, traffic between end hosts in the same VNI (the virtual network you are in) needs to be tunneled through the L3 underlay network, which means that VTEP devices for a given VNI need to be able tunnel MAC addresses to other end hosts in this VNI over Layer 3. You can now also use MP-BGP to augment this and get some extra advantages when connecting Data Centres over L3.
If the above all sounds like gobbledy-gook, hang in there. Let’s go back a few steps and try to make things get a little clearer below.
So how did we get here? We are going climb a number of steps to get to where we want to go.
First of all back to Layer 2 (think MAC address identifiers and broadcast domains). This is always seen as easy stuff from an application point of view. Why? Because you can keep your IP addresses the same, move wherever you like and the network underneath will just sort itself out. However, a Layer 2 domain (or VLAN) is of course a broadcast domain. Connect this together in the wrong way and you can create loops. Loops mean broadcast storms and network meltdowns – this is bad! This is why you have spanning tree to block these loops. So, in short, Layer 2 doesn’t scale, and you need to take a lot of care to avoid loops, or paths to network Armageddon. But on the surface, from an application viewpoint, it does seem easy.
So why do people want to spread their VLANs or L2 domains everywhere? Like extending L2 across Data Centres? Well, it means as an app developer or server guy or gal, you don’t have to change your IP address whenever you move. If you want to move a Virtual Server to another platform for business and app continuity, you can just Vmotion it. You don’t have to change a thing. The traffic will sort itself out and find the new location.
Vmotion certainly became a compelling driver here. From a network point of view, there are a bunch of traffic tromboning caveats, and lots to consider, but from a server view you don’t have to touch anything so all seems golden.
This comes to the classic tension between servers, apps and networks. “Just give me a Vlan for my servers everywhere. I just want to spec and forget”. As an app developer, I may have an awareness of IP addresses, but I really
don’t want to be changing these all the time just because I move stuff around, or I move my server/VM. I want to develop my app to work in a world of unlimited resources, memory, compute, infrastructure and bandwidth. I am oversimplifying to illustrate a point of course.
So why do I have to change things like IP address when I move? Well if I move across a layer 3 boundary then you hit one of the characteristic of IP addresses. An IP address bundles together two characteristics, 1) identity and 2) location. So if I move a server with the same IP address, my identity has stayed the same, but my location has changed. I now need to play with the network to make sure that traffic for my app gets to the new location. BUT my location info has stayed THE SAME!
Imagine moving house, your name never changes, but you also keep the same home address as well. The postman is going to get pretty confused as to where, physically, to deliver your mail. You will need to put in mechanisms to say, “hey Postman, we are not here anymore we are in another town.. oh and with exactly the same address. What? You don’t cover that district? Ok, you need to tell the post-office to redirect it to the new place, and yes, it has the same address…. I don’t care how you do it, you are a clever postman, I am sure you will work something out…”
At Layer 2 however you just have an identity, which is directly mapped to IP addresses on the host, so your identifier piece is not just the IP address, the network just sorts itself out through flood and learn semantics.
Ok, so how can you move wherever you like, and how does the network just sort itself out at Layer 2?
Time for some basics…
Very simply, switches are a bunch of ports connected together at Layer 2. Frames arrive on the switch port with a source MAC address and destination MAC address.
In a normal switch infrastructure, if you know the IP address of the device you want to get to at Layer 3, then you ARP for the L2 address or MAC address that is associated with that IP address.
I am an end host or PC and I want to get to another device somewhere and I know the IP address of the device I want to get to. Well I sends out a broadcast ARP message saying “any of you guys know what MAC address is associated with this IP address?” The switch receives this on the port the host is attached to and, because the frame has a source MAC address, it knows that this identifier/MAC address of the host is attached to this particular switch port. It then makes a note and stores it in its MAC-address table, e.g. Mac A is associated with Port 1 etc. You may also see it referred to as CAM (Content Addressable Memory) which is essentially the structure of the store of information in memory i.e. where fixed length addresses are stored for fast lookup – the MAC address is a fixed 48bit (6 byte) address.
If everything is on the same broadcast domain or VLAN, then everything is dandy. The target end host or PC will receive the query (because it will receive the broadcast ARP packet flooded by the switch), and say, “yep that’s me, here is my actual MAC address associated with that IP address”, and reply to the source MAC address. As this reply goes through the switch, the switch says, “cool, I now know which port that specific MAC address is on, let’s log that in my MAC-address table”. If a packet subsequently is addressed to that destination MAC, now the switch knows which port it is on. Additionally if the switch receives traffic for a MAC address it has not seen before (unknown unicast), it floods out of all ports to see if it gets a reply, then makes a note as it passes through.
A confusion I see (often enough to note), is whether a switch broadcasts. In short, switches flood frames, they do not broadcast. If the switch receives a broadcast frame (like an ARP destination broadcast frame), it typically floods this frame out all ports except the receiving port. If you think that end-points send broadcast frames and switches flood frames based on whether it is a broadcast, unknown unicast, or multicast (BUM), then you are good to go in general.
If you ever go on a switch and and do an equivalent of show mac-address table it will show you the Vlan, Mac address and associated port.
Ok so why this long-winded explanation of the basics?
I mentioned that at Layer 2, a VLAN is a broadcast domain. You can reduce the size of the broadcast domain by limiting the number of hosts in the VLAN, and putting other hosts in other VLANs, so if you want to communicate between the VLANs you now need to go across a router or Layer 3 boundary.. Remember routers or Layer 3 boundaries can also be seen as broadcast firewalls.
Say I am a server guy or gal. I like the flexibility of Layer 2 and hate changing IP addresses. (some might remember LAM – Local Area Mobility at this point :-)) Also I don’t want my Virtual Machines to really be that aware of the above gubbins, so wouldn’t it be nice if I could get to any Virtual Machine anywhere in my Datacentre, or even across Data Centres as if it was in the same L2 network? I know, I will ask my network team to just extend L2 Vlans everywhere! Dead easy. Hmmm. that’s odd? The network folk seem to be going very red in the face and screaming “spanning tree” to the heavens. Some of them are even starting to cry… Not sure what I said? Seems simple enough to me?
Enter VXLAN, and I can now do the above without ever talking to these over-emotional jitter-bugs. I can tunnel everything over their network and they don’t have to worry.
I should start by saying that VXLAN is Layer 3 encapsulation of layer 2 traffic (MAC in UDP-IP). It allows you to tunnel your Layer 2 network over a Layer 3 network. So now the network team can have their separate, tidy, layer 3 network within Datacentres, or even between Datacentres, and get rid of spanning tree wherever they like. They don’t have to extend the actual Layer 2, and we can now have our virtual Layer 2 segment over the top of the IP network – an overlay!
The VXLAN header (which is 50-54 bytes so watch that MTU), has a 24 bit VNI (VXLAN network identifier), and enables 16million+ segments as opposed to 4096 with Vlans. How many Enterprises have you seen exceed 4096 vlans in their current DC environment? Ok, but at serious scale, for cloud, fair enough.
Physical addresses with VXLAN:
Outer Dst MAC addr (MAC of the tunnel endpoint VTEP)
Outer Src MAC addr (MAC of the tunnel source VTEP)
Outer IP Dst addr (IP of the tunnel endpoint VTEP)
Outer IP Src addr (IP of the tunnel source VTEP)
OK, we now have a small issue. How on earth do you know about the MAC addresses in the same Layer 2 segment or VLAN, that are now on another IP subnet? This is where we use VXLAN. (Virtual Extensible LAN), effectively extending the concept of a VLAN across a Layer 3 network through tunneling.
Remember, fundamentally you just want to get a packet from A to B. Always keep this in mind, because no matter what kind of abstraction or fancy acronyms you use, in a packet-switched network you will always be getting a frame or packet from A to B. If you can walk this path you are good to go. (Incidentally when this path seems overly convoluted, hitting lots of different way-points on its travels, it is a sure sign that efficiency is being traded for some other functionality or abstraction).
A VTEP is a VXLAN Tunnel end-point. I mentioned earlier that the MAC address is tunneled in UDP- IP to get across the network. The VTEP provides this association, and has a VTEP IP address associated with itself (source IP address). It also knows which VTEP IP address it needs to go to in order to get the destination MAC to break out of the tunnel where that destination physically lives..
We are getting to some of the meat now relating to the underlying infrastructure.
We still need to find out where all these MAC addresses live, which VTEP they are associated with, so I can forward the frame.
So say I am to ping (ICMP echo) Virtual Machine-2 (VM2) from VM1.
Back to that packet walk. A frame arrives and hits the Vswitch destined for VM2.
If the destinations MAC address is local and in the Vswitch MAC-address table, then it simply forwards it out of the local port to that host. If the destination MAC is not local it needs to know which VTEP to forward it to (package it up in UDP-IP packet, send it to a remote VTEP over Layer 3 to where the MAC resides, and pop it out for normal Ethernet forwarding of the frame at the other end).
The VTEP maintains a MAC address table similar to a standard Ethernet switch. However, instead of just associating an address to an interface, the VTEP additionally associates a Virtual Machince (VM) MAC address to a remote VTEP IP address.
As VM1, if I don’t know VM2’s MAC address then I send out an ARP, “does anyone in my network segment at Layer 2 have the MAC for the following IP address? Can you please respond with the associated MAC address?” (I like to think switches remember their manners and say “please”.)
Here is where the debate begins!
The Vswitch either knows where all the MAC addresses live in the system, which VTEP – IP address they are associated with etc. or it needs to go on a journey of discovery.
The debate revolves around how you populate these tables with MAC addresses to VTEP mappings across your infrastructure.
There are a few options here. You can either manually pre-populate the MAC tables and associations of switches and VTEPS, because you, with your all seeing eye, know where everything is on all the switches (manual provisioning).
OR you can query a controller (in an SDN something like Opendaylight, or an NSX controller for VXLAN), which will populate the mac-addresses statically or dynamically.
Alternatively, on the control-plane discovery side, the way this was initially done with VXLAN was using multicast. Want to know where the MAC addresses are? Get the VTEPS to join an IP multicast group and we will we share what we know locally with the other VTEPS in the group.
At this point most people said. “errrr, what? Ok…no.. no, I don’t think I am turning on multicast across my core infrastructure for that.”
IP multicast with all its complexity, security problems, bugs, lack of skill set, vendor support etc. is not something you enable on a whim across your core infrastructure to solve a trivial problem in my view. Finance houses, who arguably need it and have spent countless man-years getting it to work properly for them, often don’t go for reliable multicast as its adds latency but go for Live-Live either with redundancy on a separate physical network layer path or redundancy on the server side.
If you don’t know what I am talking about and want to get started, then there are worse places than the below
There is a view that if you just understand the complexity of multicast well enough (usually as a result of having been forced to spend far too much time wrestling with it for CCIE, others must share the pain ;-)), and you ignore the proportionally vast number of multicast bugs from vendors over the years, and know how to work around all of them, then it is just fine! I don’t feel I need to go further here, anyone can write a 100 pager on the pros and (many) cons of multicast, but opinion is just that.
My point is that for VXLAN? Erm no thanks!
Ok, so say I am not sold on using an SDN controller to populate this information in my infrastructure (maybe i have decided to use controllers for other reasons, service functions, flow control, orchestration, whatever), and I really don’t want to use multicast. Have I any got other control plane options?
Can I still use a network control plane to get MAC addresses shared around in a leaf-spine architecture or DC interconnect? Oh and by the way, can I keep the number of MAC addresses flying around to a minimum please?
Remember with all the above, if you are not using multicast groups to which all the VTEPS belong, you somehow need the VTEPS to know about other VTEPS and their associations.
A VTEP needs to know which VTEP to send traffic to.
So what do we need to do now? Man, you are just making this simple thing way complicated again eh? Whether you think this makes everything simpler or more complicated, welcome to the world of abstraction.
Let’s look at physical and then logical underlays…. (PART 2)