Underlay, Overlay, Wombling free.
This is the bit where it can get mighty confusing when sitting in vendor presentations.
Hopefully we are now are starting to frame our thinking in terms of
- We have physical infrastructure, or physical underlay – e.g. Leaf-Spine
- Layer 2 underlay. Using something like Trill or Fabricpath (which in turn uses ISIS to make the L2 network look like a routed L2 network without spanning tree), or even SPB
- layer 3 underlay – this is your normal routed network applied to a DC design, L3 connections to every leaf-spine. OSPF, ISIS, EIGRP, iBGP, eBGP
- layer 3 overlay – VXLAN, NVGRE
- Instead of Multicast, Controller or manual entry – . adding to the L3 overlay a Dynamic host discovery and distribution method across all VTEPs using a Control Plane. The MP-BGP EVPN Address Family (AF). EVPN and VXLAN.
EVPN and VXLAN – see Internet Draft https://tools.ietf.org/html/draft-ietf-bess-evpn-overlay-05
So when you listen to presentations by vendors, they just blast past all this as a given and you are left thinking, “hold on, so how does a packet get from A to B again? Which processes are tied together? Are they? Which is the best choice? Wasn’t TRILL all the rage a few years ago? Why didn’t Ethernet in Ethernet take off? Mac in Mac with SPB? Software vs hardware? Wasn’t I being pushed towards FabricPath and OTV not so long ago? Now there are standards forming I should be doing what? Are we done yet? We are all agreed this is the best way?” I think I need a drink.
So why blast past it? I guess because it takes an age to fully explain. Designs are also in a state of flux. You only have to delve into the documentation to see how much of a “work in progress” this is for many vendors. Trill is popular in the East, L3 software methods more so in the West and the Valley. (Careful not to add an “s”, as I confess I have no clue as to the dominant protocols in the Rhonda).
That is not to say the architectures are not sound, just it is still relatively early days.
By my reckoning we now just need to cover numbers 3, 4 and 5 from the rough list above.
Number 3 – Layer 3 underlay
Not going to say a great deal here. Just that the leaf-spine is now a routed network with stable links so you choose your routing protocol of choice for this. Some people like ISIS (at L3 this time) as you don’t have to do a global SPF calculation if a link flaps, but OSPF can work here too (point to point, understand the SPF etc.).
Then there is the, perhaps surprising, protocol you can use here – BGP.
BGP AT LAST!
So you can use BGP as your L3 underlay in a leaf-spine – Yes, crazy BGP in the Data Centre.
Why BGP? Well because all the links are relatively stable and predictable in a leaf-spine DC design and you don’t expect much change or flapping. It also has a number of advantages outlined below…
Such as..prefix distribution, prefix filtering, traffic engineering, traffic tagging and stability in a multi-vendor environment while you are building out your leaf spine.
It is pretty good at prefix filtering, and we have probably all used it for traffic engineering and traffic tagging. You can match on any attribute or prefix, you can even prune prefixes between switches. You also have BGP communities to get lots of information across (traffic tagging with extended communities).
In short, we have loaded BGP with so much over the years that it starts to have other uses outside of the WAN and you can do all these things across different vendors.
The next decision then (if you buy the above) is iBGP vs eBGP. Pros and cons of course, but you at least want to make sure of support for ECMP (Equal Cost Multi-Path) from the Leaves to the Spines – the more Spines the more ECMP. You also need to consider how may peering sessions you want.
If you look at peering, with iBGP each switch requires a BGP session to every switch in the network (remember the iBGP advertising loop problem), so all leaves must peer with each other, and to each spine – not brilliant. There is a workaround, you can use a route reflector in the Spine, the leaves become route reflector clients and you only have to peer with each spine (route reflector). It does however only reflect the best route, not great if there are several best routes as in leaf-spine. There is a feature called AddPath but….
So eBGP therefore may make more sense, but you then need multi-AS pathing for ECMP. Traffic engineering is good though. This can all get way more complicated if going beyond a 3-stage CLOS design, so something to keep in mind.
I am not going to go into the nitty gritty here, but hopefully you get a feel that it is at least possible and might have some advantages.
So there you have it. A use-case for BGP in the DataCentre
Number 4 in the list – L3 Overlay. I have already covered this describing how VXLAN works at high level earlier. You can use NVGRE instead, although at the moment the world is converging on VXLAN
One general principle might be to separate your underlay control plane and your overlay control plane.
The underlay control plane is essentially just point to point links, and VTEPs are known within this. It is all about reachability for building the network topology.
The overlay control plane now adds hosts into this mix, so that is all the end-hosts – lots of information to learn.
If you mix the two you are introducing a dependency, you have to wait for everything, all this information, to be learned before you can forward. If you separated your overlay control plane then you won’t even have reachability or peering, and therefore are never trying to send something before you are able to.
Number 5 – another use for BGP!
So finally, how about using MP-BGP-EVPN as the control plane for your L3 overlay (VXLAN, MPLS, PBB), to overcome the flood and learn for VXLAN outlined in RFC 7348.
A standards based control plane for VXLAN.
BGP again, but for a different reason; as a method to extend L2 across L3 but without some of the pesky flooding and other interesting features of VXLAN. (OTV is another proprietary method to achieve something similar for DC interconnect).
Basically, dynamic host discovery and distribution across all VTEP using a Control Plane MP-BGP EVPN Address Family (AF).
Recap – remember I said VXLAN packages up MAC in UDP-IP?
Can we do a similar type of tunneling that links to VXLAN without having to send all our MAC address info and flooding characteristics across a WAN or DC interconnect? Can I also use something as the network control plane to populate the information the VTEPS need to get to each other?
Well turns out I can use MP-BGP to do this, and then link the processes.
The EVPN part. BGP at last! or MP-BGP
You can do a similar thing to VXLAN with MP-BGP using address-families and NLRI (Network Layer Reachability Information). NLRI sounds pretty scary, but it might be easier to think of this originally as simple reachability of an IP prefix, i.e. 10.10.14.0/24 is reachable.
We can put the MAC in MP-BGP and work with VXLAN to send it to the correct VTEP IP address it learned through MP-BGP – pretty smart eh?
How? We re-use this NLRI field for MAC-addresses and not IP addresses, and bingo, you have BGP routing on MAC addresses. You can also add IP at the same time – ah it all sounds so simple. (See rfc 7432, 4760 respectively here for extensions to RFC 4271(BGP))
DC-Interconnect and MP-BGP
You can use a number of methods for DC interconnect, from OTV, and even VXLAN with Head End Replication (note, there is no magic, the VTEPS still need to know about each other and their associations, either through distribution via a Virtual Switch Manager, or Static Mappings etc. in order to unicast the flood and learn semantics rather than using multicast).
I want to make my traffic between sites looks like a Layer 2 extension without the intermediate network knowing much about it. To achieve this for DC-interconnect typically you might use one of three methods EVPN with MPLS, PBB (Provider Backbone Bridging), or VXLAN.
We are focusing on VXLAN-EVPN
In VXLAN on its own you are tunneling MAC in IP , so you could just extend VXLAN across DCs, but we still get some of the flooding characteristics we don’t like with Layer 2.
As an alternative I can use EVPN, and push my mac addresses from VXLAN into MP-BGP to get them across to another DC over Layer 3. I outlined earlier how you can use the NLRI to represent MAC addresses so you have BGP working for MAC addresses
What you then need to do is have the VXLAN process to talk to the EVPN process and populate the MAC addresses into MP-BGP that you want to extend across Data-Centres
So what is EVPN?– in short, it is a way of separating the control plane from the data plane to distribute MAC and IP information.
A few key concepts to get us started. This will not be a deep-dive into EVPN but hopefully enough to understand why we are doing it, and roughly how it works so you can add this to your bag of tricks or knowledge to dig deeper when you need to.
EVPN introduces 4 major terms:
EVPN Instance (EVI). An EVPN instance spanning the Provider Edge (PE) devices participating in EVPN. Your routing and forwarding instance that is going to be spread across all your routers.
EVPN Segment (ES) – This is essentially the physical or logical bundled connection between your Data Centres or CE to PEs. The actual ethernet links from one DC to another or from the CE to the PW. This is the segment that is going to carry my EVPN traffic.
Ethernet Segment Identifier (ESI). A non-zero identifier of the Ethernet Segment
Ethernet TAG – essentially a VLAN-ID tag, unique identifier for a specific broadcast domain.
Underneath the EVPN protocol we turn on some type of encapsulation, be it MPLS, or VXLAN typically. When we turn this stuff on, we are basically telling the switch, “hey the stuff you learn on your VTEP can you pass that to BGP please?” EVPN then puts this in an NLRI for MP-BGP.
So we need a connection between the VXLAN process and BGP process, so they can share data. The MAC address is learned as normal at L2, builds a mac table and then this info (which is known by the switch that is the VTEP, and for the VLAN you have enabled that functionality for), is taken to put mac addresses into MP-BGP.
Not just the MAC addresses mind, you can add other info, and you would typically do this e.g. the peer from which it learned this info from, vxlan communities, targets etc. An important note – what it does NOT have is the data we are transmitting, just address identifiers with additional info that makes it unique and can be used if we need policy action e.g the BGP Community for this VNI, or from the switch for troubleshooting. Basically a bunch of additional info that might be useful and help functionality.
So what does a routing table look like? Very briefly, what does the construct look like?
e.g. you do a kind of sh NLRI BGP summary.
First it might show the address family you are using to transfer across BGP, then you can look at the ethernet switching table which has the MAC address, vlan name, and logical or physical interface I learned this from (physical interface or VTEP interface)
Then you might have your routing and switching tables, so a BGP/EVPN routing table instance and a switching instance with all the info learned from peers – kind of like a virtual switching VRF.
Then you have all the info – the VTEP address, optional communities, the VNI and the MAC address we have learned, the next-hop, AS path – telling it to send traffic to this MAC address which is physical interface connecting the DCs together potentially.
With the MP-BGP EVPN control plane, a VTEP device first needs to establish BGP neighbor adjacency with other VTEPs or with Internal BGP (iBGP) route reflectors. To know where all the VTEPS are to forward traffic.
In addition to the BGP updates for end-host NLRI, VTEPs exchange the following information about themselves through BGP:
- Layer-3 VNI, 2) VTEP address, 3) Router MAC address
The way MAC address distribution happens through EVPN allows unknown unicast flooding in the VXLAN to be reduced or eliminated.
Ok, I think we have some kind of foundation. We’ve laced up our boots, got some mountain snacks, and are ready to start a meandering EVPN packet walk.
I am in one DC and I want to open a connection to a server in another DC . Keep it nice and simple, East-West traffic. I need to make a HTTP request to another server in another DC so I put out an ARP and say “hey I am looking for the MAC address of this IP address I know about”
The router or switch that is terminating EVPN will then pick this up and look at its local bridging table. This gets populated by the local devices and what it learns from EVPN, other switches and routers in the network.
The switch will respond back pointing the the host to the correct MAC address, which is going to be proxy-ARP to send the frame to the local switch.
Ok, let’s see if i can get this next bit in words.
The host has now got a proxy-ARP response from the local switch and sends out an Ethernet frame with the destination MAC of the switch. The switch receives this and knows that as it proxy-ARP’d, the real destination MAC is not really local and it sits on another VTEP, so puts it a VXLAN header. Therefore it needs to send this across the EVPN infrastructure learned from the EVPN route.
So how does it know this? With BGP the route update shows….. The VTEP we learned that MAC address on; the VNI we need to encapsulate with; and it shows the MAC address. It takes the info from the EVPN route and sends this across to the specific remote VTEP.
From this point you have a VTEP address to aim at and you follow standard routing to the other VTEP, BGP, OSPF, whatever.
Our VXLAN packet? We proxy-ARP’d the destination MAC with the local switch’s MAC, but when we wrap the VXLAN header we are back to the real destination MAC of the server in DC2 . When it arrives at the remote VTEP it will of course have the original source MAC the traffic came from.
So a VXLAN wrap (UDP dynamic source-port and same destination) – just travels across the infrastructure as a regular IP packet, using ECMP where possible,
On the other side (the other DC), it is de-encapsulated revealing an Ethernet frame. The switch just looks in its bridging table and forwards this on to the appropriate port.
If the destination host has never seen the IP address before, it knows to reverse the function we just went through to get back the other way.
There is a load more detail on config etc. way too much to cover here, so please search the web for config guides and explanations to get all the tech deep-dives, but hopefully that rough walk-through gives you a feel.
With EVPN what happens to spanning tree frames? Well BPDUs are not sent this across the EVPN infrastructure, they don’t have MAC addresses in them, they are a unique datagram, so the switch is not going to tunnel these.
Broadcast storm?. Well there are timers for learning mac addresses – this is quite cool. A counter is incremented in EVPN. The timer is 180sec by default. Every time we learn the MAC route, we increment the counter. If it increments 5 times in 180secs, the route is suppressed until any loop is sorted out. The MAC will be seen too many times, exceed the threshold,and fall out of the EVPN table. If it is not there, it can’t go across the tunnel!
1) If we learn the mac route more that 5 times in 180 secs, the route is surpressed.
2) Each device participating in EVPN, must learn the mac address whether from an EVPN route update, or from standard broadcast, before any traffic is ever transferred . So if you pull the mac address out of the EVPN route table, then it doesn’t get transferred, and is suppressed. It does not replicate out like in VPLS or standard L2.
So there you have it! Another real use for BGP in the Enterprise. DC interconnect.
One last piece of the puzzle. Remember at the very edge, a host usually needs a default gateway to get out onto the IP network? Well one way to solve this is by VXLAN/EVPN combining this Integrated Route Bridge (IRB) with a distributed anycast gateway. Every TOR or top of rack switch where a given VLAN is configured can act as a default gateway. ALL TORs will share the same gateway MAC and IP address, so with mobility, no change is seen from a VM perspective.
If you made it this far (you deserve a slice of cake and a cuppa for sure, maybe something stronger) you now should have some sense of why you might be seeing BGP and MP-BGP talked about all over the shop nowadays. (L3 underlay, DC-Interconnect, OTV replacement, VPLS replacement – I haven’t really gone into the VPLS part but also very interesting)
BUT…None of this is set in stone, so it is always worth asking questions of vendors, “why does this make sense now, next year, and in 5 years time? Have we thought this through? Walk me through the rationale.”
It is usually at this point you see that the world is not binary.
Finally, remember, all of this is to make life simple, but simple for who? Are we just moving complexity around? Did the above all sound simple?
So please challenge, question, rinse, and repeat and hopefully you will land on an architecture you are happy with and performs.
The network is there solely to support the application and the business. Whatever we do, we just want it to work!