BGP? In the Enterprise? Part 2

Let’s look at physical and then logical underlays….

We are getting closer to where we need to be for the BGP side of this post.  Hold on though, we are not there yet…



As every good carpet fitter knows, if we have an overlay, we should have a good physical underlay.  Let’s have a look at physical infrastructure design.

You now want to design the physical infrastructure to best take advantage of this new overlay technique (VXLAN).  This overlay or tunnel technique allows me, as a network engineer, some freedom to think about the best way to scale the physical infrastructure as I now don’t have to worry so much about the host MAC addresses on the wider physical network (they are all being tunneled so I don’t see them).

This has previously been a worry with VMs.  Think about it this way, if you have a relatively small data centre with maybe 8 racks and 40 virtualised servers with 100 VMs each, that is 40 x 100 multiplied by 8, so 32000 MAC addresses we have to worry about.  Scale this and you quickly hit the limits of some switch infrastructure.

In the past you would need to worry, at Layer 2, that switch architectures would not create loops (loops are very bad).  Traditional Data Centre designs (such as the one below) had an architecture to make this as predictable as possible with the tools available at the time.

old-dcNetwork engineers will be more than familiar with this.  The basic idea being that the Core layer would be where you do routing between different sections of the network, maybe between the DC and the Internet or WAN, or sections within the DC.   Then there is the Aggregation layer, where all the access layers interconnect, so if traffic can be switched at this layer don’t push it to the core – try to keep the core clean.   Finally the access layer. Obviously, this is where you have network access, the servers plugged in.   This architecture can be collapsed of course depending on scale (collapsed core/dist/access).

If you want to intercept traffic to maybe perform a service,  security etc., then the aggregation layer might be a good place to intercept.  Usual design considerations come up here, like how far down you push Layer 2?  Spanning tree considerations arise – look at the links in the diagram, you have potential loops, so you get into l2 and L3 spanning tree blocking considerations – where do you draw the line?

Why mention this?  Just so we have a feeling of historical progression.

Then comes a number of ways to re-architect, simplify, and avoid  traditional meticulous spanning tree design for predictable forwarding.

First we added methods to better control spanning tree convergence – Rapid Spanning Tree (BPDU hellos), Rapid Per Vlan Spanning Tree (RPVST), Multiple Instance Spanning Tree (MST, consolidating Spanning Tree topologies to a few process instances).

Next, go a step further, and you see MLAG (hot potato switching) as a step to getting rid of spanning tree towards the access edge of the Datacentre (when I say MLAG, please substitute VPC and VSS as well here as I am not trying to explain the technical details, simply express concepts)  A way to make sure you are getting value by using all your links for forwarding through a unified control plane (VSS in a Cisco world, Virtual Chassis – Juniper, IRF – HP etc.)  or a separate control plane – Arista MLAG, VPC.

All have a mechanism for distributing MAC learning between them to look like a single switch and avoid spanning tree logic. The point here? – Diverse physical paths to hosts, or a way of having a VLAN span switches without spanning tree, so no blocked links at this layer!

We make 2 links to separate switches look like a single virtual link and no spanning tree loops, hurrah!.  The general term for this is link-aggregation.  Not brilliantly standardised, but hey..


What if you can take this concept of a shared control plane between switches at Layer 2 and scale this up?  No spanning tree loops at scale?

This is where solutions like Fabric Path or Trill come in. (FabricPath was based on TRILL, and TRILL stands for Transparent Interconnection of Lots of links).

The idea here is that you could treat your Layer 2 network like a L3 network but using MAC addresses as your identifier and not IP addresses.

FabricPath and Trill, as concepts at least, are not that different and both are based on ISIS as their routing control plane at Layer 2.   Trill uses RBs to terminate the Ethernet cloud if you like, and FabricPath uses the idea of a VSwitch-id.    I am not going to go into any of these techniques in detail here as I want to get to BGP – maybe another day.

Suffice to say the ultimate concept is turning your Layer 2 network into what looks like a routed Layer 2 network rather than having classic Layer 2 “flood and learn” semantics. Layer 2 routing if you will, or MAC routing with ISIS.

Let’s take this logic a little further – with this jiggery-pokery can I make a whole lot of switches now physically look like one big switch, or fabric?

We arrive at “leaf-spine” – bear with me.

Leaf-Spine is based on a CLOS architecture.

Thankfully, for once, CLOS is not yet another acronym to hide behind, but named after an actual real-life person – Charles Clos, a researcher at Bell Labs in the 1950s.  If you have heard of “crossbar” switches, well this is where it all started, as this was the name for the switching points in this non-blocking switching topology.  The term fabric appears because it looks a little like a woven piece of fabric.


The principle here?  Essentially that any port ingress has a non-blocking path to an egress port via a middle layer switch point (as above). The path is via a switching point and that switching point is called the crossbar.   The maths around this is simple and pretty clever, but a little too long to go into here, so I will be lazy and point to the Wikipedia entry for non-blocking minimal spanning switches.

So how does this relate to Leaf-Spine network architectures?  Well you can see from the below it is exactly the same thing, but typically vendors show this with all the leaves on one side to make the diagram simpler at scale.  As far as traffic is concerned it is the same.

A couple of things to note if you are used to other architectures.  You don’t directly connect spine switch to spine switch in general, and you don’t connect leaf to leaf directly (there are cases when you might do this, but in general you don’t).


That’s right, this now looks a lot like the inside of a CLOS based “crossbar” fabric switch and I have seen this described as leaves being a bit like your line-cards in a switch.

The aim? Well predictable paths for forwarding and all your inks being used at the same time, no blocking with spanning tree.

Very useful if  you think about how expensive SFPs are.   Copper is cheaper of course, but at 10Gbps runs about 10-30 metres, so there are choices to be made.  Support for 10Gig at the server level is more common now, and fibre uplinks of 40Gig to the Spine.  Also keep in mind 25G and 50G and 100Gig.  The standards, pros and cons etc. are more than enough for an entirely separate blog post.

Finally no spanning tree, and all the links are forwarding predictably , at scale!  Hurrah!  No more expensive wasted links!

No Spanning Tree – but how?

The fundamental problem with spanning tree is that in a redundant design for a leaf-spine type layout, one link is forwarding, the rest are blocking. “So what are all those 10/40Gb ports that I spend so much money on doing?”,   “Oh nothing, resting until they might be needed one day.”   Nope, we are not having that, we need to use those expensive ports, figure it out!

So how do you do this at Layer 2? Get rid of spanning tree?  

Yes, we can now use standards based TRILL or Fabricpath as mentioned earlier (both use ISIS at L2) that can take control of this physical underlay at Layer 2, make it look like a routed network, and get rid of loops for a stable leaf-spine architecture.

This is making L2 learning use the same techniques as routing protocols to look like a Layer 3 network but using Layer 2 (MAC) addresses. (confused yet?)

Well there is nothing new in the world so first lets look briefly at ISIS (Intermediate System – Intermediate System) at Layer 2 for this.

ISIS with CLNS enables you to use different protocols, therefore you can use MAC addresses in the CLNS field.  You don’t just have to use IP.

If you remember way back to the DECnet protocol days, where the device itself would have an address rather than the interface (as in IP), well you can use this fixed length System-ID in ISIS, which can be 6bytes which is …oh right, cool…   MAC address length.  As this is system to system communication with CLNS you don’t need to use ARP either.

So you can now effectively have ISIS routing at L2 (no pesky spanning tree).  As I said, both Trill and Fabricpath use this technique

Next PART 3 – Underlay, Overlay, Wombling free.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s