EBITDA, Cashflow and Service Providers

It might seem odd to see a piece about cash flow in a technical blog, but looking at EBITDA and cash flow got me thinking about whether the new wave of service provision (NFV, SDN, SDO, SD-WAN etc.) has any impact on the traditional ways of reporting for Service Providers currently based on high up-front costs for future service and revenue.

For many years EBITDA has been a preferred reporting indicator of Telecommunications Providers, and in turn, Service Providers.  Execs knew the score – grow revenue, sort out your EBITDA, and get good Enterprise Value Multiples.  Enterprise Value is how much it would cost to buy a company’s business free of its debts and liabilities.  Enterprise Value Multiple is your Enterprise Value (EV) divided by your EBITDA earnings number.

But with automatic rapid growth for telecoms providers on hold for now, and the fact that even though companies may be hitting market expectations, they are not always seeing the Enterprise Value they would like in the market, the EV multiples are simply not there.

Based on that, I thought it might be interesting to look at what EBITDA is?   Why the preference exists? And do changing cycles of investment with SDO/SDN/NFV/Agile change anything at all?


First up – What is EBITDA?


EBITDAEarnings Before Interest, Taxes, Depreciation and Amortisation.

The way to calculate this is:

EBITDA = Income before taxes + Interest Expense + Depreciation + Amortisation

So basically you add the deductions (Interest Expense, Taxes, Depreciation and Amortisation) to your net income.

Let’s break this down a little.  Before the 1980s EBIT was a key indicator of a company’s ability to service its debt (Earnings before Interest and Tax).  With the advent of Leveraged Buyouts (LBO) EBITDA was now used as a tool to measure a company’s cash. LBO is essentially a way of acquiring another company with a large amount of borrowed money.  The assets from the acquiring company as well as the assets of the company being acquired are used as collateral for the loans.  This enables acquisitions without committing lots of capital..

Some conflate operating income (or operating cash flow) with EBIT but these are not the same, as EBIT makes adjustments for items not accounted for in operating income and thus are non-GAAP (Generally Accepted Accounting Principles).  Briefly, operating income is gross income minus operating expenses, depreciation and amortization but does not include taxes and interest expenses as in EBIT.


Let’s build this up from Cash Flow and Net Income.

Net income and Operating Cash Flow may seem very similar but they are not really the same.  Operating cash flow, or raw cash flow in general might need converting through accrual accounting to get a more realistic measure of income, namely net income. Accrual accounting?  Very briefly this recognises when an expense is incurred not when that expense is paid (think of buying something one month but then having 60 days before you need to pay it off, well the expense was incurred the moment you bought it really).  As a result businesses can list credit and cash sales in the same reporting period that sales were made.

Cash flow or Operating Cash Flow (OCF) = Revenue minus Operating Expenses

Ideally you want your Operating Cash Flow to be higher than your operating expenses or you might not be in business for too long.  Certainly an important measure.

Net income = Gross Revenue minus Expense.

Net income reflects the profit that is left after accounting for ALL expenditures, every pound/dollar the company earns minus every pound/dollar the company spends.

As I mentioned Cash Flow and Net Income are not the same, with Cash Flow showing not only how much a company earned and how much it spent, but when the cash actually changed hands.  This difference is significant and an illustrative example is below:

Say you sign $20million worth of contracts in a year, complete work on $10million of them, and collect $8million of that work in cash in the year.  But say you have also paid out $6million dollars in equipment, your raw cash flow would be $2million.  Net income however might look significantly different.

So if you have provided economic value to your customers (revenue) of say $15m of the contracts (i.e. you have completed $10m of the contracts, are 50% of the way through the remaining $10m, so $5m of economic value to customers), and of the $6million of equipment purchased you consume only a third of this, and they are estimated to last 3 years, then $6million divided over 3 years is $2million, so your net income for the year would be $15million minus $2 million in expenses, so $13million.

$13 million is a very different number than the raw cash flow value of $2million, and perhaps a better indication of the operating income of the company.

Earnings and cash therefore are not the same thing but that is not to say that operating cash flow is not important.  Companies must still pay interest, and they must pay taxes, and that cash has to come from somewhere, so using this to look at Free Cash Flow (Cash Flow after all capital expenditure), and looking at working capital to see whether a company can service its short term debts is useful..  A negative working capital might suggest a company will struggle with its short term debts.

Working Capital = Current Assets – Current Liabilities

Current Assets  – assets that can be converted to cash in less than a year.

Current liabilities  – liabilities that must be paid this year.


Now we have had a quick look at why Net Income might be a preferable way to get a read of a company’s operating income over and above raw cash flow, is there a way to calculate this to give us a better standard view of Net Income without all the financing decisions and conditions that are peculiar to each company? i.e. provide a better method of comparison?

Well EBITDA adds back (or deducts from the calculation the expenses associated with) interest, taxes, depreciation and amortization.  This therefore removes the effects of financing and accounting decisions.

The idea being that by ignoring expenses like interest, taxes, depreciation and amortization you strip away the costs that aren’t directly related to the core operations of a company.  The proposition is that what you are left with (EBITDA), is a purer measure of a company’s ability to make money.

Let’s take a brief look at what these costs are to reach EBITDA:

Interest expense – is the interest payable on any borrowings  such as bonds, loans, convertible debt or lines of credit (interest rate X outstanding principle amount of debt). You hear this referred to as just the Principle sometimes.

Taxes – Generally refers to income i.e. by a state or country.  Business taxes (property, payroll taxes etc. are considered operating expenses and therefore no factored into EBITDA.

Depreciation – In this sense the reduction of the value of an asset over time.  Tangible assets  – both fixed assets (land, buildings, machinery) and current assets (Inventory). Basically things that can be physically harmed.  Depreciation is a way of giving a cost to a tangible asset over its useful life, or how much of an assets value has been used up over time.

You also have something called intangible assets, which are things you can’t really touch like copyrights, patents , brand recognition etc. (as opposed to equipment, machinery, stocks, bonds and cash for example).  Goodwill is also a part of this, so solid customer base, reputation, employee relations, patents and proprietary technology – an amount you are willing to pay over the book value of the company (If the company was liquidated, the value of the assets the shareholders would theoretically receive)

Amortisation – Deals with these intangible assets.  So the paying off of debts such as a loan or mortgage with a fixed regular repayment schedule over a time period.  Also the spreading out of capital expenses of intangible assets over the assets’ useful life, i.e. a fixed period.  Say you spend $10 million on a patent with a useful life of 10 years, then over these 10 years you can spread the cost as a $1million a year amortisation expense.


So EBITDA is not really a good measure of cash flow, but can be a reasonable measure of a company’s profitability or net income.  However it does leave out the cash required to fund working capital and the replacement of old equipment, which can be significant.

This is particularly useful for new companies or those trying to attack a new large market where depreciation and amortisation spreads out the expenses of large capital investments, which can be considerable.

Conversely EBITDA has a bit of a bad rep as it has been used by a bunch of dangerously leveraged companies in boom times, but that is not to say it is all bad by any means.

There are many pros and cons to EBITDA but most are out of scope of what I am trying to convey, namely that telecoms providers tend to like it as a reporting method, have large initial infrastructure outlay to get return on investment over time through telecoms or service provision.  They are also not seeing the Enterprise Value multiples they would like despite a focus on market expectations around EBITDA (which doesn’t penalise a company in the market for having to invest in longer term infrastructure providing earnings are going in the right direction).


As with all accounting it is useful to look under the covers when you get chance to get a better read.  In Service Providers or Telecommunications companies where there is significant expense in building out the network, buying frequency, and installing physical towers or radios, it makes a deal of sense to judge them more fairly over a longer term as repeated income will come some time after the initial investment.

In the boom years however, rapid growth was a sure fire way of getting great EV multiples, but with growth in the industry now at GDP levels and below, fixed lines in decline, mobile revenues tailing off and there being no business case in the transport of bits, EBITDA focus does not automatically equate to investor or customer value.

In essence it gets harder and harder to hide behind growth.  Knee jerk reactions from execs can zone in on cash and EBITDA while they under-invest in the very things that provide economic value to customers (your revenue as a provider) i.e. better products and services to facilitate ROI, and lead to better capital returns.  Essentially the focus needs to be on creating value and not just controlling costs.


So my question is, with SDN, NFV, Software Defined Orchestration, Agile product development etc. can you facilitate some of the above?  Are the providers who are simultaneously adopting flexible automation and orchestration to roll out new services faster (services chained to customers’ business logic), with greater visibility and with better quality, likely to be the ones who get the EV multiples they hope for?


“What is that huge fiery technology shift I see?  Should we prepare?” .   “Don’t worry it’s miles away….stop thinking, keep cutting costs and we’ll be fine!”

As customers start to play providers off against each other with shorter term contracts to drive down cost and improve service (as with SD-WAN), does flexibility to respond with new services and efficient operations become more important than high sunk costs, long contracts, eventual gain, and a relentless focus on internal cost cutting with traditional technology to justify business as usual?

Is this all doubly worrying with the reduction of some SPs to mere local cloud access providers (just give me a pipe to the nearest cloud provider thanks)?  Is it coincidence that the big cloud providers are the ones physically laying fat cables to connect their services nowadays?

Reducing operating costs and quick iterative roll-outs with new techniques and technology certainly seems to appeal to the largest and more innovative SPs.  Using the latest methods to hold vendors to account and change traditional network operating practices (e.g  vCPE), while simultaneously recognizing they can provide new services to customers with more flexibility, seems primed for technology shifts like 5G.  Will all of this lead to a renewed focus on ROI rather than EBITDA growth or Capex/Sales? At the moment EBITDA growth and Capex/Sales don’t appear to be moving the EV multiple needle.

Are those who are not adopting new ways destined for extinction?

Of course EBITDA isn’t going to be ditched by Telecoms providers anytime soon, but maybe a softening of the relentless internal focus on one reporting measure will only lead to better value for customers, and in return better results for those providers focusing on customer value.  You never know, internal business cases might actually start to incorporate vision.


Adventures in IWAN – Part 2 Intelligent Path Control

Intelligent Path Control

Broadly speaking Intelligent Path Control looks at monitoring traffic live across multiple transport links and making a next-hop decision on the fly to make sure the traffic you define in policy always chooses the best path automagically.  Application aware routing really.

If your expensive backup-links are underutilized, or you want to take advantage of multiple Wan transports and you want to do any kind of load-sharing then traditionally you would be using all your complex BGP tricks at the routing level, and if you wanted to do any kind of application monitoring to connect the two, then there would be some manual Netflow checking, SLA probes etc.  The entire  SD-WAN market was spawned, in part, to solve this problem.

Intelligent Path Control looks to remedy this by automatically routing traffic per class based on real-time performance of links to the most optimal transport on the fly.  This is useful, as routing protocols in the main are blissfully unaware of brownouts, soft errors or intermittent short lived flapping etc.

In a nutshell –  Intelligent Path Control is intelligent routing based on performance.

This “Performance Routing (PfR)” is what enables Intelligent Path Control, so here we can end the circuitous marketing and talk about PfR from this point in.

Under the covers PfR consists of Cisco’s SAF (Service Advertisement Framework), Monitoring Mechanisms, NBARv2 and Netflow v9.  It influences forwarding decisions in PfRv3 not by altering the routing table in the main, but through dynamic route maps or Policy Based Routing to change the next hop.  It can also enable some prefix control, injecting of BGP or static routes.

Before PfR, in a Cisco environment you would rely on manually using a bunch of scripts, static routes, PBR etc. to get anything like intelligent path control, and this would be a long way from dynamic or automatic. Then you would somehow be trying to stitch this together with some monitoring maybe using Netflow, or your IPSLA probes.  All very manual, all very labour intensive.

As with DMVPN, PfR is made up of a number of components, these are below and I will cover each one in turn to get an understanding of how this solution all fits together.

  • Performance Routing
  • PfR IWAN components – Controllers and Border Routers
  • Monitoring and Performance routing – NBAR2, Netflow
  • Service Advertisement Framework (SAF)
  • Paths
  • Prefixes
  • Smart Probes (optimistation, Zero SLA)
  • Thresholds
  • Steps to set up PfR – Traffic Classes and Channels
  • Routing
  • Transit site Preference
  • Backup Master controller
  • Prefixes
  • Routing
  • Controlled and uncontrolled traffic
  • Path of Last Resort.
  • VRFs

Performance Routing (PfR)

PFR (performance routing) initially came out of OER (Optimised Edge Routing) and version 1 appeared in the year  2000 with prefix route optimisation.

PFRv2 with application path optimisation then came in 2007.

PFRv3 is the latest version with new functionality, evolving a well-established and long standing Cisco IOS feature for over 10 years.

Essentially Pfr monitors network performance and makes a routing decision based on policies for application performance.  It load-balances traffic based on linked utilization to use all available wan links to the best performance per application.

PfR is really a passive engine for monitoring and gives you a superset of Netflow monitoring with around 40+ measurements.

  1. First you start with defining your polices – There are two ways here, either by DSCP or by Application.  If you use application you enable NBARv2.  There are some very handy defaults here.
  2. Then you learn your traffic.
  3. Once the traffic is learned the next step is to measure the traffic flow and network performance and report this to a Master Controller
  4. Finally you choose your path.  Performance Routing, via the Master controller, will change the next hop based on what you have learned about the traffic and how you have set-up your policies for each traffic class and link.

PfR can automatically discover the branch once connectivity is established.  The CPE/Branch (BR) is discovered and is automatically configured – auto starts.  The whole idea was to make the PfR configuration as simple and light-touch as possible.

PfR IWAN components

A device (Cisco ISR/ASR router, virtual or physical) can play one of 4 roles in PfRv3.  It is important to outline these, as the HQ Border Routers are used simultaneously as DMVPN hubs and PfR BRs.  By separating out the functionality you know which design you are talking about –  the transport overlay, or performance routing.

Let’s now look at the components involved in PfR.

Master Controller

The Master Controller(MC) is the decision maker and the Border Router(BR) is the forwarding path (roughly analogous to Controllers in other vendors’ SD-WAN architectures, this one just happens to be a router, physical or virtual).

You need a Master Controller at every site, so inevitably there is some confusion when it comes to the Hub Master Controller.  I look at the Hub Master Controller as the place where all the policy is configured and then distributed to the rest of the Master Controllers in your network as they join your IWAN domain.  As the Hub MC is looking after the IWAN domain this will normally be a separate device at the hub (physical or virtual) for scale.  An IWAN domain is basically all your routing devices that participate in the Path Control.

If you look at a typical IWAN design you see a Hub Master Controller at the central site (and a second maybe on a redundant central site or DC).  Cisco should probably have called this an IWAN domain controller or something.  Instead they call these Hub Master Controllers (HUB MCs),  central sites are IWAN POPs (Points of Presence), and each central site has a unique POP-id.

In Cisco IWAN domain you must have a Hub site, you may have a Transit Site and of course you have your Branch sites.

The Master Controller functionality doesn’t do any packet forwarding or inspection, but simply applies policy, verifies and reports.  The MC can also be standalone or combined with a Border Router (BR).  You have to have a Master controller per site (branch, hub etc.) so policy can be applied and verified.  If you only had one router at the branch with 2 transports then the border router would be a combined Master Controller and Border Router MC/BR.

Master Controller as a term comes up in several places then, but all you have to do is differentiate between the functionality of the one at the central site, and the ones everywhere else and it all becomes simpler to understand.

So a ultimately there is central place for configuration, policies, timers etc. (your Hub MC), but a completely distributed MC control plane.  It is important to know that even if you lose your HUB MC, you still have the local MCs optimising and controlling your traffic. A distributed control plane.

Let’s briefly go through each of the components involved in PfR in turn, and then look at the monitoring:

  • Hub Master Controller:  This is where all the polices are configured, and is the Master Controller at the hub site, Data Center or HQ.  Also for this central site it makes any optimisation decisions for traffic on the Border Routers (BRs).
  • Hub Border Router:  At each central site you need Border Routers to terminate WAN interfaces and PfR needs to be enabled on these.  The BR needs to know the address of its local Master Controller (the Hub Master Controller in this case), and you can have several hub BRs and indeed interfaces per BR.  As far as PfR is concerned you also need a path name for external interfaces – I will come onto Paths shortly.
  • Branch Master Controller:  I mentioned you need an MC on each site to make the optimisation decisions, but in this case there is no policy configured as with the MC at the Hub.  Instead it receives the policy from the Hub Master Controller. Obviously it therefore needs the IP address of the hub MC.  At a branch the MC can be on the same device as the border router – an MC/BR.
  • Branch Border Router:  The branch Border Router (BR) terminates the Wan interface and you need to enable PfR on this.  It also needs to know where its MC is (if it is on a separate box).  One enabled for PfR the Wan interface is detected automatically.  As noted earlier the branch Border Router can also house the Master Controller for he site.

All Master controllers  peer with the hub MC (or IWAN Domain Controller) for their configuration and policies.

Branches will always be DMVPN spokes.

Every site runs PfR to get path control configuration and policies from the IWAN domain controller through the IWAN Peering Service

A Cisco diagram probably does a better job than a scribbling of mine to get the components across visually.

IWAN domain

One other confusing term in the architecture is that of Transit (this is the problem when you use terms with multiple meanings, even if it accurately describes a functionality.)  I understand a transit site as a redundant HUB Data Centre with a Redundant HUB MC (IWAN domain controller).  So exactly the same as the HUB site, the only difference is you do not define the policies here, these are copied from the HUB MC.  The Transit site also gets a POP-ID

Remember each central site gets a POP-id.

Technically traffic can transit a branch site to get to another branch site if you get the routing and advertisements wrong, but this can get pretty messy and is best avoided.

As with most network architectures, a solid predictable routing design where you know your expected routed paths is the key to a stable and robust IWAN deployment.

Monitoring and Performance Routing

Unified Monitoring (Performance Monitor) and application visibility includes bandwidth, performance, correlation to QoS queues, etc. and is responsible for performance collection and traffic statistics.

As I mentioned at the start PfRv3 is an evolution of Optimised Edge Routing (OER) which was prefix route optimisation with traditional Netflow for passive monitoring and IPSLA probes for active monitoring.  This moved to PFRv2 to add application routing based on real-time performance metrics and then PFRv3 which adds a bunch of stuff including smart probes, NBARv2, one touch provisioning, Netflowv9, Service Advertisement Framwork, VRF awareness etc.

For application recognition (dynamic or manual) IWAN and PfR use NBAR2.

NBAR2 – Network Based Application Recognition, is way of inspecting streams of packets up to layer 7 to identify applications,  it provides stateful deep packet inspection on the network device to identify applications, attributes and groupings.

Cisco defines this as a Cisco cross platform protocol classification mechanism.  It can support and identify 1500+ applications and sub-classifications, and Cisco adds new signatures and updates through monthly released protocol packs.  It can identify Ipv4, Ipv6, ipv6 transition mechanisms, a bunch of application like TOR, Skype, MS-Lync, Facetime, Youtube, Office 365 etc.  then you can configure policy based on this.

(ios commands)

Ip NBAR protocol discovery

IP NBAR custom transport

You match the protocol from NBAR when setting up your QOS policy for IWAN and then your TCA levels (Threshold Crossing Alerts).  TCAs are pretty much how they sound, you cross predefined thresholds around jitter, loss, and delay then create an alert that the controller can then act upon for a path.

Netflow – (developed by Cisco) essentially collects IP traffic information and monitors network traffic.  There is Netflow V5 and V9.  Netflow v5 is a fixed packet format and has fields like src and dst Ip, number of packets in the flow, src and dst ports, number of L3 packets in the flow, protocol, Type of Service, TCP flags etc.   PfR also takes advantage of Netflow v9 which adds a cornucopia of extra information to customise and report on, and you can define what you want to report on as well by creating your own custom flexible flow record.

For more information see Netflow V5 export format

For more information on Netflow v9 see Netflow v9 Format

IWAN Monitoringall the PfR stats are exported using Netflow, so it is important to have a monitoring platform that supports Netflow V9 to get the most out of your monitoring for PfR for visibility.

Service Advertisement Framework (SAF)

PFRv3  has a concept of a peering domain or Enterprise domain for service coordination and exchange at the wan edge.  SAF is the underlying technology here.

SAF creates the framework for passing service messages between routers, with SAF forwarders and clients – basically a service advertisement protocol.  There is some considerable detail under the covers, but in IWAN it has been made very simple to configure.  (If you have history here you will remember SAF headers, database, client APIs, Client and Forwarding protocols, transitive forwarders, services etc.  Fortunately improvements mean the exact nature of the underlying mechanisms are improved and hidden now).

In essence the advertisement of the SAF service uses EIGRP as the transport layer for the advertisement, and completely separate to the IP routing protocol you are using to actually forward packets . It also uses link-local multicast for neighbour discovery.

Since you are using EIGRP as the engine for service advertisments then this also comes with split-horizon and DUAL (Diffusing Update Algorithm) to prevent service advertisement loops.

SAF relies on the underlying network and DMVPN in order to know about and communicate with its peers, as through the tunnel they are effectively one hop away. There can be confusion here again, as it easy to see EIGRP and assume this is related to underlying network connectivity, but for SAF, provided there is IP connectivity to the domain Border routers  (i.e. through the DMVPN tunnel) peers can communicate and pass service advertisements between each other as they have established SAF topology awareness and neighbours (peers) through the overlay EIGRP control engine.

Also SAF is efficient in that it only sends out incremental updates when advertisement changes occur and does not periodically broadcast/flood service advertisements.

In Iwan you have the concept of a Path

A Path name in IWAN identifies a transport

These Paths are identified with a Path-ID  You manually define the path name and path-id on the Hub and Transit BRs,  they then start sending discovery probes to the branches and these probes contain information about the path:  namely the Path Name, Path-ID and POP-ID.

The Path-ID is unique per site.

Paths in IWAN also have three labels.  Preferred path, Fallback Path and the next Fallback Path, and under each of these labels are three actual paths

For example you might say for this DSCP or this application (e.g EF) your preferred path is MPLS,  for something else the preferred path is Inet1 etc.

Each transport is a Path, and understanding the concept of a Path and Path-ID certainly makes troubleshooting easier when you are looking at traffic that has changed paths, when, and the reason why.

Path of Last Resort

One final bit of confusion in IWAN is that at the Hub you need a BR per transport.  At the branch/CPE you can have 2 transports per router, and if you want to do 3 transports, then you must have a second router for the 3rd transport (There are rumours this will be up to 5 transports per router in a future release, but we wait and see). There is also this thing called Path of Last resort, which some see as, “great, 3 transport are really supported per router. ” Turns out, No!

Basically if all other paths are unreachable, then we can fallback to the path of last resort.  This is not the same as the monitoring and control you get on the other paths.

PfR will not probe as usual (smartprobes – instead of sending 20pps, will be reduced to 1 packet every 10 seconds – so really just a keepalive).  Also path unreachable will be extended to 60 seconds.  So really this is to be used if you have a 4G/LTE connection as a last resort or backup path for your traffic if all else fails.  You add this config on the central site.

SummarySteps of setting policy for PfR

PfR – First you define a Class (a but like a class-map if you are familiar with QOS policy config in Cisco IOS), then this has a match on DSCP or Applicaton,  then you have your transport preference (Preferred, Fallback, Next Fallback), then your performance threshold based on loss, latency or jitter to decide which is the preferred path.

PfR actually work on the basis of a traffic class which is not an individual flow, but an aggregation of flows.

The traffic class is based on Destination Prefix, DSCP and Application name. (obviously not app name if just DSCP is used).  For each traffic class PfR will look at the individual next hop.

Performance metrics are collected per channel:

Per channel means:

  • per DSCP
  • per Source and Destination site
  • per Interface

A Channel is essentially a path between 2 sites per DSCP, or a path between 2 sites per next hop.

A Traffic Class will be mapped to a channel.

Channels are added containing the unique combination of DSCP value received, site id and exit.

PfR controlled and uncontrolled traffic

There is a concept of PfR controlled and uncontrolled traffic – and if some new traffic is seen for the spoke, then for 30 secs the normal routing table controls the traffic destination.  When this comes under the control of PfR then it abides by Threshold Control and is directed accordingly.

There is also an unreachable timer in PfR determined by PfR probes do detect the reachability of a channel.  This is seen as down once traffic is not seen either for 1 second, or if there is no traffic on the link and smart probes are not seen for 1 second.  These are the defaults I believe, but there is a recommendation to set the timer to 4 secs.  I assume this will become the new default at some point.

So for failover, blackout will be the 4 seconds above, for brownout then this is 30secs by default, but again can be reduced down to 4 or 6 seconds.

Performance Monitoring (PerfMon)

Unified Monitoring under PfR is enabled through Performance Monitor (PerfMon) which has been around a while and you might be familiar with it from Voice roll-outs.  It is responsible for performance collection and traffic statistics.

Application visibility includes bandwidth, performance,  and correlation to QoS queues

When it come to IWAN domain policies and domain monitoring there are 3 performance monitors to be aware of in PfR

  • Performance Monitor 1: to learn site prefixes (applied on external interfaces on egress)
  • Performance Monitor 2: to monitor bandwidth on egress (applied on external interfaces on egress)
  • Performance Monitor 3: to monitor performance on ingress (applied on external interfaces on ingress)

IWAN uses these performance monitors to get a view of traffic in flight (traffic flowing through the interfaces) to look for performance violations and to make path decisions based on this.  Border Routers (BRs) collect data from their Performance Monitor cache, along with smart probe results (described below), aggregate the info and then direct traffic down the optimal forwarding path as dictated by the Master Controller.

Monitoring and optimisation – Smart probes

When there is no user traffic, (e.g. a backup path) then probes are sent to get the monitoring. These are called Smart Probes

Smart Probes are used to help with the discovery but also for measurement when there is no user traffic. These probes are generated from the dataplane.  Smart Probe traffic is RTP and measured by Unified Monitoring just like other data traffic

Smart probes add a small overhead to bandwidth on a link, but this is not performance impacting in general and can be tuned.

The Probes (RTP packets) are sent over added Channels to the sites discovered via the prefix database. Without actual traffic, BR sends 10 probes spaced 20ms apart in the first 500ms and another similar 10 probes in the next 500ms, thus achieving 20pps for channels without traffic. With actual traffic, a much lower frequency is observed over the channel. Probes sent every 1/3 of [Monitor Interval], I.e. every 10 sec by default.

That is 20pps per channel or per DSCP.  

Zero SLA is another feature that is often missed and should be mentioned. If you are concerned about a very low bandwidth link and that you would be sending smart probes per channel or DSCP, then you can configure Zero SLA so only DSCP 0 uses smart probes on secondary paths. All the other channels  now do not get smart probes, only DSCP 0.  If you have a 4G or low bandwidth link and are worried about overhead this is definitely an option to have in the back pocket.

Smart probes are of three types:

  • Active Channel Probe: Active channel probe is sent out to measure network delay if no probe is sent out for past 10 seconds interval.
  • Unreachable Probe: Unreachable probe is used to detect channel reachability when there is no traffic send out.
  • Burst Probe: Burst probes are used to calculate delay, loss, jitter on a channel that is not carrying active user traffic.

For low-bandwidth links (e.g a DSL or 4G.LTE) it is possible to tune this further to have even less overhead – for example the below command:

smart-probes burst quick number-of-packets packets every seconds seconds


The whole point of defining thresholds is to look for a passing of a threshold or a performance violation – if we see this then an alert is sent to the source Master controller (a Threshold Crossing Alert or TCA) from the Border Router.  It is at this point that PfR controls the route and changes the next hop for an alternative path as per the policy configuration i.e. re-routed to a secondary path.  It is not PBR (policy based routing) as you might already be familiar with, but is similar in that the remote site device knows what to do with this traffic class and routes it accordingly based on policy.  The local Master Controller makes this decision.

All the paths are continuously monitored so we have awareness across the transports.

Routing and PfR

In part one we went through some of the choices around routing in DMVPN.  Well there are additional considerations with PFR.

One of the reasons EIGRP and BGP are preferred for IWAN is that alternate paths are available in the BGP and EIGRP topology table, and as PfR looks into these tables to make decisions and change next hops based on policy, they are well suited.

Scale.  The first thing pfr does is look at the next hop of all paths.  It looks in the BGP or EIGRP table.  If you show your routing table you have one next hop per path, but because PfR looks at both the routing table and the topology table, it see the next hops for both paths.

With EIGRP you can adjust the delay to prefer MPLS, so this combined with the EIGRP stub feature means you can control routing loops.

With BGP you would have the hubs configured as the route reflector for BGP, and to prefer MPLS you can simply set a high local pref for MPLS.  If you have say OSPF on the branch then you redistribute the BGP into OSPF, and set a route tag on the spokes to identify routes redistributed from BGP.

As ever there are many ways to configure BGP, but the validated designs guide you to one relatively simple way.

If you looked at using OSPF for example, well PfR does not look into the OSPF database, therefore relies on the RIB (Routing Information Base), so in order to support multiple paths for decision making you would need to run ECMP (Equal Cost Multi Path) – far from ideal.


When a site or a branch is part of PfR it advertises its prefix to the HUB-MC, and then this forward this to all the MCs in the domain.  

This can be confusing because obviously BGP or EIGRP send prefixes, but PfR also sends prefixes.  One of the performance monitors will collect the source prefix and mask and advertise this to all Master Controllers.  It uses the domain peering with the HUB-MC and then this will reflect this prefix out to all the other MCs in the domain.

Ultimately you end up with a table with a mapping of site-id to site prefix and how this was learned i.e learned through IWAN peering (SAF service advertisement framework), configured, or shared.

It is important that attention is paid to your routing (of course, it is always important that you pay attention to the routing) because in advertising the prefixes, PfR looks in the topology table based on the IP address and mask to dig out the prefix.

There are two Prefix concepts to be aware of 1) Enterprise Prefix List and 2) Site-Prefix

Enterprise Prefix list  is a summary of all your subnets, or all your prefixes in your IWAN domain.  This is defined on the HUB-MC for the domain.

A prefix that is not in this prefix-list is seen as an Internet prefix and load-balanced over the DMVPN tunnels.  This is important, as If there is no Site-id for example, (the site is not enabled for PfR), then you don’t necessarily want traffic to be load-balanced, such as Voice for example.  So it is important to make sure you have a complete Enterprise Prefix List.  Once included in the Enterprise prefix list, PfR will know that traffic is heading to a site where PfR is not enabled and will subsequently know not to load balance it.

Site-Prefix – the site prefix is dynamic in PfR, so that on a site Perfmon will collect the traffic outbound, look at the IP address and mask, and then advertise that prefix up to the hub through PfR.  On the hub and transit site however you want to manually advertise the Site-Prefix that is advertised.

Prefixes specific to sites are advertised along with site-id. The site-prefix to site-id mapping is used in monitoring and optimization.

It is important that the right Site-Prefix is advertised by the right Site-id

Transit site preference

When you have multiple transit sites (or multiple DCs) with the same set of prefixes advertised from the central sites, you can prefer a specific transit site for a given prefix – the decision is made on routing metrics and advertised masks, and this preference takes priority over path preference.  The feature is called transit site affinity and is on by default (you can turn this off with no transit-site-affinity).

Traffic Class timers – if no traffic is detected for 5 minutes then the dynamic tunnel is torn down.

BACKUP Master Controller

BACKUP Master Controller – you can have a backup Master Controller but it should be noted that today there is no way to provide a stateful replication to the Backup Master Controller – the two are not synched.  The way to do this is to configure the same PfR config on both, and the same loopback address on the backup controller but instead use a /31 mask so that should the primary go away the BRs will detect the failure and reconnect to the Backup Master Controller – so stateless redundancy.

The backup MC will then re-learn the traffic.

In the meantime Branch MCs keep the same config and continue to optimise traffic as you would expect – Anycast Ip.   We follow the routing table and do not drop packets, which is why you set the MPLS prefer.

On the branch you need a direct connection between the BRs – on the HUB you just need IP connectivity.

Finally VRFs

VRF-Lite is used with all the same config ideas but per VRF.  Your overlay tunnel is per VRF (DMVPN per VRF) and your overlay routing is also per VRF  (VRFs are not multiplexed in one DMVPN tunnel!).   Under PfR, I mentioned that SAF (Service Advertisement Framework) was part of the magic behind PfR, well the SAF peering for advertisements is also per VRF, as is MC and BR config, and also policies are per VRF.

Monitoring – all the PfR stats are exported using netflow, so it is important to have a monitoring platform that supports Netflow V9 to get the most out of your monitoring for PfR.


Too much to learn

Ok, so that was a lot to take in I agree.  But hopefully by breaking down the component parts a little, next time you look at IWAN you will at least have a place to start ,and understand what is actually going on when you select the drop downs in an IWAN GUI.

When you first look at IWAN you have terminology flying at you at an alarming rate. Much of it sounds familiar(ish), and it is easy to leap to a feeling of general understanding, until you realise that you are talking at cross purposes when it comes to EIGRP, or you are not sure exactly what Transit is, or the meaning of a Path.  Hopefully the above provides some context for deployments.

What I would say is that once you understand the components, deployment is surprisingly light touch and easy through your choice of IWAN app and gui.  In fact it is not too bad without really understanding it all,  but it is always best to understand what you have just done.   If you look at other SD-WAN vendors (and I will cover some of the broader protocol choices in another post), the GUIs have abstracted much of the underlying workings.  This makes it all seem “oh so simple”, and to be honest it should be like this.  But as long as you understand that abstractions have been made and that there is no magic, you will quickly get a good feel for the various technologies.   You will understand the protocols and design choices, and be able to identify the innovations that have been sprinkled along the way.

Finally you have a number of options when it comes to monitoring and orchestration with IWAN.  All take the pain away from setup and all work towards enhanced visibility. The fact you have some products marketing IWAN deployments in 10 minutes shows how the mechanisms can be streamlined through abstraction and automation.  In brief, your main options are below.

  • Orchestration – Cisco Prime, Anuta networks, Glue networks, Cisco VMS, Cisco APIC-EM, Cisco Network Service Orchestrator
  • Visualisation/monitoring – Cisco Apic-EM, Living Objects, Liveaction, Cisco Prime/VMS.

Hopefully by now you have enough of a feel for the technology to jump into the validated designs for IWAN productively, and deploy a whizzy tool with growing confidence. You never know, IWAN might be less painful than you might have feared, despite first impressions.

Software Defined Wan (SD-WAN)

With market forecasts ranging from $6bn to $12bn in 2019/20, and Gartner saying 30% of users will be managing their WAN through software in the next 3 years there is some understandable hype and attention around this term today.

So what is SD-WAN and should we care?

The answer to the second part is probably yes.  Whether it is an Enterprise using software to manage their own Wan, or SPs using software to provide more flexible services to customers, new models will come to the fore.

With the early promise of ease of deployment, central manageability, reduced costs and faster speeds to service provision, who is not going to want some of this?

SDX Central defines SD-WAN as follows “The software-defined wide area network (SD-WAN) is a specific application of software-defined networking (SDN) technology applied to Wan connections, which are used to connect enterprise networks”, so clear as mud then. To be fair it is hard to get a straight answer as to exactly what SDN is, but if you add the word “technology” you are covered, making it possible to at least broadly outline some of the technologies involved.

Of course an SDN isn’t a “thing”.  Not a “thing” you can buy.  It can range from separation and centralisation of the control plane from the data plane in a pure sense, on a controller of sorts, (of which there are many flavours), controlled by software (a variety here, open source, closed, hybrid) to provide programmability of the network (several ways to do this), to the configuration of the network (APIs are the new CLI), and configuration management tools and orchestration engines (Salt, Chef, Puppet, Ansible).

Ultimately much of this is trying help the network keep up with the speed of change higher up the stack where some of these techniques have been used for a while (server, compute and storage worlds). As we know, the speed of change in networking is traditionally slow.

Much of the above relies on a variety of abstractions and overlays to attempt to streamline services or hide complexity.

It is tempting here to drift into a discussion around abstractions, scale, state, control, complexity, speed and intent at this point, but I’ll save that for another day when I have more energy.

Some see SDN as an architecture and SD-WAN as a product you can buy?…ok, maybe.

Or maybe more a solution-set and platform for vendors in this space (Anuta-networks, Talari, Cloudgenix, Viptela, Pertino/Cradlepoint, Velocloud, Glue/Cisco, to name a few.)

The point of all of this?  To make networks more predictable, cheaper, quicker to react, more controllable, stable, service orientated, and accelerate time to market.  The usual stuff then.

What I have seen lately is an increasing acceptance of flexibility and utility in the minds of network engineers. The idea of spinning up a service across the stack as you need it and tearing it down as soon as you don’t, or it becomes too costly to troubleshoot or maintain (it can be easier to spin up a new one than fix the existing). These ideas are seeping slowly into the networking world.  I am not saying it is necessarily the right way, but certainly something I am seeing.

So the next question is, how would we enable good old-fashioned networking to take on such flexible new-age characteristics in the Wan?

One way is to use some of the techniques of SDI–Software Defined Infrastructure. Put simply, “Orchestration and Management software around Storage Compute and Network which Automates provisioning and configuration.”

Combine this with SD-WAN and Network Function Virtualisation and you can easily visualise developing a managed router service deployed with NFV to virtual CPEs (vCPE).

What is NFV? Well NFV decouples network functions, (NAT, DNS, intrusion detection, firewalling, load-balancing etc.) from proprietary hardware appliances so they can run in software.

Ok, now we’ve got that straight…let’s look at some implications for the Wan at a high level

What we are looking for is flexible appropriate network access for the right application at the right time. 

One possibility is to consider a hybrid-Wan. By that I mean dynamically routing traffic over private and public links when it suits the applications e.g.  MPLS and LTE/broadband/wireless respectively. This certainly looks like low-hanging fruit for SD-WAN.

Say you are using an SD-WAN type service connected to private MPLS but with broadband/3G-4G/Wifi as backup, and you can get extended visibility into these networks around reliability and performance.  What some have seen is better performance and reliability in some SP environments over 3G/4G than their existing private expensive MPLS service, so is it beyond the pale to consider flipping the priority and trying public first for high priority apps?

If so, could we see a ramping down of the private MPLS circuit and replacement with DSL internet?

Then along comes 5G…img_0088

It is worth asking the question, is your next Wan a 5G network?

5G takes the idea of DSL internet a step further. 5G isn’t simply increased bandwidth.  It will be seen operating in several spaces, from low bandwidth, short transaction IOT (machine to machine) chirp devices, to enhanced high bandwidth applications using some of the advances we have seen on the WiFi side in the last few years (MIMO etc.), and segmentation thereof.  Of course the higher the bandwidth the lower the effective duty cycle for different applications – download your Netflix/Prime/Now/Youtube video quickly and get off my network!   Freeing it up for other uses.

5G has the potential to offer improved indoor coverage, low power, large numbers of connections per cell, and machine to machine expansion.

For this Network Function Virtualisation (NFV) will play an important role with SD-WAN in a 5G network where lower latency will be key for IOT devices at the edge of the network.  Figures being thrown around at the moment are 1ms latency with 5G as opposed to 50ms with 4G, a million connections per km squared, and 10Gbps throughput potentially. Maybe even real-time Cloud-RAN performance of micro rather than milliseconds?  From this you can see how NFV and processing with standard hardware towards an intelligent edge / Fog Computing architectures (Openfog) will start to make sense.

Currently the ETSI Industry Specification Group for Network Functions Virtualization (ETSI ISG NFV) includes all the major players, Telefonica, Verizon, BT, Deutsche Telecom, AT&T, Orange etc. and has grown to include over 230 companies in the interest of trying to drive standard IT virtualisation technologies.  Network functions on industry standard server hardware make it very easy to move locations and reduce the need to install new, expensive, proprietary hardware every time you introduce new services.

C-RAN – SD-WAN and NFV seem a natural fit.

Cloud-RAN (C-RAN or sometimes centralised RAN), and small cell architecture is based around centralisation and virtualisation.  It is therefore easy to see how the above techniques will play an important role in 5G, e.g. 5G small cell deployments in the 30ghz band.

With 5G and C-RAN there will be a diverse range of use cases and requirements that Service Providers will need to be able to respond to quickly.

Cloud-RAN, with the separation of the Base Band unit (BBU) and the Remote Radio head (RRH) using fibre, 5G technologies, mmWAve (30Ghz to 300Ghz VHF/EHF) with  CWDM/DWDM to extend baseband over long distance,  will enable the centralisation of control into large scale centralised base band deployment. These technologies enable dynamic resource sharing, virtualisation, low latency, high bandwidth and reliable interconnect to a BBU pool.  We will see more collaborative BBU technologies, and open platform real-time virtualisation technologies.

Different hierarchies will contain different RAN radios of different sizes using multiple data rates.  Cloud-RAN will enable variable rate cells.  How can these be provisioned quickly and cheaply?

How do you enable the various use cases created by retrofitting narrowband to 3g/4g?  Narrowband, as its name suggests, essentially uses a narrow-band of frequency spectrum to provide discrete bands (from 20Mhz to 200khz wide) for lower data rate coverage (half or full-duplex) and now that 3GPP Narrowband IOT standardisation is complete, you will surely see a variety of use cases at different speeds for different applications especially with IOT.


Software Defined Infrastructure (SDI) certainly looks like a good fit here in providing more flexibility and automation to service provision at the edge.

So where does all of this leave the successful SPs of the future?  Providing enhanced visibility across their services allowing Enterprises to make better decisions? Granularity and speed of service provision? Virtualised network functions, security? 5G capabilities?

Long term private Wan contracts should become a thing of the past with this new flexibility, and as the Cloud continues to prove popular as a service model, many customers will simply want “secure, flexible, and reliable access” regardless of how it is delivered..

As ever, it will be the ones who are seriously thinking about all of this now who will come out on top.  With such rapid innovation and development, it is fine not to have the solution today, but we really do need to invest in the problem!


One final note, all these things tend to hype the way to the holy-grail of simplicity.  Smarter, self-aware, self-optimizing, self-scaling, self-healing networking – who wouldn’t want that?

Some of the things covered will certainly help to make some of what we do today simpler, but with increased flexibility will come new services, and at scale this will always be complex.  Networking is complex, we can abstract away some of it (hide it in an abstraction), but there will always be complexity.

The popular TV series “Heroes” had the tag-line “Save the cheerleader, save the world”.  In the less dramatic world of modern networking maybe it should be “Understand your abstractions and interface points, save the world”