Sponsored No-one said enterprise networking is easy, so it’s understandable that the traditional approach to managing switches has followed the maxim, “keep it simple, stupid”.
For years, NetOps teams have relied on vendors’ own apps to manage their highly complex, and highly expensive, switches, or go on-device using a command line interface (CLI). Of course, they can turn to approved tools or third-party apps in conjunction with standard protocols, such as SNMP - if the supplier chooses to let them.
Even then, each approach delivers, at best, a subtly different selection of data. Oh, and this “simple, stupid” approach means management is never going to be real time.
This might sound limiting when it comes to managing, optimizing or troubleshooting networking systems, but the approach arguably made sense when stability at all costs was the aim.
Meanwhile, over the last two decades, hyperscalers have built their data centers on standardized, often no-name networking and server silicon, and developed their own tooling to automate management, provisioning, and deployment. It’s no coincidence that this approach to NetOps has coincided with, and mirrored, the way that DevOps has revolutionized how forward-thinking companies manage, develop, and deploy software as well as servers and storage.
As Nokia’s principal solutions architect for webscale, Erwan James, explains, hyperscalers don’t want to have a human touch a network device, either physically or virtually: “They want to make sure that whatever is or was historically accessible on the device by the CLI is now accessible off the device because they actually have no interest in logging into these devices to troubleshoot.”
This approach has, in the main, been blisteringly successful. But the hyperscalers have largely kept that expertise to themselves. By comparison, the restricted options available to most enterprises and smaller operators increasingly look less like simple and more like out and out stupid. “You're not getting real time telemetry because you have to scrape on a periodic basis,” James says. A CLI is designed for humans, operating at human speed, he points out, “It's not designed for devices to login and issue their commands.”
Just scraping by?
These are some of the problems that Nokia has addressed with its Network Operating System (NOS), Service Router Linux. SR Linux is part of Nokia’s Data Center Switching Fabric, which also spans the Nokia Fabric Services System, and Nokia’s own switching hardware.
While it’s not uncommon for networking vendors to leverage Linux in their architectures, Nokia has gone further by fully embracing the open ecosystem ethos. Hence one of the central elements of Nokia’s approach is its NetOps Development kit, which enables operators to leverage SR Linux’s underlying model-driven architecture to create their own tooling for managing and automating their networks.
James describes the NDK as “a set of tools and frameworks including a completely customizable CLI, an SDK for uncompromised access into the NOS, an open set of management interfaces and an operating system architecture which allows operators to make meaningful changes to how they interact with network devices”. And, he adds, “It allows you to customize, change, expose and retrieve data.”
It’s important to understand what “data” means in this context. Firstly, it is configuration data, “where you're trying to push down your BGP configuration, any routing, stack configuration, or ACL configuration.” The second element, James continues, is “stateful data that’s happened as a result of some of the configuration data.” This could encompass interface statistics, or state changes in routing protocols or routing tables.
In contrast to traditional architectures, the entire switch/router data model is accessible via “modern” northbound interfaces, such as GNMI, and can be consumed by off device tools and applications, whether Nokia’s own offerings, third party platforms, or – crucially – those a user decides to build themselves.
“Something that's very powerful, once you wrap your head around the concept, is that you can now run an application on the box,” says James, and the application can have its own configuration, it can have its own stateful data.
Furthermore, he says, “If you model the data your own application uses for configuration and state in YANG, SR Linux will automatically expose your application’s data through the same CLI and telemetry interfaces as the rest of the NOS applications. Essentially, your application becomes a part of the NOS and is treated as a first-class citizen.”
This potentially puts huge power in the hands of NetOps teams – and leaves them facing the dilemma of how to use it appropriately.
I’ve got the power … now what?
More traditional teams, without a strong development competency, and comfortable with working via the command line, might start by simply enhancing how they work via CLI, suggests James.
This can be done using Python, he explains, delivering improvements in the operational experience, without changing the core methodology for operating the network. From there the team can take small steps, working with the data off-box, and progressively building a full-fledged software development team dedicated to NetOps.
At this point, he suggests, they will be saying to themselves, “Okay, we can now build applications.” These can be developed “in any language that you see fit, whether that'd be Python, Go, Java, that can all run on the network switch device. And we can model the application data in YANG and allow the switch to export that data using the same interface we've been using for the rest of the switch data.”
The ultimate destination, he says, is “the really low-level access the hyperscalers of this world want, with direct access to route tables, to ACL tables, and really low-level changes.”
So, what does this all look like in practice?
It’s relatively early days for the NDK, but James cites the example of one customer that has used the NDK to build an application that carries out monitoring of LLDP (Link Layer Discovery Protocol) neighbors.
The application runs on the switch, and “reaches out to a central controller, which has a single source of truth view of the network and an understanding of what port is connected to another port, and then checks the local data on the switch against the received data from the neighbor using LLDP.”
The app compares this to an off-device single source of truth database, he explains. “If there's something wrong, using GNMI and streaming telemetry it is able to flag and alert the network operations team that ‘hey, there's been some mis-cabling’.”
This is as simple as it gets, he says, but it clearly illustrates the potential benefits for operators.
Going further, James describes the example of an application on an SR Linux switch that talks to a Kubernetes cluster and understands “the applications that are sitting below it in the rack, and the network constructs that should be coming from the application world into the network.”
He explains, “In Kubernetes, you expose an application using a service IP, and that service IP gets advertised to the fabric. Well, now suddenly, you want to make sure that the operations team for networking and the operations team from the application world speak the same language.”
The on-switch app then can retrieve “the relevant information such as what applications are exposed, what IPs are they using, put that into some sort of meaningful data model into the switch, and then export that from the switch into your network monitoring tools, just as you would with any other network application.”
Of course, in the traditionally cautious world of enterprise networking, this could all be seen as a gateway to chaos. So, it’s important to remember that Nokia’s Fabric Services System also includes a Digital Sandbox. This allows the creation of a digital twin of the live network which can be used to validate code. For example, says James, “I can validate my new CLI plugin and ensure my new application has the intended output and outcome when my operators log into it.”
James says the key for teams contemplating the technology is to understand precisely what stage they’re at, the capabilities of their teams, and what their underlying infrastructure allows them to do, as they work towards enabling that hyperscale level of automation.
“It's really a four-pronged approach,” he says. “A device needs to be able to export all the data I want to export. I need to have something that can collect the data and connect to all the switches. I need to store this data somewhere. And then the last prong is I need to do something meaningful with the data.”
With Nokia’s SR Linux in general, and the NDK in particular, the first three prongs should no longer be an insurmountable challenge. As for doing something meaningful with all that data: in time, your NetOps teams will have all the detailed network views, alerts and insights that they need to help automate network operations completely.
Sponsored by Nokia