One of the largest carriers in Australia provides a variety of network services to a major federal agency in that country; in addition to provisioning circuits, it also provides support for the delivery of business applications integral to the government department. Any significant performance degradation or failure would not only disrupt users and affect the service-level agreement (SLA), it would also quickly become public knowledge, potentially harming the carrier’s reputation.
IT staff began its work for the federal agency using traditional network performance monitoring with send/receive interface utilization graphs and using the IP accounting features on deployed routers to get NetFlow information. With the advent of more heterogeneous applications, including the introduction of voice and video, the department needed a more holistic view of their network traffic and the environment. End users expected these applications to always be available and to be very responsive.
To deliver that level of visibility, the service delivery team decided to go well beyond basic socket or flow-level understanding to application fluency, comprising response-time analysis, dependency characteristics, and volumes and patterns of traffic on both the local and WAN links. This visibility allowed more-proactive troubleshooting of application-performance issues, enabling them to not only reduce mean time to repair (MTTR), but also provide tangible information about applications and services outside of their control.
Manage Increased IT Complexity by Being Application Fluent
With the increasing use of mobile devices and cloud services, as well as big data deployments and custom applications, there is more pressure on network teams than ever before. Complexity increases every day, and enterprise IT infrastructures have become borderless, as data centers are no longer physically constrained but extend past the firewall all the way to the edge of the Internet.
In the past, network teams could overprovision or throw bandwidth at problems and would measure and report on link utilizations, network latency and device use. That era is long gone, as applications have become far too complex to be monitored from only one perspective. Blindly applying additional bandwidth rarely solves the problem, as it is often not the main or constraining component in the first place—a fact that can only be ascertained by analyzing the in-flight application characteristics. Applications were always driving the business, and end users demand instant access from any platform, anywhere. Assessment of application/business health needs to begin with end-user experience.
To effectively understand, manage and optimize how their network is performing, IT teams need to become “application fluent,” which is ushering in a new era of application performance management known as application-aware network performance management (AA-NPM). From a network team’s perspective, application fluency means understanding what applications are in use, who’s using them, how and where the data is delivered, where it’s hosted, and, perhaps most importantly, what end-to-end response times are achieved. Network IT professionals can use AA-NPM not only to solve performance problems, but to also become heroes in their organizations.
For the carrier in Australia, AA-NPM was the perfect fit, and since it identifies in-flight contiguous application usage on both the LAN and WAN, IT is now able to provide ad hoc reporting to their client weekly and not just when applications inadvertently fill individual WAN links. The reports are easy to generate and in a context a user can understand. Network teams are also using the solution to troubleshoot application performance issues. These results made the team a “trusted advisor” and a “source of truth” in the eyes of the client.
From Beleaguered Target to Network Hero
Networking is generally the group business users turn to when applications aren’t behaving well. The group must be able to react and initially prove the delays are outside the network arena and then, through their data, identify where slowdowns are occurring. That’s where they can gain credibility and become experts in the eyes of the rest of the organization.
Network groups have a significant advantage over other groups in fault-domain isolation (identifying the area of delay) because the network commonly touches all tiers of the application. A modern application could have thin client, load balancer, web, firewall, application, middleware, database and storage tiers. The common element in that application-delivery chain is the network; it touches all tiers. A database team sees what’s happening inside a database, which is perfect when the delays are there but represents a disadvantaged starting point otherwise. Likewise, web teams, firewall teams, middleware teams and so on have only a limited vantage point. Each has value in its own arena, but for initial fault-domain isolation, networking is the go-to group.
What to Look For in an AA-NPM Solution
IT has to be careful about choosing an appropriate AA-NPM solution, since most will provide high-level information but are not always application fluent—they focus more on the network device transporting the data (e.g., routers, switches and firewalls) and thus are unable to discover enough details to solve problems.
AA-NPM solutions may employ SNMP queries, flow information (e.g., NetFlow) and/or network packets to gain visibility. The most comprehensive and valuable data for problem isolation and fault analysis is packet data. The others are useful for higher-level supporting analysis like identifying which ports or protocols are in use or how heavily utilized network resources are, but packets contain enough information in most cases to prove guilt or innocence among application and network tiers.
Many AA-NPM solutions capture and analyze packets, so focus on the ones that provide the deepest decoding/visibility. Decoding at the IP level tells you who is talking to whom. Decoding at the TCP level tells you what ports are in use and can help identify which applications are in flight. It can also measure network-level response times (not to be confused with end-user experience). Decoding all the way up to the application layer provides the strongest visibility and is a huge value in fault-domain isolation. From an Open Systems Interconnect (OSI) stack perspective, the higher you go, the more you know. Application-level decoding also allows the network team to speak the same language as many of the other teams. If you’ve ever been part of a war room, you likely recognized that few groups understand IP or TCP characteristics, but application transactions are well understood. As an example, when a network team can pinpoint specific web pages, web objects, database queries or SAP transaction codes that are problematic, as well as identifying the tier/server hosting that component, they gain instant credibility. When they’re able to do this repeatedly, they begin to achieve hero status.
AA-NPM solutions are able to analyze information in real time, which is critical to recognizing problems as they occur rather than waiting for users to call the help desk. This is an important concept for organizations that desire to be proactive in their monitoring. It’s useful to store packets for post incident troubleshooting, but it’s critical that your AA-NPM solution measure application performance in real time and alert the appropriate staff when slowdowns or other issues occur. It’s also important to use solutions that monitor all transactions rather than applying sampling techniques. By monitoring all users and all transactions, IT never misses a thing.
Traditionally, users would call the help desk and complain about a performance issue. With proactive AA-NPM solutions, you’re in a position to recognize problems before they occur and fix them in minutes or hours, rather than days or weeks. This approach can also significantly reduce the number of complaints from unhappy users.
The Bottom Line
Improving networking with an effective AA-NPM solution enables network teams to solve complex application problems and enhance overall business success. In other words, they become true network heroes.
About the Author
Mike Hicks is Senior Product Manager for Network Performance Monitoring at Dynatrace. In his 30+ years in the industry he has worked closely with the majority of infrastructure vendors in the area of application profiling, network performance and management. In addition, Mike has authored two books, Managing Distributed Applications: Trouble shooting in a heterogeneous environment (Prentice Hall 2000) and Optimising Applications on Cisco Networks (Cisco Press 2004).
Dynatrace is the innovator behind the new generation of application performance management. Its passion: helping customers, large and small, see their applications and digital channels through the lens of end users.