Lack of network and service health data leads to internal communication chaos and angry customers
A provider’s network or platform problem causes enterprise customers to experience service degradation or an outage, but network and IT engineers learn about it first from the sales department – rather than its own Operational Support System (OSS).
You heard right. It sounds a bit archaic, some might even say disgraceful, but guess what? This is exactly what Digital Service Providers (DSPs) encounter, and more often than you might think!
Adding fuel to the fire in this situation, the bigwig responsible for the enterprise segment (let’s say the Chief Marketing Officer (CMO) or Chief Sales Officer (CSO)) might escalate the problem to the Chief Technology Officer (CTO) who’s not even aware that there’s a problem with the network, let alone a particular customer. The CMO/CSO wants to retain the customer, but without proper technical support, they find themselves in a tough situation to timely detect and fix the problem.
If an existing OSS is unable to detect degradations and outages and provide prompt information to isolate problems quickly, maybe it’s time to rethink the system in place rather than wait for an email/call from the customer and point fingers.
Required features of an OSS to run enterprise operations properly
Providing services to enterprise customers is way more intricate than providing commodity services to residential customers. Especially when it comes to fixed and cloud/hosting services. Enterprise customers require complex services, involving many different components at many locations:
- L2 connectivity
- L3 connectivity (VPN)
- Locations connected by primary and backup links running on disjointed network segments
- Various access technologies (fiber, microwave, cable, LTE & 5G, FTTH, legacy xDSL, etc.)
- Platform services such as Centrex, hosted call center, IPTV, VAS, DDoS Protection, Security etc.
- Data center services: IaaS, PaaS, collocation, managed DC services, etc.
- Cloud connectivity: direct access to AWS or Azure
- Managed cloud services: Microsoft Office365, etc.
From the DSP / Telecom perspective, managing enterprise customers means monitoring and managing all of the systems mentioned above but also correlating all the seemingly disjointed events (alarms) to make reasonable conclusions about what is happening in the whole system.
Even when it’s understood what the problems are in the system, it’s also necessary to know what customers and services are impacted. The best-case scenario is to receive service degradation/outage alarms from the OSS. This is the only way for engineers to be promptly informed that there is something wrong with the services, inform customers about the situation, start troubleshooting immediately after detecting such an event, and resolve it in minimum time.
Informing sales about the situation so they can in turn advise customers about everything being done to fix the problem, engineers will have less stress, there will be call deflection in the call center, and the CMO/CSO/CTO all stay out of the picture.
How to choose OSS software – a checklist
Since so many systems are involved, implementing an Operations Support System that utilizes the umbrella concept is a logical choice. Umbrella software can ingest data from all the systems (element managers, specialized NMSs, CRM and billing data, service definitions, etc.), correlate them, and provide tools to quickly detect impacted services and customers. This allows network operations center (NOC) and service operations center (SOC) engineers to manage the situation efficiently.
Since the umbrella concept revolves around the consolidation of data, it’s essential to understand what data should be consolidated and what kind of data processing should be executed. The list is extensive, so let’s focus on the most crucial aspects.
- Network and platforms’ events consolidation and alarm filtering
All communication systems are interrelated. For instance, a fault on the transmission system (say DWDM) impacts connectivity between IP/MPLS nodes of the core or access network. The operation of Softswitch/IMS depends, among other things, on the performance of the virtualization platform running in a data center. This same virtualization platform depends upon the performance of the underlying storage and other systems. Storage system performance depends on the state of the data center environment (power, humidity, temperature …), and so on.
Therefore, having the capability to cross-correlate alarms from different interrelated system domains (IP/MPLS, DWDM, DC infrastructure, computing, SAN, etc.) is a must. In other words, an OSS is completely blind without full fault management capability across all platforms.
- Performance data consolidation
When there is service degradation or an outage, it’s imperative that the source of such a problem is detected. Alarms are often not enough. Historical performance data from devices and platforms in different domains have to be compared. Again, an OSS must consolidate all the performance data in one place and execute proper performance management across all domains.
- Complete resource inventory and topologies inventory
Faults in the network come from resources (network devices, platforms …) in different domains and it’s necessary to understand how they are related to actual device/system data and how the devices are interrelated. Since fault and performance data must be consolidated as explained before, resource inventory data should also be consolidated in one place to map faults to actual resource inventory data. This is essential if the goal is to troubleshoot efficiently.
The same applies to relations between different components reflected in the different topologies of the system (L2, L3, IS-IS, BGP, OSPF topologies, etc.). Topology maps/schematics should provide a way to understand the cause of the degradation or outage.
- Vendor and protocol agnostic
It is crucial that Operations Support Systems are completely vendor and protocol agnostic. An umbrella system collects and consolidates data from various underlying EMSs and networks/platforms produced by different vendors. Furthermore, these underlying systems use various APIs and protocols (REST, HTTPS, SNMP, Modbus, Corba, etc.), therefore an OSS needs to accept any type of data ingest method.
- Complete service inventory
In order to tell which fault and which faulty device impacts the performance of which service and which customer and customer’s location, resources need to be mapped to the CRM’s service instances. For this purpose, a proper and consolidated Service Inventory (SI) that will be used for Service Impact Analysis (SIA) as well as Service Quality Management (SQM) should be established.
- Open architecture
It is extremely important that an OSS has open architecture. So, for example, allowing for the addition of software modules to the basic framework. Having an OSS with open architecture also makes life easier when adding new devices or integrating with new systems.
- Integration with non-technical data sources
Technical data must be looked at in context with external non-technical data such as customer data, billing data, ERP data, etc. Without this information, it’s impossible to know which problem is more important, how the problem impacts the telecom’s business, etc. Therefore, the OSS has to be integrated with an external system to allow such mapping between technical and non-technical data.
- Enrichment of alarms and performance data
When there is an alarm/problem in the network, engineers don’t have time to explore which device is impacted, where is it located, what services are impacted, etc. The OSS system must do it for them. Therefore, enriching alarms and performance data with resource inventory, service, and customer data is a must-have in a modern NOC/SOC.
- Unified fault and performance management
When all the alarms and performance data are consolidated in one system, then it is crucial to have capabilities such as centralized alarm correlation, centralized threshold violation calculations, KPI calculations, and service impact analysis.
- Automated jobs
Some problems require investigation to understand what’s wrong and connect to a console to fix the problem. But still, the vast majority of faults can be resolved by executing predefined actions (such as port shut noshut). These actions should be automated because statistically speaking they’ll provide huge acceleration in troubleshooting problems and dramatically reduce the mean time to repair.
- Service Quality Management
When all the data are in one place alongside service and resources inventory, then all conditions are met to implement proper Service Quality Management (SQM). SQM must provide the means to model services based on catalog and inventory data, as well as calculate KQIs and inform on service instance degradation based on available service inventory (SI) data.
- Integration with service desk/ticketing systems
The network and service assurance system described above is not an isolated island. Based on the situation detected, it’s important to also integrate with external workforce management/ticketing systems in order to activate the technical resources needed to fix any technical problem.
- Automated Reporting
There’s no getting around it, management always requires reports. That’s how management works. Numerous presentations on business unit performance and discussions on how to improve are always on the agenda. This is normally a time-consuming job and requiring the OSS to automate this function is a no-brainer.
The same applies to operational reporting like SLA reports for customers.
How to proceed? Remember the tortoise and the hare
Since it’s exciting to overhaul a system, it might be tempting to go all out. However, don’t do anything big. This will fail. Slow and steady wins the race. We’ve seen people leaving organizations after experiencing huge disappointment, due to not hitting personal expectations, management’s expectations, or both.
It’s not just about the tool. The company delivering the solution must understand the industry and nature of the business, while also having previous industry experience. This means that the partner delivering the solution should have people that were working both in telecom OSS and in integration/vendor environments. This organization will be your best ally.
Start with the basics. Establish unified fault and performance management and inventory management for a quick and satisfying boost of progress.
Not staring at 18 screens but a single one that has all alarms in one place also saves time and helps your organization be more efficient.
After implementing FM and PM, adding Service Inventory (SI) is the key to advancing from network assurance to service assurance (which obviously encompasses network assurance). Only then are you ready for SQM and the proper tools for your NOC/SOC.
Take it step by step and eventually you’ll eliminate those calls/emails from unsatisfied customers that started your whole quest for a new OSS.