What is a network incident in the telecom sector?
An incident in information technology refers to an event that disrupts operational processes which are not part of normal operations and could be due to the failure of a feature or service that has been delivered or some other type of operation failure. In case of security incidents, it shows that data or the operational systems - hardware or software of an organization may have been compromised or misused. Network Incident can include minor disruptions, such as running out of disk space on a desktop machine, or sometimes major disruptions like data breaches which involve the exposure of sensitive information. This also refers to the disruption of internet services, mobile services, landline communication, and the provision of services by commercial and public organizations which might disrupt the life of consumers and citizens. Non-availability incidents in public telecom services may have a widespread impact and these incidents are always analyzed and solved by the service provider. With the missing of analysis of the network incident reports which are highly confidential, and beyond some elementary descriptive statistics, a significant opportunity that turns valuable to other providers and to standardization bodies gets left.
Incident Management is an important part of IT Service Management (ITSM) which involves returning a disrupted service to normal as quickly as possible after an incident, in a way that minimizes any negative impact on the business.
The process of incident management works in the following manner:
I) Identifying sufficiently serious incidents that are a cause for concern
II) categorizing those incidents according to the source of the problem,
III) the severity of the problem
IV) the assets affected during the incident
V) the roles of affected personnel, and
VI) the method of resolution.
How can router (and other) network incidents be approached?
When a network incident hits, the first thing is that without creating any chaos one should look for the right approach to rectify it at the earliest. Doing panic rather than focusing on the solution during any high-impact incident would hamper logical thinking and rational actions, resulting in a wild stream of uncoordinated activities. The situations where managers "demand" an immediate resolution and the users call back-to-back, and shout can easily make the situation worse. Therefore, the teammates with good intentions to "fix" the issues or invoke uncoordinated changes to other network components as they hope to quickly resolve the problem. The things that need to be followed for approaching the router are as below:
1)Find a trusted source of reference
To check the network connectivity to and from the internet using the services of a trusted source of reference is required. Internet Control Message Protocol (ICMP) is used by the “ping” utility to send echo request and echo reply messages on port 7 and the “traceroute” utility sends, by default, an array of packets using User Datagram Protocol (UDP) on ports 33434 to 33534.
Linux includes an option to use ICMP (Internet Control Message Protocol) echo request packets (-I) or any arbitrary protocol (-P), such as UDP, TCP using TCP SYN packets, or ICMP.
.
2) Communication
The team working on the network incident must be in complete sync and communicate during a high-impact incident. They need to communicate and share information using alternative channels of communication that allow them to stay connected for the duration of the incident like WhatsApp etc.… and share about who is doing what. The possibility that the entire team will not be in the same room or same location they need to communicate through cell phones or your laptop or whatever else is available for the chat to happen for sharing different viewpoints—which is an important key when it comes to troubleshooting network issues.
3) Documentation
To achieve structured troubleshooting and if required to roll back documentation of what has been done is required would be of great help in answering senior technicians as to what has been done to rectify the issue. As part of the documentation, the chat logs can also be used and for this, someone must be assigned who will record the chat room history.
4) Make use of visual reporting
Visual reporting would be required as the local manager will take care of communications with other units and upper management to explain something to a less technical colleague like the manager. This will also help in avoiding sending unnecessary reports to other layers of managers.
5) Work on an action plan
Having a plan of action makes it easier while explain it to someone less technical as it is challenging to go through the incident and planned resolution in simpler terms. It is always better for the affected community to receive Messages that outline a basic plan of action showing the progress by indicating which of the steps have been completed rather than re-communicating the same line that - we are working on it.
6) Apply preventative measures
Preventive measures should be taken before anything happens. Organizations need to be prepared regarding how to act during a critical incident and who would be playing the role of the communicator for sharing information with the rest of the organization on a regular basis. This will help in focusing the work that needs to be done during the incident and all the team members must be aware of their responsibilities and should be aware of how and when they need to communicate. Documentation and an automation tool like Ansible are essential for enforced change control.
As all software deployments to the router's firmware can be controlled by Ansible it is considered an essential change control tool. Ansible automatically version and keep track of what was deployed, when, where and by whom, this also works for all the configurations and subsequent config changes. This way it is easy to answer questions like if anything has been changed if yes then what was changed? Ansible Tower is the central server that can host all these scripts for easy access and good security.
ความคิดเห็น