Incident Management: Definition, ITIL v4 Process & Best Practices
Incident Management ITSM: Reduce
MTTR, Enforce SLAs and Restore Services
Incident management is the process that determines how fast and how consistently companies restore normal service when things go wrong. Done well, it is invisible to the business. Done poorly, it defines how your IT department is perceived.
SMC Consulting has been designing and implementing incident management processes for over 25 years, across companies of all sizes in Belgium, France, Luxembourg and Switzerland. Here is what a mature process looks like and where most companies fall short.
What Is Incident Management?
- A scheduled maintenance window is not an incident. A problem discovered during that window is.
- Total outage is not required. Slowness, intermittent failures and partial unavailability all qualify.
-
The unit of impact is the service experienced by the user not
the technical component that failed.
What incident
management is not
Incident management is not about finding root causes — that is problem management. Its sole objective is service restoration. Companies that spend resolution time diagnosing why instead of fixing what consistently miss their SLAs.
It is also distinct from service request management. A request for new software access or a hardware replacement is not an incident. Mixing the two in the same queue degrades the handling quality of both.
The 7 Stages of a Structured Incident Management Process
Detection and Identification
Logging and Registration
Classification and
Prioritisation
| Priority | Definition | Target Resolution |
|---|---|---|
| P1 — Critical | Full outage, major business impact | 1–4 hours |
| P2 — High | Significant degradation, large user group affected |
4–8 hours |
| P3 — Medium | Partial disruption, workaround available | 1–3 business days |
| P4 — Low | Minor issue, single user, minimal impact | 3–5 business days |
Initial Diagnosis and
Escalation
Investigation and Diagnosis
The assigned team investigates, consulting the CMDB to map service dependencies and reviewing recent change records that may have introduced the issue. Structured access to accurate configuration data makes a measurable difference in how long this stage takes.
Resolution and Recovery
The fix is applied and the user confirms that normal service has resumed. If resolution requires a configuration change, it must go through change management — even in emergency mode — to avoid introducing new incidents.
Closure and Documentation
The KPIs That Define Incident Management Performance
| KPI | What It Measures | Target |
|---|---|---|
| MTTR | Average time from detection to resolution | <4h (P1), <8h (P2) |
| MTTD | Average time from occurrence to detection | As low as possible |
| FCR Rate | % of incidents resolved by L1 without escalation | 70–80% |
| SLA Compliance | % of incidents resolved within contracted SLA | >90% |
| Recurrence Rate | % of closed incidents reopening within 30 days | <10% |
How SMC Consulting Structures Your
Incident Management
When we engage on incident management, we start with the process gaps: undefined responsibilities, missing escalation paths, classification inconsistencies, metrics nobody is tracking. The tool configuration comes second. Here is what our intervention covers:
Process design
We design your classification taxonomy, priority matrix, SLA framework, escalation matrix and closure procedures — tailored to your actual service catalogue, team structure and business constraints. Not a copy of an ITIL template. A process your teams will follow because it reflects how your company actually works.
Platform configuration
Knowledge base and L1 capability
A structured incident process without a usable knowledge base is incomplete. We document resolution procedures for your most frequent incidents and build the L1 capability to resolve them without escalation. This is the lever that moves FCR from 40% to 70%+.
Reporting and continuous improvement
We build the reporting structure that gives IT leadership real visibility: SLA compliance by priority, MTTR trends, backlog evolution, FCR rate and recurring patterns. And we establish the review cadence — weekly operational, monthly performance — that turns data into decisions rather than slides nobody reads.
We have delivered this across companies from SMEs running a 5-person service desk to enterprises managing 10,000+ users across multiple countries.
Incident Management and AI: The Next Layer of Performance
The most forward-looking IT companies are now augmenting their incident management processes with AI not to replace human judgment, but to eliminate the friction that slows detection, classification and resolution.
In practice, AI applied to incident management enables:
- Automated incident detection from monitoring streams before users report impact
- Intelligent classification and routing based on ticket content — eliminating manual categorisation errors
- Suggested resolutions surfaced from the knowledge base at the moment a ticket is created
- Real-time user communication via voice, chat, email or WhatsApp — handled by an AI agent, not a queue
- Anomaly detection that identifies unusual incident patterns before they escalate
FAQ about Incident management
What is the difference between an incident and a problem in ITSM?
An incident is an unplanned service interruption the goal is restoration, as fast as possible. A problem is the underlying root cause of one or more incidents the goal is permanent elimination. The two processes are separate but linked: a recurring incident must trigger a Problem record. The most common SLA failure pattern we see is teams spending incident resolution time on root cause diagnosis. Those are two different jobs.
What is a Major Incident and how should it be managed?
A Major Incident is a P1 or high-impact P2 requiring a dedicated, accelerated response a named incident commander, a structured communication rhythm to affected stakeholders, and a mandatory Post-Incident Review after resolution. Companies without a documented Major Incident procedure consistently struggle to coordinate under pressure. The procedure needs to exist before the incident, not during it.
How do you define and assign incident priority?
Priority is calculated by combining urgency (how quickly it must be resolved) and impact (how many users or business processes are affected). The result maps to a P1–P4 level, each with a defined SLA. This matrix must be formally documented and communicated to all service desk staff. Without it, priority assignment is subjective, SLA performance is unmanageable, and disputes with business stakeholders are unavoidable.
What is FCR and why does it matter more than most companies realise?
First Contact Resolution is the percentage of incidents resolved by Level 1 without escalation. It is one of the most direct indicators of service desk maturity and one of the most cost-sensitive. Every unnecessary escalation to Level 2 costs 3 to 5 times more than a Level 1 resolution. Improving FCR from 40% to 70% has a measurable impact on cost per ticket, MTTR and user satisfaction. The lever is almost always the knowledge base and clear resolution authority given to L1 agents.
How long does it take to implement a mature incident management process?
A functional baseline classification, prioritisation, SLA framework, basic reporting can be operational in 4 to 6 weeks. A fully mature process, with CMDB integration, automated escalation, a structured knowledge base and AI-assisted triage, typically requires 3 to 4 months. The timeline is driven less by technical configuration and more by stakeholder alignment and process documentation.
Should the process be different for small vs. large IT teams?
Absolutely. A 5-person IT team does not need a 4-tier escalation matrix. A 100-person service desk cannot function without one. One of the most common mistakes we see is companies implementing enterprise-grade processes on teams that lack the capacity to sustain them. We design processes that are appropriately complex: rigorous where it matters, lean where it doesn’t.
What role does the CMDB play in incident management?
The CMDB provides the map of IT assets and service dependencies that engineers need during investigation. When a service fails, knowing which configuration items are involved and how they relate to each other significantly reduces diagnosis time. A CMDB integrated with your incident process allows engineers to see related incidents, recent changes and dependency maps directly from the ticket. Without it, investigation is slow, duplicative and heavily reliant on institutional memory.
How do we reduce incident volume over time?
Incident volume reduction is the output of a well-functioning problem management process, not a direct incident management objective. The mechanism is systematic root cause analysis on recurring incidents, followed by permanent fixes that reduce recurrence. Additionally, proactive monitoring that detects and resolves degradation before it becomes user-impacting reduces reported volume at the source. If your incident volume is growing quarter-on-quarter, it is almost always a signal that Problem Management is absent or ineffective.
Resolve incidents faster with HaloITSM, with less manual triage
HaloITSM gives your service desk the structure to manage incidents consistently and the automation to reduce repetitive work. SMC Consulting configures incident management so routing, SLAs, knowledge, and integrations work together as one operating system.