Stop certificate expiry outages with governed ITSM renewals

Stop “expired cert” outages with governed renewals from ITSM

Certificate expirations are one of those problems everyone knows about… until a weekend outage proves the reminders weren’t enough.

The recurring pattern:

  • Monitoring detects TLS failures / handshake errors
  • Service desk gets flooded with incidents
  • Someone renews the cert manually (or worse: renews the wrong one)
  • No consistent approvals, no deterministic runbook, and no audit trail of who changed what, where

This is a good example of why AI suggestions alone are risky: an LLM can recommend “renew the cert,” but letting it directly change production endpoints without policy checks + approvals is how you get accidental outages.

Autom Mate fits as the execution + control layer between ITSM/AI and the systems that actually change things (load balancers, gateways, secret stores, etc.). Autom Mate is built to orchestrate incident/change workflows across ITSM and collaboration tools, including approvals and rollback patterns. end workflow (one blueprint)

1) Trigger

  • Trigger: A ServiceNow incident is created/updated with category Certificate and CI/service metadata (e.g., “API Gateway – Prod”).
  • How it starts:
    • ServiceNow record event → Autom Mate flow starts.
  • Integration: ServiceNow (Autom Mate library)

2) Validationecks)

Autom Mate enriches and validates before any action:

  • Pull CI/service owner + environment (prod/non-prod) from the ticket/CMDB fields.
  • Validate policy:
    • Is this a standard renewal (same SANs, same key type, same endpoint) or a material change?
    • Is the cert within renewal window (e.g., < 30 days) vs already expired?
    • Is the requester/assignee allowed to initiate renewal for this service?
  • If required fields are missing (endpoint, FQDN, owner), Autom Mate posts a comment and pauses.

Why this matters: ITSM workflows often stall when context is incomplete; automation needs deterministic gates, not “best effort.” (blog.invgate.com)

3) Approval (human or rule-based)

  • If non-prod and renewal is “standard”: auto-approve.
  • If prod or renewal is “material change”: require explicit approval.
  • Approvals are requested in Microsoft Teams with a structured summary:
    • impacted service, expiry date, proposed action, rollback plan
  • **Microsoft Teams (Autom Mate library)
    • ServiceNow approvalsm Mate library)

4) Deterministic execution across systems

Once approved, Autom Mate executes a fixed runbook (no free-form AI actions):

  • Create a Change record (or link to an existing one) and attach the execution plan.
  • Execute renewal steps via controlled actions:
    • Call internal PKI / certificate service API to request/renew (REST/HTTP/Webhook action)
    • Deploy cert to target (e.g., load balancer / gateway / web server) (REST/HTTP/Webhook action)
    • Restart/reload service if required (REST/HTTP/Webhook action)
  • Post progress back to the incident/change.

Autom Mate is designed to orchestrate change workflows and coordinate execution across tools, including rollback whe## 5) Logging / audit
Autom Mate writes an audit-friendly trail:

  • Who approved in Teams
  • What ticket/change initiated the action
  • Which endpoints were updated
  • API responses + timestamps
  • Final verification results

This aligns with the need for visibility and auditability when automations touch sensitive systems.

#ing / rollback
If verification fails (e.g., handshake still failing, health checks red):

  • Autom Mate triggers rollback:
    • Re-deploy last-known-good cert bundle (from your internal store) (REST/HTTP/Webhook action)
    • Revert config and reload service (REST/HTTP/Webhook action)
  • Escalate to on-call in Teams and keep the incident in “Work in Progress / Major Incident” state.

Rollback discipline is repeatedly called out as a key gap in incident/change automation; it must be explicit and rehearsed. (siit.io)


Two mini examples

Mini example 1: “Cert expires in 14 days” (standard renewal)

  • Trigger: ServiceNow incident created from monitoring.
  • Autom Mate validates it’s non-prod + standard renewal.
  • Auto-approves, renews via internal PKI API, deploys, verifies, closes ticket.

Mini example 2: “Cert already expired” (production outage)

  • Trigger: Multiple incidents → major incident declared.
  • Autom Mate requires approval in Teams (prod + outage).
  • Executes renewal + deploy, then runs verification.
  • If verification fails, Autom Mate rolls back and escalates with exact failure output.

Why not let AI do this directly?

  • AI can misread context (wrong endpoint, wrong environment, wrong cert chain).
  • Certificate changes are high-blast-radius.
  • You need policy gates + approvals + deterministic execution.

Autom Mate’s role is to keep AI (or humans) from “winging it,” by enford workflow every time.


Discussion questions

  1. For cert renewals, what do you treat as “standard change” vs “normal change” in your org?
  2. Where do you want the approval to happen: inside ITSM, or in Teams with ITSM synced for audit?