·

Istio Service Mesh — A Field Guide

Istio Service Mesh — A Field Guide — a complete, field-tested reference by John Kihiu.

John Kihiu12 min read

Istio Service Mesh — A Field Guide is the work that turns a deploy into a system. The deployment is one moment; the system is the next 18 months of uptime, incidents, and improvements. This guide is the field-tested pattern for this DevOps practice.

CI/CD

The CI/CD pipeline is the spine of the system. The right pipeline has six stages:

  1. Lint. Catch syntax errors and style violations in seconds.
  2. Test. Run unit tests, integration tests, security scans.
  3. Build. Produce the artefact (container, package, binary).
  4. Deploy to staging. The same way you deploy to production.
  5. Smoke test staging. The top 5 user flows.
  6. Deploy to production. Canary, blue-green, or rolling.
GITHUB ACTIONS · 6-STAGE PIPELINE
name: Pipeline
on: [push, pull_request]
jobs:
  lint:
    steps: [run: npm run lint]
  test:
    steps: [run: npm test]
  build:
    needs: [lint, test]
    steps: [run: docker build -t app:"$GITHUB_SHA" .]
  deploy-staging:
    needs: [build]
    steps: [run: ./deploy.sh staging]
  smoke-test:
    needs: [deploy-staging]
    steps: [run: ./smoke-test.sh https://staging.example.com]
  deploy-prod:
    needs: [smoke-test]
    if: github.ref == "refs/heads/main"
    steps: [run: ./deploy.sh production]

For the broader CI/CD patterns, see the CI/CD guide.

Observability

The three pillars: metrics, logs, traces. Modern systems need all three.

PillarWhat it answersTool
MetricsHow much? How fast? How many?Prometheus + Grafana, Datadog, New Relic
LogsWhat happened?Loki, ELK, Datadog Logs
TracesWhere did the request spend its time?Jaeger, Tempo, OpenTelemetry
Logs are the last resort

If you are reading logs to debug, your metrics and traces are not good enough. The right signal is a metric or a trace; logs are for the deep-dive when neither helps.

For the broader observability patterns, see the monitoring guide and the distributed tracing guide.

Incident response

When the system breaks, the runbook:

  1. Detect. The alert fires.
  2. Triage. On-call engineer investigates, identifies severity.
  3. Mitigate. Restore service (rollback, scale up, fail over).
  4. Communicate. Stakeholders, customers, status page.
  5. Resolve. Permanent fix.
  6. Post-mortem. Blameless. What went well, what didn't, what to change.
PYTHON · ALERT ROUTING
def route_alert(alert):
    if alert.severity == "critical":
        page_oncall(alert)
        notify_incident_channel(alert)
    elif alert.severity == "high":
        notify_team_channel(alert)
    elif alert.severity == "medium":
        create_ticket(alert)
    else:
        log(alert)

For the broader incident patterns, see the incident response runbook.

The blameless post-mortem

The most important habit. The template:

Blameless is not "no accountability"

Blameless means the system is accountable, not the person. If the same engineer made the same mistake twice, the action item is to fix the system so the next engineer cannot make the same mistake. The person is not named.

Wrapping up

The pipeline, the observability, the incident response, the post-mortem. Get all four right and the system is reliable. The discipline is the same as any production system — fail safely, learn quickly, and improve the system after every incident.

Wrapping up

That is the working approach I use on Acumatica projects. The same patterns show up whether you are in Nairobi, Johannesburg, Kigali, Lusaka or Harare — and they are the things that keep work moving when an upgrade lands at 6 PM on a Friday. If you are stuck on something specific, reach out or keep reading through the rest of the Acumatica blog.

John Kihiu
John Kihiu
Acumatica ERP Developer · Laravel Engineer

Independent software engineer in Nairobi specialising in Acumatica customisations, Laravel backends, and tax fiscalisation integrations across East and Southern Africa.