ChainLaunch

ADR 0001 — `chainlaunch doctor` and `chainlaunch support-bundle`

Add two top-level CLI commands to chainlaunch-pro:

Field Value
Status Proposed
Date 2026-05-13
Owner Platform team
Reviewers Security, Support, Docs
Supersedes
Superseded by
Issue (to file)

TL;DR

Add two top-level CLI commands to chainlaunch-pro:

  • chainlaunch doctor — runs a fast, ordered battery of checks against the local instance and its data plane, prints a human-readable report, and exits non-zero when something is wrong. Designed to be the first thing a user runs before opening a support ticket.
  • chainlaunch support-bundle — collects a redacted tarball of everything support needs to triage an issue: version metadata, host info, live doctor output, sanitized config, last-N MB of server logs, recent container logs per node, audit log tail, Prometheus metrics snapshot, and a redaction manifest. Designed to be the only thing a user attaches to a support ticket.

Both build on the existing pkg/compliance package, the pkg/troubleshooting connectivity service, and pkg/monitoring/diskspace — this is not a greenfield subsystem.

Motivation

"Run chainlaunch doctor and attach the bundle." is what every grown-up infrastructure platform tells customers in the first reply to a support ticket. New Relic, GitLab, Datadog Agent, Vault, Consul, Tailscale — all ship this. ChainLaunch does not.

Today the customer-support loop for ChainLaunch is:

  1. Customer notices something broken (a node won't start, backups are failing, the UI is slow).
  2. Customer files an issue with a vague description and a screenshot.
  3. Support asks for: ChainLaunch version, host OS, output of docker ps, df -h, the last 500 log lines, the contents of ~/.chainlaunch/config.yaml, recent audit events, whether the backup target is healthy, and whether the API can reach each peer.
  4. Customer copy-pastes a subset, often missing the smoking gun. Tickets take 3–5 round-trips to even reproduce the issue.
  5. Each round-trip risks the customer sending unredacted credentials, certs, or PII into the ticket system.

The two commands collapse step 3 into "paste the output of chainlaunch doctor" and step 4 into "attach the bundle." Round-trips drop, sensitive material is redacted at the source, and the customer can self-diagnose the easy 60% of cases before they ever file a ticket.

Equally important — and the reason this belongs in Pro rather than OSS — is the signal a doctor command sends to procurement. Buyers screenshot it. It is one of the cheapest credibility upgrades the platform can ship.

Goals

  1. Single command, deterministic exit code. chainlaunch doctor exits 0 if healthy, 1 if degraded, 2 if critical. CI-pipeline friendly, watchdog-friendly, status-page friendly.
  2. No new long-running infrastructure. Reuses pkg/compliance, pkg/troubleshooting, pkg/monitoring, and existing API endpoints. No new daemon, no new tables in v1.
  3. Air-gapped friendly. Both commands work without outbound internet. The bundle is a local tarball; the customer chooses how to ship it.
  4. Redaction by default, not by configuration. The bundle ships with redaction on; an explicit --no-redact flag is the only way to disable it, and it logs a prominent warning.
  5. Sub-30-second doctor runtime. Time budget: 30s total, 5s per check, with explicit timeouts. Past the budget, the check is reported as "TIMEOUT" — not silently hung.
  6. Idempotent and side-effect free. doctor and support-bundle make zero writes to the data plane. They do persist a compliance scan row (the existing pkg/compliance/service.go already does this), nothing else.
  7. Pro-only. Aligns with the rest of the operational tooling. OSS users still get chainlaunch compliance scan, which is a strict subset.

Non-goals

  • Not a replacement for Prometheus / Grafana / OpenTelemetry. Doctor is point-in-time; observability is continuous.
  • Not a backup-restore tester. That is the separate "scheduled restore-test job" feature in Future work.
  • Not a remediation engine. Doctor reports; it never restarts a node, rotates a cert, or changes config. Remediation hints are human-readable strings.
  • Not a network-fixer. If the peer is unreachable, doctor says so; it does not try to repair the network.
  • Not a UI feature in v1. CLI first. A "Run doctor" button in the web UI can come in v1.1; the API endpoint is designed to allow it.

Background — what already exists

A surprising amount of the substrate is already in place:

Component Location Reused for
Network compliance checks (Fabric + Besu): orderer quorum, validator count, TLS, cert expiry, node health, endpoint conflicts, backup schedule pkg/compliance/checks.go The "blockchain" check category in doctor
Persisted scan history pkg/compliance/service.go + compliance_scans table Doctor history endpoint
Connection testing (TCP/HTTP/gRPC), certificate validation, port checks, ping pkg/troubleshooting/service.go The "network" check category
Disk-space monitor with thresholds pkg/monitoring/diskspace.go The "host" check category
Audit log + audit pruner pkg/audit/audit_pruner.go Bundle: audit tail extraction
Backup providers with VerifyBackup interface pkg/backups/provider/provider.go The "backup" check category
Notifications + webhooks pkg/notifications/, pkg/webhooks/ The "notifications wired up" check
Disk metrics, Prometheus metrics handler pkg/metrics/prometheus.go Bundle: metrics snapshot
OIDC token refresher with admin audit pkg/sso/refresher.go The "SSO healthy" check
Server-side compliance handler at /compliance/summary, /networks/{id}/compliance pkg/compliance/handler.go The doctor command pulls from here when run remotely

What's missing is (a) the host-level checks (chainlaunch process, file permissions, Docker daemon, time skew, config sanity), (b) a single ordered runner that aggregates everything, (c) the bundle format and the redactor, and (d) the CLI surface that ties it together.

Command surface

chainlaunch doctor

chainlaunch doctor [flags]
Flag Description Default
-o, --output text (TTY), json, junit text if stdout is a TTY, else json
--category Filter to one or more categories (host, process, db, network, blockchain, security, observability, backup). Repeatable or comma-separated. all
--include-network Network ID(s) to scope blockchain checks. Repeatable. all networks
--timeout Per-check timeout. 5s
--budget Overall wall-clock budget. 30s
--severity Minimum severity to print: pass, warn, fail. Always exits on fail. warn
--remote If set, runs against CHAINLAUNCH_API_URL instead of the local process. Host-level checks are skipped (the API can't see the host). unset (local mode)
--no-color Disable ANSI colors. Implied when !isatty(stdout). unset

Exit codes

Code Meaning
0 All checks passed (no fail, no warn).
1 Warnings present, no failures.
2 One or more failures.
3 Doctor itself errored (e.g. couldn't read config, can't connect to API in --remote mode).

chainlaunch support-bundle

chainlaunch support-bundle [flags]
Flag Description Default
-o, --output Output path. - writes to stdout (useful for piping). ./chainlaunch-bundle-<timestamp>.tar.gz
--include Repeatable: logs, metrics, audit, config, compliance, host, nodes. all
--exclude Repeatable. Wins over --include. none
--log-tail-mb How many MB of recent logs to include per source. 20
--audit-tail How many audit events to include. 5000
--node Limit per-node container logs to these slugs/IDs. Repeatable. all nodes
--no-redact Dangerous. Skip redaction entirely. Logs a big red warning. unset (redaction on)
--redact-policy Path to a YAML file overriding the built-in redaction rules. built-in
--manifest-only Skip the heavy contents, write only the manifest. Fast preview. unset
--encrypt-with Path to a PGP public key. Output is <name>.tar.gz.gpg. unset
--ttl Set an envelope expires_at 24h from now in the manifest. Not enforced; informational. 0 (no TTL)

Exit codes

Code Meaning
0 Bundle written.
1 Bundle written but some optional sources failed (e.g. metrics endpoint unreachable). Failures recorded in the manifest.
2 Bundle could not be written.

Check inventory

The check registry is a slice of Check values; each check declares its category, severity defaults, and whether it requires API access. The runner orders them by category, then by name, and runs each with the configured timeout.

Category: host

Check What it asserts Severity if false
host.os_supported OS is in the supported matrix (linux/amd64, linux/arm64, darwin/arm64, darwin/amd64). warn
host.kernel_version Kernel ≥ 5.x on Linux. warn
host.glibc_version glibc ≥ 2.31 on Linux (only meaningful for service mode). warn
host.cpu_count ≥ 2 cores (warn) or ≥ 4 cores (recommended). warn
host.memory_total ≥ 4 GB RAM (warn under 4 GB; fail under 2 GB). warn / fail
host.disk_free Reuses monitoring/diskspace. Fail under 10%; warn under 20%. warn / fail
host.swap_active Swap is configured (warn if absent on a low-RAM host). warn
host.time_skew time.Now() differs from a stable monotonic reference by < 1s (and from API if remote). warn
host.fd_limit ulimit -n ≥ 65535. warn
host.iptables_present iptables exists when --mode docker is in use. warn
host.docker_running docker info returns OK if any node is in docker mode. fail
host.docker_storage_driver Driver is overlay2 or btrfs. warn
host.docker_disk_usage Docker's data-root is on a partition with enough headroom. warn
host.data_dir_perms ~/.chainlaunch (or --data dir) is owned by the running user, mode ≤ 0750. warn
host.binary_path chainlaunch binary is on PATH and matches Version (catches stale-binary issues). warn

Category: process

Check What it asserts
process.server_alive The local chainlaunch serve is responding on its configured port. (skipped in --remote mode)
process.server_version_matches_cli API /version returns the same GitCommit as version.GitCommit.
process.server_uptime Server has been up ≥ 60s (catches crash-loop).
process.api_latency_p50 A trivial /health request returns under 250 ms.
process.outbound_dns Server can resolve registry-1.docker.io (warn-level for air-gapped customers).
process.config_loadable Server's effective config is parseable (uses the existing config endpoint).
process.pending_migrations golang-migrate reports no pending migrations.

Category: db

Check What it asserts
db.reachable SQLite file is readable, PRAGMA integrity_check; returns ok.
db.foreign_keys_enabled PRAGMA foreign_keys is 1.
db.wal_size WAL file < 1 GB (signals stuck checkpoint).
db.last_vacuum VACUUM ran in the last 30 days (informational).
db.row_counts Row counts for the 10 largest tables — informational, surfaces explosive growth in audit / metrics tables.

Category: network

Per-node and per-network checks via pkg/troubleshooting.

Check What it asserts
network.node_reachable TCP connect to each node's listen address from the chainlaunch host.
network.node_tls_handshake TLS handshake completes; the cert chain validates against the configured CA.
network.node_grpc_healthy For Fabric peers, the gRPC health probe returns SERVING.
network.node_rpc_healthy For Besu nodes, eth_blockNumber succeeds.
network.peer_endpoint_overrides_consistent If AddressOverrides are configured, they resolve.
network.docker_publish_collisions Reuses pkg/system/http port probe to confirm no other process holds a node's published port.

Category: blockchain

This is the existing pkg/compliance.FabricChecks() / BesuChecks() — orderer quorum, validator count, TLS, cert expiry, node health, endpoint conflicts, backup schedule, org diversity. Doctor calls compliance.Service.ScanAllNetworks directly (no HTTP) when running locally, and the /compliance/summary endpoint when running --remote.

Added in this ADR:

Check What it asserts
blockchain.chaincode_definition_drift For each Fabric channel, the latest local chaincode_definitions row matches the on-channel _lifecycle:QueryChaincodeDefinitions result. Reuses GetCommittedChaincodes from the chaincode CLI.
blockchain.besu_block_height_drift All Besu nodes in a network report block heights within 5 blocks of each other.
blockchain.fabric_channel_height_drift For each Fabric channel, all joined peers report channel-block-height within 2 blocks.

Category: security

Check What it asserts
security.tls_listener_enabled The chainlaunch API itself is served over TLS (warn if not, in production-looking configs).
security.default_admin_changed The local admin account's password hash is not the install default.
security.api_key_count At least one non-default API key exists if any external integration is configured (sanity check).
security.key_provider_health Each configured key provider (DB / AWS KMS / Vault) responds to a no-op probe.
security.kms_credentials_present When AWS KMS is configured, the resolved credentials chain works (instance profile / SSO / static — uses the same chain backup uses).
security.vault_token_alive When Vault is configured, the token can auth/token/lookup-self and has ≥ 7 days TTL left.
security.encryption_key_set Encryption-at-rest master key is configured (looks at pkg/encryption).
security.oidc_refresher_healthy When OIDC is configured, the refresher thread last ran < 30 min ago.

Category: observability

Check What it asserts
observability.metrics_port_listening Prometheus endpoint is up.
observability.notifications_configured At least one channel (email / webhook / Slack) is configured.
observability.notifications_test (Opt-in via --include-network) sends a __doctor_test__ event end-to-end through the webhook pipeline.
observability.audit_pruner_recent Audit pruner ran < retention-window ago.

Category: backup

Check What it asserts
backup.target_count ≥ 1 active backup target exists.
backup.schedule_count ≥ 1 active backup schedule exists.
backup.last_successful_within_sla Each target's most recent backup succeeded within 2 * schedule_interval.
backup.target_verify Calls provider.VerifyBackup against the latest backup for each target. Slowest check; gated behind --include backup if --budget is tight.
backup.s3_reachable For S3 targets, can list with the configured credentials.
backup.ebs_role_present For EBS targets, the resolved IAM role exists.

Runner architecture

┌─────────────────────────────────────────────────────────────────┐
│  cmd/doctor/doctor.go                                           │
│   ├─ parse flags                                                │
│   ├─ build context (timeout, budget, severity, categories)      │
│   ├─ call pkg/doctor.Service.Run(ctx, opts)                     │
│   └─ render(report, --output)                                   │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│  pkg/doctor/service.go                                          │
│   ├─ Registry (categories → []Check)                            │
│   ├─ Run(ctx, opts) returns *Report                             │
│   ├─ Each Check runs in its own goroutine, bounded by:          │
│   │    semaphore (limit 8 concurrent),                          │
│   │    per-check timeout,                                       │
│   │    overall budget (cancels the parent ctx).                 │
│   └─ Deterministic ordering of the rendered output regardless   │
│      of completion order.                                       │
└─────────────────────────────────────────────────────────────────┘
                              │
        ┌─────────────────────┼─────────────────────┐
        ▼                     ▼                     ▼
┌───────────────┐   ┌──────────────────┐   ┌──────────────────┐
│ pkg/doctor/   │   │ pkg/compliance   │   │ pkg/troubleshoot │
│ checks/       │   │ (existing)       │   │ ing (existing)   │
│   host/       │   │  → blockchain    │   │  → network       │
│   process/    │   │     checks       │   │     checks       │
│   db/         │   └──────────────────┘   └──────────────────┘
│   security/   │
│   backup/     │
│   observ/     │
└───────────────┘

Types

package doctor
 
type Severity string
const (
    SeverityPass Severity = "PASS"
    SeverityWarn Severity = "WARN"
    SeverityFail Severity = "FAIL"
    SeverityTimeout Severity = "TIMEOUT"
    SeveritySkipped Severity = "SKIPPED"
)
 
type Category string
const (
    CategoryHost          Category = "host"
    CategoryProcess       Category = "process"
    CategoryDB            Category = "db"
    CategoryNetwork       Category = "network"
    CategoryBlockchain    Category = "blockchain"
    CategorySecurity      Category = "security"
    CategoryObservability Category = "observability"
    CategoryBackup        Category = "backup"
)
 
type CheckResult struct {
    Name        string         `json:"name"`        // e.g. "host.disk_free"
    Category    Category       `json:"category"`
    Severity    Severity       `json:"severity"`
    Message     string         `json:"message"`     // one-line summary
    Details     map[string]any `json:"details,omitempty"`
    Remediation string         `json:"remediation,omitempty"`
    DurationMS  int64          `json:"duration_ms"`
}
 
type Check interface {
    Name() string
    Category() Category
    RequiresAPI() bool   // true → skipped when local API is unavailable
    Run(ctx context.Context, env *Env) CheckResult
}
 
type Env struct {
    Logger          *logger.Logger
    DB              *sql.DB
    Queries         *db.Queries
    NetworksService *networksservice.NetworkService
    NodesService    *nodesservice.NodeService
    ComplianceSvc   *compliance.Service
    TroubleshootSvc *troubleshooting.Service
    APIClient       *common.Client   // nil in local mode
    DataDir         string
    Mode            Mode             // Local | Remote
}
 
type Report struct {
    Version    string         `json:"version"`     // chainlaunch version
    GitCommit  string         `json:"git_commit"`
    StartedAt  time.Time      `json:"started_at"`
    DurationMS int64          `json:"duration_ms"`
    Status     OverallStatus  `json:"status"`      // HEALTHY|DEGRADED|AT_RISK|CRITICAL
    Summary    Summary        `json:"summary"`
    Categories []CategoryReport `json:"categories"`
}

Concurrency & timeouts

  • A chan struct{} semaphore of size 8 caps parallelism. Even on a 64-core host, exploding parallelism would just thrash the SQLite WAL.
  • Each check gets ctx, cancel := context.WithTimeout(parent, opts.PerCheckTimeout). The parent has the wall budget. If the wall budget fires, the runner cancels everything in flight and emits TIMEOUT for unfinished checks.
  • Network calls inside checks must honor ctx. We'll add a gosec rule (or staticcheck) gate that flags http.Get / net.Dial in check code.
  • Each check's Run returns within its own timeout or panics — the runner recovers panics and records them as FAIL with the recovered error.

Determinism

Checks complete out of order, but the rendered output is sorted by (category, name) for stable diffs across runs. The JSON output mirrors this so a CI job comparing two reports can diff them directly.

Output formats

--output text (default for TTY)

chainlaunch doctor — v0.42.1 (a3f8b2c) — host: ip-10-0-1-23 — 2026-05-13T08:42:11Z

  HOST                                                              4/4 pass, 1 warn
    ✓ host.os_supported       linux/amd64
    ✓ host.cpu_count          8 cores
    ! host.memory_total       3.4 GB available, ≥ 4 GB recommended
    ✓ host.disk_free          71% free on /var/lib/chainlaunch
    ✓ host.docker_running     server 25.0.3, overlay2

  PROCESS                                                           7/7 pass
    ✓ process.server_alive    http://localhost:8100 (uptime 4d 2h)
    ✓ process.api_latency_p50 41 ms
    ✓ process.pending_migrations  none
    …

  BLOCKCHAIN                                                        2/3 pass, 1 fail
    ✓ blockchain.orderer_quorum (network: prod-supply-chain)        3 orderers, tolerates 1
    ✗ blockchain.cert_expiry (network: prod-supply-chain)           peer0.org2 TLS expires in 6 days
      → Renew with: POST /api/v1/nodes/{id}/certificates/renew
    ✓ blockchain.fabric_channel_height_drift                        all peers within 1 block

  BACKUP                                                            3/3 pass
    ✓ backup.last_successful_within_sla   last S3 backup 2h 14m ago (SLA 6h)
    …

  ───────────────────────────────────────────────────────────────────────
  STATUS: AT_RISK   ·   28 pass · 2 warn · 1 fail · 0 timeout   ·   2.4s
  ───────────────────────────────────────────────────────────────────────

  Run with --output json for machine-readable output, or
  chainlaunch support-bundle to package everything for support.

--output json

{
  "version": "0.42.1",
  "git_commit": "a3f8b2c",
  "started_at": "2026-05-13T08:42:11Z",
  "duration_ms": 2412,
  "status": "AT_RISK",
  "summary": {"total": 31, "pass": 28, "warn": 2, "fail": 1, "timeout": 0},
  "categories": [
    {
      "name": "host",
      "summary": {"total": 5, "pass": 4, "warn": 1, "fail": 0, "timeout": 0},
      "checks": [
        {
          "name": "host.memory_total",
          "category": "host",
          "severity": "WARN",
          "message": "3.4 GB available, ≥ 4 GB recommended",
          "details": {"available_gb": 3.4, "recommended_gb": 4},
          "remediation": "Bump host RAM to 4 GB+.",
          "duration_ms": 8
        }
      ]
    }
  ]
}

--output junit

XML matching the JUnit Surefire schema. One <testsuite> per category, one <testcase> per check. WARN becomes <testcase> with a <system-err> block; FAIL becomes <failure>. Lets CI dashboards (GitHub Actions, GitLab, CircleCI) render doctor results natively.

Support-bundle format

A gzipped tar with a fixed layout. The root directory inside the tarball is named chainlaunch-bundle-<timestamp>-<git_commit> so two bundles never collide when unpacked side-by-side.

chainlaunch-bundle-20260513-084211-a3f8b2c/
├── manifest.json                       # always first; signed in v1.1
├── README.txt                          # plain-text "how to read this bundle"
├── doctor.json                         # output of `chainlaunch doctor --output json`
├── doctor.txt                          # output of `chainlaunch doctor --output text --no-color`
├── version.json                        # version, git_commit, build_time, OS, arch, container?
├── host/
│   ├── os.json                         # uname, /etc/os-release
│   ├── cpu.json                        # cores, model
│   ├── mem.json                        # total, free, swap
│   ├── disk.json                       # df -h, df -i (inodes), data-dir size
│   ├── network.json                    # interfaces, routes, /etc/resolv.conf (redacted)
│   ├── docker.json                     # docker version, info, ps --all (no env)
│   └── ulimits.json                    # max files, max procs
├── config/
│   ├── effective.yaml                  # serialized effective config (REDACTED)
│   └── env.json                        # CHAINLAUNCH_* env vars (REDACTED)
├── audit/
│   └── recent.jsonl                    # last N audit events (REDACTED)
├── compliance/
│   ├── summary.json                    # org-wide scan
│   └── networks/<id>.json              # per-network scan
├── nodes/
│   └── <slug>/
│       ├── meta.json                   # node row, redacted
│       ├── logs.txt                    # last log-tail-mb of container/service logs
│       └── docker-inspect.json         # docker inspect (REDACTED)
├── chaincodes/
│   └── <slug>/
│       ├── meta.json
│       ├── timeline.json               # from the existing /timeline endpoint
│       └── logs.txt                    # last N MB of chaincode container logs
├── server/
│   ├── chainlaunch.log                 # last log-tail-mb of the server log
│   ├── chainlaunch.log.1               # rotated prior log if recent
│   └── metrics-snapshot.prom           # one-shot Prometheus scrape
├── db/
│   ├── schema.sql                      # output of `.schema` — no data
│   ├── integrity.txt                   # PRAGMA integrity_check
│   ├── migrations.json                 # applied migration versions
│   └── row-counts.json                 # row counts per table
└── redaction.json                      # what got redacted and the rule that matched

manifest.json

{
  "bundle_version": "1",
  "chainlaunch_version": "0.42.1",
  "git_commit": "a3f8b2c",
  "build_time": "2026-05-09T13:22:00Z",
  "generated_at": "2026-05-13T08:42:11Z",
  "expires_at": "2026-05-14T08:42:11Z",   // when --ttl is set
  "host_id": "sha256:64hex-of-hostname-uuid",
  "os": "linux",
  "arch": "amd64",
  "in_container": false,
  "license_id": "sha256:64hex",            // SHA-256 of license key, never the key itself
  "instance_id": "sha256:64hex",
  "redaction": {
    "enabled": true,
    "policy": "built-in",
    "policy_version": "1",
    "matches": 47
  },
  "contents": [
    {"path": "doctor.json", "bytes": 12834, "sha256": "…"},
    {"path": "nodes/peer0-org1/logs.txt", "bytes": 21037184, "sha256": "…"},

  ],
  "errors": [
    {"source": "metrics-snapshot", "error": "connection refused on :9090"}
  ]
}

Redaction rules (built-in v1)

Every text file passes through a streaming redactor. The redactor is one pass with a fixed pattern list (no user regex injection in v1). Each match is replaced with <redacted:<rule-id>> and recorded in redaction.json with the source file, byte offset, and rule ID.

Rule ID Pattern Notes
r.api-key.clpro clpro_[A-Za-z0-9]{32,} ChainLaunch API keys.
r.api-key.bearer Bearer\s+[A-Za-z0-9._-]{20,} Bearer tokens in headers.
r.basic-auth Authorization:\s*Basic\s+[A-Za-z0-9+/=]+ Base64-encoded basic auth.
r.private-key.pem -----BEGIN (?:RSA |EC |PRIVATE|ENCRYPTED PRIVATE) KEY-----[\s\S]*?-----END [^-]+----- All PEM private keys.
r.certificate.pem -----BEGIN CERTIFICATE-----[\s\S]*?-----END CERTIFICATE----- Replaced with a one-line summary (subject + issuer + NotAfter), not removed. Customers and support both want the metadata.
r.password.config YAML/JSON keys matching (?i)(password|passwd|secret|token|api[_-]?key|client[_-]?secret). Value replaced; key kept so the structure is debuggable.
r.aws-credential AKIA[0-9A-Z]{16} and aws_secret_access_key\s*=\s*[A-Za-z0-9/+=]{40}
r.vault-token hvs\.[A-Za-z0-9_-]{24,} or s\.[A-Za-z0-9]{24,} Vault token formats.
r.gcp-sa "private_key":\s*"-----BEGIN PRIVATE KEY-----[\s\S]*?-----END PRIVATE KEY-----" GCP service account JSON.
r.email RFC 5322 email. Optional rule, off by default — emails matter for audit traceability and customers usually want them. Documented; tunable.
r.ip.public Any IPv4 / IPv6 not in RFC 1918 / RFC 4193. Optional, off by default — usually customers want to share these so support can grep.
r.docker.env-var The Env: block of docker inspect output. Replaced with <redacted:r.docker.env-var> because env vars often contain secrets.
r.path.home The chainlaunch user's $HOME and absolute paths under it. Replaced with ~.
r.uuid.session Session cookies.

Two non-pattern rules:

  • YAML structural redaction. The effective-config emitter knows which fields are sensitive (encryption-at-rest master key, KMS credentials, OIDC client secret, etc.) and never serializes them — they appear as <redacted-by-field>. This is more reliable than regex for structured data.
  • JSON structural redaction for docker inspect output: the Env array, HostConfig.Env, and Args are wholesale removed before the regex pass.

The redactor processes files in streaming chunks (line-oriented for .txt/.log, parsed-then-serialized for .json/.yaml). A 1 GB log tail must never load entirely into RAM.

--no-redact

Disables the redactor entirely. The CLI prints a 6-line yellow banner before writing the bundle and the manifest records redaction.enabled = false. Any support ticket that comes in with a --no-redact bundle gets a polite "please regenerate without that flag" reply.

--encrypt-with

After tarballing, the file is encrypted with the supplied PGP public key using gopg. Output extension becomes .tar.gz.gpg. Support has a published team key — the standard install includes a chainlaunch support-bundle --encrypt-with /usr/share/chainlaunch/support-team.asc shortcut.

Performance budgets

Operation Budget
doctor total wall-clock 30 s
Per-check timeout 5 s
support-bundle total wall-clock (default tail sizes) 90 s
Tarball size (default tails) < 200 MB compressed
Redaction throughput ≥ 200 MB/s per core (streaming, line-oriented)
Memory ceiling for redactor 256 MB

Doctor runs faster than 30s in practice — the budget exists to keep one slow remote check (a hung peer) from blocking the report.

RBAC & permissions

  • CLI doctor local mode — must be run as the chainlaunch process user. No additional auth.
  • CLI doctor --remote — uses the normal --auth-username / --auth-password / --auth-bearer. Requires the new system:doctor:run permission.
  • API /system/doctor — same permission, plus auth.WithPermission(auth.PermissionSystemDoctor).
  • CLI support-bundle — must be run as the chainlaunch process user. Bundle output is written with 0600 perms. No remote variant in v1 — too easy to leak.

Add to pkg/auth/permissions.go:

const (
    PermissionSystemDoctor       Permission = "system:doctor:run"
    PermissionSystemSupportBundle Permission = "system:support-bundle:generate"
)

Granted to ADMIN by default. OPERATOR gets system:doctor:run but not the support-bundle (the bundle includes audit data that operators shouldn't extract). VIEWER gets neither.

API surface

POST /api/v1/system/doctor

POST /api/v1/system/doctor
Content-Type: application/json
{
  "categories": ["blockchain", "backup"],
  "per_check_timeout_ms": 5000,
  "budget_ms": 30000
}

Response is the same Report JSON the CLI emits. Synchronous in v1 (doctor is fast enough). Future v2: returns a job_id for long-running variants.

GET /api/v1/system/doctor/history

Returns the last N runs, leveraging the existing compliance_scans table plus a new lightweight doctor_runs table (just id, started_at, duration_ms, status, summary_json, created_by).

POST /api/v1/system/support-bundle

Not exposed in v1. The bundle is local-disk only. The UI may shell out to the binary in v1.1 but the HTTP endpoint waits until v2 (encrypted-streaming-to-S3 is a different feature).

Data model changes

One small table only:

-- pkg/db/migrations/0046_add_doctor_runs.up.sql
CREATE TABLE IF NOT EXISTS doctor_runs (
    id           INTEGER PRIMARY KEY AUTOINCREMENT,
    started_at   TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
    duration_ms  INTEGER NOT NULL,
    status       TEXT NOT NULL,        -- HEALTHY|DEGRADED|AT_RISK|CRITICAL|ERROR
    summary_json TEXT NOT NULL,        -- {"total": …, "pass": …, …}
    report_json  TEXT NOT NULL,        -- full Report; capped at 1 MB; truncated otherwise
    created_by   INTEGER REFERENCES users(id) ON DELETE SET NULL
);
 
CREATE INDEX IF NOT EXISTS idx_doctor_runs_started_at
    ON doctor_runs(started_at DESC);

Pruned by the existing audit_pruner mechanism (extended to know about doctor_runs). Retention default: 90 days, configurable.

Package layout

pkg/doctor/
├── service.go              # runner: takes Env + opts, returns *Report
├── types.go                # Check, CheckResult, Report, etc.
├── registry.go             # Registers all built-in checks; extension point
├── redactor/
│   ├── redactor.go         # streaming engine
│   ├── rules.go            # built-in rules
│   ├── rules_test.go
│   └── policy.go           # YAML policy loader (--redact-policy)
├── bundle/
│   ├── builder.go          # walks the rules, writes the tarball
│   ├── manifest.go         # manifest schema + writer
│   ├── encrypt.go          # PGP wrapper (--encrypt-with)
│   └── builder_test.go
└── checks/
    ├── host/               # one file per check or per subgroup
    ├── process/
    ├── db/
    ├── network/            # thin wrappers over pkg/troubleshooting
    ├── blockchain/         # thin wrappers over pkg/compliance + new drift checks
    ├── security/
    ├── observability/
    └── backup/             # thin wrappers over pkg/backups

cmd/doctor/
├── doctor.go               # top-level command
├── support_bundle.go       # `chainlaunch support-bundle`
├── render_text.go
├── render_json.go
└── render_junit.go

Wire into cmd/root.go:

rootCmd.AddCommand(doctor.NewDoctorCmd(appLogger))
rootCmd.AddCommand(doctor.NewSupportBundleCmd(appLogger))

Documentation

New documentation deliverables:

  1. chainlaunch-docs/docs/operations/doctor.md — how-to + reference, including the exit-code semantics and CI integration recipes.
  2. chainlaunch-docs/docs/operations/support-bundle.md — what's in the bundle, what's redacted, how to inspect it before sending.
  3. Update chainlaunch-pro-cli.md with a new doctor and support-bundle section.
  4. A docs page specifically for support engineers at chainlaunch-docs/docs/internal/triage.md: "open the bundle, look at these five files in this order." Internal-but-public so customers can read it too — that itself is a trust signal.

Rollout plan

Phase What ships Customer-visible?
v0 — design This ADR. No
v1 — doctor local mode chainlaunch doctor, JSON + text output, all host/process/db/network checks, reuse of compliance for blockchain checks. No bundle yet. New permission added. Yes, in next minor release. Beta-flagged.
v1.1 — support-bundle chainlaunch support-bundle, full tarball, built-in redactor, manifest, PGP option. No remote bundle. UI button → shells out to binary. Yes.
v1.2 — doctor --remote Doctor works over the API. /api/v1/system/doctor endpoint. JUnit output. doctor_runs table + history endpoint. Yes.
v2 — bundle to S3 chainlaunch support-bundle --to s3://… streaming upload. Job-based async doctor. UI dashboard tile. Yes.

Each phase is shippable on its own. v1 is the highest-leverage; v1.1 is the credibility upgrade.

Testing strategy

Layer Test
Unit — checks One table-driven test per check, hand-crafted Env fixtures. Run under -race.
Unit — redactor Golden files. Every rule has a before.txt / after.txt pair. Adding a rule without a fixture fails CI.
Integration A doctor_integration_test.go boots a real serve against test-1.db, runs doctor, asserts on the report.
Smoke A support_bundle_smoke_test.go generates a bundle in /tmp, extracts it, walks every file, asserts the redactor banned-string list does not appear.
Performance Benchmarks for the redactor at 1 GB of synthetic log input. Asserts ≥ 200 MB/s/core.
E2E A Playwright test in web/ clicks the "Run doctor" button (v1.1) and verifies the report renders.
Security The "doctor against a configured-but-broken instance" test: malformed config, expired certs, dead Vault. Must not crash, must report each.

The most important test is the redactor "never leak" test: a corpus of real config files contributed by support is run through the redactor and the output is grep'd for every known secret prefix (clpro_, AKIA, hvs., -----BEGIN). Zero matches = ship.

Security considerations

  1. Bundle is dangerous by definition. Even redacted, it tells an attacker a lot — host IPs (unless r.ip.public is enabled), software versions, network topology. The CLI writes it 0600, manifests an expires_at, and the docs are explicit about treating the bundle like a credential.
  2. Redaction is best-effort. No regex catches every secret. The --encrypt-with flag + a published support PGP key is the belt-and-suspenders defense.
  3. Doctor checks must not log secrets. Code review checklist: every check function is reviewed for log.Printf of any value that crosses a key/secret/credential boundary.
  4. The process.outbound_dns check must use a configurable target. Hard-coding registry-1.docker.io is fine for the default; air-gapped customers will override it via --redact-policy-like config or a future chainlaunch.yaml block.
  5. --no-redact writes a sentinel to manifest.json so we can refuse to process such bundles in our own ticket-handling tools.

Alternatives considered

Alternative Why rejected
Just shell out to a script. A support.sh bundled with the install. No structured output, hard to test, hard to redact reliably, doesn't surface in the UI.
Bundle without redaction; rely on the customer to scrub. Customers paste bundles into Slack within 30 seconds. Redaction must be the default.
Make doctor a SaaS service that the binary phones home to. Defeats the air-gapped/self-hosted promise. Hard pass.
Cover this with Prometheus alerting only. Different shape: Prometheus is "is something broken right now," doctor is "is this install correctly set up." Both matter; they're not substitutes.
Vault-style vault operator diagnose clone (exact behavior parity). Vault is more constrained — single-process, single-binary. ChainLaunch's data plane (Docker, peers, KMS, backups, OIDC) is much wider. Doctor's check inventory reflects that.

Open questions

  1. --remote mode and host checks. The API can't see the host. Do we hide host checks entirely in --remote mode, or show them as SKIPPED with a reason? Proposal: SKIPPED with the reason so the report layout matches between modes.
  2. Doctor and the audit log. Should each doctor run write an audit event? Proposal: yes for --remote runs (it's an authenticated user action), no for local-only runs (it's a process-user action, gets a doctor_runs row).
  3. Bundle and licensing. Should the bundle include the license object? Proposal: include a SHA-256 of the license key, never the key itself. The hash lets support correlate the bundle to the customer account.
  4. Redaction policy versioning. When we add a rule, do bundles generated against an old policy version need re-redaction before processing? Proposal: yes, support tooling refuses bundles whose redaction.policy_version is older than the currently-deployed policy.
  5. macOS notarization for support-bundle. The PGP path drags in CGO bindings on some platforms. We may need to use a pure-Go OpenPGP library (e.g. github.com/ProtonMail/go-crypto) to keep the binary CGO-free.

Future work (post-v1)

  • Scheduled doctor. A cron-like schedule that runs doctor every hour and posts results to the notifications pipeline. Closes the loop on continuous health.
  • Backup restore-test job. Bundle a daily / weekly restore test into doctor's backup category. (Separate, larger feature — its own ADR.)
  • Bundle-to-S3. --to s3://bucket/path streaming upload with server-side encryption. Customers love this; security teams love it more.
  • UI dashboard. A "Health" tile that shows the last doctor run's status with drill-down. Renders the JSON report inline. "Run now" button.
  • Differential bundles. chainlaunch support-bundle --since 2026-05-12T00:00:00Z to capture only what changed since the last bundle. Critical for repeat tickets.
  • MCP tool integration. Expose doctor.run and support-bundle.generate as MCP tools (you already have an MCP server in pkg/mcp). Letting Claude run doctor on the user's behalf during a debugging conversation is a huge DX win.
  • Status-page integration. Wire the doctor summary into a public status page if the customer wants one ("show our last 30 days of doctor results").

Appendix A — Sample chainlaunch doctor invocations

# Quick health snapshot, exits non-zero on failure.
chainlaunch doctor
 
# Only the things I can fix in production right now.
chainlaunch doctor --category blockchain,backup --severity fail
 
# CI gate: blow up the pipeline if anything fails.
chainlaunch doctor --output junit > doctor.xml || echo "::error::doctor failed"
 
# Remote prod check from a laptop.
CHAINLAUNCH_API_URL=https://prod.example.com/api/v1 \
  chainlaunch doctor --remote --output json | jq '.summary'
 
# Watchdog: keep a status file fresh for an external probe.
while sleep 60; do
  chainlaunch doctor --output json > /var/lib/chainlaunch/health.json
done

Appendix B — Sample chainlaunch support-bundle invocations

# Defaults: redacted, 20 MB log tails, every node, encrypted nothing.
chainlaunch support-bundle
 
# Smaller bundle for a quick ticket — just config + compliance + doctor.
chainlaunch support-bundle --exclude logs --exclude metrics --log-tail-mb 0
 
# Encrypted for the support team.
chainlaunch support-bundle \
  --encrypt-with /usr/share/chainlaunch/support-team.asc \
  -o /tmp/ticket-12345.tar.gz.gpg
 
# CI artifact on every test failure.
chainlaunch support-bundle --manifest-only -o - > bundle-manifest.json
 
# Only the nodes that are misbehaving.
chainlaunch support-bundle --node peer0-org1 --node orderer0-org0

Appendix C — Acceptance criteria for v1 release

A chainlaunch doctor ships when:

  • All checks in the host, process, db, network, blockchain, security, observability, backup categories have a passing unit test and at least one negative test.
  • doctor --output json | jq returns a schema-valid Report on a healthy and an unhealthy fixture.
  • The runner respects --budget (tested by injecting a 5s sleep check with a 1s budget and asserting TIMEOUT).
  • Doctor finishes under 30s wall-clock on the reference test instance (10 nodes, 4 networks, 2 chaincodes).
  • doctor_runs migration applies clean on a 0.41.x → 0.42.x upgrade.
  • CI on every PR runs chainlaunch doctor against a docker-composed reference instance and gates merge on STATUS != CRITICAL.

A chainlaunch support-bundle ships when:

  • The "never leak" redactor test passes against the support corpus.
  • A 200 MB log tail redacts in under 1.0s on the reference hardware.
  • The bundle round-trips through tar tzf cleanly and the manifest's per-file SHA-256s verify.
  • --encrypt-with produces a file that decrypts cleanly with the matching private key (round-trip test in CI).
  • The docs include the "what's in the bundle" page with the redaction-rule table.