| Field | Value |
|---|---|
| Status | Proposed |
| Date | 2026-05-13 |
| Owner | Platform team |
| Reviewers | Security, Support, Docs |
| Supersedes | — |
| Superseded by | — |
| Issue | (to file) |
TL;DR
Add two top-level CLI commands to chainlaunch-pro:
chainlaunch doctor— runs a fast, ordered battery of checks against the local instance and its data plane, prints a human-readable report, and exits non-zero when something is wrong. Designed to be the first thing a user runs before opening a support ticket.chainlaunch support-bundle— collects a redacted tarball of everything support needs to triage an issue: version metadata, host info, live doctor output, sanitized config, last-N MB of server logs, recent container logs per node, audit log tail, Prometheus metrics snapshot, and a redaction manifest. Designed to be the only thing a user attaches to a support ticket.
Both build on the existing pkg/compliance package, the pkg/troubleshooting connectivity service, and pkg/monitoring/diskspace — this is not a greenfield subsystem.
Motivation
"Run
chainlaunch doctorand attach the bundle." is what every grown-up infrastructure platform tells customers in the first reply to a support ticket. New Relic, GitLab, Datadog Agent, Vault, Consul, Tailscale — all ship this. ChainLaunch does not.
Today the customer-support loop for ChainLaunch is:
- Customer notices something broken (a node won't start, backups are failing, the UI is slow).
- Customer files an issue with a vague description and a screenshot.
- Support asks for: ChainLaunch version, host OS, output of
docker ps,df -h, the last 500 log lines, the contents of~/.chainlaunch/config.yaml, recent audit events, whether the backup target is healthy, and whether the API can reach each peer. - Customer copy-pastes a subset, often missing the smoking gun. Tickets take 3–5 round-trips to even reproduce the issue.
- Each round-trip risks the customer sending unredacted credentials, certs, or PII into the ticket system.
The two commands collapse step 3 into "paste the output of chainlaunch doctor" and step 4 into "attach the bundle." Round-trips drop, sensitive material is redacted at the source, and the customer can self-diagnose the easy 60% of cases before they ever file a ticket.
Equally important — and the reason this belongs in Pro rather than OSS — is the signal a doctor command sends to procurement. Buyers screenshot it. It is one of the cheapest credibility upgrades the platform can ship.
Goals
- Single command, deterministic exit code.
chainlaunch doctorexits 0 if healthy, 1 if degraded, 2 if critical. CI-pipeline friendly, watchdog-friendly, status-page friendly. - No new long-running infrastructure. Reuses
pkg/compliance,pkg/troubleshooting,pkg/monitoring, and existing API endpoints. No new daemon, no new tables in v1. - Air-gapped friendly. Both commands work without outbound internet. The bundle is a local tarball; the customer chooses how to ship it.
- Redaction by default, not by configuration. The bundle ships with redaction on; an explicit
--no-redactflag is the only way to disable it, and it logs a prominent warning. - Sub-30-second
doctorruntime. Time budget: 30s total, 5s per check, with explicit timeouts. Past the budget, the check is reported as "TIMEOUT" — not silently hung. - Idempotent and side-effect free.
doctorandsupport-bundlemake zero writes to the data plane. They do persist a compliance scan row (the existingpkg/compliance/service.goalready does this), nothing else. - Pro-only. Aligns with the rest of the operational tooling. OSS users still get
chainlaunch compliance scan, which is a strict subset.
Non-goals
- Not a replacement for Prometheus / Grafana / OpenTelemetry. Doctor is point-in-time; observability is continuous.
- Not a backup-restore tester. That is the separate "scheduled restore-test job" feature in Future work.
- Not a remediation engine. Doctor reports; it never restarts a node, rotates a cert, or changes config. Remediation hints are human-readable strings.
- Not a network-fixer. If the peer is unreachable,
doctorsays so; it does not try to repair the network. - Not a UI feature in v1. CLI first. A "Run doctor" button in the web UI can come in v1.1; the API endpoint is designed to allow it.
Background — what already exists
A surprising amount of the substrate is already in place:
| Component | Location | Reused for |
|---|---|---|
| Network compliance checks (Fabric + Besu): orderer quorum, validator count, TLS, cert expiry, node health, endpoint conflicts, backup schedule | pkg/compliance/checks.go |
The "blockchain" check category in doctor |
| Persisted scan history | pkg/compliance/service.go + compliance_scans table |
Doctor history endpoint |
| Connection testing (TCP/HTTP/gRPC), certificate validation, port checks, ping | pkg/troubleshooting/service.go |
The "network" check category |
| Disk-space monitor with thresholds | pkg/monitoring/diskspace.go |
The "host" check category |
| Audit log + audit pruner | pkg/audit/audit_pruner.go |
Bundle: audit tail extraction |
Backup providers with VerifyBackup interface |
pkg/backups/provider/provider.go |
The "backup" check category |
| Notifications + webhooks | pkg/notifications/, pkg/webhooks/ |
The "notifications wired up" check |
| Disk metrics, Prometheus metrics handler | pkg/metrics/prometheus.go |
Bundle: metrics snapshot |
| OIDC token refresher with admin audit | pkg/sso/refresher.go |
The "SSO healthy" check |
Server-side compliance handler at /compliance/summary, /networks/{id}/compliance |
pkg/compliance/handler.go |
The doctor command pulls from here when run remotely |
What's missing is (a) the host-level checks (chainlaunch process, file permissions, Docker daemon, time skew, config sanity), (b) a single ordered runner that aggregates everything, (c) the bundle format and the redactor, and (d) the CLI surface that ties it together.
Command surface
chainlaunch doctor
chainlaunch doctor [flags]
| Flag | Description | Default |
|---|---|---|
-o, --output |
text (TTY), json, junit |
text if stdout is a TTY, else json |
--category |
Filter to one or more categories (host, process, db, network, blockchain, security, observability, backup). Repeatable or comma-separated. |
all |
--include-network |
Network ID(s) to scope blockchain checks. Repeatable. | all networks |
--timeout |
Per-check timeout. | 5s |
--budget |
Overall wall-clock budget. | 30s |
--severity |
Minimum severity to print: pass, warn, fail. Always exits on fail. |
warn |
--remote |
If set, runs against CHAINLAUNCH_API_URL instead of the local process. Host-level checks are skipped (the API can't see the host). |
unset (local mode) |
--no-color |
Disable ANSI colors. Implied when !isatty(stdout). |
unset |
Exit codes
| Code | Meaning |
|---|---|
| 0 | All checks passed (no fail, no warn). |
| 1 | Warnings present, no failures. |
| 2 | One or more failures. |
| 3 | Doctor itself errored (e.g. couldn't read config, can't connect to API in --remote mode). |
chainlaunch support-bundle
chainlaunch support-bundle [flags]
| Flag | Description | Default |
|---|---|---|
-o, --output |
Output path. - writes to stdout (useful for piping). |
./chainlaunch-bundle-<timestamp>.tar.gz |
--include |
Repeatable: logs, metrics, audit, config, compliance, host, nodes. |
all |
--exclude |
Repeatable. Wins over --include. |
none |
--log-tail-mb |
How many MB of recent logs to include per source. | 20 |
--audit-tail |
How many audit events to include. | 5000 |
--node |
Limit per-node container logs to these slugs/IDs. Repeatable. | all nodes |
--no-redact |
Dangerous. Skip redaction entirely. Logs a big red warning. | unset (redaction on) |
--redact-policy |
Path to a YAML file overriding the built-in redaction rules. | built-in |
--manifest-only |
Skip the heavy contents, write only the manifest. Fast preview. | unset |
--encrypt-with |
Path to a PGP public key. Output is <name>.tar.gz.gpg. |
unset |
--ttl |
Set an envelope expires_at 24h from now in the manifest. Not enforced; informational. |
0 (no TTL) |
Exit codes
| Code | Meaning |
|---|---|
| 0 | Bundle written. |
| 1 | Bundle written but some optional sources failed (e.g. metrics endpoint unreachable). Failures recorded in the manifest. |
| 2 | Bundle could not be written. |
Check inventory
The check registry is a slice of Check values; each check declares its category, severity defaults, and whether it requires API access. The runner orders them by category, then by name, and runs each with the configured timeout.
Category: host
| Check | What it asserts | Severity if false |
|---|---|---|
host.os_supported |
OS is in the supported matrix (linux/amd64, linux/arm64, darwin/arm64, darwin/amd64). | warn |
host.kernel_version |
Kernel ≥ 5.x on Linux. | warn |
host.glibc_version |
glibc ≥ 2.31 on Linux (only meaningful for service mode). | warn |
host.cpu_count |
≥ 2 cores (warn) or ≥ 4 cores (recommended). | warn |
host.memory_total |
≥ 4 GB RAM (warn under 4 GB; fail under 2 GB). | warn / fail |
host.disk_free |
Reuses monitoring/diskspace. Fail under 10%; warn under 20%. |
warn / fail |
host.swap_active |
Swap is configured (warn if absent on a low-RAM host). | warn |
host.time_skew |
time.Now() differs from a stable monotonic reference by < 1s (and from API if remote). |
warn |
host.fd_limit |
ulimit -n ≥ 65535. |
warn |
host.iptables_present |
iptables exists when --mode docker is in use. |
warn |
host.docker_running |
docker info returns OK if any node is in docker mode. |
fail |
host.docker_storage_driver |
Driver is overlay2 or btrfs. |
warn |
host.docker_disk_usage |
Docker's data-root is on a partition with enough headroom. | warn |
host.data_dir_perms |
~/.chainlaunch (or --data dir) is owned by the running user, mode ≤ 0750. |
warn |
host.binary_path |
chainlaunch binary is on PATH and matches Version (catches stale-binary issues). |
warn |
Category: process
| Check | What it asserts |
|---|---|
process.server_alive |
The local chainlaunch serve is responding on its configured port. (skipped in --remote mode) |
process.server_version_matches_cli |
API /version returns the same GitCommit as version.GitCommit. |
process.server_uptime |
Server has been up ≥ 60s (catches crash-loop). |
process.api_latency_p50 |
A trivial /health request returns under 250 ms. |
process.outbound_dns |
Server can resolve registry-1.docker.io (warn-level for air-gapped customers). |
process.config_loadable |
Server's effective config is parseable (uses the existing config endpoint). |
process.pending_migrations |
golang-migrate reports no pending migrations. |
Category: db
| Check | What it asserts |
|---|---|
db.reachable |
SQLite file is readable, PRAGMA integrity_check; returns ok. |
db.foreign_keys_enabled |
PRAGMA foreign_keys is 1. |
db.wal_size |
WAL file < 1 GB (signals stuck checkpoint). |
db.last_vacuum |
VACUUM ran in the last 30 days (informational). |
db.row_counts |
Row counts for the 10 largest tables — informational, surfaces explosive growth in audit / metrics tables. |
Category: network
Per-node and per-network checks via pkg/troubleshooting.
| Check | What it asserts |
|---|---|
network.node_reachable |
TCP connect to each node's listen address from the chainlaunch host. |
network.node_tls_handshake |
TLS handshake completes; the cert chain validates against the configured CA. |
network.node_grpc_healthy |
For Fabric peers, the gRPC health probe returns SERVING. |
network.node_rpc_healthy |
For Besu nodes, eth_blockNumber succeeds. |
network.peer_endpoint_overrides_consistent |
If AddressOverrides are configured, they resolve. |
network.docker_publish_collisions |
Reuses pkg/system/http port probe to confirm no other process holds a node's published port. |
Category: blockchain
This is the existing pkg/compliance.FabricChecks() / BesuChecks() — orderer quorum, validator count, TLS, cert expiry, node health, endpoint conflicts, backup schedule, org diversity. Doctor calls compliance.Service.ScanAllNetworks directly (no HTTP) when running locally, and the /compliance/summary endpoint when running --remote.
Added in this ADR:
| Check | What it asserts |
|---|---|
blockchain.chaincode_definition_drift |
For each Fabric channel, the latest local chaincode_definitions row matches the on-channel _lifecycle:QueryChaincodeDefinitions result. Reuses GetCommittedChaincodes from the chaincode CLI. |
blockchain.besu_block_height_drift |
All Besu nodes in a network report block heights within 5 blocks of each other. |
blockchain.fabric_channel_height_drift |
For each Fabric channel, all joined peers report channel-block-height within 2 blocks. |
Category: security
| Check | What it asserts |
|---|---|
security.tls_listener_enabled |
The chainlaunch API itself is served over TLS (warn if not, in production-looking configs). |
security.default_admin_changed |
The local admin account's password hash is not the install default. |
security.api_key_count |
At least one non-default API key exists if any external integration is configured (sanity check). |
security.key_provider_health |
Each configured key provider (DB / AWS KMS / Vault) responds to a no-op probe. |
security.kms_credentials_present |
When AWS KMS is configured, the resolved credentials chain works (instance profile / SSO / static — uses the same chain backup uses). |
security.vault_token_alive |
When Vault is configured, the token can auth/token/lookup-self and has ≥ 7 days TTL left. |
security.encryption_key_set |
Encryption-at-rest master key is configured (looks at pkg/encryption). |
security.oidc_refresher_healthy |
When OIDC is configured, the refresher thread last ran < 30 min ago. |
Category: observability
| Check | What it asserts |
|---|---|
observability.metrics_port_listening |
Prometheus endpoint is up. |
observability.notifications_configured |
At least one channel (email / webhook / Slack) is configured. |
observability.notifications_test |
(Opt-in via --include-network) sends a __doctor_test__ event end-to-end through the webhook pipeline. |
observability.audit_pruner_recent |
Audit pruner ran < retention-window ago. |
Category: backup
| Check | What it asserts |
|---|---|
backup.target_count |
≥ 1 active backup target exists. |
backup.schedule_count |
≥ 1 active backup schedule exists. |
backup.last_successful_within_sla |
Each target's most recent backup succeeded within 2 * schedule_interval. |
backup.target_verify |
Calls provider.VerifyBackup against the latest backup for each target. Slowest check; gated behind --include backup if --budget is tight. |
backup.s3_reachable |
For S3 targets, can list with the configured credentials. |
backup.ebs_role_present |
For EBS targets, the resolved IAM role exists. |
Runner architecture
┌─────────────────────────────────────────────────────────────────┐
│ cmd/doctor/doctor.go │
│ ├─ parse flags │
│ ├─ build context (timeout, budget, severity, categories) │
│ ├─ call pkg/doctor.Service.Run(ctx, opts) │
│ └─ render(report, --output) │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ pkg/doctor/service.go │
│ ├─ Registry (categories → []Check) │
│ ├─ Run(ctx, opts) returns *Report │
│ ├─ Each Check runs in its own goroutine, bounded by: │
│ │ semaphore (limit 8 concurrent), │
│ │ per-check timeout, │
│ │ overall budget (cancels the parent ctx). │
│ └─ Deterministic ordering of the rendered output regardless │
│ of completion order. │
└─────────────────────────────────────────────────────────────────┘
│
┌─────────────────────┼─────────────────────┐
▼ ▼ ▼
┌───────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ pkg/doctor/ │ │ pkg/compliance │ │ pkg/troubleshoot │
│ checks/ │ │ (existing) │ │ ing (existing) │
│ host/ │ │ → blockchain │ │ → network │
│ process/ │ │ checks │ │ checks │
│ db/ │ └──────────────────┘ └──────────────────┘
│ security/ │
│ backup/ │
│ observ/ │
└───────────────┘
Types
package doctor
type Severity string
const (
SeverityPass Severity = "PASS"
SeverityWarn Severity = "WARN"
SeverityFail Severity = "FAIL"
SeverityTimeout Severity = "TIMEOUT"
SeveritySkipped Severity = "SKIPPED"
)
type Category string
const (
CategoryHost Category = "host"
CategoryProcess Category = "process"
CategoryDB Category = "db"
CategoryNetwork Category = "network"
CategoryBlockchain Category = "blockchain"
CategorySecurity Category = "security"
CategoryObservability Category = "observability"
CategoryBackup Category = "backup"
)
type CheckResult struct {
Name string `json:"name"` // e.g. "host.disk_free"
Category Category `json:"category"`
Severity Severity `json:"severity"`
Message string `json:"message"` // one-line summary
Details map[string]any `json:"details,omitempty"`
Remediation string `json:"remediation,omitempty"`
DurationMS int64 `json:"duration_ms"`
}
type Check interface {
Name() string
Category() Category
RequiresAPI() bool // true → skipped when local API is unavailable
Run(ctx context.Context, env *Env) CheckResult
}
type Env struct {
Logger *logger.Logger
DB *sql.DB
Queries *db.Queries
NetworksService *networksservice.NetworkService
NodesService *nodesservice.NodeService
ComplianceSvc *compliance.Service
TroubleshootSvc *troubleshooting.Service
APIClient *common.Client // nil in local mode
DataDir string
Mode Mode // Local | Remote
}
type Report struct {
Version string `json:"version"` // chainlaunch version
GitCommit string `json:"git_commit"`
StartedAt time.Time `json:"started_at"`
DurationMS int64 `json:"duration_ms"`
Status OverallStatus `json:"status"` // HEALTHY|DEGRADED|AT_RISK|CRITICAL
Summary Summary `json:"summary"`
Categories []CategoryReport `json:"categories"`
}Concurrency & timeouts
- A
chan struct{}semaphore of size 8 caps parallelism. Even on a 64-core host, exploding parallelism would just thrash the SQLite WAL. - Each check gets
ctx, cancel := context.WithTimeout(parent, opts.PerCheckTimeout). The parent has the wall budget. If the wall budget fires, the runner cancels everything in flight and emitsTIMEOUTfor unfinished checks. - Network calls inside checks must honor
ctx. We'll add agosecrule (orstaticcheck) gate that flagshttp.Get/net.Dialin check code. - Each check's
Runreturns within its own timeout or panics — the runner recovers panics and records them asFAILwith the recovered error.
Determinism
Checks complete out of order, but the rendered output is sorted by (category, name) for stable diffs across runs. The JSON output mirrors this so a CI job comparing two reports can diff them directly.
Output formats
--output text (default for TTY)
chainlaunch doctor — v0.42.1 (a3f8b2c) — host: ip-10-0-1-23 — 2026-05-13T08:42:11Z
HOST 4/4 pass, 1 warn
✓ host.os_supported linux/amd64
✓ host.cpu_count 8 cores
! host.memory_total 3.4 GB available, ≥ 4 GB recommended
✓ host.disk_free 71% free on /var/lib/chainlaunch
✓ host.docker_running server 25.0.3, overlay2
PROCESS 7/7 pass
✓ process.server_alive http://localhost:8100 (uptime 4d 2h)
✓ process.api_latency_p50 41 ms
✓ process.pending_migrations none
…
BLOCKCHAIN 2/3 pass, 1 fail
✓ blockchain.orderer_quorum (network: prod-supply-chain) 3 orderers, tolerates 1
✗ blockchain.cert_expiry (network: prod-supply-chain) peer0.org2 TLS expires in 6 days
→ Renew with: POST /api/v1/nodes/{id}/certificates/renew
✓ blockchain.fabric_channel_height_drift all peers within 1 block
BACKUP 3/3 pass
✓ backup.last_successful_within_sla last S3 backup 2h 14m ago (SLA 6h)
…
───────────────────────────────────────────────────────────────────────
STATUS: AT_RISK · 28 pass · 2 warn · 1 fail · 0 timeout · 2.4s
───────────────────────────────────────────────────────────────────────
Run with --output json for machine-readable output, or
chainlaunch support-bundle to package everything for support.
--output json
{
"version": "0.42.1",
"git_commit": "a3f8b2c",
"started_at": "2026-05-13T08:42:11Z",
"duration_ms": 2412,
"status": "AT_RISK",
"summary": {"total": 31, "pass": 28, "warn": 2, "fail": 1, "timeout": 0},
"categories": [
{
"name": "host",
"summary": {"total": 5, "pass": 4, "warn": 1, "fail": 0, "timeout": 0},
"checks": [
{
"name": "host.memory_total",
"category": "host",
"severity": "WARN",
"message": "3.4 GB available, ≥ 4 GB recommended",
"details": {"available_gb": 3.4, "recommended_gb": 4},
"remediation": "Bump host RAM to 4 GB+.",
"duration_ms": 8
}
]
}
]
}--output junit
XML matching the JUnit Surefire schema. One <testsuite> per category, one <testcase> per check. WARN becomes <testcase> with a <system-err> block; FAIL becomes <failure>. Lets CI dashboards (GitHub Actions, GitLab, CircleCI) render doctor results natively.
Support-bundle format
A gzipped tar with a fixed layout. The root directory inside the tarball is named chainlaunch-bundle-<timestamp>-<git_commit> so two bundles never collide when unpacked side-by-side.
chainlaunch-bundle-20260513-084211-a3f8b2c/
├── manifest.json # always first; signed in v1.1
├── README.txt # plain-text "how to read this bundle"
├── doctor.json # output of `chainlaunch doctor --output json`
├── doctor.txt # output of `chainlaunch doctor --output text --no-color`
├── version.json # version, git_commit, build_time, OS, arch, container?
├── host/
│ ├── os.json # uname, /etc/os-release
│ ├── cpu.json # cores, model
│ ├── mem.json # total, free, swap
│ ├── disk.json # df -h, df -i (inodes), data-dir size
│ ├── network.json # interfaces, routes, /etc/resolv.conf (redacted)
│ ├── docker.json # docker version, info, ps --all (no env)
│ └── ulimits.json # max files, max procs
├── config/
│ ├── effective.yaml # serialized effective config (REDACTED)
│ └── env.json # CHAINLAUNCH_* env vars (REDACTED)
├── audit/
│ └── recent.jsonl # last N audit events (REDACTED)
├── compliance/
│ ├── summary.json # org-wide scan
│ └── networks/<id>.json # per-network scan
├── nodes/
│ └── <slug>/
│ ├── meta.json # node row, redacted
│ ├── logs.txt # last log-tail-mb of container/service logs
│ └── docker-inspect.json # docker inspect (REDACTED)
├── chaincodes/
│ └── <slug>/
│ ├── meta.json
│ ├── timeline.json # from the existing /timeline endpoint
│ └── logs.txt # last N MB of chaincode container logs
├── server/
│ ├── chainlaunch.log # last log-tail-mb of the server log
│ ├── chainlaunch.log.1 # rotated prior log if recent
│ └── metrics-snapshot.prom # one-shot Prometheus scrape
├── db/
│ ├── schema.sql # output of `.schema` — no data
│ ├── integrity.txt # PRAGMA integrity_check
│ ├── migrations.json # applied migration versions
│ └── row-counts.json # row counts per table
└── redaction.json # what got redacted and the rule that matched
manifest.json
{
"bundle_version": "1",
"chainlaunch_version": "0.42.1",
"git_commit": "a3f8b2c",
"build_time": "2026-05-09T13:22:00Z",
"generated_at": "2026-05-13T08:42:11Z",
"expires_at": "2026-05-14T08:42:11Z", // when --ttl is set
"host_id": "sha256:64hex-of-hostname-uuid",
"os": "linux",
"arch": "amd64",
"in_container": false,
"license_id": "sha256:64hex", // SHA-256 of license key, never the key itself
"instance_id": "sha256:64hex",
"redaction": {
"enabled": true,
"policy": "built-in",
"policy_version": "1",
"matches": 47
},
"contents": [
{"path": "doctor.json", "bytes": 12834, "sha256": "…"},
{"path": "nodes/peer0-org1/logs.txt", "bytes": 21037184, "sha256": "…"},
…
],
"errors": [
{"source": "metrics-snapshot", "error": "connection refused on :9090"}
]
}Redaction rules (built-in v1)
Every text file passes through a streaming redactor. The redactor is one pass with a fixed pattern list (no user regex injection in v1). Each match is replaced with <redacted:<rule-id>> and recorded in redaction.json with the source file, byte offset, and rule ID.
| Rule ID | Pattern | Notes |
|---|---|---|
r.api-key.clpro |
clpro_[A-Za-z0-9]{32,} |
ChainLaunch API keys. |
r.api-key.bearer |
Bearer\s+[A-Za-z0-9._-]{20,} |
Bearer tokens in headers. |
r.basic-auth |
Authorization:\s*Basic\s+[A-Za-z0-9+/=]+ |
Base64-encoded basic auth. |
r.private-key.pem |
-----BEGIN (?:RSA |EC |PRIVATE|ENCRYPTED PRIVATE) KEY-----[\s\S]*?-----END [^-]+----- |
All PEM private keys. |
r.certificate.pem |
-----BEGIN CERTIFICATE-----[\s\S]*?-----END CERTIFICATE----- |
Replaced with a one-line summary (subject + issuer + NotAfter), not removed. Customers and support both want the metadata. |
r.password.config |
YAML/JSON keys matching (?i)(password|passwd|secret|token|api[_-]?key|client[_-]?secret). |
Value replaced; key kept so the structure is debuggable. |
r.aws-credential |
AKIA[0-9A-Z]{16} and aws_secret_access_key\s*=\s*[A-Za-z0-9/+=]{40} |
|
r.vault-token |
hvs\.[A-Za-z0-9_-]{24,} or s\.[A-Za-z0-9]{24,} |
Vault token formats. |
r.gcp-sa |
"private_key":\s*"-----BEGIN PRIVATE KEY-----[\s\S]*?-----END PRIVATE KEY-----" |
GCP service account JSON. |
r.email |
RFC 5322 email. | Optional rule, off by default — emails matter for audit traceability and customers usually want them. Documented; tunable. |
r.ip.public |
Any IPv4 / IPv6 not in RFC 1918 / RFC 4193. | Optional, off by default — usually customers want to share these so support can grep. |
r.docker.env-var |
The Env: block of docker inspect output. |
Replaced with <redacted:r.docker.env-var> because env vars often contain secrets. |
r.path.home |
The chainlaunch user's $HOME and absolute paths under it. |
Replaced with ~. |
r.uuid.session |
Session cookies. |
Two non-pattern rules:
- YAML structural redaction. The effective-config emitter knows which fields are sensitive (encryption-at-rest master key, KMS credentials, OIDC client secret, etc.) and never serializes them — they appear as
<redacted-by-field>. This is more reliable than regex for structured data. - JSON structural redaction for
docker inspectoutput: theEnvarray,HostConfig.Env, andArgsare wholesale removed before the regex pass.
The redactor processes files in streaming chunks (line-oriented for .txt/.log, parsed-then-serialized for .json/.yaml). A 1 GB log tail must never load entirely into RAM.
--no-redact
Disables the redactor entirely. The CLI prints a 6-line yellow banner before writing the bundle and the manifest records redaction.enabled = false. Any support ticket that comes in with a --no-redact bundle gets a polite "please regenerate without that flag" reply.
--encrypt-with
After tarballing, the file is encrypted with the supplied PGP public key using gopg. Output extension becomes .tar.gz.gpg. Support has a published team key — the standard install includes a chainlaunch support-bundle --encrypt-with /usr/share/chainlaunch/support-team.asc shortcut.
Performance budgets
| Operation | Budget |
|---|---|
doctor total wall-clock |
30 s |
| Per-check timeout | 5 s |
support-bundle total wall-clock (default tail sizes) |
90 s |
| Tarball size (default tails) | < 200 MB compressed |
| Redaction throughput | ≥ 200 MB/s per core (streaming, line-oriented) |
| Memory ceiling for redactor | 256 MB |
Doctor runs faster than 30s in practice — the budget exists to keep one slow remote check (a hung peer) from blocking the report.
RBAC & permissions
- CLI
doctorlocal mode — must be run as the chainlaunch process user. No additional auth. - CLI
doctor --remote— uses the normal--auth-username/--auth-password/--auth-bearer. Requires the newsystem:doctor:runpermission. - API
/system/doctor— same permission, plusauth.WithPermission(auth.PermissionSystemDoctor). - CLI
support-bundle— must be run as the chainlaunch process user. Bundle output is written with 0600 perms. No remote variant in v1 — too easy to leak.
Add to pkg/auth/permissions.go:
const (
PermissionSystemDoctor Permission = "system:doctor:run"
PermissionSystemSupportBundle Permission = "system:support-bundle:generate"
)Granted to ADMIN by default. OPERATOR gets system:doctor:run but not the support-bundle (the bundle includes audit data that operators shouldn't extract). VIEWER gets neither.
API surface
POST /api/v1/system/doctor
POST /api/v1/system/doctor
Content-Type: application/json
{
"categories": ["blockchain", "backup"],
"per_check_timeout_ms": 5000,
"budget_ms": 30000
}
Response is the same Report JSON the CLI emits. Synchronous in v1 (doctor is fast enough). Future v2: returns a job_id for long-running variants.
GET /api/v1/system/doctor/history
Returns the last N runs, leveraging the existing compliance_scans table plus a new lightweight doctor_runs table (just id, started_at, duration_ms, status, summary_json, created_by).
POST /api/v1/system/support-bundle
Not exposed in v1. The bundle is local-disk only. The UI may shell out to the binary in v1.1 but the HTTP endpoint waits until v2 (encrypted-streaming-to-S3 is a different feature).
Data model changes
One small table only:
-- pkg/db/migrations/0046_add_doctor_runs.up.sql
CREATE TABLE IF NOT EXISTS doctor_runs (
id INTEGER PRIMARY KEY AUTOINCREMENT,
started_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
duration_ms INTEGER NOT NULL,
status TEXT NOT NULL, -- HEALTHY|DEGRADED|AT_RISK|CRITICAL|ERROR
summary_json TEXT NOT NULL, -- {"total": …, "pass": …, …}
report_json TEXT NOT NULL, -- full Report; capped at 1 MB; truncated otherwise
created_by INTEGER REFERENCES users(id) ON DELETE SET NULL
);
CREATE INDEX IF NOT EXISTS idx_doctor_runs_started_at
ON doctor_runs(started_at DESC);Pruned by the existing audit_pruner mechanism (extended to know about doctor_runs). Retention default: 90 days, configurable.
Package layout
pkg/doctor/
├── service.go # runner: takes Env + opts, returns *Report
├── types.go # Check, CheckResult, Report, etc.
├── registry.go # Registers all built-in checks; extension point
├── redactor/
│ ├── redactor.go # streaming engine
│ ├── rules.go # built-in rules
│ ├── rules_test.go
│ └── policy.go # YAML policy loader (--redact-policy)
├── bundle/
│ ├── builder.go # walks the rules, writes the tarball
│ ├── manifest.go # manifest schema + writer
│ ├── encrypt.go # PGP wrapper (--encrypt-with)
│ └── builder_test.go
└── checks/
├── host/ # one file per check or per subgroup
├── process/
├── db/
├── network/ # thin wrappers over pkg/troubleshooting
├── blockchain/ # thin wrappers over pkg/compliance + new drift checks
├── security/
├── observability/
└── backup/ # thin wrappers over pkg/backups
cmd/doctor/
├── doctor.go # top-level command
├── support_bundle.go # `chainlaunch support-bundle`
├── render_text.go
├── render_json.go
└── render_junit.go
Wire into cmd/root.go:
rootCmd.AddCommand(doctor.NewDoctorCmd(appLogger))
rootCmd.AddCommand(doctor.NewSupportBundleCmd(appLogger))Documentation
New documentation deliverables:
chainlaunch-docs/docs/operations/doctor.md— how-to + reference, including the exit-code semantics and CI integration recipes.chainlaunch-docs/docs/operations/support-bundle.md— what's in the bundle, what's redacted, how to inspect it before sending.- Update
chainlaunch-pro-cli.mdwith a newdoctorandsupport-bundlesection. - A docs page specifically for support engineers at
chainlaunch-docs/docs/internal/triage.md: "open the bundle, look at these five files in this order." Internal-but-public so customers can read it too — that itself is a trust signal.
Rollout plan
| Phase | What ships | Customer-visible? |
|---|---|---|
| v0 — design | This ADR. | No |
v1 — doctor local mode |
chainlaunch doctor, JSON + text output, all host/process/db/network checks, reuse of compliance for blockchain checks. No bundle yet. New permission added. |
Yes, in next minor release. Beta-flagged. |
v1.1 — support-bundle |
chainlaunch support-bundle, full tarball, built-in redactor, manifest, PGP option. No remote bundle. UI button → shells out to binary. |
Yes. |
v1.2 — doctor --remote |
Doctor works over the API. /api/v1/system/doctor endpoint. JUnit output. doctor_runs table + history endpoint. |
Yes. |
| v2 — bundle to S3 | chainlaunch support-bundle --to s3://… streaming upload. Job-based async doctor. UI dashboard tile. |
Yes. |
Each phase is shippable on its own. v1 is the highest-leverage; v1.1 is the credibility upgrade.
Testing strategy
| Layer | Test |
|---|---|
| Unit — checks | One table-driven test per check, hand-crafted Env fixtures. Run under -race. |
| Unit — redactor | Golden files. Every rule has a before.txt / after.txt pair. Adding a rule without a fixture fails CI. |
| Integration | A doctor_integration_test.go boots a real serve against test-1.db, runs doctor, asserts on the report. |
| Smoke | A support_bundle_smoke_test.go generates a bundle in /tmp, extracts it, walks every file, asserts the redactor banned-string list does not appear. |
| Performance | Benchmarks for the redactor at 1 GB of synthetic log input. Asserts ≥ 200 MB/s/core. |
| E2E | A Playwright test in web/ clicks the "Run doctor" button (v1.1) and verifies the report renders. |
| Security | The "doctor against a configured-but-broken instance" test: malformed config, expired certs, dead Vault. Must not crash, must report each. |
The most important test is the redactor "never leak" test: a corpus of real config files contributed by support is run through the redactor and the output is grep'd for every known secret prefix (clpro_, AKIA, hvs., -----BEGIN). Zero matches = ship.
Security considerations
- Bundle is dangerous by definition. Even redacted, it tells an attacker a lot — host IPs (unless
r.ip.publicis enabled), software versions, network topology. The CLI writes it 0600, manifests anexpires_at, and the docs are explicit about treating the bundle like a credential. - Redaction is best-effort. No regex catches every secret. The
--encrypt-withflag + a published support PGP key is the belt-and-suspenders defense. - Doctor checks must not log secrets. Code review checklist: every check function is reviewed for
log.Printfof any value that crosses a key/secret/credential boundary. - The
process.outbound_dnscheck must use a configurable target. Hard-codingregistry-1.docker.iois fine for the default; air-gapped customers will override it via--redact-policy-like config or a futurechainlaunch.yamlblock. --no-redactwrites a sentinel tomanifest.jsonso we can refuse to process such bundles in our own ticket-handling tools.
Alternatives considered
| Alternative | Why rejected |
|---|---|
Just shell out to a script. A support.sh bundled with the install. |
No structured output, hard to test, hard to redact reliably, doesn't surface in the UI. |
| Bundle without redaction; rely on the customer to scrub. | Customers paste bundles into Slack within 30 seconds. Redaction must be the default. |
| Make doctor a SaaS service that the binary phones home to. | Defeats the air-gapped/self-hosted promise. Hard pass. |
| Cover this with Prometheus alerting only. | Different shape: Prometheus is "is something broken right now," doctor is "is this install correctly set up." Both matter; they're not substitutes. |
Vault-style vault operator diagnose clone (exact behavior parity). |
Vault is more constrained — single-process, single-binary. ChainLaunch's data plane (Docker, peers, KMS, backups, OIDC) is much wider. Doctor's check inventory reflects that. |
Open questions
--remotemode and host checks. The API can't see the host. Do we hide host checks entirely in--remotemode, or show them asSKIPPEDwith a reason? Proposal:SKIPPEDwith the reason so the report layout matches between modes.- Doctor and the audit log. Should each doctor run write an audit event? Proposal: yes for
--remoteruns (it's an authenticated user action), no for local-only runs (it's a process-user action, gets adoctor_runsrow). - Bundle and licensing. Should the bundle include the license object? Proposal: include a SHA-256 of the license key, never the key itself. The hash lets support correlate the bundle to the customer account.
- Redaction policy versioning. When we add a rule, do bundles generated against an old policy version need re-redaction before processing? Proposal: yes, support tooling refuses bundles whose
redaction.policy_versionis older than the currently-deployed policy. - macOS notarization for support-bundle. The PGP path drags in CGO bindings on some platforms. We may need to use a pure-Go OpenPGP library (e.g.
github.com/ProtonMail/go-crypto) to keep the binary CGO-free.
Future work (post-v1)
- Scheduled doctor. A cron-like schedule that runs doctor every hour and posts results to the notifications pipeline. Closes the loop on continuous health.
- Backup restore-test job. Bundle a daily / weekly restore test into doctor's
backupcategory. (Separate, larger feature — its own ADR.) - Bundle-to-S3.
--to s3://bucket/pathstreaming upload with server-side encryption. Customers love this; security teams love it more. - UI dashboard. A "Health" tile that shows the last doctor run's status with drill-down. Renders the JSON report inline. "Run now" button.
- Differential bundles.
chainlaunch support-bundle --since 2026-05-12T00:00:00Zto capture only what changed since the last bundle. Critical for repeat tickets. - MCP tool integration. Expose
doctor.runandsupport-bundle.generateas MCP tools (you already have an MCP server inpkg/mcp). Letting Claude run doctor on the user's behalf during a debugging conversation is a huge DX win. - Status-page integration. Wire the doctor summary into a public status page if the customer wants one ("show our last 30 days of doctor results").
Appendix A — Sample chainlaunch doctor invocations
# Quick health snapshot, exits non-zero on failure.
chainlaunch doctor
# Only the things I can fix in production right now.
chainlaunch doctor --category blockchain,backup --severity fail
# CI gate: blow up the pipeline if anything fails.
chainlaunch doctor --output junit > doctor.xml || echo "::error::doctor failed"
# Remote prod check from a laptop.
CHAINLAUNCH_API_URL=https://prod.example.com/api/v1 \
chainlaunch doctor --remote --output json | jq '.summary'
# Watchdog: keep a status file fresh for an external probe.
while sleep 60; do
chainlaunch doctor --output json > /var/lib/chainlaunch/health.json
doneAppendix B — Sample chainlaunch support-bundle invocations
# Defaults: redacted, 20 MB log tails, every node, encrypted nothing.
chainlaunch support-bundle
# Smaller bundle for a quick ticket — just config + compliance + doctor.
chainlaunch support-bundle --exclude logs --exclude metrics --log-tail-mb 0
# Encrypted for the support team.
chainlaunch support-bundle \
--encrypt-with /usr/share/chainlaunch/support-team.asc \
-o /tmp/ticket-12345.tar.gz.gpg
# CI artifact on every test failure.
chainlaunch support-bundle --manifest-only -o - > bundle-manifest.json
# Only the nodes that are misbehaving.
chainlaunch support-bundle --node peer0-org1 --node orderer0-org0Appendix C — Acceptance criteria for v1 release
A chainlaunch doctor ships when:
- All checks in the
host,process,db,network,blockchain,security,observability,backupcategories have a passing unit test and at least one negative test. -
doctor --output json | jqreturns a schema-validReporton a healthy and an unhealthy fixture. - The runner respects
--budget(tested by injecting a5s sleepcheck with a1s budgetand assertingTIMEOUT). - Doctor finishes under 30s wall-clock on the reference test instance (10 nodes, 4 networks, 2 chaincodes).
-
doctor_runsmigration applies clean on a 0.41.x → 0.42.x upgrade. - CI on every PR runs
chainlaunch doctoragainst a docker-composed reference instance and gates merge onSTATUS != CRITICAL.
A chainlaunch support-bundle ships when:
- The "never leak" redactor test passes against the support corpus.
- A 200 MB log tail redacts in under 1.0s on the reference hardware.
- The bundle round-trips through
tar tzfcleanly and the manifest's per-file SHA-256s verify. -
--encrypt-withproduces a file that decrypts cleanly with the matching private key (round-trip test in CI). - The docs include the "what's in the bundle" page with the redaction-rule table.