Logging

All services use Pino for structured JSON logging. Logs include request IDs, job IDs, and chain identifiers for tracing.

Log Format

{
  "level": "info",
  "time": 1717585200000,
  "requestId": "abc123",
  "chain": "solana",
  "msg": "Request completed",
  "method": "GET",
  "path": "/api/v1/wallets/7xKp.../rating",
  "statusCode": 200,
  "responseTime": 45
}

Log Levels

Level Usage
error Errors requiring attention
warn Recoverable issues, deprecated usage
info Request/response, job completion
debug Detailed debugging (dev only)

Key Metrics

Parser Metrics (Per Chain)

Metric Description
parser.{chain}.success_rate % of transactions parsed successfully
parser.{chain}.unknown_protocol_rate % falling back to unknown
parser.{chain}.latency_ms Time to parse a transaction
parser.{chain}.mev_detected MEV attacks detected

API Metrics

Metric Description
api.request_count Total requests by endpoint
api.error_rate 5xx error rate
api.p99_latency 99th percentile response time
api.rate_limit_hits Rate limit rejections

Worker Metrics

Metric Description
worker.job_queue_depth Pending jobs in queue
worker.job_success_rate % of jobs completing successfully
worker.backfill_throughput Transactions processed per minute
worker.webhook_delivery_rate Successful webhook deliveries

Intelligence Metrics

Metric Description
intelligence.wallets_flagged New wallets flagged per hour
intelligence.copy_detections Copy trading alerts generated
intelligence.decay_transitions Sharp → Fading → Dead transitions

Health Checks

Each service exposes a /health endpoint with multi-chain status:

GET /health

{
  "status": "healthy",
  "version": "1.0.0",
  "timestamp": "2026-06-05T12:00:00Z",
  "checks": {
    "database": "ok",
    "job_queue": "ok",
    "chains": {
      "solana": "ok",
      "base": "ok",
      "hyperliquid": "ok"
    }
  }
}

Alerting & Alarms

Critical alerts are configured using Cloudflare Workers Alarms and external monitoring.

Critical Alarms P1

Alarm Threshold Action
Parser failure rate > 5% Page on-call, investigate protocol change
API error rate > 1% Page on-call, check deployments
Database connections > 80% Scale pool, investigate leaks
Chain RPC down Any chain offline Failover to backup RPC

Warning Alarms P2

Alarm Threshold Action
Job queue depth > 1000 Scale workers, investigate backlog
Webhook delivery failures > 5% Retry logic, notify customer
Unknown protocol rate > 10% Add new DEX normalizer
API p99 latency > 5s Investigate slow queries

Cloudflare Workers Alarms

The API uses Durable Objects for scheduled alarm handling:

// Alarm service for scheduled tasks
class AlarmService {
  // Scheduled jobs
  async scheduleDecayCheck(walletId: string, nextCheck: Date) {
    await this.state.storage.setAlarm(nextCheck.getTime());
  }

  async alarm() {
    // Execute scheduled decay state check
    await this.checkDecayState();
    // Reschedule next check
    await this.scheduleNextCheck();
  }
}

Scheduled Alarms

Alarm Frequency Purpose
Decay state check Hourly per wallet Update sharp → fading → dead
Cohort snapshot Daily Generate cohort retention data
Rating recalculation 6 hours Update wallet scores
Copy detection scan 15 minutes Detect new copy patterns
Parser failure rate is the most critical metric. It's the first sign of a protocol change or new transaction type.

Recovery Procedures

Parser Regression

  1. Identify failing transaction signatures in logs
  2. Reproduce locally with pnpm --filter cli test-tx <signature>
  3. Check for protocol/program ID changes
  4. Update normalizer or add new one
  5. Deploy and verify success rate recovers

Chain RPC Outage

  1. Alarms trigger on health check failure
  2. Automatic failover to backup RPC (if configured)
  3. Manual intervention if all RPCs fail
  4. Jobs are retried with exponential backoff
Never manually delete jobs from the queue. Use the admin API to mark them as failed if needed.