Tuesday, 5 May 2026

Implementation of Autonomy Intelligence Features

v2.0 — Self-Learning Pipeline

These features transform the tool from a reactive code generator into a self-improving autonomous system. Each capability adds a new dimension of machine learning to the pipeline, reducing human intervention and improving output quality over time.

1. Closed-Loop Outcome Scoring
Foundational Layer
QA Pipeline Score + Findings Outcome Card Generator Outcome DB Tech Stack Score Trajectory Lessons Learned Stage 2 Analysis "What worked before?" Continuous Learning Loop Live Metrics Cards: 0 Success: 0%
Technical Architecture
  • Entity: OutcomeCard — JPA entity in H2/PostgreSQL
  • Service: OutcomeScoringService — generates structured outcome cards after QA
  • Trigger: PRCreatedEvent listener auto-generates cards
  • Storage: H2 table outcome_cards with indexed lookups by tech stack
  • AI Integration: Bedrock generates "lessons learned" via meta-analysis prompt
  • Retrieval: During Stage 2, top-N matching cards injected into analysis prompt
Business Impact
  • Reduces analysis time — past successes/failures guide decisions instantly
  • Avoids repeated failures — if an approach failed before, system warns proactively
  • Quantifies ROI — tracks time-to-complete and fix iterations across all requirements
  • Enables trend analysis — success rates by tech stack reveal strengths/weaknesses
  • Transforms KB from "here's similar code" to "here's what worked and what didn't"
2. Prompt Evolution via Meta-Learning
Adaptive Intelligence
21 Prompt Templates system-codegen.txt... QA Outcomes Scores + Findings Meta Analysis Amendments ALWAYS/NEVER rules Staged for Review Human Review Approved amendments feed back into prompts Blind Spot Detection "CSRF errors in 70% of runs despite prompt rule"
Technical Architecture
  • Service: PromptEvolutionService — tracks prompt→outcome correlations
  • Blind Spot Detection: Categories appearing in >50% of runs flagged as ineffective rules
  • Amendment Generation: Bedrock meta-prompt generates ALWAYS/NEVER rules
  • Safety Gate: All amendments are STAGED for human review before activation
  • Metrics: Per-template avg QA score, avg iterations, total findings tracked
  • Trigger: Auto-runs after every N requirements (configurable threshold)
Business Impact
  • Self-improving prompts — system gets smarter with every requirement processed
  • Addresses "stale rules" — detects when prompt rules aren't preventing errors
  • Reduces QA iterations — better prompts → fewer errors → less fix loop work
  • Human-in-the-loop safety — amendments never go live without approval
  • Measurable improvement — before/after metrics prove prompt evolution value
4. Proactive Defect Prediction
Pre-Generation Intelligence
Risk Scanner Risk Assessment 🔴 CSRF → HIGH (80%) 🟡 API_TYPO → MEDIUM 🟡 MISSING_DEP → MEDIUM 🟢 TEMPLATE → LOW Code Generation + Weighted Pitfalls + Guard Code + High-Risk Flags → Fewer Defects! QA Score
Technical Architecture
  • Service: DefectPredictionService — pre-generation risk analysis
  • AI Prompt: Bedrock predicts error categories by requirement + tech stack
  • Pattern Matching: Cross-references full Error Pattern Library for weighted prevention
  • Guard Code: Auto-generates defensive code snippets for HIGH-risk areas
  • File Risk Scoring: Identifies historically problematic files for extra scrutiny
  • Integration Point: Runs in CodeGenerationService before story loop starts
Business Impact
  • Shift-left quality — catches likely defects BEFORE code is generated
  • Reduces fix iterations — predicted problems are prevented at source
  • Smarter resource allocation — high-risk requirements get more defensive code
  • Quantifiable ROI — track defect prediction accuracy vs actual QA findings
  • Cost savings — fewer LLM calls for fix loops = lower Bedrock costs
5. Autonomous Self-Resolution (Stage 2)
Human Touchpoint Reduction
Analysis Engine Generates 5 Q's 🤖 Self-Resolution Engine 1. Query KB 2. Past Q&A 3. Repo Context 4. AI Reasoning Confidence ≥ 0.7? ✓ Resolved Auto-Answered 3/5 Questions ✗ Escalate User Review Only 2 Questions Impact Comparison Before: 5 questions → user ~2 day wait After: 2 questions → user 60% faster ↓60%
Technical Architecture
  • Service: SelfResolutionService — multi-strategy question resolver
  • 4 Resolution Strategies: KB search, past Q&A matching, repo context inference, AI reasoning
  • Confidence Scoring: Each resolution gets a 0.0–1.0 confidence; threshold = 0.7
  • Integration: Hooks into ProposalService.runAnalysis() before user escalation
  • Outcome Cards: Uses historical outcomes for additional reasoning context
  • Safety: All auto-resolutions are marked with source + confidence for audit trail
Business Impact
  • 60% fewer user interruptions — most questions answered autonomously
  • Faster pipeline throughput — no wait for human Q&A responses
  • Better user experience — users confirm assumptions vs answering from scratch
  • Knowledge compounds — each answered Q becomes future resolution context
  • Audit trail preserved — all auto-resolutions logged with confidence + source
6. Fix Strategy Memory (QA Intelligence)
Fix Loop Acceleration
QA Finding "Missing CSRF" protection 🧠 Strategy Memory CSRF → "Add @EnableCSRF + th:action on forms" ✓ 8/10 success rate Fix Prompt + Proven Strategy + Context → Converges faster! Fixed! 1 iteration vs 3 Score: 8.5 → PASS Success recorded → strategy strengthened Live Metrics Strategies: 0 Success: 0%
Technical Architecture
  • Entity: FixStrategy — JPA entity tracking fix approaches
  • Service: FixStrategyService — records successes/failures, retrieves best strategies
  • Integration: QaFixLoopService queries strategy memory before each fix attempt
  • Matching: By finding category + framework, with fallback to category-only
  • Scoring: Tracks success rate + avg score improvement per strategy
  • Prompt Injection: "FIX STRATEGY MEMORY" section with proven patterns
Business Impact
  • Faster convergence — fix loop resolves issues in 1 iteration vs 2-3
  • Reduced Bedrock costs — fewer fix iterations = fewer API calls
  • Knowledge retention — proven fixes persist even as prompts evolve
  • Compounding intelligence — each fix makes future fixes faster
  • Measurable: Track avg iterations before vs after strategy memory activation
7. Cross-Requirement Dependency Graph
Conflict Prevention
REQ-A (Auth) Modifying AuthService REQ-B (Roles) Modifying AuthService AuthService.java ⚠️ CONFLICT Modified by 2 reqs 🛡️ Dependency Graph File Registry tracks all in-flight modifications Conflict detected! Resolution Strategy Option 1: Sequence work Option 2: Inject pending changes as context → Compatible code gen! Live Reqs: 0 Files: 0 ⚠ 0
Technical Architecture
  • Entity: FileRegistryEntry — tracks file→requirement→branch mappings
  • Service: DependencyGraphService — conflict detection + resolution
  • Registration: Auto-registers files at code generation start
  • Detection: Before generating, queries for overlapping in-flight modifications
  • Resolution: Injects conflict context into generation prompt for compatible output
  • Lifecycle: Files marked COMPLETED after PR merge
Business Impact
  • Zero merge conflicts — detected at generation time, not PR time
  • Parallel work enabled — multiple requirements can modify overlapping areas safely
  • DevOps efficiency — no manual conflict resolution, no blocked PRs
  • Visibility — dashboard shows all in-flight modifications at a glance
  • Scales with team size — more concurrent requirements handled safely
Autonomy Maturity Model
L1: Reactive L2: Proactive L3: Predictive L4: Self-Learning ← XXXXX v2.0 (with these features) XXXXX v1.x
Feature Autonomy Dimension Human Reduction Learning Type Status
Outcome Scoring Decision Intelligence Reduces analysis guesswork by 40% Historical Pattern Active
Prompt Evolution Self-Improvement Auto-discovers blind spots Meta-Learning Active
Defect Prediction Preventive Quality 30% fewer QA fix iterations Predictive Risk Active
Self-Resolution Autonomous Analysis 60% fewer clarification questions Multi-Strategy Reasoning Active
Fix Strategy Memory QA Intelligence Fix loop converges 2x faster Strategy Reinforcement Active
Dependency Graph Coordination Intelligence Zero merge conflicts Graph Awareness Active

Friday, 1 May 2026

Autonomous QA — Architecture Diagram

⚡ Stage 6: Autonomous QA — Architecture

5-Layer Quality Gate Pipeline · Bedrock-Powered · Fully Autonomous

In my last two blogs we build the autonomus SDLC AI powered system , then we build the Auto-healing , and In this blow we will go through the desgin part of Autonomous QA — Architecture

STAGE 6 — AUTONOMOUS QA PIPELINE 🔔 TRIGGER: PRCreatedEvent QaOrchestrationService listens → polls GitHub Pages LAYER 1 Structure Check JAVA 📄 index.html exists 🔗 No broken refs (JS/CSS/img) 📁 Valid file structure 📐 HTML5 valid ⚡ HTTP 200 OK Engine: StructureCheckService.java — Pure Java HttpClient + regex scanning — No AI — Deterministic rules PASS ✓ LAYER 2 Security Audit HYBRID 🔒 OWASP A01–A10 scan 🛡️ XSS detection (inline JS) 🔑 Credential exposure 🚫 Open redirect 📦 CSP headers Engine: SecurityAuditService.java — Static regex rules (fast) + Bedrock deep analysis (qa-security-review.txt) — OWASP Top 10 mapping 🛡️ PASS ✓ LAYER 3 Functional E2E Tests BEDROCK 🧪 User flow simulation 📋 Form submit validation 🔀 Navigation paths 🔐 Auth flow check ⚠️ Error handling Engine: FunctionalTestService.java — Bedrock-simulated (qa-functional-test.txt) — AI reads HTML+JS, traces user journeys, reports failures 🧠 PASS ✓ LAYER 4 Accessibility Audit HYBRID ♿ WCAG 2.1 AA compliance 🏷️ ARIA labels & roles 🎨 Contrast ratio ≥ 4.5:1 ⌨️ Keyboard nav 📱 Responsive Engine: AccessibilityAuditService.java — Java rules (contrast calc, ARIA check) + Bedrock deep review (qa-accessibility-review.txt) 👁️ PASS ✓ LAYER 5 Performance Audit BEDROCK ⚡ Asset size analysis 🖼️ Image optimization 📦 Render-blocking resources 💾 Caching headers 🚀 Load strategy Engine: PerformanceAuditService.java — Bedrock-simulated (qa-performance-review.txt) — AI analyzes asset graph + render path 🚀 📊 QA Report (HTML + JSON) QaReportBuilder → DB + /api/qa/{reqId} 📝 PR Description Updated GitHub API PATCH → QA badge + findings 📡 SSE: QA_COMPLETE Real-time dashboard update via SSE SEQUENTIAL EXECUTION → Score P/F 0–10 0–10 0–10 0–10

🔔 Trigger: PRCreatedEvent

The entire QA pipeline is event-driven. When CodeGenerationService creates a pull request on GitHub, Spring publishes a PRCreatedEvent. The QaOrchestrationService listens for this event via @EventListener and kicks off the QA pipeline asynchronously (@Async).

Activation Sequence

StepActionDetail
1Event receivedPRCreatedEvent(reqId, prUrl, pagesUrl) captured by listener
2Poll GitHub PagesHTTP GET to pagesUrl every 15s, up to 3 min timeout, waiting for HTTP 200
3Fetch all pagesHttpClient crawls all HTML/JS/CSS from the deployed Pages site
4Execute 5 layersSequential execution — each layer receives the fetched content + previous layer results
5Aggregate & reportQaReportBuilder compiles findings → DB save → PR update → SSE broadcast

Why Event-Driven?

  • Decoupled — Code generation doesn't wait for QA; QA runs independently
  • Non-blocking — User sees PR created immediately; QA results stream in via SSE
  • Retry-safe — If Pages isn't ready, polling handles the delay gracefully

Layer 1: Structure Check Pure Java

The fastest, cheapest gate. Pure deterministic Java rules — no AI, no network calls to Bedrock. Catches deployment-breaking issues in milliseconds.

What It Checks

CheckRuleSeverity on Fail
Entry point existsindex.html must exist at repo rootCRITICAL
Broken referencesEvery <script src>, <link href>, <img src> must resolve to existing fileCRITICAL
File structureAll HTML files reference-able from root; no orphaned pagesHIGH
HTML5 validity<!DOCTYPE html>, <html lang>, <meta charset> presentMEDIUM
HTTP 200GitHub Pages URL returns 200 statusCRITICAL

Engine Details

  • Service: StructureCheckService.java
  • Technique: Java HttpClient for live URL checks; regex-based HTML parsing for reference extraction
  • Scoring: Pass/Fail (binary) — any CRITICAL finding = layer fails, pipeline short-circuits with report
  • Performance: Completes in <2 seconds typically
  • Why first? If the site doesn't load or has broken refs, deeper analysis is pointless

Layer 2: Security Audit Hybrid

Two-pass security analysis modeled on OWASP Top 10. First pass: fast static regex rules catch known patterns. Second pass: Bedrock deep analysis for nuanced vulnerabilities that pattern matching misses.

Pass 1: Static Rules (Java)

RulePatternMaps to OWASP
Inline JavaScript detectiononclick=, javascript:, eval(A03: Injection / XSS
Credential exposurepassword in URL params, hardcoded tokens, localStorage for secretsA07: Auth Failures
Form action validationForms with method="GET" containing password fieldsA04: Insecure Design
Open redirectUnvalidated window.location assignments from URL paramsA01: Broken Access
Missing security headersNo CSP meta tag, no X-Frame-OptionsA05: Security Misconfig

Pass 2: Bedrock Deep Analysis

  • Prompt: qa-security-review.txt — sends full HTML+JS source to Bedrock
  • AI analyzes: Authentication flow logic, session management, data sanitization patterns, DOM manipulation safety, third-party script risks
  • Output: JSON array of findings with severity, owaspCategory, location, remediation

Scoring

  • Security Score: 0–10 scale (10 = no findings)
  • Each CRITICAL finding: −3 points. HIGH: −2. MEDIUM: −1. LOW: −0.5
  • Gate threshold: Advisory only (no blocking) — but CRITICAL findings highlighted in PR

Layer 3: Functional E2E Tests Bedrock AI

Since the generated apps are static GitHub Pages sites (HTML/CSS/JS only), traditional browser automation (Selenium/Playwright) is overkill. Instead, Bedrock AI reads the complete source code and mentally simulates user journeys — tracing event handlers, form submissions, navigation flows, and state management.

What Bedrock Simulates

JourneyWhat AI TracesExpected Behavior
Login flowForm submit handler → validation → redirect → session storageInvalid creds show error; valid creds redirect to home
NavigationAnchor hrefs, window.location, back/forward logicAll links navigate to existing pages; no dead ends
CRUD operationsDOM manipulation, localStorage read/write, event chainsAdd/edit/delete reflect in UI; data persists across page loads
Auth guardssessionStorage/localStorage checks on page loadUnauthenticated users redirected to login
Error handlingTry/catch blocks, error display elements, edge casesGraceful degradation; user-visible messages

Why Bedrock-Simulated vs. Real Browser?

  • No infrastructure: No Selenium grid, no headless Chrome, no Docker containers
  • Deeper analysis: AI understands intent, not just DOM state — catches logic errors a click-test would miss
  • Cost-effective: One Bedrock invocation covers dozens of simulated journeys
  • Trade-off: Cannot catch rendering bugs or CSS layout issues (Layer 4 partially covers this)

Scoring

  • Score: 0–10 (10 = all journeys pass)
  • AI returns structured JSON: { journey, steps[], result: "pass"|"fail", issue?, remediation? }

Layer 4: Accessibility Audit Hybrid

Ensures WCAG 2.1 Level AA compliance through a combination of deterministic Java checks (machine-verifiable criteria) and Bedrock analysis (human-judgment criteria that require understanding context).

Pass 1: Java Rules (Deterministic)

CheckImplementationWCAG Criterion
Image alt textRegex: every <img> must have non-empty alt1.1.1 Non-text Content
Form labelsEvery <input> has associated <label> or aria-label1.3.1 Info and Relationships
Color contrastParse CSS color/background-color; compute luminance ratio ≥ 4.5:11.4.3 Contrast (Minimum)
Heading hierarchyVerify h1h2h3 sequence; no skips1.3.1 Info and Relationships
Language attribute<html lang="..."> present3.1.1 Language of Page
Focus stylesCSS includes :focus rules; no outline: none without replacement2.4.7 Focus Visible

Pass 2: Bedrock Deep Review

  • Prompt: qa-accessibility-review.txt
  • AI evaluates: Semantic HTML usage, ARIA roles/states correctness, keyboard navigation completeness, screen reader experience, touch target sizing, cognitive load assessment
  • Key insight: Many WCAG criteria (e.g., "meaningful sequence", "consistent navigation") require human-level understanding that pure regex cannot provide

Scoring

  • Accessibility Score: 0–10 (weighted: Java checks 40%, Bedrock analysis 60%)
  • Maps each finding to specific WCAG Success Criterion with conformance level (A, AA, AAA)

Layer 5: Performance Audit Bedrock AI

Analyzes the asset graph and render path of the deployed site. Since these are static sites without server-side rendering, performance analysis focuses on client-side loading strategy, asset optimization, and perceived performance.

What Bedrock Analyzes

CategoryAnalysisCommon Findings
Asset sizeTotal page weight, individual file sizes, unminified detectionUnminified JS >50KB, oversized images
Render blocking<script> without defer/async, CSS in <head> load orderRender-blocking scripts in <head>
Image optimizationFormat analysis (PNG vs WebP), dimensions, lazy loadingMissing loading="lazy", no width/height
CachingAsset fingerprinting, cache-control headers, CDN usageNo cache busting on CSS/JS filenames
Critical render pathFirst paint blocking resources, inline critical CSS presenceAll CSS loaded before any content renders

Why Bedrock Instead of Lighthouse?

  • No headless Chrome needed: Lighthouse requires a browser runtime; Bedrock works from source alone
  • Context-aware: AI understands that a login page's performance profile differs from a dashboard
  • Actionable output: AI provides specific remediation steps, not just scores
  • Trade-off: Cannot measure actual FCP/LCP/CLS metrics — these require real rendering

Scoring

  • Performance Score: 0–10
  • Deductions: unminified assets (−2), render-blocking scripts (−1.5), no lazy loading (−1), missing cache strategy (−1)

📊 Output: Report, PR Update & SSE Broadcast

After all 5 layers complete, QaReportBuilder aggregates findings into a unified report. Three outputs are generated simultaneously:

1. QA Report (Database + API)

  • DB entities: QaReport (one per run) + QaFinding (one per issue) stored via JPA
  • API endpoint: GET /api/qa/{reqId} returns JSON; GET /requirements/{reqId}/qa renders HTML view
  • Schema: Flyway V18__qa_tables.sqlqa_report (id, req_id, overall_score, security_score, accessibility_score, performance_score, functional_score, structure_pass, created_at) + qa_finding (id, report_id, layer, severity, category, description, location, remediation)

2. PR Description Patch

  • Mechanism: GitHub API PATCH /repos/{owner}/{repo}/pulls/{number}
  • Content: Appends QA badge (overall score with color), summary table of findings per layer, and critical findings with remediation steps
  • Advisory only: Does not block merge — provides visibility for human reviewer

3. SSE Broadcast

  • Event: QA_COMPLETE sent via PipelineStreamService
  • Payload: Overall score, per-layer scores, critical finding count
  • Dashboard: Real-time update on requirement detail page — QA section appears with expandable layer results

Composite Scoring

ComponentWeightRange
StructureGate (must pass)Pass / Fail
Security30%0–10
Functional30%0–10
Accessibility25%0–10
Performance15%0–10
Overall100%0–10

Thursday, 30 April 2026

Self-Healing Pipeline Architecture

In the last blog dicussed a Autonomus SDLC system , In this we make it autonomous in terms of error detection, classification, recovery, and KB-backed continuous improvement — zero human intervention from failure to resolution.

Solution for Self-Healing — 7 new files + V18 migration build on top of every existing component: PipelineLog checkpoints, WorkflowFailedEvent, KnowledgeBaseService RAG, KnowledgeFeedbackService S3 writes, BedrockClient fallback, KbMaintenanceAgent schedule pattern. No new infrastructure required.
L1 DETECT L2 CKPT L3 RECOVER L4 KB L5 LEARN LAYER 1 — Detection & Watchdog Pipeline Execution ProposalService · CodeGenerationService RepoAnalysisService · AgentLoopService Exception stepFailed() PipelineLog persisted (FAILED) SSE broadcast to all subscribers WorkflowFailedEvent requirementId · failedStep errorMessage · Spring ApplicationEvent PipelineHealthMonitor @Scheduled every 60s — scans FAILED reqs @Scheduled every 5min — scans HUNG reqs LAYER 2 — Checkpoint Ledger (V18 Migration) PipelineLog (Enhanced) step: String · stepOrder: int · status: enum + retry_count: int (V18 new) + recovery_session_id: varchar (V18 new) getLastCheckpoint(reqId) Finds last COMPLETED step order in PipelineLog e.g. story 3 done at order 113 → resume from order 114 (story 4) alreadyCompleted(reqId, stepKey) Checks PipelineLogRepository for COMPLETED status on step name Used in CodeGenerationService + ProposalService loops Prevents duplicate work: stories 1–3 skipped on resume, story 4 retried LAYER 3 — Recovery Engine: PipelineRecoveryService PipelineRecoveryService @EventListener(WorkflowFailedEvent) · @Async("jarvisTaskExecutor") retry guard: Set<String> recentlyAttempted + TTL (30 min) ErrorClassifier → RecoveryStrategy TRANSIENT_BEDROCK ThrottlingException · timeout · 503 → Exponential backoff (2s→4s→8s) + retry same step MALFORMED_AI_RESPONSE → Retry Bedrock with suffix "Respond ONLY with valid JSON" GIT_CONFLICT 422 · ref exists → Rename branch: {name}-retry-{N}, re-attempt push STALE_CLONE → Delete local dir + re-clone from remote JIRA_UNAVAILABLE → Skip JIRA calls, continue pipeline (non-critical) CONFIG_MISSING → Escalate to Teams · cannot auto-fix creds UNKNOWN → KB lookup first → if hit: apply past fix · else: retry once → escalate Max Retry Guard TRANSIENT: max 3 · MALFORMED: max 2 GIT: max 3 · UNKNOWN: max 1 After max: → PipelineRecoveryExhaustedEvent → Teams alert Checkpoint Resume 1. getLastCheckpoint(reqId) → stepOrder N 2. startDevelopment(reqId, resumeFromOrder=N) 3. CodeGenerationService story loop: if alreadyCompleted(reqId, "CODE_GEN_STORY_"+N) continue ← skip, no duplicate commit else → generate + commit (first time) 4. Pipeline continues naturally to PR creation New Event Records (v1.3) PipelineRecoveryRequestedEvent reqId · failedStep · errorMsg · attemptNumber PipelineRecoveredEvent reqId · recoveredStep · strategy · attemptsNeeded · durationMs PipelineRecoveryExhaustedEvent reqId · step · allStrategiesFailed → escalate All via Spring ApplicationEventPublisher — no new infra needed LAYER 4 — Knowledge Base Lookup (existing KnowledgeBaseService) resolveFromKB("pipeline failure step:"+step+" error:"+errorClass, reqId) Queries Bedrock KB · Retrieve API (k=20) · Metadata filter: type=pipeline-incident Score ≥ 0.80 → past fix found → apply documented strategy immediately (skip trial-and-error) KbIncidentMatch (value object) strategy: RecoveryStrategy · confidence: float backoffMs: int · maxRetries: int · notes: String (from incident.md frontmatter) KB lookup before retry LAYER 5 — Continuous Improvement: KbHealingFeedbackService Incident Document (S3 Upload) Path: s3://.../learnings/{reqId}/incidents/{ts}-{step}.md YAML frontmatter: type: pipeline-incident step · error-class · fix-strategy · attempts · duration Body: Error · Root Cause · Fix Applied · Prevention → KB sync → vector indexed → available for future RAG @EventListener Coverage WorkflowFailedEvent → write incident stub immediately PipelineRecoveredEvent → complete incident doc (fix worked) PipelineRecoveryExhaustedEvent → escalation doc (fix failed) All @Async("jarvisTaskExecutor") — non-blocking Guarded by isEnabled() — silent skip if KB/S3 not configured Continuous Improvement Loop 1st occurrence: default strategy (trial-and-error) 2nd occurrence: KB hit → apply pre-validated fix instantly Each cycle: incident doc → KnowledgeBaseService.startSync() KB continuously enriched with real production failure data → Recovery time decreases with every resolved incident ① Fail (0s) ② Detect (≤60s) ③ KB Lookup (1–2s) ④ Classify + Backoff (2–8s) ⑤ Resume from ckpt (0s overhead) ⑥ Recovery complete + KB written ⑦ Improved

Component Deep Dive

🔍 Layer 1 — PipelineHealthMonitor

Type: @Component with two @Scheduled jobs
Job 1 — Active Failures (every 60s): Queries all RequirementStatus.FAILED requirements updated in the last 2 hours that haven't been attempted in the last 30 minutes. Publishes PipelineRecoveryRequestedEvent.
Job 2 — Silent Hangs (every 5 min): Finds requirements in ANALYZING or IN_DEVELOPMENT with no PipelineLog update for more than 15 minutes. Synthesizes a WorkflowFailedEvent("TIMEOUT: no pipeline progress for 15m").
Guard: ConcurrentHashMap<String, Instant> recentlyAttempted — TTL 30 min prevents retry storms.

📋 Layer 2 — Checkpoint Ledger

V18 Migration adds: retry_count INT DEFAULT 0 and recovery_session_id VARCHAR(36) to pipeline_logs. Also adds recovery_attempt_count INT and last_recovery_at TIMESTAMP to requirements.
New repo method: findTopByRequirementIdAndStatusOrderByStepOrderDesc(reqId, COMPLETED) — returns last completed step.
Skip logic in loops: alreadyCompleted(reqId, "CODE_GEN_STORY_4") checks for a COMPLETED PipelineLog with that step name. On resume: stories 1–3 are skipped in microseconds, story 4 is retried from scratch.

⚙️ Layer 3 — PipelineRecoveryService

Entry point: @EventListener on WorkflowFailedEvent, @Async("jarvisTaskExecutor")
Flow: (1) Check retry guard → (2) Query KB for past fix → (3) ErrorClassifier.classify(step, errorMsg)RecoveryStrategy enum → (4) Apply strategy with backoff → (5) Validate success → (6) Publish PipelineRecoveredEvent or PipelineRecoveryExhaustedEvent
Max retries per strategy: TRANSIENT=3, MALFORMED=2, GIT_CONFLICT=3, STALE_CLONE=2, UNKNOWN=1
Reuses: BedrockClient.invokeWithFallback(), GitHubClient.createBranch(), existing clone/delete logic

🧠 Layer 4 — KB Lookup

Called before every recovery attempt: knowledgeBaseService.resolveFromKB("pipeline failure step:CODE_GEN_STORY error:TRANSIENT_BEDROCK", reqId)
Metadata filter: source-uri startsWith s3://.../learnings/.../incidents/
Threshold: confidence ≥ 0.80 → use documented strategy; < 0.80 → use default ErrorClassifier strategy
KbIncidentMatch parses frontmatter: fix-strategy, backoff-ms, max-retries → PipelineRecoveryService applies them directly
Zero new AWS resources — reuses existing KB ID, AOSS index, embeddings model

📚 Layer 5 — KbHealingFeedbackService

New class extending the existing S3 upload pattern from KnowledgeFeedbackService.
Listens to 3 events: WorkflowFailedEvent (stub), PipelineRecoveredEvent (complete), PipelineRecoveryExhaustedEvent (escalation).
Upload path: s3://.../learnings/{reqId}/incidents/{yyyyMMdd-HHmmss}-{step}.md
YAML frontmatter enables metadata filtering in KB: type: pipeline-incident, step, error-class, fix-strategy, resolved: true/false
After upload → knowledgeBaseService.startSync() → new incident indexed within ~30s

🗂️ ErrorClassifier (standalone)

Pure utility class — no Spring dependencies, fully unit-testable.
classify(String step, String errorMessage) → RecoveryStrategy
Uses ordered regex/contains checks:
ThrottlingException|Rate exceeded|503 → TRANSIENT_BEDROCK
<|Unexpected character|parse error → MALFORMED_AI_RESPONSE
422|Reference already exists|already exists → GIT_CONFLICT
Remote mismatch|no commits|stale clone → STALE_CLONE
jira|401|403.*atlassian → JIRA_UNAVAILABLE
bucket|credentials|NoSuch.*Key → CONFIG_MISSING
fallthrough → UNKNOWN

Recovery Time Budget (TRANSIENT_BEDROCK Example)

0s   → CODE_GEN_STORY_4 throws ThrottlingException → stepFailed() persists PipelineLog
0s   → WorkflowFailedEvent published → KbHealingFeedbackService writes incident stub to S3
≤60s → PipelineHealthMonitor detects FAILED status → publishes PipelineRecoveryRequestedEvent
+1s  → PipelineRecoveryService.onRecovery() → KB lookup: "pipeline failure step:CODE_GEN_STORY error:TRANSIENT_BEDROCK"
+2s  → KB HIT (2nd+ occurrence): past incident found, confidence=0.87 → apply: 4s backoff, retry once
       OR KB MISS (1st occurrence): ErrorClassifier → TRANSIENT_BEDROCK → default 2s→4s→8s backoff
+4s  → getLastCheckpoint() → order 113 (story 3 done) → alreadyCompleted() checks stories 1–3 = skip
+5s  → Retry story 4 → Bedrock invoked → success
+8s  → Stories 5–8 continue normally → PR created → PipelineRecoveredEvent published
+9s  → KbHealingFeedbackService completes incident.md → knowledgeBaseService.startSync()
+40s → KB re-indexed → next ThrottlingException anywhere → instant KB-guided recovery

Design Principles

✓ Zero New Infrastructure

Reuses existing thread pool, KB, S3, SSE stream, PipelineLog. Flyway V18 is the only schema change. No Redis, no Kafka, no distributed lock manager needed.

✓ Idempotent Resume

alreadyCompleted() is the single guard. A story that succeeded before recovery will never be re-run, guaranteeing no duplicate commits or duplicate JIRA transitions.

✓ Self-Improving KB

Every incident enriches the KB. First failure is trial-and-error. Every subsequent identical failure is resolved instantly from the KB. Recovery time strictly decreases over time.

⚠️ Non-Blocking Side Effects

All KB writes, incident uploads, and sync triggers are @Async. A KB outage during recovery does NOT block the recovery itself — the pipeline resumes regardless.

⚠️ Guardrail: Max Retries

Strict per-strategy retry caps prevent runaway loops. After exhaustion: RequirementStatus.FAILED is set permanently, Teams alert sent, full incident logged. Human can restart manually.

📡 Full Observability

Every recovery attempt creates a PipelineLog entry visible in the SSE pipeline viewer with step name RECOVERY_ATTEMPT_N. Users see healing happen in real time in the UI.