In the last blog dicussed a Autonomus SDLC system , In this we make it autonomous in terms of error detection, classification, recovery, and KB-backed continuous improvement — zero human intervention from failure to resolution.
Solution for Self-Healing — 7 new files + V18 migration build on top of every existing component: PipelineLog checkpoints, WorkflowFailedEvent, KnowledgeBaseService RAG, KnowledgeFeedbackService S3 writes, BedrockClient fallback, KbMaintenanceAgent schedule pattern. No new infrastructure required.
Component Deep Dive
🔍 Layer 1 — PipelineHealthMonitor
Type:@Component with two @Scheduled jobs Job 1 — Active Failures (every 60s): Queries all RequirementStatus.FAILED requirements updated in the last 2 hours that haven't been attempted in the last 30 minutes. Publishes PipelineRecoveryRequestedEvent. Job 2 — Silent Hangs (every 5 min): Finds requirements in ANALYZING or IN_DEVELOPMENT with no PipelineLog update for more than 15 minutes. Synthesizes a WorkflowFailedEvent("TIMEOUT: no pipeline progress for 15m"). Guard:ConcurrentHashMap<String, Instant> recentlyAttempted — TTL 30 min prevents retry storms.
📋 Layer 2 — Checkpoint Ledger
V18 Migration adds:retry_count INT DEFAULT 0 and recovery_session_id VARCHAR(36) to pipeline_logs. Also adds recovery_attempt_count INT and last_recovery_at TIMESTAMP to requirements. New repo method:findTopByRequirementIdAndStatusOrderByStepOrderDesc(reqId, COMPLETED) — returns last completed step. Skip logic in loops:alreadyCompleted(reqId, "CODE_GEN_STORY_4") checks for a COMPLETED PipelineLog with that step name. On resume: stories 1–3 are skipped in microseconds, story 4 is retried from scratch.
⚙️ Layer 3 — PipelineRecoveryService
Entry point:@EventListener on WorkflowFailedEvent, @Async("jarvisTaskExecutor") Flow: (1) Check retry guard → (2) Query KB for past fix → (3) ErrorClassifier.classify(step, errorMsg) → RecoveryStrategy enum → (4) Apply strategy with backoff → (5) Validate success → (6) Publish PipelineRecoveredEvent or PipelineRecoveryExhaustedEvent Max retries per strategy: TRANSIENT=3, MALFORMED=2, GIT_CONFLICT=3, STALE_CLONE=2, UNKNOWN=1 Reuses:BedrockClient.invokeWithFallback(), GitHubClient.createBranch(), existing clone/delete logic
🧠 Layer 4 — KB Lookup
Called before every recovery attempt:knowledgeBaseService.resolveFromKB("pipeline failure step:CODE_GEN_STORY error:TRANSIENT_BEDROCK", reqId) Metadata filter: source-uri startsWith s3://.../learnings/.../incidents/ Threshold: confidence ≥ 0.80 → use documented strategy; < 0.80 → use default ErrorClassifier strategy KbIncidentMatch parses frontmatter: fix-strategy, backoff-ms, max-retries → PipelineRecoveryService applies them directly Zero new AWS resources — reuses existing KB ID, AOSS index, embeddings model
📚 Layer 5 — KbHealingFeedbackService
New class extending the existing S3 upload pattern from KnowledgeFeedbackService. Listens to 3 events:WorkflowFailedEvent (stub), PipelineRecoveredEvent (complete), PipelineRecoveryExhaustedEvent (escalation). Upload path:s3://.../learnings/{reqId}/incidents/{yyyyMMdd-HHmmss}-{step}.md YAML frontmatter enables metadata filtering in KB: type: pipeline-incident, step, error-class, fix-strategy, resolved: true/false
After upload → knowledgeBaseService.startSync() → new incident indexed within ~30s
0s → CODE_GEN_STORY_4 throws ThrottlingException → stepFailed() persists PipelineLog 0s → WorkflowFailedEvent published → KbHealingFeedbackService writes incident stub to S3 ≤60s → PipelineHealthMonitor detects FAILED status → publishes PipelineRecoveryRequestedEvent +1s → PipelineRecoveryService.onRecovery() → KB lookup: "pipeline failure step:CODE_GEN_STORY error:TRANSIENT_BEDROCK" +2s → KB HIT (2nd+ occurrence): past incident found, confidence=0.87 → apply: 4s backoff, retry once
OR KB MISS (1st occurrence): ErrorClassifier → TRANSIENT_BEDROCK → default 2s→4s→8s backoff +4s → getLastCheckpoint() → order 113 (story 3 done) → alreadyCompleted() checks stories 1–3 = skip +5s → Retry story 4 → Bedrock invoked → success +8s → Stories 5–8 continue normally → PR created → PipelineRecoveredEvent published +9s → KbHealingFeedbackService completes incident.md → knowledgeBaseService.startSync() +40s → KB re-indexed → next ThrottlingException anywhere → instant KB-guided recovery
Design Principles
✓ Zero New Infrastructure
Reuses existing thread pool, KB, S3, SSE stream, PipelineLog. Flyway V18 is the only schema change. No Redis, no Kafka, no distributed lock manager needed.
✓ Idempotent Resume
alreadyCompleted() is the single guard. A story that succeeded before recovery will never be re-run, guaranteeing no duplicate commits or duplicate JIRA transitions.
✓ Self-Improving KB
Every incident enriches the KB. First failure is trial-and-error. Every subsequent identical failure is resolved instantly from the KB. Recovery time strictly decreases over time.
⚠️ Non-Blocking Side Effects
All KB writes, incident uploads, and sync triggers are @Async. A KB outage during recovery does NOT block the recovery itself — the pipeline resumes regardless.
⚠️ Guardrail: Max Retries
Strict per-strategy retry caps prevent runaway loops. After exhaustion: RequirementStatus.FAILED is set permanently, Teams alert sent, full incident logged. Human can restart manually.
📡 Full Observability
Every recovery attempt creates a PipelineLog entry visible in the SSE pipeline viewer with step name RECOVERY_ATTEMPT_N. Users see healing happen in real time in the UI.
Spring Boot monolith orchestrating AWS AI services, Git providers, and notification channels.
☕ Runtime
Java 21 LTS on Spring Boot 3.3.5. Embedded Tomcat, Spring MVC, Spring Data JPA, Flyway migrations, async event bus.
java 21spring boot 3.3.5
🤖 AI Engine
Amazon Bedrock with Nova Pro v1 (primary) and Nova Lite v1 (fallback). 5 prompt templates for analysis, options, cost estimation, code generation, and plan generation.
bedrocknova-pro
📚 RAG Pipeline
Bedrock Knowledge Base (XXXXXXX) backed by OpenSearch Serverless vector index. Titan Embeddings v2 for semantic code search.
RAGtitan embeddings
🔀 Git Integration
JGit 6.10 for clone/commit/push. OkHttp 4.12 for GitHub & Bitbucket REST APIs. Auto PR creation with generated code.
jgitgithubbitbucket
💾 Database
H2 in file mode (./data/XXXXX) with Flyway migrations V1–V13. 8 JPA entities. Web console at /h2-console.
h2flyway
🖥️ Frontend
Thymeleaf server-rendered templates. Bootstrap 5 UI, HTMX for dynamic updates, Mermaid.js for diagrams, Prism.js for syntax highlighting, SSE for live pipeline streaming.
thymeleafhtmxsse
Workflow Pipeline 18 States
End-to-end lifecycle from requirement submission to deployed code with Pull Request.
Status Transition Table
From
To
Trigger
Service
SUBMITTED
ANALYZING_REQUIREMENT
Auto (on submit)
RequirementService
ANALYZING_REQUIREMENT
ANALYSIS_COMPLETE
Bedrock response parsed
ProposalService
ANALYSIS_COMPLETE
GENERATING_OPTIONS
Auto (event-driven)
ProposalService
GENERATING_OPTIONS
OPTIONS_READY
3 options stored
ProposalService
OPTIONS_READY
OPTION_SELECTED
User selects option
RequirementController
OPTION_SELECTED
ESTIMATING_COST
Auto (event-driven)
CostEstimationService
ESTIMATING_COST
PENDING_APPROVAL
Cost estimate saved
CostEstimationService
PENDING_APPROVAL
APPROVED
Admin approval
ApprovalService
PENDING_APPROVAL
REJECTED
Admin rejection
ApprovalService
APPROVED
PLAN_GENERATION
Auto or manual trigger
CodeGenerationService
PLAN_GENERATION
CLONING_REPO
Plan generated
CodeGenerationService
CLONING_REPO
INGESTING_TO_KB
Repo cloned + S3 uploaded
GitService + S3
INGESTING_TO_KB
GENERATING_CODE
KB ingestion complete
KnowledgeBaseService
GENERATING_CODE
CODE_GENERATED
All files generated
CodeGenerationService
CODE_GENERATED
CREATING_PR
Auto
GitService
CREATING_PR
COMPLETED
PR created successfully
GitService
Data Model 8 Entities
JPA entities with Flyway-managed schema (V1–V13). H2 file-mode database.
Event-Driven Architecture 10 Events
Spring ApplicationEvents with @Async processing on ThreadPoolTaskExecutor (core=4, max=8).
AWS Services eu-west-1
All AWS services used, their configuration IDs, and how they connect.
Generate 3 options with Mermaid diagrams, code snippets, RAG-enriched context
code-generation.txt
CodeGenerationService
Generate code files using RAG-retrieved codebase patterns and conventions
self-review.txt
CodeGenerationService
AI code review with RAG context for consistency validation
mvp-breakdown.txt
MvpBreakdownService
Generate MVP tree with RAG-informed story points and task granularity
test-generation.txt
CodeGenerationService
Generate tests matching existing test patterns via RAG retrieval
Knowledge Base & RAG 8 Features
Retrieval-Augmented Generation — enriching every AI prompt with real codebase context from AWS Bedrock Knowledge Base.
All 8 KB Enhancements — Detailed Breakdown
① RAG Enabled by Default
Config:XXX.rag.enabled flipped from false → true
The entire RAG pipeline — S3 upload → KB ingestion → vector retrieval → prompt injection — was already implemented but gated behind a disabled feature flag. Enabling it activates the full pipeline: every new requirement now has its cloned repository uploaded to S3, synced to KB, and used for vector-searched code retrieval during AI analysis.
The code-generation.txt prompt now includes a {{RAG_CONTEXT}} section. Before generating code, the system retrieves existing code patterns, import styles, error handling conventions, and file structures from the KB. This ensures generated code follows the project's existing conventions rather than generic best practices.
Flow: KB retrieve → inject as "Relevant Code from Knowledge Base" → Bedrock generates consistent code
The mvp-breakdown.txt prompt is now enriched with retrieved code from the KB. When generating the MVP tree (user stories → tasks → subtasks), the AI can see the actual codebase complexity, which results in more accurate story point estimates, better task-to-file mapping, and correct identification of affected files.
Flow: KnowledgeBaseService.retrieveAsContext() → inject into mvp-breakdown prompt → more accurate planning
The test-generation.txt prompt now receives codebase context via RAG. The AI retrieves existing test files to learn the project's test framework choice (JUnit 5, Mockito, etc.), naming conventions (shouldDoX_whenY), assertion styles, and mock patterns. Generated tests then match the project's existing test suite.
Flow: Retrieve existing test files via KB → inject test patterns → Bedrock generates consistent tests
The self-review.txt prompt is enriched with real codebase patterns retrieved from the KB. When the AI reviews its own generated code, it can now compare against the actual project's patterns — catching inconsistencies like different error handling approaches, wrong import styles, or missing patterns that other files in the project use.
Flow: Retrieve codebase patterns → compare against generated code → catch deviations and security issues
Every service that calls Bedrock now has KnowledgeBaseService injected as a dependency. Before each AI invocation, the service calls knowledgeBaseService.retrieveAsContext(query, reqId) to fetch relevant code chunks, which are then passed to the prompt builder's ragContext parameter.
Service
KB Method Called
When
ProposalService
retrieveAsContext()
Each analysis + option generation round
MvpBreakdownService
retrieveAsContext()
Before MVP tree generation
CodeGenerationService
retrieveAsContext()
Before code generation (Phase 2)
dependency injectionservice layer
⑦ Cross-Requirement Learning
When a requirement reaches the PR_CREATED stage (pipeline completion), the new KnowledgeFeedbackService automatically captures the entire decision trail — requirement description, selected solution option, approach, risk assessment, affected files, and MVP breakdown — as a structured Markdown document and uploads it to S3 under the learnings/ prefix.
After upload, it triggers a KB re-sync job so the learning gets indexed. On future requirements, the KB can now retrieve past decisions: "For a similar feature last month, the team chose approach X with Y story points and Z files were affected."
File:KnowledgeFeedbackService.java — listens for PRCreatedEvent, uploads to S3, triggers KB sync
feedback loopcontinuous learning
⑧ KB Admin Dashboard
A new admin page at /kb-admin provides full visibility into the Knowledge Base health. The dashboard includes three status cards (KB connection, S3 storage, cross-requirement learning), a RAG integration map showing all 6 enriched prompts, manual sync trigger, and a live RAG query tester that lets admins search the KB and inspect retrieved chunks with relevance scores.
KB retrieval now scopes vector search to the specific requirement's S3 prefix using the x-amz-bedrock-kb-source-uri metadata field. When analyzing requirement REQ-ABC123, only code chunks from that requirement's repository are returned — preventing cross-contamination when multiple repositories are indexed in the same KB.
Filter:startsWith("s3://bucket/repos/REQ-ABC123/") — falls back gracefully to unfiltered if not supported.
File:KnowledgeBaseService.retrieve()
vector searchscoped retrieval
RAG Integration Summary
Prompt Template
Service
RAG Status
What RAG Provides
requirement-analysis.txt
ProposalService
● Active
Relevant code to assess requirement against codebase
option-generation.txt
ProposalService
● Active
Code patterns for accurate solution proposal generation
code-generation.txt
CodeGenerationService
● Active
Existing conventions for consistent code output
self-review.txt
CodeGenerationService
● Active
Project patterns to catch inconsistencies in generated code
mvp-breakdown.txt
MvpBreakdownService
● Active
Code complexity context for accurate story points
test-generation.txt
CodeGenerationService
● Active
Existing test patterns for framework-consistent tests
Service Layer 9 Services
Internal services with their responsibilities and key methods.
Thymeleaf templates with progressive enhancement via HTMX and SSE.
📄 layout.html
Master layout with Bootstrap 5.3, Mermaid.js v10, Prism.js v1.29 (8 languages + line-numbers), dark mode support.
thymeleaf layout
📋 list.html
Dashboard view — requirement cards with status badges, priority indicators, quick actions.
dashboard
➕ form.html
New requirement submission form with repo URL, branch, priority, description fields.
submission
🔍 detail.html
Requirement detail with status stepper, analysis results, admin approve/reject buttons, audit trail, Mermaid diagrams.
detail view
⚖️ compare.html
Side-by-side comparison of 3 AI options. Solution type badge, architecture/data-flow diagrams, expandable code snippets with syntax highlighting, diff view for code changes.
option comparison
🔄 pipeline.html
Real-time SSE pipeline viewer. Step-by-step progress with animated indicators for each pipeline stage.
sse streaming
💻 generated-code.html
Generated code file viewer with syntax highlighting and copy-to-clipboard.