Thursday, 30 April 2026

Self-Healing Pipeline Architecture

In the last blog dicussed a Autonomus SDLC system , In this we make it autonomous in terms of error detection, classification, recovery, and KB-backed continuous improvement — zero human intervention from failure to resolution.

Solution for Self-Healing — 7 new files + V18 migration build on top of every existing component: PipelineLog checkpoints, WorkflowFailedEvent, KnowledgeBaseService RAG, KnowledgeFeedbackService S3 writes, BedrockClient fallback, KbMaintenanceAgent schedule pattern. No new infrastructure required.

Component Deep Dive

🔍 Layer 1 — PipelineHealthMonitor

Type: @Component with two @Scheduled jobs
Job 1 — Active Failures (every 60s): Queries all RequirementStatus.FAILED requirements updated in the last 2 hours that haven't been attempted in the last 30 minutes. Publishes PipelineRecoveryRequestedEvent.
Job 2 — Silent Hangs (every 5 min): Finds requirements in ANALYZING or IN_DEVELOPMENT with no PipelineLog update for more than 15 minutes. Synthesizes a WorkflowFailedEvent("TIMEOUT: no pipeline progress for 15m").
Guard: ConcurrentHashMap<String, Instant> recentlyAttempted — TTL 30 min prevents retry storms.

📋 Layer 2 — Checkpoint Ledger

V18 Migration adds: retry_count INT DEFAULT 0 and recovery_session_id VARCHAR(36) to pipeline_logs. Also adds recovery_attempt_count INT and last_recovery_at TIMESTAMP to requirements.
New repo method: findTopByRequirementIdAndStatusOrderByStepOrderDesc(reqId, COMPLETED) — returns last completed step.
Skip logic in loops: alreadyCompleted(reqId, "CODE_GEN_STORY_4") checks for a COMPLETED PipelineLog with that step name. On resume: stories 1–3 are skipped in microseconds, story 4 is retried from scratch.

⚙️ Layer 3 — PipelineRecoveryService

Entry point: @EventListener on WorkflowFailedEvent, @Async("jarvisTaskExecutor")
Flow: (1) Check retry guard → (2) Query KB for past fix → (3) ErrorClassifier.classify(step, errorMsg) → RecoveryStrategy enum → (4) Apply strategy with backoff → (5) Validate success → (6) Publish PipelineRecoveredEvent or PipelineRecoveryExhaustedEvent
Max retries per strategy: TRANSIENT=3, MALFORMED=2, GIT_CONFLICT=3, STALE_CLONE=2, UNKNOWN=1
Reuses: BedrockClient.invokeWithFallback(), GitHubClient.createBranch(), existing clone/delete logic

🧠 Layer 4 — KB Lookup

Called before every recovery attempt: knowledgeBaseService.resolveFromKB("pipeline failure step:CODE_GEN_STORY error:TRANSIENT_BEDROCK", reqId)
Metadata filter: source-uri startsWith s3://.../learnings/.../incidents/
Threshold: confidence ≥ 0.80 → use documented strategy; < 0.80 → use default ErrorClassifier strategy
KbIncidentMatch parses frontmatter: fix-strategy, backoff-ms, max-retries → PipelineRecoveryService applies them directly
Zero new AWS resources — reuses existing KB ID, AOSS index, embeddings model

📚 Layer 5 — KbHealingFeedbackService

New class extending the existing S3 upload pattern from KnowledgeFeedbackService.
Listens to 3 events: WorkflowFailedEvent (stub), PipelineRecoveredEvent (complete), PipelineRecoveryExhaustedEvent (escalation).
Upload path: s3://.../learnings/{reqId}/incidents/{yyyyMMdd-HHmmss}-{step}.md
YAML frontmatter enables metadata filtering in KB: type: pipeline-incident, step, error-class, fix-strategy, resolved: true/false
After upload → knowledgeBaseService.startSync() → new incident indexed within ~30s

🗂️ ErrorClassifier (standalone)

Recovery Time Budget (TRANSIENT_BEDROCK Example)

    0s   → CODE_GEN_STORY_4 throws ThrottlingException → stepFailed() persists PipelineLog

    0s   → WorkflowFailedEvent published → KbHealingFeedbackService writes incident stub to S3

    ≤60s → PipelineHealthMonitor detects FAILED status → publishes PipelineRecoveryRequestedEvent

    +1s  → PipelineRecoveryService.onRecovery() → KB lookup: "pipeline failure step:CODE_GEN_STORY error:TRANSIENT_BEDROCK"

    +2s  → KB HIT (2nd+ occurrence): past incident found, confidence=0.87 → apply: 4s backoff, retry once

           OR KB MISS (1st occurrence): ErrorClassifier → TRANSIENT_BEDROCK → default 2s→4s→8s backoff

    +4s  → getLastCheckpoint() → order 113 (story 3 done) → alreadyCompleted() checks stories 1–3 = skip

    +5s  → Retry story 4 → Bedrock invoked → success

    +8s  → Stories 5–8 continue normally → PR created → PipelineRecoveredEvent published

    +9s  → KbHealingFeedbackService completes incident.md → knowledgeBaseService.startSync()

    +40s → KB re-indexed → next ThrottlingException anywhere → instant KB-guided recovery

Design Principles

✓ Zero New Infrastructure

Reuses existing thread pool, KB, S3, SSE stream, PipelineLog. Flyway V18 is the only schema change. No Redis, no Kafka, no distributed lock manager needed.

✓ Idempotent Resume

alreadyCompleted() is the single guard. A story that succeeded before recovery will never be re-run, guaranteeing no duplicate commits or duplicate JIRA transitions.

✓ Self-Improving KB

Every incident enriches the KB. First failure is trial-and-error. Every subsequent identical failure is resolved instantly from the KB. Recovery time strictly decreases over time.

⚠️ Non-Blocking Side Effects

All KB writes, incident uploads, and sync triggers are @Async. A KB outage during recovery does NOT block the recovery itself — the pipeline resumes regardless.

⚠️ Guardrail: Max Retries

Strict per-strategy retry caps prevent runaway loops. After exhaustion: RequirementStatus.FAILED is set permanently, Teams alert sent, full incident logged. Human can restart manually.

📡 Full Observability

Every recovery attempt creates a PipelineLog entry visible in the SSE pipeline viewer with step name RECOVERY_ATTEMPT_N. Users see healing happen in real time in the UI.

Tuesday, 28 April 2026

Autonomous SDLC Platform — AI-powered

Autonomous SDLC Platform Architecture

Tool Architecture

Autonomous SDLC Platform — AI-powered requirement analysis, solution proposal generation, intelligent code generation, and end-to-end delivery pipeline orchestration.

System Architecture Overview

Spring Boot monolith orchestrating AWS AI services, Git providers, and notification channels.

☕ Runtime

Java 21 LTS on Spring Boot 3.3.5. Embedded Tomcat, Spring MVC, Spring Data JPA, Flyway migrations, async event bus.

java 21 spring boot 3.3.5

🤖 AI Engine

Amazon Bedrock with Nova Pro v1 (primary) and Nova Lite v1 (fallback). 5 prompt templates for analysis, options, cost estimation, code generation, and plan generation.

bedrock nova-pro

📚 RAG Pipeline

Bedrock Knowledge Base (XXXXXXX) backed by OpenSearch Serverless vector index. Titan Embeddings v2 for semantic code search.

RAG titan embeddings

🔀 Git Integration

JGit 6.10 for clone/commit/push. OkHttp 4.12 for GitHub & Bitbucket REST APIs. Auto PR creation with generated code.

jgit github bitbucket

💾 Database

H2 in file mode (./data/XXXXX) with Flyway migrations V1–V13. 8 JPA entities. Web console at /h2-console.

h2 flyway

🖥️ Frontend

Thymeleaf server-rendered templates. Bootstrap 5 UI, HTMX for dynamic updates, Mermaid.js for diagrams, Prism.js for syntax highlighting, SSE for live pipeline streaming.

thymeleaf htmx sse

Workflow Pipeline 18 States

End-to-end lifecycle from requirement submission to deployed code with Pull Request.

Status Transition Table

From	To	Trigger	Service
SUBMITTED	ANALYZING_REQUIREMENT	Auto (on submit)	RequirementService
ANALYZING_REQUIREMENT	ANALYSIS_COMPLETE	Bedrock response parsed	ProposalService
ANALYSIS_COMPLETE	GENERATING_OPTIONS	Auto (event-driven)	ProposalService
GENERATING_OPTIONS	OPTIONS_READY	3 options stored	ProposalService
OPTIONS_READY	OPTION_SELECTED	User selects option	RequirementController
OPTION_SELECTED	ESTIMATING_COST	Auto (event-driven)	CostEstimationService
ESTIMATING_COST	PENDING_APPROVAL	Cost estimate saved	CostEstimationService
PENDING_APPROVAL	APPROVED	Admin approval	ApprovalService
PENDING_APPROVAL	REJECTED	Admin rejection	ApprovalService
APPROVED	PLAN_GENERATION	Auto or manual trigger	CodeGenerationService
PLAN_GENERATION	CLONING_REPO	Plan generated	CodeGenerationService
CLONING_REPO	INGESTING_TO_KB	Repo cloned + S3 uploaded	GitService + S3
INGESTING_TO_KB	GENERATING_CODE	KB ingestion complete	KnowledgeBaseService
GENERATING_CODE	CODE_GENERATED	All files generated	CodeGenerationService
CODE_GENERATED	CREATING_PR	Auto	GitService
CREATING_PR	COMPLETED	PR created successfully	GitService

Data Model 8 Entities

JPA entities with Flyway-managed schema (V1–V13). H2 file-mode database.

Event-Driven Architecture 10 Events

Spring ApplicationEvents with @Async processing on ThreadPoolTaskExecutor (core=4, max=8).

AWS Services eu-west-1

All AWS services used, their configuration IDs, and how they connect.

🧠 Amazon Bedrock

Primary: eu.amazon.nova-pro-v1:0
Fallback: eu.amazon.nova-lite-v1:0
Config: temp=0.2, maxTokens=4096, topP=0.9
Used by: RequirementAnalysis, OptionGeneration, CostEstimation, PlanGeneration, CodeGeneration

5 prompt templates

📚 Bedrock Knowledge Base

KB ID: XXXXXXX
DataSource: XXXXXXX
Embeddings: Titan Embeddings v2
Backend: OpenSearch Serverless vector index
RAG: All 6 prompt templates enriched with KB context
Filtering: Metadata-scoped retrieval per requirement

rag vector search cross-req learning

📦 Amazon S3

Bucket: XXXX-repo-eu-XXXXXXXXX
Prefix: repos/
Upload: 20 parallel threads
Purpose: Cloned repo storage → KB data source sync

parallel upload

🔐 AWS STS + SSO

Profile: PowerUserAccess-XXXXXXXXXXX
Region: eu-west-1
Auth chain: SSO → STS AssumeRole → Temporary credentials

iam

Prompt Templates

File	Used By	Purpose
`requirement-analysis.txt`	ProposalService	Analyze requirement with RAG context, classify solution type, extract key entities
`option-generation.txt`	ProposalService	Generate 3 options with Mermaid diagrams, code snippets, RAG-enriched context
`code-generation.txt`	CodeGenerationService	Generate code files using RAG-retrieved codebase patterns and conventions
`self-review.txt`	CodeGenerationService	AI code review with RAG context for consistency validation
`mvp-breakdown.txt`	MvpBreakdownService	Generate MVP tree with RAG-informed story points and task granularity
`test-generation.txt`	CodeGenerationService	Generate tests matching existing test patterns via RAG retrieval

Knowledge Base & RAG 8 Features

Retrieval-Augmented Generation — enriching every AI prompt with real codebase context from AWS Bedrock Knowledge Base.

All 8 KB Enhancements — Detailed Breakdown

① RAG Enabled by Default

Config: XXX.rag.enabled flipped from false → true

The entire RAG pipeline — S3 upload → KB ingestion → vector retrieval → prompt injection — was already implemented but gated behind a disabled feature flag. Enabling it activates the full pipeline: every new requirement now has its cloned repository uploaded to S3, synced to KB, and used for vector-searched code retrieval during AI analysis.

File: application.yml — XXXX.rag.enabled: ${XXXX_RAG_ENABLED:true}

configuration

② RAG in Code Generation

The code-generation.txt prompt now includes a {{RAG_CONTEXT}} section. Before generating code, the system retrieves existing code patterns, import styles, error handling conventions, and file structures from the KB. This ensures generated code follows the project's existing conventions rather than generic best practices.

Flow: KB retrieve → inject as "Relevant Code from Knowledge Base" → Bedrock generates consistent code

Files: code-generation.txt, BedrockPromptBuilder.buildCodeGenerationPrompt(), CodeGenerationService

code gen prompt enrichment

③ RAG in MVP Breakdown

The mvp-breakdown.txt prompt is now enriched with retrieved code from the KB. When generating the MVP tree (user stories → tasks → subtasks), the AI can see the actual codebase complexity, which results in more accurate story point estimates, better task-to-file mapping, and correct identification of affected files.

Flow: KnowledgeBaseService.retrieveAsContext() → inject into mvp-breakdown prompt → more accurate planning

Files: mvp-breakdown.txt, BedrockPromptBuilder.buildMvpBreakdownPrompt(), MvpBreakdownService

planning story points

④ RAG in Test Generation

The test-generation.txt prompt now receives codebase context via RAG. The AI retrieves existing test files to learn the project's test framework choice (JUnit 5, Mockito, etc.), naming conventions (shouldDoX_whenY), assertion styles, and mock patterns. Generated tests then match the project's existing test suite.

Flow: Retrieve existing test files via KB → inject test patterns → Bedrock generates consistent tests

Files: test-generation.txt, BedrockPromptBuilder.buildTestGenerationPrompt()

testing consistency

⑤ RAG in Self-Review

The self-review.txt prompt is enriched with real codebase patterns retrieved from the KB. When the AI reviews its own generated code, it can now compare against the actual project's patterns — catching inconsistencies like different error handling approaches, wrong import styles, or missing patterns that other files in the project use.

Flow: Retrieve codebase patterns → compare against generated code → catch deviations and security issues

Files: self-review.txt, BedrockPromptBuilder.buildSelfReviewPrompt()

quality gate pattern matching

⑥ RAG Wired into All Services

Every service that calls Bedrock now has KnowledgeBaseService injected as a dependency. Before each AI invocation, the service calls knowledgeBaseService.retrieveAsContext(query, reqId) to fetch relevant code chunks, which are then passed to the prompt builder's ragContext parameter.

Service	KB Method Called	When
ProposalService	retrieveAsContext()	Each analysis + option generation round
MvpBreakdownService	retrieveAsContext()	Before MVP tree generation
CodeGenerationService	retrieveAsContext()	Before code generation (Phase 2)

dependency injection service layer

⑦ Cross-Requirement Learning

When a requirement reaches the PR_CREATED stage (pipeline completion), the new KnowledgeFeedbackService automatically captures the entire decision trail — requirement description, selected solution option, approach, risk assessment, affected files, and MVP breakdown — as a structured Markdown document and uploads it to S3 under the learnings/ prefix.

After upload, it triggers a KB re-sync job so the learning gets indexed. On future requirements, the KB can now retrieve past decisions: "For a similar feature last month, the team chose approach X with Y story points and Z files were affected."

File: KnowledgeFeedbackService.java — listens for PRCreatedEvent, uploads to S3, triggers KB sync

feedback loop continuous learning

⑧ KB Admin Dashboard

A new admin page at /kb-admin provides full visibility into the Knowledge Base health. The dashboard includes three status cards (KB connection, S3 storage, cross-requirement learning), a RAG integration map showing all 6 enriched prompts, manual sync trigger, and a live RAG query tester that lets admins search the KB and inspect retrieved chunks with relevance scores.

Component	Endpoint	Purpose
Dashboard Page	`GET /kb-admin`	Status cards + RAG map + sync controls + query tester
Status API	`GET /api/kb/status`	JSON: { enabled, feedbackEnabled }
Manual Sync	`POST /api/kb/sync`	Trigger KB data source ingestion job
Test Retrieval	`GET /api/kb/retrieve?query=...`	Test RAG query, returns chunks with scores
Upload Feedback	`POST /api/kb/feedback/{reqId}`	Manually trigger learning upload for a requirement

Files: KbAdminWebController.java, KbApiController.java, kb-admin.html

admin ui diagnostics

⑨ Metadata Filtering in Retrieval

KB retrieval now scopes vector search to the specific requirement's S3 prefix using the x-amz-bedrock-kb-source-uri metadata field. When analyzing requirement REQ-ABC123, only code chunks from that requirement's repository are returned — preventing cross-contamination when multiple repositories are indexed in the same KB.

Filter: startsWith("s3://bucket/repos/REQ-ABC123/") — falls back gracefully to unfiltered if not supported.

File: KnowledgeBaseService.retrieve()

vector search scoped retrieval

RAG Integration Summary

Prompt Template	Service	RAG Status	What RAG Provides
`requirement-analysis.txt`	ProposalService	● Active	Relevant code to assess requirement against codebase
`option-generation.txt`	ProposalService	● Active	Code patterns for accurate solution proposal generation
`code-generation.txt`	CodeGenerationService	● Active	Existing conventions for consistent code output
`self-review.txt`	CodeGenerationService	● Active	Project patterns to catch inconsistencies in generated code
`mvp-breakdown.txt`	MvpBreakdownService	● Active	Code complexity context for accurate story points
`test-generation.txt`	CodeGenerationService	● Active	Existing test patterns for framework-consistent tests

Service Layer 9 Services

Internal services with their responsibilities and key methods.

Service	Responsibility	Key Methods	Publishes
RequirementService	CRUD for requirements, submission trigger	`submitRequirement()`, `getAll()`, `getById()`	RequirementSubmittedEvent
ProposalService	Bedrock analysis + option generation, JSON parsing	`analyzeRequirement()`, `generateOptions()`, `parseAndStoreOptions()`	AnalysisCompletedEvent, OptionsGeneratedEvent
CostEstimationService	AI cost/effort estimation per selected option	`estimateCost()`	CostEstimationDoneEvent
ApprovalService	Admin approve/reject workflow	`approve()`, `reject()`	ApprovalDecisionEvent
CodeGenerationService	Full pipeline: plan → clone → ingest → generate → PR	`generatePlan()`, `triggerFullPipeline()`, `generateCode()`	PlanGeneratedEvent, CodeGenerationDoneEvent
GitService	JGit clone/push, GitHub/Bitbucket REST API, PR creation	`cloneRepo()`, `pushBranch()`, `createPullRequest()`	RepoClonedEvent, PullRequestCreatedEvent
RepoAnalysisService	S3 upload of cloned repos, KB data source ingestion	`uploadToS3()`, `triggerIngestion()`, `waitForIngestion()`	—
NotificationService	MS Teams webhook notifications on status changes	`sendNotification()`, `formatTeamsCard()`	—
AuditService	Immutable audit trail for all requirement actions	`logAction()`, `getAuditTrail()`	—

API Endpoints REST + MVC

All HTTP endpoints exposed by the application.

RequirementController

Method	Path	Description
GET	`/`	Dashboard — lists all requirements
GET	`/requirements/new`	New requirement form
POST	`/requirements`	Submit new requirement
GET	`/requirements/{id}`	Requirement detail page
GET	`/requirements/{id}/compare`	Compare 3 solution options side-by-side
POST	`/requirements/{id}/select-option`	Select preferred option
GET	`/requirements/{id}/pipeline-status`	SSE stream — real-time pipeline updates
GET	`/requirements/{id}/generated-code`	View generated code files

AdminController

Method	Path	Description
GET	`/admin/pending`	List pending approval requests
POST	`/admin/{id}/approve`	Approve requirement (triggers code pipeline)
POST	`/admin/{id}/reject`	Reject requirement with reason

API REST Endpoints

Method	Path	Description
POST	`/api/requirements`	JSON API — submit requirement programmatically
GET	`/api/requirements/{id}/status`	JSON API — get current status
POST	`/api/requirements/{id}/generate-plan`	Trigger plan generation
POST	`/api/requirements/{id}/generate-code`	Trigger code generation pipeline

Frontend Stack Server-Rendered

Thymeleaf templates with progressive enhancement via HTMX and SSE.

📄 layout.html

Master layout with Bootstrap 5.3, Mermaid.js v10, Prism.js v1.29 (8 languages + line-numbers), dark mode support.

thymeleaf layout

📋 list.html

Dashboard view — requirement cards with status badges, priority indicators, quick actions.

dashboard

➕ form.html

New requirement submission form with repo URL, branch, priority, description fields.

submission

🔍 detail.html

Requirement detail with status stepper, analysis results, admin approve/reject buttons, audit trail, Mermaid diagrams.

detail view

⚖️ compare.html

Side-by-side comparison of 3 AI options. Solution type badge, architecture/data-flow diagrams, expandable code snippets with syntax highlighting, diff view for code changes.

option comparison

🔄 pipeline.html

Real-time SSE pipeline viewer. Step-by-step progress with animated indicators for each pipeline stage.

sse streaming

💻 generated-code.html

Generated code file viewer with syntax highlighting and copy-to-clipboard.

code viewer

✅ admin-pending.html

Admin approval queue — pending requirements with cost estimates, approve/reject actions.

admin panel

JavaScript Libraries

Library	Version	Purpose
Bootstrap	5.3.x	UI framework, responsive grid, components
HTMX	1.9.x	Partial page updates, AJAX replacement without JS
Mermaid.js	10.x	Architecture & data flow diagram rendering
Prism.js	1.29	Syntax highlighting (Java, JS, TS, Python, YAML, Bash, JSON, XML)
EventSource (SSE)	Native	Real-time pipeline status streaming

Technology Stack Full Inventory

Backend

Technology	Version	Purpose
Java	21 LTS	Runtime platform
Spring Boot	3.3.5	Application framework
Spring Data JPA	3.3.x	Database access layer
Flyway	10.x	Database migration (V1–V13)
H2	2.x	Embedded database (file mode)
JGit	6.10	Git operations (clone, commit, push)
OkHttp	4.12	HTTP client for GitHub/Bitbucket APIs
AWS SDK v2	2.x	Bedrock, S3, STS, KB client
Jackson	2.x	JSON serialization/deserialization
Thymeleaf	3.x	Server-side HTML template engine
Maven	3.x	Build & dependency management

Flyway Migration History

Version	Description
V1	Create requirements table
V2	Create solution_options table
V3	Create cost_estimates table
V4	Create approval_records table
V5	Create generated_code table
V6	Create audit_log table
V7	Create notifications table
V8	Add analysis fields to requirements
V9	Add pull_request_url to requirements
V10	Add platform/submittedBy to requirements
V11	Create implementation_plans table
V12	Add solution_type to requirements
V13	Add diagrams & code changes to solution_options
V14	Create mvp_milestones, user_stories, story_tasks tables

Project Structure

src/main/java/com/XXXX/
├── Application.java          # Main entry point
├── config/
│   ├── AsyncConfig.java             # ThreadPoolTaskExecutor (core=4, max=8)
│   ├── AwsConfig.java               # Bedrock, S3, STS clients
│   └── WebConfig.java               # CORS, static resources
├── controller/
│   ├── RequirementController.java   # MVC + REST endpoints
│   └── AdminController.java         # Admin approval endpoints
├── model/
│   ├── Requirement.java             # Core entity (18 statuses)
│   ├── RequirementStatus.java       # Enum: 18 states
│   ├── SolutionOption.java          # AI-generated options with diagrams
│   ├── CostEstimate.java            # Cost/effort estimation
│   ├── ApprovalRecord.java          # Approval decisions
│   ├── GeneratedCode.java           # Generated code files
│   ├── AuditLog.java                # Audit trail entries
│   ├── ImplementationPlan.java      # Detailed plan content
│   └── Notification.java            # Notification records
├── repository/                       # Spring Data JPA repositories
├── service/
│   ├── RequirementService.java
│   ├── ProposalService.java
│   ├── CostEstimationService.java
│   ├── ApprovalService.java
│   ├── CodeGenerationService.java
│   ├── GitService.java
│   ├── RepoAnalysisService.java
│   ├── NotificationService.java
│   └── AuditService.java
└── event/                            # 10 ApplicationEvent classes

src/main/resources/
├── application.yml                   # All configuration
├── db/migration/                     # V1–V13 Flyway SQL
├── prompts/                          # 5 Bedrock prompt templates
├── static/                           # CSS, JS assets
└── templates/                        # Thymeleaf HTML templates
    ├── layout.html
    └── requirements/
        ├── list.html, form.html
        ├── detail.html, compare.html
        ├── pipeline.html
        ├── generated-code.html
        └── admin-pending.html