Hem's Blog: May 2026

Tuesday, 5 May 2026

Implementation of Autonomy Intelligence Features

v2.0 — Self-Learning Pipeline

These features transform the tool from a reactive code generator into a self-improving autonomous system. Each capability adds a new dimension of machine learning to the pipeline, reducing human intervention and improving output quality over time.

1. Closed-Loop Outcome Scoring

Foundational Layer

Technical Architecture

Entity: OutcomeCard — JPA entity in H2/PostgreSQL
Service: OutcomeScoringService — generates structured outcome cards after QA
Trigger: PRCreatedEvent listener auto-generates cards
Storage: H2 table outcome_cards with indexed lookups by tech stack
AI Integration: Bedrock generates "lessons learned" via meta-analysis prompt
Retrieval: During Stage 2, top-N matching cards injected into analysis prompt

Business Impact

Reduces analysis time — past successes/failures guide decisions instantly
Avoids repeated failures — if an approach failed before, system warns proactively
Quantifies ROI — tracks time-to-complete and fix iterations across all requirements
Enables trend analysis — success rates by tech stack reveal strengths/weaknesses
Transforms KB from "here's similar code" to "here's what worked and what didn't"

2. Prompt Evolution via Meta-Learning

Adaptive Intelligence

Technical Architecture

Service: PromptEvolutionService — tracks prompt→outcome correlations
Blind Spot Detection: Categories appearing in >50% of runs flagged as ineffective rules
Amendment Generation: Bedrock meta-prompt generates ALWAYS/NEVER rules
Safety Gate: All amendments are STAGED for human review before activation
Metrics: Per-template avg QA score, avg iterations, total findings tracked
Trigger: Auto-runs after every N requirements (configurable threshold)

Business Impact

Self-improving prompts — system gets smarter with every requirement processed
Addresses "stale rules" — detects when prompt rules aren't preventing errors
Reduces QA iterations — better prompts → fewer errors → less fix loop work
Human-in-the-loop safety — amendments never go live without approval
Measurable improvement — before/after metrics prove prompt evolution value

4. Proactive Defect Prediction

Pre-Generation Intelligence

Technical Architecture

Service: DefectPredictionService — pre-generation risk analysis
AI Prompt: Bedrock predicts error categories by requirement + tech stack
Pattern Matching: Cross-references full Error Pattern Library for weighted prevention
Guard Code: Auto-generates defensive code snippets for HIGH-risk areas
File Risk Scoring: Identifies historically problematic files for extra scrutiny
Integration Point: Runs in CodeGenerationService before story loop starts

Business Impact

Shift-left quality — catches likely defects BEFORE code is generated
Reduces fix iterations — predicted problems are prevented at source
Smarter resource allocation — high-risk requirements get more defensive code
Quantifiable ROI — track defect prediction accuracy vs actual QA findings
Cost savings — fewer LLM calls for fix loops = lower Bedrock costs

5. Autonomous Self-Resolution (Stage 2)

Human Touchpoint Reduction

Technical Architecture

Service: SelfResolutionService — multi-strategy question resolver
4 Resolution Strategies: KB search, past Q&A matching, repo context inference, AI reasoning
Confidence Scoring: Each resolution gets a 0.0–1.0 confidence; threshold = 0.7
Integration: Hooks into ProposalService.runAnalysis() before user escalation
Outcome Cards: Uses historical outcomes for additional reasoning context
Safety: All auto-resolutions are marked with source + confidence for audit trail

Business Impact

60% fewer user interruptions — most questions answered autonomously
Faster pipeline throughput — no wait for human Q&A responses
Better user experience — users confirm assumptions vs answering from scratch
Knowledge compounds — each answered Q becomes future resolution context
Audit trail preserved — all auto-resolutions logged with confidence + source

6. Fix Strategy Memory (QA Intelligence)

Fix Loop Acceleration

Technical Architecture

Entity: FixStrategy — JPA entity tracking fix approaches
Service: FixStrategyService — records successes/failures, retrieves best strategies
Integration: QaFixLoopService queries strategy memory before each fix attempt
Matching: By finding category + framework, with fallback to category-only
Scoring: Tracks success rate + avg score improvement per strategy
Prompt Injection: "FIX STRATEGY MEMORY" section with proven patterns

Business Impact

Faster convergence — fix loop resolves issues in 1 iteration vs 2-3
Reduced Bedrock costs — fewer fix iterations = fewer API calls
Knowledge retention — proven fixes persist even as prompts evolve
Compounding intelligence — each fix makes future fixes faster
Measurable: Track avg iterations before vs after strategy memory activation

7. Cross-Requirement Dependency Graph

Conflict Prevention

Technical Architecture

Entity: FileRegistryEntry — tracks file→requirement→branch mappings
Service: DependencyGraphService — conflict detection + resolution
Registration: Auto-registers files at code generation start
Detection: Before generating, queries for overlapping in-flight modifications
Resolution: Injects conflict context into generation prompt for compatible output
Lifecycle: Files marked COMPLETED after PR merge

Business Impact

Zero merge conflicts — detected at generation time, not PR time
Parallel work enabled — multiple requirements can modify overlapping areas safely
DevOps efficiency — no manual conflict resolution, no blocked PRs
Visibility — dashboard shows all in-flight modifications at a glance
Scales with team size — more concurrent requirements handled safely

Autonomy Maturity Model

Feature	Autonomy Dimension	Human Reduction	Learning Type	Status
Outcome Scoring	Decision Intelligence	Reduces analysis guesswork by 40%	Historical Pattern	Active
Prompt Evolution	Self-Improvement	Auto-discovers blind spots	Meta-Learning	Active
Defect Prediction	Preventive Quality	30% fewer QA fix iterations	Predictive Risk	Active
Self-Resolution	Autonomous Analysis	60% fewer clarification questions	Multi-Strategy Reasoning	Active
Fix Strategy Memory	QA Intelligence	Fix loop converges 2x faster	Strategy Reinforcement	Active
Dependency Graph	Coordination Intelligence	Zero merge conflicts	Graph Awareness	Active

Friday, 1 May 2026

Autonomous QA — Architecture Diagram

🔔 Trigger: PRCreatedEvent

The entire QA pipeline is event-driven. When CodeGenerationService creates a pull request on GitHub, Spring publishes a PRCreatedEvent. The QaOrchestrationService listens for this event via @EventListener and kicks off the QA pipeline asynchronously (@Async).

Activation Sequence

Step	Action	Detail
1	Event received	`PRCreatedEvent(reqId, prUrl, pagesUrl)` captured by listener
2	Poll GitHub Pages	HTTP GET to `pagesUrl` every 15s, up to 3 min timeout, waiting for HTTP 200
3	Fetch all pages	HttpClient crawls all HTML/JS/CSS from the deployed Pages site
4	Execute 5 layers	Sequential execution — each layer receives the fetched content + previous layer results
5	Aggregate & report	QaReportBuilder compiles findings → DB save → PR update → SSE broadcast

Why Event-Driven?

Decoupled — Code generation doesn't wait for QA; QA runs independently
Non-blocking — User sees PR created immediately; QA results stream in via SSE
Retry-safe — If Pages isn't ready, polling handles the delay gracefully

Layer 1: Structure Check Pure Java

The fastest, cheapest gate. Pure deterministic Java rules — no AI, no network calls to Bedrock. Catches deployment-breaking issues in milliseconds.

What It Checks

Check	Rule	Severity on Fail
Entry point exists	`index.html` must exist at repo root	CRITICAL
Broken references	Every `<script src>`, `<link href>`, `<img src>` must resolve to existing file	CRITICAL
File structure	All HTML files reference-able from root; no orphaned pages	HIGH
HTML5 validity	`<!DOCTYPE html>`, `<html lang>`, `<meta charset>` present	MEDIUM
HTTP 200	GitHub Pages URL returns 200 status	CRITICAL

Engine Details

Service: StructureCheckService.java
Technique: Java HttpClient for live URL checks; regex-based HTML parsing for reference extraction
Scoring: Pass/Fail (binary) — any CRITICAL finding = layer fails, pipeline short-circuits with report
Performance: Completes in <2 seconds typically
Why first? If the site doesn't load or has broken refs, deeper analysis is pointless

Layer 2: Security Audit Hybrid

Two-pass security analysis modeled on OWASP Top 10. First pass: fast static regex rules catch known patterns. Second pass: Bedrock deep analysis for nuanced vulnerabilities that pattern matching misses.

Pass 1: Static Rules (Java)

Rule	Pattern	Maps to OWASP
Inline JavaScript detection	`onclick=`, `javascript:`, `eval(`	A03: Injection / XSS
Credential exposure	`password` in URL params, hardcoded tokens, `localStorage` for secrets	A07: Auth Failures
Form action validation	Forms with `method="GET"` containing password fields	A04: Insecure Design
Open redirect	Unvalidated `window.location` assignments from URL params	A01: Broken Access
Missing security headers	No CSP meta tag, no `X-Frame-Options`	A05: Security Misconfig

Pass 2: Bedrock Deep Analysis

Prompt: qa-security-review.txt — sends full HTML+JS source to Bedrock
AI analyzes: Authentication flow logic, session management, data sanitization patterns, DOM manipulation safety, third-party script risks
Output: JSON array of findings with severity, owaspCategory, location, remediation

Scoring

Security Score: 0–10 scale (10 = no findings)
Each CRITICAL finding: −3 points. HIGH: −2. MEDIUM: −1. LOW: −0.5
Gate threshold: Advisory only (no blocking) — but CRITICAL findings highlighted in PR

Layer 3: Functional E2E Tests Bedrock AI

Since the generated apps are static GitHub Pages sites (HTML/CSS/JS only), traditional browser automation (Selenium/Playwright) is overkill. Instead, Bedrock AI reads the complete source code and mentally simulates user journeys — tracing event handlers, form submissions, navigation flows, and state management.

What Bedrock Simulates

Journey	What AI Traces	Expected Behavior
Login flow	Form submit handler → validation → redirect → session storage	Invalid creds show error; valid creds redirect to home
Navigation	Anchor hrefs, `window.location`, back/forward logic	All links navigate to existing pages; no dead ends
CRUD operations	DOM manipulation, `localStorage` read/write, event chains	Add/edit/delete reflect in UI; data persists across page loads
Auth guards	`sessionStorage`/`localStorage` checks on page load	Unauthenticated users redirected to login
Error handling	Try/catch blocks, error display elements, edge cases	Graceful degradation; user-visible messages

Why Bedrock-Simulated vs. Real Browser?

No infrastructure: No Selenium grid, no headless Chrome, no Docker containers
Deeper analysis: AI understands intent, not just DOM state — catches logic errors a click-test would miss
Cost-effective: One Bedrock invocation covers dozens of simulated journeys
Trade-off: Cannot catch rendering bugs or CSS layout issues (Layer 4 partially covers this)

Scoring

Score: 0–10 (10 = all journeys pass)
AI returns structured JSON: { journey, steps[], result: "pass"|"fail", issue?, remediation? }

Layer 4: Accessibility Audit Hybrid

Ensures WCAG 2.1 Level AA compliance through a combination of deterministic Java checks (machine-verifiable criteria) and Bedrock analysis (human-judgment criteria that require understanding context).

Pass 1: Java Rules (Deterministic)

Check	Implementation	WCAG Criterion
Image alt text	Regex: every `<img>` must have non-empty `alt`	1.1.1 Non-text Content
Form labels	Every `<input>` has associated `<label>` or `aria-label`	1.3.1 Info and Relationships
Color contrast	Parse CSS `color`/`background-color`; compute luminance ratio ≥ 4.5:1	1.4.3 Contrast (Minimum)
Heading hierarchy	Verify `h1`→`h2`→`h3` sequence; no skips	1.3.1 Info and Relationships
Language attribute	`<html lang="...">` present	3.1.1 Language of Page
Focus styles	CSS includes `:focus` rules; no `outline: none` without replacement	2.4.7 Focus Visible

Pass 2: Bedrock Deep Review

Prompt: qa-accessibility-review.txt
AI evaluates: Semantic HTML usage, ARIA roles/states correctness, keyboard navigation completeness, screen reader experience, touch target sizing, cognitive load assessment
Key insight: Many WCAG criteria (e.g., "meaningful sequence", "consistent navigation") require human-level understanding that pure regex cannot provide

Scoring

Accessibility Score: 0–10 (weighted: Java checks 40%, Bedrock analysis 60%)
Maps each finding to specific WCAG Success Criterion with conformance level (A, AA, AAA)

Layer 5: Performance Audit Bedrock AI

Analyzes the asset graph and render path of the deployed site. Since these are static sites without server-side rendering, performance analysis focuses on client-side loading strategy, asset optimization, and perceived performance.

What Bedrock Analyzes

Category	Analysis	Common Findings
Asset size	Total page weight, individual file sizes, unminified detection	Unminified JS >50KB, oversized images
Render blocking	`<script>` without `defer`/`async`, CSS in `<head>` load order	Render-blocking scripts in `<head>`
Image optimization	Format analysis (PNG vs WebP), dimensions, lazy loading	Missing `loading="lazy"`, no `width`/`height`
Caching	Asset fingerprinting, cache-control headers, CDN usage	No cache busting on CSS/JS filenames
Critical render path	First paint blocking resources, inline critical CSS presence	All CSS loaded before any content renders

Why Bedrock Instead of Lighthouse?

No headless Chrome needed: Lighthouse requires a browser runtime; Bedrock works from source alone
Context-aware: AI understands that a login page's performance profile differs from a dashboard
Actionable output: AI provides specific remediation steps, not just scores
Trade-off: Cannot measure actual FCP/LCP/CLS metrics — these require real rendering

Scoring

Performance Score: 0–10
Deductions: unminified assets (−2), render-blocking scripts (−1.5), no lazy loading (−1), missing cache strategy (−1)

📊 Output: Report, PR Update & SSE Broadcast

After all 5 layers complete, QaReportBuilder aggregates findings into a unified report. Three outputs are generated simultaneously:

1. QA Report (Database + API)

DB entities: QaReport (one per run) + QaFinding (one per issue) stored via JPA
API endpoint: GET /api/qa/{reqId} returns JSON; GET /requirements/{reqId}/qa renders HTML view
Schema: Flyway V18__qa_tables.sql — qa_report (id, req_id, overall_score, security_score, accessibility_score, performance_score, functional_score, structure_pass, created_at) + qa_finding (id, report_id, layer, severity, category, description, location, remediation)

2. PR Description Patch

Mechanism: GitHub API PATCH /repos/{owner}/{repo}/pulls/{number}
Content: Appends QA badge (overall score with color), summary table of findings per layer, and critical findings with remediation steps
Advisory only: Does not block merge — provides visibility for human reviewer

3. SSE Broadcast

Event: QA_COMPLETE sent via PipelineStreamService
Payload: Overall score, per-layer scores, critical finding count
Dashboard: Real-time update on requirement detail page — QA section appears with expandable layer results

Composite Scoring

Component	Weight	Range
Structure	Gate (must pass)	Pass / Fail
Security	30%	0–10
Functional	30%	0–10
Accessibility	25%	0–10
Performance	15%	0–10
Overall	100%	0–10

Tuesday, 5 May 2026

Implementation of Autonomy Intelligence Features

1. Closed-Loop Outcome Scoring

Technical Architecture

Business Impact

2. Prompt Evolution via Meta-Learning

Technical Architecture

Business Impact

4. Proactive Defect Prediction

Technical Architecture

Business Impact

5. Autonomous Self-Resolution (Stage 2)

Technical Architecture

Business Impact

6. Fix Strategy Memory (QA Intelligence)

Technical Architecture

Business Impact

7. Cross-Requirement Dependency Graph

Technical Architecture

Business Impact

Autonomy Maturity Model

Friday, 1 May 2026

Autonomous QA — Architecture Diagram

⚡ Stage 6: Autonomous QA — Architecture

🔔 Trigger: PRCreatedEvent

Activation Sequence

Why Event-Driven?

Layer 1: Structure Check Pure Java

What It Checks

Engine Details

Layer 2: Security Audit Hybrid

Pass 1: Static Rules (Java)

Pass 2: Bedrock Deep Analysis

Scoring

Layer 3: Functional E2E Tests Bedrock AI

What Bedrock Simulates

Why Bedrock-Simulated vs. Real Browser?

Scoring

Layer 4: Accessibility Audit Hybrid

Pass 1: Java Rules (Deterministic)

Pass 2: Bedrock Deep Review

Scoring

Layer 5: Performance Audit Bedrock AI

What Bedrock Analyzes

Why Bedrock Instead of Lighthouse?

Scoring

📊 Output: Report, PR Update & SSE Broadcast

1. QA Report (Database + API)

2. PR Description Patch

3. SSE Broadcast

Composite Scoring