Hem's Blog: Autonomous QA — Architecture Diagram

🔔 Trigger: PRCreatedEvent

The entire QA pipeline is event-driven. When CodeGenerationService creates a pull request on GitHub, Spring publishes a PRCreatedEvent. The QaOrchestrationService listens for this event via @EventListener and kicks off the QA pipeline asynchronously (@Async).

Activation Sequence

Step	Action	Detail
1	Event received	`PRCreatedEvent(reqId, prUrl, pagesUrl)` captured by listener
2	Poll GitHub Pages	HTTP GET to `pagesUrl` every 15s, up to 3 min timeout, waiting for HTTP 200
3	Fetch all pages	HttpClient crawls all HTML/JS/CSS from the deployed Pages site
4	Execute 5 layers	Sequential execution — each layer receives the fetched content + previous layer results
5	Aggregate & report	QaReportBuilder compiles findings → DB save → PR update → SSE broadcast

Why Event-Driven?

Decoupled — Code generation doesn't wait for QA; QA runs independently
Non-blocking — User sees PR created immediately; QA results stream in via SSE
Retry-safe — If Pages isn't ready, polling handles the delay gracefully

Layer 1: Structure Check Pure Java

The fastest, cheapest gate. Pure deterministic Java rules — no AI, no network calls to Bedrock. Catches deployment-breaking issues in milliseconds.

What It Checks

Check	Rule	Severity on Fail
Entry point exists	`index.html` must exist at repo root	CRITICAL
Broken references	Every `<script src>`, `<link href>`, `<img src>` must resolve to existing file	CRITICAL
File structure	All HTML files reference-able from root; no orphaned pages	HIGH
HTML5 validity	`<!DOCTYPE html>`, `<html lang>`, `<meta charset>` present	MEDIUM
HTTP 200	GitHub Pages URL returns 200 status	CRITICAL

Engine Details

Service: StructureCheckService.java
Technique: Java HttpClient for live URL checks; regex-based HTML parsing for reference extraction
Scoring: Pass/Fail (binary) — any CRITICAL finding = layer fails, pipeline short-circuits with report
Performance: Completes in <2 seconds typically
Why first? If the site doesn't load or has broken refs, deeper analysis is pointless

Layer 2: Security Audit Hybrid

Two-pass security analysis modeled on OWASP Top 10. First pass: fast static regex rules catch known patterns. Second pass: Bedrock deep analysis for nuanced vulnerabilities that pattern matching misses.

Pass 1: Static Rules (Java)

Rule	Pattern	Maps to OWASP
Inline JavaScript detection	`onclick=`, `javascript:`, `eval(`	A03: Injection / XSS
Credential exposure	`password` in URL params, hardcoded tokens, `localStorage` for secrets	A07: Auth Failures
Form action validation	Forms with `method="GET"` containing password fields	A04: Insecure Design
Open redirect	Unvalidated `window.location` assignments from URL params	A01: Broken Access
Missing security headers	No CSP meta tag, no `X-Frame-Options`	A05: Security Misconfig

Pass 2: Bedrock Deep Analysis

Prompt: qa-security-review.txt — sends full HTML+JS source to Bedrock
AI analyzes: Authentication flow logic, session management, data sanitization patterns, DOM manipulation safety, third-party script risks
Output: JSON array of findings with severity, owaspCategory, location, remediation

Scoring

Security Score: 0–10 scale (10 = no findings)
Each CRITICAL finding: −3 points. HIGH: −2. MEDIUM: −1. LOW: −0.5
Gate threshold: Advisory only (no blocking) — but CRITICAL findings highlighted in PR

Layer 3: Functional E2E Tests Bedrock AI

Since the generated apps are static GitHub Pages sites (HTML/CSS/JS only), traditional browser automation (Selenium/Playwright) is overkill. Instead, Bedrock AI reads the complete source code and mentally simulates user journeys — tracing event handlers, form submissions, navigation flows, and state management.

What Bedrock Simulates

Journey	What AI Traces	Expected Behavior
Login flow	Form submit handler → validation → redirect → session storage	Invalid creds show error; valid creds redirect to home
Navigation	Anchor hrefs, `window.location`, back/forward logic	All links navigate to existing pages; no dead ends
CRUD operations	DOM manipulation, `localStorage` read/write, event chains	Add/edit/delete reflect in UI; data persists across page loads
Auth guards	`sessionStorage`/`localStorage` checks on page load	Unauthenticated users redirected to login
Error handling	Try/catch blocks, error display elements, edge cases	Graceful degradation; user-visible messages

Why Bedrock-Simulated vs. Real Browser?

No infrastructure: No Selenium grid, no headless Chrome, no Docker containers
Deeper analysis: AI understands intent, not just DOM state — catches logic errors a click-test would miss
Cost-effective: One Bedrock invocation covers dozens of simulated journeys
Trade-off: Cannot catch rendering bugs or CSS layout issues (Layer 4 partially covers this)

Scoring

Score: 0–10 (10 = all journeys pass)
AI returns structured JSON: { journey, steps[], result: "pass"|"fail", issue?, remediation? }

Layer 4: Accessibility Audit Hybrid

Ensures WCAG 2.1 Level AA compliance through a combination of deterministic Java checks (machine-verifiable criteria) and Bedrock analysis (human-judgment criteria that require understanding context).

Pass 1: Java Rules (Deterministic)

Check	Implementation	WCAG Criterion
Image alt text	Regex: every `<img>` must have non-empty `alt`	1.1.1 Non-text Content
Form labels	Every `<input>` has associated `<label>` or `aria-label`	1.3.1 Info and Relationships
Color contrast	Parse CSS `color`/`background-color`; compute luminance ratio ≥ 4.5:1	1.4.3 Contrast (Minimum)
Heading hierarchy	Verify `h1`→`h2`→`h3` sequence; no skips	1.3.1 Info and Relationships
Language attribute	`<html lang="...">` present	3.1.1 Language of Page
Focus styles	CSS includes `:focus` rules; no `outline: none` without replacement	2.4.7 Focus Visible

Pass 2: Bedrock Deep Review

Prompt: qa-accessibility-review.txt
AI evaluates: Semantic HTML usage, ARIA roles/states correctness, keyboard navigation completeness, screen reader experience, touch target sizing, cognitive load assessment
Key insight: Many WCAG criteria (e.g., "meaningful sequence", "consistent navigation") require human-level understanding that pure regex cannot provide

Scoring

Accessibility Score: 0–10 (weighted: Java checks 40%, Bedrock analysis 60%)
Maps each finding to specific WCAG Success Criterion with conformance level (A, AA, AAA)

Layer 5: Performance Audit Bedrock AI

Analyzes the asset graph and render path of the deployed site. Since these are static sites without server-side rendering, performance analysis focuses on client-side loading strategy, asset optimization, and perceived performance.

What Bedrock Analyzes

Category	Analysis	Common Findings
Asset size	Total page weight, individual file sizes, unminified detection	Unminified JS >50KB, oversized images
Render blocking	`<script>` without `defer`/`async`, CSS in `<head>` load order	Render-blocking scripts in `<head>`
Image optimization	Format analysis (PNG vs WebP), dimensions, lazy loading	Missing `loading="lazy"`, no `width`/`height`
Caching	Asset fingerprinting, cache-control headers, CDN usage	No cache busting on CSS/JS filenames
Critical render path	First paint blocking resources, inline critical CSS presence	All CSS loaded before any content renders

Why Bedrock Instead of Lighthouse?

No headless Chrome needed: Lighthouse requires a browser runtime; Bedrock works from source alone
Context-aware: AI understands that a login page's performance profile differs from a dashboard
Actionable output: AI provides specific remediation steps, not just scores
Trade-off: Cannot measure actual FCP/LCP/CLS metrics — these require real rendering

Scoring

Performance Score: 0–10
Deductions: unminified assets (−2), render-blocking scripts (−1.5), no lazy loading (−1), missing cache strategy (−1)

📊 Output: Report, PR Update & SSE Broadcast

After all 5 layers complete, QaReportBuilder aggregates findings into a unified report. Three outputs are generated simultaneously:

1. QA Report (Database + API)

DB entities: QaReport (one per run) + QaFinding (one per issue) stored via JPA
API endpoint: GET /api/qa/{reqId} returns JSON; GET /requirements/{reqId}/qa renders HTML view
Schema: Flyway V18__qa_tables.sql — qa_report (id, req_id, overall_score, security_score, accessibility_score, performance_score, functional_score, structure_pass, created_at) + qa_finding (id, report_id, layer, severity, category, description, location, remediation)

2. PR Description Patch

Mechanism: GitHub API PATCH /repos/{owner}/{repo}/pulls/{number}
Content: Appends QA badge (overall score with color), summary table of findings per layer, and critical findings with remediation steps
Advisory only: Does not block merge — provides visibility for human reviewer

3. SSE Broadcast

Event: QA_COMPLETE sent via PipelineStreamService
Payload: Overall score, per-layer scores, critical finding count
Dashboard: Real-time update on requirement detail page — QA section appears with expandable layer results

Composite Scoring

Component	Weight	Range
Structure	Gate (must pass)	Pass / Fail
Security	30%	0–10
Functional	30%	0–10
Accessibility	25%	0–10
Performance	15%	0–10
Overall	100%	0–10

Hem's Blog

Friday, 1 May 2026

Autonomous QA — Architecture Diagram

⚡ Stage 6: Autonomous QA — Architecture

🔔 Trigger: PRCreatedEvent

Activation Sequence

Why Event-Driven?

Layer 1: Structure Check Pure Java

What It Checks

Engine Details

Layer 2: Security Audit Hybrid

Pass 1: Static Rules (Java)

Pass 2: Bedrock Deep Analysis

Scoring

Layer 3: Functional E2E Tests Bedrock AI

What Bedrock Simulates

Why Bedrock-Simulated vs. Real Browser?

Scoring

Layer 4: Accessibility Audit Hybrid

Pass 1: Java Rules (Deterministic)

Pass 2: Bedrock Deep Review

Scoring

Layer 5: Performance Audit Bedrock AI

What Bedrock Analyzes

Why Bedrock Instead of Lighthouse?

Scoring

📊 Output: Report, PR Update & SSE Broadcast

1. QA Report (Database + API)

2. PR Description Patch

3. SSE Broadcast

Composite Scoring

No comments:

Post a Comment