Showing posts with label Architecture. Show all posts
Showing posts with label Architecture. Show all posts

Thursday, 30 April 2026

Self-Healing Pipeline Architecture

In the last blog dicussed a Autonomus SDLC system , In this we make it autonomous in terms of error detection, classification, recovery, and KB-backed continuous improvement — zero human intervention from failure to resolution.

Solution for Self-Healing — 7 new files + V18 migration build on top of every existing component: PipelineLog checkpoints, WorkflowFailedEvent, KnowledgeBaseService RAG, KnowledgeFeedbackService S3 writes, BedrockClient fallback, KbMaintenanceAgent schedule pattern. No new infrastructure required.
L1 DETECT L2 CKPT L3 RECOVER L4 KB L5 LEARN LAYER 1 — Detection & Watchdog Pipeline Execution ProposalService · CodeGenerationService RepoAnalysisService · AgentLoopService Exception stepFailed() PipelineLog persisted (FAILED) SSE broadcast to all subscribers WorkflowFailedEvent requirementId · failedStep errorMessage · Spring ApplicationEvent PipelineHealthMonitor @Scheduled every 60s — scans FAILED reqs @Scheduled every 5min — scans HUNG reqs LAYER 2 — Checkpoint Ledger (V18 Migration) PipelineLog (Enhanced) step: String · stepOrder: int · status: enum + retry_count: int (V18 new) + recovery_session_id: varchar (V18 new) getLastCheckpoint(reqId) Finds last COMPLETED step order in PipelineLog e.g. story 3 done at order 113 → resume from order 114 (story 4) alreadyCompleted(reqId, stepKey) Checks PipelineLogRepository for COMPLETED status on step name Used in CodeGenerationService + ProposalService loops Prevents duplicate work: stories 1–3 skipped on resume, story 4 retried LAYER 3 — Recovery Engine: PipelineRecoveryService PipelineRecoveryService @EventListener(WorkflowFailedEvent) · @Async("jarvisTaskExecutor") retry guard: Set<String> recentlyAttempted + TTL (30 min) ErrorClassifier → RecoveryStrategy TRANSIENT_BEDROCK ThrottlingException · timeout · 503 → Exponential backoff (2s→4s→8s) + retry same step MALFORMED_AI_RESPONSE → Retry Bedrock with suffix "Respond ONLY with valid JSON" GIT_CONFLICT 422 · ref exists → Rename branch: {name}-retry-{N}, re-attempt push STALE_CLONE → Delete local dir + re-clone from remote JIRA_UNAVAILABLE → Skip JIRA calls, continue pipeline (non-critical) CONFIG_MISSING → Escalate to Teams · cannot auto-fix creds UNKNOWN → KB lookup first → if hit: apply past fix · else: retry once → escalate Max Retry Guard TRANSIENT: max 3 · MALFORMED: max 2 GIT: max 3 · UNKNOWN: max 1 After max: → PipelineRecoveryExhaustedEvent → Teams alert Checkpoint Resume 1. getLastCheckpoint(reqId) → stepOrder N 2. startDevelopment(reqId, resumeFromOrder=N) 3. CodeGenerationService story loop: if alreadyCompleted(reqId, "CODE_GEN_STORY_"+N) continue ← skip, no duplicate commit else → generate + commit (first time) 4. Pipeline continues naturally to PR creation New Event Records (v1.3) PipelineRecoveryRequestedEvent reqId · failedStep · errorMsg · attemptNumber PipelineRecoveredEvent reqId · recoveredStep · strategy · attemptsNeeded · durationMs PipelineRecoveryExhaustedEvent reqId · step · allStrategiesFailed → escalate All via Spring ApplicationEventPublisher — no new infra needed LAYER 4 — Knowledge Base Lookup (existing KnowledgeBaseService) resolveFromKB("pipeline failure step:"+step+" error:"+errorClass, reqId) Queries Bedrock KB · Retrieve API (k=20) · Metadata filter: type=pipeline-incident Score ≥ 0.80 → past fix found → apply documented strategy immediately (skip trial-and-error) KbIncidentMatch (value object) strategy: RecoveryStrategy · confidence: float backoffMs: int · maxRetries: int · notes: String (from incident.md frontmatter) KB lookup before retry LAYER 5 — Continuous Improvement: KbHealingFeedbackService Incident Document (S3 Upload) Path: s3://.../learnings/{reqId}/incidents/{ts}-{step}.md YAML frontmatter: type: pipeline-incident step · error-class · fix-strategy · attempts · duration Body: Error · Root Cause · Fix Applied · Prevention → KB sync → vector indexed → available for future RAG @EventListener Coverage WorkflowFailedEvent → write incident stub immediately PipelineRecoveredEvent → complete incident doc (fix worked) PipelineRecoveryExhaustedEvent → escalation doc (fix failed) All @Async("jarvisTaskExecutor") — non-blocking Guarded by isEnabled() — silent skip if KB/S3 not configured Continuous Improvement Loop 1st occurrence: default strategy (trial-and-error) 2nd occurrence: KB hit → apply pre-validated fix instantly Each cycle: incident doc → KnowledgeBaseService.startSync() KB continuously enriched with real production failure data → Recovery time decreases with every resolved incident ① Fail (0s) ② Detect (≤60s) ③ KB Lookup (1–2s) ④ Classify + Backoff (2–8s) ⑤ Resume from ckpt (0s overhead) ⑥ Recovery complete + KB written ⑦ Improved

Component Deep Dive

🔍 Layer 1 — PipelineHealthMonitor

Type: @Component with two @Scheduled jobs
Job 1 — Active Failures (every 60s): Queries all RequirementStatus.FAILED requirements updated in the last 2 hours that haven't been attempted in the last 30 minutes. Publishes PipelineRecoveryRequestedEvent.
Job 2 — Silent Hangs (every 5 min): Finds requirements in ANALYZING or IN_DEVELOPMENT with no PipelineLog update for more than 15 minutes. Synthesizes a WorkflowFailedEvent("TIMEOUT: no pipeline progress for 15m").
Guard: ConcurrentHashMap<String, Instant> recentlyAttempted — TTL 30 min prevents retry storms.

📋 Layer 2 — Checkpoint Ledger

V18 Migration adds: retry_count INT DEFAULT 0 and recovery_session_id VARCHAR(36) to pipeline_logs. Also adds recovery_attempt_count INT and last_recovery_at TIMESTAMP to requirements.
New repo method: findTopByRequirementIdAndStatusOrderByStepOrderDesc(reqId, COMPLETED) — returns last completed step.
Skip logic in loops: alreadyCompleted(reqId, "CODE_GEN_STORY_4") checks for a COMPLETED PipelineLog with that step name. On resume: stories 1–3 are skipped in microseconds, story 4 is retried from scratch.

⚙️ Layer 3 — PipelineRecoveryService

Entry point: @EventListener on WorkflowFailedEvent, @Async("jarvisTaskExecutor")
Flow: (1) Check retry guard → (2) Query KB for past fix → (3) ErrorClassifier.classify(step, errorMsg)RecoveryStrategy enum → (4) Apply strategy with backoff → (5) Validate success → (6) Publish PipelineRecoveredEvent or PipelineRecoveryExhaustedEvent
Max retries per strategy: TRANSIENT=3, MALFORMED=2, GIT_CONFLICT=3, STALE_CLONE=2, UNKNOWN=1
Reuses: BedrockClient.invokeWithFallback(), GitHubClient.createBranch(), existing clone/delete logic

🧠 Layer 4 — KB Lookup

Called before every recovery attempt: knowledgeBaseService.resolveFromKB("pipeline failure step:CODE_GEN_STORY error:TRANSIENT_BEDROCK", reqId)
Metadata filter: source-uri startsWith s3://.../learnings/.../incidents/
Threshold: confidence ≥ 0.80 → use documented strategy; < 0.80 → use default ErrorClassifier strategy
KbIncidentMatch parses frontmatter: fix-strategy, backoff-ms, max-retries → PipelineRecoveryService applies them directly
Zero new AWS resources — reuses existing KB ID, AOSS index, embeddings model

📚 Layer 5 — KbHealingFeedbackService

New class extending the existing S3 upload pattern from KnowledgeFeedbackService.
Listens to 3 events: WorkflowFailedEvent (stub), PipelineRecoveredEvent (complete), PipelineRecoveryExhaustedEvent (escalation).
Upload path: s3://.../learnings/{reqId}/incidents/{yyyyMMdd-HHmmss}-{step}.md
YAML frontmatter enables metadata filtering in KB: type: pipeline-incident, step, error-class, fix-strategy, resolved: true/false
After upload → knowledgeBaseService.startSync() → new incident indexed within ~30s

🗂️ ErrorClassifier (standalone)

Pure utility class — no Spring dependencies, fully unit-testable.
classify(String step, String errorMessage) → RecoveryStrategy
Uses ordered regex/contains checks:
ThrottlingException|Rate exceeded|503 → TRANSIENT_BEDROCK
<|Unexpected character|parse error → MALFORMED_AI_RESPONSE
422|Reference already exists|already exists → GIT_CONFLICT
Remote mismatch|no commits|stale clone → STALE_CLONE
jira|401|403.*atlassian → JIRA_UNAVAILABLE
bucket|credentials|NoSuch.*Key → CONFIG_MISSING
fallthrough → UNKNOWN

Recovery Time Budget (TRANSIENT_BEDROCK Example)

0s   → CODE_GEN_STORY_4 throws ThrottlingException → stepFailed() persists PipelineLog
0s   → WorkflowFailedEvent published → KbHealingFeedbackService writes incident stub to S3
≤60s → PipelineHealthMonitor detects FAILED status → publishes PipelineRecoveryRequestedEvent
+1s  → PipelineRecoveryService.onRecovery() → KB lookup: "pipeline failure step:CODE_GEN_STORY error:TRANSIENT_BEDROCK"
+2s  → KB HIT (2nd+ occurrence): past incident found, confidence=0.87 → apply: 4s backoff, retry once
       OR KB MISS (1st occurrence): ErrorClassifier → TRANSIENT_BEDROCK → default 2s→4s→8s backoff
+4s  → getLastCheckpoint() → order 113 (story 3 done) → alreadyCompleted() checks stories 1–3 = skip
+5s  → Retry story 4 → Bedrock invoked → success
+8s  → Stories 5–8 continue normally → PR created → PipelineRecoveredEvent published
+9s  → KbHealingFeedbackService completes incident.md → knowledgeBaseService.startSync()
+40s → KB re-indexed → next ThrottlingException anywhere → instant KB-guided recovery

Design Principles

✓ Zero New Infrastructure

Reuses existing thread pool, KB, S3, SSE stream, PipelineLog. Flyway V18 is the only schema change. No Redis, no Kafka, no distributed lock manager needed.

✓ Idempotent Resume

alreadyCompleted() is the single guard. A story that succeeded before recovery will never be re-run, guaranteeing no duplicate commits or duplicate JIRA transitions.

✓ Self-Improving KB

Every incident enriches the KB. First failure is trial-and-error. Every subsequent identical failure is resolved instantly from the KB. Recovery time strictly decreases over time.

⚠️ Non-Blocking Side Effects

All KB writes, incident uploads, and sync triggers are @Async. A KB outage during recovery does NOT block the recovery itself — the pipeline resumes regardless.

⚠️ Guardrail: Max Retries

Strict per-strategy retry caps prevent runaway loops. After exhaustion: RequirementStatus.FAILED is set permanently, Teams alert sent, full incident logged. Human can restart manually.

📡 Full Observability

Every recovery attempt creates a PipelineLog entry visible in the SSE pipeline viewer with step name RECOVERY_ATTEMPT_N. Users see healing happen in real time in the UI.

Tuesday, 18 October 2022

Microservice Architecture in Java

Microservice Architecture enables large teams to build scalable applications that are composed of multiple small loosely coupled services. In Microservice each service handles a dedicated function inside a large-scale application.

Challenges that we all see when designing Microservice Architecture are "Right-Sizing and Identifying the limitations and Boundaries of the Services".

Some of the most commonly used approaches in the industry:-

  • Domain Driven:- In this approach, we would need good Domain Knowledge and it takes a lot of time to close alignment with all the Business stakeholders to identify the need and requirements to develop Microservices for business capabilities.  
  • Event Storming Sizing:-  We conduct a session with all business Stakeholders and identify various events in the system and based on that we can group them in Domain Driven.

In the below Microservice Architecture for a Bank, where we have (Loan, Card, Account, and Customer) Microservices, along with other required services for the successful implementation of Microservice Architecture. 


Let's look at the most critical components that are required for Microservice Architecture Implementation. 

The API Gateway handles all incoming requests and routes to the relevant microservices.  The API gateway depends on the Identity Provider service to handle the authentication.

To locate the service to route an incoming request to, API Gateway consults a service registry and discovery service. ALL Microservice register with Service Registry and Discover the location of other Microservices using Discovery services. 

Let's take a look at the components in detail for a Successful Microservice Architecture and why they are required.
  1. Handle Routing Requirements API Gateway:- Spring Cloud Gateway is a library for building an API gateway. Spring cloud gateway sits between a requester and a resource, where it intercepts analysis of the request.  It is also a preferred API gateway from the spring cloud team.  It also has the following advantages:- 
    1. Built on Spring 5, reactor, and Spring WebFlux.
    2. It also includes circuit breaking and discovery service with Eureka.  
  2. Configuration Service:-  We can't Hard code the config details inside the service and in a DTAP it would be a nightmare to manage all config in the application properties plus manage them when a new service joins. So for that In a Microservice architecture, we have a config service that then can load and inject the configuration from (Git Repo, File system, or Database) to Microsrevies while they're starting up, and since we are talking about Java, I have used Spring Cloud Config for Configuration Management.
  3. Service Registry and Discovery:- In a Microservice Arihcture how do services locate each other inside a network and how do we tell our application architecture when a new service is onboarded or a new node is added for existing services and how load balancer will work. This all looks very complicated but, We have Spring Cloud Discovery Service using the Eureka agent. Some Advantages of using Service discovery. 
    1. No Limitation on Availability 
    2. Peer to Peer communication between service Discovery agent
    3. Dynamically Managed IPs, Configurations, and Load Balance.
    4. Fault-tolerance and Resilience 
  4. Resilience Inside Microservices:- In this, We make sure that we handle the service failure gracefully, avoid cascading effects if one of the services is failed, and have self-healing capabilities. For Resilience Spring Framework Support Resilience4J  which is a lightweight and easy-to-use fault tolerance library inspired by NetFlix Hystrix. Before Resilience4J NetFlix Hystrix.is most commonly used for resiliency but it is now in maintenance mode.  Resilience4J offers the following patterns for increasing fault tolerance. 
    1. Circuit Breaking:- Used to stop making a request when a service is failing.
    2. Fallback:- Alternative path to failing service.
    3. Retry:- Retry when a service is failing temporarily failed.
    4. Rate Limit:- Limit the number of calls a service gets at a time.
    5. Bulkhead:- To avoid overloading.
  5. Distributed Tracing and logging:- For debugging the problem in a microservice architecture we would need to aggregate all the logs traces and monitor the chain of service calls for that we have Spring Cloud Sleuth and Zipkin.
    1. Sleuth provides auto-configuration for disturbing logs it also adds the SPAN ID to all the logs by filtering and interacting with other spring components and generating the Correlation Id passes through to all the system calls.
    2. Zipkin:- Is used for Data-Visualisations 
  6.  Monitoring:- Is used to monitor service metrics health checks and create alerts based on Monitoring and we have different approaches to do that. Let's see the most commonly used approaches.
    1.  Actuator:- is mainly used to expose operational information like health, dump, info, and memory.
    2. Micrometer:- Expose Actuator data in a format that can be understood by the Monitoring system all we need to add vendor-specific Micrometer dependency in the service.
    3. Prometheus:- It is a time-series database to store metric data and also has the data-visualization capability.
    4. Grafana:-  Pulled the data from various data sources like Prometheus and offers rich UI to create custom Dashboard and also allows to set rule-based alerts and notifications. 

We have covered all the relevant components for a successful Microservice Architecture, I build  Microservices using  Spring Framework and all the above Components Code Repo

Happy Coding and Keep Sharing!!
 

Wednesday, 29 September 2021

Event Driven Architecture using Apache Kafka

A quick recap of what we discussed in the previous post about the EDA, and In this post, we will see more insight. EDA It is a pattern that uses events to communicate between decoupled components or services, and these events will need to be published to an event broker platform and then sent to the consuming applications.  

Event-Driven Architecture is comprised of three components.

  • Producers:- are the apps or services that publish the events to an event broker platform.
  • Router:- Routes them to their respective consuming applications 
  • Consumers:- Another app or service that consumes a particular topic in an event router.
When Designing the Event-Driven Models we can implement two Models.

  • Pub/Sub (Publish/Subscribe):- Events are published to a topic and sent to one or more subscribers, once received, the event cannot be backtracked or reread again, and new subscribers do not see the event.
  • Event Streaming:- Events are written to a log and ordered in a partition. A client app can read from any part of the stream and reply to the events.

We are going to use Apache Kafka, to implement the event-driven architecture, which is an open-source, distributed, event streaming platform. 

Using Apache Kafka we could have multiple apps or services that write event events to Kafka cluster and at the same time, we could also have multiple consumers apps that subscribe or stream events from Kafka, Where a Kafka Cluster is a collection of brokers and they could be actual physical servers or single rack and If you are using Kafka on the cloud or as (PaaS) then you don't have to concerned about it. 

Here, Zookeeper is the one who is responsible for the cluster & Failure management and decides which among the replicated brokers can be the new leader.   

The broker can store multiple topics where producers write events, where a topic is a collection of related events or messages. When producers produce an event we need to specify the topic where we want to write or publish it.


In the next post, we will build Spring Boot API, which will produce events and we will see end to end flow of producers, routers and consumers using that, until then.

Happy coding and keep sharing!!
 

Monday, 27 September 2021

An Overview :- What is Event-Driven Architecture ?

Before, We jump into the EDA, let's understand the standard guidelines for system design or generally reactive manifesto. Which is something is community driven guidelines that are indented to give cohesive approach to system.     

So, the core of the reactive manifesto is make system message driven, more specifically "async" messaging.  



We want to make the system async messaging driven with scalabilityresilient and this helps us to build distributed systems or K8s. Where Scalable means our hardware should expand as the workload expands and By resilient we don't want any single point of failure and if it does we should be able to handle it elegantly.

Based on the above three foundations, we should be able to build a system that is responsive.   

Now, we have our core setup let's understand What is an event ?. In simple words, an event is a statement of facts that happened in the past. Let's talk about an example of a Retail application.




So, In an application, we have a checkout service and that service wants to talks to other services such as "Inventory", "Shipping", "Contact".

In the messaging model, if the inventory wants to know what the checkout is doing, the checkout will send a message directly to inventory to let it know a checkout happens. and to others as well directly to Shipping and to Contact service OR these services can message directly to checkout as well "Conversational Messaging", till now the message is sitting on a host machine.

When we design the event application our event producer might web app, mobile app, etc.. this will enable the events logs being produced by all the producing applications. 

Event logs can be used to Trigger an action In the case of IoT when any device turns on, It spins a pod on the Infra and that pod is a function as a Service (FaaS) that sits on top of Serverless Infrastructure and turns down when our function finishes by sending the event.  

With event logs we can optimize and custom data persistence, so can be possible that our Inventory service will consume data from stream send by web application m it will modify the local data and produce in the event backbone and this new stream is giving the most correct inventory data to any other application in the system.

Important things which happen here is we can save all our data raw or transformed in a Data Lake, this will help heavy application like AI.

Another thing that EDA enables is stream processing which is built on top of the Apache Kafka streams API.  

Benefits of Event/Driven Architecture
  • Asynchronous
  • Scalable and Failure Independent
  • Auditing and Point-in-time recovery 

Till now we have seen an overview and benefits of EDA that sit on top of reactive manifesto ideas for system designing.

In the next post, we will learn more about this in detail with some demo examples, Until then 

Happy coding and keep sharing!!