Cloud Engineer Agent - Complete Architecture

Overview

This architecture represents a comprehensive cloud engineer agent solution built on AWS, using Slack as the user interface, powered by Amazon Bedrock's Claude model, and enhanced with MCP servers and Strands tools for extended functionality.

Architecture Components

┌─────────────┐    ┌─────────────────┐    ┌──────────────────────────────────────────────────────────────┐         ┌────────────────┐
│    Slack    │───▶│   API Gateway   │───▶│                      Lambda Function                         │────────▶│   S3 Vectors   │  
│  Interface  │    │                 │    │                       (AWS Strands)                          │         └────────────────┘
└─────────────┘    └─────────────────┘    │ ┌──────────────────────────────────────────────────────────┐ │         ┌────────────────┐
                                          │ │                       Tools                              │ │────────▶│    DynamoDB    │
                                          │ │┌───────────────┐ ┌───────────────────┐ ┌───────────┐     │ │         └────────────────┘
                   ┌─────────────────┐    │ ││ aws_cloudwatch│ | aws_cost_explorer │ │ atlassian │     │ │     
                   │ CloudWatch Logs │───▶│ │└───────────────┘ └───────────────────┘ └───────────┘     │ │
                   └─────────────────┘    │ │┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌────────┐│ │
                                          │ ││ aws_eks │ │ aws_ecs │ │ aws_ecs │ │ use_aws │ │ memory ││ │
                                          │ │└─────────┘ └─────────┘ └─────────┘ └─────────┘ └────────┘│ │
                                          │ └──────────────────────────────────────────────────────────┘ │
                                          └──────────────────────────────────────────────────────────────┘
                                                                           │
                              ┌────────────────────────────────────────────┼───────────────────────────────────────────┐
                              │                                            │                                           │
                              ▼                                            ▼                                           ▼
   ┌────────────────────────────────────────────────────────┐  ┌──────────────────────────────────┐   ┌────────────────────────────────────┐                                   
   │                       Fargate                          │  │          Amazon Bedrock          │   │            Usage Metrics           │
   │                  ┌─────────────────┐                   │  │                                  │   │                                    │              
   │                  │    mcp-proxy    │                   │  │ ┌─────────────┐  ┌─────────────┐ │   │ ┌───────────────┐ ┌───────────────┐│
   │                  └─────────────────┘                   │  │ │   Model     │  │  Guardrails │ │   │ │ Cost Explorer │ │  CloudWatch   ││      
   │                          │                             │  │ └─────────────┘  └─────────────┘ │   │ └───────────────┘ │  Dashboard    ││     
   │                          │                             │  └──────────────────────────────────┘   │                   └───────────────┘│
   │                          ▼                             │                                         └────────────────────────────────────┘
   │┌─────────────────────────────────────────────────────┐ │                
   ││                    MCP Servers                      │ │         ┌─────────────────┐                                                       
   ││                                                     │ │         │     External    │ 
   ││┌──────────────┐  ┌──────────────┐  ┌──────────────┐ │ │         │ ┌─────────────┐ │                             
   │││   aws-docs   │  │  atlassian   │  │    github    │ │ │         │ │   GitHub    │ │ 
   ││└──────────────┘  └──────────────┘  └──────────────┘ │ │         │ └─────────────┘ │    
   ││┌──────────────┐  ┌──────────────┐  ┌──────────────┐ │ │────────▶│ ┌─────────────┐ │  
   │││ cost-explorer│  │   aws-eks    │  │   aws-ecs    │ │ │         │ │  Atlassian  │ │
   ││└──────────────┘  └──────────────┘  └──────────────┘ │ │         │ └─────────────┘ │   
   ││┌──────────────┐                                     │ │         │ ┌─────────────┐ │ 
   │││  cloudwatch  │                                     │ │         │ │    AWS      │ │    
   ││└──────────────┘                                     │ │         │ │Documentation│ │      
   │└─────────────────────────────────────────────────────┘ │         │ └─────────────┘ │ 
   └────────────────────────────────────────────────────────┘         └─────────────────┘

Generated using aws-diagram MCP server.

System Context

This Lambda function is triggered by two primary sources:

Slack Messages: User interactions through Slack interface for cloud engineering queries and operations
CloudWatch Log Events: Automated error detection and response workflow for infrastructure monitoring

The system behavior is defined in agent/system_prompt.md, which outlines the agent's capabilities for AWS operations management and automated error response workflows.

Enhanced Data Flow

Input Sources:
- Users interact through Slack with cloud engineering queries
- CloudWatch Logs trigger automated error response workflows
API Gateway: Slack webhook sends a request to AWS API Gateway
Lambda Processing: AWS Strands-powered Lambda function processes requests using integrated tools:
- aws_doc_tools: Access to AWS documentation and best practices
- aws_cdk_tools: CDK-specific operations and guidance
- github_tools: Repository management and pull request operations
- atlassian_tools: Jira integration for issue tracking and project management
- use_aws: Direct AWS service interactions and resource management
- memory: Context retention and conversation history
Service Integration:
- MCP Proxy (ALB): Load-balanced access to containerized MCP servers running on Fargate
- Amazon Bedrock: Claude model for AI processing and Guardrails for content safety
- External APIs: GitHub API, Atlassian API, and AWS Documentation services
Response Processing: Lambda aggregates responses from all integrated services and tools
Output Delivery: Processed responses flow back to Slack, with automated Jira ticket creation and GitHub PR generation for error response workflows

Key Components

MCP Servers

AWS Documentation MCP Server: Provides real-time access to AWS documentation, best practices, and technical guides
AWS CDK MCP Server: Offers CDK-specific operations, template generation, and infrastructure as code guidance
GitHub MCP Server: Enables repository management, pull request operations, and version control integration
Atlassian MCP Server: Provides Jira integration for issue tracking, project management, and workflow automation

Strands Tools

use_aws Tool: Enables direct interaction with AWS services for operational tasks, resource management, and configuration changes
memory: Store user and agent memories across agent runs to provide personalized experiences with both Mem0 and Amazon Bedrock Knowledge Bases

Amazon Bedrock Services

Claude Model: Advanced language model for understanding and generating responses
Guardrails: Content filtering and safety validation
Knowledge Base: RAG implementation with internal knowledge repository

DynamoDB Integration

Message Deduplication Table: Prevents duplicate Slack message processing across Lambda executions
- Partition Key: message_id (MD5 hash of timestamp, user, channel, and message text)
- TTL: 1-hour automatic cleanup of processed message records
- Atomic Operations: Conditional writes ensure race-condition-free duplicate detection
- Cross-Lambda Persistence: Maintains deduplication state across multiple Lambda invocations
- Billing: Pay-per-request pricing model for cost-effective operation

Enhanced Capabilities

Documentation & Learning

Real-time AWS documentation lookup
Best practices and architectural guidance
Service-specific technical references
Troubleshooting guides and solutions

Cost Management

Real-time cost analysis and reporting
Budget monitoring and alerts
Cost optimization recommendations
Resource utilization insights

AWS Operations

Direct AWS service interactions
Resource provisioning and management
Configuration changes and updates
Infrastructure monitoring and control

AI-Powered Assistance

Natural language query processing
Context-aware responses
Multi-service orchestration
Intelligent recommendation engine

Security & Compliance

Lambda execution environment isolation
Bedrock Guardrails for content safety
AWS IAM for granular access control
Audit logging for all operations

AI Tooling Used in Development

This project leveraged various AI tools throughout the development lifecycle to enhance productivity and code quality:

Product Development & Planning

Claude: Product Requirements Document (PRD) creation and architectural planning
Cline + Mantel API Gateway: Large-scale codebase development, refactoring, and feature implementation

Documentation & Visual Assets

Gemini: README generation and documentation creation based on demo screenshots
aws-diagram-mcp: Automated architecture diagram generation and visualization

Code Development & Maintenance

Amazon Q: Precise, surgical code fixes and targeted problem resolution
GitHub Copilot: Real-time tab completions, inline code suggestions, and automated commit message generation

This multi-AI approach enabled rapid development while maintaining high code quality and comprehensive documentation across the entire cloud engineering solution.

Development Challenges

System Prompt Engineering

Achieving precise, surgical code changes required extensive iteration and refinement of the system prompt. The challenge was balancing comprehensive capabilities with focused execution - ensuring the agent could handle complex scenarios while maintaining minimal, targeted fixes for specific issues.

Multi-Agent Architecture Evaluation

Initial exploration of a multi-agent architecture revealed significant limitations for precision-focused tasks:

Context Fragmentation: Specialized agents (orchestrator, pr-agent, knowledge-base-agent, operations-agent, jira-agent) only saw partial context, leading to suboptimal decisions
Over-Specialization: Individual agents felt compelled to "add value" within their domain, resulting in broader changes than necessary
Communication Overhead: Information loss and transformation occurred during handoffs between agents
Competing Objectives: Different agents had conflicting approaches to problem-solving

Architecture Decision: Single Agent Superiority

The evaluation conclusively demonstrated that a single-agent architecture with a well-crafted system prompt significantly outperformed the multi-agent approach for surgical infrastructure fixes:

Full Context Awareness: Complete problem visibility without information fragmentation
Clear Single Objective: Direct focus on fixing specific errors without role confusion
Simplified Execution Path: Elimination of complex orchestration overhead
Consistent Precision: Reliable delivery of minimal, targeted changes

This architectural insight proved crucial for achieving the system's core requirement of surgical precision in automated error response workflows.

Demos

Explore the Cloud Engineer Agent capabilities through interactive demonstrations:

Automated Error Response - Complete workflow from CloudWatch error detection to automated Jira ticket creation and GitHub PR generation
Root Cause Analysis - Systematic investigation and diagnosis of complex AWS infrastructure issues, including organizational policy conflicts
AWS Well-Architected Review - Comprehensive infrastructure assessment against all five Well-Architected pillars with automated Jira Epic creation
Cloud Operations - Direct AWS service interactions, resource management, and infrastructure operations
General Queries - AWS documentation lookup, best practices guidance, and expert recommendations

File Structure

cloud-engineer/
├── README.md                           # Main project documentation
├── package.json                        # Node.js dependencies and scripts
├── cdk.json                            # CDK configuration
├── tsconfig.json                       # TypeScript configuration
├── jest.config.js                      # Jest testing configuration
├── LICENSE.md                          # Project license
├── .gitignore                          # Git ignore patterns
├── .npmignore                          # NPM ignore patterns
├── cdk.context.json                    # CDK context cache
│
├── agent/                              # Lambda function source code
│   ├── agent.py                        # Main Lambda handler
│   ├── cloud_engineer.py               # Core agent implementation
│   ├── system_prompt.md                # Agent behavior definition
│   ├── requirements.txt                # Python dependencies
│   └── Dockerfile                      # Container configuration
│
├── bin/                                # CDK application entry point
│   └── cloud-engineer.ts               # CDK app definition
│
├── lib/                                # CDK infrastructure code
│   └── cloud-engineer-stack.ts         # Main infrastructure stack
│
├── mcp-proxy/                          # MCP server proxy configuration
│   ├── Dockerfile                      # Proxy container configuration
│   ├── entrypoint.sh                   # Container startup script
│   └── mcp-servers.json                # MCP server definitions
│
├── demos/                              # Demo screenshots and documentation
│   ├── automated-error-response/       # Error response workflow demos
│   ├── cloud-ops/                      # AWS operations demos
│   ├── cost-forecasting/               # AWS cost forecasting demos
│   ├── general-queries/                # Documentation query demos
│   ├── query-jira/                     # Jira query demos
│   ├── root-cause-analysis/            # Infrastructure issue investigation demos
│   └── well-architected-review/        # AWS Well-Architected Framework assessment demos
│
├── generated-diagrams/                 # Architecture diagrams
│   └── cloud-engineer-architecture.png # Current system architecture
│
└── tests/                              # Test files
    └── test_cloud_engineer.py          # Agent unit tests

Scalability & Performance

Auto-scaling Lambda functions
Distributed MCP server architecture

Future Improvements

The following features are planned for future implementation:

Bedrock Knowledge Base or S3 Vector: RAG implementation with internal knowledge repository for enhanced contextual responses
Memory Strands Tool: Advanced context retention and conversation history management within AWS Lambda
CloudWatch Dashboard: Comprehensive cost explorer integration for inference cost monitoring and visualization
API Security Implementation: Advanced security measures and authentication

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cloud Engineer Agent - Complete Architecture

Overview

Architecture Components

System Context

Enhanced Data Flow

Key Components

MCP Servers

Strands Tools

Amazon Bedrock Services

DynamoDB Integration

Enhanced Capabilities

Documentation & Learning

Cost Management

AWS Operations

AI-Powered Assistance

Security & Compliance

AI Tooling Used in Development

Product Development & Planning

Documentation & Visual Assets

Code Development & Maintenance

Development Challenges

System Prompt Engineering

Multi-Agent Architecture Evaluation

Architecture Decision: Single Agent Superiority

Demos

File Structure

Scalability & Performance

Future Improvements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
agent		agent
bin		bin
demos		demos
generated-diagrams		generated-diagrams
lib		lib
mcp-proxy		mcp-proxy
tests		tests
.gitignore		.gitignore
.npmignore		.npmignore
LICENSE.md		LICENSE.md
README.md		README.md
cdk.context.json		cdk.context.json
cdk.json		cdk.json
jest.config.js		jest.config.js
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

Folders and files

Latest commit

History

Repository files navigation

Cloud Engineer Agent - Complete Architecture

Overview

Architecture Components

System Context

Enhanced Data Flow

Key Components

MCP Servers

Strands Tools

Amazon Bedrock Services

DynamoDB Integration

Enhanced Capabilities

Documentation & Learning

Cost Management

AWS Operations

AI-Powered Assistance

Security & Compliance

AI Tooling Used in Development

Product Development & Planning

Documentation & Visual Assets

Code Development & Maintenance

Development Challenges

System Prompt Engineering

Multi-Agent Architecture Evaluation

Architecture Decision: Single Agent Superiority

Demos

File Structure

Scalability & Performance

Future Improvements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages