This architecture represents a comprehensive cloud engineer agent solution built on AWS, using Slack as the user interface, powered by Amazon Bedrock's Claude model, and enhanced with MCP servers and Strands tools for extended functionality.
┌─────────────┐ ┌─────────────────┐ ┌──────────────────────────────────────────────────────────────┐ ┌────────────────┐
│ Slack │───▶│ API Gateway │───▶│ Lambda Function │────────▶│ S3 Vectors │
│ Interface │ │ │ │ (AWS Strands) │ └────────────────┘
└─────────────┘ └─────────────────┘ │ ┌──────────────────────────────────────────────────────────┐ │ ┌────────────────┐
│ │ Tools │ │────────▶│ DynamoDB │
│ │┌───────────────┐ ┌───────────────────┐ ┌───────────┐ │ │ └────────────────┘
┌─────────────────┐ │ ││ aws_cloudwatch│ | aws_cost_explorer │ │ atlassian │ │ │
│ CloudWatch Logs │───▶│ │└───────────────┘ └───────────────────┘ └───────────┘ │ │
└─────────────────┘ │ │┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌────────┐│ │
│ ││ aws_eks │ │ aws_ecs │ │ aws_ecs │ │ use_aws │ │ memory ││ │
│ │└─────────┘ └─────────┘ └─────────┘ └─────────┘ └────────┘│ │
│ └──────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────┘
│
┌────────────────────────────────────────────┼───────────────────────────────────────────┐
│ │ │
▼ ▼ ▼
┌────────────────────────────────────────────────────────┐ ┌──────────────────────────────────┐ ┌────────────────────────────────────┐
│ Fargate │ │ Amazon Bedrock │ │ Usage Metrics │
│ ┌─────────────────┐ │ │ │ │ │
│ │ mcp-proxy │ │ │ ┌─────────────┐ ┌─────────────┐ │ │ ┌───────────────┐ ┌───────────────┐│
│ └─────────────────┘ │ │ │ Model │ │ Guardrails │ │ │ │ Cost Explorer │ │ CloudWatch ││
│ │ │ │ └─────────────┘ └─────────────┘ │ │ └───────────────┘ │ Dashboard ││
│ │ │ └──────────────────────────────────┘ │ └───────────────┘│
│ ▼ │ └────────────────────────────────────┘
│┌─────────────────────────────────────────────────────┐ │
││ MCP Servers │ │ ┌─────────────────┐
││ │ │ │ External │
││┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ ┌─────────────┐ │
│││ aws-docs │ │ atlassian │ │ github │ │ │ │ │ GitHub │ │
││└──────────────┘ └──────────────┘ └──────────────┘ │ │ │ └─────────────┘ │
││┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │────────▶│ ┌─────────────┐ │
│││ cost-explorer│ │ aws-eks │ │ aws-ecs │ │ │ │ │ Atlassian │ │
││└──────────────┘ └──────────────┘ └──────────────┘ │ │ │ └─────────────┘ │
││┌──────────────┐ │ │ │ ┌─────────────┐ │
│││ cloudwatch │ │ │ │ │ AWS │ │
││└──────────────┘ │ │ │ │Documentation│ │
│└─────────────────────────────────────────────────────┘ │ │ └─────────────┘ │
└────────────────────────────────────────────────────────┘ └─────────────────┘
Generated using aws-diagram MCP server.
This Lambda function is triggered by two primary sources:
- Slack Messages: User interactions through Slack interface for cloud engineering queries and operations
- CloudWatch Log Events: Automated error detection and response workflow for infrastructure monitoring
The system behavior is defined in agent/system_prompt.md, which outlines the agent's capabilities for AWS operations management and automated error response workflows.
-
Input Sources:
- Users interact through Slack with cloud engineering queries
- CloudWatch Logs trigger automated error response workflows
-
API Gateway: Slack webhook sends a request to AWS API Gateway
-
Lambda Processing: AWS Strands-powered Lambda function processes requests using integrated tools:
- aws_doc_tools: Access to AWS documentation and best practices
- aws_cdk_tools: CDK-specific operations and guidance
- github_tools: Repository management and pull request operations
- atlassian_tools: Jira integration for issue tracking and project management
- use_aws: Direct AWS service interactions and resource management
- memory: Context retention and conversation history
-
Service Integration:
- MCP Proxy (ALB): Load-balanced access to containerized MCP servers running on Fargate
- Amazon Bedrock: Claude model for AI processing and Guardrails for content safety
- External APIs: GitHub API, Atlassian API, and AWS Documentation services
-
Response Processing: Lambda aggregates responses from all integrated services and tools
-
Output Delivery: Processed responses flow back to Slack, with automated Jira ticket creation and GitHub PR generation for error response workflows
- AWS Documentation MCP Server: Provides real-time access to AWS documentation, best practices, and technical guides
- AWS CDK MCP Server: Offers CDK-specific operations, template generation, and infrastructure as code guidance
- GitHub MCP Server: Enables repository management, pull request operations, and version control integration
- Atlassian MCP Server: Provides Jira integration for issue tracking, project management, and workflow automation
- use_aws Tool: Enables direct interaction with AWS services for operational tasks, resource management, and configuration changes
- memory: Store user and agent memories across agent runs to provide personalized experiences with both Mem0 and Amazon Bedrock Knowledge Bases
- Claude Model: Advanced language model for understanding and generating responses
- Guardrails: Content filtering and safety validation
- Knowledge Base: RAG implementation with internal knowledge repository
- Message Deduplication Table: Prevents duplicate Slack message processing across Lambda executions
- Partition Key:
message_id(MD5 hash of timestamp, user, channel, and message text) - TTL: 1-hour automatic cleanup of processed message records
- Atomic Operations: Conditional writes ensure race-condition-free duplicate detection
- Cross-Lambda Persistence: Maintains deduplication state across multiple Lambda invocations
- Billing: Pay-per-request pricing model for cost-effective operation
- Partition Key:
- Real-time AWS documentation lookup
- Best practices and architectural guidance
- Service-specific technical references
- Troubleshooting guides and solutions
- Real-time cost analysis and reporting
- Budget monitoring and alerts
- Cost optimization recommendations
- Resource utilization insights
- Direct AWS service interactions
- Resource provisioning and management
- Configuration changes and updates
- Infrastructure monitoring and control
- Natural language query processing
- Context-aware responses
- Multi-service orchestration
- Intelligent recommendation engine
- Lambda execution environment isolation
- Bedrock Guardrails for content safety
- AWS IAM for granular access control
- Audit logging for all operations
This project leveraged various AI tools throughout the development lifecycle to enhance productivity and code quality:
- Claude: Product Requirements Document (PRD) creation and architectural planning
- Cline + Mantel API Gateway: Large-scale codebase development, refactoring, and feature implementation
- Gemini: README generation and documentation creation based on demo screenshots
- aws-diagram-mcp: Automated architecture diagram generation and visualization
- Amazon Q: Precise, surgical code fixes and targeted problem resolution
- GitHub Copilot: Real-time tab completions, inline code suggestions, and automated commit message generation
This multi-AI approach enabled rapid development while maintaining high code quality and comprehensive documentation across the entire cloud engineering solution.
Achieving precise, surgical code changes required extensive iteration and refinement of the system prompt. The challenge was balancing comprehensive capabilities with focused execution - ensuring the agent could handle complex scenarios while maintaining minimal, targeted fixes for specific issues.
Initial exploration of a multi-agent architecture revealed significant limitations for precision-focused tasks:
- Context Fragmentation: Specialized agents (orchestrator, pr-agent, knowledge-base-agent, operations-agent, jira-agent) only saw partial context, leading to suboptimal decisions
- Over-Specialization: Individual agents felt compelled to "add value" within their domain, resulting in broader changes than necessary
- Communication Overhead: Information loss and transformation occurred during handoffs between agents
- Competing Objectives: Different agents had conflicting approaches to problem-solving
The evaluation conclusively demonstrated that a single-agent architecture with a well-crafted system prompt significantly outperformed the multi-agent approach for surgical infrastructure fixes:
- Full Context Awareness: Complete problem visibility without information fragmentation
- Clear Single Objective: Direct focus on fixing specific errors without role confusion
- Simplified Execution Path: Elimination of complex orchestration overhead
- Consistent Precision: Reliable delivery of minimal, targeted changes
This architectural insight proved crucial for achieving the system's core requirement of surgical precision in automated error response workflows.
Explore the Cloud Engineer Agent capabilities through interactive demonstrations:
- Automated Error Response - Complete workflow from CloudWatch error detection to automated Jira ticket creation and GitHub PR generation
- Root Cause Analysis - Systematic investigation and diagnosis of complex AWS infrastructure issues, including organizational policy conflicts
- AWS Well-Architected Review - Comprehensive infrastructure assessment against all five Well-Architected pillars with automated Jira Epic creation
- Cloud Operations - Direct AWS service interactions, resource management, and infrastructure operations
- General Queries - AWS documentation lookup, best practices guidance, and expert recommendations
cloud-engineer/
├── README.md # Main project documentation
├── package.json # Node.js dependencies and scripts
├── cdk.json # CDK configuration
├── tsconfig.json # TypeScript configuration
├── jest.config.js # Jest testing configuration
├── LICENSE.md # Project license
├── .gitignore # Git ignore patterns
├── .npmignore # NPM ignore patterns
├── cdk.context.json # CDK context cache
│
├── agent/ # Lambda function source code
│ ├── agent.py # Main Lambda handler
│ ├── cloud_engineer.py # Core agent implementation
│ ├── system_prompt.md # Agent behavior definition
│ ├── requirements.txt # Python dependencies
│ └── Dockerfile # Container configuration
│
├── bin/ # CDK application entry point
│ └── cloud-engineer.ts # CDK app definition
│
├── lib/ # CDK infrastructure code
│ └── cloud-engineer-stack.ts # Main infrastructure stack
│
├── mcp-proxy/ # MCP server proxy configuration
│ ├── Dockerfile # Proxy container configuration
│ ├── entrypoint.sh # Container startup script
│ └── mcp-servers.json # MCP server definitions
│
├── demos/ # Demo screenshots and documentation
│ ├── automated-error-response/ # Error response workflow demos
│ ├── cloud-ops/ # AWS operations demos
│ ├── cost-forecasting/ # AWS cost forecasting demos
│ ├── general-queries/ # Documentation query demos
│ ├── query-jira/ # Jira query demos
│ ├── root-cause-analysis/ # Infrastructure issue investigation demos
│ └── well-architected-review/ # AWS Well-Architected Framework assessment demos
│
├── generated-diagrams/ # Architecture diagrams
│ └── cloud-engineer-architecture.png # Current system architecture
│
└── tests/ # Test files
└── test_cloud_engineer.py # Agent unit tests
- Auto-scaling Lambda functions
- Distributed MCP server architecture
The following features are planned for future implementation:
- Bedrock Knowledge Base or S3 Vector: RAG implementation with internal knowledge repository for enhanced contextual responses
- Memory Strands Tool: Advanced context retention and conversation history management within AWS Lambda
- CloudWatch Dashboard: Comprehensive cost explorer integration for inference cost monitoring and visualization
- API Security Implementation: Advanced security measures and authentication