Skip to content

dinindunz/cloud-engineer

Repository files navigation

Cloud Engineer Agent - Complete Architecture

Overview

This architecture represents a comprehensive cloud engineer agent solution built on AWS, using Slack as the user interface, powered by Amazon Bedrock's Claude model, and enhanced with MCP servers and Strands tools for extended functionality.

Architecture Components

┌─────────────┐    ┌─────────────────┐    ┌──────────────────────────────────────────────────────────────┐         ┌────────────────┐
│    Slack    │───▶│   API Gateway   │───▶│                      Lambda Function                         │────────▶│   S3 Vectors   │  
│  Interface  │    │                 │    │                       (AWS Strands)                          │         └────────────────┘
└─────────────┘    └─────────────────┘    │ ┌──────────────────────────────────────────────────────────┐ │         ┌────────────────┐
                                          │ │                       Tools                              │ │────────▶│    DynamoDB    │
                                          │ │┌───────────────┐ ┌───────────────────┐ ┌───────────┐     │ │         └────────────────┘
                   ┌─────────────────┐    │ ││ aws_cloudwatch│ | aws_cost_explorer │ │ atlassian │     │ │     
                   │ CloudWatch Logs │───▶│ │└───────────────┘ └───────────────────┘ └───────────┘     │ │
                   └─────────────────┘    │ │┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌────────┐│ │
                                          │ ││ aws_eks │ │ aws_ecs │ │ aws_ecs │ │ use_aws │ │ memory ││ │
                                          │ │└─────────┘ └─────────┘ └─────────┘ └─────────┘ └────────┘│ │
                                          │ └──────────────────────────────────────────────────────────┘ │
                                          └──────────────────────────────────────────────────────────────┘
                                                                           │
                              ┌────────────────────────────────────────────┼───────────────────────────────────────────┐
                              │                                            │                                           │
                              ▼                                            ▼                                           ▼
   ┌────────────────────────────────────────────────────────┐  ┌──────────────────────────────────┐   ┌────────────────────────────────────┐                                   
   │                       Fargate                          │  │          Amazon Bedrock          │   │            Usage Metrics           │
   │                  ┌─────────────────┐                   │  │                                  │   │                                    │              
   │                  │    mcp-proxy    │                   │  │ ┌─────────────┐  ┌─────────────┐ │   │ ┌───────────────┐ ┌───────────────┐│
   │                  └─────────────────┘                   │  │ │   Model     │  │  Guardrails │ │   │ │ Cost Explorer │ │  CloudWatch   ││      
   │                          │                             │  │ └─────────────┘  └─────────────┘ │   │ └───────────────┘ │  Dashboard    ││     
   │                          │                             │  └──────────────────────────────────┘   │                   └───────────────┘│
   │                          ▼                             │                                         └────────────────────────────────────┘
   │┌─────────────────────────────────────────────────────┐ │                
   ││                    MCP Servers                      │ │         ┌─────────────────┐                                                       
   ││                                                     │ │         │     External    │ 
   ││┌──────────────┐  ┌──────────────┐  ┌──────────────┐ │ │         │ ┌─────────────┐ │                             
   │││   aws-docs   │  │  atlassian   │  │    github    │ │ │         │ │   GitHub    │ │ 
   ││└──────────────┘  └──────────────┘  └──────────────┘ │ │         │ └─────────────┘ │    
   ││┌──────────────┐  ┌──────────────┐  ┌──────────────┐ │ │────────▶│ ┌─────────────┐ │  
   │││ cost-explorer│  │   aws-eks    │  │   aws-ecs    │ │ │         │ │  Atlassian  │ │
   ││└──────────────┘  └──────────────┘  └──────────────┘ │ │         │ └─────────────┘ │   
   ││┌──────────────┐                                     │ │         │ ┌─────────────┐ │ 
   │││  cloudwatch  │                                     │ │         │ │    AWS      │ │    
   ││└──────────────┘                                     │ │         │ │Documentation│ │      
   │└─────────────────────────────────────────────────────┘ │         │ └─────────────┘ │ 
   └────────────────────────────────────────────────────────┘         └─────────────────┘                                                                 
                                                                         

Architecture Diagram Generated using aws-diagram MCP server.

System Context

This Lambda function is triggered by two primary sources:

  • Slack Messages: User interactions through Slack interface for cloud engineering queries and operations
  • CloudWatch Log Events: Automated error detection and response workflow for infrastructure monitoring

The system behavior is defined in agent/system_prompt.md, which outlines the agent's capabilities for AWS operations management and automated error response workflows.

Enhanced Data Flow

  1. Input Sources:

    • Users interact through Slack with cloud engineering queries
    • CloudWatch Logs trigger automated error response workflows
  2. API Gateway: Slack webhook sends a request to AWS API Gateway

  3. Lambda Processing: AWS Strands-powered Lambda function processes requests using integrated tools:

    • aws_doc_tools: Access to AWS documentation and best practices
    • aws_cdk_tools: CDK-specific operations and guidance
    • github_tools: Repository management and pull request operations
    • atlassian_tools: Jira integration for issue tracking and project management
    • use_aws: Direct AWS service interactions and resource management
    • memory: Context retention and conversation history
  4. Service Integration:

    • MCP Proxy (ALB): Load-balanced access to containerized MCP servers running on Fargate
    • Amazon Bedrock: Claude model for AI processing and Guardrails for content safety
    • External APIs: GitHub API, Atlassian API, and AWS Documentation services
  5. Response Processing: Lambda aggregates responses from all integrated services and tools

  6. Output Delivery: Processed responses flow back to Slack, with automated Jira ticket creation and GitHub PR generation for error response workflows

Key Components

MCP Servers

  • AWS Documentation MCP Server: Provides real-time access to AWS documentation, best practices, and technical guides
  • AWS CDK MCP Server: Offers CDK-specific operations, template generation, and infrastructure as code guidance
  • GitHub MCP Server: Enables repository management, pull request operations, and version control integration
  • Atlassian MCP Server: Provides Jira integration for issue tracking, project management, and workflow automation

Strands Tools

  • use_aws Tool: Enables direct interaction with AWS services for operational tasks, resource management, and configuration changes
  • memory: Store user and agent memories across agent runs to provide personalized experiences with both Mem0 and Amazon Bedrock Knowledge Bases

Amazon Bedrock Services

  • Claude Model: Advanced language model for understanding and generating responses
  • Guardrails: Content filtering and safety validation
  • Knowledge Base: RAG implementation with internal knowledge repository

DynamoDB Integration

  • Message Deduplication Table: Prevents duplicate Slack message processing across Lambda executions
    • Partition Key: message_id (MD5 hash of timestamp, user, channel, and message text)
    • TTL: 1-hour automatic cleanup of processed message records
    • Atomic Operations: Conditional writes ensure race-condition-free duplicate detection
    • Cross-Lambda Persistence: Maintains deduplication state across multiple Lambda invocations
    • Billing: Pay-per-request pricing model for cost-effective operation

Enhanced Capabilities

Documentation & Learning

  • Real-time AWS documentation lookup
  • Best practices and architectural guidance
  • Service-specific technical references
  • Troubleshooting guides and solutions

Cost Management

  • Real-time cost analysis and reporting
  • Budget monitoring and alerts
  • Cost optimization recommendations
  • Resource utilization insights

AWS Operations

  • Direct AWS service interactions
  • Resource provisioning and management
  • Configuration changes and updates
  • Infrastructure monitoring and control

AI-Powered Assistance

  • Natural language query processing
  • Context-aware responses
  • Multi-service orchestration
  • Intelligent recommendation engine

Security & Compliance

  • Lambda execution environment isolation
  • Bedrock Guardrails for content safety
  • AWS IAM for granular access control
  • Audit logging for all operations

AI Tooling Used in Development

This project leveraged various AI tools throughout the development lifecycle to enhance productivity and code quality:

Product Development & Planning

  • Claude: Product Requirements Document (PRD) creation and architectural planning
  • Cline + Mantel API Gateway: Large-scale codebase development, refactoring, and feature implementation

Documentation & Visual Assets

  • Gemini: README generation and documentation creation based on demo screenshots
  • aws-diagram-mcp: Automated architecture diagram generation and visualization

Code Development & Maintenance

  • Amazon Q: Precise, surgical code fixes and targeted problem resolution
  • GitHub Copilot: Real-time tab completions, inline code suggestions, and automated commit message generation

This multi-AI approach enabled rapid development while maintaining high code quality and comprehensive documentation across the entire cloud engineering solution.

Development Challenges

System Prompt Engineering

Achieving precise, surgical code changes required extensive iteration and refinement of the system prompt. The challenge was balancing comprehensive capabilities with focused execution - ensuring the agent could handle complex scenarios while maintaining minimal, targeted fixes for specific issues.

Multi-Agent Architecture Evaluation

Initial exploration of a multi-agent architecture revealed significant limitations for precision-focused tasks:

  • Context Fragmentation: Specialized agents (orchestrator, pr-agent, knowledge-base-agent, operations-agent, jira-agent) only saw partial context, leading to suboptimal decisions
  • Over-Specialization: Individual agents felt compelled to "add value" within their domain, resulting in broader changes than necessary
  • Communication Overhead: Information loss and transformation occurred during handoffs between agents
  • Competing Objectives: Different agents had conflicting approaches to problem-solving

Architecture Decision: Single Agent Superiority

The evaluation conclusively demonstrated that a single-agent architecture with a well-crafted system prompt significantly outperformed the multi-agent approach for surgical infrastructure fixes:

  • Full Context Awareness: Complete problem visibility without information fragmentation
  • Clear Single Objective: Direct focus on fixing specific errors without role confusion
  • Simplified Execution Path: Elimination of complex orchestration overhead
  • Consistent Precision: Reliable delivery of minimal, targeted changes

This architectural insight proved crucial for achieving the system's core requirement of surgical precision in automated error response workflows.

Demos

Explore the Cloud Engineer Agent capabilities through interactive demonstrations:

  • Automated Error Response - Complete workflow from CloudWatch error detection to automated Jira ticket creation and GitHub PR generation
  • Root Cause Analysis - Systematic investigation and diagnosis of complex AWS infrastructure issues, including organizational policy conflicts
  • AWS Well-Architected Review - Comprehensive infrastructure assessment against all five Well-Architected pillars with automated Jira Epic creation
  • Cloud Operations - Direct AWS service interactions, resource management, and infrastructure operations
  • General Queries - AWS documentation lookup, best practices guidance, and expert recommendations

File Structure

cloud-engineer/
├── README.md                           # Main project documentation
├── package.json                        # Node.js dependencies and scripts
├── cdk.json                            # CDK configuration
├── tsconfig.json                       # TypeScript configuration
├── jest.config.js                      # Jest testing configuration
├── LICENSE.md                          # Project license
├── .gitignore                          # Git ignore patterns
├── .npmignore                          # NPM ignore patterns
├── cdk.context.json                    # CDK context cache
│
├── agent/                              # Lambda function source code
│   ├── agent.py                        # Main Lambda handler
│   ├── cloud_engineer.py               # Core agent implementation
│   ├── system_prompt.md                # Agent behavior definition
│   ├── requirements.txt                # Python dependencies
│   └── Dockerfile                      # Container configuration
│
├── bin/                                # CDK application entry point
│   └── cloud-engineer.ts               # CDK app definition
│
├── lib/                                # CDK infrastructure code
│   └── cloud-engineer-stack.ts         # Main infrastructure stack
│
├── mcp-proxy/                          # MCP server proxy configuration
│   ├── Dockerfile                      # Proxy container configuration
│   ├── entrypoint.sh                   # Container startup script
│   └── mcp-servers.json                # MCP server definitions
│
├── demos/                              # Demo screenshots and documentation
│   ├── automated-error-response/       # Error response workflow demos
│   ├── cloud-ops/                      # AWS operations demos
│   ├── cost-forecasting/               # AWS cost forecasting demos
│   ├── general-queries/                # Documentation query demos
│   ├── query-jira/                     # Jira query demos
│   ├── root-cause-analysis/            # Infrastructure issue investigation demos
│   └── well-architected-review/        # AWS Well-Architected Framework assessment demos
│
├── generated-diagrams/                 # Architecture diagrams
│   └── cloud-engineer-architecture.png # Current system architecture
│
└── tests/                              # Test files
    └── test_cloud_engineer.py          # Agent unit tests

Scalability & Performance

  • Auto-scaling Lambda functions
  • Distributed MCP server architecture

Future Improvements

The following features are planned for future implementation:

  • Bedrock Knowledge Base or S3 Vector: RAG implementation with internal knowledge repository for enhanced contextual responses
  • Memory Strands Tool: Advanced context retention and conversation history management within AWS Lambda
  • CloudWatch Dashboard: Comprehensive cost explorer integration for inference cost monitoring and visualization
  • API Security Implementation: Advanced security measures and authentication

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors