Skip to content

feat(devx): local dev setup for control plane and full end-to-end flow (MLI-6681)#823

Open
lilyz-ai wants to merge 8 commits intomainfrom
lilyz-ai/mli-6681-control-plane-local-devx
Open

feat(devx): local dev setup for control plane and full end-to-end flow (MLI-6681)#823
lilyz-ai wants to merge 8 commits intomainfrom
lilyz-ai/mli-6681-control-plane-local-devx

Conversation

@lilyz-ai
Copy link
Copy Markdown
Collaborator

@lilyz-ai lilyz-ai commented May 7, 2026

Summary

Adds a complete local development workflow for model-engine so developers can iterate on both control plane code and the full endpoint lifecycle without cloud credentials or prod images.

Control-plane-only mode (make dev-server):

  • Spins up Postgres + Redis via docker-compose
  • LOCAL=true activates fake queue/docker/k8s implementations (mirrors CIRCLECI=true)
  • Full gateway API available at :5000 with auth skipped — no k8s cluster needed

Full end-to-end mode (make dev-server-full + make dev-service-builder + make dev-k8s-cacher):

  • make kind-up + make kind-image creates a local kind cluster and loads model-engine:local into it
  • Service Builder picks up endpoint creation tasks from local Redis and creates real k8s Deployments in kind
  • K8s Cacher polls kind and writes endpoint status back to Redis
  • Echo server (model-engine:local) used as the inference container — no GPU required

Code fixes included:

  • service_builder/celery.py + celery_task_queue_gateway.py: onprem cloud provider now uses redis Celery backend instead of s3 — without this, the Service Builder writes results to Redis but the Gateway looks in S3, leaving endpoints stuck in PENDING
  • dependencies.py: LOCAL=true + cloud_provider=onprem falls through to real OnPremQueueEndpointResourceDelegate instead of the fake
  • env_vars.py: GIT_TAG defaults to "local" when LOCAL=true so k8s templates reference the correct model-engine:local image

New files:

  • docker-compose.local.yml — Postgres 15 + Redis 7 with healthchecks and persistent volume
  • service_configs/service_config_local.yaml — HMI config for local services
  • model_engine_server/core/configs/local-full.yaml — onprem infra config for kind
  • Makefile — all dev targets in one place

Test plan

  • make dev-up && make dev-migrate && make dev-server — gateway starts, GET /v1/model-endpoints returns 200
  • make kind-up && make kind-image — kind cluster created, model-engine:local loaded
  • make dev-server-full + make dev-service-builder + make dev-k8s-cacher — all three processes start cleanly
  • POST a sync CPU endpoint with the echo server image → pod appears in kubectl --context kind-llm-engine get pods -n model-engine and endpoint transitions to READY
  • Existing unit tests pass: make test

Closes MLI-6681

🤖 Generated with Claude Code

Greptile Summary

  • Adds a complete local development workflow with two modes: control-plane-only (make dev-server using fake k8s/queue/docker) and full end-to-end via kind (make dev-server-full + service builder + k8s cacher), with all env vars pinned in the Makefile.
  • Fixes a backend/broker mismatch for cloud_provider=onprem: both celery_task_queue_gateway.py and service_builder/celery.py now use \"redis\" as the Celery result backend for onprem (previously fell through to \"s3\"), which resolves endpoints getting stuck in PENDING in the full local flow.
  • dependencies.py is updated so LOCAL=true + onprem correctly uses the real OnPremQueueEndpointResourceDelegate and Redis task queues, while LOCAL=true + any other provider still uses the fake delegate.

Confidence Score: 5/5

Safe to merge — all previously flagged issues have been addressed and no new P1/P0 issues were found.

All three previously identified issues (cache_redis_aws_url, ML_INFRA_SERVICES_CONFIG_PATH not pinned, celery backend mismatch) are resolved in this revision. The logic changes in dependencies.py, celery_task_queue_gateway.py, and service_builder/celery.py are consistent and correct. Only dev tooling and config files are affected.

No files require special attention.

Important Files Changed

Filename Overview
model-engine/Makefile New Makefile with dev targets; ML_INFRA_SERVICES_CONFIG_PATH is now pinned in both LOCAL_ENV and FULL_LOCAL_ENV, and docker compose --wait is used instead of manual polling loops.
model-engine/model_engine_server/api/dependencies.py Three conditional blocks updated: fake queue delegate is skipped for LOCAL+onprem, Redis task queues selected for LOCAL mode, and FakeDockerRepository used for LOCAL. Logic is correct for both control-plane-only and full e2e modes.
model-engine/model_engine_server/infra/gateways/celery_task_queue_gateway.py Backend protocol now correctly uses 'redis' for onprem, matching the fix in service_builder/celery.py; resolves the gateway/service-builder backend mismatch that caused endpoints to stick in PENDING.
model-engine/model_engine_server/service_builder/celery.py Backend protocol updated to 'redis' for onprem; broker type for onprem is driven by celery_broker_type_redis: true in local-full.yaml via force_redis, which correctly yields Redis.
model-engine/service_configs/service_config_local.yaml Uses cache_redis_onprem_url (not cache_redis_aws_url), correctly bypassing the cloud-provider assertion in hmi_config.cache_redis_url.
model-engine/model_engine_server/core/configs/local-full.yaml New onprem infra config for kind; celery_broker_type_redis: true is the key flag that routes the service builder broker to Redis instead of SQS for onprem.

Reviews (6): Last reviewed commit: "fix(test): include otlp.proto.grpc in OT..." | Re-trigger Greptile

lilyz-ai and others added 2 commits May 7, 2026 01:59
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds a one-command local development workflow for the model engine control
plane so developers can iterate on gateway/service-builder code without
building prod images or touching live infra.

- docker-compose.local.yml: spins up Postgres 15 + Redis 7
- service_configs/service_config_local.yaml: HMI config for local services
- Makefile: dev-up / dev-migrate / dev-server / dev-down / test targets
- LOCAL=true env var now activates fake queue/docker implementations
  (parallel to existing CIRCLECI=true path) and skips GIT_TAG requirement
- README: new "Control Plane Local Setup" section with full walkthrough

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Comment thread model-engine/service_configs/service_config_local.yaml Outdated
Comment thread model-engine/Makefile Outdated
Comment thread model-engine/Makefile Outdated
…G_PATH

- service_config_local.yaml: switch from cache_redis_aws_url to
  cache_redis_onprem_url so the Redis URL is resolved before the
  cloud_provider assertion fires — fixes startup failure for non-AWS configs
- Makefile: pin ML_INFRA_SERVICES_CONFIG_PATH to default.yaml so local
  dev is not affected by a developer's ambient infra config

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Comment thread model-engine/README.md
lilyz-ai and others added 2 commits May 7, 2026 02:32
- README: add ML_INFRA_SERVICES_CONFIG_PATH to the manual env-var snippet
  so developers with non-AWS ambient configs don't accidentally hit
  the cloud_provider assertion
- docker-compose.local.yml: mount a named volume for Postgres so the
  database survives dev-down/dev-up cycles

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replaces the manual until-loops in dev-up with `docker compose up --wait`,
which blocks until healthchecks pass and exits non-zero if they fail —
eliminating the infinite-spin on container crash.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@lilyz-ai
Copy link
Copy Markdown
Collaborator Author

lilyz-ai commented May 7, 2026

@greptile review

@lilyz-ai
Copy link
Copy Markdown
Collaborator Author

lilyz-ai commented May 7, 2026

/greptile

lilyz-ai and others added 2 commits May 7, 2026 03:07
Extends the local dev setup so the complete control plane → Service Builder
→ k8s inference pod flow can be tested locally without cloud credentials.

Changes:
- local-full.yaml: new onprem infra config pointing to localhost Redis/kind
- dependencies.py: LOCAL=true + cloud_provider=onprem falls through to real
  Redis queue delegate instead of the fake (enabling full k8s flow)
- service_builder/celery.py: fix onprem to use redis backend not s3
- env_vars.py: default GIT_TAG to "local" when LOCAL=true so k8s templates
  reference the correct model-engine:local image loaded into kind
- Makefile: kind-up/kind-down/kind-image targets + dev-server-full,
  dev-service-builder, dev-k8s-cacher targets using FULL_LOCAL_ENV
- README: full end-to-end setup section with step-by-step instructions,
  example endpoint creation, and flow table

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The gateway's module-level backend_protocol had the same aws/gcp/azure
mapping as service_builder/celery.py. Without this fix, the Service Builder
writes task results to Redis but the Gateway looks in S3, leaving endpoints
stuck in PENDING under the kind-based full local flow.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@lilyz-ai
Copy link
Copy Markdown
Collaborator Author

lilyz-ai commented May 7, 2026

@greptile review

@lilyz-ai
Copy link
Copy Markdown
Collaborator Author

lilyz-ai commented May 7, 2026

/greptile

@lilyz-ai lilyz-ai changed the title feat(devx): local control plane dev setup (MLI-6681) feat(devx): local dev setup for control plane and full end-to-end flow (MLI-6681) May 7, 2026
The exporter package was imported unconditionally under the OTEL_AVAILABLE
flag which only checked the base SDK, not the exporter. Include it in the
try block so OTEL_AVAILABLE stays False when the exporter is absent, fixing
the ImportError that caused run_unit_tests_server to fail.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant