fix: skip VMs with host passthrough during DRS plan generation#13099
fix: skip VMs with host passthrough during DRS plan generation#13099raniers1 wants to merge 2 commits intoapache:mainfrom
Conversation
When iterating over VMs in getBestMigration, the call to listHostsForMigrationOfVM throws InvalidParameterValueException for VMs using a vGPU passthrough profile. This exception was not caught inside the loop, causing it to propagate up and abort the entire DRS plan generation for the cluster, leaving all other eligible VMs without a migration plan. Fix wraps the listHostsForMigrationOfVM call in a try-catch block. When the exception is caught, the VM is skipped with a debug log message and the loop continues evaluating the remaining VMs. Also adds unit test testGetBestMigrationSkipsPassthroughVm to verify that a VM throwing InvalidParameterValueException is skipped while other eligible VMs are still considered for migration. Fixes: apache#13098
|
Congratulations on your first Pull Request and welcome to the Apache CloudStack community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/cloudstack/blob/main/CONTRIBUTING.md)
|
|
@raniers1 the mentioned fix got reverted during the merge commit, can you check. |
Codecov Report✅ All modified and coverable lines are covered by tests.
Additional details and impacted files@@ Coverage Diff @@
## main #13099 +/- ##
=============================================
- Coverage 18.08% 3.52% -14.56%
=============================================
Files 6037 464 -5573
Lines 542437 40137 -502300
Branches 66420 7555 -58865
=============================================
- Hits 98088 1415 -96673
+ Misses 433333 38534 -394799
+ Partials 11016 188 -10828
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
@raniers1 can you create the fix against 4.20 branch. |
There was a problem hiding this comment.
Pull request overview
This PR aims to prevent cluster DRS plan generation from aborting when encountering VMs that cannot be migrated due to host passthrough (e.g., vGPU passthrough), by skipping those VMs and continuing to evaluate the remaining candidates.
Changes:
- Skip VMs that trigger
InvalidParameterValueExceptionduring migration-host evaluation so DRS can still produce a plan for other eligible VMs. - Add a unit test intended to verify passthrough VMs are skipped while other VMs are still considered.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| Mockito.when(managementServer.listHostsForMigrationOfVM(vmPassthrough, 0L, 500L, null, vmList)) | ||
| .thenThrow(new InvalidParameterValueException("Unsupported operation, VM uses host passthrough, cannot migrate")); | ||
| Mockito.when(managementServer.listHostsForMigrationOfVM(vmNormal, 0L, 500L, null, vmList)).thenReturn( | ||
| new Ternary<>(new Pair<>(List.of(destHost), 1), List.of(destHost), Map.of(destHost, false))); | ||
| Mockito.when(balancedAlgorithm.getMetrics(cluster, vmNormal, serviceOffering, destHost, new HashMap<>(), | ||
| new HashMap<>(), false)).thenReturn(new Ternary<>(1.0, 0.5, 1.5)); | ||
|
|
||
| Pair<VirtualMachine, Host> bestMigration = clusterDrsService.getBestMigration(cluster, balancedAlgorithm, | ||
| vmList, vmIdServiceOfferingMap, new HashMap<>(), new HashMap<>()); |
|
Closing this PR. The fix was accidentally reverted during conflict resolution. Will reopen targeting the 4.20 branch as requested by @vishesh92 |
|
After further investigation, the fix was already introduced in the 4.20 branch via a prior refactoring of getCompatibleHostAndVmStorageMotionCache, which includes a catch (Exception e) block that prevents InvalidParameterValueException from propagating and aborting the DRS plan generation. The bug specifically affects the 4.21.0.0 release tag, which does not have an active maintenance branch. All currently maintained branches (4.20, 4.22, main) already handle this case. Thank you for the review and guidance, @vishesh92 |
When iterating over VMs in getBestMigration, the call to listHostsForMigrationOfVM throws InvalidParameterValueException for VMs using a GPU passthrough profile. This exception was not caught inside the loop, causing it to propagate up and abort the entire DRS plan generation for the cluster, leaving all other eligible VMs without a migration plan.
Fix wraps the listHostsForMigrationOfVM call in a try-catch block. When the exception is caught, the VM is skipped with a debug log message and the loop continues evaluating the remaining VMs.
Also adds unit test testGetBestMigrationSkipsPassthroughVm to verify that a VM throwing InvalidParameterValueException is skipped while other eligible VMs are still considered for migration.
Fixes: #13098
Types of changes
Bug Severity
How Has This Been Tested?
DEBUG [o.a.c.c.ClusterDrsServiceImpl] Skipping VM VM instance
{"id":621,"instanceName":"i-23-621-VM","state":"Running","type":"User"}
for DRS, unsupported operation: Unsupported operation, VM uses host passthrough, cannot migrate.