[ZSTAC-83890] Ceph HA: sibling-fence VM before HA recreate@@3#3896
[ZSTAC-83890] Ceph HA: sibling-fence VM before HA recreate@@3#3896MatheMatrix wants to merge 1 commit into5.4.8from
Conversation
Walkthrough新增 Ceph sibling-fence SPI、请求/回复消息,并在 CephPrimaryStorageFactory.preInstantiateVmResource 的 watcher 为空路径中调用该 SPI,依据异步回复决定是否允许 VM 启动。 变更详情Ceph Sibling Fence 功能
Sequence Diagram(s)sequenceDiagram
participant Factory as CephPrimaryStorageFactory
participant Plugin as CephSiblingFenceExtensionPoint
participant Completion as completion
Factory->>Factory: 检查 root volume watchers
alt 观察者为空且扩展点存在
Factory->>Plugin: fenceVmOnFailedHost(failedHostUuid, vmUuid, clusterUuid, haTargetHostUuid, completion)
Plugin-->>Factory: SiblingFenceVmOnHostReply
alt reply.alive == false or reply.killed == true
Factory->>Completion: success()
else
Factory->>Completion: fail(ErrorCode with reason)
end
else 扩展点不存在
Factory->>Completion: success()
end
评估代码审查工作量🎯 3 (Moderate) | ⏱️ ~20 分钟 诗歌
Important Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional. ❌ Failed checks (1 error, 1 warning)
✅ Passed checks (3 passed)
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
header/src/main/java/org/zstack/header/storage/ceph/CephSiblingFenceExtensionPoint.java (1)
5-11: 💤 Low value建议为接口方法添加 Javadoc 注释。
根据编码规范,接口方法必须配有有效的 Javadoc 注释。建议说明各参数的含义,特别是
failedHostUuid与haTargetHostUuid的区别,以及 completion 回调的预期行为。📝 建议的 Javadoc
public interface CephSiblingFenceExtensionPoint { + /** + * Fence a VM on a failed host via sibling host SSH. + * + * `@param` failedHostUuid the UUID of the host that failed + * `@param` vmUuid the UUID of the VM to fence + * `@param` clusterUuid the UUID of the cluster + * `@param` haTargetHostUuid the UUID of the target host for HA migration + * `@param` completion callback with SiblingFenceVmOnHostReply indicating fence result + */ void fenceVmOnFailedHost(String failedHostUuid, String vmUuid, String clusterUuid, String haTargetHostUuid, ReturnValueCompletion<SiblingFenceVmOnHostReply> completion); }As per coding guidelines: "接口方法不应有多余的修饰符(例如
public),且必须配有有效的 Javadoc 注释"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@header/src/main/java/org/zstack/header/storage/ceph/CephSiblingFenceExtensionPoint.java` around lines 5 - 11, Add a Javadoc block to the interface method fenceVmOnFailedHost in CephSiblingFenceExtensionPoint describing the method purpose and each parameter (clarify that failedHostUuid is the host that failed, haTargetHostUuid is the HA target host to migrate/recover the VM to, and explain vmUuid and clusterUuid), and document the expected behavior of the completion callback (ReturnValueCompletion<SiblingFenceVmOnHostReply> completion) including when to call success vs. failure and what fields in SiblingFenceVmOnHostReply should be populated; also remove the redundant public modifier from the interface method signature so it uses the default visibility.header/src/test/java/org/zstack/header/storage/ceph/SiblingFenceVmOnHostMsgTest.java (1)
20-32: 💤 Low value测试覆盖不完整:设置的字段未全部断言验证。
testReplyFields()设置了多个字段但仅验证了isKilled()和getExecutorRole()。建议补充对其他字段的断言以确保完整的测试覆盖。💚 建议补充的断言
r.setReason(""); assertTrue(r.isKilled()); assertEquals("sibling", r.getExecutorRole()); + assertTrue(r.isAlive()); + assertTrue(r.isSshReachable()); + assertTrue(r.isQemuFound()); + assertEquals("h-sibling", r.getExecutorHostUuid()); + assertEquals("", r.getReason()); } }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@header/src/test/java/org/zstack/header/storage/ceph/SiblingFenceVmOnHostMsgTest.java` around lines 20 - 32, The test testReplyFields() in SiblingFenceVmOnHostMsgTest sets multiple fields on SiblingFenceVmOnHostReply but only asserts two; update the test method to assert all other set fields: add assertions for r.isAlive(), r.getExecutorHostUuid(), r.isSshReachable(), r.isQemuFound(), and r.getReason() (e.g., assertTrue/assertEquals as appropriate) so every setter in the test is validated.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@plugin/ceph/src/main/java/org/zstack/storage/ceph/primary/CephPrimaryStorageFactory.java`:
- Around line 1264-1285: The outcome label "success-other" is misleading because
the branch where sshReachable=true, alive=true, killed=false leads to
completion.fail(); update the outcome string in the branch that currently sets
"success-other" to a failure-oriented name (e.g., "failed-still-alive" or
"failed-fence-incomplete") so the value passed to
fireSiblingFenceAuditEvent(vmUuid, failedHostUuid, clusterUuid,
haTargetHostUuid, outcome, r.getReason()) correctly reflects the failure,
keeping the rest of the logic (checks on r.isKilled(), r.isSshReachable(),
r.isAlive(), completion.success()/completion.fail()) unchanged; ensure you only
change the outcome assignment in CephPrimaryStorageFactory around the
sibling-fence handling block.
---
Nitpick comments:
In
`@header/src/main/java/org/zstack/header/storage/ceph/CephSiblingFenceExtensionPoint.java`:
- Around line 5-11: Add a Javadoc block to the interface method
fenceVmOnFailedHost in CephSiblingFenceExtensionPoint describing the method
purpose and each parameter (clarify that failedHostUuid is the host that failed,
haTargetHostUuid is the HA target host to migrate/recover the VM to, and explain
vmUuid and clusterUuid), and document the expected behavior of the completion
callback (ReturnValueCompletion<SiblingFenceVmOnHostReply> completion) including
when to call success vs. failure and what fields in SiblingFenceVmOnHostReply
should be populated; also remove the redundant public modifier from the
interface method signature so it uses the default visibility.
In
`@header/src/test/java/org/zstack/header/storage/ceph/SiblingFenceVmOnHostMsgTest.java`:
- Around line 20-32: The test testReplyFields() in SiblingFenceVmOnHostMsgTest
sets multiple fields on SiblingFenceVmOnHostReply but only asserts two; update
the test method to assert all other set fields: add assertions for r.isAlive(),
r.getExecutorHostUuid(), r.isSshReachable(), r.isQemuFound(), and r.getReason()
(e.g., assertTrue/assertEquals as appropriate) so every setter in the test is
validated.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: http://open.zstack.ai:20001/code-reviews/zstack-cloud.yaml (via .coderabbit.yaml)
Review profile: CHILL
Plan: Pro
Run ID: 7852302d-20d8-4541-8fde-7f148df3fa64
📒 Files selected for processing (6)
header/src/main/java/org/zstack/header/storage/ceph/CephSiblingFenceExtensionPoint.javaheader/src/main/java/org/zstack/header/storage/ceph/SiblingFenceVmOnHostMsg.javaheader/src/main/java/org/zstack/header/storage/ceph/SiblingFenceVmOnHostReply.javaheader/src/test/java/org/zstack/header/storage/ceph/CephSiblingFenceExtensionPointTest.javaheader/src/test/java/org/zstack/header/storage/ceph/SiblingFenceVmOnHostMsgTest.javaplugin/ceph/src/main/java/org/zstack/storage/ceph/primary/CephPrimaryStorageFactory.java
12eb41f to
b9569aa
Compare
Resolves: ZSTAC-83890 Change-Id: I652cbc9cc303f788657a1eddf7b706b38ba7adf0
ee30adc to
a12f22c
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@plugin/ceph/src/main/java/org/zstack/storage/ceph/primary/CephPrimaryStorageFactory.java`:
- Around line 1261-1269: The success handler for SiblingFenceVmOnHostReply
currently treats only !r.isAlive() || r.isKilled() as success and may
mis-classify ssh-unreachable cases; update the success(SiblingFenceVmOnHostReply
r) branch in CephPrimaryStorageFactory so that r.isSshReachable() is checked
explicitly as part of the decision: treat ssh-unreachable (i.e.,
!r.isSshReachable()) as a success case that calls completion.success(), while
preserving the existing checks for r.isAlive() and r.isKilled() and using
completion.fail(...) only when the reply indicates an actual fence failure
(e.g., r.isAlive() && r.isSshReachable() && !r.isKilled()), referencing the
success(...) method, SiblingFenceVmOnHostReply, r.isSshReachable(), r.isAlive(),
r.isKilled(), completion.success(), and completion.fail().
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: http://open.zstack.ai:20001/code-reviews/zstack-cloud.yaml (via .coderabbit.yaml)
Review profile: CHILL
Plan: Pro
Run ID: 80110b15-e175-464e-9f33-680138ad8571
📒 Files selected for processing (4)
header/src/main/java/org/zstack/header/storage/ceph/CephSiblingFenceExtensionPoint.javaheader/src/main/java/org/zstack/header/storage/ceph/SiblingFenceVmOnHostMsg.javaheader/src/main/java/org/zstack/header/storage/ceph/SiblingFenceVmOnHostReply.javaplugin/ceph/src/main/java/org/zstack/storage/ceph/primary/CephPrimaryStorageFactory.java
✅ Files skipped from review due to trivial changes (3)
- header/src/main/java/org/zstack/header/storage/ceph/CephSiblingFenceExtensionPoint.java
- header/src/main/java/org/zstack/header/storage/ceph/SiblingFenceVmOnHostMsg.java
- header/src/main/java/org/zstack/header/storage/ceph/SiblingFenceVmOnHostReply.java
| public void success(SiblingFenceVmOnHostReply r) { | ||
| if (!r.isAlive() || r.isKilled()) { | ||
| logger.debug(String.format("ceph sibling-fence succeeded for vm[%s] on failedHost[%s]: killed=%s sshReachable=%s", | ||
| vmUuid, failedHostUuid, r.isKilled(), r.isSshReachable())); | ||
| completion.success(); | ||
| } else { | ||
| logger.warn(String.format("ceph sibling-fence failed for vm[%s] on failedHost[%s]: %s — blocking VM start to prevent split-brain", | ||
| vmUuid, failedHostUuid, r.getReason())); | ||
| completion.fail(operr("sibling fence failed: %s", r.getReason())); |
There was a problem hiding this comment.
未显式放行 SSH 不可达分支,可能误阻断 HA 拉起
Line [1262] 仅判断 !r.isAlive() || r.isKilled()。按本 PR 目标存在 success-ssh-unreachable 成功态;若扩展实现返回 sshReachable=false 且 alive=true(或后续实现调整),这里会错误进入 completion.fail(),导致可恢复场景被阻断。建议把 !r.isSshReachable() 作为明确成功条件,避免依赖实现细节。
🔧 建议修复
- if (!r.isAlive() || r.isKilled()) {
+ if (!r.isSshReachable() || !r.isAlive() || r.isKilled()) {
logger.debug(String.format("ceph sibling-fence succeeded for vm[%s] on failedHost[%s]: killed=%s sshReachable=%s",
vmUuid, failedHostUuid, r.isKilled(), r.isSshReachable()));
completion.success();🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@plugin/ceph/src/main/java/org/zstack/storage/ceph/primary/CephPrimaryStorageFactory.java`
around lines 1261 - 1269, The success handler for SiblingFenceVmOnHostReply
currently treats only !r.isAlive() || r.isKilled() as success and may
mis-classify ssh-unreachable cases; update the success(SiblingFenceVmOnHostReply
r) branch in CephPrimaryStorageFactory so that r.isSshReachable() is checked
explicitly as part of the decision: treat ssh-unreachable (i.e.,
!r.isSshReachable()) as a success case that calls completion.success(), while
preserving the existing checks for r.isAlive() and r.isKilled() and using
completion.fail(...) only when the reply indicates an actual fence failure
(e.g., r.isAlive() && r.isSshReachable() && !r.isKilled()), referencing the
success(...) method, SiblingFenceVmOnHostReply, r.isSshReachable(), r.isAlive(),
r.isKilled(), completion.success(), and completion.fail().
Summary
Fixes Ceph HA empty-watcher race that allows VM split-brain (TIC-5513). Adds sibling-host SSH fence in
PreVmInstantiatehook before HA starts a new VM on the failed host's primary storage.Changes (zstack main)
header/: newSiblingFenceVmOnHostMsg+SiblingFenceVmOnHostReplyheader/: new SPICephSiblingFenceExtensionPoint(implementation in premium)plugin/ceph/CephPrimaryStorageFactory.preInstantiateVmResource: whengetVolumeWatchersreturns empty, dispatch sibling-fence via SPI before allowing VM startceph.ha.sibling-fencewith outcome (success-killed / success-already-dead / success-ssh-unreachable / failed)Cross-repo dependencies
CephSiblingFencerSPI/ha/sibling-fence-vmkvmagent endpointTest plan
Resolves: ZSTAC-83890
Customer: TIC-5513
(Supersedes !9777, branch renamed to @@3 with squashed commit)
sync from gitlab !9778