-
Notifications
You must be signed in to change notification settings - Fork 2.2k
fix: asyncio not saving interrupt state #5568
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
I will add some tests soon. |
|
@dobrac I have improved your tests to iterate until pending ops queue is reproduced. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #5568 +/- ##
==========================================
- Coverage 83.24% 83.23% -0.01%
==========================================
Files 277 277
Lines 29263 29268 +5
==========================================
+ Hits 24359 24362 +3
- Misses 4904 4906 +2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
During prepare_save we must unconditionally trigger an interrupt to ensure the guest gets notified after restore. The guest may have suppressed notifications, but after snapshot/restore it needs to be woken up regardless. Fixes firecracker-microvm#5554 Signed-off-by: Constantine Peresypkin <pconstantine@gmail.com>
|
Codecov idea of "coverage" seems incorrect here. Flagging a debug print is not the best use of coverage checks. So, ignored. |
| queue.advance_used_ring_idx(); | ||
|
|
||
| if queue.prepare_kick() { | ||
| if (force_signal && used_any) || queue.prepare_kick() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so the problem seems to be that prepare_kick() returns false because of suppressed notifications from the side of the guest and then we somehow miss the chance to notify the guest.
If that is the case, then snapshot or no snapshot prepare_kick() should eventually return true and we should send the interrupt, which means that there's a problem in the way we save the state of the queue. If that's the case, let's fix that (instead of sending an unsolicited interrupt).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think prepare_kick() will eventually return true.
No new I/O is coming, guest kernel is in HLT state waiting for interrupt (I did a thread dump when it froze)
So it looks like a deadlock to me, thus just interrupting it solves the problem.
We can probably track num_added at the time of prepare_save or do not call prepare_kick at the time of the prepare_save at all.
But I'm not sure it's a better solution.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think prepare_kick() will eventually return true.
Well, it should, right? Although, indirectly. The fact that we don't send the interrupt shouldn't matter because the guest is not expecting it (unless there's a bug in prepare_kick()). So, the guest should see the descriptor we just added (even without the interrupt) and should continue adding more descriptors, so eventually we would call prepare_kick() and it would return true
The thing that is perplexing to me is that I can't understand why your solution fixes the issue. You are injecting an interrupt in the guest while we're taking a snapshot, but we're saving KVM state before we save device state. This means that the guest should never see this interrupt 😝
What I am afraid is that us sending the interrupt while taking the snapshot simply changes the ordering of things and hides the problem rather than actually fixing it.
Not saying that this is definitely the case, but I'm still looking into it. I want to make sure that we fix this properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this bug is all about race condition as adding simple debug prints reduces reproduction rate significantly. So I agree that it looks fishy.
|
@bchalios As you can see both vCPUs are halted waiting for interrupt, so the interrupt is not really "unsolicited" (although technically it is). |
During prepare_save we must unconditionally trigger an interrupt to ensure the guest gets notified after restore. The guest may have suppressed notifications, but after snapshot/restore it needs to be woken up regardless.
Fixes #5554
Changes
Fixes a bug where guest would hang indefinitely on interrupts after resume.
Reason
See above.
License Acceptance
By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following Developer
Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md.PR Checklist
tools/devtool checkbuild --allto verify that the PR passesbuild checks on all supported architectures.
tools/devtool checkstyleto verify that the PR passes theautomated style checks.
how they are solving the problem in a clear and encompassing way.
in the PR.
CHANGELOG.md.Runbook for Firecracker API changes.
integration tests.
TODO.rust-vmm.