-
Notifications
You must be signed in to change notification settings - Fork 1.9k
feat(metrics): add IO queue len and backend latency metrics #4486
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #4486 +/- ##
==========================================
+ Coverage 81.61% 81.65% +0.03%
==========================================
Files 243 243
Lines 29794 29825 +31
==========================================
+ Hits 24316 24353 +37
+ Misses 5478 5472 -6
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally LGTM.
I would like it if there was more information in the commit messages about what the commit does and mainly why. Right now, the descriptions make sense reading the PR description, but by themselves are a bit cryptic.
Other than that, right now, the new latency metrics only measure the backend. I think it would make sense to also measure the time we spend in Firecracker to handle each request. Ideally, this should give us a sense of the "overhead" of Firecracker's processing and could point to low-hanging fruit optimizations in the short term, as well as protect us from doing stupid things in the long term. WDYT?
remaining_reqs_count and tx_remaining_reqs_count could be bit confusing since they are reset every time we flush. So technically there could be requests still pending in the queue but metrics won't report them because they have already been reported. I think its fine the way they are right now but we should explicitly call out that these 2 metrics being 0 does not mean that there are no requests in queue. This could be confusing if we are debugging a crash where the last metrics showed that there are no requests but there actually were.
Can we document it in CHANGELOG or in code saying divide by the metric |
This is needed to collect metrics depending on the number of remaining requests in the queue. Signed-off-by: Nikita Kalyazin <[email protected]>
5031cf9
to
d085a91
Compare
added a note in the changelog, but not in doc for now. |
Added more info in commit messages. I agree, it would be useful to have a pie chart of VMM time split by idling, being busy and blocked on lower layers. That could be the next step. |
This is for it to be available in block and net modules. Signed-off-by: Nikita Kalyazin <[email protected]>
LatencyAggregateMetrics metrics are dictionaries rather than ints, so they require special treatment. Signed-off-by: Nikita Kalyazin <[email protected]>
These metrics measure how long Firecracker VMM thread is blocked on the read/write syscalls when accessing data in the drive file. By looking at them, we will be able to see what portion of block latency is attributed to factors external to Firecracker. Signed-off-by: Nikita Kalyazin <[email protected]>
This metric accumulates the number of pending requests sitting in the virtio queue while a request is being processed. By looking at this number, one may make a decision on whether an async IO solution would be beneficial. Signed-off-by: Nikita Kalyazin <[email protected]>
This metric measures how long Firecracker VMM thread is blocked on the write syscalls when accessing a tap device. By looking at it, we will be able to see what portion of tx net latency is attributed to factors external to Firecracker. Signed-off-by: Nikita Kalyazin <[email protected]>
This metric accumulates the number of pending requests sitting in the Tx virtio queue while a request is being processed. This may give an insight at how sensible it would be to think of processing ingress traffic parallelly. Signed-off-by: Nikita Kalyazin <[email protected]>
- block/net file/tap access latencies - block/net virtio queue backlog lengths Signed-off-by: Nikita Kalyazin <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for adding latency_agg_metrics in the fcmetrics.py tests as well!
Changes
This adds the following accumulated metrics:
read_agg
,write_agg
)remaining_reqs_count
)tap_agg
)tx_remaining_reqs_count
)The latency metrics show how long the VMM thread spends waiting for the requests being processed by the block/net backend. The queue backlog metrics show how many requests are still waiting to be processed by the VMM thread. They are updated during processing of every request, so dividing it by the total number of requests would show the average queue length per request processing.
Reason
We would like to have more visibility on how block and network backends are saturated.
License Acceptance
By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following Developer
Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md
.PR Checklist
[ ] If a specific issue led to this PR, this PR closes the issue.PR.
[ ] API changes follow the Runbook for Firecracker API changes.CHANGELOG.md
.[ ] NewTODO
s link to an issue.contribution quality standards.
rust-vmm
.