feat(metrics): add IO queue len and backend latency metrics #4486

kalyazin · 2024-03-05T16:31:47Z

Changes

This adds the following accumulated metrics:

block IO: read and write backend latencies (read_agg, write_agg)
block IO: length of the queue backlog (remaining_reqs_count)
net: tap write backend latencies (tap_agg)
net: length of the Tx queue backlog (tx_remaining_reqs_count)

The latency metrics show how long the VMM thread spends waiting for the requests being processed by the block/net backend. The queue backlog metrics show how many requests are still waiting to be processed by the VMM thread. They are updated during processing of every request, so dividing it by the total number of requests would show the average queue length per request processing.

Reason

We would like to have more visibility on how block and network backends are saturated.

License Acceptance

By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following Developer
Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md.

PR Checklist

~~[ ] If a specific issue led to this PR, this PR closes the issue.~~
The description of changes is clear and encompassing.
Any required documentation changes (code and docs) are included in this
PR.
~~[ ] API changes follow the Runbook for Firecracker API changes.~~
User-facing changes are mentioned in CHANGELOG.md.
All added/changed functionality is tested.
~~[ ] New TODOs link to an issue.~~
Commits meet
contribution quality standards.

This functionality cannot be added in rust-vmm.

codecov · 2024-03-05T16:44:48Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 81.65%. Comparing base (8d09bae) to head (d2131e3).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4486      +/-   ##
==========================================
+ Coverage   81.61%   81.65%   +0.03%     
==========================================
  Files         243      243              
  Lines       29794    29825      +31     
==========================================
+ Hits        24316    24353      +37     
+ Misses       5478     5472       -6

Flag	Coverage Δ
4.14-c5n.metal	`78.97% <100.00%> (?)`
4.14-c7g.metal	`?`
4.14-m5d.metal	`?`
4.14-m5n.metal	`78.96% <100.00%> (?)`
4.14-m6a.metal	`78.10% <100.00%> (+0.02%)`	⬆️
4.14-m6g.metal	`77.05% <100.00%> (+0.02%)`	⬆️
4.14-m6i.metal	`78.95% <100.00%> (+0.02%)`	⬆️
4.14-m7g.metal	`77.05% <100.00%> (?)`
5.10-c5n.metal	`81.62% <100.00%> (?)`
5.10-c7g.metal	`?`
5.10-m5d.metal	`?`
5.10-m5n.metal	`81.61% <100.00%> (?)`
5.10-m6a.metal	`80.85% <100.00%> (+0.02%)`	⬆️
5.10-m6g.metal	`79.94% <100.00%> (+0.02%)`	⬆️
5.10-m6i.metal	`81.61% <100.00%> (+0.02%)`	⬆️
5.10-m7g.metal	`79.94% <100.00%> (?)`
6.1-c5n.metal	`81.62% <100.00%> (?)`
6.1-m5n.metal	`81.61% <100.00%> (?)`
6.1-m6a.metal	`80.85% <100.00%> (+0.02%)`	⬆️
6.1-m6g.metal	`79.94% <100.00%> (+0.02%)`	⬆️
6.1-m6i.metal	`81.61% <100.00%> (+0.02%)`	⬆️
6.1-m7g.metal	`79.94% <100.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

bchalios

Generally LGTM.

I would like it if there was more information in the commit messages about what the commit does and mainly why. Right now, the descriptions make sense reading the PR description, but by themselves are a bit cryptic.

Other than that, right now, the new latency metrics only measure the backend. I think it would make sense to also measure the time we spend in Firecracker to handle each request. Ideally, this should give us a sense of the "overhead" of Firecracker's processing and could point to low-hanging fruit optimizations in the short term, as well as protect us from doing stupid things in the long term. WDYT?

CHANGELOG.md

src/vmm/src/devices/virtio/block/virtio/metrics.rs

src/vmm/src/devices/virtio/block/virtio/request.rs

src/vmm/src/logger/mod.rs

wearyzen · 2024-03-06T15:49:47Z

The queue backlog metrics show how many requests are still waiting to be processed by the VMM thread. They are updated during processing of every request,

remaining_reqs_count and tx_remaining_reqs_count could be bit confusing since they are reset every time we flush. So technically there could be requests still pending in the queue but metrics won't report them because they have already been reported. I think its fine the way they are right now but we should explicitly call out that these 2 metrics being 0 does not mean that there are no requests in queue. This could be confusing if we are debugging a crash where the last metrics showed that there are no requests but there actually were.

so dividing it by the total number of requests would show the average queue length per request processing.

Can we document it in CHANGELOG or in code saying divide by the metric queue_event_count to get avg queue lenght?
Maybe worth adding a small section in https://github.com./firecracker-microvm/firecracker/blob/main/docs/metrics.md to highlight how these metrics can help with analysing saturation.

This is needed to collect metrics depending on the number of remaining requests in the queue. Signed-off-by: Nikita Kalyazin <[email protected]>

kalyazin · 2024-03-07T10:53:39Z

The queue backlog metrics show how many requests are still waiting to be processed by the VMM thread. They are updated during processing of every request,

remaining_reqs_count and tx_remaining_reqs_count could be bit confusing since they are reset every time we flush. So technically there could be requests still pending in the queue but metrics won't report them because they have already been reported. I think its fine the way they are right now but we should explicitly call out that these 2 metrics being 0 does not mean that there are no requests in queue. This could be confusing if we are debugging a crash where the last metrics showed that there are no requests but there actually were.

so dividing it by the total number of requests would show the average queue length per request processing.

Can we document it in CHANGELOG or in code saying divide by the metric queue_event_count to get avg queue lenght? Maybe worth adding a small section in https://github.com./firecracker-microvm/firecracker/blob/main/docs/metrics.md to highlight how these metrics can help with analysing saturation.

added a note in the changelog, but not in doc for now.

kalyazin · 2024-03-07T10:55:25Z

Generally LGTM.

I would like it if there was more information in the commit messages about what the commit does and mainly why. Right now, the descriptions make sense reading the PR description, but by themselves are a bit cryptic.

Other than that, right now, the new latency metrics only measure the backend. I think it would make sense to also measure the time we spend in Firecracker to handle each request. Ideally, this should give us a sense of the "overhead" of Firecracker's processing and could point to low-hanging fruit optimizations in the short term, as well as protect us from doing stupid things in the long term. WDYT?

Added more info in commit messages. I agree, it would be useful to have a pie chart of VMM time split by idling, being busy and blocked on lower layers. That could be the next step.

This is for it to be available in block and net modules. Signed-off-by: Nikita Kalyazin <[email protected]>

LatencyAggregateMetrics metrics are dictionaries rather than ints, so they require special treatment. Signed-off-by: Nikita Kalyazin <[email protected]>

These metrics measure how long Firecracker VMM thread is blocked on the read/write syscalls when accessing data in the drive file. By looking at them, we will be able to see what portion of block latency is attributed to factors external to Firecracker. Signed-off-by: Nikita Kalyazin <[email protected]>

This metric accumulates the number of pending requests sitting in the virtio queue while a request is being processed. By looking at this number, one may make a decision on whether an async IO solution would be beneficial. Signed-off-by: Nikita Kalyazin <[email protected]>

This metric measures how long Firecracker VMM thread is blocked on the write syscalls when accessing a tap device. By looking at it, we will be able to see what portion of tx net latency is attributed to factors external to Firecracker. Signed-off-by: Nikita Kalyazin <[email protected]>

This metric accumulates the number of pending requests sitting in the Tx virtio queue while a request is being processed. This may give an insight at how sensible it would be to think of processing ingress traffic parallelly. Signed-off-by: Nikita Kalyazin <[email protected]>

- block/net file/tap access latencies - block/net virtio queue backlog lengths Signed-off-by: Nikita Kalyazin <[email protected]>

wearyzen

LGTM, thanks for adding latency_agg_metrics in the fcmetrics.py tests as well!

kalyazin force-pushed the io_metrics branch from efb79f0 to 6c507b8 Compare March 5, 2024 16:39

kalyazin force-pushed the io_metrics branch from 6c507b8 to e4febdf Compare March 6, 2024 10:09

kalyazin marked this pull request as ready for review March 6, 2024 12:32

kalyazin requested review from xmarcalx, pb8o and wearyzen as code owners March 6, 2024 12:32

kalyazin self-assigned this Mar 6, 2024

kalyazin added the Status: Awaiting review Indicates that a pull request is ready to be reviewed label Mar 6, 2024

bchalios reviewed Mar 6, 2024

View reviewed changes

ShadowCurse reviewed Mar 6, 2024

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

wearyzen reviewed Mar 6, 2024

View reviewed changes

src/vmm/src/devices/virtio/block/virtio/metrics.rs Show resolved Hide resolved

wearyzen reviewed Mar 6, 2024

View reviewed changes

src/vmm/src/devices/virtio/block/virtio/request.rs Outdated Show resolved Hide resolved

wearyzen reviewed Mar 6, 2024

View reviewed changes

src/vmm/src/logger/mod.rs Outdated Show resolved Hide resolved

kalyazin force-pushed the io_metrics branch from e4febdf to 5a52278 Compare March 6, 2024 15:32

kalyazin force-pushed the io_metrics branch from 5a52278 to 5716573 Compare March 6, 2024 18:27

feat(virtio): make queue.len() public

d1b1b72

This is needed to collect metrics depending on the number of remaining requests in the queue. Signed-off-by: Nikita Kalyazin <[email protected]>

kalyazin force-pushed the io_metrics branch 3 times, most recently from 5031cf9 to d085a91 Compare March 7, 2024 10:26

kalyazin added 7 commits March 7, 2024 12:16

feat(logger): make LatencyAggregateMetrics public

7633c06

This is for it to be available in block and net modules. Signed-off-by: Nikita Kalyazin <[email protected]>

test(metrics): add support for LatencyAggregateMetrics

bbc7569

LatencyAggregateMetrics metrics are dictionaries rather than ints, so they require special treatment. Signed-off-by: Nikita Kalyazin <[email protected]>

doc(changelog): add entry for io load metrics

b8ce034

- block/net file/tap access latencies - block/net virtio queue backlog lengths Signed-off-by: Nikita Kalyazin <[email protected]>

kalyazin force-pushed the io_metrics branch from d085a91 to b8ce034 Compare March 7, 2024 12:16

wearyzen approved these changes Mar 7, 2024

View reviewed changes

ShadowCurse approved these changes Mar 7, 2024

View reviewed changes

Merge branch 'main' into io_metrics

d2131e3

bchalios approved these changes Mar 7, 2024

View reviewed changes

kalyazin merged commit 448df60 into firecracker-microvm:main Mar 7, 2024

kalyazin deleted the io_metrics branch March 7, 2024 17:30

kalyazin mentioned this pull request Mar 8, 2024

[1.7] backport IO metrics from main #4490

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(metrics): add IO queue len and backend latency metrics #4486

feat(metrics): add IO queue len and backend latency metrics #4486

kalyazin commented Mar 5, 2024 •

edited

Loading

codecov bot commented Mar 5, 2024 •

edited

Loading

bchalios left a comment

wearyzen commented Mar 6, 2024 •

edited

Loading

kalyazin commented Mar 7, 2024

kalyazin commented Mar 7, 2024

wearyzen left a comment

feat(metrics): add IO queue len and backend latency metrics #4486

feat(metrics): add IO queue len and backend latency metrics #4486

Conversation

kalyazin commented Mar 5, 2024 • edited Loading

Changes

Reason

License Acceptance

PR Checklist

codecov bot commented Mar 5, 2024 • edited Loading

Codecov Report

bchalios left a comment

Choose a reason for hiding this comment

wearyzen commented Mar 6, 2024 • edited Loading

kalyazin commented Mar 7, 2024

kalyazin commented Mar 7, 2024

wearyzen left a comment

Choose a reason for hiding this comment

kalyazin commented Mar 5, 2024 •

edited

Loading

codecov bot commented Mar 5, 2024 •

edited

Loading

wearyzen commented Mar 6, 2024 •

edited

Loading