Skip to content

feat(metrics): add IO queue len and backend latency metrics #4486

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Mar 7, 2024

Conversation

kalyazin
Copy link
Contributor

@kalyazin kalyazin commented Mar 5, 2024

Changes

This adds the following accumulated metrics:

  • block IO: read and write backend latencies (read_agg, write_agg)
  • block IO: length of the queue backlog (remaining_reqs_count)
  • net: tap write backend latencies (tap_agg)
  • net: length of the Tx queue backlog (tx_remaining_reqs_count)

The latency metrics show how long the VMM thread spends waiting for the requests being processed by the block/net backend. The queue backlog metrics show how many requests are still waiting to be processed by the VMM thread. They are updated during processing of every request, so dividing it by the total number of requests would show the average queue length per request processing.

Reason

We would like to have more visibility on how block and network backends are saturated.

License Acceptance

By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following Developer
Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md.

PR Checklist

  • [ ] If a specific issue led to this PR, this PR closes the issue.
  • The description of changes is clear and encompassing.
  • Any required documentation changes (code and docs) are included in this
    PR.
  • [ ] API changes follow the Runbook for Firecracker API changes.
  • User-facing changes are mentioned in CHANGELOG.md.
  • All added/changed functionality is tested.
  • [ ] New TODOs link to an issue.
  • Commits meet
    contribution quality standards.

  • This functionality cannot be added in rust-vmm.

Copy link

codecov bot commented Mar 5, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 81.65%. Comparing base (8d09bae) to head (d2131e3).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4486      +/-   ##
==========================================
+ Coverage   81.61%   81.65%   +0.03%     
==========================================
  Files         243      243              
  Lines       29794    29825      +31     
==========================================
+ Hits        24316    24353      +37     
+ Misses       5478     5472       -6     
Flag Coverage Δ
4.14-c5n.metal 78.97% <100.00%> (?)
4.14-c7g.metal ?
4.14-m5d.metal ?
4.14-m5n.metal 78.96% <100.00%> (?)
4.14-m6a.metal 78.10% <100.00%> (+0.02%) ⬆️
4.14-m6g.metal 77.05% <100.00%> (+0.02%) ⬆️
4.14-m6i.metal 78.95% <100.00%> (+0.02%) ⬆️
4.14-m7g.metal 77.05% <100.00%> (?)
5.10-c5n.metal 81.62% <100.00%> (?)
5.10-c7g.metal ?
5.10-m5d.metal ?
5.10-m5n.metal 81.61% <100.00%> (?)
5.10-m6a.metal 80.85% <100.00%> (+0.02%) ⬆️
5.10-m6g.metal 79.94% <100.00%> (+0.02%) ⬆️
5.10-m6i.metal 81.61% <100.00%> (+0.02%) ⬆️
5.10-m7g.metal 79.94% <100.00%> (?)
6.1-c5n.metal 81.62% <100.00%> (?)
6.1-m5n.metal 81.61% <100.00%> (?)
6.1-m6a.metal 80.85% <100.00%> (+0.02%) ⬆️
6.1-m6g.metal 79.94% <100.00%> (+0.02%) ⬆️
6.1-m6i.metal 81.61% <100.00%> (+0.02%) ⬆️
6.1-m7g.metal 79.94% <100.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@kalyazin kalyazin marked this pull request as ready for review March 6, 2024 12:32
@kalyazin kalyazin self-assigned this Mar 6, 2024
@kalyazin kalyazin added the Status: Awaiting review Indicates that a pull request is ready to be reviewed label Mar 6, 2024
Copy link
Contributor

@bchalios bchalios left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM.

I would like it if there was more information in the commit messages about what the commit does and mainly why. Right now, the descriptions make sense reading the PR description, but by themselves are a bit cryptic.

Other than that, right now, the new latency metrics only measure the backend. I think it would make sense to also measure the time we spend in Firecracker to handle each request. Ideally, this should give us a sense of the "overhead" of Firecracker's processing and could point to low-hanging fruit optimizations in the short term, as well as protect us from doing stupid things in the long term. WDYT?

@wearyzen
Copy link
Contributor

wearyzen commented Mar 6, 2024

The queue backlog metrics show how many requests are still waiting to be processed by the VMM thread. They are updated during processing of every request,

remaining_reqs_count and tx_remaining_reqs_count could be bit confusing since they are reset every time we flush. So technically there could be requests still pending in the queue but metrics won't report them because they have already been reported. I think its fine the way they are right now but we should explicitly call out that these 2 metrics being 0 does not mean that there are no requests in queue. This could be confusing if we are debugging a crash where the last metrics showed that there are no requests but there actually were.

so dividing it by the total number of requests would show the average queue length per request processing.

Can we document it in CHANGELOG or in code saying divide by the metric queue_event_count to get avg queue lenght?
Maybe worth adding a small section in https://github.com./firecracker-microvm/firecracker/blob/main/docs/metrics.md to highlight how these metrics can help with analysing saturation.

This is needed to collect metrics depending on
the number of remaining requests in the queue.

Signed-off-by: Nikita Kalyazin <[email protected]>
@kalyazin kalyazin force-pushed the io_metrics branch 3 times, most recently from 5031cf9 to d085a91 Compare March 7, 2024 10:26
@kalyazin
Copy link
Contributor Author

kalyazin commented Mar 7, 2024

The queue backlog metrics show how many requests are still waiting to be processed by the VMM thread. They are updated during processing of every request,

remaining_reqs_count and tx_remaining_reqs_count could be bit confusing since they are reset every time we flush. So technically there could be requests still pending in the queue but metrics won't report them because they have already been reported. I think its fine the way they are right now but we should explicitly call out that these 2 metrics being 0 does not mean that there are no requests in queue. This could be confusing if we are debugging a crash where the last metrics showed that there are no requests but there actually were.

so dividing it by the total number of requests would show the average queue length per request processing.

Can we document it in CHANGELOG or in code saying divide by the metric queue_event_count to get avg queue lenght? Maybe worth adding a small section in https://github.com./firecracker-microvm/firecracker/blob/main/docs/metrics.md to highlight how these metrics can help with analysing saturation.

added a note in the changelog, but not in doc for now.

@kalyazin
Copy link
Contributor Author

kalyazin commented Mar 7, 2024

Generally LGTM.

I would like it if there was more information in the commit messages about what the commit does and mainly why. Right now, the descriptions make sense reading the PR description, but by themselves are a bit cryptic.

Other than that, right now, the new latency metrics only measure the backend. I think it would make sense to also measure the time we spend in Firecracker to handle each request. Ideally, this should give us a sense of the "overhead" of Firecracker's processing and could point to low-hanging fruit optimizations in the short term, as well as protect us from doing stupid things in the long term. WDYT?

Added more info in commit messages. I agree, it would be useful to have a pie chart of VMM time split by idling, being busy and blocked on lower layers. That could be the next step.

kalyazin added 7 commits March 7, 2024 12:16
This is for it to be available in block and net modules.

Signed-off-by: Nikita Kalyazin <[email protected]>
LatencyAggregateMetrics metrics are dictionaries
rather than ints, so they require special treatment.

Signed-off-by: Nikita Kalyazin <[email protected]>
These metrics measure how long Firecracker VMM thread
is blocked on the read/write syscalls when accessing
data in the drive file.
By looking at them, we will be able to see what portion
of block latency is attributed to factors external
to Firecracker.

Signed-off-by: Nikita Kalyazin <[email protected]>
This metric accumulates the number of pending
requests sitting in the virtio queue while a request
is being processed.
By looking at this number, one may make a decision
on whether an async IO solution would be beneficial.

Signed-off-by: Nikita Kalyazin <[email protected]>
This metric measures how long Firecracker VMM thread
is blocked on the write syscalls when accessing a tap device.
By looking at it, we will be able to see what portion
of tx net latency is attributed to factors external
to Firecracker.

Signed-off-by: Nikita Kalyazin <[email protected]>
This metric accumulates the number of pending
requests sitting in the Tx virtio queue while a request
is being processed.
This may give an insight at how sensible it would
be to think of processing ingress traffic parallelly.

Signed-off-by: Nikita Kalyazin <[email protected]>
 - block/net file/tap access latencies
 - block/net virtio queue backlog lengths

Signed-off-by: Nikita Kalyazin <[email protected]>
Copy link
Contributor

@wearyzen wearyzen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for adding latency_agg_metrics in the fcmetrics.py tests as well!

@kalyazin kalyazin merged commit 448df60 into firecracker-microvm:main Mar 7, 2024
@kalyazin kalyazin deleted the io_metrics branch March 7, 2024 17:30
@kalyazin kalyazin mentioned this pull request Mar 8, 2024
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Awaiting review Indicates that a pull request is ready to be reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants