[8.x] [ML] Inference duration and error metrics (#115876) #118700

jonathan-buttner · 2024-12-13T21:00:15Z

Backport

This will backport the following commits from main to 8.x:

[ML] Inference duration and error metrics (#115876)

I'm backporting this because Max and I's unified backport needs it: #118506
The failures you see in CI are unrelated to Pat's PR but as soon as you fix those you get a bunch of stuff like this:

> Task :x-pack:plugin:inference:compileJava
/proj/es/elasticsearch/x-pack/plugin/inference/src/main/java/org/elasticsearch/xpack/inference/action/BaseTransportInferenceAction.java:35: error: cannot find symbol
import org.elasticsearch.xpack.inference.telemetry.InferenceTimer;

Questions ?

Please refer to the Backport tool documentation

Add `es.inference.requests.time` metric around `infer` API. As recommended by OTel spec, errors are determined by the presence or absence of the `error.type` attribute in the metric. "error.type" will be the http status code (as a string) if it is available, otherwise it will be the name of the exception (e.g. NullPointerException). Additional notes: - ApmInferenceStats is merged into InferenceStats. Originally we planned to have multiple implementations, but now we're only using APM. - Request count is now always recorded, even when there are failures loading the endpoint configuration. - Added a hook in streaming for cancel messages, so we can close the metrics when a user cancels the stream. (cherry picked from commit 26870ef)

maxhniebergall

LGTM! Thanks for fixing the switch

…stic#118700) * [ML] Inference duration and error metrics (elastic#115876) Add `es.inference.requests.time` metric around `infer` API. As recommended by OTel spec, errors are determined by the presence or absence of the `error.type` attribute in the metric. "error.type" will be the http status code (as a string) if it is available, otherwise it will be the name of the exception (e.g. NullPointerException). Additional notes: - ApmInferenceStats is merged into InferenceStats. Originally we planned to have multiple implementations, but now we're only using APM. - Request count is now always recorded, even when there are failures loading the endpoint configuration. - Added a hook in streaming for cancel messages, so we can close the metrics when a user cancels the stream. (cherry picked from commit 26870ef) * fixing switch with class issue --------- Co-authored-by: Pat Whelan <[email protected]>

jonathan-buttner added the backport label Dec 13, 2024

jonathan-buttner mentioned this pull request Dec 13, 2024

[ML] Inference duration and error metrics #115876

Merged

elasticsearchmachine added the v8.18.0 label Dec 13, 2024

jonathan-buttner requested review from davidkyle and maxhniebergall December 13, 2024 21:02

fixing switch with class issue

47d2527

maxhniebergall approved these changes Dec 13, 2024

View reviewed changes

maxhniebergall enabled auto-merge (squash) December 13, 2024 22:17

maxhniebergall added the :ml Machine learning label Dec 13, 2024

maxhniebergall merged commit 7eaf380 into elastic:8.x Dec 13, 2024
15 checks passed

jonathan-buttner deleted the backport/8.x/pr-115876 branch December 16, 2024 13:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[8.x] [ML] Inference duration and error metrics (#115876) #118700

[8.x] [ML] Inference duration and error metrics (#115876) #118700

jonathan-buttner commented Dec 13, 2024 •

edited

Loading

maxhniebergall left a comment

[8.x] [ML] Inference duration and error metrics (#115876) #118700

[8.x] [ML] Inference duration and error metrics (#115876) #118700

Conversation

jonathan-buttner commented Dec 13, 2024 • edited Loading

Backport

Questions ?

maxhniebergall left a comment

Choose a reason for hiding this comment

jonathan-buttner commented Dec 13, 2024 •

edited

Loading