Skip to content

[8.x] [ML] Inference duration and error metrics (#115876) #118700

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Dec 13, 2024

Conversation

jonathan-buttner
Copy link
Contributor

@jonathan-buttner jonathan-buttner commented Dec 13, 2024

Backport

This will backport the following commits from main to 8.x:

I'm backporting this because Max and I's unified backport needs it: #118506
The failures you see in CI are unrelated to Pat's PR but as soon as you fix those you get a bunch of stuff like this:

> Task :x-pack:plugin:inference:compileJava
/proj/es/elasticsearch/x-pack/plugin/inference/src/main/java/org/elasticsearch/xpack/inference/action/BaseTransportInferenceAction.java:35: error: cannot find symbol
import org.elasticsearch.xpack.inference.telemetry.InferenceTimer;

Questions ?

Please refer to the Backport tool documentation

Add `es.inference.requests.time` metric around `infer` API.

As recommended by OTel spec, errors are determined by the
presence or absence of the `error.type` attribute in the metric.
"error.type" will be the http status code (as a string) if it is
available, otherwise it will be the name of the exception (e.g.
NullPointerException).

Additional notes:
- ApmInferenceStats is merged into InferenceStats. Originally we planned
  to have multiple implementations, but now we're only using APM.
- Request count is now always recorded, even when there are failures
  loading the endpoint configuration.
- Added a hook in streaming for cancel messages, so we can close the
  metrics when a user cancels the stream.

(cherry picked from commit 26870ef)
Copy link
Contributor

@maxhniebergall maxhniebergall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for fixing the switch

@maxhniebergall maxhniebergall enabled auto-merge (squash) December 13, 2024 22:17
@maxhniebergall maxhniebergall added the :ml Machine learning label Dec 13, 2024
@maxhniebergall maxhniebergall merged commit 7eaf380 into elastic:8.x Dec 13, 2024
15 checks passed
@jonathan-buttner jonathan-buttner deleted the backport/8.x/pr-115876 branch December 16, 2024 13:27
maxhniebergall pushed a commit to maxhniebergall/elasticsearch that referenced this pull request Dec 16, 2024
…stic#118700)

* [ML] Inference duration and error metrics (elastic#115876)

Add `es.inference.requests.time` metric around `infer` API.

As recommended by OTel spec, errors are determined by the
presence or absence of the `error.type` attribute in the metric.
"error.type" will be the http status code (as a string) if it is
available, otherwise it will be the name of the exception (e.g.
NullPointerException).

Additional notes:
- ApmInferenceStats is merged into InferenceStats. Originally we planned
  to have multiple implementations, but now we're only using APM.
- Request count is now always recorded, even when there are failures
  loading the endpoint configuration.
- Added a hook in streaming for cancel messages, so we can close the
  metrics when a user cancels the stream.

(cherry picked from commit 26870ef)

* fixing switch with class issue

---------

Co-authored-by: Pat Whelan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants