Skip to content

Add application err category #2485

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Apr 25, 2025
Merged

Add application err category #2485

merged 12 commits into from
Apr 25, 2025

Conversation

THardy98
Copy link
Contributor

What was changed

Add support for application error category.

Why?

SDK work for benign exceptions

  1. Closes Apply application failure logging and metrics behaviour according to ApplicationErrorCategory #2475

  2. How was this tested:
    Integration tests

  3. Any docs updates needed?
    Likely

@THardy98 THardy98 requested a review from a team as a code owner April 22, 2025 14:42
@THardy98
Copy link
Contributor Author

Don't like the API changes to ApplicationFailure, but figured I'd get something up first for discussion

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems in other parts of the SDK, e.g. parent close policy, we just expose the raw enum from proto. Should we do the same here? It has "unrecognized", automatically gets new values, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't notice that, have changed to use raw proto

Comment on lines 44 to 47
case UNRECOGNIZED:
default:
// Fallback unrecognized or unspecified proto values as UNSPECIFIED
return UNSPECIFIED;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the concept of unrecognized is ideal and different from unspecified (then again though, arguably we should be using raw API proto enums and not a new enumerate here).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using raw API proto enums now

@@ -218,4 +274,43 @@ private static String getMessage(String message, String type, boolean nonRetryab
+ ", nonRetryable="
+ nonRetryable;
}

public static boolean isBenignApplicationFailure(@Nullable Throwable t) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Easier to justify than the Go equivalent I suppose, but I still think this helper may not be needed and may be more confusing than it's worth

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to an internal failure utils file, it's convenient to have imo

Comment on lines 300 to 301
// Handle ActivityFailure, which wraps the actual ApplicationFailure
if (t instanceof io.temporal.failure.ActivityFailure) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want to expose this kind of logic to users. A user checking whether an in-workflow-thrown application failure is benign may not want to have it match when an activity raises it. I think if a user wants to check a category for whatever reason, they can.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, moved to an internal util file

@@ -153,7 +154,9 @@ private void completeWorkflow(@Nullable WorkflowExecutionException failure) {
metricsScope.counter(MetricsType.WORKFLOW_CANCELED_COUNTER).inc(1);
} else if (failure != null) {
workflowStateMachines.failWorkflow(failure.getFailure());
metricsScope.counter(MetricsType.WORKFLOW_FAILED_COUNTER).inc(1);
if (!ApplicationFailure.isBenignApplicationFailure(failure)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think you want to swallow activities that fail because of benign exceptions. They may reach a max attempts or something and throw this out. Just because an activity has a benign exception it doesn't want to be in telemetry doesn't mean a workflow doesn't want it in telemetry.

Copy link
Contributor Author

@THardy98 THardy98 Apr 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this not the same pattern in:
temporalio/sdk-go#1925
temporalio/sdk-core#905
Unless I'm mistaken.

What about if a user throws a benign exception in workflow code, not in an activity?

Copy link
Member

@cretz cretz Apr 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I was unclear here. This is basically me restating https://github.com./temporalio/sdk-java/pull/2485/files#r2054385164.

What about if a user throws a benign exception in workflow code, not in an activity?

That is fine to skip metrics, but we don't want to skip metrics if the exception in the workflow code is an activity failure with a benign cause. There is a difference between not recording telemetry on benign exceptions and not recording telemetry on an exception that has a benign cause.

Here are some scenarios:

  • Activity throws benign exception - do not record telemetry
  • Workflow throws benign exception - do not record telemetry
  • Activity or child fails with benign exception from workflow POV (thrown from activity and it is out of max attempts or something) - still record telemetry
  • Activity uses a Temporal client that runs a workflow and that workflow throws benign exception - still record telemetry
  • In Python, an activity has a general catch that caught the benign exception, tried to do cleanup, and raised its own exception (which automatically sets the cause as the current catch exception) - still record telemetry

In the last 3 bullets on certain languages, they will be treated as benign and shouldn't be IMO. Basically, change isBenignApplicationFailure to stop checking causes IMO. IMO throwing a benign exception is benign, but receiving a benign exception wrapped in another exception is not benign.

Is this not the same pattern in:
temporalio/sdk-go#1925
temporalio/sdk-core#905
Unless I'm mistaken.

Not exactly because Go stops at the first application error, not every application error no matter the depth. Also, this isn't recursively checking causes to arbitrary depths like Go. Still, looking back on Go, we should only apply it to the current error and not check cause or errors.As IMO. I added a comment at temporalio/sdk-go#1925 (comment). Didn't check Core, but I think the same applies there too.

If you think about how users use benign exceptions (as control flow), they aren't supposed to wrap them. Open to discussion here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't they inherently be wrapped as other failure types though? Could be wrong here, but for example, an ApplicationFailure thrown from an activity, would it not always (or at least in the general case) be wrapped as an ActivityFailure?

Copy link
Contributor Author

@THardy98 THardy98 Apr 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree in that it doesn't seem relevant/useful to check for benign causes at an arbitrary depth, but I though a depth of 0 (immediate) or 1 (thinly wrapped) would be sufficient.

In Java SDK, how would one receive an ApplicationFailure that is not wrapped?

Copy link
Member

@cretz cretz Apr 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't they inherently be wrapped as other failure types though?

No. They are only benign where they are thrown (impl side) and they are usually only wrapped where they are caught (caller side).

Could be wrong here, but for example, an ApplicationFailure thrown from an activity, would it not always (or at least in the general case) be wrapped as an ActivityFailure?

Not on the activity side, only on the workflow caller side. But if an activity fails because of a benign error, even though it is benign on the activity side, if that activity failure bubbles out of a workflow, that is not benign.

I agree in that it doesn't seem relevant/useful to check for benign causes at an arbitrary depth, but I though a depth of 0 (immediate) or 1 (thinly wrapped) would be sufficient.

Even 1 depth treats a benign activity failure as benign on the workflow side which is incorrect IMO (not to mention Go side is arbitrary depth). I just checked Core, Core does it right in that it doesn't recurse.

In Java SDK, how would one receive an ApplicationFailure that is not wrapped?

What do you mean by "one"? If you mean how does our SDK internals receive an unwrapped thrown error, it is where you have changes in ActivityTaskExecutors.java. If you mean how does a workflow caller receive it unwrapped, they don't, that's the point.

An error is only benign where it is thrown impl side, not where it is caught caller side.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification, that all sounds reasonable (and saved me some digging :))
I've removed the nested failure checks, only check immediate failures.

Copy link
Member

@cretz cretz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Will want to wait for @Quinn-With-Two-Ns's approval here.

import io.temporal.failure.ApplicationFailure;
import javax.annotation.Nullable;

public class FailureUtils {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pedantic, but would add a private constructor here

@@ -51,13 +52,15 @@
* <li>nonRetryable is set to false
* <li>details are set to null
* <li>stack trace is copied from the original exception
* <li>stack category is set to ApplicationErrorCategory.APPLICATION_ERROR_CATEGORY_UNSPECIFIED
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* <li>stack category is set to ApplicationErrorCategory.APPLICATION_ERROR_CATEGORY_UNSPECIFIED
* <li>category is set to ApplicationErrorCategory.APPLICATION_ERROR_CATEGORY_UNSPECIFIED

@@ -21,6 +21,7 @@
package io.temporal.failure;

import com.google.common.base.Strings;
import io.temporal.api.enums.v1.ApplicationErrorCategory;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't use the proto enum in the user facing API, we should declare our own enum in the Java SDK and convert.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry I didn't notice before you were still using the proto enum

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I formerly had this but: #2485 (comment)

It looked like we use raw protos, when do we use each pattern?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All new features should NOT use raw proto, only old features use raw protos

temp.txt Outdated
@@ -0,0 +1,483 @@
diff --git a/temporal-sdk/src/main/java/io/temporal/failure/ApplicationErrorCategory.java b/temporal-sdk/src/main/java/io/temporal/failure/ApplicationErrorCategory.java
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whoops - rm'd

package io.temporal.failure;

/**
* Mirrors the proto definition for ApplicationErrorCategory. Used to categorize application
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Mirrors the proto definition for ApplicationErrorCategory. Used to categorize application
* Used to categorize application

I don't think we need to mention the proto def. here the docs are for the end user and that isn't really relevant for them

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@THardy98 THardy98 force-pushed the add_application_err_category branch from 4182485 to 823848e Compare April 25, 2025 22:38
@THardy98 THardy98 merged commit 7460156 into master Apr 25, 2025
12 checks passed
@THardy98 THardy98 deleted the add_application_err_category branch April 25, 2025 23:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Apply application failure logging and metrics behaviour according to ApplicationErrorCategory
3 participants