Add application err category #2485

THardy98 · 2025-04-22T14:42:46Z

What was changed

Add support for application error category.

Why?

SDK work for benign exceptions

Closes Apply application failure logging and metrics behaviour according to ApplicationErrorCategory #2475
How was this tested:
Integration tests
Any docs updates needed?
Likely

THardy98 · 2025-04-22T14:43:44Z

Don't like the API changes to ApplicationFailure, but figured I'd get something up first for discussion

cretz · 2025-04-22T15:38:58Z

temporal-sdk/src/main/java/io/temporal/failure/ApplicationErrorCategory.java

It seems in other parts of the SDK, e.g. parent close policy, we just expose the raw enum from proto. Should we do the same here? It has "unrecognized", automatically gets new values, etc.

Didn't notice that, have changed to use raw proto

cretz · 2025-04-22T15:39:33Z

temporal-sdk/src/main/java/io/temporal/failure/ApplicationErrorCategory.java

+      case UNRECOGNIZED:
+      default:
+        // Fallback unrecognized or unspecified proto values as UNSPECIFIED
+        return UNSPECIFIED;


I think the concept of unrecognized is ideal and different from unspecified (then again though, arguably we should be using raw API proto enums and not a new enumerate here).

using raw API proto enums now

cretz · 2025-04-22T15:41:13Z

temporal-sdk/src/main/java/io/temporal/failure/ApplicationFailure.java

@@ -218,4 +274,43 @@ private static String getMessage(String message, String type, boolean nonRetryab
        + ", nonRetryable="
        + nonRetryable;
  }
+
+  public static boolean isBenignApplicationFailure(@Nullable Throwable t) {


Easier to justify than the Go equivalent I suppose, but I still think this helper may not be needed and may be more confusing than it's worth

Moved to an internal failure utils file, it's convenient to have imo

cretz · 2025-04-22T15:41:51Z

temporal-sdk/src/main/java/io/temporal/failure/ApplicationFailure.java

+    // Handle ActivityFailure, which wraps the actual ApplicationFailure
+    if (t instanceof io.temporal.failure.ActivityFailure) {


I don't think we want to expose this kind of logic to users. A user checking whether an in-workflow-thrown application failure is benign may not want to have it match when an activity raises it. I think if a user wants to check a category for whatever reason, they can.

Agree, moved to an internal util file

cretz · 2025-04-22T15:43:24Z

temporal-sdk/src/main/java/io/temporal/internal/replay/ReplayWorkflowExecutor.java

@@ -153,7 +154,9 @@ private void completeWorkflow(@Nullable WorkflowExecutionException failure) {
      metricsScope.counter(MetricsType.WORKFLOW_CANCELED_COUNTER).inc(1);
    } else if (failure != null) {
      workflowStateMachines.failWorkflow(failure.getFailure());
-      metricsScope.counter(MetricsType.WORKFLOW_FAILED_COUNTER).inc(1);
+      if (!ApplicationFailure.isBenignApplicationFailure(failure)) {


I do not think you want to swallow activities that fail because of benign exceptions. They may reach a max attempts or something and throw this out. Just because an activity has a benign exception it doesn't want to be in telemetry doesn't mean a workflow doesn't want it in telemetry.

Is this not the same pattern in:
temporalio/sdk-go#1925
temporalio/sdk-core#905
Unless I'm mistaken.

What about if a user throws a benign exception in workflow code, not in an activity?

Sorry, I was unclear here. This is basically me restating https://github.com./temporalio/sdk-java/pull/2485/files#r2054385164.

What about if a user throws a benign exception in workflow code, not in an activity?

That is fine to skip metrics, but we don't want to skip metrics if the exception in the workflow code is an activity failure with a benign cause. There is a difference between not recording telemetry on benign exceptions and not recording telemetry on an exception that has a benign cause.

Here are some scenarios:

Activity throws benign exception - do not record telemetry

Workflow throws benign exception - do not record telemetry

Activity or child fails with benign exception from workflow POV (thrown from activity and it is out of max attempts or something) - still record telemetry

Activity uses a Temporal client that runs a workflow and that workflow throws benign exception - still record telemetry

In Python, an activity has a general catch that caught the benign exception, tried to do cleanup, and raised its own exception (which automatically sets the cause as the current catch exception) - still record telemetry

In the last 3 bullets on certain languages, they will be treated as benign and shouldn't be IMO. Basically, change isBenignApplicationFailure to stop checking causes IMO. IMO throwing a benign exception is benign, but receiving a benign exception wrapped in another exception is not benign.

Is this not the same pattern in:
temporalio/sdk-go#1925
temporalio/sdk-core#905
Unless I'm mistaken.

Not exactly because Go stops at the first application error, not every application error no matter the depth. Also, this isn't recursively checking causes to arbitrary depths like Go. Still, looking back on Go, we should only apply it to the current error and not check cause or errors.As IMO. I added a comment at temporalio/sdk-go#1925 (comment). Didn't check Core, but I think the same applies there too.

If you think about how users use benign exceptions (as control flow), they aren't supposed to wrap them. Open to discussion here.

Won't they inherently be wrapped as other failure types though? Could be wrong here, but for example, an ApplicationFailure thrown from an activity, would it not always (or at least in the general case) be wrapped as an ActivityFailure?

I agree in that it doesn't seem relevant/useful to check for benign causes at an arbitrary depth, but I though a depth of 0 (immediate) or 1 (thinly wrapped) would be sufficient.

In Java SDK, how would one receive an ApplicationFailure that is not wrapped?

Won't they inherently be wrapped as other failure types though?

No. They are only benign where they are thrown (impl side) and they are usually only wrapped where they are caught (caller side).

Could be wrong here, but for example, an ApplicationFailure thrown from an activity, would it not always (or at least in the general case) be wrapped as an ActivityFailure?

Not on the activity side, only on the workflow caller side. But if an activity fails because of a benign error, even though it is benign on the activity side, if that activity failure bubbles out of a workflow, that is not benign.

I agree in that it doesn't seem relevant/useful to check for benign causes at an arbitrary depth, but I though a depth of 0 (immediate) or 1 (thinly wrapped) would be sufficient.

Even 1 depth treats a benign activity failure as benign on the workflow side which is incorrect IMO (not to mention Go side is arbitrary depth). I just checked Core, Core does it right in that it doesn't recurse.

In Java SDK, how would one receive an ApplicationFailure that is not wrapped?

What do you mean by "one"? If you mean how does our SDK internals receive an unwrapped thrown error, it is where you have changes in ActivityTaskExecutors.java. If you mean how does a workflow caller receive it unwrapped, they don't, that's the point.

An error is only benign where it is thrown impl side, not where it is caught caller side.

Thanks for the clarification, that all sounds reasonable (and saved me some digging :))
I've removed the nested failure checks, only check immediate failures.

cretz

Looks great! Will want to wait for @Quinn-With-Two-Ns's approval here.

cretz · 2025-04-25T18:18:36Z

temporal-sdk/src/main/java/io/temporal/internal/common/FailureUtils.java

+import io.temporal.failure.ApplicationFailure;
+import javax.annotation.Nullable;
+
+public class FailureUtils {


Pedantic, but would add a private constructor here

Quinn-With-Two-Ns · 2025-04-25T18:22:12Z

temporal-sdk/src/main/java/io/temporal/failure/ApplicationFailure.java

@@ -51,13 +52,15 @@
 *   <li>nonRetryable is set to false
 *   <li>details are set to null
 *   <li>stack trace is copied from the original exception
+ *   <li>stack category is set to ApplicationErrorCategory.APPLICATION_ERROR_CATEGORY_UNSPECIFIED


Suggested change

* <li>stack category is set to ApplicationErrorCategory.APPLICATION_ERROR_CATEGORY_UNSPECIFIED

* <li>category is set to ApplicationErrorCategory.APPLICATION_ERROR_CATEGORY_UNSPECIFIED

Quinn-With-Two-Ns · 2025-04-25T18:25:19Z

temporal-sdk/src/main/java/io/temporal/failure/ApplicationFailure.java

@@ -21,6 +21,7 @@
 package io.temporal.failure;

 import com.google.common.base.Strings;
+import io.temporal.api.enums.v1.ApplicationErrorCategory;


We shouldn't use the proto enum in the user facing API, we should declare our own enum in the Java SDK and convert.

sorry I didn't notice before you were still using the proto enum

I formerly had this but: #2485 (comment)

It looked like we use raw protos, when do we use each pattern?

All new features should NOT use raw proto, only old features use raw protos

Quinn-With-Two-Ns · 2025-04-25T21:35:26Z

temp.txt

@@ -0,0 +1,483 @@
+diff --git a/temporal-sdk/src/main/java/io/temporal/failure/ApplicationErrorCategory.java b/temporal-sdk/src/main/java/io/temporal/failure/ApplicationErrorCategory.java


Please remove

whoops - rm'd

Quinn-With-Two-Ns · 2025-04-25T21:36:32Z

temporal-sdk/src/main/java/io/temporal/failure/ApplicationErrorCategory.java

+package io.temporal.failure;
+
+/**
+ * Mirrors the proto definition for ApplicationErrorCategory. Used to categorize application


Suggested change

* Mirrors the proto definition for ApplicationErrorCategory. Used to categorize application

* Used to categorize application

I don't think we need to mention the proto def. here the docs are for the end user and that isn't really relevant for them

THardy98 added 7 commits April 21, 2025 18:35

update protos

4fd8dd9

add ApplicationErrorCategory enum

71c11a4

Add category field to ApplicationFailure

1858f95

add category proto converter logic

4e8a98e

add logging/metrics checks for benign application failures

433c066

fixes, added test for workflow failure metric

6875557

fixes for activity failures, added test for activity failures

29ff5dc

THardy98 requested a review from a team as a code owner April 22, 2025 14:42

cretz reviewed Apr 22, 2025

View reviewed changes

cretz mentioned this pull request Apr 24, 2025

Add application err category temporalio/sdk-go#1925

Merged

THardy98 added 2 commits April 24, 2025 14:14

address PR review

44de349

only check immediate failure

a64f646

cretz approved these changes Apr 25, 2025

View reviewed changes

Quinn-With-Two-Ns reviewed Apr 25, 2025

View reviewed changes

THardy98 requested a review from Quinn-With-Two-Ns April 25, 2025 20:41

define code enum, convert to/from proto representation

d22a2ad

Quinn-With-Two-Ns reviewed Apr 25, 2025

View reviewed changes

Quinn-With-Two-Ns approved these changes Apr 25, 2025

View reviewed changes

cleanup

823848e

THardy98 force-pushed the add_application_err_category branch from 4182485 to 823848e Compare April 25, 2025 22:38

Merge branch 'master' into add_application_err_category

ab0a99a

THardy98 merged commit 7460156 into master Apr 25, 2025
12 checks passed

THardy98 deleted the add_application_err_category branch April 25, 2025 23:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add application err category #2485

Add application err category #2485

THardy98 commented Apr 22, 2025

THardy98 commented Apr 22, 2025

cretz Apr 22, 2025

THardy98 Apr 24, 2025

cretz Apr 22, 2025

THardy98 Apr 24, 2025

cretz Apr 22, 2025

THardy98 Apr 24, 2025

cretz Apr 22, 2025

THardy98 Apr 24, 2025

cretz Apr 22, 2025

THardy98 Apr 24, 2025 •

edited

Loading

cretz Apr 24, 2025 •

edited

Loading

THardy98 Apr 24, 2025

THardy98 Apr 24, 2025 •

edited

Loading

cretz Apr 25, 2025 •

edited

Loading

THardy98 Apr 25, 2025

cretz left a comment

cretz Apr 25, 2025

Quinn-With-Two-Ns Apr 25, 2025

Quinn-With-Two-Ns Apr 25, 2025

Quinn-With-Two-Ns Apr 25, 2025

THardy98 Apr 25, 2025

Quinn-With-Two-Ns Apr 25, 2025

Quinn-With-Two-Ns Apr 25, 2025

THardy98 Apr 25, 2025

Quinn-With-Two-Ns Apr 25, 2025

THardy98 Apr 25, 2025

		// Handle ActivityFailure, which wraps the actual ApplicationFailure
		if (t instanceof io.temporal.failure.ActivityFailure) {

	* <li>stack category is set to ApplicationErrorCategory.APPLICATION_ERROR_CATEGORY_UNSPECIFIED
	* <li>category is set to ApplicationErrorCategory.APPLICATION_ERROR_CATEGORY_UNSPECIFIED

		@@ -0,0 +1,483 @@
		diff --git a/temporal-sdk/src/main/java/io/temporal/failure/ApplicationErrorCategory.java b/temporal-sdk/src/main/java/io/temporal/failure/ApplicationErrorCategory.java

	* Mirrors the proto definition for ApplicationErrorCategory. Used to categorize application
	* Used to categorize application

Add application err category #2485

Add application err category #2485

Conversation

THardy98 commented Apr 22, 2025

What was changed

Why?

THardy98 commented Apr 22, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

THardy98 Apr 24, 2025 • edited Loading

Choose a reason for hiding this comment

cretz Apr 24, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

THardy98 Apr 24, 2025 • edited Loading

Choose a reason for hiding this comment

cretz Apr 25, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cretz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

THardy98 Apr 24, 2025 •

edited

Loading

cretz Apr 24, 2025 •

edited

Loading

THardy98 Apr 24, 2025 •

edited

Loading

cretz Apr 25, 2025 •

edited

Loading