Skip to content

Add application err category #1925

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Apr 25, 2025
Merged

Add application err category #1925

merged 8 commits into from
Apr 25, 2025

Conversation

THardy98
Copy link
Contributor

What was changed

Upgrade api-go to 1.48.0
Added support for ApplicationErrorCategory, allowing users to specify benign application errors.

Why?

Part of the benign exceptions work.

  1. Closes Apply application failure logging and metrics behaviour according to ApplicationErrorCategory #1908

  2. How was this tested:
    Modify existing failure conversion tests.

  3. Any docs updates needed?
    Not sure

@THardy98 THardy98 requested a review from a team as a code owner April 22, 2025 00:47
@THardy98 THardy98 force-pushed the add_application_err_category branch from dd61c43 to 8700e2a Compare April 22, 2025 00:50
@THardy98 THardy98 force-pushed the add_application_err_category branch 6 times, most recently from 9522536 to 49536fa Compare April 22, 2025 03:12
@THardy98 THardy98 force-pushed the add_application_err_category branch from 49536fa to 388c9ba Compare April 22, 2025 03:58
@@ -137,6 +140,7 @@ type (
//
// NOTE: This option is supported by Temporal Server >= v1.24.2 older version will ignore this value.
NextRetryDelay time.Duration
Category ApplicationErrorCategory
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs docs

"Benign failure",
"",
temporal.ApplicationErrorOptions{
Category: internal.ErrorCategoryBenign,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test should not refer to internal types, you need to export ErrorCategoryBenign

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exported/exposed in common error.

@@ -137,6 +140,7 @@ type (
//
// NOTE: This option is supported by Temporal Server >= v1.24.2 older version will ignore this value.
NextRetryDelay time.Duration
Category ApplicationErrorCategory
Copy link
Contributor

@Quinn-With-Two-Ns Quinn-With-Two-Ns Apr 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the release status of temporalio/features#614? Is it GA?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question - I'll ask drew

@@ -119,6 +119,9 @@ Workflow consumers will get an instance of *WorkflowExecutionError. This error w
*/

type (
// Category of the error. Maps to logging/metrics behaviours.
ApplicationErrorCategory string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs to be exposed outside of internal

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exposed in common error

@@ -119,6 +119,9 @@ Workflow consumers will get an instance of *WorkflowExecutionError. This error w
*/

type (
// Category of the error. Maps to logging/metrics behaviours.
ApplicationErrorCategory string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrmm, I wonder if we really want this to be a string vs just a number (based on the proto number) and can have a String() on it. A string may make users think it can accept any string literal. Also other enumerates we traditionally do the iota approach. And if/when you do change to integer here, make sure there is an unspecified form representing 0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah not sure why I opted for a string... changed to an int enum

@@ -380,6 +385,11 @@ var (
ErrMissingWorkflowID = errors.New("workflow ID is unset for Nexus operation")
)

const (
// ErrorCategoryBenign indicates an error that is expected under normal operation and should not trigger alerts.
ErrorCategoryBenign ApplicationErrorCategory = "BENIGN"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ErrorCategoryBenign ApplicationErrorCategory = "BENIGN"
ApplicationErrorCategoryBenign ApplicationErrorCategory = "BENIGN"

Need to qualify the const IMO

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

case "":
// Zero value maps to unspecified
return enumspb.APPLICATION_ERROR_CATEGORY_UNSPECIFIED
default:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If/when we change to integer enums we can/should just use the enum given to us in both directions since both Go and proto allow "open-ended" enums.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, removed

}
}

func IsBenignApplicationError(err error) bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func IsBenignApplicationError(err error) bool {
func isBenignApplicationError(err error) bool {

No need to expose I assume unless it's a helper we really need to expose, but I don't believe we do. A user can easily check for any value on application error without a helper.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't mean to export this, unexported.

Copy link
Contributor Author

@THardy98 THardy98 Apr 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you want this removed btw? Just do the check inline wherever it's used?
Alternatively, should we put this check under a different name (i.e. shouldRecordFailureMetric or something similar, probably move it to metrics), in case we adopt more conditional logic for failure metrics in the future?

ts.Error(err)
var appErr *temporal.ApplicationError
ts.True(errors.As(err, &appErr))
ts.False(internal.IsBenignApplicationError(err))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try to avoid using internal in integration tests. Users can't use it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - inline boolean check.

@@ -474,19 +474,21 @@ func (wtp *workflowTaskPoller) RespondTaskCompletedWithMetrics(
) (response *workflowservice.RespondWorkflowTaskCompletedResponse, err error) {
metricsHandler := wtp.metricsHandler.WithTags(metrics.WorkflowTags(task.WorkflowType.GetName()))
if taskErr != nil {
wtp.logger.Warn("Failed to process workflow task.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are any of the changes made inside of this method needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, old change I forgot to revert

@@ -119,6 +119,9 @@ Workflow consumers will get an instance of *WorkflowExecutionError. This error w
*/

type (
// Category of the error. Maps to logging/metrics behaviours.
ApplicationErrorCategory string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is ApplicationErrorCategory intended to be an enum?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

@THardy98 THardy98 force-pushed the add_application_err_category branch 3 times, most recently from c914631 to 8f0368d Compare April 24, 2025 17:41
- fully qualified names ApplicationErrorCategory<x>
- export category for external usage...
- make enum int instead of string
Comment on lines 1060 to 1061
var appError *ApplicationError
return errors.As(err, &appError) && appError.Category() == ApplicationErrorCategoryBenign
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should assume causes no matter the depth as benign. Explained at temporalio/sdk-java#2485 (comment).

Suggested change
var appError *ApplicationError
return errors.As(err, &appError) && appError.Category() == ApplicationErrorCategoryBenign
appError, _ := err.(*ApplicationError)
return appError != nil && appError.Category() == ApplicationErrorCategoryBenign

Copy link
Contributor

@Quinn-With-Two-Ns Quinn-With-Two-Ns Apr 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we should only look at the top level error, no error part of the error is inspected at any depth

@@ -137,6 +137,8 @@ type (
//
// NOTE: This option is supported by Temporal Server >= v1.24.2 older version will ignore this value.
NextRetryDelay time.Duration
// Category of the error. Maps to logging/metrics behaviours.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Category of the error. Maps to logging/metrics behaviours.
// Category of the error. Maps to SDK logging/metrics behaviours.

This Category does not effect server side metrics right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update the comment then if this is correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Member

@cretz cretz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look great! Would like @Quinn-With-Two-Ns's approval before merge though.

@THardy98 THardy98 merged commit ee78d25 into master Apr 25, 2025
17 checks passed
@THardy98 THardy98 deleted the add_application_err_category branch April 25, 2025 21:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Apply application failure logging and metrics behaviour according to ApplicationErrorCategory
3 participants