Suppress notifications on Sidekiq retries #1570

jpcody · 2021-09-15T20:57:48Z

Describe the idea

I'd love if Sidekiq could report only on the first retry.

Why do you think it's beneficial to most of the users

If retries are errors in application code or longer outages of third-party dependencies, capturing a series of the same errors isn't useful.

Possible implementation

A config.sidekiq.report_on_retries boolean configuration option could be checked in Sentry::Sidekiq::ErrorHandler. In that class, the .call method could return if config.sidekiq.report_on_retries is false and retry_count is present in the job context.

The text was updated successfully, but these errors were encountered:

st0012 · 2021-09-16T11:53:49Z

Since version 4.7.0, sentry-sidekiq has an option called report_after_job_retries. When it's turned on, the SDK will only report the exception after all retries have failed.
Although it reports the "last" retry instead of the "first" like you proposed, I think it should achieve the behavior you want 🙂

jpcody · 2021-09-16T12:41:58Z

Thanks for responding, @st0012. I should have clarified that I'm aware of report_after_job_retries and think that the inverse behavior is very valuable.

By default, Sidekiq's exponential backoff starts every minute and goes on for ~20 days. If an exception happened over the weekend in a background job, by the time it was addressed on Monday, Sentry might have 15 errors tracked. But if report_after_job_retries were set, a notification wouldn't occur for 20 days.

So the sweet spot really seems to me "after the first instance, and only the first instance". Per Sidekiq's docs, this is inline with best practices (though I think 1st is far more valuable than 3rd/10th):

Use an error service - Honeybadger, Airbrake, Rollbar, BugSnag, Sentry, Exceptiontrap, Raygun, etc. They're all similar in feature sets and pricing but pick one and use it. The error service will send you an email every time there is an exception in a job (Smarter ones like [redacted] will send email on the 1st, 3rd and 10th identical error so your inbox won't be overwhelmed if 1000s of jobs are failing).

st0012 · 2021-09-16T13:29:15Z

Before continuing the discussion, let's make sure we're on the same page. Are these the behaviors you expect?

	`report_on_retries = true` (default)	`report_on_retries = false`
`report_after_job_retries = false` (default)	Report every failure	Report the first failure
`report_after_job_retries = true`	Report the last failure	Report the first failure

jpcody · 2021-09-16T13:58:47Z

If we're on the same page that this behavior is currently unsupported and would be valuable in the future, I think we're good to talk about expected behaviors. Your chart might be correct for my prior thinking, but I think seeing it laid out shows I wasn't thinking thoroughly enough, and I think that would be confusing for users.

I think it's really three states: reporting after the first instance, reporting after every failure, and reporting after the last retry. So maybe it's one of these two strategies:

Optimized for no new config

Add three new options to the existing report_after_job_retries configuration, bringing the possible outputs to:

setting	1st?	Nth?	Last	Notes
`report_after_job_retries = "first"`	Y	N	N	New option
`report_after_job_retries = "every"`	Y	Y	Y	New option
`report_after_job_retries = "last"`	N	N	Y	New option
`report_after_job_retries = true`	N	N	Y	Backwards-compat
`report_after_job_retries = false`	Y	Y	Y	Previous default

Optimized for future flexibility, taking on the cost of a deprecation:

Introduce a new setting intended to replace report_after_job_retries and take precedent over it, with the intent to deprecate the previous configuration since the new configuration is a superset of it:

setting	1st?	Nth?	Last	Notes
`report_after_job_failures = "first"`	Y	N	N	takes precedent if set, reports after the first failure, before the first retry
`report_after_job_failures = "every"`	Y	Y	Y	takes precedent if set, new default
`report_after_job_failures = "last"`	N	N	Y	takes precedent if set
`report_after_job_retries = true`	N	N	Y	deprecated, backwards-compat
`report_after_job_retries = false`	Y	Y	Y	deprecated, previous default

st0012 · 2021-09-26T09:02:55Z

Optimized for no new config

I think I like this solution the most.

But this is actually the first time that I received such request (either in old SDK or the current one). So I'm not sure if the benefit of this change would worth the effort. For example:

Is it really necessary to receive a job failure event if the job later succeeded?
If a job is allowed for 20 retries, it must be very error-tolerant. Then why is it critical to know its first failure?

I'd like to see some real world cases for this feature before committing into it, because we'll likely to support it for a long time.

jpcody · 2021-09-26T15:04:58Z

Thanks for taking the time to circle back on this, Stan.

Is it really necessary to receive a job failure event if the job later succeeded?

We're often dealing with bulk actions that are still important to customers but too expensive for web requests or imports. So we want to get those failure notices as soon as possible.

If a job is allowed for 20 retries, it must be very error-tolerant. Then why is it critical to know its first failure?

I think this is the real question. We use sidekiq's default retry configuration, which does seem optimized for intermittent availability issues with third-party systems. But our most common failures are application code, and the occasional race condition where a job will succeed after a single failure. Perhaps we should consider being more judicious in our retry settings.

From my end, I've monkeypatched Sentry::Sidekiq::ErrorHandler to get the behavior we would like, so we're good to go. I'm 100% fine closing this issue for now and seeing if future onlookers have the same request.

Again, thanks for taking the time to go around on this one. Just turned off our competitor and moved over to a paid plan, but I'll likely have one more sentry-rails question coming your way.

popcorn · 2021-09-30T10:32:12Z

@jpcody - can you share your monkey patch code?

jpcody · 2021-09-30T10:35:12Z

@jpcody - can you share your monkey patch code?

Sure thing: https://gist.github.com./jpcody/dc6704c89d5b130d31dae849a6ccb62d

popcorn · 2021-09-30T10:53:43Z

@jpcody - can you share your monkey patch code?

Sure thing: https://gist.github.com./jpcody/dc6704c89d5b130d31dae849a6ccb62d

Love the integrity checker. Thanks!

popcorn · 2021-09-30T11:01:48Z

@jpcody - imagine you have an error in your worker. It fails, the error gets sent to Sentry. You fix the issue, but then a different error happens on the retry. This error wouldn't get sent to Sentry.

Probably low probability of this happening but just thinking of the edge cases.

jpcody · 2021-09-30T11:09:20Z

That's actually a great point, and I don't consider that low-probability. While I have never once written a class with two different bugs, I hear that other people have 😆

We are typically doing manual checks of the retry queue, so we would eventually catch this, but it'd sure be nice to be able to disregard the notification only if it was the same error coming from the same line number. If I can come up with a solution to this, I'll post it here.

popcorn · 2021-09-30T11:23:05Z

Sidekiq keeps data about the exception for the jobs in the retries queue. But, once it moves the job from retries to a real queue then I think this data gets dropped.

st0012 · 2021-10-02T04:17:36Z

Some users have also proposed concepts like config.retryable_exceptions (like this comment), which feels like tackling the same issue from a different angle. Do you think that'll be better than just setting an universal retry-able job index?

jpcody · 2021-10-20T02:18:20Z

Some users have also proposed concepts like config.retryable_exceptions (like this comment), which feels like tackling the same issue from a different angle. Do you think that'll be better than just setting an universal retry-able job index?

In our case, I don't think this would take care of us, because it could be any number of exceptions. But we are currently set with the aforementioned monkeypatch, and I'm a-ok to let this lie and let someone else re-open it if it turns out to be a common issue.

st0012 · 2021-10-25T15:19:44Z

@jpcody I see. I'll close this for now then, thanks for the thoughtful suggestion 👍

jpcody added the enhancement label Sep 15, 2021

st0012 closed this as completed Oct 25, 2021

st0012 mentioned this issue Dec 5, 2021

The number of attempts before notifications will be sent. #1632

Closed

st0012 mentioned this issue Dec 19, 2021

[Plan] Version 6 #1279

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suppress notifications on Sidekiq retries #1570

Suppress notifications on Sidekiq retries #1570

jpcody commented Sep 15, 2021

st0012 commented Sep 16, 2021

jpcody commented Sep 16, 2021

st0012 commented Sep 16, 2021

jpcody commented Sep 16, 2021

st0012 commented Sep 26, 2021

jpcody commented Sep 26, 2021

popcorn commented Sep 30, 2021

jpcody commented Sep 30, 2021

popcorn commented Sep 30, 2021

popcorn commented Sep 30, 2021

jpcody commented Sep 30, 2021

popcorn commented Sep 30, 2021

st0012 commented Oct 2, 2021

jpcody commented Oct 20, 2021

st0012 commented Oct 25, 2021

Suppress notifications on Sidekiq retries #1570

Suppress notifications on Sidekiq retries #1570

Comments

jpcody commented Sep 15, 2021

st0012 commented Sep 16, 2021

jpcody commented Sep 16, 2021

st0012 commented Sep 16, 2021

jpcody commented Sep 16, 2021

st0012 commented Sep 26, 2021

jpcody commented Sep 26, 2021

popcorn commented Sep 30, 2021

jpcody commented Sep 30, 2021

popcorn commented Sep 30, 2021

popcorn commented Sep 30, 2021

jpcody commented Sep 30, 2021

popcorn commented Sep 30, 2021

st0012 commented Oct 2, 2021

jpcody commented Oct 20, 2021

st0012 commented Oct 25, 2021