Skip to content

Suppress notifications on Sidekiq retries #1570

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jpcody opened this issue Sep 15, 2021 · 15 comments
Closed

Suppress notifications on Sidekiq retries #1570

jpcody opened this issue Sep 15, 2021 · 15 comments

Comments

@jpcody
Copy link

jpcody commented Sep 15, 2021

Describe the idea

I'd love if Sidekiq could report only on the first retry.

Why do you think it's beneficial to most of the users

If retries are errors in application code or longer outages of third-party dependencies, capturing a series of the same errors isn't useful.

Possible implementation

A config.sidekiq.report_on_retries boolean configuration option could be checked in Sentry::Sidekiq::ErrorHandler. In that class, the .call method could return if config.sidekiq.report_on_retries is false and retry_count is present in the job context.

@st0012
Copy link
Collaborator

st0012 commented Sep 16, 2021

Since version 4.7.0, sentry-sidekiq has an option called report_after_job_retries. When it's turned on, the SDK will only report the exception after all retries have failed.
Although it reports the "last" retry instead of the "first" like you proposed, I think it should achieve the behavior you want 🙂

@jpcody
Copy link
Author

jpcody commented Sep 16, 2021

Thanks for responding, @st0012. I should have clarified that I'm aware of report_after_job_retries and think that the inverse behavior is very valuable.

By default, Sidekiq's exponential backoff starts every minute and goes on for ~20 days. If an exception happened over the weekend in a background job, by the time it was addressed on Monday, Sentry might have 15 errors tracked. But if report_after_job_retries were set, a notification wouldn't occur for 20 days.

So the sweet spot really seems to me "after the first instance, and only the first instance". Per Sidekiq's docs, this is inline with best practices (though I think 1st is far more valuable than 3rd/10th):

Use an error service - Honeybadger, Airbrake, Rollbar, BugSnag, Sentry, Exceptiontrap, Raygun, etc. They're all similar in feature sets and pricing but pick one and use it. The error service will send you an email every time there is an exception in a job (Smarter ones like [redacted] will send email on the 1st, 3rd and 10th identical error so your inbox won't be overwhelmed if 1000s of jobs are failing).

@st0012
Copy link
Collaborator

st0012 commented Sep 16, 2021

Before continuing the discussion, let's make sure we're on the same page. Are these the behaviors you expect?

report_on_retries = true (default) report_on_retries = false
report_after_job_retries = false (default) Report every failure Report the first failure
report_after_job_retries = true Report the last failure Report the first failure

@jpcody
Copy link
Author

jpcody commented Sep 16, 2021

If we're on the same page that this behavior is currently unsupported and would be valuable in the future, I think we're good to talk about expected behaviors. Your chart might be correct for my prior thinking, but I think seeing it laid out shows I wasn't thinking thoroughly enough, and I think that would be confusing for users.

I think it's really three states: reporting after the first instance, reporting after every failure, and reporting after the last retry. So maybe it's one of these two strategies:

Optimized for no new config

Add three new options to the existing report_after_job_retries configuration, bringing the possible outputs to:

setting 1st? Nth? Last Notes
report_after_job_retries = "first" Y N N New option
report_after_job_retries = "every" Y Y Y New option
report_after_job_retries = "last" N N Y New option
report_after_job_retries = true N N Y Backwards-compat
report_after_job_retries = false Y Y Y Previous default
Optimized for future flexibility, taking on the cost of a deprecation:

Introduce a new setting intended to replace report_after_job_retries and take precedent over it, with the intent to deprecate the previous configuration since the new configuration is a superset of it:

setting 1st? Nth? Last Notes
report_after_job_failures = "first" Y N N takes precedent if set, reports after the first failure, before the first retry
report_after_job_failures = "every" Y Y Y takes precedent if set, new default
report_after_job_failures = "last" N N Y takes precedent if set
report_after_job_retries = true N N Y deprecated, backwards-compat
report_after_job_retries = false Y Y Y deprecated, previous default

@st0012
Copy link
Collaborator

st0012 commented Sep 26, 2021

Optimized for no new config

I think I like this solution the most.

But this is actually the first time that I received such request (either in old SDK or the current one). So I'm not sure if the benefit of this change would worth the effort. For example:

  • Is it really necessary to receive a job failure event if the job later succeeded?
  • If a job is allowed for 20 retries, it must be very error-tolerant. Then why is it critical to know its first failure?

I'd like to see some real world cases for this feature before committing into it, because we'll likely to support it for a long time.

@jpcody
Copy link
Author

jpcody commented Sep 26, 2021

Thanks for taking the time to circle back on this, Stan.

Is it really necessary to receive a job failure event if the job later succeeded?

We're often dealing with bulk actions that are still important to customers but too expensive for web requests or imports. So we want to get those failure notices as soon as possible.

If a job is allowed for 20 retries, it must be very error-tolerant. Then why is it critical to know its first failure?

I think this is the real question. We use sidekiq's default retry configuration, which does seem optimized for intermittent availability issues with third-party systems. But our most common failures are application code, and the occasional race condition where a job will succeed after a single failure. Perhaps we should consider being more judicious in our retry settings.

From my end, I've monkeypatched Sentry::Sidekiq::ErrorHandler to get the behavior we would like, so we're good to go. I'm 100% fine closing this issue for now and seeing if future onlookers have the same request.

Again, thanks for taking the time to go around on this one. Just turned off our competitor and moved over to a paid plan, but I'll likely have one more sentry-rails question coming your way.

@popcorn
Copy link

popcorn commented Sep 30, 2021

@jpcody - can you share your monkey patch code?

@jpcody
Copy link
Author

jpcody commented Sep 30, 2021

@jpcody - can you share your monkey patch code?

Sure thing: https://gist.github.com./jpcody/dc6704c89d5b130d31dae849a6ccb62d

@popcorn
Copy link

popcorn commented Sep 30, 2021

@jpcody - can you share your monkey patch code?

Sure thing: https://gist.github.com./jpcody/dc6704c89d5b130d31dae849a6ccb62d

Love the integrity checker. Thanks!

@popcorn
Copy link

popcorn commented Sep 30, 2021

@jpcody - imagine you have an error in your worker. It fails, the error gets sent to Sentry. You fix the issue, but then a different error happens on the retry. This error wouldn't get sent to Sentry.

Probably low probability of this happening but just thinking of the edge cases.

@jpcody
Copy link
Author

jpcody commented Sep 30, 2021

That's actually a great point, and I don't consider that low-probability. While I have never once written a class with two different bugs, I hear that other people have 😆

We are typically doing manual checks of the retry queue, so we would eventually catch this, but it'd sure be nice to be able to disregard the notification only if it was the same error coming from the same line number. If I can come up with a solution to this, I'll post it here.

@popcorn
Copy link

popcorn commented Sep 30, 2021

Sidekiq keeps data about the exception for the jobs in the retries queue. But, once it moves the job from retries to a real queue then I think this data gets dropped.

@st0012
Copy link
Collaborator

st0012 commented Oct 2, 2021

Some users have also proposed concepts like config.retryable_exceptions (like this comment), which feels like tackling the same issue from a different angle. Do you think that'll be better than just setting an universal retry-able job index?

@jpcody
Copy link
Author

jpcody commented Oct 20, 2021

Some users have also proposed concepts like config.retryable_exceptions (like this comment), which feels like tackling the same issue from a different angle. Do you think that'll be better than just setting an universal retry-able job index?

In our case, I don't think this would take care of us, because it could be any number of exceptions. But we are currently set with the aforementioned monkeypatch, and I'm a-ok to let this lie and let someone else re-open it if it turns out to be a common issue.

@st0012
Copy link
Collaborator

st0012 commented Oct 25, 2021

@jpcody I see. I'll close this for now then, thanks for the thoughtful suggestion 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants