Skip to content

res_musiconhold: Appropriately lock channel during start. #1104

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

jkroonza
Copy link
Contributor

@jkroonza jkroonza commented Jan 31, 2025

This relates to #829

This doesn't fully solve the Ops issue, but it solves the specific crash there. Further PRs to follow.

In the specific crash the generator was still under construction when moh was being stopped, which then proceeded to close the stream whilst it was still in use.

This relates to asterisk#829

This doesn't sully solve the Ops issue, but it solves the specific crash
there.  Further PRs to follow.

In the specific crash the generator was still under construction when
moh was being stopped, which then proceeded to close the stream whilst
it was still in use.

Signed-off-by: Jaco Kroon <[email protected]>
@jkroonza
Copy link
Contributor Author

cherry-pick-to: 20
cherry-pick-to: 21
cherry-pick-to: 22

@asterisk asterisk deleted a comment from github-actions bot Feb 1, 2025
Comment on lines +1765 to +1766
ast_channel_lock(chan);
state = ast_channel_music_state(chan);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just do this (and the unlock) one time at the top of the function? Or do we need to serialize all access to state?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect there's potentially some long-running sub-tasks which is why I went a bit more granular.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Long-running sub tasks like what?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. You know better than me. I didn't dig into this much further beyond this specific PR, life happened and I deviated from this. My gut was that the construction of some of these generators could be time-consuming, and that that could have adverse impact on the channel structure.

But yea, if any of those do take overly long, it's going to be the case that the channel gets no audio anyway, and it would most likely be signalling stuff that just waits for a "long" duration, and that's probably OK if those gets blocked for a few hundred ms.

Based on that, sure, I think I can resubmit with "lock at top, release at bottom" and we see what happens?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If memory serves, when this originally came up I did an audit of this file and I think that the channel lock is serializing access to members of the moh_files_state structure, so the way it is here in this PR is probably the safest approach.

It would be helpful to go through and note which functions are being called with the channel lock already held, but that may be out of scope for this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be helpful to go through and note which functions are being called with the channel lock already held, but that may be out of scope for this PR.

I went ahead and did this and pushed an additional commit that you can throw away or merge - up to you.

After your change, the only place in the module where the music state is manipulated without the channel lock held is in local_ast_moh_cleanup which is only called during channel destruction (but ast_moh_cleanup() is currently API so anyone could call it).

I think locking around state manipulation in local_ast_moh_cleanup is technically the right thing, and the channel lock is still available at the time it is called, so it should be safe, and I've added another change to this PR that does that.

Copy link

github-actions bot commented Apr 8, 2025

Attention! This pull request may contain issues that could prevent it from being accepted. Please review the checklist below and take the recommended action. If you believe any of these are not applicable, just add a comment and let us know.

  • There is more than 1 commit in this PR and no PR comment with a multiple-commits: special header. Please squash the commits down into 1 or see the Code Contribution documentation for how to add the multiple-commits: header.

Documentation:

@github-actions github-actions bot added the has-pr-checklist A PR Checklist is present on the PR label Apr 8, 2025
Copy link

github-actions bot commented Apr 8, 2025

Workflow PRCheck failed
master-ari1: FAILED TEST: rest_api/bridges/error
master-ari2: FAILED TEST: rest_api/websocket_requests/bad_requests
master-ari2: FAILED TEST: rest_api/websocket_requests/parallel_requests
master-ari2: FAILED TEST: rest_api/websocket_requests/single_requests
master-local_iax2_mgr: FAILED TEST: manager/config/basic
master-local_iax2_mgr: FAILED TEST: manager/config/no_preserve_effective_context
master-local_iax2_mgr: FAILED TEST: manager/config/restricted
master-pjs2: FAILED TEST: channels/pjsip/diversion/diversion_caller_id
master-pjs5: FAILED TEST: channels/pjsip/stir_shaken/stir_shaken_anon_callerid

@ashtonmoomoo
Copy link

Hi everyone. We appreciate the work that is going into resolving #829.

I wanted to share some very coarse numbers with everybody. We have a script (which we unfortunately can't share, since it interfaces with our application, not Asterisk) that can reproduce #829, and we have been running it against various revisions to see how long it takes before reproduction.

20.9.3 (no patches applied)

Run # Calls before crash
1 54
2 292
3 208
4 300

20.9.3 (with just 4f5ecca)

Run # Calls before crash
1 77
2 486
3 233
4 257

20.9.3 (with all commits on this branch up to 36f842b)

Run # Calls before crash
1 349
2 29
3 567
4 133

Obviously more runs would be desirable in all cases but I am not sure if there's value in letting the script run for longer.

So it seems like there is still something going on. If there's any additional diagnostic information that we can share (other than what we shared in #829), please let us know.

@jkroonza
Copy link
Contributor Author

jkroonza commented Apr 9, 2025

Same details I requested in my email discussion with you, with details on exact code version (commit # ideally or asterisk release version, and list of additional patches would be helpful). You're welcome to email those to me as previous if you feel there is sensitive information in it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
has-pr-checklist A PR Checklist is present on the PR pr-submit-tests-failed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants