Skip to content

DRIVERS-2477: use w:'majority' in consecutive resume change stream tests #1323

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

baileympearson
Copy link
Contributor

https://jira.mongodb.org/browse/DRIVERS-2477

POC in Node: mongodb/node-mongodb-native#3451

The change stream consecutive resume test (here) fails intermittently in Node because the test inserts documents with write the default write concern instead of majority.

Change events are not reported by the server until they are majority committed. The test inserts multiple documents and attempts to iterate the change stream, expecting change events. If the documents haven't been committed by the time the test attempts to iterate the stream, no change will be reported and the test will fail.

This only impacts the consecutive resume test and does not impact users because the change event will eventually be reported. This test specifically sets up fail commands on getMores, so that the getMore fail and the subsequent aggregate call succeeds and returns a change event in the firstBatch. The race condition is present only because we expect the aggregate to immediately return documents in the firstBatch.

Please complete the following before merging:

  • Update changelog.
  • Make sure there are generated JSON files from the YAML test files.
  • Test changes in at least one language driver.
  • Test these changes against all server versions and topologies (including standalone, replica set, sharded clusters, and serverless).

@baileympearson baileympearson changed the title test(DRIVERS-2477): use w:'majority' in consecutive resume change stream tests DRIVERS-2477: use w:'majority' in consecutive resume change stream tests Oct 18, 2022
@baileympearson baileympearson marked this pull request as ready for review October 18, 2022 19:07
@baileympearson baileympearson requested review from a team as code owners October 18, 2022 19:07
@baileympearson baileympearson requested review from jmikola and benjirewis and removed request for a team October 18, 2022 19:07
Copy link
Contributor

@benjirewis benjirewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your explanation makes sense, but I'm surprised other tests are not failing in Node (sounds like this is somehow not a problem in other drivers). I may have missed something in your description, but do you know why the lack of propagation to multiple nodes' oplogs is only a problem in this test and not in the others for Node?

@@ -854,14 +854,20 @@ tests:
object: *globalCollection0
arguments:
document: { x: 1 }
writeConcern:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we leave a comment above this writeConcern to explain that this is needed to make sure the insert propagates and a change stream event is present on all oplogs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, will do!

@@ -172,14 +172,20 @@ tests:
object: *collection1
arguments:
document: { x: 1 }
writeConcern:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thing here.

@baileympearson
Copy link
Contributor Author

baileympearson commented Oct 19, 2022

Your explanation makes sense, but I'm surprised other tests are not failing in Node (sounds like this is somehow not a problem in other drivers). I may have missed something in your description, but do you know why the lack of propagation to multiple nodes' oplogs is only a problem in this test and not in the others for Node?

@benjirewis I do know why this test fails actually. It's because of the failCommand on getMore set to 2 times (> 1 is what matters).

Here's what we expect to happen when we iterateUntilDocumentOrError (and what happens with w: majority):

  • we try a get more and it fails
  • we attempt to resume
  • the aggregate call on resume comes back with a single document (batch size 1 set on change stream)
  • the change stream yields the document

Here's what actually happens

  • we try a get more and it fails
  • we attempt to resume
  • the aggregate call completes before the writes have been committed, so no change is returned in the aggregate call. but the change stream does resume successfully
  • we try a get more again but it fails again
  • change streams only attempt to resume once, so the error is returned and the test fails

You can also get the test to pass by setting the fail command to only fail once, but then we're not testing consecutive resumes anymore 🙂

I misread your comment! I don't know why this only fails in Node but I can look closely at it

@baileympearson
Copy link
Contributor Author

@benjirewis

The root cause of the issue was a bug Node introduced last spring into our change streams (this test did continue to pass for a long while though, it only started failing recently). You can see the fix PR here - mongodb/node-mongodb-native#3453.

Should we close this PR and the associated drivers ticket?

@benjirewis
Copy link
Contributor

Yeah if it was just a bug in Node's change streams code, then closing this PR and the associated drivers ticket SGTM.

@jmikola jmikola removed their request for review October 23, 2022 13:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants