You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When using a SentinelManagedConnection with a retry mechanism, the connection does not detect a new master even after retrying and SENTINEL_DOWN_AFTER_MILLISECONDS has passed. Instead, it remains stuck inside the super().connect() retry loop, continuing to attempt connections to the old master. Only after exhausting all retry attempts does it finally invoke _connect_retry, which allows get_master_address to retrieve the updated master information.
Expected Behavior
When the master goes down and SENTINEL_DOWN_AFTER_MILLISECONDS elapses, the connection should, at the next opportunity, try to identify a new master and redirect traffic accordingly, minimizing downtime and avoiding writes to outdated instances.
Actual Behavior
The connection persists in the super().connect() (from AbstractConnection) retry loop, attempting to reconnect to the old master. Even after a new master has been elected by Redis, it does not detect this change until all retry attempts to the old master have failed.
Dangerous Side Effects
Beyond the unexpected delay, this behavior can cause data integrity issues:
If the old master comes back online at the right moment, it initially starts as a master before being demoted to a slave.
A pending write operation could be executed on this instance before it fully syncs with the other replicas.
Since Redis acknowledges the writing call, users may believe the operation succeeded, when in reality, the data is lost once the instance syncs and adopts its new slave role.
Reproduce behavior
Reproducing this issue is somewhat dependent on failover timing and how long it takes for the old master to switch to slave vs the backing off time between retries but here is a draft:
Create a Redis Sentinel cluster using something like the following docker-compose.yml:
defrestart_container(delay, container):
time.sleep(delay)
container.restart()
defproblems():
forelindocker.from_env().containers.list(all=True):
if"redis-node-1"==el.name:
container_to_be_stopped=elsentinel=Sentinel(
[("localhost", 36380), ("localhost", 36381)],
socket_timeout=1,
socket_connect_timeout=1,
health_check_interval=5,
retry=Retry(ExponentialBackoff(10, 0.5), 5),
)
redis=sentinel.master_for("mymaster")
pipeline=redis.pipeline()
pipeline.hset("myhash","name","john doe")
print("Shutdown master")
container_to_be_stopped.stop()
stop_thread=Thread(target=restart_container, args=(5, container_to_be_stopped))
stop_thread.start()
resp=pipeline.execute()
stop_thread.join()
print(resp)
# You can also try to hgetall("myhash") to verify that there is nothing written despite the resp being [1] - as in it wrote 1 field-value pair correctly
Comments:
It might take you a few tries to replicate the dangerous side effect mentioned, but if, in the discover_master function, you check the state after line 302 and the flags say that master is disconnected but not sdown/odown yet, you are on your way.
In my opinion, SentinelManagedConnection should handle the retries directly, so it can switch to the newly discovered master address without waiting for all old-master retries to fail. For example, modifying::
and updating AbstractConnection.connect() to accept a retry=True parameter:
defconnect(self, retry=True):
"Connects to the Redis server if not already connected"ifself._sock:
returntry:
ifretry:
sock=self.retry.call_with_retry(
lambda: self._connect(), lambdaerror: self.disconnect(error)
)
else:
sock=self._connect()
...
However, having flags flying around (😉) might not be very pretty so maybe a better approach would be instead of calling its superclass implementation, to let Sentinel maintain its own version of the AbstractConnection.connect() method.
Additional Comments
This might also happen in the async version since it follows a similar pattern.
The text was updated successfully, but these errors were encountered:
Summary
When using a
SentinelManagedConnection
with a retry mechanism, the connection does not detect a new master even after retrying andSENTINEL_DOWN_AFTER_MILLISECONDS
has passed. Instead, it remains stuck inside thesuper().connect()
retry loop, continuing to attempt connections to the old master. Only after exhausting all retry attempts does it finally invoke_connect_retry
, which allowsget_master_address
to retrieve the updated master information.Expected Behavior
When the master goes down and
SENTINEL_DOWN_AFTER_MILLISECONDS
elapses, the connection should, at the next opportunity, try to identify a new master and redirect traffic accordingly, minimizing downtime and avoiding writes to outdated instances.Actual Behavior
The connection persists in the
super().connect()
(fromAbstractConnection
) retry loop, attempting to reconnect to the old master. Even after a new master has been elected by Redis, it does not detect this change until all retry attempts to the old master have failed.Dangerous Side Effects
Beyond the unexpected delay, this behavior can cause data integrity issues:
Reproduce behavior
Reproducing this issue is somewhat dependent on failover timing and how long it takes for the old master to switch to slave vs the backing off time between retries but here is a draft:
docker-compose.yml
:Comments:
discover_master
function, you check the state after line 302 and the flags say that master is disconnected but not sdown/odown yet, you are on your way.redis-py/redis/sentinel.py
Lines 287 to 308 in ea01a30
Possible solutions
The issue stems from Sentinel's retry-connect logic
redis-py/redis/sentinel.py
Lines 36 to 58 in ea01a30
redis-py/redis/connection.py
Lines 373 to 384 in ea01a30
In my opinion,
SentinelManagedConnection
should handle the retries directly, so it can switch to the newly discovered master address without waiting for all old-master retries to fail. For example, modifying::and updating
AbstractConnection.connect()
to accept aretry=True
parameter:However, having flags flying around (😉) might not be very pretty so maybe a better approach would be instead of calling its superclass implementation, to let Sentinel maintain its own version of the
AbstractConnection.connect()
method.Additional Comments
This might also happen in the async version since it follows a similar pattern.
The text was updated successfully, but these errors were encountered: