`SentinelManagedConnection` delays new master detection and Risks data loss #3560

ManelCoutinhoSensei · 2025-03-14T18:37:28Z

Summary

When using a SentinelManagedConnection with a retry mechanism, the connection does not detect a new master even after retrying and SENTINEL_DOWN_AFTER_MILLISECONDS has passed. Instead, it remains stuck inside the super().connect() retry loop, continuing to attempt connections to the old master. Only after exhausting all retry attempts does it finally invoke _connect_retry, which allows get_master_address to retrieve the updated master information.

Expected Behavior

When the master goes down and SENTINEL_DOWN_AFTER_MILLISECONDS elapses, the connection should, at the next opportunity, try to identify a new master and redirect traffic accordingly, minimizing downtime and avoiding writes to outdated instances.

Actual Behavior

The connection persists in the super().connect() (from AbstractConnection) retry loop, attempting to reconnect to the old master. Even after a new master has been elected by Redis, it does not detect this change until all retry attempts to the old master have failed.

Dangerous Side Effects

Beyond the unexpected delay, this behavior can cause data integrity issues:

If the old master comes back online at the right moment, it initially starts as a master before being demoted to a slave.
A pending write operation could be executed on this instance before it fully syncs with the other replicas.
Since Redis acknowledges the writing call, users may believe the operation succeeded, when in reality, the data is lost once the instance syncs and adopts its new slave role.

Reproduce behavior

Reproducing this issue is somewhat dependent on failover timing and how long it takes for the old master to switch to slave vs the backing off time between retries but here is a draft:

Create a Redis Sentinel cluster using something like the following docker-compose.yml:

services:

  # Redis Node #1 (initial master) + Sentinel
  redis-node-1:SentinelManagedConnection
    container_name: redis-node-1
    image: bitnami/redis:7.2.4
    environment:
      - ALLOW_EMPTY_PASSWORD=yes
      - REDIS_AOF_ENABLED=no
    ports:
      - 6380:6379
    networks:
      - nw

  redis-node-1-sentinel:
    container_name: redis-node-1-sentinel
    image: bitnami/redis-sentinel:7.2.4
    depends_on:
      - redis-node-1
    environment:
      - REDIS_MASTER_HOST=redis-node-1
      - REDIS_SENTINEL_MASTER_NAME=mymaster
      - REDIS_SENTINEL_DOWN_AFTER_MILLISECONDS=5000
      - REDIS_SENTINEL_FAILOVER_TIMEOUT=10000
      - REDIS_SENTINEL_QUORUM=2
    ports:
      - 36380:26379
    networks:
      - nw

  # Redis Node #2  + Sentinel  
  redis-node-2:
    container_name: redis-node-2
    image: bitnami/redis:7.2.4
    environment:
      - ALLOW_EMPTY_PASSWORD=yes
      - REDIS_REPLICATION_MODE=slave
      - REDIS_MASTER_HOST=redis-node-1
      - REDIS_AOF_ENABLED=no
    ports:
      - 6381:6379
    networks:
      - nw

  redis-node-2-sentinel:
    container_name: redis-node-2-sentinel
    image: bitnami/redis-sentinel:7.2.4
    depends_on:
      - redis-node-2
    environment:
      - REDIS_MASTER_HOST=redis-node-1
      - REDIS_SENTINEL_MASTER_NAME=mymaster
      - REDIS_SENTINEL_DOWN_AFTER_MILLISECONDS=5000
      - REDIS_SENTINEL_FAILOVER_TIMEOUT=10000
      - REDIS_SENTINEL_QUORUM=2
    ports:
      - 36381:26379
    networks:
      - nw

  # Redis Node #3 + Sentinel 
  redis-node-3:
    container_name: redis-node-3
    image: bitnami/redis:7.2.4
    environment:
      - ALLOW_EMPTY_PASSWORD=yes
      - REDIS_REPLICATION_MODE=slave
      - REDIS_MASTER_HOST=redis-node-1
      - REDIS_AOF_ENABLED=no
    ports:
      - 6382:6379
    networks:
      - nw

  redis-node-3-sentinel:
    container_name: redis-node_3-sentinel
    image: bitnami/redis-sentinel:7.2.4
    depends_on:
      - redis-node-3
    environment:
      - REDIS_MASTER_HOST=redis-node-1
      - REDIS_SENTINEL_MASTER_NAME=mymaster
      - REDIS_SENTINEL_DOWN_AFTER_MILLISECONDS=5000
      - REDIS_SENTINEL_FAILOVER_TIMEOUT=10000
      - REDIS_SENTINEL_QUORUM=2
    ports:
      - 36382:26379
    networks:
      - nw

networks:
  nw:
    driver: bridge

Run this Python script:

def restart_container(delay, container):
    time.sleep(delay)
    container.restart()

def problems():
    for el in docker.from_env().containers.list(all=True):
        if "redis-node-1" == el.name:
            container_to_be_stopped = el

    sentinel = Sentinel(
        [("localhost", 36380), ("localhost", 36381)],
        socket_timeout=1,
        socket_connect_timeout=1,
        health_check_interval=5,
        retry=Retry(ExponentialBackoff(10, 0.5), 5),
    )
    redis = sentinel.master_for("mymaster")


    pipeline = redis.pipeline()
    pipeline.hset("myhash","name","john doe")

    print("Shutdown master")
    container_to_be_stopped.stop()

    stop_thread = Thread(target=restart_container, args=(5, container_to_be_stopped))
    stop_thread.start()
    resp = pipeline.execute()

    stop_thread.join()

    print(resp)
    # You can also try to hgetall("myhash") to verify that there is nothing written despite the resp being [1] - as in it wrote 1 field-value pair correctly

Comments:

It might take you a few tries to replicate the dangerous side effect mentioned, but if, in the discover_master function, you check the state after line 302 and the flags say that master is disconnected but not sdown/odown yet, you are on your way.

redis-py/redis/sentinel.py

Lines 287 to 308 in ea01a30

    
               def discover_master(self, service_name): 
        
                   """ 
        
                   Asks sentinel servers for the Redis master's address corresponding 
        
                   to the service labeled ``service_name``. 
        
                   Returns a pair (address, port) or raises MasterNotFoundError if no 
        
                   master is found. 
        
                   """ 
        
                   collected_errors = list() 
        
                   for sentinel_no, sentinel in enumerate(self.sentinels): 
        
                       try: 
        
                           masters = sentinel.sentinel_masters() 
        
                       except (ConnectionError, TimeoutError) as e: 
        
                           collected_errors.append(f"{sentinel} - {e!r}") 
        
                           continue 
        
                       state = masters.get(service_name) 
        
                       if state and self.check_master_state(state, service_name): 
        
                           # Put this sentinel at the top of the list 
        
                           self.sentinels[0], self.sentinels[sentinel_no] = ( 
        
                               sentinel, 
        
                               self.sentinels[0], 
        
                           )

If you remove the thread part, you'll see the behavior explained in Actual Behavior: only finding a new master after every attempt is made.
Sockets must have a timeout otherwise you won't see the problem.

Possible solutions

The issue stems from Sentinel's retry-connect logic

redis-py/redis/sentinel.py

Lines 36 to 58 in ea01a30

    
           def connect_to(self, address): 
        
               self.host, self.port = address 
        
               super().connect() 
        
               if self.connection_pool.check_connection: 
        
                   self.send_command("PING") 
        
                   if str_if_bytes(self.read_response()) != "PONG": 
        
                       raise ConnectionError("PING failed") 
        
           def _connect_retry(self): 
        
               if self._sock: 
        
                   return  # already connected 
        
               if self.connection_pool.is_master: 
        
                   self.connect_to(self.connection_pool.get_master_address()) 
        
               else: 
        
                   for slave in self.connection_pool.rotate_slaves(): 
        
                       try: 
        
                           return self.connect_to(slave) 
        
                       except ConnectionError: 
        
                           continue 
        
                   raise SlaveNotFoundError  # Never be here 
        
           def connect(self): 
        
               return self.retry.call_with_retry(self._connect_retry, lambda error: None)

redis-py/redis/connection.py

Lines 373 to 384 in ea01a30

    
           def connect(self): 
        
               "Connects to the Redis server if not already connected" 
        
               if self._sock: 
        
                   return 
        
               try: 
        
                   sock = self.retry.call_with_retry( 
        
                       lambda: self._connect(), lambda error: self.disconnect(error) 
        
                   ) 
        
               except socket.timeout: 
        
                   raise TimeoutError("Timeout connecting to server") 
        
               except OSError as e: 
        
                   raise ConnectionError(self._error_message(e))

In my opinion, SentinelManagedConnection should handle the retries directly, so it can switch to the newly discovered master address without waiting for all old-master retries to fail. For example, modifying::

    def connect_to(self, address):
        self.host, self.port = address
        super().connect(retry=False)  # <- changed here
        if self.connection_pool.check_connection:
            self.send_command("PING")
            if str_if_bytes(self.read_response()) != "PONG":
                raise ConnectionError("PING failed")

and updating AbstractConnection.connect() to accept a retry=True parameter:

    def connect(self, retry=True):
        "Connects to the Redis server if not already connected"
        if self._sock:
            return

        try:
            if retry:
                sock = self.retry.call_with_retry(
                    lambda: self._connect(), lambda error: self.disconnect(error)
                )
            else:
                sock = self._connect()
       ...

However, having flags flying around (😉) might not be very pretty so maybe a better approach would be instead of calling its superclass implementation, to let Sentinel maintain its own version of the AbstractConnection.connect() method.

Additional Comments

This might also happen in the async version since it follows a similar pattern.

The text was updated successfully, but these errors were encountered:

petyaslavova · 2025-03-17T16:29:36Z

Hi @ManelCoutinhoSensei! We'll have a look at it shortly.

…lure (redis#3560)

ManelCoutinhoSensei · 2025-04-16T08:26:50Z

Hi @petyaslavova, I opened a PR for this issue as well

ManelCoutinhoSensei added a commit to ManelCoutinhoSensei/redis-py that referenced this issue Apr 14, 2025

SentinelManagedConnection searches for new master upon connection fai…

03092ba

…lure (redis#3560)

ManelCoutinhoSensei linked a pull request Apr 15, 2025 that will close this issue

SentinelManagedConnection searches for new master upon connection failure (#3560) #3601

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`SentinelManagedConnection` delays new master detection and Risks data loss #3560

`SentinelManagedConnection` delays new master detection and Risks data loss #3560

ManelCoutinhoSensei commented Mar 14, 2025

petyaslavova commented Mar 17, 2025

ManelCoutinhoSensei commented Apr 16, 2025

SentinelManagedConnection delays new master detection and Risks data loss #3560

SentinelManagedConnection delays new master detection and Risks data loss #3560

Comments

ManelCoutinhoSensei commented Mar 14, 2025

Summary

Expected Behavior

Actual Behavior

Dangerous Side Effects

Reproduce behavior

Comments:

Possible solutions

Additional Comments

petyaslavova commented Mar 17, 2025

ManelCoutinhoSensei commented Apr 16, 2025

`SentinelManagedConnection` delays new master detection and Risks data loss #3560

`SentinelManagedConnection` delays new master detection and Risks data loss #3560