DRAFT: How to write reliable socket servers that survive crashes and restart?

Published on 7 June 2022.

This is a work in progress that will change. Like to see it finished? Let me know by sending me an email.

This past weekend I was researching how to do zero-downtime deployments and found the wonderful blog post Dream Deploys: Atomic, Zero-Downtime Deployments.

In it, Alan describes how separating listening on a socket and accepting connections on it into different processes can keep a socket “live” at all times even during a restart.

In this blog post I want to document that trick and my understanding of it.

TODO

The problem with a crashing server

To illustrate the problem with a crashing server, we use the example below.

  1. server-listen.py
import socket

with socket.socket() as s:
    s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
    s.bind(("localhost", 9000))
    print("listening on port 9000")
    s.listen()
    print("accepting connections")
    while True:
        conn, addr = s.accept()
        with conn:
            data = conn.recv(100)
            number = int(data)
            conn.sendall(f"got number {number}\n".encode("ascii"))

This is a TCP server, listening on port 9000, reading numbers from clients, and echoing them back. It assumes that the data received can be parsed as an integer. If the parsing fails, the server crashes.

To test the behavior of the server, we use the following client:

  1. client.py
import socket
import time

for i in range(21):
    if i > 0:
        try:
            with socket.socket() as s:
                s.connect(("localhost", 9000))
                if i == 5:
                    s.sendall(b"five\n")
                else:
                    s.sendall(f"{i}\n".encode("ascii"))
                message = s.recv(100).decode("ascii").rstrip()
                diff = int((time.perf_counter() - prev) * 1000)
                print(f"{diff}ms {message}")
        except:
            print(f"{i} failed")
        time.sleep(0.01)
    prev = time.perf_counter()

It sends 20 requests to the server with a 0.01s delay between them. However, on the fifth request, instead of sending the number 5 it sends the string five to cause the server to crash.

If we start the server, then the client, the output looks as follows:

  1. server output
$ python server-listen.py 
listening on port 9000
accepting connections
Traceback (most recent call last):
  File "/home/rick/rickardlindberg.me/writing/reliable-socket-servers/server-listen.py", line 13, in <module>
    number = int(data)
ValueError: invalid literal for int() with base 10: b'five\n'
  1. client output
$ python client.py 
0ms got number 1
0ms got number 2
0ms got number 3
0ms got number 4
0ms 
6 failed
7 failed
8 failed
9 failed
10 failed
11 failed
12 failed
13 failed
14 failed
15 failed
16 failed
17 failed
18 failed
19 failed
20 failed

In the client output, we see that request number five never receives a response from the server and that subsequent requests fail because the server has crashed, and there is no one listening on port 9000.

Solution: restart server in loop

In order for subsequent requests to succeed, we need to start the server again after it has crashed. One way to do that is to run the server program in an infinite loop using a script like the one below:

  1. loop.sh
while true; do
    echo "$@"
    "$@" || true
    echo "restarting"
done

This Bash script takes a command to run as argument and runs that command in a loop, ignoring any exit code.

Invoking the server and client again, we get the following output:

  1. server output
$ bash loop.sh python server-listen.py 
python server-listen.py
listening on port 9000
accepting connections
Traceback (most recent call last):
  File "/home/rick/rickardlindberg.me/writing/reliable-socket-servers/server-listen.py", line 13, in <module>
    number = int(data)
ValueError: invalid literal for int() with base 10: b'five\n'
restarting
python server-listen.py
listening on port 9000
accepting connections
  1. client output
$ python client.py 
0ms got number 1
0ms got number 2
0ms got number 3
0ms got number 4
0ms 
6 failed
7 failed
8 failed
9 failed
10 failed
11 failed
12 failed
13 failed
14 failed
15 failed
16 failed
0ms got number 17
0ms got number 18
0ms got number 19
0ms got number 20

In the server output, we see that the server starts again after the crash and starts listening to port 9000.

In the client output, we see that request five fails the same way, but after a few more request, it starts getting responses again at request 17.

The problem with a restarting server

Running the server in a loop is an improvement. Instead of dropping all subsequent requests, we only drop a few.

But during the time between the server crash and a new process been started, there is no one listening on port 9000 and we still drop connections.

How can we make sure to answer all connections?

Solution: separate listening on a socket and accepting connections

The trick, as also demonstrated in the blog post, is to listen on the socket in one process and accept connections and processing requests in another process. That way, if processing fails, and that process dies, the socket still stays open because it is managed by another process.

Here is a program that listens on a socket and then spawns another process in a loop to accept connections:

Here is server-listen-loop.py:

  1. server-listen-loop.py
import os
import socket

with socket.socket() as s:
    s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
    s.bind(("localhost", 9000))
    print("listening on port 9000")
    s.listen()
    os.dup2(s.fileno(), 0)
    os.close(s.fileno())
    os.execvp("bash", ["bash", "loop.sh", "python", "server-accept.py"])

The first part of this program creates a socket and starts listening.

The last part starts executing the command bash loop.sh python server-accept.py. At this point the process is listening on the socket and starts the server-accept.py program in a loop.

The server-accept.py program is similar to server-listen.py, but instead of listening on port 9000, it just accepts connections on the socket which is passed to it as file descriptor 0 (stdin):

Here is server-accept.py:

  1. server-accept.py
import socket

with socket.socket(fileno=0) as s:
    print("accepting connections")
    while True:
        conn, addr = s.accept()
        with conn:
            data = conn.recv(100)
            number = int(data)
            conn.sendall(f"got number {number}\n".encode("ascii"))

Again:

  1. server output
$ python server-listen-loop.py 
listening on port 9000
python server-accept.py
accepting connections
Traceback (most recent call last):
  File "/home/rick/rickardlindberg.me/writing/reliable-socket-servers/server-accept.py", line 9, in <module>
    number = int(data)
ValueError: invalid literal for int() with base 10: b'five\n'
restarting
python server-accept.py
accepting connections
  1. client output
$ python client.py 
0ms got number 1
0ms got number 2
0ms got number 3
0ms got number 4
0ms 
108ms got number 6
0ms got number 7
1ms got number 8
0ms got number 9
0ms got number 10
0ms got number 11
0ms got number 12
0ms got number 13
0ms got number 14
0ms got number 15
0ms got number 16
0ms got number 17
0ms got number 18
0ms got number 19
0ms got number 20

Now all requests that we send get a response. We see that request number six takes longer to complete. That is because the server needs to start and accept the socket. But it doesn’t fail. The client will not get connection errors.

And this is one way to write a reliable socket servers that survive crashes and restarts.

Questions & Answers

How long will a socket wait before timing out?

Can we improve on the long startup time?

Why socket option REUSE?

Is this how supervisor works?

Seems like it closes socket upon restart:

[fcgi-program:test]
socket=tcp://localhost:9000
command=python /home/rick/rickardlindberg.me/writing/reliable-socket-servers/server-accept.py

2022-05-10 21:46:28,734 INFO exited: test (exit status 1; not expected)
2022-05-10 21:46:28,734 INFO Closing socket tcp://localhost:9000
2022-05-10 21:46:29,736 INFO Creating socket tcp://localhost:9000
2022-05-10 21:46:29,737 INFO spawned: 'test' with pid 561624
2022-05-10 21:46:30,740 INFO success: test entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)

Can we solve it with a dummy process just to keep the socket open?

It can call loop.sh python server-accept.py instead. Of course! But then again, we can as well use a regular program and create the socket ourselves.

Why sleep in loop?

Is dup needed?

Python file descriptors not inheritable.

Is asyncio more reliable

Don’t kill server if client request failed

Can this mechanism be used for zero-downtime deploy

Well, yes, that is how I learned about it in the blog post.

Can we use this technique to create a load balancer?

Unix domain socket vs. TCP socket


Site proudly generated by Hakyll.