If `zulip-puppet-apply` is run during an upgrade, it will immediately
try to re-`stop-server` before running migrations; if the last step in
the puppet application was to restart `supervisor`, it may not be
listening on its UNIX socket yet. In such cases, `socket.connect()`
throws a `FileNotFoundError`:
```
Traceback (most recent call last):
File "./scripts/stop-server", line 53, in <module>
services = list_supervisor_processes(services, only_running=True)
File "./scripts/lib/supervisor.py", line 34, in list_supervisor_processes
processes = rpc().supervisor.getAllProcessInfo()
File "/usr/lib/python3.9/xmlrpc/client.py", line 1116, in __call__
return self.__send(self.__name, args)
File "/usr/lib/python3.9/xmlrpc/client.py", line 1456, in __request
response = self.__transport.request(
File "/usr/lib/python3.9/xmlrpc/client.py", line 1160, in request
return self.single_request(host, handler, request_body, verbose)
File "/usr/lib/python3.9/xmlrpc/client.py", line 1172, in single_request
http_conn = self.send_request(host, handler, request_body, verbose)
File "/usr/lib/python3.9/xmlrpc/client.py", line 1285, in send_request
self.send_content(connection, request_body)
File "/usr/lib/python3.9/xmlrpc/client.py", line 1315, in send_content
connection.endheaders(request_body)
File "/usr/lib/python3.9/http/client.py", line 1250, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/usr/lib/python3.9/http/client.py", line 1010, in _send_output
self.send(msg)
File "/usr/lib/python3.9/http/client.py", line 950, in send
self.connect()
File "./scripts/lib/supervisor.py", line 10, in connect
self.sock.connect(self.host)
FileNotFoundError: [Errno 2] No such file or directory
```
Catch the `FileNotFoundError` and retry twice more, with backoff. If
it fails repeatedly, point to `service supervisor status` for further
debugging, as `FileNotFoundError` is rather misleading -- the file
exists, it simply is not accepting connections.
7c4293a7d3 switched to checking if the
service was already running, and use `supervisorctl start` if it was
not.
Unfortunately, `list_supervisor_processes("zulip-tornado:*")` did not
include `zulip-tornado`, and as such a non-sharded process was always
considered to _not_ be running, and was thus started, not restarted.
Starting an already-started service is a no-op, and thus non-sharded
tornado processes were never restarted.
The observed behaviour is that requests to the tornado process attempt
to load the user from the cache, with a different prefix from Django,
and immediately invalidate the session and eject the user back to the
login page.
Fix the `list_supervisor_processes` logic to match without the
trailing `:*`.
For many uses, shelling out to `supervisorctl` is going to produce
better error messages. However, for instances where we wish to parse
the output of `supervisorctl`, using the API directly is less brittle.