puppet: Use lazy-apps and uwsgi control sockets for rolling reloads.

Restarting the uwsgi processes by way of supervisor opens a window during which nginx 502's all responses. uwsgi has a configuration called "chain reloading" which allows for rolling restart of the uwsgi processes, such that only one process at once in unavailable; see uwsgi documentation ([1]). The tradeoff is that this requires that the uwsgi processes load the libraries after forking, rather than before ("lazy apps"); in theory this can lead to larger memory footprints, since they are not shared. In practice, as Django defers much of the loading, this is not as much of an issue. In a very basic test of memory consumption (measured by total memory - free - caches - buffers; 6 uwsgi workers), both immediately after restarting Django, and after requesting `/` 60 times with 6 concurrent requests: | Non-lazy | Lazy app | Difference ------------------+------------+------------+------------- Fresh | 2,827,216 | 2,870,480 | +43,264 After 60 requests | 3,332,284 | 3,409,608 | +77,324 ..................|............|............|............. Difference | +505,068 | +539,128 | +34,060 That is, "lazy app" loading increased the footprint pre-requests by 43MB, and after 60 requests grew the memory footprint by 539MB, as opposed to non-lazy loading, which grew it by 505MB. Using wsgi "lazy app" loading does increase the memory footprint, but not by a large percentage. The other effect is that processes may be served by either old or new code during the restart window. This may cause transient failures when new frontend code talks to old backend code. Enable chain-reloading during graceful, puppetless restarts, but only if enabled via a zulip.conf configuration flag. Fixes #2559. [1]: https://uwsgi-docs.readthedocs.io/en/latest/articles/TheArtOfGracefulReloading.html#chain-reloading-lazy-apps
2021-12-31 20:20:49 -08:00 · 2021-12-31 20:20:49 -08:00 · 6218ed91c2
parent 4aaa250623
commit 6218ed91c2
4 changed files with 45 additions and 2 deletions
--- a/docs/production/deployment.md
+++ b/docs/production/deployment.md
@ -623,6 +623,14 @@ override is useful both Docker systems (where the above algorithm
 might see the host's memory, not the container's) and/or when using
 remote servers for postgres, memcached, redis, and RabbitMQ.

+#### `rolling_restart`
+
+If set to a non-empty value, when using `./scripts/restart-server` to
+restart Zulip, restart the uwsgi processes one-at-a-time, instead of
+all at once. This decreases the number of 502's served to clients, at
+the cost of slightly increased memory usage, and the possibility that
+different requests will be served by different versions of the code.
+
 #### `uwsgi_buffer_size`

 Override the default uwsgi buffer size of 8192.
--- a/puppet/zulip/manifests/app_frontend_base.pp
+++ b/puppet/zulip/manifests/app_frontend_base.pp
@ -119,6 +119,12 @@ class zulip::app_frontend_base {
    notify  => Service[$zulip::common::supervisor_service],
  }

+  $uwsgi_rolling_restart = zulipconf('application_server', 'rolling_restart', '')
+  if $uwsgi_rolling_restart == '' {
+    file { '/home/zulip/deployments/uwsgi-control':
+      ensure => absent,
+    }
+  }
  $uwsgi_listen_backlog_limit = zulipconf('application_server', 'uwsgi_listen_backlog_limit', 128)
  $uwsgi_buffer_size = zulipconf('application_server', 'uwsgi_buffer_size', 8192)
  $uwsgi_processes = zulipconf('application_server', 'uwsgi_processes', $uwsgi_default_processes)
--- a/puppet/zulip/templates/uwsgi.ini.template.erb
+++ b/puppet/zulip/templates/uwsgi.ini.template.erb
@ -16,6 +16,13 @@ gid=zulip

 stats=/home/zulip/deployments/uwsgi-stats

+<% if @uwsgi_rolling_restart != '' -%>
+master-fifo=/home/zulip/deployments/uwsgi-control
+# lazy-apps are required for rolling restarts:
+# https://uwsgi-docs.readthedocs.io/en/latest/articles/TheArtOfGracefulReloading.html#preforking-vs-lazy-apps-vs-lazy
+lazy-apps=true
+<% end -%>
+
 ignore-sigpipe = true
 ignore-write-errors = true
 disable-write-exception = true
--- a/scripts/restart-server
+++ b/scripts/restart-server
@ -13,6 +13,7 @@ from scripts.lib.zulip_tools import (
    ENDC,
    OKGREEN,
    WARNING,
+    get_config,
    get_config_file,
    get_tornado_ports,
    has_application_server,
@ -128,8 +129,29 @@ if has_application_server():
        subprocess.check_call(["supervisorctl", action, "zulip-tornado:*"])

    # Finally, restart the Django uWSGI processes.
-    logging.info("%s django server", verbing)
-    subprocess.check_call(["supervisorctl", action, "zulip-django"])
+    if (
+        action == "restart"
+        and not args.less_graceful
+        and get_config(config_file, "application_server", "rolling_restart") != ""
+        and os.path.exists("/home/zulip/deployments/uwsgi-control")
+    ):
+        # See if it's currently running
+        uwsgi_status = subprocess.run(
+            ["supervisorctl", "status", "zulip-django"],
+            stdout=subprocess.DEVNULL,
+        )
+        if uwsgi_status.returncode == 0:
+            logging.info("Starting rolling restart of django server")
+            with open("/home/zulip/deployments/uwsgi-control", "w") as control_socket:
+                # "c" is chain-reloading:
+                # https://uwsgi-docs.readthedocs.io/en/latest/MasterFIFO.html#available-commands
+                control_socket.write("c")
+        else:
+            logging.info("Starting django server")
+            subprocess.check_call(["supervisorctl", "start", "zulip-django"])
+    else:
+        logging.info("%s django server", verbing)
+        subprocess.check_call(["supervisorctl", action, "zulip-django"])

    using_sso = subprocess.check_output(["./scripts/get-django-setting", "USING_APACHE_SSO"])
    if using_sso.strip() == b"True":