zulip/scripts/restart-server

#!/usr/bin/env python
from __future__ import print_function
import os
import sys
import pwd
import subprocess
import logging
import time

sys.path.append(os.path.join(os.path.dirname(__file__), '..'))
from zulip_tools import ENDC, OKGREEN, DEPLOYMENTS_DIR

logging.basicConfig(format="%(asctime)s restart-server: %(message)s",
                    level=logging.INFO)

deploy_path = os.path.realpath(os.path.join(os.path.dirname(__file__), '..'))
os.chdir(deploy_path)

if pwd.getpwuid(os.getuid()).pw_name != "zulip":
    logging.error("Must be run as user 'zulip'.")
    sys.exit(1)

# Send a statsd event on restarting the server
subprocess.check_call(["python", "./manage.py", "send_stats", "incr", "events.server_restart", str(int(time.time()))])

logging.info("Filling memcached caches")
subprocess.check_call(["python", "./manage.py", "fill_memcached_caches"])

# Restart the FastCGI and related processes via supervisorctl.
logging.info("Stopping workers")
subprocess.check_call(["supervisorctl", "stop", "zulip-workers:*"])
logging.info("Stopping server core")
subprocess.check_call(["supervisorctl", "stop", "zulip-senders:* zulip-django zulip-tornado"])

current_symlink = os.path.join(DEPLOYMENTS_DIR, "current")
last_symlink = os.path.join(DEPLOYMENTS_DIR, "last")
if os.readlink(current_symlink) != deploy_path:
    subprocess.check_call(["ln", '-nsf', os.readlink(current_symlink), last_symlink])
    subprocess.check_call(["ln", '-nsf', deploy_path, current_symlink])
logging.info("Starting server core")
subprocess.check_call(["supervisorctl", "start", "zulip-tornado zulip-django zulip-senders:*"])
logging.info("Starting workers")
subprocess.check_call(["supervisorctl", "start", "zulip-workers:*"])

using_sso = subprocess.check_output(['./scripts/get-django-setting', 'USING_APACHE_SSO'])
if using_sso.strip() == b'True':
    logging.info("Restarting Apache WSGI process...")
    subprocess.check_call(["pkill", "-f", "apache2", "-u", "zulip"])

logging.info("Done!")
print(OKGREEN + "Application restarted successfully!" + ENDC)
Change shebangs from python2.7 to python. 2016-04-07 15:03:22 +02:00			`#!/usr/bin/env python`
Apply Python 3 futurize transform libfuturize.fixes.fix_print_with_import Refer #256 2016-03-10 17:15:34 +01:00			`from __future__ import print_function`
Split restart-server code out of update-deployment. (imported from commit 3ae913b950be0a0c94fbaf0173012ea315f36d62) 2013-01-31 16:49:09 +01:00			`import os`
			`import sys`
Use different mechanism to determine the running user Per http://docs.python.org/2/library/os.html#os.getlogin, getlogin() only works when you have an associated controlling tty. This script didn't work previously because when we do deployments there is no tty. Thus, we switch to the alternative mechanism for determining the current username described on the page linked above. (imported from commit 1dbcf98fd7248d20e501fd7fb22e1dbd306040fd) 2013-06-19 21:16:39 +02:00			`import pwd`
Split restart-server code out of update-deployment. (imported from commit 3ae913b950be0a0c94fbaf0173012ea315f36d62) 2013-01-31 16:49:09 +01:00			`import subprocess`
restart-server: Add some output on what's happening as we go. restart-server has been relatively slow recently, and it'd be nice to know what it is spending its time doing when it hangs for a few seconds. (imported from commit a411c951f5a3f2a1366b6d5d3a40d0660ebec11b) 2013-03-13 19:26:51 +01:00			`import logging`
Log a statsd event when restarting the server (imported from commit e9fa632a39f0a6b6aa7311e80e68faf4178a2cf3) 2013-04-18 22:58:32 +02:00			`import time`
[manual] Move our deployment scripts to scripts/. This will require updating the post-receive code on git.zulip.net to work. (imported from commit 2e51fa2d7b891c1138d3f22ae534cfb8a6cf174c) 2013-10-25 23:20:40 +02:00
Move zulip_tools library to root of repository. (imported from commit 2fada9d2acbcf81f8e2b3de8caadbf335141dfaa) 2013-10-25 23:46:02 +02:00			`sys.path.append(os.path.join(os.path.dirname(__file__), '..'))`
Rename humbug_tools to zulip_tools. (imported from commit 7f21fdc2c2d6ad0bdbd99eb616ffc75c347d8dcb) 2013-08-06 22:39:20 +02:00			`from zulip_tools import ENDC, OKGREEN, DEPLOYMENTS_DIR`
restart-server: Add some output on what's happening as we go. restart-server has been relatively slow recently, and it'd be nice to know what it is spending its time doing when it hangs for a few seconds. (imported from commit a411c951f5a3f2a1366b6d5d3a40d0660ebec11b) 2013-03-13 19:26:51 +01:00
			`logging.basicConfig(format="%(asctime)s restart-server: %(message)s",`
			`level=logging.INFO)`
Split restart-server code out of update-deployment. (imported from commit 3ae913b950be0a0c94fbaf0173012ea315f36d62) 2013-01-31 16:49:09 +01:00
Move the current deployment symlink in restart-server This will help minimize downtime. (imported from commit 47fb66f0d2e21fc12f62c69b7c59ca6828553309) 2013-06-03 19:29:52 +02:00			`deploy_path = os.path.realpath(os.path.join(os.path.dirname(__file__), '..'))`
			`os.chdir(deploy_path)`
Split restart-server code out of update-deployment. (imported from commit 3ae913b950be0a0c94fbaf0173012ea315f36d62) 2013-01-31 16:49:09 +01:00
Make all scripts in scripts/ pass mypy check. 2016-07-23 20:33:58 +02:00			`if pwd.getpwuid(os.getuid()).pw_name != "zulip":`
[manual] Switch over to new /etc/zulip/zulip.conf config file Run the following commands as root before deploying this branch: # /root/zulip/tools/migrate-server-config # rm /etc/zulip/machinetype /etc/zulip/server /etc/zulip/local /etc/humbug-machinetype /etc/humbug-server /etc/humbug-local (imported from commit aa7dcc50d2f4792ce33834f14761e76512fca252) 2013-11-01 00:00:30 +01:00			`logging.error("Must be run as user 'zulip'.")`
			`sys.exit(1)`
Make restart-server refuse to run if non-Humbug user on deployment If you're running this as a user other than "humbug" on a deployed server, you're going to have a bad time. Specifically, memcached won't work, and other undefined behaviour may occur. So here we add a check and error out if you run this script on an app_frontend as non-"humbug". (imported from commit a3d5f0f58ded42393c03f4d21b4650494fae418f) 2013-06-19 17:25:42 +02:00
Log a statsd event when restarting the server (imported from commit e9fa632a39f0a6b6aa7311e80e68faf4178a2cf3) 2013-04-18 22:58:32 +02:00			`# Send a statsd event on restarting the server`
Replace python2.7 by python everywhere. 2016-04-07 15:27:25 +02:00			`subprocess.check_call(["python", "./manage.py", "send_stats", "incr", "events.server_restart", str(int(time.time()))])`
Log a statsd event when restarting the server (imported from commit e9fa632a39f0a6b6aa7311e80e68faf4178a2cf3) 2013-04-18 22:58:32 +02:00
Fill memcached caches synchronously before restarting the server (imported from commit a45fa845e94a1fc6e96a1aafca31e9a6fc2b7526) 2013-05-30 21:05:34 +02:00			`logging.info("Filling memcached caches")`
Replace python2.7 by python everywhere. 2016-04-07 15:27:25 +02:00			`subprocess.check_call(["python", "./manage.py", "fill_memcached_caches"])`
Fill memcached caches synchronously before restarting the server (imported from commit a45fa845e94a1fc6e96a1aafca31e9a6fc2b7526) 2013-05-30 21:05:34 +02:00
Start and stop all our processes in one invocation of supervisorctl With the supervisor speedups, these run much faster. (imported from commit d8b96178f7c57861e9de8dd640a861c22df6e9ad) 2013-08-22 02:23:23 +02:00			`# Restart the FastCGI and related processes via supervisorctl.`
[manual] restart-server: Minimize downtime for message sender worker. The manual step here is that we need to do the `puppet apply` before pushing this commit, or `restart-server` will crash. Previously we shut down everything in one group, which performed poorly with supervisor's bad performance on restarting many daemons at once. Now we shut down the unimportant stuff, then the important stuff, bring back the important stuff, and then bring back the unimportant stuff. This new model has a little over 5s of downtime for the core user-facing daemons -- which is still far more than would be ideal, but a lot less than the 13s or so that we had before. Here's some logs with the current setup for the tornado/django downtime: 2013-12-19 20:16:51,995 restart-server: Stopping daemons 2013-12-19 20:16:53,461 restart-server: Starting daemons 2013-12-19 20:16:57,146 restart-server: Starting workers Compare with the behavior on master today: 2013-12-19 20:21:45,281 restart-server: Stopping daemons 2013-12-19 20:21:49,225 restart-server: Starting daemons 2013-12-19 20:21:58,463 restart-server: Done! (imported from commit b2c1ba77f3dc989551d0939779208465a8410435) 2013-12-19 21:07:02 +01:00			`logging.info("Stopping workers")`
			`subprocess.check_call(["supervisorctl", "stop", "zulip-workers:*"])`
			`logging.info("Stopping server core")`
			`subprocess.check_call(["supervisorctl", "stop", "zulip-senders:* zulip-django zulip-tornado"])`
restart-server: Maintain a last symlink. 2016-08-05 01:58:57 +02:00
			`current_symlink = os.path.join(DEPLOYMENTS_DIR, "current")`
			`last_symlink = os.path.join(DEPLOYMENTS_DIR, "last")`
			`if os.readlink(current_symlink) != deploy_path:`
			`subprocess.check_call(["ln", '-nsf', os.readlink(current_symlink), last_symlink])`
			`subprocess.check_call(["ln", '-nsf', deploy_path, current_symlink])`
[manual] restart-server: Minimize downtime for message sender worker. The manual step here is that we need to do the `puppet apply` before pushing this commit, or `restart-server` will crash. Previously we shut down everything in one group, which performed poorly with supervisor's bad performance on restarting many daemons at once. Now we shut down the unimportant stuff, then the important stuff, bring back the important stuff, and then bring back the unimportant stuff. This new model has a little over 5s of downtime for the core user-facing daemons -- which is still far more than would be ideal, but a lot less than the 13s or so that we had before. Here's some logs with the current setup for the tornado/django downtime: 2013-12-19 20:16:51,995 restart-server: Stopping daemons 2013-12-19 20:16:53,461 restart-server: Starting daemons 2013-12-19 20:16:57,146 restart-server: Starting workers Compare with the behavior on master today: 2013-12-19 20:21:45,281 restart-server: Stopping daemons 2013-12-19 20:21:49,225 restart-server: Starting daemons 2013-12-19 20:21:58,463 restart-server: Done! (imported from commit b2c1ba77f3dc989551d0939779208465a8410435) 2013-12-19 21:07:02 +01:00			`logging.info("Starting server core")`
			`subprocess.check_call(["supervisorctl", "start", "zulip-tornado zulip-django zulip-senders:*"])`
			`logging.info("Starting workers")`
			`subprocess.check_call(["supervisorctl", "start", "zulip-workers:*"])`
Split restart-server code out of update-deployment. (imported from commit 3ae913b950be0a0c94fbaf0173012ea315f36d62) 2013-01-31 16:49:09 +01:00
Move bin/get-django-setting to scripts/. 2016-05-08 04:02:32 +02:00			`using_sso = subprocess.check_output(['./scripts/get-django-setting', 'USING_APACHE_SSO'])`
scripts/: Make subprocess calls unicode-aware. 2016-07-26 06:40:05 +02:00			`if using_sso.strip() == b'True':`
Move apache2 restart for SSO sites to restart-server (imported from commit f999e2b0591a11442c1d3fdba2393ecf6e78bad3) 2013-11-15 00:40:23 +01:00			`logging.info("Restarting Apache WSGI process...")`
			`subprocess.check_call(["pkill", "-f", "apache2", "-u", "zulip"])`

restart-server: Add some output on what's happening as we go. restart-server has been relatively slow recently, and it'd be nice to know what it is spending its time doing when it hangs for a few seconds. (imported from commit a411c951f5a3f2a1366b6d5d3a40d0660ebec11b) 2013-03-13 19:26:51 +01:00			`logging.info("Done!")`
Apply Python 3 futurize transform libfuturize.fixes.fix_print_with_import Refer #256 2016-03-10 17:15:34 +01:00			`print(OKGREEN + "Application restarted successfully!" + ENDC)`