The previous model for these Nagios checks was kinda crazy -- every
minute, we'd run a full `rabbitmctl list_consumers` for each of the
dozen+ consumers that we have, and then do the exact same parsing
logic for each to determine whether the target queue has a running
consumer to write out a state file.
Because `rabbitmctl list_consumers` takes a small amount of resources,
on systems where CPU is very limited (e.g. t2 style AWS instances),
this minor CPU wastage could be problematic.
Now we just do that `rabbitmqctl list_consumers` once per minute, and
output all the state files from a single command.
Further TODO items on this front include removing the hardcoded list
of queues.
Because rabbitmq doesn't support changing the nodename of a running
rabbitmq node, Zulip installations suffered a plague of issues where
e.g. a Zulip server would reboot, the hostname would change, and
suddenly the local rabbitmq instance being used by Zulip would stop
working.
We address this problem by using, by default, a fixed rabbitmq
nodename, but providing server administrators the option to set the
rabbitmq nodename used by Zulip however they choose.
To upgrade an existing server to use this new configuration, one will
need to add something like the following to /etc/zulip/zulip.conf:
[rabbitmq]
nodename = zulip@localhost
However, I don't believe we have the puppet code in place to make this
work correctly at initial installation without rabbitmq-server being
already installed (but off), as we can easily setup in Travis CI but I
haven't been willing to do for the installer. So for now, this just
fixes our Travis CI problems.
Fixes: #1579.
This reverts commit 3f95e567c1.
Apparently `apt-add-repository` fails periodically in CI. I suspect
this is some sort of silly networking problem, but given that all
we're saving is a few lines of code, the old version was better if
this fails basically ever.
Previously, the install script would fail if you passed various
non-default puppet rules, since the code to configure and restart
services that runs later on in the install script largely ran
unconditionally, regardless of whether the relevant service was
actually installed on the target system.
This should make the main install script reusable for installing
e.g. a dedicated Postgres server for use with Zulip.
This reverts commit f1f48f305e.
The use of sklearn unfortunately caused a substantial slowdown to the
Zulip provisioning process, which didn't seem worth it for a
relatively minor feature.
Apparently, puppet has messed up exit codes and doesn't by default
return the usual 0=success, nonzero=failure codes. By default, it
seems to always return 0; and with `--detailed-exitcodes`, it returns
the complicated thing documented in the comments.
We fix this by checking the exit code and translating it to what we
actually care about, namely whether errors occurred.
See https://tickets.puppetlabs.com/browse/PUP-2754 for details.
Fixes#1094.
In python 3, subprocess uses bytes for input and output if
universal_newlines=False (the default). It uses str for input and
output if universal_newlines=True.
Since we're dealing with strings here, add universal_newlines=True
to subprocess.check_output calls.
This is important for both ensuring the Nagios checks work correctly
in production, as well as making sure the `zulip` user can access the
virtualenv (owned by the `travis` user) in Travis CI.
The manage.py change effectively switches the Zulip production server
to use the virtualenv, since all of our supervisord commands for the
various Python services go through manage.py.
Additionally, this migrates the production scripts and Nagios plugins
to use the virtualenv as well.
Apparently, c74a74dc74 introduced a bug
where we are no longer correctly depending on build-essential as part
of the Zulip development environment installation process.
Fixes#1111.
This is needed because hash_reqs.py is used to create a virtualenv.
Currently we only use virtualenv in development, but we will soon
start using it in production. Scripts used in production should be
put in scripts/.
Camo is a caching image proxy, used in Zulip to avoid mixed-content
warnings by proxying HTTP image content over HTTPS. We've been using
it in zulip.com production for years; this change makes it available
in standalone Zulip deployments.
The main function of prompting inside `manage.py migrate` is to ask
the user if they want to delete stale content-types, which is
unimportant and likely scary, so we disable doing so.
This automatically loads settings, zerver.models.* and
zerver.lib.actions.* when you start `manage.py shell`, which should
save a bit of time basically every time someone uses it.
Fixes#275.
Previously, we used shell quoting that would result in the shell variable not
being substituted. Instead, we use `"`s that will allow for variable
substitution.
Previously these were hardcoded in zproject/settings.py to be accessed
on localhost.
[Modified by Tim Abbott to adjust comments and fix configure-rabbitmq]
A common issue when doing a Zulip upgrade is trying to pass
upgrade-zulip a tarball path under /root, which doesn't work because
the Zulip user doesn't have permission to read the tarball. We
could fix this by just unpacking the tarballs as root, but it seemed
like a nicer approach would be to archive the release tarballs
somewhere readable by the Zulip user (/home/zulip/archives) and unpack
them from there.
Fixes#208.
The point of the lock is to prevent two deployments happening at the
same time and racing with each other, not to prevent doing any future
deployments after an error happens (which is what the current
implementation does in practice).
Addresses part of #208.
The #! line processing interpreted the argument to pass to `env` as
"python2.7 -u", which obviously isn't a real program.
We fix this by setting the PYTHONUNBUFFERED environment variable
inside the program, which has the same effect.
Thanks to Dan Fedele for the bug report and suggested solution!
With this change, we are now testing the production static asset
pipeline and installation process in a new testing job (and also run
the frontend/backend tests separately).
This means that changes that break the Zulip static asset pipeline or
production installation process are more likely to fail tests. The
testing is imperfect in that it does not have proper isolation -- we
build a complete Zulip development environment and then install a
Zulip production environment on top of it, so e.g. any apt
dependencies installed for Zulip development will still be available
for the Zulip production environment. But, it's better than nothing!
A good v2 of this would be to have the production setup process just
install the minimum stuff needed to run `build-release-tarball` and
then uninstall it / clean it up so that we can do a more clear
production installation, but that's more work.
This fixes an annoying issue where one tries to rebuild the database,
and it fails due to there being existing connections.
The one thing that is potentially scary about this implementation is
that it means it's now a lot easier to accidentally drop your
production database by running the wrong script; might be worth adding
a "--force" flag controlling this behavior or something.
Thanks to Nemanja Stanarevic and Neeraj Wahi for prototypes of this
implementation! They did most of the work and testing for this.
This fixes some issues that we've had where commands will fail is
confusing ways after the database is rebuilt because data from before
the database was dropped is still in the memcached cache.
This fixes issue #123. Namely, the script in scripts/setup/install was
returning 0. Adding `set -e` and `set -o pipeline` causes the install
script to exit and return 1 if any part fails, including piping output
(`set -o pipeline` does this).
Most of our installation process is idempotent, but this step in
particular is not, so it's important to provide a clear error message
about how to proceed.
While the docu on https://www.zulip.org/server.html says:
```
cd /root/zulip
./scripts/setup/install
```
This script downloads the `python-django-guardian_1.3-1~zulip4_all.deb` file to current working dir (`/root/zulip` if you follow the docu), but tries to install it from /root/.
This fails obviously. So i changed the download location to /tmp/.
If there's a problem with Django settings then RMQPW would just be
empty, causing more confusing errors downstream.
(imported from commit 5948b1a15eb92fc032ea02e499be58365d8e9ecb)
Source LOCAL_DATABASE_PASSWORD and INITIAL_PASSWORD_SALT from the secrets file.
Fix the creation of pgpass file.
Tim's note: This will definitely break the original purpose of the
tool but it should be pretty easy to add that back as an option.
(imported from commit 8ab31ea2b7cbc80a4ad2e843a2529313fad8f5cf)
The manual step here is that we need to do the `puppet apply` before
pushing this commit, or `restart-server` will crash.
Previously we shut down everything in one group, which performed
poorly with supervisor's bad performance on restarting many daemons at
once. Now we shut down the unimportant stuff, then the important
stuff, bring back the important stuff, and then bring back the
unimportant stuff.
This new model has a little over 5s of downtime for the core
user-facing daemons -- which is still far more than would be ideal,
but a lot less than the 13s or so that we had before.
Here's some logs with the current setup for the tornado/django downtime:
2013-12-19 20:16:51,995 restart-server: Stopping daemons
2013-12-19 20:16:53,461 restart-server: Starting daemons
2013-12-19 20:16:57,146 restart-server: Starting workers
Compare with the behavior on master today:
2013-12-19 20:21:45,281 restart-server: Stopping daemons
2013-12-19 20:21:49,225 restart-server: Starting daemons
2013-12-19 20:21:58,463 restart-server: Done!
(imported from commit b2c1ba77f3dc989551d0939779208465a8410435)
This reverts commit acef4c0027b77053497ef6e9f7aa4b61703205c3.
Despite the lower total downtime, this caused more user-facing downtime.
(imported from commit 5cce032bb20abe83853a65ee72bf0bb28af403cc)
This shaves about 1.5 seconds off our restart time on ls-dev (9s ->
7.5s). Still too slow, but it's a little bit better.
(imported from commit acef4c0027b77053497ef6e9f7aa4b61703205c3)
We don't use apache in the main app -- only for the SSO situation --
this code was just copied from our own install script. And it caused
problems at CUSTOMER13 because they installed Apache in preparation for
the SSO integration, but restarting it failed.
(imported from commit 3f2961574134847c836e8b69736f60d9f8790201)
supervisord may start up during the install process and do a bunch of
incorrect stuff, with the net effect of creating files in there owned
by root.
(imported from commit 28379af9680bf9d3c72da196f329abdf8c82c6be)
We can probably later merge the create-database code with that of our
internal do-destroy-rebuild-database.
(imported from commit 323932dbf2eb916545d6ebdda70eb1f5e1abb181)
We really should fix this in supervisor itself, since in particular we
lose this setting every time the system is rebooted.
(imported from commit a700078b158808340f5f30812235449c74508cde)