Our previous dependencies on the `/root/zulip` path should all be
long gone at this point. Run our production-install test suite through
a fresh temporary path instead, mainly just to avoid causing any confusion
over whether that's quite the case.
Apparently Travis CI has a very strange issue today that causes our
Nagios/E2E tests to have Tornado failing to connect to RabbitMQ.
Causes unknown, but I've spent a day trying to debug this without
luck, and we need our test suites passing in the meantime.
- Shell command `sort` depends on system locale and
symbol `_` (underscore) is located after letters
in the default locale of Travis instances. Otherwise,
python sorting uses its own symbols ordering and
underscore is located before letters. Changing the
locale for `sort` shell command with adding environment
variable doesn' t work on Travis instances. That's why
It was decided to apply the same sorting command to both
comparing lists.
* Now queue_workers.py sorts queue names and prints them on their own
line. Previously it's output was nondeterministic.
* Simplified grep strategy for removing the "test" worker.
This list was likely to end up out of date quickly, since it wasn't
documented that you need to update it when adding a queue. The best
solution is to just not require it to be updated.
- Add websocket client to create connection with SockJS websocket server.
It contains callback method to launch after connection setup.
- Add '--websocket' parameter to 'check_send_receive_time' script to
check websocket connection.
- Add testing websocket connection to production installation checking.
- Add cronjob to launch websocket connection nagios test.
This makes it possible for Zulip Nagios monitoring to check for
problems impacting the websockets sending code path, which is what all
web users use.
Previously, we were doing this request to the production server before
waiting for all the supervisord processes to start; it's possible this
could cause failures where we hit the server before the Django uwsgi
processes are up.
Hopefully fixes#2723.
The previous model for these Nagios checks was kinda crazy -- every
minute, we'd run a full `rabbitmctl list_consumers` for each of the
dozen+ consumers that we have, and then do the exact same parsing
logic for each to determine whether the target queue has a running
consumer to write out a state file.
Because `rabbitmctl list_consumers` takes a small amount of resources,
on systems where CPU is very limited (e.g. t2 style AWS instances),
this minor CPU wastage could be problematic.
Now we just do that `rabbitmqctl list_consumers` once per minute, and
output all the state files from a single command.
Further TODO items on this front include removing the hardcoded list
of queues.
Because rabbitmq doesn't support changing the nodename of a running
rabbitmq node, Zulip installations suffered a plague of issues where
e.g. a Zulip server would reboot, the hostname would change, and
suddenly the local rabbitmq instance being used by Zulip would stop
working.
We address this problem by using, by default, a fixed rabbitmq
nodename, but providing server administrators the option to set the
rabbitmq nodename used by Zulip however they choose.
To upgrade an existing server to use this new configuration, one will
need to add something like the following to /etc/zulip/zulip.conf:
[rabbitmq]
nodename = zulip@localhost
However, I don't believe we have the puppet code in place to make this
work correctly at initial installation without rabbitmq-server being
already installed (but off), as we can easily setup in Travis CI but I
haven't been willing to do for the installer. So for now, this just
fixes our Travis CI problems.
Fixes: #1579.
Travis CI seems to have changed the way the snakeoil SSL certs are
generated in their infrastructure, so we need to update our expected
"success" HTTP headers accordingly.
Run '/puppet/zulip/files/nagios_plugins/zulip_app_frontend/check_send_receive_time'
script as 'zulip' user so that the connection to the database can be
made correctly.
Previously, we were wasting time every time we installed packages,
because `apt-get upgrade` would only install most of the packages
`apt-get dist-upgrade` would.
Camo is a caching image proxy, used in Zulip to avoid mixed-content
warnings by proxying HTTP image content over HTTPS. We've been using
it in zulip.com production for years; this change makes it available
in standalone Zulip deployments.
This should save several minutes off the Travis CI `production`
suite's runtime, since previously we were doing the full apt upgrade
process twice, resulting in things like multiple expensive rebuilds of
the initramfs.
Travis CI's model of installing every version of postgres on the test
VM and then shutting all the versions other than the one requested
down seems to not work very well with doing apt upgrades. It seems
the best way to resolve this is to just uninstall the versions we
don't need.
We ran into a bug with the Travis CI infrastructure where it postgres
9.1 is installed on the system, and so when we'd do an apt upgrade
with a new version of 9.1, the 9.1 daemon would end up getting started
and conflict with the 9.3 daemon we were trying to run.
With this change, we are now testing the production static asset
pipeline and installation process in a new testing job (and also run
the frontend/backend tests separately).
This means that changes that break the Zulip static asset pipeline or
production installation process are more likely to fail tests. The
testing is imperfect in that it does not have proper isolation -- we
build a complete Zulip development environment and then install a
Zulip production environment on top of it, so e.g. any apt
dependencies installed for Zulip development will still be available
for the Zulip production environment. But, it's better than nothing!
A good v2 of this would be to have the production setup process just
install the minimum stuff needed to run `build-release-tarball` and
then uninstall it / clean it up so that we can do a more clear
production installation, but that's more work.