Commit Graph

46 Commits

Author SHA1 Message Date
Tim Abbott e9b3ac3f34 travis: Remove now-unnecssary items from apt-mark hold lists.
This seems to have been causing the travis production suite to fail.
It's a direct consequence of removing travis' giant library of apt
sources.list files; now that those are gone, there aren't copies of
all these extra packages available anyway.
2017-11-30 14:56:36 -08:00
Tim Abbott 054952a44a docs: Update links from codebase to point to ReadTheDocs. 2017-11-16 10:53:49 -08:00
Greg Price 835c2ca1ce travis: Cut /root/zulip out of prod test path, just to be sure.
Our previous dependencies on the `/root/zulip` path should all be
long gone at this point.  Run our production-install test suite through
a fresh temporary path instead, mainly just to avoid causing any confusion
over whether that's quite the case.
2017-08-15 17:41:07 -07:00
Tim Abbott 84da22da67 travis: Improve production test for HTTP success to not check length.
This will make life a lot easier when iterating on the login page.
2017-08-15 10:51:29 -07:00
Rishi Gupta 85d38bd17b emails: Remove DEFAULT_FROM_EMAIL from prod_settings_template. 2017-06-29 17:54:33 -07:00
Aditya Bansal b19134d4c2 travis: Remove docker-engine from held package list in production.
We are doing this in order to avoid facing troubles when travis
updates its trusty images. Here is a link to the blog post.
https://blog.travis-ci.com/2017-06-19-trusty-updates-2017-Q2
2017-06-20 06:45:40 -04:00
Aditya Bansal fab4a30cce production-helper: Add wrapper to retry apt-get dist-upgrade.
In this commit we add a conditional inside production helper to just
re run apt-get dist-upgrade in case it fails the first time.
2017-06-19 16:14:11 -04:00
Aditya Bansal 8e33d9e48b travis: Update held packages list in production tests. 2017-06-13 11:12:26 -07:00
Tim Abbott 2215af4b57 docs: Add a bunch of documentation on Travis CI. 2017-06-06 13:39:51 -07:00
Tim Abbott 5db3f60c7d travis: Temporarily disable failing Nagios tests.
Apparently Travis CI has a very strange issue today that causes our
Nagios/E2E tests to have Tornado failing to connect to RabbitMQ.
Causes unknown, but I've spent a day trying to debug this without
luck, and we need our test suites passing in the meantime.
2017-03-15 22:01:04 -07:00
K.Kanakhin ea4b9cb609 production-helper: Fix sorting queue workers by name.
- Shell command `sort` depends on system locale and
  symbol `_` (underscore) is located after letters
  in the default locale of Travis instances. Otherwise,
  python sorting uses its own symbols ordering and
  underscore is located before letters. Changing the
  locale for `sort` shell command with adding environment
  variable doesn' t work on Travis instances. That's why
  It was decided to apply the same sorting command to both
  comparing lists.
2017-03-07 20:05:53 -08:00
Tim Abbott aa6567ee34 queue_workers: Fix confusing --queue_type argument name. 2017-02-22 00:23:26 -08:00
Tim Abbott 2768be7fdf travis: Add logs to rabbitmq consumer debug output. 2017-02-22 00:21:22 -08:00
Tim Abbott 19896460f0 nagios: Fix RabbitMQ Nagios checks running Django as root.
This can cause problems by making the /var/log/zulip files owned by
root (not zulip) and thus not writable by the Zulip user.
2017-02-22 00:20:57 -08:00
Tim Abbott 306b29d414 travis: Improve debuggability of rabbitmq errors. 2017-02-19 23:44:40 -08:00
Tim Abbott 4f2a2a39f9 travis: Increase the supervisor sleep to 15s for now.
The hope is this will help us investigate failures like
https://travis-ci.org/zulip/zulip/jobs/203335213.
2017-02-19 23:41:42 -08:00
Tim Abbott 620f1e444e travis: Fix various bugs in new queue worker test.
* Now queue_workers.py sorts queue names and prints them on their own
  line.  Previously it's output was nondeterministic.
* Simplified grep strategy for removing the "test" worker.
2017-02-19 21:17:42 -08:00
Tim Abbott 42cbce9ad6 travis: Add special check for queue processor lists matching. 2017-02-19 16:19:55 -08:00
Tim Abbott d6bbcd2737 travis: Automate updates to production-helper Nagios test.
This list was likely to end up out of date quickly, since it wasn't
documented that you need to update it when adding a queue.  The best
solution is to just not require it to be updated.
2017-02-19 16:19:53 -08:00
Tim Abbott 3a3a1872e7 travis: Prevent upgrading oracle-java9-installer.
This fixes a minor performance problem, and also avoids errors when
Oracle's Java installer site is down.
2017-01-06 19:30:16 -08:00
K.Kanakhin 0d8c18a6dd nagios-plugins: Add websocket checking to nagios message sending test.
- Add websocket client to create connection with SockJS websocket server.
  It contains callback method to launch after connection setup.
- Add '--websocket' parameter to 'check_send_receive_time' script to
  check websocket connection.
- Add testing  websocket connection to production installation checking.
- Add cronjob to launch websocket connection nagios test.

This makes it possible for Zulip Nagios monitoring to check for
problems impacting the websockets sending code path, which is what all
web users use.
2016-12-30 15:36:37 -08:00
Umair Khan 6c1d805495 travis: Fix production suite flakiness.
Previously, we were doing this request to the production server before
waiting for all the supervisord processes to start; it's possible this
could cause failures where we hit the server before the Django uwsgi
processes are up.

Hopefully fixes #2723.
2016-12-15 22:04:57 -08:00
Tim Abbott 15bfedec99 travis: Improve debuggability of server wget failures.
The main improvement here is causing `wget` errors to be ignored so
that we see the server logs in the event of a `wget` failure.
2016-12-03 20:48:57 -08:00
Tim Abbott 1a8a329b44 production-helper: Expand the apt-mark hold list. 2016-12-01 12:29:31 -08:00
Vishnu Ks a7ead9e99d settings: Eliminate ADMIN_DOMAIN for creating initial realm.
We now use `./manage.py generate_realm_creation_link` as the flow flow
for creating one's first realm.
2016-08-25 09:37:33 -07:00
Tim Abbott 88a123d5e0 Fix excessive CPU usage by rabbitmq-numconsumers Nagios checks.
The previous model for these Nagios checks was kinda crazy -- every
minute, we'd run a full `rabbitmctl list_consumers` for each of the
dozen+ consumers that we have, and then do the exact same parsing
logic for each to determine whether the target queue has a running
consumer to write out a state file.

Because `rabbitmctl list_consumers` takes a small amount of resources,
on systems where CPU is very limited (e.g. t2 style AWS instances),
this minor CPU wastage could be problematic.

Now we just do that `rabbitmqctl list_consumers` once per minute, and
output all the state files from a single command.

Further TODO items on this front include removing the hardcoded list
of queues.
2016-08-12 14:09:36 -07:00
Tim Abbott 6496fe2a53 travis: Remove rabbitmq nodename dependency on hostname.
Because rabbitmq doesn't support changing the nodename of a running
rabbitmq node, Zulip installations suffered a plague of issues where
e.g. a Zulip server would reboot, the hostname would change, and
suddenly the local rabbitmq instance being used by Zulip would stop
working.

We address this problem by using, by default, a fixed rabbitmq
nodename, but providing server administrators the option to set the
rabbitmq nodename used by Zulip however they choose.

To upgrade an existing server to use this new configuration, one will
need to add something like the following to /etc/zulip/zulip.conf:

[rabbitmq]
nodename = zulip@localhost

However, I don't believe we have the puppet code in place to make this
work correctly at initial installation without rabbitmq-server being
already installed (but off), as we can easily setup in Travis CI but I
haven't been willing to do for the installer.  So for now, this just
fixes our Travis CI problems.

Fixes: #1579.
2016-08-12 09:38:23 -07:00
Tim Abbott c7059c9751 travis: Update success-http-headers to match current certs.
Travis CI seems to have changed the way the snakeoil SSL certs are
generated in their infrastructure, so we need to update our expected
"success" HTTP headers accordingly.
2016-08-12 09:35:41 -07:00
Tim Abbott a648513580 production-helper: Remove /root/zulip during setup process.
This fixes a problem that caused production-helper to not be
idempotent.
2016-08-11 22:21:13 -07:00
Tim Abbott 7011c94465 production-helper: Use ln -nsf to install snakeoil symlinks.
This fixes a problem where production-helper was not idempotent.
2016-08-11 22:20:24 -07:00
Tim Abbott b3a768f4b2 settings: Improve ALLOWED_HOSTS defaults logic and docs.
This removes the requirement for the user to put localhost/127.0.0.1
in their ALLOWED_HOSTS list, since it is now added automatically.

Fixes: #1358.
2016-08-05 21:25:29 -07:00
Umair Khan 1a6e8282c8 Run 'check_send_receive_time' as 'zulip' user.
Run '/puppet/zulip/files/nagios_plugins/zulip_app_frontend/check_send_receive_time'
script as 'zulip' user so that the connection to the database can be
made correctly.
2016-07-28 13:39:29 -07:00
Tim Abbott 039c175d68 production-helper: Hold tons of packages.
This saves almost a minute doing apt upgrades in the production test
suite.
2016-06-22 10:41:09 -07:00
Tim Abbott 0f2729f5fb production-helper: use dist-upgrade to match install script.
Previously, we were wasting time every time we installed packages,
because `apt-get upgrade` would only install most of the packages
`apt-get dist-upgrade` would.
2016-06-22 10:38:27 -07:00
Tim Abbott 6c744564a7 travis: Add debugging code for rabbitmq nagios failures. 2016-05-09 09:55:18 -07:00
Tim Abbott 804dad42e6 travis: Run various Nagios checks in production tests. 2016-05-08 17:35:50 -07:00
Tim Abbott 744e8ad0e3 travis: Set prod EXTERNAL_HOST to resolve correctly.
This is needed to use check_send_receive_time in the tests.
2016-05-08 17:35:50 -07:00
Tim Abbott e4c098fba4 travis: Verify all supervisord jobs are running in production test.
This requires a bit of complexity since supervisord automatically
restarts failing jobs.
2016-05-08 17:35:50 -07:00
Tim Abbott 40de75d9e6 travis: Verify the server doesn't 500 in production test. 2016-05-08 17:35:50 -07:00
Tim Abbott 52c1e8ac7d Run a local camo server in voyager production environments.
Camo is a caching image proxy, used in Zulip to avoid mixed-content
warnings by proxying HTTP image content over HTTPS.  We've been using
it in zulip.com production for years; this change makes it available
in standalone Zulip deployments.
2016-05-02 17:21:31 -07:00
Tim Abbott 48a578d003 travis: hold expensive to upgrade packages in Travis CI.
This should save a few minutes of time running the production test
suite.  This is part of solving #722.
2016-05-02 16:59:21 -07:00
Tim Abbott 79327a61ae travis: Do an apt-get update before the apt upgrade.
This should save several minutes off the Travis CI `production`
suite's runtime, since previously we were doing the full apt upgrade
process twice, resulting in things like multiple expensive rebuilds of
the initramfs.
2016-05-02 16:35:46 -07:00
Tim Abbott 6943a142ea Fix postgres errors in Travis CI again.
Travis CI's model of installing every version of postgres on the test
VM and then shutting all the versions other than the one requested
down seems to not work very well with doing apt upgrades.  It seems
the best way to resolve this is to just uninstall the versions we
don't need.
2016-01-21 22:07:10 -08:00
Tim Abbott a98b0cf35d travis: Workaround postgres 9.1 conflict issues on trusty.
We ran into a bug with the Travis CI infrastructure where it postgres
9.1 is installed on the system, and so when we'd do an apt upgrade
with a new version of 9.1, the 9.1 daemon would end up getting started
and conflict with the 9.3 daemon we were trying to run.
2016-01-09 16:59:43 -08:00
Tim Abbott 2be7ac8d70 travis: Fix prompting for user input in production-helper. 2015-12-07 20:33:36 -08:00
Tim Abbott 6eb670097c Expand testing done via Travis CI to cover production pipeline.
With this change, we are now testing the production static asset
pipeline and installation process in a new testing job (and also run
the frontend/backend tests separately).

This means that changes that break the Zulip static asset pipeline or
production installation process are more likely to fail tests.  The
testing is imperfect in that it does not have proper isolation -- we
build a complete Zulip development environment and then install a
Zulip production environment on top of it, so e.g. any apt
dependencies installed for Zulip development will still be available
for the Zulip production environment.  But, it's better than nothing!

A good v2 of this would be to have the production setup process just
install the minimum stuff needed to run `build-release-tarball` and
then uninstall it / clean it up so that we can do a more clear
production installation, but that's more work.
2015-11-01 18:11:39 -08:00