2019-11-22 02:07:17 +01:00
|
|
|
# Troubleshooting and monitoring
|
2016-07-12 22:02:30 +02:00
|
|
|
|
2016-08-09 23:13:53 +02:00
|
|
|
Zulip uses [Supervisor](http://supervisord.org/index.html) to monitor
|
|
|
|
and control its many Python services. Read the next section, [Using
|
|
|
|
supervisorctl](#using-supervisorctl), to learn how to use the
|
|
|
|
Supervisor client to monitor and manage services.
|
|
|
|
|
|
|
|
If you haven't already, now might be a good time to read Zulip's [architectural
|
2019-09-30 19:37:56 +02:00
|
|
|
overview](../overview/architecture-overview.md), particularly the
|
2019-04-06 02:58:44 +02:00
|
|
|
[Components](../overview/architecture-overview.html#components) section. This will help you
|
2016-08-09 23:13:53 +02:00
|
|
|
understand the many services Zulip uses.
|
|
|
|
|
|
|
|
If you encounter issues while running Zulip, take a look at Zulip's logs, which
|
|
|
|
are located in `/var/log/zulip/`. That directory contains one log file for
|
|
|
|
each service, plus `errors.log` (has all errors), `server.log` (has logs from
|
|
|
|
the Django and Tornado servers), and `workers.log` (has combined logs from the
|
|
|
|
queue workers).
|
|
|
|
|
|
|
|
The section [troubleshooting services](#troubleshooting-services)
|
|
|
|
on this page includes details about how to fix common issues with Zulip services.
|
|
|
|
|
|
|
|
If you run into additional problems, [please report
|
|
|
|
them](https://github.com/zulip/zulip/issues) so that we can update
|
|
|
|
this page! The Zulip installation scripts logs its full output to
|
|
|
|
`/var/log/zulip/install.log`, so please include the context for any
|
|
|
|
tracebacks from that log.
|
|
|
|
|
|
|
|
## Using supervisorctl
|
|
|
|
|
|
|
|
To see what Zulip-related services are configured to
|
|
|
|
use Supervisor, look at `/etc/supervisor/conf.d/zulip.conf` and
|
|
|
|
`/etc/supervisor/conf.d/zulip-db.conf`.
|
|
|
|
|
|
|
|
Use the supervisor client `supervisorctl` to list the status of, stop, start,
|
|
|
|
and restart various services.
|
|
|
|
|
|
|
|
### Checking status with `supervisorctl status`
|
|
|
|
|
2020-10-23 02:43:28 +02:00
|
|
|
You can check if the Zulip application is running using:
|
2016-07-12 22:02:30 +02:00
|
|
|
```
|
|
|
|
supervisorctl status
|
|
|
|
```
|
|
|
|
|
2016-08-09 23:13:53 +02:00
|
|
|
When everything is running as expected, you will see something like this:
|
|
|
|
|
|
|
|
```
|
|
|
|
process-fts-updates RUNNING pid 2194, uptime 1:13:11
|
|
|
|
zulip-django RUNNING pid 2192, uptime 1:13:11
|
|
|
|
zulip-tornado RUNNING pid 2193, uptime 1:13:11
|
|
|
|
zulip-workers:zulip-events-confirmation-emails RUNNING pid 2199, uptime 1:13:11
|
|
|
|
zulip-workers:zulip-events-digest_emails RUNNING pid 2205, uptime 1:13:11
|
|
|
|
zulip-workers:zulip-events-email_mirror RUNNING pid 2203, uptime 1:13:11
|
|
|
|
zulip-workers:zulip-events-error_reports RUNNING pid 2200, uptime 1:13:11
|
|
|
|
zulip-workers:zulip-events-missedmessage_mobile_notifications RUNNING pid 2204, uptime 1:13:11
|
|
|
|
zulip-workers:zulip-events-missedmessage_reminders RUNNING pid 2206, uptime 1:13:11
|
|
|
|
zulip-workers:zulip-events-signups RUNNING pid 2198, uptime 1:13:11
|
|
|
|
zulip-workers:zulip-events-slowqueries RUNNING pid 2202, uptime 1:13:11
|
|
|
|
zulip-workers:zulip-events-user-activity RUNNING pid 2197, uptime 1:13:11
|
|
|
|
zulip-workers:zulip-events-user-activity-interval RUNNING pid 2196, uptime 1:13:11
|
|
|
|
zulip-workers:zulip-events-user-presence RUNNING pid 2195, uptime 1:13:11
|
|
|
|
```
|
|
|
|
|
2019-11-22 18:59:43 +01:00
|
|
|
If you see any services showing a status other than `RUNNING`, or you
|
|
|
|
see an uptime under 5 seconds (which indicates it's crashing
|
|
|
|
immediately after startup and repeatedly restarting), that service
|
|
|
|
isn't running. If you don't see relevant logs in
|
|
|
|
`/var/log/zulip/errors.log`, check the log file declared via
|
|
|
|
`stdout_logfile` for that service's entry in
|
|
|
|
`/etc/supervisor/conf.d/zulip.conf` for details. Logs only make it to
|
|
|
|
`/var/log/zulip/errors.log` once a service has started fully.
|
|
|
|
|
2016-08-09 23:13:53 +02:00
|
|
|
### Restarting services with `supervisorctl restart all`
|
2016-07-12 22:02:30 +02:00
|
|
|
|
|
|
|
After you change configuration in `/etc/zulip/settings.py` or fix a
|
|
|
|
misconfiguration, you will often want to restart the Zulip application.
|
|
|
|
You can restart Zulip using:
|
|
|
|
|
|
|
|
```
|
|
|
|
supervisorctl restart all
|
|
|
|
```
|
|
|
|
|
2016-08-09 23:13:53 +02:00
|
|
|
### Stopping services with `supervisorctl stop all`
|
|
|
|
|
2016-07-12 22:02:30 +02:00
|
|
|
Similarly, you can stop Zulip using:
|
|
|
|
|
|
|
|
```
|
|
|
|
supervisorctl stop all
|
|
|
|
```
|
|
|
|
|
2016-08-09 23:13:53 +02:00
|
|
|
## Troubleshooting services
|
|
|
|
|
|
|
|
The Zulip application uses several major open source services to store
|
|
|
|
and cache data, queue messages, and otherwise support the Zulip
|
|
|
|
application:
|
2016-07-12 22:02:30 +02:00
|
|
|
|
2020-10-23 02:43:28 +02:00
|
|
|
* PostgreSQL
|
|
|
|
* RabbitMQ
|
|
|
|
* Nginx
|
|
|
|
* Redis
|
2016-07-12 22:02:30 +02:00
|
|
|
* memcached
|
|
|
|
|
|
|
|
If one of these services is not installed or functioning correctly,
|
|
|
|
Zulip will not work. Below we detail some common configuration
|
|
|
|
problems and how to resolve them:
|
|
|
|
|
|
|
|
* If your browser reports no webserver is running, that is likely
|
|
|
|
because nginx is not configured properly and thus failed to start.
|
|
|
|
nginx will fail to start if you configured SSL incorrectly or did
|
|
|
|
not provide SSL certificates. To fix this, configure them properly
|
|
|
|
and then run:
|
|
|
|
```
|
|
|
|
service nginx restart
|
|
|
|
```
|
|
|
|
|
2017-02-25 03:41:30 +01:00
|
|
|
* If your host is being port scanned by unauthorized users, you may see
|
|
|
|
messages in `/var/log/zulip/server.log` like
|
|
|
|
```
|
|
|
|
2017-02-22 14:11:33,537 ERROR Invalid HTTP_HOST header: '10.2.3.4'. You may need to add u'10.2.3.4' to ALLOWED_HOSTS.
|
|
|
|
```
|
|
|
|
Django uses the hostnames configured in `ALLOWED_HOSTS` to identify
|
|
|
|
legitimate requests and block others. When an incoming request does
|
|
|
|
not have the correct HTTP Host header, Django rejects it and logs the
|
|
|
|
attempt. For more on this issue, see the [Django release notes on Host header
|
|
|
|
poisoning](https://www.djangoproject.com/weblog/2013/feb/19/security/#s-issue-host-header-poisoning)
|
2019-11-22 02:07:17 +01:00
|
|
|
|
2019-12-13 07:19:12 +01:00
|
|
|
* An AMQPConnectionError traceback or error running rabbitmqctl
|
|
|
|
usually means that RabbitMQ is not running; to fix this, try:
|
|
|
|
```
|
|
|
|
service rabbitmq-server restart
|
|
|
|
```
|
|
|
|
If RabbitMQ fails to start, the problem is often that you are using
|
|
|
|
a virtual machine with broken DNS configuration; you can often
|
|
|
|
correct this by configuring `/etc/hosts` properly.
|
|
|
|
|
2020-10-23 02:49:41 +02:00
|
|
|
### Restrict unattended upgrades
|
2019-12-02 19:43:33 +01:00
|
|
|
|
|
|
|
```eval_rst
|
|
|
|
.. important::
|
2020-10-23 02:49:41 +02:00
|
|
|
We recommend that you `disable or limit Ubuntu's unattended-upgrades
|
|
|
|
to skip some server packages
|
|
|
|
<https://linoxide.com/ubuntu-how-to/enable-disable-unattended-upgrades-ubuntu-16-04/>`;
|
|
|
|
if you disable them, do not forget to regularly install apt upgrades
|
|
|
|
manually. With unattended upgrades enabled but not limited, the
|
|
|
|
moment a new Postgres release is published, your Zulip server will
|
|
|
|
have its Postgres server upgraded (and thus restarted).
|
2019-12-02 19:43:33 +01:00
|
|
|
```
|
|
|
|
|
2020-10-23 02:43:28 +02:00
|
|
|
Restarting one of the system services that Zulip uses (Postgres,
|
|
|
|
memcached, Redis, or Rabbitmq) will drop the connections that
|
2019-12-13 07:19:12 +01:00
|
|
|
Zulip processes have to the service, resulting in future operations on
|
|
|
|
those connections throwing errors.
|
|
|
|
|
|
|
|
Zulip is designed to recover from system service downtime by creating
|
|
|
|
new connections once the system service is back up, so the Zulip
|
|
|
|
outage will end once the system service finishes restarting. But
|
|
|
|
you'll get a bunch of error emails during the system service outage
|
|
|
|
whenever one of the Zulip server's ~20 workers attempts to access the
|
|
|
|
system service.
|
|
|
|
|
|
|
|
An unplanned outage will also result in an annoying (and potentially
|
|
|
|
confusing) trickle of error emails over the following hours or days.
|
|
|
|
These emails happen because a worker only learns its connection was
|
|
|
|
dropped when it next tries to access the connection (at which point
|
|
|
|
it'll send an error email and make a new connection), and several
|
|
|
|
workers are commonly idle for periods of hours or days at a time.
|
|
|
|
|
|
|
|
You can prevent this trickle when doing a planned upgrade by
|
|
|
|
restarting the Zulip server with
|
|
|
|
`/home/zulip/deployments/current/scripts/restart-server` after
|
2020-10-23 02:43:28 +02:00
|
|
|
installing system package updates to Postgres, memcached,
|
|
|
|
RabbitMQ, or Redis.
|
2019-12-13 07:19:12 +01:00
|
|
|
|
2020-10-23 02:49:41 +02:00
|
|
|
You can ensure that the `unattended-upgrades` package never upgrades
|
|
|
|
PostgreSQL, memcached, Redis, or RabbitMQ, by configuring in
|
|
|
|
`/etc/apt/apt.conf.d/50unattended-upgrades`:
|
|
|
|
|
|
|
|
```
|
|
|
|
// Python regular expressions, matching packages to exclude from upgrading
|
|
|
|
Unattended-Upgrade::Package-Blacklist {
|
|
|
|
"libc\d+";
|
|
|
|
"memcached$";
|
|
|
|
"nginx-full$";
|
|
|
|
"postgresql-\d+$";
|
|
|
|
"rabbitmq-server$";
|
|
|
|
"redis-server$";
|
|
|
|
"supervisor$";
|
|
|
|
};
|
|
|
|
```
|
2019-12-02 19:43:33 +01:00
|
|
|
|
2019-11-22 02:07:17 +01:00
|
|
|
## Monitoring
|
|
|
|
|
2019-11-22 18:59:43 +01:00
|
|
|
Chat is mission-critical to many organizations. This section contains
|
|
|
|
advice on monitoring your Zulip server to minimize downtime.
|
|
|
|
|
|
|
|
First, we should highlight that Zulip sends Django error emails to
|
|
|
|
`ZULIP_ADMINISTRATOR` for any backend exceptions. A properly
|
|
|
|
functioning Zulip server shouldn't send any such emails, so it's worth
|
|
|
|
reporting/investigating any that you do see.
|
|
|
|
|
|
|
|
Beyond that, the most important monitoring for a Zulip server is
|
|
|
|
standard stuff:
|
|
|
|
|
|
|
|
* Basic host health monitoring for issues running out of disk space,
|
|
|
|
especially for the database and where uploads are stored.
|
|
|
|
* Service uptime and standard monitoring for the [services Zulip
|
|
|
|
depends on](#troubleshooting-services). Most monitoring software
|
2020-10-23 02:43:28 +02:00
|
|
|
has standard plugins for Nginx, Postgres, Redis, RabbitMQ,
|
|
|
|
and memcached, and those will work well with Zulip.
|
2019-11-22 18:59:43 +01:00
|
|
|
* `supervisorctl status` showing all services `RUNNING`.
|
|
|
|
* Checking for processes being OOM killed.
|
|
|
|
|
|
|
|
Beyond that, Zulip ships a few application-specific end-to-end health
|
|
|
|
checks. The Nagios plugins `check_send_receive_time`,
|
|
|
|
`check_rabbitmq_queues`, and `check_rabbitmq_consumers` are generally
|
|
|
|
sufficient to point to the cause of any Zulip production issue. See
|
|
|
|
the next section for details.
|
|
|
|
|
|
|
|
### Nagios configuration
|
|
|
|
|
2019-11-22 02:07:17 +01:00
|
|
|
The complete Nagios configuration (sans secret keys) used to
|
2020-06-09 00:58:42 +02:00
|
|
|
monitor zulip.com is available under `puppet/zulip_ops` in the
|
2019-11-22 02:07:17 +01:00
|
|
|
Zulip Git repository (those files are not installed in the release
|
|
|
|
tarballs).
|
|
|
|
|
|
|
|
The Nagios plugins used by that configuration are installed
|
|
|
|
automatically by the Zulip installation process in subdirectories
|
|
|
|
under `/usr/lib/nagios/plugins/`. The following is a summary of the
|
2019-12-13 07:19:12 +01:00
|
|
|
useful Nagios plugins included with Zulip and what they check:
|
2019-11-22 02:07:17 +01:00
|
|
|
|
|
|
|
Application server and queue worker monitoring:
|
|
|
|
|
2019-12-13 07:19:12 +01:00
|
|
|
* `check_send_receive_time`: Sends a test message through the system
|
|
|
|
between two bot users to check that end-to-end message sending
|
|
|
|
works. An effective end-to-end check for Zulip's Django and Tornado
|
|
|
|
systems being healthy.
|
|
|
|
* `check_rabbitmq_consumers` and `check_rabbitmq_queues`: Effective
|
|
|
|
checks for Zulip's RabbitMQ-based queuing systems being healthy.
|
|
|
|
* `check_worker_memory`: Monitors for memory leaks in queue workers.
|
|
|
|
* `check_email_deliverer_backlog` and `check_email_deliverer_process`:
|
|
|
|
Monitors for whether scheduled outgoing emails (e.g. invitation
|
|
|
|
reminders) are being sent properly.
|
2019-11-22 02:07:17 +01:00
|
|
|
|
|
|
|
Database monitoring:
|
|
|
|
|
2019-12-13 07:19:12 +01:00
|
|
|
* `check_fts_update_log`: Checks whether full-text search updates are
|
|
|
|
being processed properly or getting backlogged.
|
|
|
|
* `check_postgres`: General checks for database health.
|
2020-10-26 22:05:28 +01:00
|
|
|
* `check_postgresql_backup`: Checks status of Postgres backups.
|
2020-10-23 02:43:28 +02:00
|
|
|
* `check_postgres_replication_lag`: Checks whether Postgres streaming
|
2019-12-13 07:19:12 +01:00
|
|
|
replication is up to date.
|
2019-11-22 02:07:17 +01:00
|
|
|
|
|
|
|
Standard server monitoring:
|
|
|
|
|
2019-12-13 07:19:12 +01:00
|
|
|
* `check_website_response.sh`: Basic HTTP check.
|
|
|
|
* `check_debian_packages`: Checks whether the system is behind on `apt
|
|
|
|
upgrade`.
|
2019-11-22 02:07:17 +01:00
|
|
|
|
|
|
|
If you're using these plugins, bug reports and pull requests to make
|
|
|
|
it easier to monitor Zulip and maintain it in production are
|
|
|
|
encouraged!
|
|
|
|
|
|
|
|
## Memory leak mitigation
|
|
|
|
|
2019-12-13 07:19:12 +01:00
|
|
|
As a measure to mitigate the potential impact of any future memory
|
|
|
|
leak bugs in one of the Zulip daemons, Zulip service automatically
|
|
|
|
restarts itself every Sunday early morning. See
|
|
|
|
`/etc/cron.d/restart-zulip` for the precise configuration.
|