Commit Graph

1425 Commits

Author SHA1 Message Date
Matt Keller 91e5ae84ac uwsgi: Increase timeout before harakiri.
Some legitimate requests in Zulip can take more than 20s to be
processed, and we don't have a current problem where having a 20s
limit here is preventing a problem.
2022-08-23 15:28:10 -07:00
Alex Vandiver a9183d2208 grafana: Enable auto-sign-up.
This avoids the need to explicitly create new users in Grafana, by
simply trusting Teleport.
2022-07-19 17:52:17 -07:00
Alex Vandiver 9bd88a93e2 puppet: Tell needrestart to not default to restarting core services.
The `needrestart` tool added in 22.04 is useful in terms of listing
which services may need to be restarted to pick up updated libraries.
However, it prompts about the current state of services needing
restart for *every* subsequent `apt-get upgrade`, and defaulting core
services to restarting requires carefully manually excluding them
every time, at risk of causing an unscheduled outage.

Build a list of default-off services based on the list in
unattended-upgrades.
2022-07-19 17:51:18 -07:00
Alex Vandiver 7ae3708c02 teleport: Add explicit WebAuthn config, not just U2F.
WebAuthn is the default, replacing U2F, in Teleport 10 and above[1].
While Teleport can derive a WebAuthn configuration from a U2F
configuration[2], it's useful to be explicit.

[1]: https://goteleport.com/docs/access-controls/guides/webauthn/
[2]: https://goteleport.com/docs/access-controls/guides/webauthn/#u2f
2022-07-18 11:41:00 -07:00
Alex Vandiver 9d29c46078 puppet: Upgrade Grafana, Prometheus and redis_exporter. 2022-07-15 09:18:58 -07:00
Alex Vandiver 42dc5d003e puppet: Upgrade Smokescreen and golang. 2022-07-15 09:18:58 -07:00
Alex Vandiver 120de1dca9 zephyr: Write out unix timestamp in check, as check_cron_file expects.
A follow-up fix to 8bc26aab08.
2022-06-30 11:12:26 -07:00
Alex Vandiver 4fd51cb5ad uwsgi: Increase request buffer size to 64k, from 8k default.
The default value in uwsgi is 4k; receiving more than this amount from
nginx leads to a 502 response (though, happily, the backend uwsgi does not
terminate).

ab18dbfde5 originally increased it from the unstated uwsgi default
of 4096, to 8192; b1da797955 made it configurable, in order to allow
requests from clients with many cookies, without causing 502's[1].

nginx defaults to a limitation of 1k, with 4 additional 8k header
lines allowed[2]; any request larger than that returns a response of
`400 Request Header Or Cookie Too Large`.  The largest header size
theoretically possible from nginx, by default, is thus 33k, though
that would require packing four separate headers to exactly 8k each.

Remove the gap between nginx's limit and uwsgi's, which could trigger
502s, by removing the uwsgi configurability, and setting a 64k size in
uwsgi (the max allowable), which is larger than nginx's default limit.

uWSGI's documentation of `buffer-size` ([3], [4]) also notes that "It
is a security measure too, so adapt to your app needs instead of
maxing it out."  Python has no security issues with buffers of 64k,
and there is no appreciable memory footprint difference to having a
larger buffer available in uwsgi.

[1]: https://chat.zulip.org/#narrow/stream/31-production-help/topic/works.20in.20Edge.20not.20Chrome/near/719523
[2]: https://nginx.org/en/docs/http/ngx_http_core_module.html#client_header_buffer_size
[3]: https://uwsgi-docs.readthedocs.io/en/latest/ThingsToKnow.html
[4]: https://uwsgi-docs.readthedocs.io/en/latest/Options.html#buffer-size
2022-06-28 16:14:24 -07:00
Anders Kaseorg ef3510fa6d nginx: Remove legacy X-XSS-Protection header.
Support for this header was removed in Chrome 78, Safari 15.4, and
Edge 17.  It was never supported in Firefox.

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2022-06-27 17:38:18 -07:00
Alex Vandiver 8577adcf2e cron: Remove unused STATE_FILE environment variable. 2022-06-22 12:07:38 -07:00
Alex Vandiver 8bc26aab08 nagios: Switch check_user_zephyr_mirror_liveness to run via cron.
This check loads Django, and as such must be run as the zulip user.
Repeat the same pattern used elsewhere in nagios, of writing a state
file, which is read by `check_cron_file`.
2022-06-22 12:07:38 -07:00
Alex Vandiver 41deef40cf nagios: Switch to generic check_cron_file for queues and consumers.
These share a common root; 91da4bd59b duplicated the code, but
didn't move the existing uses to the new utility.
2022-06-22 12:07:38 -07:00
Alex Vandiver b2d0bad9af check_cron_file: Remove unnecessary quotes. 2022-06-22 12:07:38 -07:00
Alex Vandiver 41b7ae4e44 check_cron_file: Don't crash on missing cron file.
This is 5050fb19f6, but for `check_cron_file`, which was introduced
in 91da4bd59b.
2022-06-22 12:07:38 -07:00
Alex Vandiver 8fbde9b8c5 nagios: Only run check_fts_update_log on one PostgreSQL host.
The data is the same in the table in all replicas -- there is no need
to alert on all of them.
2022-06-22 12:07:38 -07:00
Alex Vandiver 499284d2fd nagios: Split postgresql into primary and replica.
Replication checks should only run on primary and replicas, not
standalone hosts; while `autovac_freeze` currently only runs on
primary hosts, it functions identically on replicas, and is fine to
run there.

Make `autovac_freeze` run on all `postgresql` hosts, and make
standalone hosts no longer `postgres_primary`, so they do not fail the
replication tests.
2022-06-22 12:07:38 -07:00
Alex Vandiver 38e435347b nagios: Add missing queue consumer checks. 2022-06-22 12:07:38 -07:00
Alex Vandiver e01a4242aa nagios: Sort queue consumer checks. 2022-06-22 12:07:38 -07:00
Alex Vandiver 2c90c7a010 nagios: Switch `check_remote_arg_string` queue checks to consumer checks.
These style of checks just look for matching process names using
`check_remote_arg_string`, which dates to 8edbd64bb8.  These were
added because the original two (`missedmessage_emails` and
`slow_queries`) did not create consumers, instead polling for events.

Switch these to checking the queue consumer counts that the
`check-rabbitmq-consumers` check is already writing out.  Since the
`missedmessage_emails` was _already_ checked via the consumer check, a
duplicate is not added.
2022-06-22 12:07:38 -07:00
Alex Vandiver f48d543d9b nagios: Make and use a "rabbitmq-consumer-service" template service. 2022-06-22 12:07:38 -07:00
Alex Vandiver 775a084d0f nagios: Add a catchall "other" set. 2022-06-22 12:07:38 -07:00
Alex Vandiver 83c82c8e15 nagios: Adjust load alerting by hostgroup.
Even the `pageable_servers` group did not page for high load -- in
part because what was "high" depends on the servers.  Set slightly
better limits based on server role.
2022-06-22 12:07:38 -07:00
Alex Vandiver 2a14aa5180 nagios: Add a `fullstack` hostgroup.
This will be used to apply checks only to czo.
2022-06-22 12:07:38 -07:00
Alex Vandiver b5ecfc327f nagios: Remove unnecessary `web` hostgroup.
This had identical membership to `frontends`.
2022-06-22 12:07:38 -07:00
Alex Vandiver 4be9025212 nagios: Remove redundant `postgresql` hostgroup.
This is implied by `postgresql_primary`.
2022-06-22 12:07:38 -07:00
Alex Vandiver d9d0014fb4 nagios: Rename `zmirror_main` into `zmirror` hostgroup.
`zmirror` itself was `zmirror_main` + `zmirrorp` but was unused; we
consistently just use the term `zmirror` for the non-personals server,
so use it as the hostgroup name.
2022-06-22 12:07:38 -07:00
Alex Vandiver 70c36985b4 nagios: Remove frontends from redis group.
The Redis nagios checks themselves are done against `redis` +
`frontends` groups, so there is no need to misleadingly place
`frontends` in the `redis` hostgroup.
2022-06-22 12:07:38 -07:00
Alex Vandiver 08127086bc nagios: Remove misleading "staging_frontends" from standalone.
No services are tested for the `staging_frontends` hostgroup, so this
does not alter the checks.
2022-06-22 12:07:38 -07:00
Alex Vandiver d804de871d nagios: Move staging and prod hostgroups adjacent. 2022-06-22 12:07:38 -07:00
Alex Vandiver 4c17f2bccc nagios: The frontends hostgroup now includes prod and staging frontends.
This lets the config file remove some repetition.
2022-06-22 12:07:38 -07:00
Alex Vandiver 1e81775fa0 nagios: Drop unhelpful hostgroup comment. 2022-06-22 12:07:38 -07:00
Alex Vandiver 7b584401ac nagios: Reformat hostgroups. 2022-06-22 12:07:38 -07:00
Alex Vandiver 93bcb86345 nagios: Reorder service checks. 2022-06-22 12:07:38 -07:00
Alex Vandiver eaaa2fbff8 nagios: Use canonical "hostgroup_name" consistently. 2022-06-22 12:07:38 -07:00
Alex Vandiver e8996b53a5 nagios: Remove unused has_swap hostgroup. 2022-06-22 12:07:38 -07:00
Alex Vandiver 33472ee9ff nagios: Remove unused stats host set. 2022-06-22 12:07:38 -07:00
Alex Vandiver bc4f4b4862 nagios: Make the pageable/not/flaky tri-state clearer. 2022-06-22 12:07:38 -07:00
Alex Vandiver c74f195fba nagios: Split AWS and non-AWS hosts, for ntp checks.
The non-AWS hosts cannot use the AWS ntp server for their check.
2022-06-22 12:07:38 -07:00
Alex Vandiver 872efdee58 nagios: Fold single- and multitornado_frontends back into frontends.
5abf4dee92 made this distinction, then multitornado_frontends was
never used; the singletornado_frontends alerting worked even for the
multiple-Tornado instances.

Remove the useless and misleading distinction.
2022-06-22 12:07:38 -07:00
Anders Kaseorg dc6af98e52 nginx: Add Cache-Control headers for Django-hashed static files.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2022-06-21 17:26:23 -07:00
Alex Vandiver 0645656fd8 process_fts_updates: Nagios may lack permissions to load Django config.
Even if Django and PostgreSQL are on the same host, the `nagios` user
may lack permissions to read accessory configuration files needed to
load the Django configuration (e.g. authentication keys).

Catch those failures, and switch to loading the required settings from
`/etc/zulip/zulip.conf`.
2022-06-21 12:50:13 -07:00
Anders Kaseorg a7f9c4f958 logging: Pass more format arguments to logging.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2022-06-03 12:27:23 -07:00
Alex Vandiver aa46d8d2a8 puppet: Enable strict typo checking in uwsgi. 2022-06-02 13:20:48 -07:00
Alex Vandiver 18ec3b6215 puppet: Enable background worker threads in uwsgi.
Without this, uwsgi does not release the GIL before going back into
`epoll_wait` to wait for the next request.  This results in any
background threads languishing, unserviced.[1]

Practically, this results in Sentry background reporter threads timing
out when attempting to post results -- but only in situations with low
traffic, as in those significant time is spent in `epoll_wait`.  This
is seen in logs as:

    WARN [urllib3.connectionpool] Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:1131)'))': /api/123456789/envelope/

Or:

    WARN [urllib3.connectionpool] Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ProxyError('Cannot connect to proxy.', RemoteDisconnected('Remote end closed connection without response'))': /api/123456789/envelope/

Sentry attempts to detect this and warn, but due to startup ordering,
the warning is not printed without lazy-loading.

Enable threads, at a miniscule performance cost, in order to support
background workers like Sentry[2].

[1] https://github.com/unbit/uwsgi/issues/1141#issuecomment-169042767
[2] https://docs.sentry.io/clients/python/advanced/#a-note-on-uwsgi
2022-06-02 13:20:48 -07:00
Alex Vandiver 919c904091 puppet: Give the uwsgi processes a shorter process name.
Previously, the complete command line, which is quite long, is shown:

    3963143 ?        SN     0:00 /home/zulip/deployments/current/zulip-current-venv/bin/uwsgi --ini /etc/zulip/uwsgi.ini
    3963144 ?        SN     0:03  \_ /home/zulip/deployments/current/zulip-current-venv/bin/uwsgi --ini /etc/zulip/uwsgi.ini
    3963145 ?        SN     0:03  \_ /home/zulip/deployments/current/zulip-current-venv/bin/uwsgi --ini /etc/zulip/uwsgi.ini
    3963146 ?        SN     0:03  \_ /home/zulip/deployments/current/zulip-current-venv/bin/uwsgi --ini /etc/zulip/uwsgi.ini
    3963147 ?        SN     0:03  \_ /home/zulip/deployments/current/zulip-current-venv/bin/uwsgi --ini /etc/zulip/uwsgi.ini
    3963148 ?        SN     0:03  \_ /home/zulip/deployments/current/zulip-current-venv/bin/uwsgi --ini /etc/zulip/uwsgi.ini
    3963149 ?        SN     0:03  \_ /home/zulip/deployments/current/zulip-current-venv/bin/uwsgi --ini /etc/zulip/uwsgi.ini

Configure uwsgi to rename and number the processes.  This results in:

    3907613 ?        SN     0:00 zulip-django uWSGI master
    3907614 ?        SN     0:05  \_ zulip-django uWSGI worker 1
    3907615 ?        SN     0:03  \_ zulip-django uWSGI worker 2
    3907616 ?        SN     0:05  \_ zulip-django uWSGI worker 3
    3907617 ?        SN     0:05  \_ zulip-django uWSGI worker 4
    3907618 ?        SN     0:05  \_ zulip-django uWSGI worker 5
    3907619 ?        SN     0:05  \_ zulip-django uWSGI worker 6
2022-06-02 13:20:48 -07:00
Alex Vandiver a522ad1d9a puppet: Always create a uwsgi master control socket.
This is potentially useful even with rolling restarts disabled.
2022-06-02 13:20:48 -07:00
Alex Vandiver 721a101f12 puppet: Reorganize and comment uwsgi.ini file.
As the uwsgi documentation is somewhat obtuse, more comments are added
here than might usually be.
2022-06-02 13:20:48 -07:00
Alex Vandiver 3741c1c034 puppet: Switch to checking time against the AWS timeserver.
Since this is what chrony is sync'ing to, it lessens the chance of
spurious firings of this alert.

See https://aws.amazon.com/blogs/aws/keeping-time-with-amazon-time-sync-service/
2022-05-31 22:57:32 -07:00
Alex Vandiver a201e3b25b puppet: Upgrade wal-g to 2.0.0. 2022-05-22 14:51:18 -07:00
Alex Vandiver c8ee53619d puppet: Upgrade go and smokescreen. 2022-05-22 14:51:18 -07:00