Commit Graph

182 Commits

Author SHA1 Message Date
Alex Vandiver 31d80a77d4 puppet: Update nagios check_postgres_replication_lag to be on DB hosts
7d4a370a57 attempted to move the replication check to on the
PostgreSQL hosts.  While it updated the _check_ to assume it was
running and talking to a local PostgreSQL instance, the configuration
and installation for the check were not updated.  As such, the check
ran on the nagios host for each DB host, and produced no output.

Start distributing the check to all apopdb hosts, and configure nagios
to use the SSH tunnel to get there.
2020-07-14 16:27:18 -07:00
Alex Vandiver 6c27f07c1d puppet: Move PostgreSQL backups to their own class.
wal-g was used in `puppet/zulip` by env-wal-g, but only installed in
`puppet/zulip_ops`.

Merge all of the dependencies of doing backups using wal-g (wal-g
installation, the pg_backup_and_purge job, the nagios plugin that
verifies it happens) into a common base class in `puppet/zulip`, since
it is generally useful.
2020-07-14 00:40:25 -07:00
Alex Vandiver 3691a94efe puppet: Configure munin and nagios under apache with puppet.
This swaps in the actually-in-use munin configuiration file;
otherwise, it is an implementation of the configuration as it exists
on the machine.
2020-07-13 13:23:11 -07:00
Alex Vandiver 7c7b5fcd6f munin: Deal with spaces in the channel names. 2020-07-13 12:49:28 -07:00
Alex Vandiver ddc7bb5a45 munin: Fix the path to check_send_receive_time. 2020-07-13 12:49:28 -07:00
Alex Vandiver 8be544e7eb munin: Rename monitoring plugin to use zulip name, not humbug. 2020-07-13 12:49:28 -07:00
Alex Vandiver a4e7c7a27e nagios: Remove check_memcached.
check_memcached does not support memcached authentication even in its
latest release (it’s in a TODO item comment, and that’s it), and was
never particularly useful.
2020-07-10 00:12:48 -07:00
Alex Vandiver a21a086f5c puppet: nagios-plugins-basic is replaced by monitoring-plugins-basic.
In Bionic, nagios-plugins-basic is a transitional package which
depends on monitoring-plugins-basic.  In Focal, it is a virtual
package, which means that every time puppet runs, it tries to
re-install the nagios-plugins-basic package.

Switch all instances to referring to `$zulip::common::nagios_plugins`,
and repoint that to monitoring-plugins-basic.
2020-06-29 14:58:01 -07:00
Alex Vandiver 876ee4a8ed installer: Remove code specific to stretch or xenial.
Support for Xenial and Stretch was removed (5154ddafca, 0f4b1076ad,
8944e0ad53, 79acd5ae40, 1219a2e854), but not all codepaths were
updated to remove their conditionals on it.

Remove all code predicated on Xenial or Stretch.  debathena support
was migrated to Bionic, since that appears to be the current state of
existing debathena servers.
2020-06-24 12:57:38 -07:00
Alex Vandiver 5f433d6eeb puppet: Remove vestigial check_postgres.pl.
65774e1c4f switched from using the bundled check_postgres.pl to using
the version from packages; the file itself remained, however.

Remove it, and clean up references to it.

Fixes #15389.
2020-06-15 16:18:07 -07:00
Alex Vandiver 7d4a370a57 puppet: Move monitoring of pg replication to the pg hosts.
Instead of SSH'ing around to them, run directly on the database hosts.
This means that the replicas do not know how many bytes behind they
are in _receiving_ the wall logs; thus, the monitoring also extends to
the primary database, which knows that information for each replica.
This also allows for detecting when there are too few active replicas.
2020-06-15 16:18:07 -07:00
Anders Kaseorg 74c17bf94a python: Convert more percent formatting to Python 3.6 f-strings.
Generated by pyupgrade --py36-plus.

Now including %d, %i, %u, and multi-line strings.

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2020-06-14 23:27:22 -07:00
Anders Kaseorg 91a86c24f5 python: Replace None defaults with empty collections where appropriate.
Use read-only types (List ↦ Sequence, Dict ↦ Mapping, Set ↦
AbstractSet) to guard against accidental mutation of the default
value.

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2020-06-13 15:31:27 -07:00
Alex Vandiver 55bd31721d puppet: Remove custom `vm.dirty_ratio` and `vm.dirty_background_ratio`.
These values differed between the primary and secondary database
hosts, for unclear reasons.  The differences date back to their
introduction in 387f63deaa.  As the comment in the replica
confguration notes, settings of `vm.dirty_ratio = 10` and
`vm.dirty_background_ratio = 5` matched the kernel defaults for
"newer" kernels; however, kernel 2.6.30 bumped those to 20 and 10,
respectively[1], as a fix for underlying logic now being more correct.

Remove these overrides; they should at very least be consistent across
roles, and the previous values look to be an attempt to tune for a
very much older version of the Linux kernel, which was using an
different, buggier, algorithm under the hood.

[1] 1b5e62b42b
2020-06-12 14:57:46 -07:00
Alex Vandiver f39816e768 puppet: Stop distributing recovery.conf file.
This file controls streaming replication, and recovery using wal-g on
the secondary.  The `primary_conninfo` data needs to change on short
notice when database failover happens, in a way that is not suitable
for being controlled by puppet.

PostgreSQL 12, in fact, removes the use of the `recovery.conf` file[1];
the `primary_conninfo` and `restore_command` information goes into the
main `postgresql.conf` file, and the standby status is controlled by
the presence of absence of an empty `standby.signal` file.

Remove the puppet control of the `recovery.conf` file.

[1] https://pgstef.github.io/2018/11/26/postgresql12_preview_recovery_conf_disappears.html
2020-06-12 14:57:46 -07:00
Alex Vandiver 316498a169 puppet: Remove unnecessary nagios authentication setup.
Since the nagios authentication is stored _in the database_, it is
unnecessary to run if the database is simply a replica of the
production database.  The only case in which this statement would have
an effect is if the postgres node contains a _different_ (or empty)
database, which `setup_disks` now effectively prevents.

Remove the unnecessary step.
2020-06-11 21:01:49 -07:00
Alex Vandiver 1dc2de5026 puppet: Update setup-disks to be idempotent.
The end state it produces is _either_:

 - `/srv/postgresql` already existed, which was symlinked into
   `/var/lib/postgresql`; postgres is left untouched.  This is the
   situation if `setup_disks` is run on the database primary, or a
   replica which was correctly configured.

 - An empty `/srv/postgresql` now exists, symlinked into
   `/var/lib/postgresql`, and postgres is stopped.  This is the
   situation if `puppet` was just run on a new host, or a
   previously-configured host was rebooted (clearing the temporary
   disk in `/dev/nvme0`)

In the latter case, where `/srv/postgresql` is now empty, any previous
contents of `/var/lib/postgresql` are placed under `/root`,
timestamped for uniqueness.

In either case, the tool should now be idempotent.
2020-06-11 21:01:49 -07:00
Anders Kaseorg 365fe0b3d5 python: Sort imports with isort.
Fixes #2665.

Regenerated by tabbott with `lint --fix` after a rebase and change in
parameters.

Note from tabbott: In a few cases, this converts technical debt in the
form of unsorted imports into different technical debt in the form of
our largest files having very long, ugly import sequences at the
start.  I expect this change will increase pressure for us to split
those files, which isn't a bad thing.

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2020-06-11 16:45:32 -07:00
Anders Kaseorg 69730a78cc python: Use trailing commas consistently.
Automatically generated by the following script, based on the output
of lint with flake8-comma:

import re
import sys

last_filename = None
last_row = None
lines = []

for msg in sys.stdin:
    m = re.match(
        r"\x1b\[35mflake8    \|\x1b\[0m \x1b\[1;31m(.+):(\d+):(\d+): (\w+)", msg
    )
    if m:
        filename, row_str, col_str, err = m.groups()
        row, col = int(row_str), int(col_str)

        if filename == last_filename:
            assert last_row != row
        else:
            if last_filename is not None:
                with open(last_filename, "w") as f:
                    f.writelines(lines)

            with open(filename) as f:
                lines = f.readlines()
            last_filename = filename
        last_row = row

        line = lines[row - 1]
        if err in ["C812", "C815"]:
            lines[row - 1] = line[: col - 1] + "," + line[col - 1 :]
        elif err in ["C819"]:
            assert line[col - 2] == ","
            lines[row - 1] = line[: col - 2] + line[col - 1 :].lstrip(" ")

if last_filename is not None:
    with open(last_filename, "w") as f:
        f.writelines(lines)

Signed-off-by: Anders Kaseorg <anders@zulipchat.com>
2020-06-11 16:04:12 -07:00
Alex Vandiver b114eb2f10 puppet: Rename env-wal-e to env-wal-g.
It runs wal-g now, not wal-e; make its name respect that.
2020-06-11 15:52:43 -07:00
Anders Kaseorg 67e7a3631d python: Convert percent formatting to Python 3.6 f-strings.
Generated by pyupgrade --py36-plus.

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2020-06-10 15:02:09 -07:00
Alex Vandiver 8b1d49dbc7 puppet: Rename "wiki" realm to "monitoring".
This is vestigial.

It requires manually altering the `htdigest` file (not stored in this
repo) to change the digest realm from `wiki` to `monitoring`, and will
re-prompt users for their passwords if the browsers currently store
them.
2020-05-30 12:26:21 -07:00
Alex Vandiver b33aa8da7f postgresql: Update setup-disks to use `service postgresql`.
Using `service postgresql` makes it no longer linked to the specific
version/cluster that is on the host.
2020-05-30 12:14:24 -07:00
Alex Vandiver 4e370cda75 postgresql: Update setup-disks to drop /mnt disabling.
Hosts do not start out with a `/mnt`; there is no need to disable it.
2020-05-30 12:14:24 -07:00
Alex Vandiver a7d85b7e69 postgresql: Update setup-disks to not move /tmp.
Drop the change to move `/tmp` onto the local disk.  Doing this move
confuses `resolved` until there is a restart, and has no clear
benefits.  The change came in during bf82fadc95, but does not describe
the reasoning; it is particularly puzzling, since postgresql stores
its temporary files under `$PGDATA/base/pgsql_tmp`.
2020-05-30 12:14:24 -07:00
Alex Vandiver 481613a344 postgresql: Update setup-disks to not use RAID.
Do not RAID the disks together.  This was previously done when they
were spinning media, for reliability; running them on an SSD obviates
this sufficiently.  This means that updating the initramfs is also not
necessary.
2020-05-30 12:14:24 -07:00
Alex Vandiver b537563bc1 postgresql: Set the current primary host. 2020-05-30 12:14:24 -07:00
Alex Vandiver ad2918ea51 puppet: Remove `postgres_other` nagios hostgroup.
This no longer has any rules specific to it.  We leave the `postgres`
munin group (which now only contains `postgres_appdb`) as
future-proofing, and so that `postgres_appdb` matches to the puppet
manifest of the same name.
2020-05-28 17:24:35 -07:00
Alex Vandiver 2c73fbdcb6 puppet: Remove munin monitoring for no-longer-used "postgres_other".
The `wiki` and `trac` products are no longer used.
2020-05-28 17:24:35 -07:00
Anders Kaseorg f5b33f9398 python: Further pyupgrade changes.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2020-05-26 11:43:40 -07:00
Tim Abbott 220620e7cf sharding: Add basic sharding configuration for Tornado.
This allows straight-forward configuration of realm-based Tornado
sharding through simply editing /etc/zulip/zulip.conf to configure
shards and running scripts/refresh-sharding-and-restart.

Co-Author-By: Mateusz Mandera <mateusz.mandera@zulip.com>
2020-05-20 13:47:20 -07:00
Mateusz Mandera dd40649e04 queue_processors: Remove the slow_queries queue.
While this functionality to post slow queries to a Zulip stream was
very useful in the early days of Zulip, when there were only a few
hundred accounts, it's long since been useless since (1) the total
request volume on larger Zulip servers run by Zulip developers, and
(2) other server operators don't want real-time notifications of slow
backend queries.  The right structure for this is just a log file.

We get rid of the queue and replace it with a "zulip.slow_queries"
logger, which will still log to /var/log/zulip/slow_queries.log for
ease of access to this information and propagate to the other logging
handlers.  Reducing the amount of queues is good for lowering zulip's
memory footprint and restart performance, since we run at least one
dedicated queue worker process for each one in most configurations.
2020-05-11 00:45:13 -07:00
Anders Kaseorg c0ffa71fa9 nginx: Replace unanchored regexes in location directives.
We could anchor the regexes, but there’s no need for the power (and
responsibility) of regexes here.

Signed-off-by: Anders Kaseorg <anders@zulipchat.com>
2020-04-24 16:58:19 -07:00
Anders Kaseorg 5e01a0ae8b zulip-ec2-configure-interfaces: Convert function type annotations.
Signed-off-by: Anders Kaseorg <anders@zulipchat.com>
2020-04-24 13:06:54 -07:00
Anders Kaseorg f8339f019d python: Convert assignment type annotations to Python 3.6 style.
Commit split by tabbott; this has changes to scripts/, tools/, and
puppet/.

scripts/lib/hash_reqs.py, scripts/lib/setup_venv.py,
scripts/lib/zulip_tools.py, and tools/lib/provision.py are excluded so
tools/provision still gives the right error message on Ubuntu 16.04
with Python 3.5.

Generated by com2ann, with whitespace fixes and various manual fixes
for runtime issues:

-shebang_rules: List[Rule] = [
+shebang_rules: List["Rule"] = [

-trailing_whitespace_rule: Rule = {
+trailing_whitespace_rule: "Rule" = {

-whitespace_rules: List[Rule] = [
+whitespace_rules: List["Rule"] = [

-comma_whitespace_rule: List[Rule] = [
+comma_whitespace_rule: List["Rule"] = [

-prose_style_rules: List[Rule] = [
+prose_style_rules: List["Rule"] = [

-html_rules: List[Rule] = whitespace_rules + prose_style_rules + [
+html_rules: List["Rule"] = whitespace_rules + prose_style_rules + [

-    target_port: int = None
+    target_port: int

Signed-off-by: Anders Kaseorg <anders@zulipchat.com>
2020-04-24 13:06:54 -07:00
Aman Agrawal 2dc6d09c2a python3-upgrade: Move python2 scripts to run on python3. 2020-04-22 16:13:15 -07:00
Anders Kaseorg 5901e7ba7e python: Convert function type annotations to Python 3 style.
Generated by com2ann (slightly patched to avoid also converting
assignment type annotations, which require Python 3.6), followed by
some manual whitespace adjustment, and six fixes for runtime issues:

-    def __init__(self, token: Token, parent: Optional[Node]) -> None:
+    def __init__(self, token: Token, parent: "Optional[Node]") -> None:

-def main(options: argparse.Namespace) -> NoReturn:
+def main(options: argparse.Namespace) -> "NoReturn":

-def fetch_request(url: str, callback: Any, **kwargs: Any) -> Generator[Callable[..., Any], Any, None]:
+def fetch_request(url: str, callback: Any, **kwargs: Any) -> "Generator[Callable[..., Any], Any, None]":

-def assert_server_running(server: subprocess.Popen[bytes], log_file: Optional[str]) -> None:
+def assert_server_running(server: "subprocess.Popen[bytes]", log_file: Optional[str]) -> None:

-def server_is_up(server: subprocess.Popen[bytes], log_file: Optional[str]) -> bool:
+def server_is_up(server: "subprocess.Popen[bytes]", log_file: Optional[str]) -> bool:

-    method_kwarg_pairs: List[FuncKwargPair],
+    method_kwarg_pairs: "List[FuncKwargPair]",

Signed-off-by: Anders Kaseorg <anders@zulipchat.com>
2020-04-18 20:42:48 -07:00
Tim Abbott e1ce53ac46 puppet: Update nagios checks for disk to exclude kernel filesystems.
The fact that we have to explicitly list these is almost certainly a
bug in check_disk, but at least this works.
2020-04-16 17:49:29 -07:00
Tim Abbott cfbb617f5c puppet: Update nagios configuration for checking local disk. 2020-04-16 17:48:36 -07:00
Tim Abbott 8e5a866122 puppet: Update tuning for load average monitoring. 2020-04-16 16:47:05 -07:00
Anders Kaseorg c734bbd95d python: Modernize legacy Python 2 syntax with pyupgrade.
Generated by `pyupgrade --py3-plus --keep-percent-format` on all our
Python code except `zthumbor` and `zulip-ec2-configure-interfaces`,
followed by manual indentation fixes.

Signed-off-by: Anders Kaseorg <anders@zulipchat.com>
2020-04-09 16:43:22 -07:00
Vishnu KS 449f7e2d4b team: Generate team page data using cron job.
This eliminates the contributors data as a possible source of
flakiness when installing Zulip from Git.

Fixes #14351.
2020-04-08 12:52:31 -07:00
Stefan Weil d2fa058cc1
text: Fix some typos (most of them found and fixed by codespell).
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-03-27 17:25:56 -07:00
Anders Kaseorg 7ff9b22500 docs: Convert many http URLs to https.
Signed-off-by: Anders Kaseorg <anders@zulipchat.com>
2020-03-26 21:35:32 -07:00
Anders Kaseorg 687553a661 setup_path_on_import: Replace with setup_path function.
isort 5 knows not to reorder imports across function calls, so this
will stop isort from breaking our code.

Signed-off-by: Anders Kaseorg <anders@zulipchat.com>
2020-02-25 15:40:21 -08:00
Mateusz Mandera 4c5a8e6f0c queue: Remove missedmessage_email_senders. 2020-01-31 12:13:51 -08:00
Tim Abbott dd969b5339 install: Remove references to "Zulip Voyager".
"Zulip Voyager" was a name invented during the Hack Week to open
source Zulip for what a single-system Zulip server might be called, as
a Star Trek pun on the code it was based on, "Zulip Enterprise".

At the time, we just needed a name quickly, but it was never a good
name, just a placeholder.  This removes that placeholder name from
much of the codebase.  A bit more work will be required to transition
the `zulip::voyager` Puppet class, as that has some migration work
involved.
2020-01-30 12:40:41 -08:00
Tim Abbott d70e799466 bots: Remove FEEDBACK_BOT implementation.
This legacy cross-realm bot hasn't been used in several years, as far
as I know.  If we wanted to re-introduce it, I'd want to implement it
as an embedded bot using those common APIs, rather than the totally
custom hacky code used for it that involves unnecessary queue workers
and similar details.

Fixes #13533.
2020-01-25 22:41:39 -08:00
Anders Kaseorg ea6934c26d dependencies: Remove WebSockets system for sending messages.
Zulip has had a small use of WebSockets (specifically, for the code
path of sending messages, via the webapp only) since ~2013.  We
originally added this use of WebSockets in the hope that the latency
benefits of doing so would allow us to avoid implementing a markdown
local echo; they were not.  Further, HTTP/2 may have eliminated the
latency difference we hoped to exploit by using WebSockets in any
case.

While we’d originally imagined using WebSockets for other endpoints,
there was never a good justification for moving more components to the
WebSockets system.

This WebSockets code path had a lot of downsides/complexity,
including:

* The messy hack involving constructing an emulated request object to
  hook into doing Django requests.
* The `message_senders` queue processor system, which increases RAM
  needs and must be provisioned independently from the rest of the
  server).
* A duplicate check_send_receive_time Nagios test specific to
  WebSockets.
* The requirement for users to have their firewalls/NATs allow
  WebSocket connections, and a setting to disable them for networks
  where WebSockets don’t work.
* Dependencies on the SockJS family of libraries, which has at times
  been poorly maintained, and periodically throws random JavaScript
  exceptions in our production environments without a deep enough
  traceback to effectively investigate.
* A total of about 1600 lines of our code related to the feature.
* Increased load on the Tornado system, especially around a Zulip
  server restart, and especially for large installations like
  zulipchat.com, resulting in extra delay before messages can be sent
  again.

As detailed in
https://github.com/zulip/zulip/pull/12862#issuecomment-536152397, it
appears that removing WebSockets moderately increases the time it
takes for the `send_message` API query to return from the server, but
does not significantly change the time between when a message is sent
and when it is received by clients.  We don’t understand the reason
for that change (suggesting the possibility of a measurement error),
and even if it is a real change, we consider that potential small
latency regression to be acceptable.

If we later want WebSockets, we’ll likely want to just use Django
Channels.

Signed-off-by: Anders Kaseorg <anders@zulipchat.com>
2020-01-14 22:34:00 -08:00
Tim Abbott f84c037225 puppet: Tune check_postgres_locks parameters.
This has been a spurious alert for a long time.

It's unclear that this check is useful at all, but if it spikes
dramatically above what's normal, there's perhaps still utility in
being alerted.
2019-10-23 15:04:38 -07:00