65774e1c4f switched from using the bundled check_postgres.pl to using
the version from packages; the file itself remained, however.
Remove it, and clean up references to it.
Fixes#15389.
Instead of SSH'ing around to them, run directly on the database hosts.
This means that the replicas do not know how many bytes behind they
are in _receiving_ the wall logs; thus, the monitoring also extends to
the primary database, which knows that information for each replica.
This also allows for detecting when there are too few active replicas.
Use read-only types (List ↦ Sequence, Dict ↦ Mapping, Set ↦
AbstractSet) to guard against accidental mutation of the default
value.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
These values differed between the primary and secondary database
hosts, for unclear reasons. The differences date back to their
introduction in 387f63deaa. As the comment in the replica
confguration notes, settings of `vm.dirty_ratio = 10` and
`vm.dirty_background_ratio = 5` matched the kernel defaults for
"newer" kernels; however, kernel 2.6.30 bumped those to 20 and 10,
respectively[1], as a fix for underlying logic now being more correct.
Remove these overrides; they should at very least be consistent across
roles, and the previous values look to be an attempt to tune for a
very much older version of the Linux kernel, which was using an
different, buggier, algorithm under the hood.
[1] 1b5e62b42b
This file controls streaming replication, and recovery using wal-g on
the secondary. The `primary_conninfo` data needs to change on short
notice when database failover happens, in a way that is not suitable
for being controlled by puppet.
PostgreSQL 12, in fact, removes the use of the `recovery.conf` file[1];
the `primary_conninfo` and `restore_command` information goes into the
main `postgresql.conf` file, and the standby status is controlled by
the presence of absence of an empty `standby.signal` file.
Remove the puppet control of the `recovery.conf` file.
[1] https://pgstef.github.io/2018/11/26/postgresql12_preview_recovery_conf_disappears.html
Since the nagios authentication is stored _in the database_, it is
unnecessary to run if the database is simply a replica of the
production database. The only case in which this statement would have
an effect is if the postgres node contains a _different_ (or empty)
database, which `setup_disks` now effectively prevents.
Remove the unnecessary step.
The end state it produces is _either_:
- `/srv/postgresql` already existed, which was symlinked into
`/var/lib/postgresql`; postgres is left untouched. This is the
situation if `setup_disks` is run on the database primary, or a
replica which was correctly configured.
- An empty `/srv/postgresql` now exists, symlinked into
`/var/lib/postgresql`, and postgres is stopped. This is the
situation if `puppet` was just run on a new host, or a
previously-configured host was rebooted (clearing the temporary
disk in `/dev/nvme0`)
In the latter case, where `/srv/postgresql` is now empty, any previous
contents of `/var/lib/postgresql` are placed under `/root`,
timestamped for uniqueness.
In either case, the tool should now be idempotent.
Fixes#2665.
Regenerated by tabbott with `lint --fix` after a rebase and change in
parameters.
Note from tabbott: In a few cases, this converts technical debt in the
form of unsorted imports into different technical debt in the form of
our largest files having very long, ugly import sequences at the
start. I expect this change will increase pressure for us to split
those files, which isn't a bad thing.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
Automatically generated by the following script, based on the output
of lint with flake8-comma:
import re
import sys
last_filename = None
last_row = None
lines = []
for msg in sys.stdin:
m = re.match(
r"\x1b\[35mflake8 \|\x1b\[0m \x1b\[1;31m(.+):(\d+):(\d+): (\w+)", msg
)
if m:
filename, row_str, col_str, err = m.groups()
row, col = int(row_str), int(col_str)
if filename == last_filename:
assert last_row != row
else:
if last_filename is not None:
with open(last_filename, "w") as f:
f.writelines(lines)
with open(filename) as f:
lines = f.readlines()
last_filename = filename
last_row = row
line = lines[row - 1]
if err in ["C812", "C815"]:
lines[row - 1] = line[: col - 1] + "," + line[col - 1 :]
elif err in ["C819"]:
assert line[col - 2] == ","
lines[row - 1] = line[: col - 2] + line[col - 1 :].lstrip(" ")
if last_filename is not None:
with open(last_filename, "w") as f:
f.writelines(lines)
Signed-off-by: Anders Kaseorg <anders@zulipchat.com>
This is vestigial.
It requires manually altering the `htdigest` file (not stored in this
repo) to change the digest realm from `wiki` to `monitoring`, and will
re-prompt users for their passwords if the browsers currently store
them.
Drop the change to move `/tmp` onto the local disk. Doing this move
confuses `resolved` until there is a restart, and has no clear
benefits. The change came in during bf82fadc95, but does not describe
the reasoning; it is particularly puzzling, since postgresql stores
its temporary files under `$PGDATA/base/pgsql_tmp`.
Do not RAID the disks together. This was previously done when they
were spinning media, for reliability; running them on an SSD obviates
this sufficiently. This means that updating the initramfs is also not
necessary.
This no longer has any rules specific to it. We leave the `postgres`
munin group (which now only contains `postgres_appdb`) as
future-proofing, and so that `postgres_appdb` matches to the puppet
manifest of the same name.
This allows straight-forward configuration of realm-based Tornado
sharding through simply editing /etc/zulip/zulip.conf to configure
shards and running scripts/refresh-sharding-and-restart.
Co-Author-By: Mateusz Mandera <mateusz.mandera@zulip.com>
While this functionality to post slow queries to a Zulip stream was
very useful in the early days of Zulip, when there were only a few
hundred accounts, it's long since been useless since (1) the total
request volume on larger Zulip servers run by Zulip developers, and
(2) other server operators don't want real-time notifications of slow
backend queries. The right structure for this is just a log file.
We get rid of the queue and replace it with a "zulip.slow_queries"
logger, which will still log to /var/log/zulip/slow_queries.log for
ease of access to this information and propagate to the other logging
handlers. Reducing the amount of queues is good for lowering zulip's
memory footprint and restart performance, since we run at least one
dedicated queue worker process for each one in most configurations.
We could anchor the regexes, but there’s no need for the power (and
responsibility) of regexes here.
Signed-off-by: Anders Kaseorg <anders@zulipchat.com>
Generated by `pyupgrade --py3-plus --keep-percent-format` on all our
Python code except `zthumbor` and `zulip-ec2-configure-interfaces`,
followed by manual indentation fixes.
Signed-off-by: Anders Kaseorg <anders@zulipchat.com>
isort 5 knows not to reorder imports across function calls, so this
will stop isort from breaking our code.
Signed-off-by: Anders Kaseorg <anders@zulipchat.com>
"Zulip Voyager" was a name invented during the Hack Week to open
source Zulip for what a single-system Zulip server might be called, as
a Star Trek pun on the code it was based on, "Zulip Enterprise".
At the time, we just needed a name quickly, but it was never a good
name, just a placeholder. This removes that placeholder name from
much of the codebase. A bit more work will be required to transition
the `zulip::voyager` Puppet class, as that has some migration work
involved.
This legacy cross-realm bot hasn't been used in several years, as far
as I know. If we wanted to re-introduce it, I'd want to implement it
as an embedded bot using those common APIs, rather than the totally
custom hacky code used for it that involves unnecessary queue workers
and similar details.
Fixes#13533.
Zulip has had a small use of WebSockets (specifically, for the code
path of sending messages, via the webapp only) since ~2013. We
originally added this use of WebSockets in the hope that the latency
benefits of doing so would allow us to avoid implementing a markdown
local echo; they were not. Further, HTTP/2 may have eliminated the
latency difference we hoped to exploit by using WebSockets in any
case.
While we’d originally imagined using WebSockets for other endpoints,
there was never a good justification for moving more components to the
WebSockets system.
This WebSockets code path had a lot of downsides/complexity,
including:
* The messy hack involving constructing an emulated request object to
hook into doing Django requests.
* The `message_senders` queue processor system, which increases RAM
needs and must be provisioned independently from the rest of the
server).
* A duplicate check_send_receive_time Nagios test specific to
WebSockets.
* The requirement for users to have their firewalls/NATs allow
WebSocket connections, and a setting to disable them for networks
where WebSockets don’t work.
* Dependencies on the SockJS family of libraries, which has at times
been poorly maintained, and periodically throws random JavaScript
exceptions in our production environments without a deep enough
traceback to effectively investigate.
* A total of about 1600 lines of our code related to the feature.
* Increased load on the Tornado system, especially around a Zulip
server restart, and especially for large installations like
zulipchat.com, resulting in extra delay before messages can be sent
again.
As detailed in
https://github.com/zulip/zulip/pull/12862#issuecomment-536152397, it
appears that removing WebSockets moderately increases the time it
takes for the `send_message` API query to return from the server, but
does not significantly change the time between when a message is sent
and when it is received by clients. We don’t understand the reason
for that change (suggesting the possibility of a measurement error),
and even if it is a real change, we consider that potential small
latency regression to be acceptable.
If we later want WebSockets, we’ll likely want to just use Django
Channels.
Signed-off-by: Anders Kaseorg <anders@zulipchat.com>
This has been a spurious alert for a long time.
It's unclear that this check is useful at all, but if it spikes
dramatically above what's normal, there's perhaps still utility in
being alerted.
Since LoopQueueProcessingWorker jobs cannot be monitored by checking
for connected consumers (since they poll, rather than consuming as
events arrive), they can't be monitored with check_consumers. It's
OK, because that monitoring was redundant with monitoring for
potential growth in their queue that we have as well.
Also clean up the block comments for the two other similar queue
procesors.
Delete trailing newlines from all files, except
tools/ci/success-http-headers.txt and tools/setup/dev-motd, where they
are significant, and static/third, where we want to stay close to
upstream.
Signed-off-by: Anders Kaseorg <anders@zulipchat.com>
Previous cleanups (mostly the removals of Python __future__ imports)
were done in a way that introduced leading newlines. Delete leading
newlines from all files, except static/assets/zulip-emoji/NOTICE,
which is a verbatim copy of the Apache 2.0 license.
Signed-off-by: Anders Kaseorg <anders@zulipchat.com>
As a result of dropping support for trusty, we can remove our old
pattern of putting `if False` before importing the typing module,
which was essential for Python 3.4 support, but not required and maybe
harmful on newer versions.
cron_file_helper
check_rabbitmq_consumers
hash_reqs
check_zephyr_mirror
check_personal_zephyr_mirrors
check_cron_file
zulip_tools
check_postgres_replication_lag
api_test_helpers
purge-old-deployments
setup_venv
node_cache
clean_venv_cache
clean_node_cache
clean_emoji_cache
pg_backup_and_purge
restore-backup
generate_secrets
zulip-ec2-configure-interfaces
diagnose
check_user_zephyr_mirror_liveness