1f565a9f41 removed the `package` lines which install
`python-dateutil`, but not the line in `puppet_ops` that reference it;
as such, Puppet manifests in puppet_ops fail to compile.
Remove the stale reference to `python-dateutil`, which is unnecessary
since the code is python3, not python2.
The zulip user needs to be able to read the file, when running the
backup tool.
We put root:root as owner on other nginx config files, so it's probably
correct to keep the ownership as it is, and set the mode to 0644.
certbot-auto doesn’t work on Ubuntu 20.04, and won’t be updated; we
migrate to instead using the certbot package shipped with the OS
instead. Also made sure that sure certbot gets installed when running
zulip-puppet-apply, to handle existing systems.
We're migrating to using the cleaner zulip.com domain, which involves
changing all of our links from ReadTheDocs and other places to point
to the cleaner URL.
datetime.timezone is available in Python ≥ 3.2. This also lets us
remove a pytz dependency from the PostgreSQL scripts.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
This is vestigial.
It requires manually altering the `htdigest` file (not stored in this
repo) to change the digest realm from `wiki` to `monitoring`, and will
re-prompt users for their passwords if the browsers currently store
them.
Drop the change to move `/tmp` onto the local disk. Doing this move
confuses `resolved` until there is a restart, and has no clear
benefits. The change came in during bf82fadc95, but does not describe
the reasoning; it is particularly puzzling, since postgresql stores
its temporary files under `$PGDATA/base/pgsql_tmp`.
Do not RAID the disks together. This was previously done when they
were spinning media, for reliability; running them on an SSD obviates
this sufficiently. This means that updating the initramfs is also not
necessary.
This no longer has any rules specific to it. We leave the `postgres`
munin group (which now only contains `postgres_appdb`) as
future-proofing, and so that `postgres_appdb` matches to the puppet
manifest of the same name.
This allows straight-forward configuration of realm-based Tornado
sharding through simply editing /etc/zulip/zulip.conf to configure
shards and running scripts/refresh-sharding-and-restart.
Co-Author-By: Mateusz Mandera <mateusz.mandera@zulip.com>
memcached 1.5.22 in Ubuntu 20.04 has a bug where it looks for its SASL
configuration at /etc/sasl2/memcached.conf/memcached.conf instead of
/etc/sasl2/memcached.conf.
https://bugs.launchpad.net/ubuntu/+source/memcached/+bug/1878721
Signed-off-by: Anders Kaseorg <anders@zulip.com>
Using the `host` virtual package confused Puppet into reporting it was
doing work every time one did a puppet run, resulting in unnecessarily
spammy output.
While this functionality to post slow queries to a Zulip stream was
very useful in the early days of Zulip, when there were only a few
hundred accounts, it's long since been useless since (1) the total
request volume on larger Zulip servers run by Zulip developers, and
(2) other server operators don't want real-time notifications of slow
backend queries. The right structure for this is just a log file.
We get rid of the queue and replace it with a "zulip.slow_queries"
logger, which will still log to /var/log/zulip/slow_queries.log for
ease of access to this information and propagate to the other logging
handlers. Reducing the amount of queues is good for lowering zulip's
memory footprint and restart performance, since we run at least one
dedicated queue worker process for each one in most configurations.
Our priority hierarchy is:
(1) Tornado and base services like memcached, redis, etc.
(2) Django and message sender queue workers.
(3) Everything else.
Ideally, we'd have something a bit more fine-grained (e.g. some queue
workers are potentially in the sending path, while others aren't), but
this should have a big impact on ensuring Tornado gets the resources
it needs during load spikes.
I think this has a good chance of causing some load spikes that would
previously have resulted in a user-facing delivery delays no longer
having any significant user-facing impact.
Currently when the user uploads files with ".jpe" file extension, the
markdown is converted to link but the image is not embedded.
This commit adds the support for ".jpe" file extension.
Fixes#14863
We could anchor the regexes, but there’s no need for the power (and
responsibility) of regexes here.
Signed-off-by: Anders Kaseorg <anders@zulipchat.com>
This PR updates the puppet manifest to allow /etc/zulip to be a symlink. The current behaviour overwrites /etc/zulip if it is link to another directory, which is problematic with docker-zulip and
in particular the `LINK_SETTINGS_TO_DATA` setting.
Generated by `pyupgrade --py3-plus --keep-percent-format` on all our
Python code except `zthumbor` and `zulip-ec2-configure-interfaces`,
followed by manual indentation fixes.
Signed-off-by: Anders Kaseorg <anders@zulipchat.com>
The nginx ‘add_header’ directive doesn’t inherit the way you’d
want (https://trac.nginx.org/nginx/ticket/854), so we need to manually
simulate inheritance using ‘include’, like we previously did with
api_headers.
Signed-off-by: Anders Kaseorg <anders@zulipchat.com>
isort 5 knows not to reorder imports across function calls, so this
will stop isort from breaking our code.
Signed-off-by: Anders Kaseorg <anders@zulipchat.com>
Puppet doesn’t re-run an exec blocks that’s declared as creating an
existing file, even if it’s notified. Remove the creates declaration.
Fixes#13730.
Signed-off-by: Anders Kaseorg <anders@zulipchat.com>
"Zulip Voyager" was a name invented during the Hack Week to open
source Zulip for what a single-system Zulip server might be called, as
a Star Trek pun on the code it was based on, "Zulip Enterprise".
At the time, we just needed a name quickly, but it was never a good
name, just a placeholder. This removes that placeholder name from
much of the codebase. A bit more work will be required to transition
the `zulip::voyager` Puppet class, as that has some migration work
involved.
This legacy cross-realm bot hasn't been used in several years, as far
as I know. If we wanted to re-introduce it, I'd want to implement it
as an embedded bot using those common APIs, rather than the totally
custom hacky code used for it that involves unnecessary queue workers
and similar details.
Fixes#13533.
Zulip has had a small use of WebSockets (specifically, for the code
path of sending messages, via the webapp only) since ~2013. We
originally added this use of WebSockets in the hope that the latency
benefits of doing so would allow us to avoid implementing a markdown
local echo; they were not. Further, HTTP/2 may have eliminated the
latency difference we hoped to exploit by using WebSockets in any
case.
While we’d originally imagined using WebSockets for other endpoints,
there was never a good justification for moving more components to the
WebSockets system.
This WebSockets code path had a lot of downsides/complexity,
including:
* The messy hack involving constructing an emulated request object to
hook into doing Django requests.
* The `message_senders` queue processor system, which increases RAM
needs and must be provisioned independently from the rest of the
server).
* A duplicate check_send_receive_time Nagios test specific to
WebSockets.
* The requirement for users to have their firewalls/NATs allow
WebSocket connections, and a setting to disable them for networks
where WebSockets don’t work.
* Dependencies on the SockJS family of libraries, which has at times
been poorly maintained, and periodically throws random JavaScript
exceptions in our production environments without a deep enough
traceback to effectively investigate.
* A total of about 1600 lines of our code related to the feature.
* Increased load on the Tornado system, especially around a Zulip
server restart, and especially for large installations like
zulipchat.com, resulting in extra delay before messages can be sent
again.
As detailed in
https://github.com/zulip/zulip/pull/12862#issuecomment-536152397, it
appears that removing WebSockets moderately increases the time it
takes for the `send_message` API query to return from the server, but
does not significantly change the time between when a message is sent
and when it is received by clients. We don’t understand the reason
for that change (suggesting the possibility of a measurement error),
and even if it is a real change, we consider that potential small
latency regression to be acceptable.
If we later want WebSockets, we’ll likely want to just use Django
Channels.
Signed-off-by: Anders Kaseorg <anders@zulipchat.com>
Resolves these warnings from puppet-lint.
puppet-lint| puppet/zulip/manifests/app_frontend_base.pp - WARNING: double quoted string containing no variables on line 14
puppet-lint| puppet/zulip/manifests/app_frontend_base.pp - WARNING: double quoted string containing no variables on line 19
Signed-off-by: Anders Kaseorg <anders@zulipchat.com>
This has been a spurious alert for a long time.
It's unclear that this check is useful at all, but if it spikes
dramatically above what's normal, there's perhaps still utility in
being alerted.
Since LoopQueueProcessingWorker jobs cannot be monitored by checking
for connected consumers (since they poll, rather than consuming as
events arrive), they can't be monitored with check_consumers. It's
OK, because that monitoring was redundant with monitoring for
potential growth in their queue that we have as well.
Also clean up the block comments for the two other similar queue
procesors.
Now that we're implemented tsearch_extras in pure postgres, we no
longer need a custom extension. This should help us considerably, as
it means we no longer need to ship custom apt packages at all.
Fixes#467.
Signed-off-by: Anders Kaseorg <anders@zulipchat.com>
This commit finishes adding end-to-end support for the install script
on Debian Buster (making it production ready). Some support for this
was already added in prior commits such as
99414e2d96.
We plan to revert the postgres hunks of this once we've built
tsearch_extras for our packagecloud archive.
Fixes#9828.
The issue here was that the '.' character was unescaped and the
regex was not anchored with a terminal '$'. This was detected by
Anders Kaseorg.
Co-authored-by: Anders Kaseorg <anders@zulipchat.com>
`/etc/postfix/virtual` is of `regexp:` type, not `hash:` type, so
running `postmap` on it has no effect; we need to reload Postfix when
it changes.
http://www.postfix.org/DATABASE_README.html#detect
In the interest of forcing a reload now, optimize the regexes by
eliding the unanchored `.*`s at the beginnings and ends.
Signed-off-by: Anders Kaseorg <anders@zulipchat.com>
Introduced by #12966.
puppet/zulip/manifests/base.pp - WARNING: double quoted string containing no variables on line 93
puppet/zulip/manifests/base.pp - WARNING: string containing only a variable on line 93
scanf doesn’t accept a number as input, so uh, add a dummy space
character.
What. You can’t give me a bad language and then complain when I write
bad programs in it.
Signed-off-by: Anders Kaseorg <anders@zulipchat.com>
Fixes this warning:
Warning: The string '8167976' was automatically coerced to the numerical value 8167976 (file: /root/zulip/puppet/zulip/manifests/base.pp, line: 93, column: 19)
Fixes#9682.
Signed-off-by: Anders Kaseorg <anders@zulipchat.com>
Delete trailing newlines from all files, except
tools/ci/success-http-headers.txt and tools/setup/dev-motd, where they
are significant, and static/third, where we want to stay close to
upstream.
Signed-off-by: Anders Kaseorg <anders@zulipchat.com>
Previous cleanups (mostly the removals of Python __future__ imports)
were done in a way that introduced leading newlines. Delete leading
newlines from all files, except static/assets/zulip-emoji/NOTICE,
which is a verbatim copy of the Apache 2.0 license.
Signed-off-by: Anders Kaseorg <anders@zulipchat.com>
This gives us access to typing_extensions.Deque, which was not added
to typing until 3.5.4.
(PROVISION_VERSION is not bumped because the transitive dependency set
in dev.txt hasn’t changed.)
Signed-off-by: Anders Kaseorg <anders@zulipchat.com>
As a result of dropping support for trusty, we can remove our old
pattern of putting `if False` before importing the typing module,
which was essential for Python 3.4 support, but not required and maybe
harmful on newer versions.
cron_file_helper
check_rabbitmq_consumers
hash_reqs
check_zephyr_mirror
check_personal_zephyr_mirrors
check_cron_file
zulip_tools
check_postgres_replication_lag
api_test_helpers
purge-old-deployments
setup_venv
node_cache
clean_venv_cache
clean_node_cache
clean_emoji_cache
pg_backup_and_purge
restore-backup
generate_secrets
zulip-ec2-configure-interfaces
diagnose
check_user_zephyr_mirror_liveness
Since 204 responses don’t contain a payload body, Content-Type is
neither required nor encouraged (RFC 7231 §3.1.1.5), and ours was
missing a semicolon to boot; Content-Length is expressly
forbidden (RFC 7230 §3.3.2).
Furthermore, these add_header directives were silencing the CORS
headers set in api_headers, because add_header inheritance doesn’t
work the way you think it does, as was known before commit
5614d51afc.
Fixes: #12902.
Signed-off-by: Anders Kaseorg <anders@zulipchat.com>
Disable TLS 1.0 and TLS 1.1. (We no longer need to support IE8 on
Windows XP.)
Prefer client-selected cipher order. (Now that all enabled ciphers
provide good security, this allows mobile clients lacking AES hardware
acceleration to pick ChaCha20 for better performance.)
Disable session tickets. (Mozilla discourages them based on
https://www.imperialviolet.org/2013/06/27/botchingpfs.html.)
Signed-off-by: Anders Kaseorg <anders@zulipchat.com>
Since we no longer support Ubuntu Trusty, we no longer need this
backwards-compatibility cruft (which we only kept around to avoid
randomizing configuration for existing systems).
This was only used in Ubuntu 14.04 Trusty.
Removing this also finally lets us simplify our security model
discussion of uploaded files.
Signed-off-by: Anders Kaseorg <anders@zulipchat.com>
One of smtpd_relay_restrictions or smtpd_recipient_restrictions is
required by postfix ≥ 2.10 (see
http://www.postfix.org/SMTPD_ACCESS_README.html).
This is important for using the email mirror on Ubuntu Bionic.
Our priority hierarchy is:
(1) Tornado and base services like memcached, redis, etc.
(2) Django and message sender queue workers.
(3) Everything else.
Ideally, we'd have something a bit more fine-grained (e.g. some queue
workers are potentially in the sending path, while others aren't), but
this should have a big impact on ensuring Tornado gets the resources
it needs during load spikes.
I think this has a good chance of causing some load spikes that would
previously have resulted in a user-facing delivery delays no longer
having any significant user-facing impact.
Previously, if process_fts_updates ended up very far behind
(e.g. 100,000s of messages), it was unable to recover without doing
some very expensive databsae operations to fetch and then delete the
list of message IDs needing updates. This change fixes that issue by
doing the catch-up work in batches.
Also use psql -e (--echo-queries) in scripts that use ‘set -x’, so
errors can be traced to a specific query from the output.
Signed-off-by: Anders Kaseorg <anders@zulipchat.com>
The commit 87d1809657 changed the time when
digests are sent by 3 hours to account for moving from the US East Coast to the
West Coast, but didn't change the time period exception in the
`check-rabbitmq-queue` script.
Closes#5415
The construction `su postgres -c -- bash -c 'psql …'` didn’t behave the
way it reads, and only worked by accident:
1. `-c --` sets the command to `--`.
2. `bash` sets the first argument to `bash`.
3. `-c 'psql …'` replaces the command with `psql …`.
Thus, `su` ended up executing `<shell> -c 'psql …' bash`, where
`<shell>` is the `postgres` user’s login shell, usually also `bash`,
which then executed 'psql …' and ignored the extra `bash`.
Unconfuse this construction.
Note from tabbott: The old code didn't even work by accident, it was
just broken. The right fix is to move the quoting around properly.
Signed-off-by: Anders Kaseorg <anders@zulipchat.com>
Lengthen the session timeout and enlarge the session cache. Upgrade
Diffie-Hellman parameters from fixed 1024-bit to custom 2048-bit.
Enable OCSP stapling.
Signed-off-by: Anders Kaseorg <andersk@mit.edu>
Apparently, our testing environment for this configuration was broken
and didn't test the code we thought it did; as a result, a variable
redefinition bug slipped through.
Fixes#11786.
The overall goal of this change is to fix an issue where on Ubuntu
Trusty, we were accidentally overriding the configuration to serve
uploads from disk with the regular expressions for adding access
control headers.
However, while investigating this, it became clear that we could
considerably simplify the mental energy required to understand this
system by making the uploads-route file be unconditionally available
and included from `zulip-include/app` (which means the zulip_ops code
can share behavior here).
We also move the Access-Control-Allow-* headers to a separate include
file, to avoid duplicating it in 5 places. Fixing this duplication
discovered a potential bug in the settings used for Tornado, where
DELETE was not allowed on a route that definitely expects DELETE.
Fixes#11758.
For users putting Zulip behind certain proxies (and potentially some
third-party API clients), buffer sizes can exceed the uwsgi default of
4096. Since we aren't doing such high-throughput APIs that a small
buffer size is valuable, we should just raise this for everyone.
Nowm unless you specify `--fill-cache`, memcached caches will not be
pre-filled after a server restart. This will be helpful when someone
is in a hurry (e.g. if the server is down right now, or if he/she
testing a configuration change in a newly setup server), it's best to
just restart without pre-filling the cache.
Fixes: #10900.
Update the list of ciphers that nginx will use to the current
Mozilla recommended ones.
These are Intermediate compatibility ones suitable for clients
running anything newer than Firefox 1, Chrome 1, IE 7, Opera 5
and Safari 1. Modern compatibility is not suitable as it excludes
Andriod 4 which is still seen on ~1% of traffic.
More info: https://wiki.mozilla.org/Security/Server_Side_TLS
It hasn't been working for years, but more importantly, it spams up
root's mail queue so that one can't find important things in there
(e.g. the fact that the long-term-idle cron job was failing).
This now checks if the user is zulip, and if not, switches to the
zulip user, making it possible to run it as root.
Significantly modified by tabbott to not break existing behavior.
/bin/sh and /usr/bin/env are the only two binaries that NixOS provides
at a fixed path (outside a buildFHSUserEnv sandbox).
This discussion was split from #11004.
Signed-off-by: Anders Kaseorg <andersk@mit.edu>
This is a common bug that users might be tempated to introduce.
And also fix two instances of this bug that were present in our
codebase, including an important one in our upgrade code path.
This commit works by vendoring the couple functions we still use from
puppetlabs stdlib (join and range), but removing the rest of the
puppetlabs codebase, and of course cleaning up our linter rules in the
process.
Fixes#7423.
This should be a nice performance improvement for browsers that
support it.
We can't yet enabled this in the Zulip on-premise nginx configuration,
because that still has to support Trusty.
This isn't super required, in that we add these repositories via
`setup-apt-repo` in any case, but the previous code was wrong and
worth fixing in any case.
This fixes a bug where our API routes for uploaded files (where we
need to use a consistent URL between session auth and API auth) were
not properly configured to pass through the API authentication headers
(and otherwise provide REST endpoint settings).
In particular, this prevented the Zulip mobile apps from being able to
access authenticated image files using these URLs.
Apparently, we can use the process group naming style of having dashes
in the names without using the explicit nun_procs feature of
supervisord configuration.
The new configuration is perfectly satisfactory, so there's no real
reason to prefer the old approach.
Previously, this script needed access to Django settings, which in
turn required access to /etc/zulip/zulip-secrets.conf. Since that
isn't world-readable, this meant that this couldn't run as an
unprivileged `nagios` user.
Fix that by just hardcoding the appropriate path under /var/log/.
When using the Python 3 typing style, Python scripts can't import from
typing inside an `if False` (in contrast, one needs to import inside
an `if False` to support the Python 3 syntax without needing
python-typing installed). So this was just incorrectly half-converted
from the Python 2 style to the Python 3 style.
Apparently, `puppet-lint` on Ubuntu trusty throws warnings for certain
quoting patterns that are OK in modern `puppet-lint`. I believe the
old Zulip code was actually correct (i.e. the old `puppet-lint`
implementation was the problem), but it seems worth changing anyway to
suppress the warnings.
We also exclude more of puppet-apt from linting, since it's
third-party code.
This was converted to Python 3 incorrectly, in a way that actually
completely broke the script (the .decode() that this adds is critical,
since 'f' != b'f').
We fix this, and also add an assert that makes the parsing code
safer against future refactors.
We fix "ERROR: safepackage not in autoload module layout" error
which was caused by a defined type "safepackage" definitation
lying in the wrong place. We refactor to create the defined type
according to puppet guidelines. Link below:
https://docs.puppet.com/puppet/2.7/lang_defined_types.html
We fix these by adding ignore statements in a bunch of files
where this error popped up. We target only specific lines using
the ignore statements and not the entire files.
In puppet/zulip_ops/files/postgresql/setup_disks.sh line 15:
array_name=$(mdadm --examine --scan | sed 's/.*name=//')
^-- SC2034: array_name appears unused. Verify use (or export if used externally).
Signed-off-by: Anders Kaseorg <andersk@mit.edu>
In puppet/zulip_ops/files/munin-plugins/rabbitmq_connections line 66:
echo "connections.value $(HOME=$HOME rabbitmqctl list_connections | grep -v "^Listing" | grep -v "done.$" | wc -l)"
^-- SC2126: Consider using grep -c instead of grep|wc -l.
In puppet/zulip_ops/files/munin-plugins/rabbitmq_consumers line 32:
VHOST=${vhost:-"/"}
^-- SC2034: VHOST appears unused. Verify use (or export if used externally).
In puppet/zulip_ops/files/munin-plugins/rabbitmq_messages line 32:
VHOST=${vhost:-"/"}
^-- SC2034: VHOST appears unused. Verify use (or export if used externally).
In puppet/zulip_ops/files/munin-plugins/rabbitmq_messages_unacknowledged line 32:
VHOST=${vhost:-"/"}
^-- SC2034: VHOST appears unused. Verify use (or export if used externally).
In puppet/zulip_ops/files/munin-plugins/rabbitmq_messages_uncommitted line 32:
VHOST=${vhost:-"/"}
^-- SC2034: VHOST appears unused. Verify use (or export if used externally).
In puppet/zulip_ops/files/munin-plugins/rabbitmq_queue_memory line 32:
VHOST=${vhost:-"/"}
^-- SC2034: VHOST appears unused. Verify use (or export if used externally).
Signed-off-by: Anders Kaseorg <andersk@mit.edu>
In puppet/zulip/files/postgresql/env-wal-e line 6:
export AWS_ACCESS_KEY_ID=$(crudini --get "$ZULIP_SECRETS_CONF" secrets s3_backups_key)
^-- SC2155: Declare and assign separately to avoid masking return values.
In puppet/zulip/files/postgresql/env-wal-e line 7:
export AWS_SECRET_ACCESS_KEY=$(crudini --get "$ZULIP_SECRETS_CONF" secrets s3_backups_secret_key)
^-- SC2155: Declare and assign separately to avoid masking return values.
In puppet/zulip/files/postgresql/env-wal-e line 9:
if [ $? -ne 0 ]; then
^-- SC2181: Check exit code directly with e.g. 'if mycmd;', not indirectly with $?.
Signed-off-by: Anders Kaseorg <andersk@mit.edu>
In puppet/zulip/files/nagios_plugins/zulip_app_frontend/check_email_deliverer_process line 16:
elif [ "$(echo "$STATUS" | egrep '(STOPPED)|(STARTING)|(BACKOFF)|(STOPPING)|(EXITED)|(FATAL)|(UNKNOWN)$')" ]
^-- SC2143: Use egrep -q instead of comparing output with [ -n .. ].
^-- SC2196: egrep is non-standard and deprecated. Use grep -E instead.
Signed-off-by: Anders Kaseorg <andersk@mit.edu>
In puppet/zulip/files/nagios_plugins/zulip_app_frontend/check_email_deliverer_backlog line 8:
cd /home/zulip/deployments/current
^-- SC2164: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.
Signed-off-by: Anders Kaseorg <andersk@mit.edu>
With this change, all that one needs to do to start using thumbor in
production is to set the `THUMBOR_URL` setting.
Since without THUMBOR_URL enabled, the thumbor service doesn't
actually do anything, this is pretty safe.
As part of our effort to change the data model away from each user
having a single API key, we're eliminating the couple requests that
were made from Django to Tornado (as part of a /register or home
request) where we used the user's API key grabbed from the database
for authentication.
Instead, we use the (already existing) internal_notify_view
authentication mechanism, which uses the SHARED_SECRET setting for
security, for these requests, and just fetch the user object using
get_user_profile_by_id directly.
Tweaked by Yago to include the new /api/v1/events/internal endpoint in
the exempt_patterns list in test_helpers, since it's an endpoint we call
through Tornado. Also added a couple missing return type annotations.
This commits adds the necessary puppet configuration and
installer/upgrade code for installing and managing the thumbor service
in production. This configuration is gated by the 'thumbor.pp'
manifest being enabled (which is not yet the default), and so this
commit should have no effect in a default Zulip production environment
(or in the long term, in any Zulip production server that isn't using
thumbor).
Credit for this effort is shared by @TigorC (who initiated the work on
this project), @joshland (who did a great deal of work on this and got
it working during PyCon 2017) and @adnrs96, who completed the work.
While there are legitimate use cases for embedded Zulip in an iFrame,
they're rare, and it's more important to prevent this category of
attack by default.
Sysadmins can switch this to a whitelist when they want to use frames.
We no longer need or use these, since Zulip installs a pinned version
of node directly with the scripts/setup/install-node tool.
Noticed because in the effort of adding Ubuntu bionic support, we
noticed the package names changed again.
Apparently, our nginx configuration's use of "localhost", combined
with the default in modern Linux of having localhost resolve to both
the IPv4 and IPv6 addresses on a given machine, resulted in `nginx`
load-balancing requests to a given Zulip server between the IPv4 and
IPv6 addresses. This, in turn, resulted in irrelevant 502 errors
problems every few minutes on the /events endpoints for some clients.
Disabling IPv6 on the server resolved the problem, as does simply
spelling localhost as 127.0.0.1 for the `nginx` upstreams that we
declare for proxying to non-Django services on localhost.
Now, one can just set `no_serve_uploads` in `zulip.conf` to prevent
`nginx` from serving locally uploaded files.
This should help simplify the S3 integration setup process.
This option is intended to support situations like a quick Docker
setup where doing HTTPS adds more setup overhead than it's worth.
It's not intended to be used in actual production environments.
This is preferred, since we don't currently have a way to run Django
logic on the postgres hosts with the Docker implementation.
This is a necessary part of removing the need for the docker-zulip
package to patch this file to make Zulip work with Docker.
The main purpose of this change is to make it guaranteed that
`manage.py register_server --rotate-key` can edit the
/etc/zulip/zulip-secrets.conf configuration via crudini.
But it also adds value by ensuring zulip-secrets.conf is not readable
by other users.
With modern apt-key, the fingerprints are displayed in the more fully
written-out format with spaces, and so `apt-key add` was being run
every time.
This fixes some unnecessary work being done on each puppet run on
Debian stretch.
I would have preferred to not need to do this by upgrading to
upstream, but see #7423 for notes on why that isn't going to work
(basically they broke support for puppet older than 4).
Apparently, these confused the puppet template parser, since they are
somewhat similar to its syntax, resulting in errors trying to use
these templates. It's easy enough to just remove the example
content from the base postgres config file.
We can't really do this in the zulip manifests (since it's sorta a
sysadmin policy decision), but these scripts can cause significant
load when Nagios logs into a server (because many of them take 50ms or
more of work to run). So we just get rid of them.
It seems unlikely we're going to add support for additional older
Debian-based distributions, so it makes sense to just use an else
statement. This should save a bit of busywork every time we add a new
distro.
Mostly, this involves adding the big block at the bottom and making
10 a variable so that it's easier to compare different versions of
these.
I did an audit of the configuration changes between 9.6 and 10, so
this should be fine, but it hasn't been tested yet.
Our recent addition of Content-Security-Policy to the file uploads
backend broke in-browser previews of PDFs.
The content-types change in the last commit fixed loading PDFs for
most users; but the result was ugly, because e.g. Chrome would put the
PDF previewer into a frame (so there were 2 left scrollbars).
There were two changes needed to fix this:
* Loading the style to use the plugin. We corrected this by adding
`style-src 'self' 'unsafe-inline';`
* Loading the plugin. Our CSP blocked loading the PDf viewer plugin.
To correct this, we add object-src 'self', and then limit the
plugin-type to just the one for application/pdf.
We verified this new CSP using https://csp-evaluator.withgoogle.com/
in addition to manual testing.
Previously, user-uploaded PDF files were not properly rendered by
browsers with the local uploads backend, because we weren't setting
the correct content-type.
This adds a basic Content-Security-Policy for user-uploaded avatars
served by the LOCAL_UPLOADS backend.
I think this is for now an unnecessary follow-up to
d608a9d315, but is worth doing because
we may later change what can be uploaded in the avatars directory.
This adds a basic Content-Security-Policy for user-uploaded files with local uploads.
While over time, we plan to add CSP for the main site as well, this CSP is particularly
important for the local-uploads backend, which often shares a domain with the main site.
Running this on additional machines would be redundant; additionally,
the FillState checker cron job runs only on cron systems, so this will
crash on other app frontends.
While this is a different system than I'd written up in #8004, I think
this is a better solution to the general problem of cron jobs to run
on just one server.
Fixes#8004.
Revert c8f034e9a "queue: Remove missedmessage_email_senders code."
As the comment in the code says, it ensures a smooth upgrade path
from 1.7.x; we can delete it in master after 1.8.0 is released.
The removal commit was merged early due to a communication failure.
From here on we start to authenticate uploaded file request before
serving this files in production. This involves allowing NGINX to
pass on these file requests to Django for authentication and then
serve these files by making use on internal redirect requests having
x-accel-redirect field. The redirection on requests and loading
of x-accel-redirect param is handled by django-sendfile.
NOTE: This commit starts to authenticate these requests for Zulip
servers running platforms either Ubuntu Xenial (16.04) or above.
Fixes: #320 and #291 partially.
This should make it possible to use the zulip_ops base rules
successfully on chat.zulip.org. Many of the changes in this commit
are hacks and probably can be cleaned up later, but given that we plan
to drop trusty support soon, it's likely that most of them will simply
be deleted then.
We've been running this change on zulipchat.com for a couple of months
now. Before then, we used to regularly get exceptions like this:
File "./zerver/views/messages.py", line 749, in get_messages_backend
setter=stringify_message_dict)
File "./zerver/lib/cache.py", line 275, in generic_bulk_cached_fetch
cache_set_many(items_for_remote_cache)
File "./zerver/lib/cache.py", line 215, in cache_set_many
get_cache_backend(cache_name).set_many(items, timeout=timeout)
File "/home/zulip/deployments/2017-09-28-21-04-12/zulip-py3-venv/lib/python3.5/site-packages/django/core/cache/backends/memcached.py", line 150, in set_many
self._cache.set_multi(safe_data, self.get_backend_timeout(timeout))
pylibmc.Error: error 48 from memcached_set_multi
This error means memcached was unable to find space for the new value.
You might think that because memcached provides an LRU cache, this
shouldn't happen because it would just evict something... but in fact
* memcached splits its data into "slabs" by object size, and
* until recently, once a 1MiB "chunk" is allocated to a given "slab"
i.e. size class, it wouldn't be reclaimed to allocate to another.
So once the cache has been filled up with objects of some distribution
of sizes, if some objects come in that would go in a different size
class, we have no chunks for that size class / slab, and can't get one.
And that's exactly what was happening on zulipchat.com.
Useful background can be found in
https://github.com/memcached/memcached/wiki/ServerMaint#slab-imbalancehttps://github.com/memcached/memcached/wiki/ReleaseNotes1411https://github.com/memcached/memcached/wiki/ReleaseNotes1425https://github.com/memcached/memcached/wiki/ReleaseNotes150
We're already running v1.4.25, which provides an "automover" that should
be well equipped to fix this; v1.5.0 turns it on by default.
With this commit, adopt the "modern start line" recommended in the
release notes for our v1.4.25, including turning on the automover.
This doesn't yet pass all Nagios checks correctly, and still has a few
flaws:
* The ideal setup code for the `nagios` user in the database isn't included.
* Some of the other details are a bit off; we need to split some host roles.
But it's better than nothing, and we can iterate from here.
This commit just copies all the code from MissedMessageSendingWorker
class to a new EmailSendingWorker class. All the logic to send an email
through a queue was already there. This commit only makes the logic
generic. It does so by creating a special purpose queue called
'email_senders' to send any type of email. To make
MissedMessageSendingWorker still work we derive it from
EmailSendingWorker. All the tests that were testing
MissedMessageSendingWorker now run against EmailSendingWorker.
This fixes a bug where, when a user is unsubscribed from a stream,
they might have unread messages on that stream leak. While it might
seem to be a minor problem, it can cause significant problems for
computing the `unread_msgs` data structures, since it means we need to
add an extra filter for whether the user is still subscribed, either
in the backend or in the UI.
Fixes#7095.
This causes the cron job to run only when a Zulip-managed certbot
install is actually set up.
Inside `install`, zulip.conf doesn't yet exist when we run
setup-certbot, so we write the setting later. But we also give
setup-certbot the ability to write the setting itself, so that we
can recommend it in instructions for adopting certbot in an
existing Zulip installation.
If we were making an old-fashioned webroot where hand-written static
HTML files went, somewhere under `/srv` would be most appropriate.
Here, this webroot is really more of an implementation detail of the
certbot set up by the Zulip installer/packaging, containing transient
state. So someplace under `/var` is appropriate, and specifically
under `/var/lib/zulip` in order to properly namespace it.
For background on `/var/www` and friends, see the top couple of answers
on
https://unix.stackexchange.com/questions/47436/why-web-server-var-www
For some reason, we have the USING_PGROONGA setting on in development
right now. I'm going to disable that in another commit to match what
we're doing in production, but we'll still want that setting to work
in development.
The problem here was that process_fts_updates only attempted to read
the USING_PGROONGA setting from a /etc/zulip/zulip.conf source, and
thus would just not be updating the index in development.
We weren't compressing SVG, while at the same time were incorrectly
compressing octet-stream (Which meant downloading .tar.gz files in
Chrome would get double-compressed).
Sparkle was the auto-update system used by the legacy desktop app. We
haven't been capable of using it for auto-update in years, so there's
no reason to keep around the configuration.
The new Electron app uses a different system anyway.
Whatever dist/ functionality this had in 2014 is now served by
zulip.org, and since this serves as a sample, it should be as simple
as possible.
Previously, this was more cluttered than it needed to be.
The old limits were such that these would sometimes oscillated too
high and page erroneously. The purpose of this check is to prevent
large memory leaks, and will still achieve that with a higher limit.
This allows the Nagios user to access redis without having full access
to the redis system. Ideally, this would eventually use a password
that only has statistics read access, but I'm not sure redis supports
that.
This old puppet configuration was never really used, and regardless
hardcoded an ancient zulip.net hostname. We fix this to use the
zulipconf system to get the host domain (though not, at present, the
hostname).
If a machine is configured with no swap intentationally, that
shouldn't be a Nagios problem. This alert is intended to flag
machines which are swapping.
Arguably, we should make this a symlink, but it's probably a good idea
to have every change in the production Nagios configuration go through
the zulip-puppet-apply diff experience.
Since we've found that it's fairly frequent that we want to recommend
to developers that they upgrade to a version of Zulip from Git, it
makes sense to include that by default.
It's needed by scripts/install-yarn.sh. This hadn't been discovered
because most systems end up having curl installed even though it isn't
technically a required package.
This code empirically doesn't work. It's not entirely clear why, even
having done quite a bit of debugging; partly because the code is quite
convoluted, and because it shows the symptoms of people making changes
over time without really understanding how it was supposed to work.
Moreover, this code targets an old version of the APNs provider API.
Apple deprecated that in 2015, in favor of a shiny new one which uses
HTTP/2 to meet the same needs for concurrency and scale that the old
one had to do a bunch of ad-hoc protocol design for.
So, rip this code out. We'll build a pathway to the new API from
scratch; it's not that complicated.
Whenever you restarted supervisord services, we'd end up leaking one
process from the process_queue group, eventually resulting in running
out of memory.
Fixes#6184.
This causes `upgrade-zulip-from-git`, as well as a no-option run of
`tools/build-release-tarball`, to produce a Zulip install running
Python 3, rather than Python 2. In particular this means that the
virtualenv we create, in which all application code runs, is Python 3.
One shebang line, on `zulip-ec2-configure-interfaces`, explicitly
keeps Python 2, and at least one external ops script, `wal-e`, also
still runs on Python 2. See discussion on the respective previous
commits that made those explicit. There may also be some other
third-party scripts we use, outside of this source tree and running
outside our virtualenv, that still run on Python 2.
The Zulip server's settings are only available if process-fts-updates
is running is on the same server as a Zulip production deployment. So
we instead check whether we have pgroonga configured in
/etc/zulip/zulip.conf.
On `trusty` there is no package for `boto` or `gevent` on Python 3, both
of which are dependencies of `wal-e` (at the version we've pinned.) This
is something used only on database servers and only in a replication
scenario, and it doesn't involve any of our code outside the wal-e repo,
so the Python version it uses is quite independent of the Zulip
application server itself and the rest of our code. For now, keep it
explicitly on Python 2 while we move forward for most everything else.
This script in `zulip_ops` is handy for managing EC2 instances. It uses
`boto`, which isn't available in `trusty` for Python 3. The use of
`boto` here isn't particularly deep, so we could replace it with some
more manual HTTP calls if it comes to that. For now, just mark it to
stay on Python 2 while we move the app and all the rest of the ops code
(except this and another straggler or two) to Python 3.
Also make a comment on this package in the Puppet manifest clearer
about what it specifically refers to.
This consists of the `zulip_ops::stats` Puppet class, which has apparently
not been used since 2014, and a number of files that I believe were
only used for that. Also a couple of tiny loose ends in other files.
This is only actually used in our `wal-e` setup, which is in
zulip_ops::postgres_common. (In fact the only mentions of `gevent` in
our whole Git history are for `wal-e`.) So remove where we mention it
on the broader zulip::postgres_common module, and move it where it's
needed.
This follows up on 98cef0ab4 by eliminating the only dependency
outside of the `zulip_ops` Puppet tree on a system Python-library
package which isn't available in `trusty` for Python 3.
This follows up on 207cf6302 from last year to clean up cases that
have apparently popped up since then. Invoking the scripts directly
makes a cleaner command line in any case, and moreover is essential
to how we control running a Zulip install as either Python 2 or 3
(soon, how we always ensure it runs as Python 3.)
One exception: we're currently forcing `provision` in dev to run
Python 3, while still running both Python 2 and Python 3 jobs in CI.
We use a non-shebang invocation to do the forcing of Python 3.
In some of these contexts, we may still be *using* the Python 2
version, but at least this should eliminate running into
`ImportError`s one by one in scripts that run outside a virtualenv,
as we update their shebangs to refer to Python 3.
Several Python libraries we use don't come in Python 3 versions on
trusty: gevent, boto, twisted, django, django-tagging, whisper.
The latter two don't come in Python 3 versions even on xenial.
So some work required before we can actually switch the code that
relies on those libraries to run as Python 3 -- probably the best
solution will be to backport them all in our apt repo. (All but
`whisper` are packaged in zesty; `whisper` upstream just grew Python 3
support this year.)
These are no longer useful, with our spiffy new analytics framework,
and we haven't in fact been using them for some time, while the
`active-user-stats` cron job does cause regular mail from cron.
Just delete them.
It's rare that there's value in having the log files get this big, and
these changes mean these log files should never consume more than a
few gigabytes.
And in particular, the server.log is far more important than the other
log files, and grows much faster, so we might as well spend most of
the space we are spending on that.
I estimate that the total size of log files from this is going to be
under 1-2GB, since 75MB (compressed size) * 10 (compressed logs) +
500MB (uncompressed size) = 1.25GB from server.log, and the rest is
negligible.
Fixes part of #5724.
Most of these log files are useless except a few minutes after an
event happens, and the aggregate effect of the originals size limits
meant that Zulip's logs could consume many gigabytes of disk.
The new logging strategy should limit our usage from supervisor logs
to at most 3 Gigabytes:
* 20 * 3 = 60MB per queue worker => <1GB.
* 100 * 10 = 1GB for Django and Tornado logs.
Fixes part of #5724.
While running queue processors multithreaded will limit the
performance available to very small systems, it's easy to fix that by
adding more RAM, and previously, Zulip didn't work on such systems at
all, so this is unambiguously an improvement there.
Fixes#32.
Fixes#34.
(Commit message expanded significantly by tabbott.)
Also puts them into a processing queue, though the queue processor
does nothing.
Rewritten by tabbott to avoid unnecessary database queries in
do_send_messages.
This fixes a performance problem where we were previously starting up
a full Django process (~0.7s even on a fast machine) every time a new
email came in, potentially allowing users to accidentally DoS a Zulip
server. Now, we just post over HTTPS, allowing the existing thread
pool support to do its job.
- Add script wrapper to communicate postfix pipe with django web server
over HTTP(S). It uses shared_secret authentication mode.
- Add django view to process messages from email mirror server.
- Clean management command `email-mirror`. Left just functional
for cron email processing.
- Add routes for new tornado view.
- Change pipe script in master process postfix config template
based on updated script.
- Add tests.
Tweaked by tabbott to adjust the directory and set better defaults.
Fixes#2421.
- Enable `master` parameter for `uswgi` configuration.
It allows cleaning leaked processes if the parent
process is closed unexpectedly or with SIGKILL command.
Child processes follow to the master and kill themselves
after the main process.
Fixes#3855
- Add new 'missedmessage_email_senders' queue for sending missed messages emails.
- Add the new worker to process 'missedmessage_email_senders' queue.
- Split aggregation missed messages and sending missed messages email
to separate queue workers.
- Adapt tests for sending missed emails to the new logic.
Fixes#2607
Using `supervisorctl restart all` carried longer downtime (since it
just restarts everything at the same moment) and was less under our
control; I'm not sure it had any advantages.
Since browser clients send messages via websockets and not the API,
this is an important element in making sure mission-critical Zulip
functionality is working.
I'm not altogether happy with this (a better solution would be
database-level locking), but I think it solves the immediate problem
of folks with 2 servers being very likely to run analytics on both of
them.
This results in a brief service interruption (not a graceful restart),
but fixes a bug where on a `supervisorctl restart zulip-django`, we'd
end up leaking a bunch of uwsgi processes.
The mechanism was that sending SIGHUP to uwsgi was a command for it to
gracefully restart, so it'd start doing that (whereas supervisor
expected it to be dying)... and then supervisor would start up the new
uwsgi process group, resulting in 2 uwsgi process groups running.
This, in turn, led to a memory leak that could eventually result in
OOM kills.
The old zulip_ops Nagios configuration depended on Nagios having the
ability to login as the zulip user (with essentially full write
access); this configuration is helpful for limiting nagios to special
"nagios" user with more limited credentials.