This code empirically doesn't work. It's not entirely clear why, even
having done quite a bit of debugging; partly because the code is quite
convoluted, and because it shows the symptoms of people making changes
over time without really understanding how it was supposed to work.
Moreover, this code targets an old version of the APNs provider API.
Apple deprecated that in 2015, in favor of a shiny new one which uses
HTTP/2 to meet the same needs for concurrency and scale that the old
one had to do a bunch of ad-hoc protocol design for.
So, rip this code out. We'll build a pathway to the new API from
scratch; it's not that complicated.
Whenever you restarted supervisord services, we'd end up leaking one
process from the process_queue group, eventually resulting in running
out of memory.
Fixes#6184.
This causes `upgrade-zulip-from-git`, as well as a no-option run of
`tools/build-release-tarball`, to produce a Zulip install running
Python 3, rather than Python 2. In particular this means that the
virtualenv we create, in which all application code runs, is Python 3.
One shebang line, on `zulip-ec2-configure-interfaces`, explicitly
keeps Python 2, and at least one external ops script, `wal-e`, also
still runs on Python 2. See discussion on the respective previous
commits that made those explicit. There may also be some other
third-party scripts we use, outside of this source tree and running
outside our virtualenv, that still run on Python 2.
The Zulip server's settings are only available if process-fts-updates
is running is on the same server as a Zulip production deployment. So
we instead check whether we have pgroonga configured in
/etc/zulip/zulip.conf.
On `trusty` there is no package for `boto` or `gevent` on Python 3, both
of which are dependencies of `wal-e` (at the version we've pinned.) This
is something used only on database servers and only in a replication
scenario, and it doesn't involve any of our code outside the wal-e repo,
so the Python version it uses is quite independent of the Zulip
application server itself and the rest of our code. For now, keep it
explicitly on Python 2 while we move forward for most everything else.
This script in `zulip_ops` is handy for managing EC2 instances. It uses
`boto`, which isn't available in `trusty` for Python 3. The use of
`boto` here isn't particularly deep, so we could replace it with some
more manual HTTP calls if it comes to that. For now, just mark it to
stay on Python 2 while we move the app and all the rest of the ops code
(except this and another straggler or two) to Python 3.
Also make a comment on this package in the Puppet manifest clearer
about what it specifically refers to.
This consists of the `zulip_ops::stats` Puppet class, which has apparently
not been used since 2014, and a number of files that I believe were
only used for that. Also a couple of tiny loose ends in other files.
This is only actually used in our `wal-e` setup, which is in
zulip_ops::postgres_common. (In fact the only mentions of `gevent` in
our whole Git history are for `wal-e`.) So remove where we mention it
on the broader zulip::postgres_common module, and move it where it's
needed.
This follows up on 98cef0ab4 by eliminating the only dependency
outside of the `zulip_ops` Puppet tree on a system Python-library
package which isn't available in `trusty` for Python 3.
This follows up on 207cf6302 from last year to clean up cases that
have apparently popped up since then. Invoking the scripts directly
makes a cleaner command line in any case, and moreover is essential
to how we control running a Zulip install as either Python 2 or 3
(soon, how we always ensure it runs as Python 3.)
One exception: we're currently forcing `provision` in dev to run
Python 3, while still running both Python 2 and Python 3 jobs in CI.
We use a non-shebang invocation to do the forcing of Python 3.
In some of these contexts, we may still be *using* the Python 2
version, but at least this should eliminate running into
`ImportError`s one by one in scripts that run outside a virtualenv,
as we update their shebangs to refer to Python 3.
Several Python libraries we use don't come in Python 3 versions on
trusty: gevent, boto, twisted, django, django-tagging, whisper.
The latter two don't come in Python 3 versions even on xenial.
So some work required before we can actually switch the code that
relies on those libraries to run as Python 3 -- probably the best
solution will be to backport them all in our apt repo. (All but
`whisper` are packaged in zesty; `whisper` upstream just grew Python 3
support this year.)
These are no longer useful, with our spiffy new analytics framework,
and we haven't in fact been using them for some time, while the
`active-user-stats` cron job does cause regular mail from cron.
Just delete them.
It's rare that there's value in having the log files get this big, and
these changes mean these log files should never consume more than a
few gigabytes.
And in particular, the server.log is far more important than the other
log files, and grows much faster, so we might as well spend most of
the space we are spending on that.
I estimate that the total size of log files from this is going to be
under 1-2GB, since 75MB (compressed size) * 10 (compressed logs) +
500MB (uncompressed size) = 1.25GB from server.log, and the rest is
negligible.
Fixes part of #5724.
Most of these log files are useless except a few minutes after an
event happens, and the aggregate effect of the originals size limits
meant that Zulip's logs could consume many gigabytes of disk.
The new logging strategy should limit our usage from supervisor logs
to at most 3 Gigabytes:
* 20 * 3 = 60MB per queue worker => <1GB.
* 100 * 10 = 1GB for Django and Tornado logs.
Fixes part of #5724.
While running queue processors multithreaded will limit the
performance available to very small systems, it's easy to fix that by
adding more RAM, and previously, Zulip didn't work on such systems at
all, so this is unambiguously an improvement there.
Fixes#32.
Fixes#34.
(Commit message expanded significantly by tabbott.)
Also puts them into a processing queue, though the queue processor
does nothing.
Rewritten by tabbott to avoid unnecessary database queries in
do_send_messages.
This fixes a performance problem where we were previously starting up
a full Django process (~0.7s even on a fast machine) every time a new
email came in, potentially allowing users to accidentally DoS a Zulip
server. Now, we just post over HTTPS, allowing the existing thread
pool support to do its job.
- Add script wrapper to communicate postfix pipe with django web server
over HTTP(S). It uses shared_secret authentication mode.
- Add django view to process messages from email mirror server.
- Clean management command `email-mirror`. Left just functional
for cron email processing.
- Add routes for new tornado view.
- Change pipe script in master process postfix config template
based on updated script.
- Add tests.
Tweaked by tabbott to adjust the directory and set better defaults.
Fixes#2421.
- Enable `master` parameter for `uswgi` configuration.
It allows cleaning leaked processes if the parent
process is closed unexpectedly or with SIGKILL command.
Child processes follow to the master and kill themselves
after the main process.
Fixes#3855
- Add new 'missedmessage_email_senders' queue for sending missed messages emails.
- Add the new worker to process 'missedmessage_email_senders' queue.
- Split aggregation missed messages and sending missed messages email
to separate queue workers.
- Adapt tests for sending missed emails to the new logic.
Fixes#2607
Using `supervisorctl restart all` carried longer downtime (since it
just restarts everything at the same moment) and was less under our
control; I'm not sure it had any advantages.
Since browser clients send messages via websockets and not the API,
this is an important element in making sure mission-critical Zulip
functionality is working.
I'm not altogether happy with this (a better solution would be
database-level locking), but I think it solves the immediate problem
of folks with 2 servers being very likely to run analytics on both of
them.
This results in a brief service interruption (not a graceful restart),
but fixes a bug where on a `supervisorctl restart zulip-django`, we'd
end up leaking a bunch of uwsgi processes.
The mechanism was that sending SIGHUP to uwsgi was a command for it to
gracefully restart, so it'd start doing that (whereas supervisor
expected it to be dying)... and then supervisor would start up the new
uwsgi process group, resulting in 2 uwsgi process groups running.
This, in turn, led to a memory leak that could eventually result in
OOM kills.
The old zulip_ops Nagios configuration depended on Nagios having the
ability to login as the zulip user (with essentially full write
access); this configuration is helpful for limiting nagios to special
"nagios" user with more limited credentials.
Previously, the CRITICAL state would never fire (because x > 6 =>
x > 3). Additionally, 6s is not so unusually high as to deserve being
immediately pageable.
- Add websocket client to create connection with SockJS websocket server.
It contains callback method to launch after connection setup.
- Add '--websocket' parameter to 'check_send_receive_time' script to
check websocket connection.
- Add testing websocket connection to production installation checking.
- Add cronjob to launch websocket connection nagios test.
This makes it possible for Zulip Nagios monitoring to check for
problems impacting the websockets sending code path, which is what all
web users use.
This change adds support for displaying inline open graph previews for
links posted into Zulip.
It is designed to interact correctly with message editing.
This adds the new settings.INLINE_URL_EMBED_PREVIEW setting to control
whether this feature is enabled.
By default, this setting is currently disabled, so that we can burn it
in for a bit before it impacts users more broadly.
Eventually, we may want to make this manageable via a (set of?)
per-realm settings. E.g. I can imagine a realm wanting to be able to
enable/disable it for certain URLs.