The Zulip email mirror script called by postfix had performance/load
issues, because it spent so much time on startup/import due to use of
the Zulip virtualenv.
The script was rewritten using pure python (no Django) to improve
performance.
The install script was failing on 2nd+ attempts if the first attempt
was interrupted.
This failure happened because zulip-venv already existed at
`current_venv_path`. Changing the `ln` command's flags from `-s` to
`-nsf` should make this part of the script idempotent.
Now, generate_secrets.py will never overwrite existing secrets. In
addition to being a safer model in generate, this fixes 2 significant
issues:
(1) It makes it much easier to preserve secrets like Oauth tokens in a
development environment (previously, provision would destroy them).
(2) It makes it possible to automatically add new secrets as part of
the upgrade process. In particular, this is useful for the
zulip_org_id settings.
Fixes#4797.
This fixes a significant performance issue with LaTeX rendering (and
other things that invoked node) where starting up node took a few
hundred milliseconds due to nvm initialization.
Tweaked by tabbott to avoid copying the node binary itself, instead
using a tiny wrapper script.
This is important primarily because it's possible a future version of
node will expect to find libraries/dependencies/etc. installed via NVM
at some path related to the path of the node binary itself, and that's
more guaranteed with this new model.
Fixes#4618.
Also puts them into a processing queue, though the queue processor
does nothing.
Rewritten by tabbott to avoid unnecessary database queries in
do_send_messages.
This fixes a performance problem where we were previously starting up
a full Django process (~0.7s even on a fast machine) every time a new
email came in, potentially allowing users to accidentally DoS a Zulip
server. Now, we just post over HTTPS, allowing the existing thread
pool support to do its job.
- Add script wrapper to communicate postfix pipe with django web server
over HTTP(S). It uses shared_secret authentication mode.
- Add django view to process messages from email mirror server.
- Clean management command `email-mirror`. Left just functional
for cron email processing.
- Add routes for new tornado view.
- Change pipe script in master process postfix config template
based on updated script.
- Add tests.
Tweaked by tabbott to adjust the directory and set better defaults.
Fixes#2421.
Follow-on from #2373/ PR https://github.com/zulip/zulip/pull/4316, to set an
appropriate umask also when upgrading so files have appropriate permissions.
I've tested this by starting from a clean install, deleting /srv/* so new
files are downloaded, and then doing an upgrade. It worked starting with both
a current version from master and an older release installed with a less
restrictive umask and then the umask changed.
Fixes#2373.
- Add new 'missedmessage_email_senders' queue for sending missed messages emails.
- Add the new worker to process 'missedmessage_email_senders' queue.
- Split aggregation missed messages and sending missed messages email
to separate queue workers.
- Adapt tests for sending missed emails to the new logic.
Fixes#2607
* Now queue_workers.py sorts queue names and prints them on their own
line. Previously it's output was nondeterministic.
* Simplified grep strategy for removing the "test" worker.
This list was likely to end up out of date quickly, since it wasn't
documented that you need to update it when adding a queue. The best
solution is to just not require it to be updated.
Now that we no longer use node_modules at all in production (it's only
used to generate static assets), we don't include `node_modules` in
the production tarballs, and thus we shouldn't attempt to copy
`node_modules` out of the production tarballs when installing.
Fixes a regression introduced in
d71f2e7b9b.
This saves about a minute of downtime when using
upgrade-zulip-from-git in the default configuration.
It should also save several seconds of downtime when upgrading to a
production release tarball as well.
This indirectly causes the RabbitMQ node name for new Zulip
installations to default to zulip@localhost, which would eliminate the
persistent problems we have had
Fixes#194, #465, #1375, #1751.
Signed-off-by: Anders Kaseorg <andersk@mit.edu>
This adds a dependency on the realpath package on trusty; we could try
to remove it if needed, but given that realpath is included in
coreutils on Xenial (and presumably anything else modern), I think
it's reasonable to add it.
Fixes#1797.
Previously, success_stamp was touched whenever we used a particular
node_modules version; it makes more sense to only touch it when the
node_modules directory has actually changed.
get_package_names did not correctly strip the GitHub URLs from package
names, resulting in the "package names" for our dependencies installed
from Git being tracked with the complete sha1sum included in the name.
This meant that upgrading our virtualenvs incorrectly ended up
resorting to creating an entirely new virtualenv whenever we changed a
dependency that had previously been installed from GitHub URLs.
Now that we're no longer actively debugging this tool, there's no need
to have it print everything it's doing.
This will make `test-backend` a lot nicer to use.
generate-secrets.py now requires --development for development environment
setup or --production for production environment setup (and one of these
options is mandatory).
This solves the problem that it was somewhat easy to accidentally run
generate-secrets.py without the `-d` option while doing manual development
environment setup.
Fixes: #1911.
This is a first pass at building a framework for collecting various
stats about realms, users, streams, etc. Includes:
* New analytics tables for storing counts data
* Raw SQL queries for pulling data from zerver/models.py tables
* Aggregation functions for aggregating hourly stats into daily stats, and
aggregating user/stream level stats into realm level stats
* A management command for pulling the data
Note that counts.py was added to the linter exclude list due to errors
around %%s.
NVM takes a specific node version and installs the node package and
a corresponding compatible npm package.
We use it in a somewhat hackish way to install node/npm globally with
a pinned version, since that's how we actually want to consume node in
our development environment.
Other details:
- Travis CI now is configured to use the version of node installed by
provision; the easiest way to do this was to sabotage the existing node
installation.
- jsdom is upgraded to a current version, which both requires recent
node and also is required for the tests to pass with recent node.
This fixes running the node tests on Xenial.
Fixes#1498.
[tweaked by tabbott]
This adds a new system for copying packages from old virtualenvs that
are sufficiently similar to the new virtualenv required.
In practice, this results in a huge performance improvement for
re-provisioning Zulip development environments when the requirements
files have changed (which is the dominant performance problem with
provision today).
Fixes: #1507.
Between releases 1.3.13 and 1.4.0, local_settings.py was renamed to
prod_settings.py. The upgrade scripts were adjusted to reflect this name
change. But because the first part of the upgrade script is run with the
currently installed version's code, the symlink to /etc/zulip/settings.py is
created with the old name. This was causing upgrade-zulip-stage-2 to fail.
Now upgrade-zulip-stage-2 creates the symlink at zproject/prod_settings.py
if it doesn't already exist.
Fixes#1731.
The previous model for these Nagios checks was kinda crazy -- every
minute, we'd run a full `rabbitmctl list_consumers` for each of the
dozen+ consumers that we have, and then do the exact same parsing
logic for each to determine whether the target queue has a running
consumer to write out a state file.
Because `rabbitmctl list_consumers` takes a small amount of resources,
on systems where CPU is very limited (e.g. t2 style AWS instances),
this minor CPU wastage could be problematic.
Now we just do that `rabbitmqctl list_consumers` once per minute, and
output all the state files from a single command.
Further TODO items on this front include removing the hardcoded list
of queues.
Because rabbitmq doesn't support changing the nodename of a running
rabbitmq node, Zulip installations suffered a plague of issues where
e.g. a Zulip server would reboot, the hostname would change, and
suddenly the local rabbitmq instance being used by Zulip would stop
working.
We address this problem by using, by default, a fixed rabbitmq
nodename, but providing server administrators the option to set the
rabbitmq nodename used by Zulip however they choose.
To upgrade an existing server to use this new configuration, one will
need to add something like the following to /etc/zulip/zulip.conf:
[rabbitmq]
nodename = zulip@localhost
However, I don't believe we have the puppet code in place to make this
work correctly at initial installation without rabbitmq-server being
already installed (but off), as we can easily setup in Travis CI but I
haven't been willing to do for the installer. So for now, this just
fixes our Travis CI problems.
Fixes: #1579.
This reverts commit 3f95e567c1.
Apparently `apt-add-repository` fails periodically in CI. I suspect
this is some sort of silly networking problem, but given that all
we're saving is a few lines of code, the old version was better if
this fails basically ever.
Previously, the install script would fail if you passed various
non-default puppet rules, since the code to configure and restart
services that runs later on in the install script largely ran
unconditionally, regardless of whether the relevant service was
actually installed on the target system.
This should make the main install script reusable for installing
e.g. a dedicated Postgres server for use with Zulip.
This reverts commit f1f48f305e.
The use of sklearn unfortunately caused a substantial slowdown to the
Zulip provisioning process, which didn't seem worth it for a
relatively minor feature.
Apparently, puppet has messed up exit codes and doesn't by default
return the usual 0=success, nonzero=failure codes. By default, it
seems to always return 0; and with `--detailed-exitcodes`, it returns
the complicated thing documented in the comments.
We fix this by checking the exit code and translating it to what we
actually care about, namely whether errors occurred.
See https://tickets.puppetlabs.com/browse/PUP-2754 for details.
Fixes#1094.
In python 3, subprocess uses bytes for input and output if
universal_newlines=False (the default). It uses str for input and
output if universal_newlines=True.
Since we're dealing with strings here, add universal_newlines=True
to subprocess.check_output calls.
This is important for both ensuring the Nagios checks work correctly
in production, as well as making sure the `zulip` user can access the
virtualenv (owned by the `travis` user) in Travis CI.
The manage.py change effectively switches the Zulip production server
to use the virtualenv, since all of our supervisord commands for the
various Python services go through manage.py.
Additionally, this migrates the production scripts and Nagios plugins
to use the virtualenv as well.
Apparently, c74a74dc74 introduced a bug
where we are no longer correctly depending on build-essential as part
of the Zulip development environment installation process.
Fixes#1111.
This is needed because hash_reqs.py is used to create a virtualenv.
Currently we only use virtualenv in development, but we will soon
start using it in production. Scripts used in production should be
put in scripts/.
Camo is a caching image proxy, used in Zulip to avoid mixed-content
warnings by proxying HTTP image content over HTTPS. We've been using
it in zulip.com production for years; this change makes it available
in standalone Zulip deployments.
The main function of prompting inside `manage.py migrate` is to ask
the user if they want to delete stale content-types, which is
unimportant and likely scary, so we disable doing so.
This automatically loads settings, zerver.models.* and
zerver.lib.actions.* when you start `manage.py shell`, which should
save a bit of time basically every time someone uses it.
Fixes#275.
Previously, we used shell quoting that would result in the shell variable not
being substituted. Instead, we use `"`s that will allow for variable
substitution.
Previously these were hardcoded in zproject/settings.py to be accessed
on localhost.
[Modified by Tim Abbott to adjust comments and fix configure-rabbitmq]
A common issue when doing a Zulip upgrade is trying to pass
upgrade-zulip a tarball path under /root, which doesn't work because
the Zulip user doesn't have permission to read the tarball. We
could fix this by just unpacking the tarballs as root, but it seemed
like a nicer approach would be to archive the release tarballs
somewhere readable by the Zulip user (/home/zulip/archives) and unpack
them from there.
Fixes#208.
The point of the lock is to prevent two deployments happening at the
same time and racing with each other, not to prevent doing any future
deployments after an error happens (which is what the current
implementation does in practice).
Addresses part of #208.
The #! line processing interpreted the argument to pass to `env` as
"python2.7 -u", which obviously isn't a real program.
We fix this by setting the PYTHONUNBUFFERED environment variable
inside the program, which has the same effect.
Thanks to Dan Fedele for the bug report and suggested solution!
With this change, we are now testing the production static asset
pipeline and installation process in a new testing job (and also run
the frontend/backend tests separately).
This means that changes that break the Zulip static asset pipeline or
production installation process are more likely to fail tests. The
testing is imperfect in that it does not have proper isolation -- we
build a complete Zulip development environment and then install a
Zulip production environment on top of it, so e.g. any apt
dependencies installed for Zulip development will still be available
for the Zulip production environment. But, it's better than nothing!
A good v2 of this would be to have the production setup process just
install the minimum stuff needed to run `build-release-tarball` and
then uninstall it / clean it up so that we can do a more clear
production installation, but that's more work.
This fixes an annoying issue where one tries to rebuild the database,
and it fails due to there being existing connections.
The one thing that is potentially scary about this implementation is
that it means it's now a lot easier to accidentally drop your
production database by running the wrong script; might be worth adding
a "--force" flag controlling this behavior or something.
Thanks to Nemanja Stanarevic and Neeraj Wahi for prototypes of this
implementation! They did most of the work and testing for this.
This fixes some issues that we've had where commands will fail is
confusing ways after the database is rebuilt because data from before
the database was dropped is still in the memcached cache.
This fixes issue #123. Namely, the script in scripts/setup/install was
returning 0. Adding `set -e` and `set -o pipeline` causes the install
script to exit and return 1 if any part fails, including piping output
(`set -o pipeline` does this).
Most of our installation process is idempotent, but this step in
particular is not, so it's important to provide a clear error message
about how to proceed.
While the docu on https://www.zulip.org/server.html says:
```
cd /root/zulip
./scripts/setup/install
```
This script downloads the `python-django-guardian_1.3-1~zulip4_all.deb` file to current working dir (`/root/zulip` if you follow the docu), but tries to install it from /root/.
This fails obviously. So i changed the download location to /tmp/.
If there's a problem with Django settings then RMQPW would just be
empty, causing more confusing errors downstream.
(imported from commit 5948b1a15eb92fc032ea02e499be58365d8e9ecb)
Source LOCAL_DATABASE_PASSWORD and INITIAL_PASSWORD_SALT from the secrets file.
Fix the creation of pgpass file.
Tim's note: This will definitely break the original purpose of the
tool but it should be pretty easy to add that back as an option.
(imported from commit 8ab31ea2b7cbc80a4ad2e843a2529313fad8f5cf)
The manual step here is that we need to do the `puppet apply` before
pushing this commit, or `restart-server` will crash.
Previously we shut down everything in one group, which performed
poorly with supervisor's bad performance on restarting many daemons at
once. Now we shut down the unimportant stuff, then the important
stuff, bring back the important stuff, and then bring back the
unimportant stuff.
This new model has a little over 5s of downtime for the core
user-facing daemons -- which is still far more than would be ideal,
but a lot less than the 13s or so that we had before.
Here's some logs with the current setup for the tornado/django downtime:
2013-12-19 20:16:51,995 restart-server: Stopping daemons
2013-12-19 20:16:53,461 restart-server: Starting daemons
2013-12-19 20:16:57,146 restart-server: Starting workers
Compare with the behavior on master today:
2013-12-19 20:21:45,281 restart-server: Stopping daemons
2013-12-19 20:21:49,225 restart-server: Starting daemons
2013-12-19 20:21:58,463 restart-server: Done!
(imported from commit b2c1ba77f3dc989551d0939779208465a8410435)