This is the analog of 7b2c9223e7 for the
emoji cache; the only difference is that the existing code was working
correctly. It's still worth changing for improved robustness.
We saw issues with /srv/zulip_npm_cache being cleaned incorrectly by
this tool in production (more correctly, we noticed broken symlinks to
those directories, even from the current deployment). Print-debugging
showed that indeed older deployments were being ignored, because the
logic for `get_caches_in_use` was totally broken (this was sorta
masked because we also keep the last week's deployments).
The specific bug here turned out to be that we weren't passing the
`production` argument to generate_sha1sum_node_modules, but the
broader problem is that this logic isn't robust to changes in the
hashing algorithm.
Fix this by replacing the broken logic for trying to compute the
correct hash for that deployment with just checking the symlink inside
the deployment to let it self-report.
We can't easily do this same change for clean-venv-cache, because we
use multiple virtualenvs there. But a similar change could be useful
for the emoji cache as well.
Fixes#8116.
This is apparently installed by the perl package; I hadn't even known
it existed. We of course want to use the sha1sum command from
coreutils.
Fixes#8836.
This commit switches our emoji infrastructure to use 256 color indexed
64px spritesheets. Earlier we were using non-indexed 32px spritesheets
which were blurry on high dpi displays. These indexed spritesheets not
only provide a crispier display but are also smaller in size.
This commit also removes the `emoji-datasource` package as a dependency
as all the data is now sourced from individual datasource packages.
Fixes: #7862.
The installation isn't really complete here, and wasn't even when this
was the only success case; the instructions we're giving are for the
next step in the installation.
These instructions don't say what to do in an actual use case for this
option, but decent instructions there will require having a concrete
use case in front of us and designing the flow for it. At this stage,
just say where we are in the normal flow, and an admin who's chosen to
go off that flow can figure out how they want to vary it from there.
This flips the experimental `--express` option to be the default.
We retain the old behavior, where the script exits before
`initialize-database`, as an option `--no-init-db`; it might be useful
in e.g. a migration scenario (from a Zulip install elsewhere, or
another chat system) where the admin wants to set up the database
separately.
The install instructions are adjusted to match, getting shorter by two
steps and a bunch of words. I think this opens up opportunities to
refactor the text to simplify things further, too, but leaving that
for another commit.
Also tweak the "production" test suite to match.
Kind of unfortunate because the `sudo` interface for running a command
is objectively better -- a list of arguments, rather than a string to
be re-parsed by the shell. But some bare-bones machine images lack
`sudo`, so this makes things a bit more portable.
We do the following here:
* Remove libjasper-dev from THUMBOR_VENV_DEPENDENCIES.
Reason: This dependancy wasn't really needed by us for using
thumbor. It was a dependancy for using open-cv as Imaging Engine
in thumbor but we use PIL (Pillow now) as Imaging Engine.
* Add zlib1g-dev, libfreetype6-dev to THUMBOR_VENV_DEPENDENCIES.
Reason: These are dependancies of Pillow which are required for it
Pillow to function. Since we use Pillow in thumbor as Imaging Engine
we need these. Stuff before this didn't break because we also use
Pillow in development Environment and have these dependancies
installed from VENV_DEPENDENCIES as well.
We'll make this the normal behavior soon, once we're satisfied with
our arrangements for sending the admin straight to realm creation and
using the app without configuring email. The instructions in the docs
will also have to change accordingly, of course.
This causes us to give an error if you pass the installer any
positional arguments, e.g. with `--`. There's no reason you'd want
to do this, but I accidentally did it by passing an extra `--` to
the `test-install/install` wrapper and spent a few minutes on
confused debugging.
This gives us just one way of adopting a self-signed cert, rather than
one script which would generate a new one and an option to another
which would symlink to the system's snakeoil cert. Now those two
codepaths converge, and do the same thing.
The small advantage of generating our own over the alternative is that
it lets us set the name in the cert to EXTERNAL_HOST, rather than the
system's hostname as embedded in the system snakeoil certs. Not a big
deal, but might make things go slightly smoother if some browsers are
lenient (in a way that they probably shouldn't be.)
Before this fix, the installer has an extremely annoying bug where
when run inside a container with `lxc-attach`, when the installer
finishes, the `lxc-attach` just hangs and doesn't respond even to
C-c or C-z. The only way to get the terminal back is to root around
from some other terminal to find the PID and kill it; then run
something like `stty sane` to fix the messed-up terminal settings
left behind.
After bisecting pieces of the install script to locate which step
was causing the issue, it comes down to the `service camo restart`.
The comment here indicates that we knew about an annoying bug here
years ago, and just swept it under the rug by skipping this step
when in Travis. >_<
The issue can be reproduced by running simply `service camo restart`
under `lxc-attach` instead of the installer; or `service camo start`,
following a `service camo stop`. If `lxc-attach` is used to get an
interactive shell, these commands appear to work fine; but then when
that shell exits, the same hang appears. So, when we start camo
we're evidently leaving some kind of mess that entangles the daemon
with our shell.
Looking at the camo initscript where it starts the daemon, there's
not much code, and one flag jumps out as suspicious:
start-stop-daemon --start --quiet --pidfile $PIDFILE -bm \
--exec $DAEMON --no-close -c nobody --test > /dev/null 2>&1 \
|| return 1
start-stop-daemon --start --quiet --pidfile $PIDFILE -bm \
--no-close -c nobody --exec $DAEMON -- \
$DAEMON_ARGS >> /var/log/camo/camo.log 2>&1 \
|| return 2
What does `--no-close` do?
-C, --no-close
Do not close any file descriptor when forcing the daemon
into the background (since version 1.16.5). Used for
debugging purposes to see the process output, or to
redirect file descriptors to log the process output.
And in fact, looking in /proc/PID/fd while a hang is happening finds
that fd 0 on the camo daemon process, aka stdin, is connected to our
terminal.
So, stop that by denying the initscript our stdin in the first place.
This fixes the problem.
The Debian maintainer turns out to be "Zulip Debian Packaging Team",
at debian@zulip.com; so this package and its bugs are basically ours.
This provides a major simplification for non-production installs,
including our own testing (it's already in both the test-install
harness script and the "production" test suite) as well as potential
admins evaluating Zulip.
Ultimately this should probably be the default behavior, with perhaps
something shown to admins on the web as a reminder and link to help on
installing a better certificate. For now, pending working through
that, just get the behavior in and leave it opt-in.
The third-party `install-yarn.sh` script uses `curl`, and we invoke it
in `install-node`. So we need to install it as a dependency.
We've mostly gotten away with this because it's common for `curl` to
already be installed; but it isn't always.
Apparently, this was checking the wrong path in Travis CI, and thus
never actually running (meaning we'd accumulate every `node_modules`
directory ever in the Travis caches, which in turn resulted in very
slow builds).
This updates commit 11ab545f3 "install: Set the locale ..."
to be somewhat cleaner, and to explain more in the commit message.
In some environments, either pip itself fails or some packages fail to
install, and setting the locale to en_US.UTF-8 resolves the issue.
We heard reports of this kind of behavior with at least two different
sets of symptoms, with 1.7.0 or its release candidates:
https://chat.zulip.org/#narrow/stream/general/subject/Trusty.201.2E7.20Upgrade/near/302214https://chat.zulip.org/#narrow/stream/production.20help/subject/1.2E6.20to.201.2E7/near/306250
In all reported cases, commit 11ab545f3 or equivalent fixed the issue.
Setting LC_CTYPE is redundant when also setting LC_ALL, because LC_ALL
overrides all `LC_*` environment variables; so skip that. Also move
the line in `install` to a more appropriate spot, and adjust the
comments.
This commit renames various source requirements files like `dev.txt`,
`mypy.txt` etc to `dev.in`, `mypy.in` etc and various locked requirements
files like `dev_lock.txt`, `mypy_lock.txt` etc to `dev.txt`, `mypy.txt`
etc. This will help in emphasizing to the user that *.in are actually
input to `update-locked-requirements` tool which should be run after
updating any of these.
In this commit we add new dependencies needed for running thumbor.
Also we add the script for creating the virtual environment ready
for thumbor.
Note: Thumbor will use python2 and thus have different virtualenv
dedicated to it.
Credits to @TigorC and @joshland as well for there work on this.
This allows the installer to continue using this script for the
`standalone` method, while the no-argument form now uses the same
`webroot` method as the renewal cron job, suitable for running
by hand to adopt Certbot after initial install.
Certbot replaces the cert files under /etc/letsencrypt/live/,
which our nginx config refers to symlinks to; but it doesn't
tell nginx there's been an update, so nginx keeps serving the
old cert.
This is fine as long as nginx is restarted, or just told to
reload its config, at some point before the cert actually
expires about 30 days later. Which is probably the common
case, but of course we should make it just work. So, if we
actually renew a cert, tell nginx to reload its config now.
This causes the cron job to run only when a Zulip-managed certbot
install is actually set up.
Inside `install`, zulip.conf doesn't yet exist when we run
setup-certbot, so we write the setting later. But we also give
setup-certbot the ability to write the setting itself, so that we
can recommend it in instructions for adopting certbot in an
existing Zulip installation.
Except in:
- docs/writing-bots-guide.md, because bots are supposed to be Python 2
compatible
- puppet/zulip_ops/files/zulip-ec2-configure-interfaces, because this
script is still on python2.7
- tools/lint
- tools/linter_lib
- tools/lister.py
For the latter two, because they might be yanked away to a separate repo
for general use with other FLOSS projects.
This didn't work at all when one did a `vagrant destroy` and then
`vagrant up`, because the cache state would be preserved even though
the machine is gone.
Fixes#5981.
This should make it easier to script the installation process, and
also conveniently are the options one would want for the --certbot
option.
Significantly modified by tabbott to have a sane right interface,
include --help, and avoid printing all the `set -x` garbage before the
usage notices.
Based on #450, with commits
restructured by Rein Zustand.
Tweaks by Rein Zustand:
- Replace configure-cert with generate-self-signed-certs
- `mv scripts/lib/create-zulip-admin.sh scripts/lib/create-zulip-admin`
We were checking for whether an item in the deployments directory
represents a directory but were using its relative path which was
causing a false value to be returned for all items irrespective of
their being a directory or not if the script was invoked from some
where other than the deployments directory.
This commit re-arranges the arguments of `purge_unused_caches()`
function in order to remain consistent with other similar functions
in the library like `may_be_perform_caching()`.
This function will replace the repetitive definition of `parse_args()`
in various cache cleaning scripts. Also adds a `--verbose` argument
to the parser.
Historically, one has needed to build a release tarball in order to
use/test the Zulip installer, but you could upgrade a Zulip server
from Git. However, the only reason for that requirement was that we
didn't run `tools/update-prod-static` as part of the install script if
it's required. A good test for that case is whether we're in a Git
repository, but a better one is to check whether the prod-static
content exists in the tarball paths.
Fixes#3704.
This enforces our use of a consistent style in how we access Python
modules; "from os.path import dirname" is a particularly popular
abbreviation inconsistent with our style, and so it deserves a lint
rule.
Commit message and error text tweaked by tabbott.
Fixes#6543.
Based on the `dry_run` flag, this function either purges the list
of directories passed to them or prints a listing of the directories
it would have purged/kept_back, had the `dry_run` flag been false.
Apparently, the refactoring to make this script only run when changes
are present was buggy, in that if `apt-get update` failed, running
provision against wouldn't rerun `apt-get update`, resulting in a
broken state that requires expertise to fix. This closes that gap, by
using a stamp file to ensure we always successfully update apt before
proceeding.
It doesn't fix existing installations.
Modify `generate_sha1sum_node_modules()` such that it can calculate
the hash for a particular installation.
Tweaked by tabbott to use os.path.realpath in the setup_dir
calculation, to ensure it's consistent.
This should make it much more likely that users see this before
waiting a long time for other things to happen, since the `apt-get
dist-upgrade` step is really slow. We can't move further to the top,
since this requires `lsb_release` to be installed.
Given the path of directory containing all the caches, a list of
caches in use and threshold days, this function returns a list
of caches which can be removed safely.
This function returns a list of all the deployments directories
which are newer than some threshold number of days including the
`/root/zulip` directory if it exists.
This saves us from spending 200-250ms of CPU time importing Django
again just to log that we're running a management command. On
`scripts/restart-server`, this saves us from one thundering herd of
Django startups when all the queue workers are restarted; but there's
still the Django startup for the `manage.py` process itself for each
worker, so on a machine with e.g. 2 (virtual) cores the restart is
still painful.