**Background**
User groups are expected to comply with the DAG constraint for the
many-to-many inter-group membership. The check for this constraint has
to be performed recursively so that we can find all direct and indirect
subgroups of the user group to be added.
This kind of check is vulnerable to phantom reads which is possible at
the default read committed isolation level because we cannot guarantee
that the check is still valid when we are adding the subgroups to the
user group.
**Solution**
To avoid having another transaction concurrently update one of the
to-be-subgroup after the recursive check is done, and before the subgroup
is added, we use SELECT FOR UPDATE to lock the user group rows.
The lock needs to be acquired before a group membership change is about
to occur before any check has been conducted.
Suppose that we are adding subgroup B to supergroup A, the locking protocol
is specified as follows:
1. Acquire a lock for B and all its direct and indirect subgroups.
2. Acquire a lock for A.
For the removal of user groups, we acquire a lock for the user group to
be removed with all its direct and indirect subgroups. This is the special
case A=B, which is still complaint with the protocol.
**Error handling**
We currently rely on Postgres' deadlock detection to abort transactions
and show an error for the users. In the future, we might need some
recovery mechanism or at least better error handling.
**Notes**
An important note is that we need to reuse the recursive CTE query that
finds the direct and indirect subgroups when applying the lock on the
rows. And the lock needs to be acquired the same way for the addition and
removal of direct subgroups.
User membership change (as opposed to user group membership) is not
affected. Read-only queries aren't either. The locks only protect
critical regions where the user group dependency graph might violate
the DAG constraint, where users are not participating.
**Testing**
We implement a transaction test case targeting some typical scenarios
when an internal server error is expected to happen (this means that the
user group view makes the correct decision to abort the transaction when
something goes wrong with locks).
To achieve this, we add a development view intended only for unit tests.
It has a global BARRIER that can be shared across threads, so that we
can synchronize them to consistently reproduce certain potential race
conditions prevented by the database locks.
The transaction test case lanuches pairs of threads initiating possibly
conflicting requests at the same time. The tests are set up such that exactly N
of them are expected to succeed with a certain error message (while we don't
know each one).
**Security notes**
get_recursive_subgroups_for_groups will no longer fetch user groups from
other realms. As a result, trying to add/remove a subgroup from another
realm results in a UserGroup not found error response.
We also implement subgroup-specific checks in has_user_group_access to
keep permission managing in a single place. Do note that the API
currently don't have a way to violate that check because we are only
checking the realm ID now.
The comment logic doesn’t make sense. Every build gets to write to
the caches; some builds do in fact add new items, and without
clean_unused_caches.py there’s no way for them to remove items.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
As discussed in the comment, it doesn't really make sense for our 4
jobs that we run in parallel for different platforms to all start with
running the backend tests. While it's true that puppeteer will likely
fail if the backend doesn't run, and thus there's a mild prerequisite
relationship there, what is far more common is the node tests fail and
the user doesn't get that input for 10 minutes unnecessarily while all
the backend jobs run, and this change lets us avoid that.
This would ordinarily be determined by running ‘pnpm store path’, but
pnpm is not installed yet at that point.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
New in pnpm 8.3.0, this replaces the yarn-deduplicate check that was
removed in commit 3a27b12a7d (#24731).
Signed-off-by: Anders Kaseorg <anders@zulip.com>
Ever since we started bundling the app with webpack, there’s been less
and less overlap between our ‘static’ directory (files belonging to
the frontend app) and Django’s interpretation of the ‘static’
directory (files served directly to the web).
Split the app out to its own ‘web’ directory outside of ‘static’, and
remove all the custom collectstatic --ignore rules. This makes it
much clearer what’s actually being served to the web, and what’s being
bundled by webpack. It also shrinks the release tarball by 3%.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
Using curl to POST to the CircleCI workflow endpoint on CZO:
- Doesn't work on zulip/zulip@main (CZO runs a revert)
- Sets a bad example for other orgs
- Robs us of an opportunity to dogfood our own zulip/github-actions-zulip
Refactor the Actions workflows in this repo to report failure states
using the Zulip Action, and reimplement the related helper scripts in
Python, since they'd previously mostly shelled out to Python anyway.
We’ve always been running CI on both push events and pull_request
events, which means it runs twice for commits that are pushed to a
pull request.
Filter the push events by branch name. Add the workflow_dispatch
event in case developers want to manually run CI on some other branch
that isn’t a pull request.
https://docs.github.com/en/actions/managing-workflow-runs/manually-running-a-workflow
Signed-off-by: Anders Kaseorg <anders@zulip.com>
Comments out the steps in 'Create cache directories' that use
`actions/cache@2` so that the CI and production build can pass
while Github support issue is processed.
See https://github.com/actions/cache/issues/794 for an upstream report.
As a consequence:
• Bump minimum supported Python version to 3.8.
• Move Vagrant environment to Ubuntu 20.04, which has Python 3.8.
• Move CI frontend tests to Ubuntu 20.04.
• Move production build test to Ubuntu 20.04.
• Move 3.4 upgrade test to Ubuntu 20.04.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
As a consequence:
• Bump minimum supported Python version to 3.7.
• Move Vagrant environment to Debian 10, which has Python 3.7.
• Move CI frontend tests to Debian 10.
• Move production build test to Debian 10.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
This tool helps catch common typos in code and documentation, which is
particularly useful for our many contributors who are not native
English speakers.
The config is based on the codespell that I ran in
https://github.com/zulip/zulip/pull/18535.
We convert the `clean-unused-caches` script to a
python file so we can run it in provision by importing it
instead of running the script, hence saving some time.
Thumbor and tc-aws have been dragging their feet on Python 3 support
for years, and even the alphas and unofficial forks we’ve been running
don’t seem to be maintained anymore. Depending on these projects is
no longer viable for us.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
I have made `tools/setup/optimize-svg` do the SVG optimization
automatically rather than just telling you the command to run if they
need optimizing. This included adding a `--check` parameter to use in
CI to only check as we previously did rather than actually running the
optimization.
I have also made `tools/setup/optimize-svg` execute
`tools/setup/generate_integration_bots_avatars.py` once it has run the
optimization to ensure it is always ran.
This makes it one less command to run when creating an integration,
but also means that we catch instances where a PNG has just been
copied into the `static/images/integrations/bot_avatars` folder as the
only instance where this won't be run is if `optimize-svg` has not
been run which would be caught in CI.
Fixes#18183. Fixes#18184.
We had used 2>&1 to redirect stderr to stdout so it could be piped
into ts, but commit dd3cdd6ec5 (#17611)
removed ts, so we no longer need the redirection.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
This helps us reduce time to update dependencies on every CI
build since the previous containers used to take about 1 minute.
`sudo` had a bug due to which we were not able to create directories.
See https://github.com/sudo-project/sudo/issues/42.
We used these directories to restore caches.
Upgrading the focal dependencies via this commit naturally fixes that
bug.
Fixes#17854
GitHub Actions gives us 2 cpus (probably shared) to run the
jobs. Specifying 6 processes here doesn't make a difference
since both jobs run in around 5 minutes right now.