**Background**
User groups are expected to comply with the DAG constraint for the
many-to-many inter-group membership. The check for this constraint has
to be performed recursively so that we can find all direct and indirect
subgroups of the user group to be added.
This kind of check is vulnerable to phantom reads which is possible at
the default read committed isolation level because we cannot guarantee
that the check is still valid when we are adding the subgroups to the
user group.
**Solution**
To avoid having another transaction concurrently update one of the
to-be-subgroup after the recursive check is done, and before the subgroup
is added, we use SELECT FOR UPDATE to lock the user group rows.
The lock needs to be acquired before a group membership change is about
to occur before any check has been conducted.
Suppose that we are adding subgroup B to supergroup A, the locking protocol
is specified as follows:
1. Acquire a lock for B and all its direct and indirect subgroups.
2. Acquire a lock for A.
For the removal of user groups, we acquire a lock for the user group to
be removed with all its direct and indirect subgroups. This is the special
case A=B, which is still complaint with the protocol.
**Error handling**
We currently rely on Postgres' deadlock detection to abort transactions
and show an error for the users. In the future, we might need some
recovery mechanism or at least better error handling.
**Notes**
An important note is that we need to reuse the recursive CTE query that
finds the direct and indirect subgroups when applying the lock on the
rows. And the lock needs to be acquired the same way for the addition and
removal of direct subgroups.
User membership change (as opposed to user group membership) is not
affected. Read-only queries aren't either. The locks only protect
critical regions where the user group dependency graph might violate
the DAG constraint, where users are not participating.
**Testing**
We implement a transaction test case targeting some typical scenarios
when an internal server error is expected to happen (this means that the
user group view makes the correct decision to abort the transaction when
something goes wrong with locks).
To achieve this, we add a development view intended only for unit tests.
It has a global BARRIER that can be shared across threads, so that we
can synchronize them to consistently reproduce certain potential race
conditions prevented by the database locks.
The transaction test case lanuches pairs of threads initiating possibly
conflicting requests at the same time. The tests are set up such that exactly N
of them are expected to succeed with a certain error message (while we don't
know each one).
**Security notes**
get_recursive_subgroups_for_groups will no longer fetch user groups from
other realms. As a result, trying to add/remove a subgroup from another
realm results in a UserGroup not found error response.
We also implement subgroup-specific checks in has_user_group_access to
keep permission managing in a single place. Do note that the API
currently don't have a way to violate that check because we are only
checking the realm ID now.
The comment logic doesn’t make sense. Every build gets to write to
the caches; some builds do in fact add new items, and without
clean_unused_caches.py there’s no way for them to remove items.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
As discussed in the comment, it doesn't really make sense for our 4
jobs that we run in parallel for different platforms to all start with
running the backend tests. While it's true that puppeteer will likely
fail if the backend doesn't run, and thus there's a mild prerequisite
relationship there, what is far more common is the node tests fail and
the user doesn't get that input for 10 minutes unnecessarily while all
the backend jobs run, and this change lets us avoid that.
This would ordinarily be determined by running ‘pnpm store path’, but
pnpm is not installed yet at that point.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
New in pnpm 8.3.0, this replaces the yarn-deduplicate check that was
removed in commit 3a27b12a7d (#24731).
Signed-off-by: Anders Kaseorg <anders@zulip.com>
Ever since we started bundling the app with webpack, there’s been less
and less overlap between our ‘static’ directory (files belonging to
the frontend app) and Django’s interpretation of the ‘static’
directory (files served directly to the web).
Split the app out to its own ‘web’ directory outside of ‘static’, and
remove all the custom collectstatic --ignore rules. This makes it
much clearer what’s actually being served to the web, and what’s being
bundled by webpack. It also shrinks the release tarball by 3%.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
Using curl to POST to the CircleCI workflow endpoint on CZO:
- Doesn't work on zulip/zulip@main (CZO runs a revert)
- Sets a bad example for other orgs
- Robs us of an opportunity to dogfood our own zulip/github-actions-zulip
Refactor the Actions workflows in this repo to report failure states
using the Zulip Action, and reimplement the related helper scripts in
Python, since they'd previously mostly shelled out to Python anyway.
Before Zulip 4.9, the Zulip install process left any already-installed
rabbitmq with whatever nodename it had previously configured. Wince
this encodes the name of the host when it was installed, this does not
function well with containers.
Leave rabbitmq-server uninstalled, which lets the Zulip installation
process set the nodename to `localhost`, which ensures that it is
usable across container restarts.
Silences “Warning: 1 issue was detected with this workflow: Please
make sure that every branch in on.pull_request is also in on.push so
that Code Scanning can compare pull requests against the state of the
base branch.”
Signed-off-by: Anders Kaseorg <anders@zulip.com>
We’ve always been running CI on both push events and pull_request
events, which means it runs twice for commits that are pushed to a
pull request.
Filter the push events by branch name. Add the workflow_dispatch
event in case developers want to manually run CI on some other branch
that isn’t a pull request.
https://docs.github.com/en/actions/managing-workflow-runs/manually-running-a-workflow
Signed-off-by: Anders Kaseorg <anders@zulip.com>
Comments out the steps in 'Create cache directories' that use
`actions/cache@2` so that the CI and production build can pass
while Github support issue is processed.
See https://github.com/actions/cache/issues/794 for an upstream report.
As a consequence:
• Bump minimum supported Python version to 3.8.
• Move Vagrant environment to Ubuntu 20.04, which has Python 3.8.
• Move CI frontend tests to Ubuntu 20.04.
• Move production build test to Ubuntu 20.04.
• Move 3.4 upgrade test to Ubuntu 20.04.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
The job name is just the constant `production_build`. Renaming it to
have the OS in the key ensures that it is not shared across OS'es (for
instance between `4.x` and `main`, which are now bionic and buster,
respectively), and also allows it to share caches with the install
step, which uses the OS name in that place.
As a consequence:
• Bump minimum supported Python version to 3.7.
• Move Vagrant environment to Debian 10, which has Python 3.7.
• Move CI frontend tests to Debian 10.
• Move production build test to Debian 10.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
It should not use the configured zulip username, but should instead
pull from the login user (likely `nagios`), or an explicit alternate
provided PostgreSQL username. Failure to do so results in Nagios
failures because the `nagios` login does not have permissions to
authenticated the `zulip` PostgreSQL user.
This requires CI changes, as the install tests install as the `zulip`
login username, which allowed Nagios tests to pass previously; with
the custom database and username, however, they must be passed to
process_fts_updates explicitly when validating the install.