Migrate all `ids` of anything which does not have a foreign key from
the Message or UserMessage table (and would thus require walking
those) to be `bigint`. This is done by removing explicit
`BigAutoField`s, trading them for explicit `AutoField`s on the tables
to not be migrated, while updating `DEFAULT_AUTO_FIELD` to the new
default.
In general, the tables adjusted in this commit are small tables -- at
least compared to Messages and UserMessages.
Many-to-many tables without their own model class are adjusted by a
custom Operation, since they do not automatically pick up migrations
when `DEFAULT_AUTO_FIELD` changes[^1].
Note that this does multiple scans over tables to update foreign
keys[^2]. Large installs may wish to hand-optimize this using the
output of `./manage.py sqlmigrate` to join multiple `ALTER TABLE`
statements into one, to speed up the migration. This is unfortunately
not possible to do generically, as constraint names may differ between
installations.
This leaves the following primary keys as non-`bigint`:
- `auth_group.id`
- `auth_group_permissions.id`
- `auth_permission.id`
- `django_content_type.id`
- `django_migrations.id`
- `otp_static_staticdevice.id`
- `otp_static_statictoken.id`
- `otp_totp_totpdevice.id`
- `two_factor_phonedevice.id`
- `zerver_archivedmessage.id`
- `zerver_client.id`
- `zerver_message.id`
- `zerver_realm.id`
- `zerver_recipient.id`
- `zerver_userprofile.id`
[^1]: https://code.djangoproject.com/ticket/32674
[^2]: https://code.djangoproject.com/ticket/24203
This helps prevent wraparound on exceedingly large and old installs,
particularly Zulip Cloud. These are relatively simple migrations
since they are not referenced by any other tables; however, they are
quite large, and are actively used from Django by running servers,
making this not a migration which is possible to run without stopping
the server.
Use the escape hatch in the previous commit to temporarily pause
analytics writes while the migration happens. This should make the
migration transparent to users, at the small cost of an artificial dip
in statistics (specifically, to push notification counts, and unread
message counts) while the migration runs.
With `realm_active_humans` no longer dependent on the per-user rows,
there is no reason to preserve them -- any measure of "was a user
active" should look directly at the much richer RealmAuditLog. This
removes the bulk of the UserCount table, since the remaining rows all
require user interaction of some sort to produce rows.
Due to a bug[^1] in Django 4.2, fixed in 4.2.6, queries using
`__isnull` added an unnecessary cast. This cast was _also_ used in
`WHERE` clauses for partial indexes. This means that partial indexes
created before Zulip was using Django 4.2 (i.e. before Zulip Server
7.0 or 2c20028aa4) will not be used when the server is using Django
4.2.0 through 4.2.5 -- and, conversely, that indexes created while
Zulip had those versions of Django (i.e. Zulip Server 7.0 through 7.4
or 7807bff526) will not be used later.
We re-create the indexes, to ensure that users that installed Zulip
after Zulip Server 7.0 / 2c20028aa4 and before Zulip Server 7.5 /
7807bff526 have indexes which can be used by current Django. This
is useless work for some installations, but most analytics tables are
not large enough for this to take significant time.
[^1]: https://code.djangoproject.com/ticket/34840
Black 23 enforces some slightly more specific rules about empty line
counts and redundant parenthesis removal, but the result is still
compatible with Black 22.
(This does not actually upgrade our Python environment to Black 23
yet.)
Signed-off-by: Anders Kaseorg <anders@zulip.com>
This commit upgrades 0015_clear_duplicate_counts migration to remove
duplicate count in StreamCount, UserCount, InstallationCount as well.
Fixes https://github.com/zulip/docker-zulip/issues/266
Fixes#2665.
Regenerated by tabbott with `lint --fix` after a rebase and change in
parameters.
Note from tabbott: In a few cases, this converts technical debt in the
form of unsorted imports into different technical debt in the form of
our largest files having very long, ugly import sequences at the
start. I expect this change will increase pressure for us to split
those files, which isn't a bad thing.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
Generated by `pyupgrade --py3-plus --keep-percent-format` on all our
Python code except `zthumbor` and `zulip-ec2-configure-interfaces`,
followed by manual indentation fixes.
Signed-off-by: Anders Kaseorg <anders@zulipchat.com>
Replaced unique_together with UniqueConstraint in models that
covered nullable fields as in unique_together database indexes
don't work where subgroup=None. So added conditional unique
index handling invalid duplicate Count data.
Added 0015_clear_duplicate_counts migration to handle existing
data that violates the constraints.
Also corrected a test case in test_counts.py which didn't clear its
state properly and thus was accidentally taking advantage of this
database schema bug.
This field wasn't used for anything, and I think it has very limited
use for debugging, since fundamentally, it'll almost always have a
value within the hour of the actual timestamp in FillState, and any
more fine-grained logging we might want would be available in the
analytics job's own logs.
The proximal reason to remove it is that apparently Django's
model_to_dict doesn't support auto_now fields, and that caused some
trouble when working on adding more complete import/export support for
analytics data.
This is a preparatory commit for using isort for sorting all of our
imports, merging changes to files where we can easily review the
changes as something we're happy with.
These are also files with relatively little active development, which
means we don't expect much merge conflict risk from these changes.
Analytics database tables are getting big, and so we're likely moving to a
model where ~all stats are day stats, and we keep hourly stats only for the
last N days.
Also changed the name because:
* messages_sent_* suggests the counts (summed over subgroup) should be the
same as the other messages_sent stats, but they are different (these don't
include PMs).
* messages_sent_by_stream:is_bot:day is longer than 32 characters, the max
allowable length for a BaseCount.property.
Includes a database migration to remove the old stat from the analytics
tables.
Includes a database migration. The interval field was originally there to
facilitate time aggregation (e.g. aggregate_hour_to_day), but we now do such
aggregations in views code or in the frontend.
This is a major change to the analytics schema, and is the first step in a
number of refactorings and performance improvements. For instance, it allows
* Grouping sets of similar CountStats in the *Count tables. For instance,
active{_humans,_bots} will now have the same property, but have different
subgroup values.
* Combining queries that differ only in their value on 1 filter clause, so
that we make fewer passes through the zerver tables. For instance, instead
of running a query for each of messages_sent_to_public_streams and
messages_sent_to_private_streams, we can now run a single query with a
group by on Stream.invite_only, and store the group by value in the
subgroup column.
Adds two simplifying assumptions to how we process analytics stats:
* Sets the atomic unit of work to: a stat processed at an hour boundary.
* For any given stat, only allows these atomic units of work to be processed
in chronological order.
Adds a table FillState that, for each stat, keeps track of the last unit of
work that was processed.
This is primarily implemented through altering the migration file in
order to move the columns, but also we try to make the defaults a
little better for future tables inherited from BaseCount.
This is a first pass at building a framework for collecting various
stats about realms, users, streams, etc. Includes:
* New analytics tables for storing counts data
* Raw SQL queries for pulling data from zerver/models.py tables
* Aggregation functions for aggregating hourly stats into daily stats, and
aggregating user/stream level stats into realm level stats
* A management command for pulling the data
Note that counts.py was added to the linter exclude list due to errors
around %%s.