When archiving Messages, we stop relying on LEFT JOIN ... IS NULL to
avoid duplicates when INSERTing. Instead we use ON CONFLICT DO UPDATE
(added in postgresql 9.5) to, in case of archiving a Message that
already has a corresponding archived objects (this happens if a Message
gets archived, restored and then archived again), re-assign the existing
ArchivedMessage to the new transaction.
This also allows us to fix test_archiving_messages_second_time, which
was temporarily disable a few commits before.
We combine run_message_batch_query and run_archiving_in_chunks
functions, which makes the code simpler and more readable - we get rid
of hacky generator usage, for example.
In the process, move_expired_messages_* functions are adjusted, and now
they archive Messages as well as their related objects.
Appropriate adjustments in reaction to this are made in the main
archiving functions which call move_expired_messages_* (they no longer
need to call move_related_objects_to_archive).
Instead of having a bunch of custom code in the function, we make it use
run_message_batch_query and run_archiving_in_chunks to do the necessary
operations in a consistent way, using the same codepaths as the rest of
the archiving system.
This breaks test_archiving_messages_second_time temporarily, but we will
fix it and re-enable the test in the next commits, where we'll address
various other issues with re-archiving of messages.
We also remove the @transaction.atomic wrapper, because atomicity is
handled by the logic inside run_archiving_in_chunks.
We add a new model, ArchiveTransaction, to tie archived objects together
in a coherent way, according to the batches in which they are archived.
This enables making a better system for restoring from archive, and it
seems just more sensible to tie the archived objects in this way, rather
the somewhat vague setting of archive_timestamp to each object using
timezone_now().
Rather than relying on the CASCADING property of the ForeignKey to the
Message table to clean up these objects, we delete them in the same
query as we archive them - since it's guaranteed that any of these
objects that we archive will be deleted due to their Message being
deleted later.
We don't have this guarantee for Attachment objects, which is why we
can't apply this scheme to them.
To ensure the database retains a consistent state if archiving gets
interrupted, we process each Messages chunk together with related
objects in a single atomic transaction.
We had two duplicate functions for archiving zerver_attachment_messages
rows, doing the same thing - archiving by message_id. One of them had a
redundant INNER JOIN, so we get rid of that too.
Since we loop over realms in the functions for archiving stream messages
and then personal+huddle messages, and also want to split cleaning up
attachments by realm - it makes sense to do it all in one single loop.
We batch queries that archive Messages, to limit the maximum amount of
Message objects archived in a single query. This leads to the archiving
of other related objects being batched as well, because we loop over
chunks of archived messages and archive their related objects per-chunk.
We add the following behavior:
If stream has message_retention_days set to -1, archiving for it is
disabled.
If stream has message_retention_days set to null, use the realm's
policy. If the realm has no policy, we don't archive for this stream.
UserMessages no longer need special handling, they can be archived by
move_models_with_message_key_to_archive and automatically cleaned up
like the other models with a message key with CASCADING=True.
We change the archiving scheme to allow having stream based retention
policies. In the first step of the archiving process, we loop over
streams and archive their expired messages and related objects.
Then we separately archive all expired personal and huddle messages and
related objects. As the last step, we scan for redundant attachments
which can now be deleted.
To achieve this, we have to rewrite a significant portion of the
retention code and rework some of the database queries.
For the sake of simplicity, we neither archive nor delete cross-realm
messages, except cross-realm stream messages – in their case they can
be processed in the same manner as ordinary stream messages.
In the query for archiving personal and huddle messages we simply
exclude those sent by cross-realm bots.
We change the tests to adapt to these modifications.
Since we archive attachments and attachment_messages tied to a list of
ids of Messages that we just archived (so from the current realm), it's
unnecessary to check their realm in the queries. This could potentially
cause archiving of an attachment with realm_id of another realm, but
this isn't an issue, as long as we make sure we don't end up deleting
the original Attachment object incorrectly - but realm_id check is
included in delete_expired_attachments() to ensure that.
We add RETURNING to fetch relevant message and usermessage ids in
archiving queries and use them to make other queries faster and slower.
A side-effect of this implementation is that with cross-realm messages,
the UserMessage of the recipient and the Message will not be deleted -
but cross-realm messages are rare, will still get correctly put in the
archive tables and so failing to delete should not be a problem for now.
They will be fully handled later.
zerver_archivedmessage is already INNER JOIN-ed earlier in the query, so
we check the pub_date in it, instead of joining zerver_message, which
would just redundantly join the analogical rows.
We add general code that will archive models that are tied to a specific
Message (such as Reactions and SubMessages). Certain details of the
model are grabbed from a list models_with_message_key, and then used to
create queries that will archive these database tables.
We put Reaction in that list in this commit, and add appropriate tests.
To have archiving of other analogical models (for example SubMessage),
one only needs to make an appropriate entry in the
models_with_message_key list.
We split archive_messages code into two functions: moving to archive and
cleanup. This allows cleaning up the tests - they can call
these functions directly instead of copying several lines of
archive_messages here and there in multiple tests.
This is probably a good idea for the production use case, since then
there's some consistency of behavior, and if we extend logging, one
knows exactly which realms were or were not executed before a logged
failure.
This fixes the nondeterministic test failures we've been seeing in CI:
if you use `-id` in that order_by, it happens consistently.
This is a very old commit for #106, which has been on hiatus for a few
years. It was significantly modified by tabbott to:
* Improve coding style and variable names
* Update mypy annotations style
* Clean up the testing logic
* Update for API changes elsewhere in our system
But the actual runtime code is essentially unmodified from the
original work by Kirill.
It contains basic support for archiving Messages, UserMessages, and
Attachments with a nice test suite. It's still not usable in
production (e.g. it will probably break Reactions, SubMessages, etc.),
but upcoming commits will address that.
This makes it possible for Zulip administrators to delete messages.
This is primarily intended for use in deleting early test messages,
but it can solve other problems as well.
Later we'll want to play with the permissions model for this, but for
now, the goal is just to integrate the feature.
Note that it saves the deleted messages for some time using the same
approach as Zulip's message retention policy feature.
Fixes#135.
This is a first step towards implementing a message retention policy
feature.
- Add Realm model message_retention_days field to setup
messages expired period for realm.
- Add migration.
- Add tool to get expired messages for each Realm.
- Add tests to cover tool for getting expired messages.