Commit Graph

100 Commits

Author SHA1 Message Date
Mateusz Mandera 24ec1c7aa1 Revert "retention: Delete objects tied to a Message in one query with archiving."
This reverts commit 8f15884c7d. Using the
WITH ( ) ... DELETE method leads to a small performance drop, while
probably not offering many positives, so it seems appropriate to go to
the simpler case of just letting things get cleaned up by CASCADE.
2019-07-08 16:35:53 -07:00
Mateusz Mandera b55f24e07c retention: Avoid "SELECT realm" being run by Django ORM for each stream.
The  way the code changed in this commit was written caused Django to
fetch stream.realm from the database for every stream, leading to
redundant, identical queries. Each stream's realm is already known, so
we use that information.
2019-07-08 16:32:38 -07:00
Mateusz Mandera 89ba6d7941 retention: Improve the placement of logging in run_archiving_in_chunks. 2019-07-02 17:33:53 -07:00
Mateusz Mandera 17c5398703 retention: Filter also by transaction type when restoring by realm. 2019-07-02 17:31:07 -07:00
Mateusz Mandera 627238ebe1 retention: Replace all LEFT JOIN IS NULL with ON CONFLICT handling.
Duplicate handling when INSERTing is switched from "LEFT JOIN ... id IS
NULL" approach to "ON CONFLICT (id) DO NOTHING", since we now have
postgresql 9.5. The ON CONFLICT approach is more natural as well as also
potentially being faster,
2019-07-02 17:25:31 -07:00
Mateusz Mandera 9f452a6e9a retention: Add logging to the restoring and archive cleaning functions. 2019-07-02 17:25:31 -07:00
Mateusz Mandera 7950aaea1e retention: Add code for deleting old archive data. 2019-06-26 12:24:47 -07:00
Mateusz Mandera 3ac11a3fc5 retention: Use ON CONFLICT DO UPDATE to handle re-archiving properly.
When archiving Messages, we stop relying on LEFT JOIN ... IS NULL to
avoid duplicates when INSERTing. Instead we use ON CONFLICT DO UPDATE
(added in postgresql 9.5) to, in case of archiving a Message that
already has a corresponding archived objects (this happens if a Message
gets archived, restored and then archived again), re-assign the existing
ArchivedMessage to the new transaction.

This also allows us to fix test_archiving_messages_second_time, which
was temporarily disable a few commits before.
2019-06-26 12:05:59 -07:00
Mateusz Mandera 7b2b4435ed retention: Combine run_message_batch_query and run_archiving_in_chunks.
We combine run_message_batch_query and run_archiving_in_chunks
functions, which makes the code simpler and more readable - we get rid
of hacky generator usage, for example.
In the process, move_expired_messages_* functions are adjusted, and now
they archive Messages as well as their related objects.
Appropriate adjustments in reaction to this are made in the main
archiving functions which call move_expired_messages_* (they no longer
need to call move_related_objects_to_archive).
2019-06-26 12:05:59 -07:00
Mateusz Mandera 6e46c6d752 retention: Add functions for restoring archived data.
Functions for restoring archived data are added and existing tests are
expanded to restore data they archived and check correctness.
2019-06-26 12:05:59 -07:00
Mateusz Mandera 9acd3b0f46 retention: Rewrite move_messages_to_archive to use existing functions.
Instead of having a bunch of custom code in the function, we make it use
run_message_batch_query and run_archiving_in_chunks to do the necessary
operations in a consistent way, using the same codepaths as the rest of
the archiving system.
This breaks test_archiving_messages_second_time temporarily, but we will
fix it and re-enable the test in the next commits, where we'll address
various other issues with re-archiving of messages.

We also remove the @transaction.atomic wrapper, because atomicity is
handled by the logic inside run_archiving_in_chunks.
2019-06-26 12:05:59 -07:00
Mateusz Mandera 80b834dd1b retention: Update move_rows() function code.
We make minor changes to the move_rows() function to allow its use in
the code for restoring from the archive.
2019-06-26 12:05:59 -07:00
Mateusz Mandera e3fe66a084 retention: Set savepoint=False on atomic wrapper on move_rows().
Savepoints create unnecessary overhead, and there's no benefit from
them, with the way we use this function.
2019-06-26 12:05:59 -07:00
Mateusz Mandera 5d8d5910a8 retention: Log archive_transaction id and information. 2019-06-26 12:05:59 -07:00
Mateusz Mandera a2cce62c1c retention: Use new ArchiveTransaction model.
We add a new model, ArchiveTransaction, to tie archived objects together
in a coherent way, according to the batches in which they are archived.
This enables making a better system for restoring from archive, and it
seems just more sensible to tie the archived objects in this way, rather
the somewhat vague setting of archive_timestamp to each object using
timezone_now().
2019-06-26 12:05:59 -07:00
Mateusz Mandera 8f15884c7d retention: Delete objects tied to a Message in one query with archiving.
Rather than relying on the CASCADING property of the ForeignKey to the
Message table to clean up these objects, we delete them in the same
query as we archive them - since it's guaranteed that any of these
objects that we archive will be deleted due to their Message being
deleted later.
We don't have this guarantee for Attachment objects, which is why we
can't apply this scheme to them.
2019-06-13 11:18:11 -07:00
Mateusz Mandera 25810752fe retention: Fully process each Message chunk in a transaction.
To ensure the database retains a consistent state if archiving gets
interrupted, we process each Messages chunk together with related
objects in a single atomic transaction.
2019-06-13 11:17:54 -07:00
Mateusz Mandera 55eb46433b retention: Use yield when batching instead of returning a list of lists.
This generator architecture will be cleaner for supporting the
transactionality model we want.
2019-06-13 11:11:34 -07:00
Mateusz Mandera 37a22844b9 retention: Clean up code of move_messages_to_archive(). 2019-06-13 11:02:11 -07:00
Mateusz Mandera a68c460a14 retention: Clean up code for archiving attachment_messages.
We had two duplicate functions for archiving zerver_attachment_messages
rows, doing the same thing - archiving by message_id. One of them had a
redundant INNER JOIN, so we get rid of that too.
2019-06-13 11:02:11 -07:00
Mateusz Mandera cbee5beeac retention: Log progress through the archiving process. 2019-06-13 11:02:11 -07:00
Mateusz Mandera e3c7a5d896 retention: Loop over realms in archive_messages.
Since we loop over realms in the functions for archiving stream messages
and then personal+huddle messages, and also want to split cleaning up
attachments by realm - it makes sense to do it all in one single loop.
2019-06-13 11:02:11 -07:00
Mateusz Mandera 5b8140cf75 retention: Group stream message archiving by realm.
We group the process of archiving stream message by realm, to allow
logging and keeping track of time taken per realm.
2019-06-11 09:25:25 -07:00
Mateusz Mandera f06a4b4eab retention: Batch Message archiving queries.
We batch queries that archive Messages, to limit the maximum amount of
Message objects archived in a single query. This leads to the archiving
of other related objects being batched as well, because we loop over
chunks of archived messages and archive their related objects per-chunk.
2019-06-11 09:25:25 -07:00
Tim Abbott 065575debf retention: Add a quick comment explaining how deletion works. 2019-06-06 11:41:07 -07:00
Mateusz Mandera 323be57151 retention: If stream has no retention policy set, use realm policy.
We add the following behavior:
If stream has message_retention_days set to -1, archiving for it is
disabled.
If stream has message_retention_days set to null, use the realm's
policy. If the realm has no policy, we don't archive for this stream.
2019-06-06 11:17:42 -07:00
Mateusz Mandera 8bef82c7f9 retention: Clean up redundant code for special handling of UserMessages.
UserMessages no longer need special handling, they can be archived by
move_models_with_message_key_to_archive and automatically cleaned up
like the other models with a message key with CASCADING=True.
2019-06-06 11:17:42 -07:00
Mateusz Mandera 0e9fa4f028 retention: Support stream-based retention policies.
We change the archiving scheme to allow having stream based retention
policies. In the first step of the archiving process, we loop over
streams and archive their expired messages and related objects.
Then we separately archive all expired personal and huddle messages and
related objects. As the last step, we scan for redundant attachments
which can now be deleted.
To achieve this, we have to rewrite a significant portion of the
retention code and rework some of the database queries.
For the sake of simplicity, we neither archive nor delete cross-realm
messages, except cross-realm stream messages – in their case they can
be processed in the same manner as ordinary stream messages.
In the query for archiving personal and huddle messages we simply
exclude those sent by cross-realm bots.
We change the tests to adapt to these modifications.
2019-06-06 11:17:42 -07:00
Mateusz Mandera aa45325b5f retention: Rename move_expired_rows to move_rows. 2019-06-06 11:17:42 -07:00
Mateusz Mandera d373a16910 retention: Remove realm_id check when archiving attachments.
Since we archive attachments and attachment_messages tied to a list of
ids of Messages that we just archived (so from the current realm), it's
unnecessary to check their realm in the queries. This could potentially
cause archiving of an attachment with realm_id of another realm, but
this isn't an issue, as long as we make sure we don't end up deleting
the original Attachment object incorrectly - but realm_id check is
included in delete_expired_attachments() to ensure that.
2019-06-06 11:17:42 -07:00
Mateusz Mandera 6c3ba25474 retention: Use RETURNING to speed up database queries.
We add RETURNING to fetch relevant message and usermessage ids in
archiving queries and use them to make other queries faster and slower.
A side-effect of this implementation is that with cross-realm messages,
the UserMessage of the recipient and the Message will not be deleted -
but cross-realm messages are rare, will still get correctly put in the
archive tables and so failing to delete should not be a problem for now.
They will be fully handled later.
2019-06-02 14:55:14 -07:00
Mateusz Mandera 426e3bbbd9 retention: Remove redundant LEFT JOIN in archiving UserMessages.
zerver_archivedmessage is already INNER JOIN-ed earlier in the query, so
we check the pub_date in it, instead of joining zerver_message, which
would just redundantly join the analogical rows.
2019-06-02 14:55:14 -07:00
Mateusz Mandera 4facc93670 retention: Add archiving of SubMessages. 2019-05-30 11:40:20 -07:00
Mateusz Mandera 37c42a09e5 retention: Archiving of models tied to a Message, applied to Reactions.
We add general code that will archive models that are tied to a specific
Message (such as Reactions and SubMessages). Certain details of the
model are grabbed from a list models_with_message_key, and then used to
create queries that will archive these database tables.
We put Reaction in that list in this commit, and add appropriate tests.
To have archiving of other analogical models (for example SubMessage),
one only needs to make an appropriate entry in the
models_with_message_key list.
2019-05-30 11:40:20 -07:00
Mateusz Mandera 2bc6d52c72 retention: Fix name of move_attachment_message_to_archive_by_message.
The first instance of the word "message" should be in plural. We rename
to move_attachment_message_to_archive_by_message.
2019-05-29 16:26:11 -07:00
Mateusz Mandera 2ca650be4d retention: Clean up move_messages_to_archive() for more clarity. 2019-05-29 16:26:11 -07:00
Mateusz Mandera c5ac66b9c8 retention: Split archive_messages code into two functions.
We split archive_messages code into two functions: moving to archive and
cleanup. This allows cleaning up the tests - they can call
these functions directly instead of copying several lines of
archive_messages here and there in multiple tests.
2019-05-27 12:53:32 -07:00
Tim Abbott 3d1aa98b2e retention: Use a consistent ordering for processing realms.
This is probably a good idea for the production use case, since then
there's some consistency of behavior, and if we extend logging, one
knows exactly which realms were or were not executed before a logged
failure.

This fixes the nondeterministic test failures we've been seeing in CI:
if you use `-id` in that order_by, it happens consistently.
2019-05-22 10:48:53 -07:00
K.Kanakhin e930851d16 retention-period: Add more core code for retention policy.
This is a very old commit for #106, which has been on hiatus for a few
years.  It was significantly modified by tabbott to:
* Improve coding style and variable names
* Update mypy annotations style
* Clean up the testing logic
* Update for API changes elsewhere in our system

But the actual runtime code is essentially unmodified from the
original work by Kirill.

It contains basic support for archiving Messages, UserMessages, and
Attachments with a nice test suite.  It's still not usable in
production (e.g. it will probably break Reactions, SubMessages, etc.),
but upcoming commits will address that.
2019-05-19 20:22:47 -07:00
Anders Kaseorg f0ecb93515 zerver core: Remove unused imports.
Signed-off-by: Anders Kaseorg <andersk@mit.edu>
2019-02-02 17:41:24 -08:00
Vishnu Ks 962d72b58b retention: move_messages_to_archive should accept multiple message ids.
This will speed up the scrub realm management command. Calling the
function with a single message_id in a loop was extremely inefficient.
2018-10-11 15:31:12 -07:00
rht ee546a33a3 zerver/lib: Use python 3 syntax for typing.
Edited by tabbott to improve various line-wrapping decisions.
2017-11-28 17:15:14 -08:00
rht 2e12fe5e2e zerver/lib: Remove print_function. 2017-09-27 18:05:45 -07:00
rht f43e54d352 zerver/lib: Remove absolute_import. 2017-09-27 10:00:39 -07:00
K.Kanakhin 2434f2d96c messages: Add support for admins deleting messages.
This makes it possible for Zulip administrators to delete messages.
This is primarily intended for use in deleting early test messages,
but it can solve other problems as well.

Later we'll want to play with the permissions model for this, but for
now, the goal is just to integrate the feature.

Note that it saves the deleted messages for some time using the same
approach as Zulip's message retention policy feature.

Fixes #135.
2017-05-29 21:59:38 -07:00
hackerkid b2504084ab Replace timezone.now with timezone_now. 2017-04-16 12:28:56 -07:00
Rishi Gupta af4718c50c retention.py: Remove use of domain from get_expired_messages.
get_expired_messages seems to only be used by tests anyway.
2017-03-14 17:17:42 -07:00
Raghav Jajodia a3a03bd6a5 mypy: Added Dict, List and Set imports.
Fixed mypy errors associated with the upgrade.
2017-03-04 14:33:44 -08:00
Bickio e009383460 pep8: Fix E231. 2016-11-30 19:59:25 -08:00
K.Kanakhin 39e0886361 retention-policy: Add tool to determine expired messages.
This is a first step towards implementing a message retention policy
feature.

- Add Realm model message_retention_days field to setup
  messages expired period for realm.
- Add migration.
- Add tool to get expired messages for each Realm.
- Add tests to cover tool for getting expired messages.
2016-10-25 15:38:08 -07:00