zulip

Commit Graph

Author	SHA1	Message	Date
Mateusz Mandera	890cafac11	retention: Use batch size of 100 for stream messages. Streams can have lots of subscribers, meaning that the archiving process will be moving tons of UserMessages per message. For that reason, using a smaller batch size for stream messages is justified. Some personal messages need to be added in test_scrub_realm to have coverage of do_delete_messages_by_sender after these changes.	2020-06-24 10:41:00 -07:00
Mateusz Mandera	0c6497d43a	retention: Add restore_retention_policy_deletions_for_stream function.	2020-06-24 10:40:38 -07:00
Mateusz Mandera	468f8cf488	retention: Improve logging of transactions.	2020-06-24 10:40:38 -07:00
Pragati Agrawal	1562ec758e	org settings: Use 'forever' value instead of -1 for message_retention_days. Currently, we use -1 as the Realm.message_retention_days value to retain message forever unless specified at stream level for a particular stream, that is, no policy set at the realm level. But this is incoherent with what we use for Stream.message_retention_days where -1 means > disable retention policy for this stream unconditionally that can be confusing from an API standpoint. So instead of trying some hack to reset the value to NULL or using some other value like -2 for RETAIN_MESSAGE_FOREVER and use that for API. It is much more intuitive to use a string like 'forever' that can be mapped to RETAIN_MESSAGE_FOREVER at the backend. And this is similar to what we use for streams settings as well.	2020-06-24 10:38:58 -07:00
Mateusz Mandera	7a03e2a7fe	retention: Replace Realm.message_retention_days None value with -1. To be more consistent with the meaning in the Stream model, and to make it easier to have a reasonable settings API, we get rid of the None value for Realm.message_retention_days in favor of the value -1 to represent the "don't delete messages" default policy.	2020-06-24 10:33:21 -07:00
Tim Abbott	d503549f0b	retention: Add some more system documentation.	2020-06-20 17:35:07 -07:00
Anders Kaseorg	365fe0b3d5	python: Sort imports with isort. Fixes #2665. Regenerated by tabbott with `lint --fix` after a rebase and change in parameters. Note from tabbott: In a few cases, this converts technical debt in the form of unsorted imports into different technical debt in the form of our largest files having very long, ugly import sequences at the start. I expect this change will increase pressure for us to split those files, which isn't a bad thing. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2020-06-11 16:45:32 -07:00
Anders Kaseorg	69730a78cc	python: Use trailing commas consistently. Automatically generated by the following script, based on the output of lint with flake8-comma: import re import sys last_filename = None last_row = None lines = [] for msg in sys.stdin: m = re.match( r"\x1b\[35mflake8 \\|\x1b\[0m \x1b\[1;31m(.+):(\d+):(\d+): (\w+)", msg ) if m: filename, row_str, col_str, err = m.groups() row, col = int(row_str), int(col_str) if filename == last_filename: assert last_row != row else: if last_filename is not None: with open(last_filename, "w") as f: f.writelines(lines) with open(filename) as f: lines = f.readlines() last_filename = filename last_row = row line = lines[row - 1] if err in ["C812", "C815"]: lines[row - 1] = line[: col - 1] + "," + line[col - 1 :] elif err in ["C819"]: assert line[col - 2] == "," lines[row - 1] = line[: col - 2] + line[col - 1 :].lstrip(" ") if last_filename is not None: with open(last_filename, "w") as f: f.writelines(lines) Signed-off-by: Anders Kaseorg <anders@zulipchat.com>	2020-06-11 16:04:12 -07:00
Mateusz Mandera	b234fe8ccb	retention: Pass optional realm argument to move_messages_to_archive. This allows having the realm field of ArchiveTransaction set instead of NULL when using move_messages_to_archive.	2020-05-16 14:46:56 -07:00
Mateusz Mandera	7d8a3581a5	retention: Clarify the status of cross-realm huddles in a comment.	2020-05-16 14:42:40 -07:00
Tim Abbott	f10f2600e0	retention: Fix OOM issues when deleting large numbers of transactions. For unknown reasons, deleting 10,000s of ArchiveTransaction objects results in rapidly growing memory in the job making the request in the Django process, eventually leading to an OOM kill. I don't understand why Django behaves that way; I would have expected the failure mode to instead be a serious load problem on the database server, but perhaps the way Django's internal deletion logic handles cascading the deletes to many millions of ArchiveMessages and other ForeignKey objects requires tracking a lot of data in memory. The solution is the same in any case, which is to batch the deletions to execute a reasonable number of them at once. Doing a single ArchiveTransaction at a time would likely result in huge numbers of database queries in a loop, which performs very poorly. So we balance by batching deletions in groups of 100 ArchiveTransactions; testing this in production, I saw no spike of memory usage materially beyond that of a normal Django process, and each bulk-deletion transaction takes several seconds to process (meaning per-transaction overhead is negligible).	2020-05-15 17:10:19 -07:00
Mateusz Mandera	812ac4714f	retention: Optimize fetching of realms and streams with retention policy.	2020-05-07 16:28:05 -07:00
Anders Kaseorg	fd65511fe9	retention: Improve move_rows escaping correctness with psycopg2.sql. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2020-05-04 09:35:30 -07:00
Tim Abbott	341787a5e0	retention: Use logging API in a more standard way.	2020-05-03 10:57:23 -07:00
Mateusz Mandera	0d7cbc71dd	retention: Make logging less unnecessarily verbose. For realms with no retention policy on themselves or any of their streams, no archiving happens, but 3 lines of logs would be generated. That's redundant and we make changes in this commit to avoid logging those lines if nothing of interest is happening.	2020-05-03 19:24:00 +02:00
Anders Kaseorg	bdc365d0fe	logging: Pass format arguments to logging. https://docs.python.org/3/howto/logging.html#optimization Signed-off-by: Anders Kaseorg <anders@zulip.com>	2020-05-02 10:18:02 -07:00
Anders Kaseorg	fead14951c	python: Convert assignment type annotations to Python 3.6 style. This commit was split by tabbott; this piece covers the vast majority of files in Zulip, but excludes scripts/, tools/, and puppet/ to help ensure we at least show the right error messages for Xenial systems. We can likely further refine the remaining pieces with some testing. Generated by com2ann, with whitespace fixes and various manual fixes for runtime issues: - invoiced_through: Optional[LicenseLedger] = models.ForeignKey( + invoiced_through: Optional["LicenseLedger"] = models.ForeignKey( -_apns_client: Optional[APNsClient] = None +_apns_client: Optional["APNsClient"] = None - notifications_stream: Optional[Stream] = models.ForeignKey('Stream', related_name='+', null=True, blank=True, on_delete=CASCADE) - signup_notifications_stream: Optional[Stream] = models.ForeignKey('Stream', related_name='+', null=True, blank=True, on_delete=CASCADE) + notifications_stream: Optional["Stream"] = models.ForeignKey('Stream', related_name='+', null=True, blank=True, on_delete=CASCADE) + signup_notifications_stream: Optional["Stream"] = models.ForeignKey('Stream', related_name='+', null=True, blank=True, on_delete=CASCADE) - author: Optional[UserProfile] = models.ForeignKey('UserProfile', blank=True, null=True, on_delete=CASCADE) + author: Optional["UserProfile"] = models.ForeignKey('UserProfile', blank=True, null=True, on_delete=CASCADE) - bot_owner: Optional[UserProfile] = models.ForeignKey('self', null=True, on_delete=models.SET_NULL) + bot_owner: Optional["UserProfile"] = models.ForeignKey('self', null=True, on_delete=models.SET_NULL) - default_sending_stream: Optional[Stream] = models.ForeignKey('zerver.Stream', null=True, related_name='+', on_delete=CASCADE) - default_events_register_stream: Optional[Stream] = models.ForeignKey('zerver.Stream', null=True, related_name='+', on_delete=CASCADE) + default_sending_stream: Optional["Stream"] = models.ForeignKey('zerver.Stream', null=True, related_name='+', on_delete=CASCADE) + default_events_register_stream: Optional["Stream"] = models.ForeignKey('zerver.Stream', null=True, related_name='+', on_delete=CASCADE) -descriptors_by_handler_id: Dict[int, ClientDescriptor] = {} +descriptors_by_handler_id: Dict[int, "ClientDescriptor"] = {} -worker_classes: Dict[str, Type[QueueProcessingWorker]] = {} -queues: Dict[str, Dict[str, Type[QueueProcessingWorker]]] = {} +worker_classes: Dict[str, Type["QueueProcessingWorker"]] = {} +queues: Dict[str, Dict[str, Type["QueueProcessingWorker"]]] = {} -AUTH_LDAP_REVERSE_EMAIL_SEARCH: Optional[LDAPSearch] = None +AUTH_LDAP_REVERSE_EMAIL_SEARCH: Optional["LDAPSearch"] = None Signed-off-by: Anders Kaseorg <anders@zulipchat.com>	2020-04-22 11:02:32 -07:00
Mateusz Mandera	cbdfef28a8	retention: Update to account for the zulipinternal realm. In https://github.com/zulip/zulip/pull/12823 some changes to the realms structure have been made, so now both in production and development cross-realm bots live in the realm with string_id "zulipinternal". There was a TODO in retention code to eliminate a conditional in a query that became redundant with this change, and also the zulipinternal realm should be omitted from the archiving process in archive_messages().	2020-02-14 17:15:26 -08:00
Mateusz Mandera	9a42a83e15	streams: Remove get_stream_recipients function and its uses. With the recipient field being denormalized into the UserProfile and Streams models, all current uses of get_stream_recipients can be done more efficiently, by simply checking the .recipient_id attribute on the appropriate objects.	2019-12-12 12:05:42 -08:00
Mateusz Mandera	dbe508bb91	models: Migration of Message.pub_date to date_sent, part 2. Fixes #1727. With the server down, apply migrations 0245 and 0246. 0246 will remove the pub_date column, so it's essential that the previous migrations ran correctly to copy data before running this.	2019-10-05 19:01:34 -07:00
Anders Kaseorg	becef760bf	cleanup: Delete leading newlines. Previous cleanups (mostly the removals of Python __future__ imports) were done in a way that introduced leading newlines. Delete leading newlines from all files, except static/assets/zulip-emoji/NOTICE, which is a verbatim copy of the Apache 2.0 license. Signed-off-by: Anders Kaseorg <anders@zulipchat.com>	2019-08-06 23:29:11 -07:00
Mateusz Mandera	d1c2185c81	retention: Archive cross-realm personal messages. We can simply archive cross-realm personal messages according to the retention policy of the recipient's realm. It requires adding another message-archiving query for this case however. What remains is to figure out how to treat cross-realm huddle messages.	2019-07-08 20:03:20 -07:00
Mateusz Mandera	24ec1c7aa1	Revert "retention: Delete objects tied to a Message in one query with archiving." This reverts commit `8f15884c7d`. Using the WITH ( ) ... DELETE method leads to a small performance drop, while probably not offering many positives, so it seems appropriate to go to the simpler case of just letting things get cleaned up by CASCADE.	2019-07-08 16:35:53 -07:00
Mateusz Mandera	b55f24e07c	retention: Avoid "SELECT realm" being run by Django ORM for each stream. The way the code changed in this commit was written caused Django to fetch stream.realm from the database for every stream, leading to redundant, identical queries. Each stream's realm is already known, so we use that information.	2019-07-08 16:32:38 -07:00
Mateusz Mandera	89ba6d7941	retention: Improve the placement of logging in run_archiving_in_chunks.	2019-07-02 17:33:53 -07:00
Mateusz Mandera	17c5398703	retention: Filter also by transaction type when restoring by realm.	2019-07-02 17:31:07 -07:00
Mateusz Mandera	627238ebe1	retention: Replace all LEFT JOIN IS NULL with ON CONFLICT handling. Duplicate handling when INSERTing is switched from "LEFT JOIN ... id IS NULL" approach to "ON CONFLICT (id) DO NOTHING", since we now have postgresql 9.5. The ON CONFLICT approach is more natural as well as also potentially being faster,	2019-07-02 17:25:31 -07:00
Mateusz Mandera	9f452a6e9a	retention: Add logging to the restoring and archive cleaning functions.	2019-07-02 17:25:31 -07:00
Mateusz Mandera	7950aaea1e	retention: Add code for deleting old archive data.	2019-06-26 12:24:47 -07:00
Mateusz Mandera	3ac11a3fc5	retention: Use ON CONFLICT DO UPDATE to handle re-archiving properly. When archiving Messages, we stop relying on LEFT JOIN ... IS NULL to avoid duplicates when INSERTing. Instead we use ON CONFLICT DO UPDATE (added in postgresql 9.5) to, in case of archiving a Message that already has a corresponding archived objects (this happens if a Message gets archived, restored and then archived again), re-assign the existing ArchivedMessage to the new transaction. This also allows us to fix test_archiving_messages_second_time, which was temporarily disable a few commits before.	2019-06-26 12:05:59 -07:00
Mateusz Mandera	7b2b4435ed	retention: Combine run_message_batch_query and run_archiving_in_chunks. We combine run_message_batch_query and run_archiving_in_chunks functions, which makes the code simpler and more readable - we get rid of hacky generator usage, for example. In the process, move_expired_messages_* functions are adjusted, and now they archive Messages as well as their related objects. Appropriate adjustments in reaction to this are made in the main archiving functions which call move_expired_messages_* (they no longer need to call move_related_objects_to_archive).	2019-06-26 12:05:59 -07:00
Mateusz Mandera	6e46c6d752	retention: Add functions for restoring archived data. Functions for restoring archived data are added and existing tests are expanded to restore data they archived and check correctness.	2019-06-26 12:05:59 -07:00
Mateusz Mandera	9acd3b0f46	retention: Rewrite move_messages_to_archive to use existing functions. Instead of having a bunch of custom code in the function, we make it use run_message_batch_query and run_archiving_in_chunks to do the necessary operations in a consistent way, using the same codepaths as the rest of the archiving system. This breaks test_archiving_messages_second_time temporarily, but we will fix it and re-enable the test in the next commits, where we'll address various other issues with re-archiving of messages. We also remove the @transaction.atomic wrapper, because atomicity is handled by the logic inside run_archiving_in_chunks.	2019-06-26 12:05:59 -07:00
Mateusz Mandera	80b834dd1b	retention: Update move_rows() function code. We make minor changes to the move_rows() function to allow its use in the code for restoring from the archive.	2019-06-26 12:05:59 -07:00
Mateusz Mandera	e3fe66a084	retention: Set savepoint=False on atomic wrapper on move_rows(). Savepoints create unnecessary overhead, and there's no benefit from them, with the way we use this function.	2019-06-26 12:05:59 -07:00
Mateusz Mandera	5d8d5910a8	retention: Log archive_transaction id and information.	2019-06-26 12:05:59 -07:00
Mateusz Mandera	a2cce62c1c	retention: Use new ArchiveTransaction model. We add a new model, ArchiveTransaction, to tie archived objects together in a coherent way, according to the batches in which they are archived. This enables making a better system for restoring from archive, and it seems just more sensible to tie the archived objects in this way, rather the somewhat vague setting of archive_timestamp to each object using timezone_now().	2019-06-26 12:05:59 -07:00
Mateusz Mandera	8f15884c7d	retention: Delete objects tied to a Message in one query with archiving. Rather than relying on the CASCADING property of the ForeignKey to the Message table to clean up these objects, we delete them in the same query as we archive them - since it's guaranteed that any of these objects that we archive will be deleted due to their Message being deleted later. We don't have this guarantee for Attachment objects, which is why we can't apply this scheme to them.	2019-06-13 11:18:11 -07:00
Mateusz Mandera	25810752fe	retention: Fully process each Message chunk in a transaction. To ensure the database retains a consistent state if archiving gets interrupted, we process each Messages chunk together with related objects in a single atomic transaction.	2019-06-13 11:17:54 -07:00
Mateusz Mandera	55eb46433b	retention: Use yield when batching instead of returning a list of lists. This generator architecture will be cleaner for supporting the transactionality model we want.	2019-06-13 11:11:34 -07:00
Mateusz Mandera	37a22844b9	retention: Clean up code of move_messages_to_archive().	2019-06-13 11:02:11 -07:00
Mateusz Mandera	a68c460a14	retention: Clean up code for archiving attachment_messages. We had two duplicate functions for archiving zerver_attachment_messages rows, doing the same thing - archiving by message_id. One of them had a redundant INNER JOIN, so we get rid of that too.	2019-06-13 11:02:11 -07:00
Mateusz Mandera	cbee5beeac	retention: Log progress through the archiving process.	2019-06-13 11:02:11 -07:00
Mateusz Mandera	e3c7a5d896	retention: Loop over realms in archive_messages. Since we loop over realms in the functions for archiving stream messages and then personal+huddle messages, and also want to split cleaning up attachments by realm - it makes sense to do it all in one single loop.	2019-06-13 11:02:11 -07:00
Mateusz Mandera	5b8140cf75	retention: Group stream message archiving by realm. We group the process of archiving stream message by realm, to allow logging and keeping track of time taken per realm.	2019-06-11 09:25:25 -07:00
Mateusz Mandera	f06a4b4eab	retention: Batch Message archiving queries. We batch queries that archive Messages, to limit the maximum amount of Message objects archived in a single query. This leads to the archiving of other related objects being batched as well, because we loop over chunks of archived messages and archive their related objects per-chunk.	2019-06-11 09:25:25 -07:00
Tim Abbott	065575debf	retention: Add a quick comment explaining how deletion works.	2019-06-06 11:41:07 -07:00
Mateusz Mandera	323be57151	retention: If stream has no retention policy set, use realm policy. We add the following behavior: If stream has message_retention_days set to -1, archiving for it is disabled. If stream has message_retention_days set to null, use the realm's policy. If the realm has no policy, we don't archive for this stream.	2019-06-06 11:17:42 -07:00
Mateusz Mandera	8bef82c7f9	retention: Clean up redundant code for special handling of UserMessages. UserMessages no longer need special handling, they can be archived by move_models_with_message_key_to_archive and automatically cleaned up like the other models with a message key with CASCADING=True.	2019-06-06 11:17:42 -07:00
Mateusz Mandera	0e9fa4f028	retention: Support stream-based retention policies. We change the archiving scheme to allow having stream based retention policies. In the first step of the archiving process, we loop over streams and archive their expired messages and related objects. Then we separately archive all expired personal and huddle messages and related objects. As the last step, we scan for redundant attachments which can now be deleted. To achieve this, we have to rewrite a significant portion of the retention code and rework some of the database queries. For the sake of simplicity, we neither archive nor delete cross-realm messages, except cross-realm stream messages – in their case they can be processed in the same manner as ordinary stream messages. In the query for archiving personal and huddle messages we simply exclude those sent by cross-realm bots. We change the tests to adapt to these modifications.	2019-06-06 11:17:42 -07:00

1 2

72 Commits