zulip/zerver
Alex Vandiver 102481bc47 migrations: Adjust stats size for tsvector to 10k, from 100.
PostgreSQL's `default_statistics_target` is used to track how many
"most common values" ("MCVs") for a column when performing an
`ANALYZE`.  For `tsvector` columns, the number of values is actually
10x this number, because each row contains multiple values for the
column[1].  The `default_statistics_target` defaults to 100[2], and
Zulip does not adjust this at the server level.

This translates to 1000 entries in the MCV for tsvectors. For
large tables like `zerver_messages`, a too-small value can cause
mis-planned query plans.  The query planner assumes that any
entry *not* found in the MCV list is *half* as likely as the
least-likely value in it.  If the table is large, and the MCV list is
too short (as 1000 values is for large deployments), arbitrary
no-in-the-MCV words will often be estimated by the query planner to
occur comparatively quite frequently in the index.  Based on this, the
planner will instead choose to scan all messages accessible by the
user, filtering by word in tsvector, instead of using the tsvector
index and filtering by being accessible to the user.  This results in
degraded performance for word searching.

However, PostgreSQL allows adjustment of this value on a per-column
basis.  Add a migration to adjust the value up to 10k for
`search_tsvector` on `zerver_message`, which results in 100k entries
in that MCV list.

PostgreSQL's documentation says[3]:

> Raising the limit might allow more accurate planner estimates to be
> made, particularly for columns with irregular data distributions, at
> the price of consuming more space in `pg_statistic` and slightly
> more time to compute the estimates.

These costs seem adequate for the utility of having better search.
In the event that the pgroonga backend is in use, these larger index
statistics are simply wasted space and `VACUUM` computational time,
but the costs are likely still reasonable -- even 100k values are
dwarfed by the size of the database needed to generate 100k unique
entries in tsvectors.

[1]: https://github.com/postgres/postgres/blob/REL_14_4/src/backend/utils/adt/array_typanalyze.c#L261-L267
[2]: https://www.postgresql.org/docs/14/runtime-config-query.html#GUC-DEFAULT-STATISTICS-TARGET
[3]: https://www.postgresql.org/docs/14/planner-stats.html#id-1.5.13.5.3
2022-07-19 09:24:06 -07:00
..
actions create_user: Improve comment about prereg_user handling. 2022-07-18 12:16:20 -07:00
data_import user_profile: Fallback to "" for timezone upon creation. 2022-06-28 16:05:24 -07:00
integration_fixtures/nagios
lib response: Replace json_unauthorized with UnauthorizedError. 2022-07-18 18:01:42 -07:00
management typing: Add assertions before accessing settings. 2022-07-15 14:00:56 -07:00
migrations migrations: Adjust stats size for tsvector to 10k, from 100. 2022-07-19 09:24:06 -07:00
openapi populate_db: Fix data for "Favorite editor" custom field. 2022-07-15 16:51:24 -07:00
tests test_message_fetch: Verify the value of WWW-Authenticate. 2022-07-18 18:01:42 -07:00
tornado tornado: Ignore StreamClosedError. 2022-06-28 16:35:49 -07:00
views users: Tighten the type annotation of clean_profile_data. 2022-07-15 14:55:03 -07:00
webhooks integrations: Add RhodeCode webhook integration. 2022-07-13 14:10:00 -07:00
worker message_send: Remove unnecessary user_ids argument. 2022-05-04 14:45:18 -07:00
__init__.py
apps.py caching: Make sender type optional for flush_cache. 2021-07-26 14:48:07 -07:00
context_processors.py middleware: Reorder middleware to avoid hasattr checks. 2022-07-14 17:24:24 -07:00
decorator.py response: Replace json_unauthorized with UnauthorizedError. 2022-07-18 18:01:42 -07:00
filters.py typing: Fix function signatures. 2021-08-20 05:54:19 -07:00
forms.py integrations: Fix wrong type annotation. 2022-07-15 14:00:56 -07:00
logging_handlers.py python: Use Python 3.8 typing.{Protocol,TypedDict}. 2022-04-27 12:57:49 -07:00
middleware.py middleware: Add isinstance check before retrieving content. 2022-07-15 14:00:56 -07:00
models.py typing: Add assertions for Optional values. 2022-07-15 14:00:56 -07:00
signals.py requirements: Upgrade to Django 4.0. 2022-07-13 16:07:17 -07:00