zulip/zerver/migrations/0398_tsvector_statistics.py

27 lines
922 B
Python
Raw Normal View History

migrations: Adjust stats size for tsvector to 10k, from 100. PostgreSQL's `default_statistics_target` is used to track how many "most common values" ("MCVs") for a column when performing an `ANALYZE`. For `tsvector` columns, the number of values is actually 10x this number, because each row contains multiple values for the column[1]. The `default_statistics_target` defaults to 100[2], and Zulip does not adjust this at the server level. This translates to 1000 entries in the MCV for tsvectors. For large tables like `zerver_messages`, a too-small value can cause mis-planned query plans. The query planner assumes that any entry *not* found in the MCV list is *half* as likely as the least-likely value in it. If the table is large, and the MCV list is too short (as 1000 values is for large deployments), arbitrary no-in-the-MCV words will often be estimated by the query planner to occur comparatively quite frequently in the index. Based on this, the planner will instead choose to scan all messages accessible by the user, filtering by word in tsvector, instead of using the tsvector index and filtering by being accessible to the user. This results in degraded performance for word searching. However, PostgreSQL allows adjustment of this value on a per-column basis. Add a migration to adjust the value up to 10k for `search_tsvector` on `zerver_message`, which results in 100k entries in that MCV list. PostgreSQL's documentation says[3]: > Raising the limit might allow more accurate planner estimates to be > made, particularly for columns with irregular data distributions, at > the price of consuming more space in `pg_statistic` and slightly > more time to compute the estimates. These costs seem adequate for the utility of having better search. In the event that the pgroonga backend is in use, these larger index statistics are simply wasted space and `VACUUM` computational time, but the costs are likely still reasonable -- even 100k values are dwarfed by the size of the database needed to generate 100k unique entries in tsvectors. [1]: https://github.com/postgres/postgres/blob/REL_14_4/src/backend/utils/adt/array_typanalyze.c#L261-L267 [2]: https://www.postgresql.org/docs/14/runtime-config-query.html#GUC-DEFAULT-STATISTICS-TARGET [3]: https://www.postgresql.org/docs/14/planner-stats.html#id-1.5.13.5.3
2022-07-19 01:27:45 +02:00
# Generated by Django 4.0.6 on 2022-07-18 23:22
from django.db import migrations
class Migration(migrations.Migration):
dependencies = [
("zerver", "0397_remove_custom_field_values_for_deleted_options"),
]
operations = [
# The "most common values" list for a tsvector is 10x this
# number, which defaults to 100. Increasing it allows for
# better query planning, at a small cost of size, and
# `ANALYZE` time. It only takes effect after the next
# `ANALYZE`, which we run immediately.
migrations.RunSQL(
sql="ALTER TABLE zerver_message ALTER COLUMN search_tsvector SET STATISTICS 10000",
reverse_sql="ALTER TABLE zerver_message ALTER COLUMNS search_tsvector SET STATISTICS -1",
),
migrations.RunSQL(
sql="ANALYZE zerver_message",
reverse_sql=migrations.RunSQL.noop,
),
]