Whenever we use API queries to mark messages as read we now increment
two new LoggingCount stats, messages_read::hour and
messages_read_interactions::hour.
We add an early return in do_increment_logging_stat function if there
are no changes (increment is 0), as an optimization to avoid
unnecessary database queries.
We also log messages_read_interactions::hour Logging stat
as the number of API queries to mark messages as read.
We don't include tests for the case where do_update_pointer is called
because do_update_pointer will most likely be removed from the
codebase in the near future.
Fixes#2665.
Regenerated by tabbott with `lint --fix` after a rebase and change in
parameters.
Note from tabbott: In a few cases, this converts technical debt in the
form of unsorted imports into different technical debt in the form of
our largest files having very long, ugly import sequences at the
start. I expect this change will increase pressure for us to split
those files, which isn't a bad thing.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
Automatically generated by the following script, based on the output
of lint with flake8-comma:
import re
import sys
last_filename = None
last_row = None
lines = []
for msg in sys.stdin:
m = re.match(
r"\x1b\[35mflake8 \|\x1b\[0m \x1b\[1;31m(.+):(\d+):(\d+): (\w+)", msg
)
if m:
filename, row_str, col_str, err = m.groups()
row, col = int(row_str), int(col_str)
if filename == last_filename:
assert last_row != row
else:
if last_filename is not None:
with open(last_filename, "w") as f:
f.writelines(lines)
with open(filename) as f:
lines = f.readlines()
last_filename = filename
last_row = row
line = lines[row - 1]
if err in ["C812", "C815"]:
lines[row - 1] = line[: col - 1] + "," + line[col - 1 :]
elif err in ["C819"]:
assert line[col - 2] == ","
lines[row - 1] = line[: col - 2] + line[col - 1 :].lstrip(" ")
if last_filename is not None:
with open(last_filename, "w") as f:
f.writelines(lines)
Signed-off-by: Anders Kaseorg <anders@zulipchat.com>
A few comments are added to explain more clearly the changes made in
b23a5431cd namely about not using realm
arguments in LoggingCount Stats and the need to pass realm argument in
pull function.
The comments were tweaked by tabbott for readability.
This changeset is prepartory work for doing something reasonable with
analytics data during the zulip -> zulip data import process (and
potentially e.g. slack -> Zulip as well).
To support that, we need to make it possible to do our analytics
calculations for a single realm.
We do this while maintaining backwards compatibility and avoiding
massive duplicated code by adding an optional `realm` argument to the
entrypoints to the analytics system, especially process_count_stat.
More work involving restructuring FillState will be required for this
to be actually usable for its intented purpose, but this commit is a
nice checkpoint along the way.
Tweaked by tabbott to adjust comments and disable InstallationCount
updates when a realm argument is specified.
Fixes#1727.
With the server down, apply migrations 0245 and 0246. 0246 will remove
the pub_date column, so it's essential that the previous migrations
ran correctly to copy data before running this.
The name `create_logger` suggests something much bigger than what this
function actually does -- the logger doesn't any more or less exist
after the function is called than before. Its one real function is to
send logs to a specific file.
So, pull out that logic to an appropriately-named function just for
it. We already use `logging.getLogger` in a number of places to
simply get a logger by name, and the old `create_logger` callsites can
do the same.
This is already the loglevel we set on the root logger, so this has no
effect -- except in tests, where `test_settings.py` attempts to set
some of these same loggers to higher loglevels. Because the
`create_logger` call generally runs after we've configured settings,
it clobbers that effect.
The code in `test_settings.py` that tries to suppress logs only works
because it also sets `propagate=False`, which has nothing to do with
loglevels but does cause logs at this logger (and descendants) to be
dropped completely unless we've configured handlers for this logger
(or one of its relevant descendants.)
Except in:
- docs/writing-bots-guide.md, because bots are supposed to be Python 2
compatible
- puppet/zulip_ops/files/zulip-ec2-configure-interfaces, because this
script is still on python2.7
- tools/lint
- tools/linter_lib
- tools/lister.py
For the latter two, because they might be yanked away to a separate repo
for general use with other FLOSS projects.
Sort of a hacky hammer, but
* The original design of the analytics system mistakenly attempted to play
nicely with non-UTC datetimes.
* Timezone errors are really hard to find and debug, and don't jump out that
easily when reading code.
I don't know of any outstanding errors, but putting a few "assert this
timezone is in UTC" around will hopefully reduce the chance that there are
any current or future timezone errors.
Note that none of these functions are called outside of the analytics code
(and tests). This commit also doesn't change any current behavior, assuming
a database where all datetimes have been being stored in UTC.
Previously we would update FillState for daily stats on hourly boundaries as
well. This would create two extra queries on the FillState table every hour
(for each CountStat), which adds roughly 50ms of extra processing for each
CountStat each day, as well as two extra lines each hour in the analytics
log. This can be a minor annoyance when backfilling stats.
I think this is more pythonic?
We could also get rid of LoggingCountStats altogether, since it's now just a
special case of CountStat (is_logging == data_collector.pull_function is None).
But I think it's nice to keep the distinction since they behave so differently.
Removes the circular dependency of CountStat containing a DataCollector, and
DataCollector containing a function that takes a CountStat as an argument.
This will allow us to appropriately generalize CountStat to include
LoggingCountStat and CustomPullCountStat. It'll also make life easier when
we introduce DependentCountStat.
The previous zerver_* names were unwieldy and not very readable. This also
puts more of the useful information in one place; in particular, makes it
easier to skim a CountStat declaration and see if we're collecting it at a
user/stream granularity or a realm granularity.
It turned out to not be that useful once we added subgroup. The previous
design of the CountStat object also assumed more reuseability of the *_query
strings than what ended up happening.
The filter_args also had some carrying costs:
* It's hard to be confident that filter_args other than the ones explicitly
in our tests would have had expected behavior.
* The filter_args/join_args system is the most complex part of the CountStat
object, and makes understanding the *_query strings unnecessarily
difficult for a new contributor.
Groundwork for allowing stats like "Monthly Active Users".
CountStat.interval is no longer as clean a value as before, so removed it
from views.get_chart_data. It wasn't being used by the frontend anyway.
Removing interval from logger calls in counts.py is not a big loss since we
now include the frequency (which is typically also the interval) in
CountStat.property.
count_message_type_by_user_query is in a different format (no WHERE clause)
from the rest since I'm having a hard time reasoning about how that would
interact with the LEFT JOIN, especially given that there are %(join_args)s.
Analytics database tables are getting big, and so we're likely moving to a
model where ~all stats are day stats, and we keep hourly stats only for the
last N days.
Also changed the name because:
* messages_sent_* suggests the counts (summed over subgroup) should be the
same as the other messages_sent stats, but they are different (these don't
include PMs).
* messages_sent_by_stream:is_bot:day is longer than 32 characters, the max
allowable length for a BaseCount.property.
Includes a database migration to remove the old stat from the analytics
tables.
Having both messages_sent:hour and messages_sent:is_bot:day is confusing,
since a single messages_sent:is_bot:hour would have a superset of the
information and take less total space. This commit and its parent together
replace the two stats with a single messages_sent:is_bot:hour.
Includes a database migration. The interval field was originally there to
facilitate time aggregation (e.g. aggregate_hour_to_day), but we now do such
aggregations in views code or in the frontend.
A few reasons:
* Our two other subgroup'd message stats in UserCount are at CountStat.DAY
frequency (messages_sent:is_bot and messages_sent:message_type).
* Keeping this stat at hourly frequency would likely double the size of our
analytics table, given the current stats. (Counterpoint: if there are
roughly as many active streams as active users, and we keep
messages_sent_to_stream:is_bot at hourly frequency, then maybe this stat
is only a 30% or 50% increase).
* We're currently only showing this on the frontend as a pie chart anyway.
Previously, this function seemed ambivalent about whether it was generating
a series of abstract data points or a series of data points that would
correspond to times. Switch firmly to the latter, so e.g. if the frequency
changes, so will the length of the output sequence.
Fixes an error in the definition of
COUNT_STATS['messages_sent_to_stream:is_bot']. The CountStat needs a
group_by argument since it is supposed to group by UserProfile.is_bot.
This query counts the number of messages each user has sent, subgroup'd by
whether the message was a private_message (PM or sent to a huddle), sent to
a 'private_stream', or sent to a 'public_stream'.
We need to join on zerver_stream to find out whether stream messages were
sent to public streams or private streams, but it needs to be a LEFT JOIN
rather than a JOIN so that we preserve the messages sent to non-streams.
We were updating FillState with FillState.objects.filter(..).update(..),
which does not update the last_modified field (which has auto_now=True).
The correct incantation is the save() method of the actual FillState
object.
interval refers to a time interval, and frequency refers to something that
semantically means something closer to 'hourly' or 'daily'.
Currently, interval can have values 'hour', 'day', or 'gauge', and frequency
can only have values 'hour' and 'day'.
It seems unlikely we will need count_message_by_stream without the
UserProfile table in the future, so write count_message_by_stream_and_is_bot
in the usual query form and replace count_message_by_stream with it.
This also has the benefit of shortening our list of "special case" queries
from two to one.
The pathways of the removed test will be covered more thoroughly in the new
TestCountStats tests.