This commit upgrades 0015_clear_duplicate_counts migration to remove
duplicate count in StreamCount, UserCount, InstallationCount as well.
Fixes https://github.com/zulip/docker-zulip/issues/266
Unlike stream_stats, I'm not aware of any of these having been used in
the last few years, and it's basically just really bad subsets of the
data in /activity, which also doesn't require shell access to use.
These haven't had real work or usage, AFAIK, since 2013.
A few major themes here:
- We remove short_name from UserProfile
and add the appropriate migration.
- We remove short_name from various
cache-related lists of fields.
- We allow import tools to continue to
write short_name to their export files,
and then we simply ignore the field
at import time.
- We change functions like do_create_user,
create_user_profile, etc.
- We keep short_name in the /json/bots
API. (It actually gets turned into
an email.)
- We don't modify our LDAP code much
here.
The forms to change plan_type, add discount, scrub_realm etc
all post to the same endpoint.
Our frontend code is written so that only one form posts at a time.
But there should be no harm in enforcing the same in backend as well.
This commit changes the PreregistrationUser.invite_as dict to have
same set of values as we have for UserProfile.role.
This also adds a data migration to update the already exisiting
PreregistrationUser and MultiuseInvite objects.
Whenever we use API queries to mark messages as read we now increment
two new LoggingCount stats, messages_read::hour and
messages_read_interactions::hour.
We add an early return in do_increment_logging_stat function if there
are no changes (increment is 0), as an optimization to avoid
unnecessary database queries.
We also log messages_read_interactions::hour Logging stat
as the number of API queries to mark messages as read.
We don't include tests for the case where do_update_pointer is called
because do_update_pointer will most likely be removed from the
codebase in the near future.
There seems to have been a confusion between two different uses of the
word “optional”:
• An optional parameter may be omitted and replaced with a default
value.
• An Optional type has None as a possible value.
Sometimes an optional parameter has a default value of None, or None
is otherwise a meaningful value to provide, in which case it makes
sense for the optional parameter to have an Optional type. But in
other cases, optional parameters should not have Optional type. Fix
them.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
Fixes#2665.
Regenerated by tabbott with `lint --fix` after a rebase and change in
parameters.
Note from tabbott: In a few cases, this converts technical debt in the
form of unsorted imports into different technical debt in the form of
our largest files having very long, ugly import sequences at the
start. I expect this change will increase pressure for us to split
those files, which isn't a bad thing.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
Automatically generated by the following script, based on the output
of lint with flake8-comma:
import re
import sys
last_filename = None
last_row = None
lines = []
for msg in sys.stdin:
m = re.match(
r"\x1b\[35mflake8 \|\x1b\[0m \x1b\[1;31m(.+):(\d+):(\d+): (\w+)", msg
)
if m:
filename, row_str, col_str, err = m.groups()
row, col = int(row_str), int(col_str)
if filename == last_filename:
assert last_row != row
else:
if last_filename is not None:
with open(last_filename, "w") as f:
f.writelines(lines)
with open(filename) as f:
lines = f.readlines()
last_filename = filename
last_row = row
line = lines[row - 1]
if err in ["C812", "C815"]:
lines[row - 1] = line[: col - 1] + "," + line[col - 1 :]
elif err in ["C819"]:
assert line[col - 2] == ","
lines[row - 1] = line[: col - 2] + line[col - 1 :].lstrip(" ")
if last_filename is not None:
with open(last_filename, "w") as f:
f.writelines(lines)
Signed-off-by: Anders Kaseorg <anders@zulipchat.com>
Generated by pyupgrade --py36-plus --keep-percent-format, but with the
NamedTuple changes reverted (see commit
ba7906a3c6, #15132).
Signed-off-by: Anders Kaseorg <anders@zulip.com>
At the time of creating streams in test_counts.py we earlier did not saved
recipient in the stream object.
stream.recipient is used in many functions so they would throw error.
The right long-term fix here is probably to just use the standard
stream creation functions rather than having a hacky duplicate
here.
datetime.timezone is available in Python ≥ 3.2. This also lets us
remove a pytz dependency from the PostgreSQL scripts.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
We change do_create_user and create_user to accept
role as a parameter instead of 'is_realm_admin' and 'is_guest'.
These changes are done to minimize data conversions between
role and boolean fields.
mock is just a backport of the standard library’s unittest.mock now.
The SAMLAuthBackendTest change is needed because
MagicMock.call_args.args wasn’t introduced until Python
3.8 (https://bugs.python.org/issue21269).
The PROVISION_VERSION bump is skipped because mock is still an
indirect dev requirement via moto.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
This commit merges do_change_is_admin and do_change_is_guest to a
single function do_change_user_role which will be used for changing
role of users.
do_change_is_api_super_user is added as a separate function for
changing is_api_super_user field of UserProfile.
Since production testing of `message_retention_days` is finished, we can
enable this feature in the organization settings page. We already had this
setting in frontend but it was bit rotten and not rendered in templates.
Here we replaced our past text-input based setting with a
dropdown-with-text-input setting approach which is more consistent with our
existing UI.
Along with frontend changes, we also incorporated a backend change to
handle making retention period forever. This change introduces a new
convertor `to_positive_or_allowed_int` which only allows positive integers
and an allowed value for settings like `message_retention_days` which can
be a positive integer or has the value `Realm.RETAIN_MESSAGE_FOREVER` when
we change the setting to retain message forever.
This change made `to_not_negative_int_or_none` redundant so removed it as
well.
Fixes: #14854
Generated by autopep8, with the setup.cfg configuration from #14532.
I’m not sure why pycodestyle didn’t already flag these.
Signed-off-by: Anders Kaseorg <anders@zulipchat.com>
Generated by `pyupgrade --py3-plus --keep-percent-format` on all our
Python code except `zthumbor` and `zulip-ec2-configure-interfaces`,
followed by manual indentation fixes.
Signed-off-by: Anders Kaseorg <anders@zulipchat.com>
The value of start would be set to realm creation date or installation
date unless a value is explicitly provided by the user. We don't
want to show analytics data missing error in both these cases if
start is less than 24.5 hours from timezone_now() since some of the
stats are populated by a daily cron job that may take a couple of
minutes to run.
The same case applies if the value of the start is passed in the
request.
We try to use the correct variation of `email`
or `delivery_email`, even though in some
databases they are the same.
(To find the differences, I temporarily hacked
populate_db to use different values for email
and delivery_email, and reduced email visibility
in the zulip realm to admins only.)
In places where we want the "normal" realm
behavior of showing emails (and having `email`
be the same as `delivery_email`), we use
the new `reset_emails_in_zulip_realm` helper.
A couple random things:
- I fixed any error messages that were leaking
the wrong email
- a test that claimed to rely on the order
of emails no longer does (we sort user_ids
instead)
- we now use user_ids in some place where we used
to use emails
- for IRC mirrors I just punted and used
`reset_emails_in_zulip_realm` in most places
- for MIT-related tests, I didn't fix email
vs. delivery_email unless it was obvious
I also explicitly reset the realm to a "normal"
realm for a couple tests that I frankly just didn't
have the energy to debug. (Also, we do want some
coverage on the normal case, even though it is
"easier" for tests to pass if you mix up `email`
and `delivery_email`.)
In particular, I just reset data for the analytics
and corporate tests.
We now have this API...
If you really just need to log in
and not do anything with the actual
user:
self.login('hamlet')
If you're gonna use the user in the
rest of the test:
hamlet = self.example_user('hamlet')
self.login_user(hamlet)
If you are specifically testing
email/password logins (used only in 4 places):
self.login_by_email(email, password)
And for failures uses this (used twice):
self.assert_login_failure(email)
Replaced unique_together with UniqueConstraint in models that
covered nullable fields as in unique_together database indexes
don't work where subgroup=None. So added conditional unique
index handling invalid duplicate Count data.
Added 0015_clear_duplicate_counts migration to handle existing
data that violates the constraints.
Also corrected a test case in test_counts.py which didn't clear its
state properly and thus was accidentally taking advantage of this
database schema bug.
A few comments are added to explain more clearly the changes made in
b23a5431cd namely about not using realm
arguments in LoggingCount Stats and the need to pass realm argument in
pull function.
The comments were tweaked by tabbott for readability.
This field wasn't used for anything, and I think it has very limited
use for debugging, since fundamentally, it'll almost always have a
value within the hour of the actual timestamp in FillState, and any
more fine-grained logging we might want would be available in the
analytics job's own logs.
The proximal reason to remove it is that apparently Django's
model_to_dict doesn't support auto_now fields, and that caused some
trouble when working on adding more complete import/export support for
analytics data.
This changeset is prepartory work for doing something reasonable with
analytics data during the zulip -> zulip data import process (and
potentially e.g. slack -> Zulip as well).
To support that, we need to make it possible to do our analytics
calculations for a single realm.
We do this while maintaining backwards compatibility and avoiding
massive duplicated code by adding an optional `realm` argument to the
entrypoints to the analytics system, especially process_count_stat.
More work involving restructuring FillState will be required for this
to be actually usable for its intented purpose, but this commit is a
nice checkpoint along the way.
Tweaked by tabbott to adjust comments and disable InstallationCount
updates when a realm argument is specified.
This is a preparatory commit for using isort for sorting all of our
imports, merging changes to files where we can easily review the
changes as something we're happy with.
These are also files with relatively little active development, which
means we don't expect much merge conflict risk from these changes.
This is adds foreign keys to the corresponding Recipient object in the
UserProfile on Stream tables, a denormalization intended to improve
performance as this is a common query.
In the migration for setting the field correctly for existing users,
we do a direct SQL query (because Django 1.11 doesn't provide any good
method for doing it properly in bulk using the ORM.).
A consequence of this change to the model is that a bit of code needs
to be added to the functions responsible for creating new users (to
set the field after the Recipient object gets created). Fortunately,
there's only a few code paths for doing that.
Also an adjustment is needed in the import system - this introduces a
circular relation between Recipient and UserProfile. The field cannot be
set until the Recipient objects have been created, but UserProfiles need
to be created before their corresponding Recipients. We deal with this
by first importing UserProfiles same way as before, but we leave the
personal_recipient field uninitialized. After creating the Recipient
objects, we call a function to set the field for all the imported users
in bulk.
A similar change is made for managing Stream objects.
We currently have code to calculate the value of realm_icon_url,
admin_emails and default_discount in two diffrent places. With
the addition of showing confirmation links it would become three.
The easiest way to deduplicate the code and make the view cleaner
is by doing the calculations in template. Alternatively one can
write a function that takes users, realms and confirmations as
arguments and sets the value of realm_icon_url, admin_emails and
default_discount appropriately in realm object according to the
type of the confirmation. But that seems more messy than passing
the functions directly to template approach.
MigrationsTestCase is intentionally omitted from this, since migrations
tests are different in their nature and so whatever setUp()
ZulipTestCase may do in the future, MigrationsTestCase may not
necessarily want to replicate.
Fixes#1727.
With the server down, apply migrations 0245 and 0246. 0246 will remove
the pub_date column, so it's essential that the previous migrations
ran correctly to copy data before running this.
This sidesteps tricky escaping issues, and will make it easier to
build a strict Content-Security-Policy.
Signed-off-by: Anders Kaseorg <anders@zulipchat.com>
Previous cleanups (mostly the removals of Python __future__ imports)
were done in a way that introduced leading newlines. Delete leading
newlines from all files, except static/assets/zulip-emoji/NOTICE,
which is a verbatim copy of the Apache 2.0 license.
Signed-off-by: Anders Kaseorg <anders@zulipchat.com>
We simply state that certain options are `Optional`.
The following files are affected:
add_users_to_mailing_list
send_to_email_mirror
fill_memcached_caches
client_activity
This function is an alternative to get_admin_users that we use in all
places where we explicitly want only human administrative users (not
administrative bots). The following commits will rename
get_admin_users for better clarity.
This renames Subscription.in_home_view field to is_muted, for greater
clarity as to what it does just from seeing the setting name, without
having to look it up.
Also disabled an obsolete test_migrations test.
Fixes#10042.
This makes the implementation of `get_realm` consistent with its
declared return type of `Realm` rather than `Optional[Realm]`.
Fixes#12263.
Signed-off-by: Anders Kaseorg <anders@zulipchat.com>
Using sys.exit in a management command makes it impossible
to unit test the code in question. The correct approach to do the same
thing in Django management commands is to raise CommandError.
Followup of b570c0dafa
AssertionErrors were raised when attempting to run manual comparison
tests to ensure correctness when exporting the analytics realm using
export_from_config. This was caused by this populate_analytics_db
stream being created without any subscribers, which violates an
invariant.
We fix this by simply subscribing the 'shylock' user to that stream.
In ebdd55814c, we added zilencer imports
without using the proper mocking procedure for when zilencer is not enabled.
This whole setup is a mess and probably we should enable zilencer
unconditionally in a future version.
This is a major rewrite of the billing system. It moves subscription
information off of stripe Subscriptions and into a local CustomerPlan
table.
To keep this manageable, it leaves several things unimplemented
(downgrading, etc), and a variety of other TODOs in the code. There are also
some known regressions, e.g. error-handling on /upgrade is broken.
A key part of this is the new helper, get_user_by_delivery_email. Its
verbose name is important for clarity; it should help avoid blind
copy-pasting of get_user (which we'll also want to rename).
Unfortunately, it requires detailed understanding of the context to
figure out which one to use; each is used in about half of call sites.
Another important note is that this PR doesn't migrate get_user calls
in the tests except where not doing so would cause the tests to fail.
This probably deserves a follow-up refactor to avoid bugs here.
This deletes the unused Subscription.notifications field and removes
it from some testing and analytics code (which should not have been
using it in the first place).
Fixes#10042.
This fixes a subtle bug where if you reran populate_analytics_db
directly, we'd end up in a weird state where memcached fetched the
"old" pre-flush UserProfile object for shylock when loading /stats,
which ultimately would result in /stats appearing totally broken.
This code is going to end up pretty complex -- each stat has multiple levels
of aggregation (UserCount, RealmCount, InstallationCount), and refinement
(subgroups), and soon we'll have charts that take data from multiple stats
as input.
Not sure what the best way to present it is, but hopefully this simplifies
it a bit.
We use "Everyone" for the button labels already.
Soon we'll support "Everyone" meaning either the installation or the realm,
depending on the URL route used to access the stats.
In this commit:
Two new URLs are added, to make all realms accessible for server
admins. One is for the stats page itself and another for getting
chart data i.e. chart data API requests.
For the above two new URLs corresponding two view functions are
added.
The name `create_logger` suggests something much bigger than what this
function actually does -- the logger doesn't any more or less exist
after the function is called than before. Its one real function is to
send logs to a specific file.
So, pull out that logic to an appropriately-named function just for
it. We already use `logging.getLogger` in a number of places to
simply get a logger by name, and the old `create_logger` callsites can
do the same.
This is already the loglevel we set on the root logger, so this has no
effect -- except in tests, where `test_settings.py` attempts to set
some of these same loggers to higher loglevels. Because the
`create_logger` call generally runs after we've configured settings,
it clobbers that effect.
The code in `test_settings.py` that tries to suppress logs only works
because it also sets `propagate=False`, which has nothing to do with
loglevels but does cause logs at this logger (and descendants) to be
dropped completely unless we've configured handlers for this logger
(or one of its relevant descendants.)
I've wanted this when looking at a tab from the day before.
Also provides the date and time in UTC, which is handy for
interpreting some of the data.
Pretty sure this is not the world's cleanest way to do this in the
front-end code. It'll do for now.
Substantively, this makes the table more readable by grouping users
into expanding sets by level of activity: active in last day, active
in last week, have an account at all. The class "active in last week",
as opposed to "active in last week but not in last day", makes more
natural comparisons both between realms and for one realm through time,
and it's less sensitive to the details of our definitions.
This also makes the terminology more standard. We already made that
change in the display, in the previous commit; as we go through the
logic here, we adjust the terminology in the code too.
Except in:
- docs/writing-bots-guide.md, because bots are supposed to be Python 2
compatible
- puppet/zulip_ops/files/zulip-ec2-configure-interfaces, because this
script is still on python2.7
- tools/lint
- tools/linter_lib
- tools/lister.py
For the latter two, because they might be yanked away to a separate repo
for general use with other FLOSS projects.
Sort of a hacky hammer, but
* The original design of the analytics system mistakenly attempted to play
nicely with non-UTC datetimes.
* Timezone errors are really hard to find and debug, and don't jump out that
easily when reading code.
I don't know of any outstanding errors, but putting a few "assert this
timezone is in UTC" around will hopefully reduce the chance that there are
any current or future timezone errors.
Note that none of these functions are called outside of the analytics code
(and tests). This commit also doesn't change any current behavior, assuming
a database where all datetimes have been being stored in UTC.
Previously, entering a non-UTC end time for a daily stat would give you
incorrect results. This is because:
* All daily stats are collected at and have end_times in the database in
midnight UTC.
* For daily stats, time_range returns a list of datetimes at midnight in the
timezone of its end argument. These datetimes are the only ones we look
for when looking for rows corresponding to the stat in the database.
* Previously, we passed on the end argument from the API to time_range,
without modification.
This consists of the `zulip_ops::stats` Puppet class, which has apparently
not been used since 2014, and a number of files that I believe were
only used for that. Also a couple of tiny loose ends in other files.
These are no longer useful, with our spiffy new analytics framework,
and we haven't in fact been using them for some time, while the
`active-user-stats` cron job does cause regular mail from cron.
Just delete them.
sort_client_labels sorts first by total, and then to ensure deterministic
outcomes, sorts (reverse) alphabetically by label.
Fixes regression introduced in 0c0e539.
Previously we showed the total number of users with an active account. This
changes it to show only the number of users that have logged in in the past
two weeks.
Rename 'zulip_internal' decorator to 'require_server_admin', add
documentation for 'server_admin', explaining how to give permission
for ./activity page.
Fixes: #1463.
Previously we would update FillState for daily stats on hourly boundaries as
well. This would create two extra queries on the FillState table every hour
(for each CountStat), which adds roughly 50ms of extra processing for each
CountStat each day, as well as two extra lines each hour in the analytics
log. This can be a minor annoyance when backfilling stats.
I think this is more pythonic?
We could also get rid of LoggingCountStats altogether, since it's now just a
special case of CountStat (is_logging == data_collector.pull_function is None).
But I think it's nice to keep the distinction since they behave so differently.
Removes the circular dependency of CountStat containing a DataCollector, and
DataCollector containing a function that takes a CountStat as an argument.
This will allow us to appropriately generalize CountStat to include
LoggingCountStat and CustomPullCountStat. It'll also make life easier when
we introduce DependentCountStat.
The previous zerver_* names were unwieldy and not very readable. This also
puts more of the useful information in one place; in particular, makes it
easier to skim a CountStat declaration and see if we're collecting it at a
user/stream granularity or a realm granularity.
It turned out to not be that useful once we added subgroup. The previous
design of the CountStat object also assumed more reuseability of the *_query
strings than what ended up happening.
The filter_args also had some carrying costs:
* It's hard to be confident that filter_args other than the ones explicitly
in our tests would have had expected behavior.
* The filter_args/join_args system is the most complex part of the CountStat
object, and makes understanding the *_query strings unnecessarily
difficult for a new contributor.
Groundwork for allowing stats like "Monthly Active Users".
CountStat.interval is no longer as clean a value as before, so removed it
from views.get_chart_data. It wasn't being used by the frontend anyway.
Removing interval from logger calls in counts.py is not a big loss since we
now include the frequency (which is typically also the interval) in
CountStat.property.
The code in fixtures.py is only called from populate_analytics_db, and is
only used for generating pretty fixture data for manual testing. This commit
adds tests for a few things that were easy to add tests for, and provides
some minimal coverage of the file, but is not meant to be comprehensive.
Originally, all the client names in populate_analytics_db started with
underscores to make it easy to selectively delete and regenerate them when
re-running populate_analytics_db.
We eventually want to merge populate_analytics_db into populate_db though,
in which case it makes more sense for them to share client names, and not
worry about the case where we run (or re-run) populate_analytics_db
independently of populate_db.
It will simplify the logic needed to process the "Sent by Me" view in
Messages Sent Over Time in stats.js.
Also, we gzip the data sent from our server, so there is little additional
network usage by doing this.
Django 1.10 has changed the implementation of this function to
match our custom implementation; in addition to this, we prefer
render().
Fixes#1914 via #4093.
count_message_type_by_user_query is in a different format (no WHERE clause)
from the rest since I'm having a hard time reasoning about how that would
interact with the LEFT JOIN, especially given that there are %(join_args)s.
Analytics database tables are getting big, and so we're likely moving to a
model where ~all stats are day stats, and we keep hourly stats only for the
last N days.
Also changed the name because:
* messages_sent_* suggests the counts (summed over subgroup) should be the
same as the other messages_sent stats, but they are different (these don't
include PMs).
* messages_sent_by_stream:is_bot:day is longer than 32 characters, the max
allowable length for a BaseCount.property.
Includes a database migration to remove the old stat from the analytics
tables.
When you pass a naive datetime to the Django ORM, it uses settings.TIME_ZONE
for the time zone. In the development environment, both settings.TIME_ZONE
and datetime.now() use 'America/New_York', so there is no change in behavior
there. (fromtimestamp with no tz argument uses the same timezone as
datetime.now)
We are soon going to change settings.TIME_ZONE to UTC, so need to remove
naive datetimes from queries to the ORM.
This actually fixes previously broken behavior, since 'date' here gets
turned into the 'day' argument of seconds_active_during_day(day), where
tzinfo is set to UTC.
API: Adds a "display_order" to the response, which is a suggested order of
importance for the clients or recipient types respectively.
frontend: Changes messages_sent_by_{client,recipient_type} to use a fixed
order for any given user.
Also includes a number of changes to messages_sent_by_recipient_type that
were convenient to do at the same time, since the two charts share a lot of
code.
This adds a frontend for the analytics system we've had for a few
months, showing several graphs of the data in Zulip.
There's a ton more that we can do with this tooling, but this initial
version is enough to provide users with a pretty good experience.
Fixes#2052.
Makes a number of simplications to the analytics views code. The main one is
that we now return the entire data series, even if the data is eventually
going to go into a pie chart. This was prompted by us wanting several
different pie charts for each stat (one for last 30 days, one for all time,
etc), but I think it is also a more natural API. The total amount of data
being sent for the pie charts now is maybe half of what is being sent for
our single 'hourly' stat, or maybe up to 10,000 ints per year the
organization has been around.
The other big change is that the data being sent back is now always explicit
about whether it is data about the realm (stored in data['realm'], or data
about the user (stored in data['user']).
Having both messages_sent:hour and messages_sent:is_bot:day is confusing,
since a single messages_sent:is_bot:hour would have a superset of the
information and take less total space. This commit and its parent together
replace the two stats with a single messages_sent:is_bot:hour.
Includes a database migration. The interval field was originally there to
facilitate time aggregation (e.g. aggregate_hour_to_day), but we now do such
aggregations in views code or in the frontend.
A few reasons:
* Our two other subgroup'd message stats in UserCount are at CountStat.DAY
frequency (messages_sent:is_bot and messages_sent:message_type).
* Keeping this stat at hourly frequency would likely double the size of our
analytics table, given the current stats. (Counterpoint: if there are
roughly as many active streams as active users, and we keep
messages_sent_to_stream:is_bot at hourly frequency, then maybe this stat
is only a 30% or 50% increase).
* We're currently only showing this on the frontend as a pie chart anyway.
Previously, this function seemed ambivalent about whether it was generating
a series of abstract data points or a series of data points that would
correspond to times. Switch firmly to the latter, so e.g. if the frequency
changes, so will the length of the output sequence.
Not sure if this would actually be a performance problem in practice, but
this was originally making a database query for each subgroup (instead of
just a single query getting data for all the subgroups).
Also removed the filter against the interval column, which will soon not be
needed (interval will be uniquely determined by the property).
Adds two things to TestCountStats.setUp():
* A realm with no messages, that generally should not show up in *Count
tables,
* Users/streams/messages created at 0, 1, 61, and 1441 (just over a day)
minutes ago (previously was 0, 60), to better test the start_time/end_time
in the queries, and the frequency/interval setting in the CountStats.
Fixes an error in the definition of
COUNT_STATS['messages_sent_to_stream:is_bot']. The CountStat needs a
group_by argument since it is supposed to group by UserProfile.is_bot.
This query counts the number of messages each user has sent, subgroup'd by
whether the message was a private_message (PM or sent to a huddle), sent to
a 'private_stream', or sent to a 'public_stream'.
We need to join on zerver_stream to find out whether stream messages were
sent to public streams or private streams, but it needs to be a LEFT JOIN
rather than a JOIN so that we preserve the messages sent to non-streams.
We were updating FillState with FillState.objects.filter(..).update(..),
which does not update the last_modified field (which has auto_now=True).
The correct incantation is the save() method of the actual FillState
object.
interval refers to a time interval, and frequency refers to something that
semantically means something closer to 'hourly' or 'daily'.
Currently, interval can have values 'hour', 'day', or 'gauge', and frequency
can only have values 'hour' and 'day'.
Finishes the refactoring started in c1bbd8d. The goal of the refactoring is
to change the argument to get_realm from a Realm.domain to a
Realm.string_id. The steps were
* Add a new function, get_realm_by_string_id.
* Change all calls to get_realm to use get_realm_by_string_id instead.
* Remove get_realm.
* (This commit) Rename get_realm_by_string_id to get_realm.
Part of a larger migration to remove the Realm.domain field entirely.
Was enabled by commit 41e8ee3 where we moved TIME_ZERO to before the realms
created by populate_db.py.
Also removes the stub for TestAggregates, since the remaining thing to be
tested was the aggregation from RealmCount to InstallationCount, and the end
to end checks provided by the TestCountStat tests should be sufficient.
In a previous design, there was no FillState table, and one could run any
CountStat at any time. This is no longer supported.
This test was making sure that if one ran a CountStat at a certain hour, and
then ran it at a previous hour, the old rows would still be there.
It seems unlikely we will need count_message_by_stream without the
UserProfile table in the future, so write count_message_by_stream_and_is_bot
in the usual query form and replace count_message_by_stream with it.
This also has the benefit of shortening our list of "special case" queries
from two to one.
The pathways of the removed test will be covered more thoroughly in the new
TestCountStats tests.
The filter args dictionary applies to the X table in a count X by Y query,
which in this case is the zerver_message table. This stat had an incorrect set
of arguments meant for the zerver_userprofile table.
We alter the behavior of our queries to no longer write rows with 0 counts
to the db, and pad with 0s in the related views code. As a result we are
also able to combine the where and join clause conditions in the sql
queries. This new behavior is also updated in our tests.
Adds a database migration, adds a new string_id argument to the management
realm creation command, and adds a short name field to the web realm
creation form when REALMS_HAVE_SUBDOMAINS is False.
Adds a count_X_by_Y_query to counts.py, similar in spirit to a
count_recipient_by_user query, where we would join on the Message,
Recipient, and UserProfile table. Here, we also join on the Stream table in
order to distinguish private and public streams, and we merge the counts for
PM and Huddle type messages into a single subgroup.
This is a major change to the analytics schema, and is the first step in a
number of refactorings and performance improvements. For instance, it allows
* Grouping sets of similar CountStats in the *Count tables. For instance,
active{_humans,_bots} will now have the same property, but have different
subgroup values.
* Combining queries that differ only in their value on 1 filter clause, so
that we make fewer passes through the zerver tables. For instance, instead
of running a query for each of messages_sent_to_public_streams and
messages_sent_to_private_streams, we can now run a single query with a
group by on Stream.invite_only, and store the group by value in the
subgroup column.
For each database query made by an analytics function, log time spent and
the number of rows changed to var/logs/analytics.log.
In the spirit of write ahead logging, for each (stat, end_time)
update, log the start and end of the "transaction", as well as time
spent.
Change the CountStat object to take an is_gauge variable instead of a
smallest_interval variable. Previously, (smallest_interval, frequency)
could be any of (hour, hour), (hour, day), (hour, gauge), (day, hour),
(day, day), or (day, gauge).
The current change is equivalent to excluding (hour, day) and (day, hour)
from the list above.
This change, along with other recent changes, allows us to simplify how we
handle time intervals. This commit also removes the TimeInterval object.
Adding FillState, removing do_aggregate_hour_to_day, and disallowing unused
(interval, frequency) pairs removes the need for the nested for loops in
do_fill_count_stat_at_hour. This commit replaces that control flow with a
simpler equivalent.
The functionality provided is more naturally done in the views code. It also
allows us to aggregate using day boundaries from the local timezone, rather
than UTC.
Adds two simplifying assumptions to how we process analytics stats:
* Sets the atomic unit of work to: a stat processed at an hour boundary.
* For any given stat, only allows these atomic units of work to be processed
in chronological order.
Adds a table FillState that, for each stat, keeps track of the last unit of
work that was processed.
Previously, if a Realm had no users (or no streams),
do_aggregate_to_summary_table would fail to add a row with value 0. This
commit fixes the issue and also simplifies the do_aggregate_to_summary_table
logic.
Previously we showed both the value and the id of the BaseCount record,
which is confusing in a typical case where you only care about the value,
and both the value and id are smallish ints.
Refactor the current analytics tests into the following classes:
* TestUpdateAnalyticsCounts, which will eventually test the management
command, backfilling, what happens when new tests are added, etc.
* TestProcessCountStat, which tests the ins and outs of propagating the
value of a single stat up through the various *Count tables.
* TestAggregates, which tests the do_aggregate_* methods.
* TestXByYQueries, which tests the count_X_by_Y_query SQL snippets.
* TestCountStats, which has tests for individual CountStats.
This commit does not change the name or contents of any individual test.
Many tests are structured to run some process, and then check a count in a
BaseCount record using default values for realm, property, interval, and
end_time. This commit adds a new assertCountEquals method to
AnalyticsTestCase, and simplifies other assert calls as appropriate.
Add a default_realm object to AnalyticsTestCase, created 2 days before
AnalyticsTestCase.TIME_ZERO.
Add lightweight create_user, create_stream, and create_message methods to
AnalyticsTestCase, with sensible defaults. In particular, all objects are by
default created at AnalyticsTestCase.TIME_LAST_HOUR, so that they are
included when running AnalyticsTestCase.process_last_hour.
Previously, analytics tests used timezone.now or custom datetime objects
when creating new realms, users, and streams.
This commit adds a fixed TIME_ZERO and a process_last_hour helper function
in a new AnalyticsTestCase class, and modifies the existing tests to use
them.
This is primarily implemented through altering the migration file in
order to move the columns, but also we try to make the defaults a
little better for future tables inherited from BaseCount.
There are a number of different stats that need to be propagated from
UserCount and StreamCount to RealmCount, and from RealmCount to
InstallationCount. Stats with hour intervals also need to have their day
values propagated. This commit fixes a bug in the summary table aggregation
logic so that for a given interval on a CountStat object we pull the correct
counts for the interval as well as do the day aggregation if required. We Also
ensure that any aggregation then done from the realmcount
table to the installationcount table follows the same aggregation logic
for intervals.
This is a first pass at building a framework for collecting various
stats about realms, users, streams, etc. Includes:
* New analytics tables for storing counts data
* Raw SQL queries for pulling data from zerver/models.py tables
* Aggregation functions for aggregating hourly stats into daily stats, and
aggregating user/stream level stats into realm level stats
* A management command for pulling the data
Note that counts.py was added to the linter exclude list due to errors
around %%s.
Type of parameter for function `is_recent`(line no.812) is `datetime`.
MyPy errors out, however, when the parameter is defined as `datetime`.
To get around, type `Any` is used.
This results in a substantial performance improvement for all of
Zulip's backend templates.
Changes in templates:
- Change `block.super` to `super()`.
- Remove `load` tag because Jinja2 doesn't support it.
- Use `minified_js()|safe` instead of `{% minified_js %}`.
- Use `compressed_css()|safe` instead of `{% compressed_css %}`.
- `forloop.first` -> `loop.first`.
- Use `{{ csrf_input }}` instead of `{% csrf_token %}`.
- Use `{# ... #}` instead of `{% comment %}`.
- Use `url()` instead of `{% url %}`.
- Use `_()` instead of `{% trans %}` because in Jinja `trans` is a block tag.
- Use `{% trans %}` instead of `{% blocktrans %}`.
- Use `{% raw %}` instead of `{% verbatim %}`.
Changes in tools:
- Check for `trans` block in `check-templates` instead of `blocktrans`
Changes in backend:
- Create custom `render_to_response` function which takes `request` objects
instead of `RequestContext` object. There are two reasons to do this:
1. `RequestContext` is not compatible with Jinja2
2. `RequestContext` in `render_to_response` is deprecated.
- Add Jinja2 related support files in zproject/jinja2 directory. It
includes a custom backend and a template renderer, compressors for js
and css and Jinja2 environment handler.
- Enable `slugify` and `pluralize` filters in Jinja2 environment.
Fixes#620.
get_realm is better in two key ways:
* It uses memcached to fetch the data from the cache and thus is faster.
* It does a case-insensitive query and thus is more safe.
To make this give accurate numbers, we need to filter out the
automated traffic from Zephyr mirroring and our internal monitoring.
(imported from commit 83642bc9a9d8d01dd9dc5dc7b3e3dee6c9705162)
This shows the number of messages sent by humans for the last
eight 24-hour periods, for each realm. "Messages sent" isn't a
perfect metric of activity, but it's easier to query with our
current data model than certain other statistics.
(imported from commit 9de3c479640a0b9dbc017b245dda21d951f4efa4)