Commit Graph

149 Commits

Author SHA1 Message Date
Tim Abbott a0cfe45150 analytics: Wrap some longer lines. 2017-11-17 13:19:48 -08:00
rht d1689b5884 analytics: Use python 3 syntax for typing. 2017-11-17 13:16:49 -08:00
Tim Abbott 2b43a0302a python: Sort imports in smaller apps. 2017-11-15 15:55:49 -08:00
rht 51c1a6dfc9 analytics: Text-wrap long lines exceeding 110.
License: Apache-2.0
Signed-off-by: rht <rhtbot@protonmail.com>
2017-11-10 16:22:00 -08:00
rht b557b02f2f analytics/lib: Remove unused imports (F401). 2017-11-07 16:37:07 -08:00
rht ec5120e807 refactor: Remove six.moves.zip import. 2017-11-07 10:46:42 -08:00
rht 5cfffb0e51 analytics: Remove inheritance from object. 2017-11-06 08:53:48 -08:00
rht dcc831f767 refactor: Replace all __unicode__ method with __str__.
Close #6627.
2017-11-02 11:01:47 -07:00
rht 691598a88b py3: Remove "from six.moves import range".
This is no longer required, since in Python 3, this is what the range
built-in does.
2017-10-17 23:28:14 -07:00
rht 2f3ae84e5a py3: Remove all `__future__ import division`. 2017-10-17 23:09:12 -07:00
rht a603a4f9f5 Remove `from __future__ import absolute_import`.
Except in:
- docs/writing-bots-guide.md, because bots are supposed to be Python 2
  compatible
- puppet/zulip_ops/files/zulip-ec2-configure-interfaces, because this
  script is still on python2.7
- tools/lint
- tools/linter_lib
- tools/lister.py

For the latter two, because they might be yanked away to a separate repo
for general use with other FLOSS projects.
2017-10-17 22:59:42 -07:00
Rishi Gupta c7bdabbda8 analytics: Disallow non-UTC fill times in process_count_stat.
No change in behavior, but we aren't supporting non-UTC times in analytics
as a whole any more, so might as well change this check as well.
2017-10-05 11:22:06 -07:00
Rishi Gupta 0596c4a810 analytics: Enforce various datetime arguments are in UTC.
Sort of a hacky hammer, but
* The original design of the analytics system mistakenly attempted to play
  nicely with non-UTC datetimes.
* Timezone errors are really hard to find and debug, and don't jump out that
  easily when reading code.

I don't know of any outstanding errors, but putting a few "assert this
timezone is in UTC" around will hopefully reduce the chance that there are
any current or future timezone errors.

Note that none of these functions are called outside of the analytics code
(and tests). This commit also doesn't change any current behavior, assuming
a database where all datetimes have been being stored in UTC.
2017-10-05 11:22:06 -07:00
Rishi Gupta 0f31cddf49 analytics: Add management command to clear single stat. 2017-10-05 11:22:06 -07:00
Aditya Bansal d9c9bfe7f6 logger: Add new create_logger abstraction to simplify logging.
This deduplicates a ton of Python logger-creation code to use a single
standard implementation, so we can avoid copy-paste problems.
2017-08-27 18:31:53 -07:00
umkay d9b23b39d3 mypy: Fix strict-optional in analytics. 2017-05-26 15:39:39 -07:00
Aditya Bansal 27b87943af pep8: Add compliance with rule E261 to counts.py. 2017-05-07 23:21:50 -07:00
Rishi Gupta 61bf445da4 analytics: Restrict fill_to_time to hour boundaries in process_count_stat. 2017-04-28 16:15:07 -07:00
Rishi Gupta 5e49da9285 analytics: Only update daily stats on day boundaries.
Previously we would update FillState for daily stats on hourly boundaries as
well. This would create two extra queries on the FillState table every hour
(for each CountStat), which adds roughly 50ms of extra processing for each
CountStat each day, as well as two extra lines each hour in the analytics
log. This can be a minor annoyance when backfilling stats.
2017-04-18 11:02:51 -07:00
Rishi Gupta c5f1398052 analytics: Add section comments in counts.count_stats_.
Also reorders the stats a bit.
2017-04-18 11:02:51 -07:00
Rishi Gupta b335ad2794 models: Add MIN_INTERVAL_LENGTH to UserActivityInterval.
Was previously a floating magic number appearing in both
zerver/lib/actions.py and analytics/lib/counts.py.
2017-04-18 11:02:51 -07:00
hackerkid 5c8f011d66 Remove unused timezone import. 2017-04-16 12:28:56 -07:00
Rishi Gupta 49bd330304 analytics: Add class DependentCountStat and stat realm_active_humans::day. 2017-04-14 11:41:07 -07:00
Rishi Gupta 1e8d2b984d counts.py: Rename DataCollector-level operations to be more generic.
We're about to use these for DependentCountStats that will run SQL queries
on the analytics tables instead of the zerver tables.
2017-04-14 11:41:07 -07:00
Rishi Gupta 47cf1d15ba counts.py: Move performance logging call out of pull_functions.
Makes it less likely someone will write a pull function in the future and
forget.
2017-04-14 11:41:07 -07:00
Rishi Gupta 6dff22cbaf counts.py: Change check for LoggingCountStat to use isinstance.
I think this is more pythonic?

We could also get rid of LoggingCountStats altogether, since it's now just a
special case of CountStat (is_logging == data_collector.pull_function is None).
But I think it's nice to keep the distinction since they behave so differently.
2017-04-14 11:41:07 -07:00
Rishi Gupta b45185562a counts.py: Fix out of date comments. 2017-04-14 11:41:07 -07:00
Rishi Gupta ac2cc9e2da counts.py: Reorganize file into logical sections.
No changes to code or behavior.
2017-04-14 11:41:07 -07:00
Rishi Gupta 50868b98a9 counts.py: Change pull_function to take a property instead of a full stat.
Removes the circular dependency of CountStat containing a DataCollector, and
DataCollector containing a function that takes a CountStat as an argument.
2017-04-14 11:41:07 -07:00
Rishi Gupta eadfc743c8 counts.py: Remove CustomPullCountStat. 2017-04-14 11:41:07 -07:00
Rishi Gupta 118b44d4f0 counts.py: Change DataCollector to take a pull_function argument.
This will allow us to appropriately generalize CountStat to include
LoggingCountStat and CustomPullCountStat. It'll also make life easier when
we introduce DependentCountStat.
2017-04-14 11:41:07 -07:00
Rishi Gupta f9e56ad25d counts.py: Move DataCollector declarations into CountStat declarations.
The previous zerver_* names were unwieldy and not very readable. This also
puts more of the useful information in one place; in particular, makes it
easier to skim a CountStat declaration and see if we're collecting it at a
user/stream granularity or a realm granularity.
2017-04-14 11:41:07 -07:00
Rishi Gupta c20e79ab1f counts.py: Rename DataCollector.analytics_table to output_table. 2017-04-14 11:41:07 -07:00
Rishi Gupta 6369d23633 counts.py: Rename ZerverCountQuery to DataCollector.
Not the final form of DataCollector, but the name change causes a big diff
so separating it out.
2017-04-14 11:41:07 -07:00
Rishi Gupta b3991e2557 counts.py: Move CountStat.group_by into ZerverCountQuery.
Part of a larger refactoring to reduce cyclic dependencies between CountStat
and DataCollector (coming soon).
2017-04-14 11:41:07 -07:00
Rishi Gupta 341e1b54fc counts.py: Remove zerver_table from ZerverCountQuery.
Was only needed for filter_args, which are now gone.
2017-04-14 11:41:07 -07:00
Rishi Gupta 661de6bf25 counts.py: Remove filter_args argument from CountStat definition.
It turned out to not be that useful once we added subgroup. The previous
design of the CountStat object also assumed more reuseability of the *_query
strings than what ended up happening.

The filter_args also had some carrying costs:

* It's hard to be confident that filter_args other than the ones explicitly
  in our tests would have had expected behavior.
* The filter_args/join_args system is the most complex part of the CountStat
  object, and makes understanding the *_query strings unnecessarily
  difficult for a new contributor.
2017-04-14 11:41:07 -07:00
Rishi Gupta 4dfadba244 counts.py: Hardcode is_active=true in count_user_by_realm_query.
A step towards removing filter_args from the CountStat object.
2017-04-14 11:41:07 -07:00
Rishi Gupta 6bb97db136 analytics: Add active_users_audit:is_bot:day. 2017-04-14 11:41:07 -07:00
Rishi Gupta cc75d83b74 counts.py: Reorder count_stats_ to put similar stats together. 2017-04-14 11:41:07 -07:00
Rishi Gupta 2f74ccabf9 analytics: Add 15day_actives CountStat. 2017-04-14 11:41:07 -07:00
Rishi Gupta 9b661ca91f analytics: Replace CountStat.is_gauge with interval.
Groundwork for allowing stats like "Monthly Active Users".

CountStat.interval is no longer as clean a value as before, so removed it
from views.get_chart_data. It wasn't being used by the frontend anyway.

Removing interval from logger calls in counts.py is not a big loss since we
now include the frequency (which is typically also the interval) in
CountStat.property.
2017-04-14 11:41:07 -07:00
Rishi Gupta d6c5c672d3 analytics: Add minutes_active CountStat. 2017-04-14 11:41:07 -07:00
hollywoodno dd067c761a analytics: Separate private messages from group private messages.
This makes it possible for our graphs to show the group private
message counts as separate from 1:1 private messages.

Fixes #4102.
2017-03-20 11:46:29 -07:00
Rishi Gupta 7c6f0033ed analytics: Add test for do_drop_all_analytics_tables. 2017-03-14 16:59:54 -07:00
Rishi Gupta 87981a2bf1 analytics: Fix direct import of models in migrations. 2017-03-14 16:59:54 -07:00
Rishi Gupta ebebd04587 analytics: Fix ValueErrors affecting test coverage.
Pathways that only catch internal code errors should use AssertionError so
that they are not included when computing test coverage.
2017-03-14 16:59:54 -07:00
Rishi Gupta b18bfe6771 analytics: Standardize format of zerver count queries.
count_message_type_by_user_query is in a different format (no WHERE clause)
from the rest since I'm having a hard time reasoning about how that would
interact with the LEFT JOIN, especially given that there are %(join_args)s.
2017-03-14 16:59:54 -07:00
Rishi Gupta 8feea6c598 analytics: Add LoggingCountStat for number of users. 2017-03-04 16:46:09 -08:00
Raghav Jajodia a3a03bd6a5 mypy: Added Dict, List and Set imports.
Fixed mypy errors associated with the upgrade.
2017-03-04 14:33:44 -08:00
Rishi Gupta 8bea47d6b5 analytics: Do a stylistic cleanup of TestProcessCountStat. 2017-03-03 16:12:12 -08:00
Rishi Gupta 6c784d6321 analytics: Refactor COUNT_STATS declaration to not repeat itself. 2017-03-03 16:11:28 -08:00
Rishi Gupta 20255e48a4 analytics: Change messages_sent_to_stream to a daily stat.
Analytics database tables are getting big, and so we're likely moving to a
model where ~all stats are day stats, and we keep hourly stats only for the
last N days.

Also changed the name because:
* messages_sent_* suggests the counts (summed over subgroup) should be the
  same as the other messages_sent stats, but they are different (these don't
  include PMs).
* messages_sent_by_stream:is_bot:day is longer than 32 characters, the max
  allowable length for a BaseCount.property.

Includes a database migration to remove the old stat from the analytics
tables.
2017-03-03 16:11:28 -08:00
Rishi Gupta 5eb5fa3f31 analytics: Change time_range to not include current day/hour.
Current day/hour will always be 0, since we haven't computed it yet for the
CountStat tables.
2017-02-02 10:59:52 -08:00
Tim Abbott d6e38e2a5c lint: Clean up E123 PEP-8 rule. 2017-01-23 21:34:26 -08:00
Rishi Gupta 734ca4644c analytics: Add random_seed argument to generate_time_series_data. 2017-01-17 15:54:57 -08:00
Rishi Gupta 37bdc7c010 analytics: Remove COUNT_STATS['messages_sent:hour'].
Having both messages_sent:hour and messages_sent:is_bot:day is confusing,
since a single messages_sent:is_bot:hour would have a superset of the
information and take less total space. This commit and its parent together
replace the two stats with a single messages_sent:is_bot:hour.
2017-01-17 15:54:57 -08:00
Rishi Gupta b593ac9d7c analytics: Change messages_sent:is_bot to hourly frequency.
In preparation for replacing messages_sent.
2017-01-17 15:54:57 -08:00
Rishi Gupta 68fcb4152f analytics: Remove interval field from *Count tables.
Includes a database migration. The interval field was originally there to
facilitate time aggregation (e.g. aggregate_hour_to_day), but we now do such
aggregations in views code or in the frontend.
2017-01-17 15:54:57 -08:00
Rishi Gupta a8f2ebb443 analytics: Include interval in COUNT_STATS property names. 2017-01-17 15:54:57 -08:00
Rishi Gupta 12d277d4f4 analytics: Change messages_sent:client stat to daily frequency.
A few reasons:
* Our two other subgroup'd message stats in UserCount are at CountStat.DAY
  frequency (messages_sent:is_bot and messages_sent:message_type).
* Keeping this stat at hourly frequency would likely double the size of our
  analytics table, given the current stats. (Counterpoint: if there are
  roughly as many active streams as active users, and we keep
  messages_sent_to_stream:is_bot at hourly frequency, then maybe this stat
  is only a 30% or 50% increase).
* We're currently only showing this on the frontend as a pie chart anyway.
2017-01-17 15:54:57 -08:00
Rishi Gupta 2710a944e8 analytics: Refactor fixture creation to make it more general.
Also less verbose, in preparation for adding a bunch more fixtures.
2017-01-17 15:54:57 -08:00
Rishi Gupta 680e7f75e1 analytics: Change generate_time_series_data argument from length to days.
Previously, this function seemed ambivalent about whether it was generating
a series of abstract data points or a series of data points that would
correspond to times. Switch firmly to the latter, so e.g. if the frequency
changes, so will the length of the output sequence.
2017-01-17 15:54:57 -08:00
Rishi Gupta 3712fda30d analytics: Ensure fixture data points are non-negative. 2017-01-17 15:54:57 -08:00
Rishi Gupta 3f2a002c6e analytics/lib/counts.py: Fix one of the COUNT_STATS definitions.
Fixes an error in the definition of
COUNT_STATS['messages_sent_to_stream:is_bot']. The CountStat needs a
group_by argument since it is supposed to group by UserProfile.is_bot.
2017-01-10 20:41:07 -08:00
Rishi Gupta 977f5b9178 analytics/lib/counts.py: Fix error in count_message_type_by_user_query.
This query counts the number of messages each user has sent, subgroup'd by
whether the message was a private_message (PM or sent to a huddle), sent to
a 'private_stream', or sent to a 'public_stream'.

We need to join on zerver_stream to find out whether stream messages were
sent to public streams or private streams, but it needs to be a LEFT JOIN
rather than a JOIN so that we preserve the messages sent to non-streams.
2017-01-10 20:41:07 -08:00
Rishi Gupta 6374596a77 analytics: Add initial fixture for testing views. 2017-01-10 17:48:07 -08:00
Rishi Gupta 552d626ef2 analytics: Fix FillState.last_modified not being updated.
We were updating FillState with FillState.objects.filter(..).update(..),
which does not update the last_modified field (which has auto_now=True).
The correct incantation is the save() method of the actual FillState
object.
2017-01-08 23:36:34 -08:00
Rishi Gupta 190d320afa analytics: Change CountStat.property from Text to str. 2017-01-08 17:24:51 -08:00
Rishi Gupta f8962d521d analytics: Fix uses of 'interval' in arguments and variable names.
interval refers to a time interval, and frequency refers to something that
semantically means something closer to 'hourly' or 'daily'.

Currently, interval can have values 'hour', 'day', or 'gauge', and frequency
can only have values 'hour' and 'day'.
2017-01-08 17:24:51 -08:00
Rishi Gupta f5899dd14b analytics: Add lib/ function to drop all analytics tables. 2017-01-08 17:24:51 -08:00
Rishi Gupta 73dc904e9c analytics: Move time_range from views.py to lib/time_utils.py 2017-01-08 17:24:51 -08:00
Rishi Gupta 2211b8b102 analytics: Change count_message_by_stream to join on UserProfile.
It seems unlikely we will need count_message_by_stream without the
UserProfile table in the future, so write count_message_by_stream_and_is_bot
in the usual query form and replace count_message_by_stream with it.
This also has the benefit of shortening our list of "special case" queries
from two to one.

The pathways of the removed test will be covered more thoroughly in the new
TestCountStats tests.
2016-12-20 12:03:23 -08:00
Rishi Gupta 6992f9784c analytics: Update TestCountStat prototype. 2016-12-20 12:03:23 -08:00
Rishi Gupta 93a10a475a counts.py: Fix count_message_type_by_user_query. 2016-12-15 16:02:12 -08:00
Rishi Gupta 4f3e1b2ece analytics/lib/counts.py: Fix messages_sent_to_stream:is_bot.
Adds a new query.
2016-12-15 16:02:12 -08:00
Rishi Gupta 87b47ec283 analytics: Add __unicode__ method to the CountStat object. 2016-12-15 16:02:12 -08:00
anirudhjain75 beaa62cafa mypy: Convert several directories to use typing.Text.
Specifically, these directories are converted: [analytics/, scripts/,
tools/, zerver/management/, zilencer/, zproject/]
2016-12-07 20:51:05 -08:00
nikolay abc2ff4a06 pep8: Fix many rule E128 violations.
[Tweaked by tabbott to adjust some approaches used in wrapping]
2016-12-03 13:33:31 -08:00
bulat22101 adebc75740 pep8: Fix E502 violations 2016-12-03 10:56:36 -08:00
AZtheAsian 1ba150fa85 pep8: Fix E203 violations 2016-12-01 20:37:57 -08:00
Rafid Aslam c5316b4002 lint: Fix E127 pep8 violations.
Fix pep8: E127 continuation line over-indented for visual indent
style issue.
2016-12-01 10:23:55 -08:00
umkay dc8463e09c analytics: Remove incorrect filter args for stat.
The filter args dictionary applies to the X table in a count X by Y query,
which in this case is the zerver_message table. This stat had an incorrect set
of arguments meant for the zerver_userprofile table.
2016-11-10 12:25:21 -08:00
umkay e6ac8c3543 analytics: Add extra count stats.
Fill in remaining countstats in counts.py for our intended use cases.
2016-11-03 16:50:39 -07:00
umkay 298890d125 analytics: Rename count stats and associated properties.
Our current naming convention is getting unwieldy. The subgroup now goes
on the right side of the colon.
2016-11-03 16:50:39 -07:00
umkay 5490442580 analytics: Replace all joins in raw SQL with natural joins.
We alter the behavior of our queries to no longer write rows with 0 counts
to the db, and pad with 0s in the related views code. As a result we are
also able to combine the where and join clause conditions in the sql
queries. This new behavior is also updated in our tests.
2016-11-03 16:50:39 -07:00
umkay 5e5a0d4db9 analytics: Add user-level count query for messages sent to {PMs, streams}.
Adds a count_X_by_Y_query to counts.py, similar in spirit to a
count_recipient_by_user query, where we would join on the Message,
Recipient, and UserProfile table. Here, we also join on the Stream table in
order to distinguish private and public streams, and we merge the counts for
PM and Huddle type messages into a single subgroup.
2016-11-01 17:00:43 -07:00
umkay 610e92b94e analytics: Add subgroup column to analytics tables.
This is a major change to the analytics schema, and is the first step in a
number of refactorings and performance improvements. For instance, it allows

* Grouping sets of similar CountStats in the *Count tables. For instance,
  active{_humans,_bots} will now have the same property, but have different
  subgroup values.

* Combining queries that differ only in their value on 1 filter clause, so
  that we make fewer passes through the zerver tables. For instance, instead
  of running a query for each of messages_sent_to_public_streams and
  messages_sent_to_private_streams, we can now run a single query with a
  group by on Stream.invite_only, and store the group by value in the
  subgroup column.
2016-10-27 16:33:58 -07:00
Rishi Gupta 54016e1096 analytics: Remove outdated comment in counts.py. 2016-10-25 13:42:55 -07:00
umkay 87d22c9e4d analytics: Fix count_stream_by_realm.
Add a join clause on zerver_message in count_stream_by_realm,
otherwise we only output the final total streamcount for a realm
for every time entry.
2016-10-22 19:10:36 -07:00
umkay 906a4e3b26 analytics: Add performance and transaction logging to counts.py.
For each database query made by an analytics function, log time spent and
the number of rows changed to var/logs/analytics.log.
In the spirit of write ahead logging, for each (stat, end_time)
update, log the start and end of the "transaction", as well as time
spent.
2016-10-17 16:10:03 -07:00
Rishi Gupta 82b814a1cd analytics: Simplify frequency and measurement interval options.
Change the CountStat object to take an is_gauge variable instead of a
smallest_interval variable. Previously, (smallest_interval, frequency)
could be any of (hour, hour), (hour, day), (hour, gauge), (day, hour),
(day, day), or (day, gauge).
The current change is equivalent to excluding (hour, day) and (day, hour)
from the list above.

This change, along with other recent changes, allows us to simplify how we
handle time intervals. This commit also removes the TimeInterval object.
2016-10-14 10:18:37 -07:00
Rishi Gupta 807520411b analytics: Simplify logic in do_fill_count_stat_at_hour.
Adding FillState, removing do_aggregate_hour_to_day, and disallowing unused
(interval, frequency) pairs removes the need for the nested for loops in
do_fill_count_stat_at_hour. This commit replaces that control flow with a
simpler equivalent.
2016-10-14 10:18:37 -07:00
Rishi Gupta 27d1360e1d analytics: Remove do_aggregate_hour_to_day.
The functionality provided is more naturally done in the views code. It also
allows us to aggregate using day boundaries from the local timezone, rather
than UTC.
2016-10-14 10:18:37 -07:00
Rishi Gupta 655ee51e35 analytics: Add table to keep track of fill state.
Adds two simplifying assumptions to how we process analytics stats:
* Sets the atomic unit of work to: a stat processed at an hour boundary.
* For any given stat, only allows these atomic units of work to be processed
  in chronological order.

Adds a table FillState that, for each stat, keeps track of the last unit of
work that was processed.
2016-10-14 10:18:37 -07:00
umkay 721529b782 analytics: Remove HuddleCount for now.
Planned changes to the underlying analytics model will require potentially
complicated changes to huddle queries.
2016-10-14 10:18:37 -07:00
umkay 7e2340155d analytics: Fix aggregation to RealmCount for realms with no users.
Previously, if a Realm had no users (or no streams),
do_aggregate_to_summary_table would fail to add a row with value 0. This
commit fixes the issue and also simplifies the do_aggregate_to_summary_table
logic.
2016-10-11 18:20:58 -07:00
umkay 01324f2afe Fix aggregation to analytics summary tables.
There are a number of different stats that need to be propagated from
UserCount and StreamCount to RealmCount, and from RealmCount to
InstallationCount. Stats with hour intervals also need to have their day
values propagated. This commit fixes a bug in the summary table aggregation
logic so that for a given interval on a CountStat object we pull the correct
counts for the interval as well as do the day aggregation if required. We Also
ensure that any aggregation then done from the realmcount
table to the installationcount table follows the same aggregation logic
for intervals.
2016-10-06 08:46:33 -07:00
umkay d260a22637 Add a new statistics/analytics framework.
This is a first pass at building a framework for collecting various
stats about realms, users, streams, etc. Includes:
* New analytics tables for storing counts data
* Raw SQL queries for pulling data from zerver/models.py tables
* Aggregation functions for aggregating hourly stats into daily stats, and
  aggregating user/stream level stats into realm level stats
* A management command for pulling the data

Note that counts.py was added to the linter exclude list due to errors
around %%s.
2016-10-04 17:18:54 -07:00