zulip/scripts/nagios/check-rabbitmq-consumers

77 lines
2.2 KiB
Plaintext
Raw Normal View History

#!/usr/bin/env python3
import argparse
import os
import subprocess
import sys
import time
from collections import defaultdict
from typing import Dict
ZULIP_PATH = os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
sys.path.append(ZULIP_PATH)
from scripts.lib.check_rabbitmq_queue import normal_queues
from scripts.lib.zulip_tools import get_config_file, get_tornado_ports
states = {
0: "OK",
1: "WARNING",
2: "CRITICAL",
3: "UNKNOWN",
}
if "USER" in os.environ and not os.environ["USER"] in ["root", "rabbitmq"]:
print("This script must be run as the root or rabbitmq user")
usage = """Usage: check-rabbitmq-consumers --queue=[queue-name] --min-threshold=[min-threshold]"""
parser = argparse.ArgumentParser(usage=usage)
parser.add_argument("--min-threshold", dest="min_count", type=int, default=1)
options = parser.parse_args()
config_file = get_config_file()
TORNADO_PROCESSES = len(get_tornado_ports(config_file))
output = subprocess.check_output(["/usr/sbin/rabbitmqctl", "list_consumers"], text=True)
consumers: Dict[str, int] = defaultdict(int)
queues = {
*normal_queues,
# These queues may not be present if settings.TORNADO_PROCESSES > 1
"notify_tornado",
}
for queue_name in queues:
queue_name = queue_name.strip()
consumers[queue_name] = 0
for line in output.split("\n"):
parts = line.split("\t")
if len(parts) >= 2:
queue_name = parts[0]
if queue_name.startswith("notify_tornado_"):
queue_name = "notify_tornado"
consumers[queue_name] += 1
now = int(time.time())
for queue_name in consumers.keys():
state_file_path = "/var/lib/nagios_state/check-rabbitmq-consumers-" + queue_name
state_file_tmp = state_file_path + "-tmp"
target_count = options.min_count
dependencies: Remove WebSockets system for sending messages. Zulip has had a small use of WebSockets (specifically, for the code path of sending messages, via the webapp only) since ~2013. We originally added this use of WebSockets in the hope that the latency benefits of doing so would allow us to avoid implementing a markdown local echo; they were not. Further, HTTP/2 may have eliminated the latency difference we hoped to exploit by using WebSockets in any case. While we’d originally imagined using WebSockets for other endpoints, there was never a good justification for moving more components to the WebSockets system. This WebSockets code path had a lot of downsides/complexity, including: * The messy hack involving constructing an emulated request object to hook into doing Django requests. * The `message_senders` queue processor system, which increases RAM needs and must be provisioned independently from the rest of the server). * A duplicate check_send_receive_time Nagios test specific to WebSockets. * The requirement for users to have their firewalls/NATs allow WebSocket connections, and a setting to disable them for networks where WebSockets don’t work. * Dependencies on the SockJS family of libraries, which has at times been poorly maintained, and periodically throws random JavaScript exceptions in our production environments without a deep enough traceback to effectively investigate. * A total of about 1600 lines of our code related to the feature. * Increased load on the Tornado system, especially around a Zulip server restart, and especially for large installations like zulipchat.com, resulting in extra delay before messages can be sent again. As detailed in https://github.com/zulip/zulip/pull/12862#issuecomment-536152397, it appears that removing WebSockets moderately increases the time it takes for the `send_message` API query to return from the server, but does not significantly change the time between when a message is sent and when it is received by clients. We don’t understand the reason for that change (suggesting the possibility of a measurement error), and even if it is a real change, we consider that potential small latency regression to be acceptable. If we later want WebSockets, we’ll likely want to just use Django Channels. Signed-off-by: Anders Kaseorg <anders@zulipchat.com>
2019-07-23 01:43:40 +02:00
if queue_name == "notify_tornado":
target_count = TORNADO_PROCESSES
if consumers[queue_name] < target_count:
status = 2
else:
status = 0
with open(state_file_tmp, "w") as f:
f.write(
f"{now}|{status}|{states[status]}|queue {queue_name} has {consumers[queue_name]} consumers, needs {target_count}\n"
)
os.rename(state_file_tmp, state_file_path)