Commit Graph

16 Commits (main)

Author SHA1 Message Date
Nick Khyl 1ccece0f78 util/eventbus: use unbounded event queues for DeliveredEvents in subscribers
Bounded DeliveredEvent queues reduce memory usage, but they can deadlock under load.
Two common scenarios trigger deadlocks when the number of events published in a short
period exceeds twice the queue capacity (there's a PublishedEvent queue of the same size):
 - a subscriber tries to acquire the same mutex as held by a publisher, or
 - a subscriber for A events publishes B events

Avoiding these scenarios is not practical and would limit eventbus usefulness and reduce its adoption,
pushing us back to callbacks and other legacy mechanisms. These deadlocks already occurred in customer
devices, dev machines, and tests. They also make it harder to identify and fix slow subscribers and similar
issues we have been seeing recently.

Choosing an arbitrary large fixed queue capacity would only mask the problem. A client running
on a sufficiently large and complex customer environment can exceed any meaningful constant limit,
since event volume depends on the number of peers and other factors. Behavior also changes
based on scheduling of publishers and subscribers by the Go runtime, OS, and hardware, as the issue
is essentially a race between publishers and subscribers. Additionally, on lower-end devices,
an unreasonably high constant capacity is practically the same as using unbounded queues.

Therefore, this PR changes the event queue implementation to be unbounded by default.
The PublishedEvent queue keeps its existing capacity of 16 items, while subscribers'
DeliveredEvent queues become unbounded.

This change fixes known deadlocks and makes the system stable under load,
at the cost of higher potential memory usage, including cases where a queue grows
during an event burst and does not shrink when load decreases.

Further improvements can be implemented in the future as needed.

Fixes #17973
Fixes #18012

Signed-off-by: Nick Khyl <nickk@tailscale.com>
2 weeks ago
Brad Fitzpatrick 6ac4356bce util/eventbus: simplify some reflect in Bus.pump
Updates #cleanup

Change-Id: Ib7b497e22c6cdd80578c69cf728d45754e6f909e
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
2 weeks ago
Brad Fitzpatrick 99b06eac49 syncs: add Mutex/RWMutex alias/wrappers for future mutex debugging
Updates #17852

Change-Id: I477340fb8e40686870e981ade11cd61597c34a20
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
2 weeks ago
M. J. Fromberger 061e6266cf
util/eventbus: allow logging of slow subscribers (#17705)
Add options to the eventbus.Bus to plumb in a logger.

Route that logger in to the subscriber machinery, and trigger a log message to
it when a subscriber fails to respond to its delivered events for 5s or more.

The log message includes the package, filename, and line number of the call
site that created the subscription.

Add tests that verify this works.

Updates #17680

Change-Id: I0546516476b1e13e6a9cf79f19db2fe55e56c698
Signed-off-by: M. J. Fromberger <fromberger@tailscale.com>
1 month ago
Brad Fitzpatrick a40f23ad4a util/eventbus: flesh out docs a bit
Updates #cleanup

Change-Id: Ia6b0e4b0426be1dd10a777aff0a81d4dd6b69b01
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
2 months ago
Claus Lensbøl 53f67c4396
util/eventbus: fix docstrings (#16401)
Updates #15160

Signed-off-by: Claus Lensbøl <claus@tailscale.com>
5 months ago
David Anderson d83024a63f util/eventbus: add a debug HTTP handler for the bus
Updates #15160

Signed-off-by: David Anderson <dave@tailscale.com>
9 months ago
David Anderson e71e95b841 util/eventbus: don't allow publishers to skip events while debugging
If any debugging hook might see an event, Publisher.ShouldPublish should
tell its caller to publish even if there are no ordinary subscribers.

Updates #15160

Signed-off-by: David Anderson <dave@tailscale.com>
9 months ago
David Anderson 853abf8661 util/eventbus: initial debugging facilities for the event bus
Enables monitoring events as they flow, listing bus clients, and
snapshotting internal queues to troubleshoot stalls.

Updates #15160

Signed-off-by: David Anderson <dave@tailscale.com>
9 months ago
David Anderson e80d2b4ad1 util/eventbus: add debug hooks to snoop on bus traffic
Updates #15160

Signed-off-by: David Anderson <dave@tailscale.com>
9 months ago
David Anderson cf5c788cf1 util/eventbus: track additional event context in subscribe queue
Updates #15160

Signed-off-by: David Anderson <dave@tailscale.com>
9 months ago
David Anderson a1192dd686 util/eventbus: track additional event context in publish queue
Updates #15160

Signed-off-by: David Anderson <dave@tailscale.com>
9 months ago
David Anderson bf40bc4fa0 util/eventbus: make internal queue a generic type
In preparation for making the queues carry additional event metadata.

Updates #15160

Signed-off-by: David Anderson <dave@tailscale.com>
9 months ago
David Anderson 24d4846f00 util/eventbus: adjust worker goroutine management helpers
This makes the helpers closer in behavior to cancelable contexts
and taskgroup.Single, and makes the worker code use a more normal
and easier to reason about context.Context for shutdown.

Updates #15160

Signed-off-by: David Anderson <dave@tailscale.com>
9 months ago
David Anderson 3e18434595 util/eventbus: rework to have a Client abstraction
The Client carries both publishers and subscribers for a single
actor. This makes the APIs for publish and subscribe look more
similar, and this structure is a better fit for upcoming debug
facilities.

Updates #15160

Signed-off-by: David Anderson <dave@tailscale.com>
9 months ago
David Anderson ef906763ee util/eventbus: initial implementation of an in-process event bus
Updates #15160

Signed-off-by: David Anderson <dave@tailscale.com>
Co-authored-by: M. J. Fromberger <fromberger@tailscale.com>
9 months ago