Commit Graph

49 Commits (cmol/resolveconf_trample_trample_back)

Author SHA1 Message Date
Nick Khyl 1ccece0f78 util/eventbus: use unbounded event queues for DeliveredEvents in subscribers
Bounded DeliveredEvent queues reduce memory usage, but they can deadlock under load.
Two common scenarios trigger deadlocks when the number of events published in a short
period exceeds twice the queue capacity (there's a PublishedEvent queue of the same size):
 - a subscriber tries to acquire the same mutex as held by a publisher, or
 - a subscriber for A events publishes B events

Avoiding these scenarios is not practical and would limit eventbus usefulness and reduce its adoption,
pushing us back to callbacks and other legacy mechanisms. These deadlocks already occurred in customer
devices, dev machines, and tests. They also make it harder to identify and fix slow subscribers and similar
issues we have been seeing recently.

Choosing an arbitrary large fixed queue capacity would only mask the problem. A client running
on a sufficiently large and complex customer environment can exceed any meaningful constant limit,
since event volume depends on the number of peers and other factors. Behavior also changes
based on scheduling of publishers and subscribers by the Go runtime, OS, and hardware, as the issue
is essentially a race between publishers and subscribers. Additionally, on lower-end devices,
an unreasonably high constant capacity is practically the same as using unbounded queues.

Therefore, this PR changes the event queue implementation to be unbounded by default.
The PublishedEvent queue keeps its existing capacity of 16 items, while subscribers'
DeliveredEvent queues become unbounded.

This change fixes known deadlocks and makes the system stable under load,
at the cost of higher potential memory usage, including cases where a queue grows
during an event burst and does not shrink when load decreases.

Further improvements can be implemented in the future as needed.

Fixes #17973
Fixes #18012

Signed-off-by: Nick Khyl <nickk@tailscale.com>
2 weeks ago
Nick Khyl 3780f25d51 util/eventbus: add tests for a subscriber publishing events
As of 2025-11-20, publishing more events than the eventbus's
internal queues can hold may deadlock if a subscriber tries
to publish events itself.

This commit adds a test that demonstrates this deadlock,
and skips it until the bug is fixed.

Updates #18012

Signed-off-by: Nick Khyl <nickk@tailscale.com>
2 weeks ago
Nick Khyl 016ccae2da util/eventbus: add tests for a subscriber trying to acquire the same mutex as a publisher
As of 2025-11-20, publishing more events than the eventbus's
internal queues can hold may deadlock if a subscriber tries
to acquire a mutex that can also be held by a publisher.

This commit adds a test that demonstrates this deadlock,
and skips it until the bug is fixed.

Updates #17973

Signed-off-by: Nick Khyl <nickk@tailscale.com>
2 weeks ago
Brad Fitzpatrick 6ac4356bce util/eventbus: simplify some reflect in Bus.pump
Updates #cleanup

Change-Id: Ib7b497e22c6cdd80578c69cf728d45754e6f909e
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
2 weeks ago
Brad Fitzpatrick 99b06eac49 syncs: add Mutex/RWMutex alias/wrappers for future mutex debugging
Updates #17852

Change-Id: I477340fb8e40686870e981ade11cd61597c34a20
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
2 weeks ago
Brad Fitzpatrick 1eba5b0cbd util/eventbus: log goroutine stacks when hung in CI
Updates #17680

Change-Id: Ie48dc2d64b7583d68578a28af52f6926f903ca4f
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
3 weeks ago
M. J. Fromberger 4c856078e4
util/eventbus: block for the subscriber during SubscribeFunc close (#17642)
Prior to this change a SubscriberFunc treated the call to the subscriber's
function as the completion of delivery. But that means when we are closing the
subscriber, that callback could continue to execute for some time after the
close returns.

For channel-based subscribers that works OK because the close takes effect
before the subscriber ever sees the event. To make the two subscriber types
symmetric, we should also wait for the callback to finish before returning.
This ensures that a Close of the client means the same thing with both kinds of
subscriber.

Updates #17638

Change-Id: I82fd31bcaa4e92fab07981ac0e57e6e3a7d9d60b
Signed-off-by: M. J. Fromberger <fromberger@tailscale.com>
1 month ago
M. J. Fromberger 061e6266cf
util/eventbus: allow logging of slow subscribers (#17705)
Add options to the eventbus.Bus to plumb in a logger.

Route that logger in to the subscriber machinery, and trigger a log message to
it when a subscriber fails to respond to its delivered events for 5s or more.

The log message includes the package, filename, and line number of the call
site that created the subscription.

Add tests that verify this works.

Updates #17680

Change-Id: I0546516476b1e13e6a9cf79f19db2fe55e56c698
Signed-off-by: M. J. Fromberger <fromberger@tailscale.com>
1 month ago
Claus Lensbøl 005e264b54
util/eventbus/eventbustest: add support for synctest instead of timers (#17522)
Before synctest, timers was needed to allow the events to flow into the
test bus. There is still a timer, but this one is not derived from the
test deadline and it is mostly arbitrary as synctest will render it
practically non-existent.

With this approach, tests that do not need to test for the absence of
events do not rely on synctest.

Updates #15160

Signed-off-by: Claus Lensbøl <claus@tailscale.com>
2 months ago
M. J. Fromberger 0a33aae823
util/eventbus: run subscriber functions in a goroutine (#17510)
With a channel subscriber, the subscription processing always occurs on another
goroutine. The SubscriberFunc (prior to this commit) runs its callbacks on the
client's own goroutine. This changes the semantics, though: In addition to more
directly pushing back on the publisher, a publisher and subscriber can deadlock
in a SubscriberFunc but succeed on a Subscriber. They should behave
equivalently regardless which interface they use.

Arguably the caller should deal with this by creating its own goroutine if it
needs to. However, that loses much of the benefit of the SubscriberFunc API, as
it will need to manage the lifecycle of that goroutine. So, for practical
ergonomics, let's make the SubscriberFunc do this management on the user's
behalf. (We discussed doing this in #17432, but decided not to do it yet).  We
can optimize this approach further, if we need to, without changing the API.

Updates #17487

Change-Id: I19ea9e8f246f7b406711f5a16518ef7ff21a1ac9
Signed-off-by: M. J. Fromberger <fromberger@tailscale.com>
2 months ago
M. J. Fromberger ad6cf2f8f3
util/eventbus: add a function-based subscriber type (#17432)
Originally proposed by @bradfitz in #17413.

In practice, a lot of subscribers have only one event type of interest, or a
small number of mostly independent ones. In that case, the overhead of running
and maintaining a goroutine to select on multiple channels winds up being more
noisy than we'd like for the user of the API.

For this common case, add a new SubscriberFunc[T] type that delivers events to
a callback owned by the subscriber, directly on the goroutine belonging to the
client itself. This frees the consumer from the need to maintain their own
goroutine to pull events from the channel, and to watch for closure of the
subscriber.

Before:

     s := eventbus.Subscribe[T](eventClient)
     go func() {
       for {
          select {
          case <-s.Done():
            return
          case e := <-s.Events():
            doSomethingWith(e)
          }
       }
     }()
     // ...
     s.Close()

After:

     func doSomethingWithT(e T) { ... }
     s := eventbus.SubscribeFunc(eventClient, doSomethingWithT)
     // ...
     s.Close()

Moreover, unless the caller wants to explicitly stop the subscriber separately
from its governing client, it need not capture the SubscriberFunc value at all.

One downside of this approach is that a slow or deadlocked callback could block
client's service routine and thus stall all other subscriptions on that client,
However, this can already happen more broadly if a subscriber fails to service
its delivery channel in a timely manner, it just feeds back more immediately.

Updates #17487

Change-Id: I64592d786005177aa9fd445c263178ed415784d5
Signed-off-by: M. J. Fromberger <fromberger@tailscale.com>
2 months ago
Brad Fitzpatrick 9386a101d8 cmd/tailscaled, ipn/localapi, util/eventbus: don't link in regexp when debug is omitted
Saves 442 KB. Lock it with a new min test.

Updates #12614

Change-Id: Ia7bf6f797b6cbf08ea65419ade2f359d390f8e91
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
2 months ago
Brad Fitzpatrick be6cfa00cb util/eventbus: when ts_omit_debugeventbus is set, don't import tsweb
I'm trying to remove the "regexp" and "regexp/syntax" packages from
our minimal builds. But tsweb pulls in regexp (via net/http/pprof etc)
and util/eventbus was importing the tsweb for no reason.

Updates #12614

Change-Id: Ifa8c371ece348f1dbf80d6b251381f3ed39d5fbd
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
2 months ago
Brad Fitzpatrick a40f23ad4a util/eventbus: flesh out docs a bit
Updates #cleanup

Change-Id: Ia6b0e4b0426be1dd10a777aff0a81d4dd6b69b01
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
2 months ago
M. J. Fromberger df747f1c1b
util/eventbus: add a Done method to the Monitor type (#17263)
Some systems need to tell whether the monitored goroutine has finished
alongside other channel operations (notably in this case the relay server, but
there seem likely to be others similarly situated).

Updates #15160

Change-Id: I5f0f3fae827b07f9b7102a3b08f60cda9737fe28
Signed-off-by: M. J. Fromberger <fromberger@tailscale.com>
2 months ago
M. J. Fromberger e59fbaab64
util/eventbus: give a nicer error when attempting to use a closed client (#17208)
It is a programming error to Publish or Subscribe on a closed Client, but now
the way you discover that is by getting a panic from down in the machinery of
the bus after the client state has been cleaned up.

To provide a more helpful error, let's panic explicitly when that happens and
say what went wrong ("the client is closed"), by preventing subscriptions from
interleaving with closure of the client. With this change, either an attachment
fails outright (because the client is already closed) or completes and then
shuts down in good order in the normal course.

This does not change the semantics of the client, publishers, or subscribers,
it's just making the failure more eager so we can attach explanatory text.

Updates #15160

Change-Id: Ia492f4c1dea7535aec2cdcc2e5ea5410ed5218d2
Signed-off-by: M. J. Fromberger <fromberger@tailscale.com>
2 months ago
M. J. Fromberger ca9d795006
util/eventbus: add a Monitor type to manage subscriber goroutines (#17127)
A common pattern in event bus usage is to run a goroutine to service a
collection of subscribers on a single bus client. To have an orderly shutdown,
however, we need a way to wait for such a goroutine to be finished.

This commit adds a Monitor type that makes this pattern easier to wire up:
rather than having to track all the subscribers and an extra channel, the
component need only track the client and the monitor.  For example:

   cli := bus.Client("example")
   m := cli.Monitor(func(c *eventbus.Client) {
     s1 := eventbus.Subscribe[T](cli)
     s2 := eventbus.Subscribe[U](cli)
     for {
       select {
       case <-c.Done():
         return
       case t := <-s1.Events():
          processT(t)
       case u := <-s2.Events():
          processU(u)
       }
     }
   })

To shut down the client and wait for the goroutine, the caller can write:

   m.Close()

which closes cli and waits for the goroutine to finish. Or, separately:

   cli.Close()
   // do other stuff
   m.Wait()

While the goroutine management is not explicitly tied to subscriptions, it is a
common enough pattern that this seems like a useful simplification in use.

Updates #15160

Change-Id: I657afda1cfaf03465a9dce1336e9fd518a968bca
Signed-off-by: M. J. Fromberger <fromberger@tailscale.com>
3 months ago
Claus Lensbøl 009d702adf
health: remove direct callback and replace with eventbus (#17199)
Pulls out the last callback logic and ensures timers are still running.

The eventbustest package is updated support the absence of events.

Updates #15160

Signed-off-by: Claus Lensbøl <claus@tailscale.com>
3 months ago
Brad Fitzpatrick d559a21418 util/eventbus/eventbustest: fix typo of test name
And another case of the same typo in a comment elsewhere.

Updates #cleanup

Change-Id: Iaa9d865a1cf83318d4a30263c691451b5d708c9c
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
3 months ago
M. J. Fromberger fc9a74a405
util/eventbus: fix flakes in eventbustest tests (#17198)
When tests run in parallel, events from multiple tests on the same bus can
intercede with each other. This is working as intended, but for the test cases
we want to control exactly what goes through the bus.

To fix that, allocate a fresh bus for each subtest.

Fixes #17197

Change-Id: I53f285ebed8da82e72a2ed136a61884667ef9a5e
Signed-off-by: M. J. Fromberger <fromberger@tailscale.com>
3 months ago
M. J. Fromberger 4f211ea5c5
util/eventbus: add a LogAllEvents helper for testing (#17187)
When developing (and debugging) tests, it is useful to be able to see all the
traffic that transits the event bus during the execution of a test.

Updates #15160

Change-Id: I929aee62ccf13bdd4bd07d786924ce9a74acd17a
Signed-off-by: M. J. Fromberger <fromberger@tailscale.com>
3 months ago
M. J. Fromberger 6992f958fc
util/eventbus: add an EqualTo helper for testing (#17178)
For a common case of events being simple struct types with some exported
fields, add a helper to check (reflectively) for equal values using cmp.Diff so
that a failed comparison gives a useful diff in the test output.

More complex uses will still want to provide their own comparisons; this
(intentionally) does not export diff options or other hooks from the cmp
package.

Updates #15160

Change-Id: I86bee1771cad7debd9e3491aa6713afe6fd577a6
Signed-off-by: M. J. Fromberger <fromberger@tailscale.com>
3 months ago
M. J. Fromberger 48029a897d
util/eventbus: allow test expectations reporting only an error (#17146)
Extend the Expect method of a Watcher to allow filter functions that report
only an error value, and which "pass" when the reported error is nil.

Updates #15160

Change-Id: I582d804554bd1066a9e499c1f3992d068c9e8148
Signed-off-by: M. J. Fromberger <fromberger@tailscale.com>
3 months ago
Claus Lensbøl 2015ce4081
health,ipn/ipnlocal: introduce eventbus in heath.Tracker (#17085)
The Tracker was using direct callbacks to ipnlocal. This PR moves those
to be triggered via the eventbus.

Additionally, the eventbus is now closed on exit from tailscaled
explicitly, and health is now a SubSystem in tsd.

Updates #15160

Signed-off-by: Claus Lensbøl <claus@tailscale.com>
3 months ago
M. J. Fromberger 5b5ae2b2ee
util/eventbus: add a Done channel to the Client (#17118)
Subscribers already have a Done channel that the caller can use to detect when
the subscriber has been closed. Typically this happens when the governing
Client closes, which in turn is typically because the Bus closed.

But clients and subscribers can stop at other times too, and a caller has no
good way to tell the difference between "this subscriber closed but the rest
are OK" and "the client closed and all these subscribers are finished".

We've worked around this in practice by knowing the closure of one subscriber
implies the fate of the rest, but we can do better: Add a Done method to the
Client that allows us to tell when that has been closed explicitly, after all
the publishers and subscribers associated with that client have been closed.
This allows the caller to be sure that, by the time that occurs, no further
pending events are forthcoming on that client.

Updates #15160

Change-Id: Id601a79ba043365ecdb47dd035f1fdadd984f303
Signed-off-by: M. J. Fromberger <fromberger@tailscale.com>
3 months ago
Claus Lensbøl b816fd7117
control/controlclient: introduce eventbus messages instead of callbacks (#16956)
This is a small introduction of the eventbus into controlclient that
communicates with mainly ipnlocal. While ipnlocal is a complicated part
of the codebase, the subscribers here are from the perspective of
ipnlocal already called async.

Updates #15160

Signed-off-by: Claus Lensbøl <claus@tailscale.com>
3 months ago
Brad Fitzpatrick ffc82ad820 util/eventbus: add ts_omit_debugeventbus
Updates #17063

Change-Id: Ibc98dd2088f82c829effa71f72f3e2a5abda5038
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
3 months ago
Claus Lensbøl 5bb42e3018
wgengine/router: rely on events for deleted IP rules (#16744)
Adds the eventbus to the router subsystem.

The event is currently only used on linux.

Also includes facilities to inject events into the bus.

Updates #15160

Signed-off-by: Claus Lensbøl <claus@tailscale.com>
4 months ago
Claus Lensbøl d334d9ba07
client/local,cmd/tailscale/cli,ipn/localapi: expose eventbus graph (#16597)
Make it possible to dump the eventbus graph as JSON or DOT to both debug
and document what is communicated via the bus.

Updates #15160

Signed-off-by: Claus Lensbøl <claus@tailscale.com>
5 months ago
Claus Lensbøl 53f67c4396
util/eventbus: fix docstrings (#16401)
Updates #15160

Signed-off-by: Claus Lensbøl <claus@tailscale.com>
5 months ago
Claus Lensbøl f2f1236ad4
util/eventbus: add test helpers to simplify testing events (#16294)
Instead of every module having to come up with a set of test methods for
the event bus, this handful of test helpers hides a lot of the needed
setup for the testing of the event bus.

The tests in portmapper is also ported over to the new helpers.

Updates #15160

Signed-off-by: Claus Lensbøl <claus@tailscale.com>
5 months ago
Nick Khyl 866614202c util/eventbus: remove redundant code from eventbus.Publish
eventbus.Publish() calls newPublisher(), which in turn invokes (*Client).addPublisher().
That method adds the new publisher to c.pub, so we don’t need to add it again in eventbus.Publish.

Updates #cleanup

Signed-off-by: Nick Khyl <nickk@tailscale.com>
6 months ago
Claus Lensbøl 6010812f0c
ipn/localapi,client/local: add debug watcher for bus events (#16239)
Updates: #15160

Signed-off-by: Claus Lensbøl <claus@tailscale.com>
6 months ago
Brad Fitzpatrick e2814871a7 util/eventbus: also disable websocket debug on Android
So tsnet-on-Android is smaller, like iOS.

Updates #12614
Updates #15297

Change-Id: I97ae997f5d17576024470fe5fea93d9f5f134bde
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
7 months ago
David Anderson e091e71937 util/eventbus: remove debug UI from iOS build
The use of html/template causes reflect-based linker bloat. Longer
term we have options to bring the UI back to iOS, but for now, cut
it out.

Updates #15297

Signed-off-by: David Anderson <dave@tailscale.com>
9 months ago
M. J. Fromberger 0663412559
util/eventbus: add basic throughput benchmarks (#15284)
Shovel small events through the pipeine as fast as possible in a few basic
configurations, to establish some baseline performance numbers.

Updates #15160

Change-Id: I1dcbbd1109abb7b93aa4dcb70da57f183eb0e60e
Signed-off-by: M. J. Fromberger <fromberger@tailscale.com>
9 months ago
David Anderson 6d217d81d1 util/eventbus: add a helper program for bus development
The demo program generates a stream of made up bus events between
a number of bus actors, as a way to generate some interesting activity
to show on the bus debug page.

Signed-off-by: David Anderson <dave@tailscale.com>
9 months ago
David Anderson d83024a63f util/eventbus: add a debug HTTP handler for the bus
Updates #15160

Signed-off-by: David Anderson <dave@tailscale.com>
9 months ago
David Anderson 346a35f612 util/eventbus: add debugger methods to list pub/sub types
This lets debug tools list the types that clients are wielding, so
that they can build a dataflow graph and other debugging views.

Updates #15160

Signed-off-by: David Anderson <dave@tailscale.com>
9 months ago
David Anderson e71e95b841 util/eventbus: don't allow publishers to skip events while debugging
If any debugging hook might see an event, Publisher.ShouldPublish should
tell its caller to publish even if there are no ordinary subscribers.

Updates #15160

Signed-off-by: David Anderson <dave@tailscale.com>
9 months ago
David Anderson 853abf8661 util/eventbus: initial debugging facilities for the event bus
Enables monitoring events as they flow, listing bus clients, and
snapshotting internal queues to troubleshoot stalls.

Updates #15160

Signed-off-by: David Anderson <dave@tailscale.com>
9 months ago
David Anderson e80d2b4ad1 util/eventbus: add debug hooks to snoop on bus traffic
Updates #15160

Signed-off-by: David Anderson <dave@tailscale.com>
9 months ago
David Anderson dd7166cb8e util/eventbus: add internal hook type for debugging
Publicly exposed debugging functions will use these hooks to
observe dataflow in the bus.

Updates #15160

Signed-off-by: David Anderson <dave@tailscale.com>
9 months ago
David Anderson cf5c788cf1 util/eventbus: track additional event context in subscribe queue
Updates #15160

Signed-off-by: David Anderson <dave@tailscale.com>
9 months ago
David Anderson a1192dd686 util/eventbus: track additional event context in publish queue
Updates #15160

Signed-off-by: David Anderson <dave@tailscale.com>
9 months ago
David Anderson bf40bc4fa0 util/eventbus: make internal queue a generic type
In preparation for making the queues carry additional event metadata.

Updates #15160

Signed-off-by: David Anderson <dave@tailscale.com>
9 months ago
David Anderson 24d4846f00 util/eventbus: adjust worker goroutine management helpers
This makes the helpers closer in behavior to cancelable contexts
and taskgroup.Single, and makes the worker code use a more normal
and easier to reason about context.Context for shutdown.

Updates #15160

Signed-off-by: David Anderson <dave@tailscale.com>
9 months ago
David Anderson 3e18434595 util/eventbus: rework to have a Client abstraction
The Client carries both publishers and subscribers for a single
actor. This makes the APIs for publish and subscribe look more
similar, and this structure is a better fit for upcoming debug
facilities.

Updates #15160

Signed-off-by: David Anderson <dave@tailscale.com>
9 months ago
David Anderson ef906763ee util/eventbus: initial implementation of an in-process event bus
Updates #15160

Signed-off-by: David Anderson <dave@tailscale.com>
Co-authored-by: M. J. Fromberger <fromberger@tailscale.com>
9 months ago