Commit Graph

9815 Commits (66826a496b678495bfe70225b261288edcb26edf)
 

Author SHA1 Message Date
Nick Khyl 66826a496b
VERSION.txt: this is v1.90.9
Signed-off-by: Nick Khyl <nickk@tailscale.com>
1 week ago
Claus Lensbøl d7cf0cfdb4 wgengine/userspace: run link change subscribers in eventqueue (#18024)
Updates #17996

Signed-off-by: Claus Lensbøl <claus@tailscale.com>
(cherry picked from commit e7f5ca1d5e)
1 week ago
Nick Khyl f58cbffda1 util/eventbus: use unbounded event queues for DeliveredEvents in subscribers
Bounded DeliveredEvent queues reduce memory usage, but they can deadlock under load.
Two common scenarios trigger deadlocks when the number of events published in a short
period exceeds twice the queue capacity (there's a PublishedEvent queue of the same size):
 - a subscriber tries to acquire the same mutex as held by a publisher, or
 - a subscriber for A events publishes B events

Avoiding these scenarios is not practical and would limit eventbus usefulness and reduce its adoption,
pushing us back to callbacks and other legacy mechanisms. These deadlocks already occurred in customer
devices, dev machines, and tests. They also make it harder to identify and fix slow subscribers and similar
issues we have been seeing recently.

Choosing an arbitrary large fixed queue capacity would only mask the problem. A client running
on a sufficiently large and complex customer environment can exceed any meaningful constant limit,
since event volume depends on the number of peers and other factors. Behavior also changes
based on scheduling of publishers and subscribers by the Go runtime, OS, and hardware, as the issue
is essentially a race between publishers and subscribers. Additionally, on lower-end devices,
an unreasonably high constant capacity is practically the same as using unbounded queues.

Therefore, this PR changes the event queue implementation to be unbounded by default.
The PublishedEvent queue keeps its existing capacity of 16 items, while subscribers'
DeliveredEvent queues become unbounded.

This change fixes known deadlocks and makes the system stable under load,
at the cost of higher potential memory usage, including cases where a queue grows
during an event burst and does not shrink when load decreases.

Further improvements can be implemented in the future as needed.

Fixes #17973
Fixes #18012

Signed-off-by: Nick Khyl <nickk@tailscale.com>
(cherry picked from commit 1ccece0f78)
1 week ago
Nick Khyl f2100e2e1d util/eventbus: add tests for a subscriber publishing events
As of 2025-11-20, publishing more events than the eventbus's
internal queues can hold may deadlock if a subscriber tries
to publish events itself.

This commit adds a test that demonstrates this deadlock,
and skips it until the bug is fixed.

Updates #18012

Signed-off-by: Nick Khyl <nickk@tailscale.com>
(cherry picked from commit 3780f25d51)
1 week ago
Nick Khyl 6bc07d3c24 util/eventbus: add tests for a subscriber trying to acquire the same mutex as a publisher
As of 2025-11-20, publishing more events than the eventbus's
internal queues can hold may deadlock if a subscriber tries
to acquire a mutex that can also be held by a publisher.

This commit adds a test that demonstrates this deadlock,
and skips it until the bug is fixed.

Updates #17973

Signed-off-by: Nick Khyl <nickk@tailscale.com>
(cherry picked from commit 016ccae2da)
1 week ago
Nick Khyl ccf4f3c7ce
VERSION.txt: this is v1.90.8
Signed-off-by: Nick Khyl <nickk@tailscale.com>
2 weeks ago
Alex Chan 6b0fbffd4f tka: move RemoveAll() to CompactableChonk
I added a RemoveAll() method on tka.Chonk in #17946, but it's only used
in the node to purge local AUMs. We don't need it in the SQLite storage,
which currently implements tka.Chonk, so move it to CompactableChonk
instead.

Also add some automated tests, as a safety net.

Updates tailscale/corp#33599

Change-Id: I54de9ccf1d6a3d29b36a94eccb0ebd235acd4ebc
Signed-off-by: Alex Chan <alexc@tailscale.com>
(cherry picked from commit c17ba64129)
2 weeks ago
Nick Khyl 90d3cb3c95
VERSION.txt: this is v1.90.7
Signed-off-by: Nick Khyl <nickk@tailscale.com>
2 weeks ago
Alex Chan 37b63eff1c ipn/ipnlocal: use an in-memory TKA store if FS is unavailable
This requires making the internals of LocalBackend a bit more generic,
and implementing the `tka.CompactableChonk` interface for `tka.Mem`.

Signed-off-by: Alex Chan <alexc@tailscale.com>

Updates https://github.com/tailscale/corp/issues/33599

(cherry picked from commit 1723cb83ed)
2 weeks ago
Alex Chan 43ab8b4b18 tka: rename a mutex to `mu` instead of single-letter `l`
See http://go/no-ell

Updates tailscale/corp#33846

Signed-off-by: Alex Chan <alexc@tailscale.com>

Change-Id: I88ecd9db847e04237c1feab9dfcede5ca1050cc5
(cherry picked from commit fca66fb51a)
2 weeks ago
Alex Chan 6b64718bb9 tka: don't try to read AUMs which are partway through being written
Fixes https://github.com/tailscale/tailscale/issues/17600

Signed-off-by: Alex Chan <alexc@tailscale.com>
(cherry picked from commit 23359dc727)
2 weeks ago
Jonathan Nobels fa514c7280
net/netmon: do not abandon a subscriber when exiting early (#17899) (#17905)
LinkChangeLogLimiter keeps a subscription to track rate limits for log
messages.  But when its context ended, it would exit the subscription loop,
leaving the subscriber still alive. Ensure the subscriber gets cleaned up
when the context ends, so we don't stall event processing.

Updates tailscale/corp#34311

Change-Id: I82749e482e9a00dfc47f04afbc69dd0237537cb2

(cherry picked from commit ab4b990d51)

Signed-off-by: M. J. Fromberger <fromberger@tailscale.com>
Co-authored-by: M. J. Fromberger <fromberger@tailscale.com>
2 weeks ago
Jordan Whited ea8eeeb2f7 feature/relayserver: fix Shutdown() deadlock (#17898)
Updates #17894

Signed-off-by: Jordan Whited <jordan@tailscale.com>
(cherry picked from commit 0285e1d5fb)
3 weeks ago
Jordan Whited 0f421d3def feature/relayserver,ipn/ipnlocal,net/udprelay: plumb DERPMap (#17881)
This commit replaces usage of local.Client in net/udprelay with DERPMap
plumbing over the eventbus. This has been a longstanding TODO. This work
was also accelerated by a memory leak in net/http when using
local.Client over long periods of time. So, this commit also addresses
said leak.

Updates #17801

Signed-off-by: Jordan Whited <jordan@tailscale.com>
(cherry picked from commit 9e4d1fd87f)
3 weeks ago
Jordan Whited eb03b354f6 net/udprelay: replace VNI pool with selection algorithm (#17868)
This reduces memory usage when tailscaled is acting as a peer relay.

Updates #17801

Signed-off-by: Jordan Whited <jordan@tailscale.com>
(cherry picked from commit f4f9dd7f8c)
3 weeks ago
Jordan Whited 771a9d29ff wgengine/magicsock: fix UDPRelayAllocReq/Resp deadlock (#17831)
Updates #17830

Signed-off-by: Jordan Whited <jordan@tailscale.com>
(cherry picked from commit 2ad2d4d409)
3 weeks ago
Jordan Whited e602907cf5 wgengine/magicsock: validate endpoint.derpAddr in Conn.onUDPRelayAllocResp (#17828)
Otherwise a zero value will panic in Conn.sendUDPStd.

Updates #17827

Signed-off-by: Jordan Whited <jordan@tailscale.com>
(cherry picked from commit 18806de400)
3 weeks ago
Nick Khyl 28f6c2dbfc
VERSION.txt: this is v1.90.6
Signed-off-by: Nick Khyl <nickk@tailscale.com>
1 month ago
M. J. Fromberger b6eabd4038 util/eventbus: allow logging of slow subscribers (#17705)
Add options to the eventbus.Bus to plumb in a logger.

Route that logger in to the subscriber machinery, and trigger a log message to
it when a subscriber fails to respond to its delivered events for 5s or more.

The log message includes the package, filename, and line number of the call
site that created the subscription.

Add tests that verify this works.

Updates #17680

Change-Id: I0546516476b1e13e6a9cf79f19db2fe55e56c698
Signed-off-by: M. J. Fromberger <fromberger@tailscale.com>
(cherry picked from commit 061e6266cf)
1 month ago
M. J. Fromberger 6e2f2bb31a ipn/ipnlocal: do not stall event processing for appc route updates (#17663)
A follow-up to #17411. Put AppConnector events into a task queue, as they may
take some time to process. Ensure that the queue is stopped at shutdown so that
cleanup will remain orderly.

Because events are delivered on a separate goroutine, slow processing of an
event does not cause an immediate problem; however, a subscriber that blocks
for a long time will push back on the bus as a whole. See
https://godoc.org/tailscale.com/util/eventbus#hdr-Expected_subscriber_behavior
for more discussion.

Updates #17192
Updates #15160

Change-Id: Ib313cc68aec273daf2b1ad79538266c81ef063e3
Signed-off-by: M. J. Fromberger <fromberger@tailscale.com>
(cherry picked from commit 06b092388e)
1 month ago
Alex Chan faca4c08b7
.github/workflows: pin the google/oss-fuzz GitHub Actions
Updates https://github.com/tailscale/corp/issues/31017

Signed-off-by: Alex Chan <alexc@tailscale.com>
(cherry picked from commit 3944809a11)
1 month ago
Nick Khyl 63242007ae
VERSION.txt: this is v1.90.5
Signed-off-by: Nick Khyl <nickk@tailscale.com>
1 month ago
Brad Fitzpatrick 300e6062bf cmd/k8s-operator/generate: skip tests if no network or Helm is down
Updates helm/helm#31434

Change-Id: I5eb20e97ff543f883d5646c9324f50f54180851d
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
(cherry picked from commit d5a40c01ab)
1 month ago
Brad Fitzpatrick 1a6c31538e sessionrecording: fix regression in recent http2 package change
In 3f5c560fd4 I changed to use std net/http's HTTP/2 support,
instead of pulling in x/net/http2.

But I forgot to update DialTLSContext to DialContext, which meant it
was falling back to using the std net.Dialer for its dials, instead
of the passed-in one.

The tests only passed because they were using localhost addresses, so
the std net.Dialer worked. But in prod, where a tsnet Dialer would be
needed, it didn't work, and would time out for 10 seconds before
resorting to the old protocol.

So this fixes the tests to use an isolated in-memory network to prevent
that class of problem in the future. With the test change, the old code
fails and the new code passes.

Thanks to @jasonodonnell for debugging!

Updates #17304
Updates 3f5c560fd4

Change-Id: I3602bafd07dc6548e2c62985af9ac0afb3a0e967
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
(cherry picked from commit 8996254647)
1 month ago
Nick Khyl 68cba300e4
VERSION.txt: this is v1.90.4
Signed-off-by: Nick Khyl <nickk@tailscale.com>
1 month ago
M. J. Fromberger 2dd72f6ec2
Revert "logtail: avoid racing eventbus subscriptions with Shutdown (#17639)" (#17684)
This reverts commit 4346615d77.
We averted the shutdown race, but will need to service the subscriber even when
we are not waiting for a change so that we do not delay the bus as a whole.

Updates #17638

Change-Id: I5488466ed83f5ad1141c95267f5ae54878a24657
Signed-off-by: M. J. Fromberger <fromberger@tailscale.com>
(cherry picked from commit db5815fb97)
1 month ago
Brad Fitzpatrick 53004dded1 wgengine/magicsock: fix js/wasm crash regression loading non-existent portmapper
Thanks for the report, @Need-an-AwP!

Fixes #17681
Updates #9394

Change-Id: I2e0b722ef9b460bd7e79499192d1a315504ca84c
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
(cherry picked from commit edb11e0e60)
1 month ago
srwareham 033adc398c cmd/tailscale/cli: move JetKVM scripts to /userdata/init.d for persistence (#17610)
Updates #16524
Updates jetkvm/rv1106-system#34

Signed-off-by: srwareham <ebriouscoding@gmail.com>
(cherry picked from commit f4e2720821)
1 month ago
Max Coulombe bad03eefa1 feature/identityfederation: strip query params on clientID (#17666)
Updates #9192

Change-Id: I35c88df8a0242ecc19a23265d392ef78ac176b9d
Signed-off-by: mcoulombe <max@tailscale.com>
(cherry picked from commit 34e992f59d)
1 month ago
Patrick O'Doherty dc3c15b4c6
control/controlclient: back out HW key attestation (#17664)
Temporarily back out the TPM-based hw attestation code while we debug
Windows exceptions.

Updates tailscale/corp#31269

Signed-off-by: Patrick O'Doherty <patrick@tailscale.com>
(cherry picked from commit a760cbe33f)
1 month ago
Nick Khyl c50fe71822
VERSION.txt: this is v1.90.3
Signed-off-by: Nick Khyl <nickk@tailscale.com>
1 month ago
M. J. Fromberger 597acd8663
logtail: avoid racing eventbus subscriptions with Shutdown (#17639)
When the eventbus is enabled, set up the subscription for change deltas at the
beginning when the client is created, rather than waiting for the first
awaitInternetUp check.

Otherwise, it is possible for a check to race with the client close in
Shutdown, which triggers a panic.

Updates #17638

Change-Id: I461c07939eca46699072b14b1814ecf28eec750c
Signed-off-by: M. J. Fromberger <fromberger@tailscale.com>
(cherry picked from commit 4346615d77)
1 month ago
Claus Lensbøl e6a3669277
net/tsdial: do not panic if setting the same eventbus twice (#17640)
Updates #17638

Signed-off-by: Claus Lensbøl <claus@tailscale.com>
(cherry picked from commit fd0e541e5d)
1 month ago
Nick Khyl 8bcd44ecf0
VERSION.txt: this is v1.90.2
Signed-off-by: Nick Khyl <nickk@tailscale.com>
1 month ago
Claus Lensbøl b0f0bce928 health: compare warnable codes to avoid errors on release branch (#17637)
This compares the warnings we actually care about and skips the unstable
warnings and the changes with no warnings.

Fixes #17635

Signed-off-by: Claus Lensbøl <claus@tailscale.com>
(cherry picked from commit 7418583e47)
1 month ago
Brad Fitzpatrick c81ef9055b util/linuxfw: fix 32-bit arm regression with iptables
This fixes a regression from dd615c8fdd that moved the
newIPTablesRunner constructor from a any-Linux-GOARCH file to one that
was only amd64 and arm64, thus breaking iptables on other platforms
(notably 32-bit "arm", as seen on older Pis running Buster with
iptables)

Tested by hand on a Raspberry Pi 2 w/ Buster + iptables for now, for
lack of automated 32-bit arm tests at the moment. But filed #17629.

Fixes #17623
Updates #17629

Change-Id: Iac1a3d78f35d8428821b46f0fed3f3717891c1bd
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
(cherry picked from commit 8576a802ca)
1 month ago
Patrick O'Doherty 9fe44b3718 feature/tpm: use withSRK to probe TPM availability (#17627)
On some platforms e.g. ChromeOS the owner hierarchy might not always be
available to us. To avoid stale sealing exceptions later we probe to
confirm it's working rather than rely solely on family indicator status.

Updates #17622

Signed-off-by: Patrick O'Doherty <patrick@tailscale.com>
(cherry picked from commit 672b1f0e76)
1 month ago
Patrick O'Doherty a8ae316858 feature/tpm: check TPM family data for compatibility (#17624)
Check that the TPM we have opened is advertised as a 2.0 family device
before using it for state sealing / hardware attestation.

Updates #17622

Signed-off-by: Patrick O'Doherty <patrick@tailscale.com>
(cherry picked from commit 36ad24b20f)
1 month ago
Nick Khyl 75b0c6f164 VERSION.txt: this is v1.90.1
Signed-off-by: Nick Khyl <nickk@tailscale.com>
1 month ago
Nick Khyl 3c78146ece VERSION.txt: this is v1.90.0
Signed-off-by: Nick Khyl <nickk@tailscale.com>
1 month ago
License Updater 4e1c270f90 licenses: update license notices
Signed-off-by: License Updater <noreply+license-updater@tailscale.com>
1 month ago
Alex Chan 4673992b96 tka: created a shared testing library for Chonk
This patch creates a set of tests that should be true for all implementations of Chonk and CompactableChonk, which we can share with the SQLite implementation in corp.

It includes all the existing tests, plus a test for LastActiveAncestor which was in corp but not in oss.

Updates https://github.com/tailscale/corp/issues/33465

Signed-off-by: Alex Chan <alexc@tailscale.com>
1 month ago
Alex Chan c961d58091 cmd/tailscale: improve the error message for `lock log` with no lock
Previously, running `tailscale lock log` in a tailnet without Tailnet
Lock enabled would return a potentially confusing error:

    $ tailscale lock log
    2025/10/20 11:07:09 failed to connect to local Tailscale service; is Tailscale running?

It would return this error even if Tailscale was running.

This patch fixes the error to be:

    $ tailscale lock log
    Tailnet Lock is not enabled

Fixes #17586

Signed-off-by: Alex Chan <alexc@tailscale.com>
1 month ago
Max Coulombe 6a73c0bdf5
cmd/tailscale/cli,feature: add support for identity federation (#17529)
Add new arguments to `tailscale up` so authkeys can be generated dynamically via identity federation.

Updates #9192

Signed-off-by: mcoulombe <max@tailscale.com>
2 months ago
Brad Fitzpatrick 54cee33bae go.toolchain.rev: update to Go 1.25.3
Updates tailscale/go#140
Updates tailscale/go#142
Updates tailscale/go#138

Change-Id: Id25b6fa4e31eee243fec17667f14cdc48243c59e
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
2 months ago
David Bond 9083ef1ac4
cmd/k8s-operator: allow pod tolerations on nameservers (#17260)
This commit modifies the `DNSConfig` custom resource to allow specifying
[tolerations](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/)
on the nameserver pods.

This will allow users to dictate where their nameserver pods are located
within their clusters.

Fixes: https://github.com/tailscale/tailscale/issues/17092

Signed-off-by: David Bond <davidsbond93@gmail.com>
2 months ago
Andrew Lytvynov 6493206ac7
.github/workflows: pin nix-related github actions (#17574)
Updates #cleanup

Signed-off-by: Andrew Lytvynov <awly@tailscale.com>
2 months ago
Alex Chan 8d119f62ee wgengine/magicsock: minor tidies in Test_endpoint_maybeProbeUDPLifetimeLocked
* Remove a couple of single-letter `l` variables
* Use named struct parameters in the test cases for readability
* Delete `wantAfterInactivityForFn` parameter when it returns the
  default zero

Updates #cleanup

Signed-off-by: Alex Chan <alexc@tailscale.com>
2 months ago
Alex Chan 55a43c3736 tka: don't look up parent/child information from purged AUMs
We soft-delete AUMs when they're purged, but when we call `ChildAUMs()`,
we look up soft-deleted AUMs to find the `Children` field.

This patch changes the behaviour of `ChildAUMs()` so it only looks at
not-deleted AUMs. This means we don't need to record child information
on AUMs any more, which is a minor space saving for any newly-recorded
AUMs.

Updates https://github.com/tailscale/tailscale/issues/17566
Updates https://github.com/tailscale/corp/issues/27166

Signed-off-by: Alex Chan <alexc@tailscale.com>
2 months ago
Alex Chan c3acf25d62 tka: remove an unused Mem.Orphans() method
This method was added in cca25f6 in the initial in-memory implementation
of Chonk, but it's not part of the Chonk interface and isn't implemented
or used anywhere else. Let's get rid of it.

Updates https://github.com/tailscale/corp/issues/33465

Signed-off-by: Alex Chan <alexc@tailscale.com>
2 months ago