Commit Graph

2500 Commits (f21982f854990baf0f0f85eac1c9369a3d05f9f2)
 

Author SHA1 Message Date
Josh Bleecher Snyder 744de615f1 health, wgenegine: fix receive func health checks for the fourth time
The old implementation knew too much about how wireguard-go worked.
As a result, it missed genuine problems that occurred due to unrelated bugs.

This fourth attempt to fix the health checks takes a black box approach.
A receive func is healthy if one (or both) of these conditions holds:

* It is currently running and blocked.
* It has been executed recently.

The second condition is required because receive functions
are not continuously executing. wireguard-go calls them and then
processes their results before calling them again.

There is a theoretical false positive if wireguard-go go takes
longer than one minute to process the results of a receive func execution.
If that happens, we have other problems.

Updates #1790

Signed-off-by: Josh Bleecher Snyder <josharian@gmail.com>
4 years ago
Josh Bleecher Snyder 0d4c8cb2e1 health: delete ReceiveFunc health checks
They were not doing their job.
They need yet another conceptual re-think.
Start by clearing the decks.

Signed-off-by: Josh Bleecher Snyder <josharian@gmail.com>
4 years ago
Josh Bleecher Snyder 99705aa6b7 net/tstun: split TUN events channel into up/down and MTU
We had a long-standing bug in which our TUN events channel
was being received from simultaneously in two places.

The first is wireguard-go.

At wgengine/userspace.go:366, we pass e.tundev to wireguard-go,
which starts a goroutine (RoutineTUNEventReader)
that receives from that channel and uses events to adjust the MTU
and bring the device up/down.

At wgengine/userspace.go:374, we launch a goroutine that
receives from e.tundev, logs MTU changes, and triggers
state updates when up/down changes occur.

Events were getting delivered haphazardly between the two of them.

We don't really want wireguard-go to receive the up/down events;
we control the state of the device explicitly by calling device.Up.
And the userspace.go loop MTU logging duplicates logging that
wireguard-go does when it received MTU updates.

So this change splits the single TUN events channel into up/down
and other (aka MTU), and sends them to the parties that ought
to receive them.

I'm actually a bit surprised that this hasn't caused more visible trouble.
If a down event went to wireguard-go but the subsequent up event
went to userspace.go, we could end up with the wireguard-go device disappearing.

I believe that this may also (somewhat accidentally) be a fix for #1790.

Signed-off-by: Josh Bleecher Snyder <josharian@gmail.com>
4 years ago
David Anderson 97d2fa2f56 net/dns: work around WSL DNS implementation flaws.
Fixes tailscale/corp#1662

Signed-off-by: David Anderson <danderson@tailscale.com>
4 years ago
Brad Fitzpatrick ffe6c8e335 cmd/tailscale/cli: don't do a simple up when in state NeedsLogin
Fixes #1780

Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
4 years ago
Brad Fitzpatrick 138921ae40 ipn/ipnlocal: always write files to partial files, even in buffered mode
The intention was always that files only get written to *.partial
files and renamed at the end once fully received, but somewhere in the
process that got lost in buffered mode and *.partial files were only
being used in direct receive mode. This fix prevents WaitingFiles
from returning files that are still being transferred.

Updates tailscale/corp#1626

Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
4 years ago
Brad Fitzpatrick 5e268e6153 ipn/ipnlocal: use delete marker files to work around Windows delete problems
If DeleteFile fails on Windows due to another process (anti-virus,
probably) having our file open, instead leave a marker file that the
file is logically deleted, and remove it from API calls and clean it
up lazily later.

Updates tailscale/corp#1626

Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
4 years ago
Avery Pennarun a7fe1d7c46 wgengine/bench: improved rate selection.
The old decay-based one took a while to converge. This new one (based
very loosely on TCP BBR) seems to converge quickly on what seems to be
the best speed.

Signed-off-by: Avery Pennarun <apenwarr@tailscale.com>
4 years ago
Avery Pennarun a92b9647c5 wgengine/bench: speed test for channels, sockets, and wireguard-go.
This tries to generate traffic at a rate that will saturate the
receiver, without overdoing it, even in the event of packet loss. It's
unrealistically more aggressive than TCP (which will back off quickly
in case of packet loss) but less silly than a blind test that just
generates packets as fast as it can (which can cause all the CPU to be
absorbed by the transmitter, giving an incorrect impression of how much
capacity the total system has).

Initial indications are that a syscall about every 10 packets (TCP bulk
delivery) is roughly the same speed as sending every packet through a
channel. A syscall per packet is about 5x-10x slower than that.

The whole tailscale wireguard-go + magicsock + packet filter
combination is about 4x slower again, which is better than I thought
we'd do, but probably has room for improvement.

Note that in "full" tailscale, there is also a tundev read/write for
every packet, effectively doubling the syscall overhead per packet.

Given these numbers, it seems like read/write syscalls are only 25-40%
of the total CPU time used in tailscale proper, so we do have
significant non-syscall optimization work to do too.

Sample output:

$ GOMAXPROCS=2 go test -bench . -benchtime 5s ./cmd/tailbench
goos: linux
goarch: amd64
pkg: tailscale.com/cmd/tailbench
cpu: Intel(R) Core(TM) i7-4785T CPU @ 2.20GHz
BenchmarkTrivialNoAlloc/32-2         	56340248	        93.85 ns/op	 340.98 MB/s	         0 %lost	       0 B/op	       0 allocs/op
BenchmarkTrivialNoAlloc/124-2        	57527490	        99.27 ns/op	1249.10 MB/s	         0 %lost	       0 B/op	       0 allocs/op
BenchmarkTrivialNoAlloc/1024-2       	52537773	       111.3 ns/op	9200.39 MB/s	         0 %lost	       0 B/op	       0 allocs/op
BenchmarkTrivial/32-2                	41878063	       135.6 ns/op	 236.04 MB/s	         0 %lost	       0 B/op	       0 allocs/op
BenchmarkTrivial/124-2               	41270439	       138.4 ns/op	 896.02 MB/s	         0 %lost	       0 B/op	       0 allocs/op
BenchmarkTrivial/1024-2              	36337252	       154.3 ns/op	6635.30 MB/s	         0 %lost	       0 B/op	       0 allocs/op
BenchmarkBlockingChannel/32-2           12171654	       494.3 ns/op	  64.74 MB/s	         0 %lost	    1791 B/op	       0 allocs/op
BenchmarkBlockingChannel/124-2          12149956	       507.8 ns/op	 244.17 MB/s	         0 %lost	    1792 B/op	       1 allocs/op
BenchmarkBlockingChannel/1024-2         11034754	       528.8 ns/op	1936.42 MB/s	         0 %lost	    1792 B/op	       1 allocs/op
BenchmarkNonlockingChannel/32-2          8960622	      2195 ns/op	  14.58 MB/s	         8.825 %lost	    1792 B/op	       1 allocs/op
BenchmarkNonlockingChannel/124-2         3014614	      2224 ns/op	  55.75 MB/s	        11.18 %lost	    1792 B/op	       1 allocs/op
BenchmarkNonlockingChannel/1024-2        3234915	      1688 ns/op	 606.53 MB/s	         3.765 %lost	    1792 B/op	       1 allocs/op
BenchmarkDoubleChannel/32-2          	 8457559	       764.1 ns/op	  41.88 MB/s	         5.945 %lost	    1792 B/op	       1 allocs/op
BenchmarkDoubleChannel/124-2         	 5497726	      1030 ns/op	 120.38 MB/s	        12.14 %lost	    1792 B/op	       1 allocs/op
BenchmarkDoubleChannel/1024-2        	 7985656	      1360 ns/op	 752.86 MB/s	        13.57 %lost	    1792 B/op	       1 allocs/op
BenchmarkUDP/32-2                    	 1652134	      3695 ns/op	   8.66 MB/s	         0 %lost	     176 B/op	       3 allocs/op
BenchmarkUDP/124-2                   	 1621024	      3765 ns/op	  32.94 MB/s	         0 %lost	     176 B/op	       3 allocs/op
BenchmarkUDP/1024-2                  	 1553750	      3825 ns/op	 267.72 MB/s	         0 %lost	     176 B/op	       3 allocs/op
BenchmarkTCP/32-2                    	11056336	       503.2 ns/op	  63.60 MB/s	         0 %lost	       0 B/op	       0 allocs/op
BenchmarkTCP/124-2                   	11074869	       533.7 ns/op	 232.32 MB/s	         0 %lost	       0 B/op	       0 allocs/op
BenchmarkTCP/1024-2                  	 8934968	       671.4 ns/op	1525.20 MB/s	         0 %lost	       0 B/op	       0 allocs/op
BenchmarkWireGuardTest/32-2          	 1403702	      4547 ns/op	   7.04 MB/s	        14.37 %lost	     467 B/op	       3 allocs/op
BenchmarkWireGuardTest/124-2         	  780645	      7927 ns/op	  15.64 MB/s	         1.537 %lost	     420 B/op	       3 allocs/op
BenchmarkWireGuardTest/1024-2        	  512671	     11791 ns/op	  86.85 MB/s	         0.5206 %lost	     411 B/op	       3 allocs/op
PASS
ok  	tailscale.com/wgengine/bench	195.724s

Updates #414.

Signed-off-by: Avery Pennarun <apenwarr@tailscale.com>
4 years ago
Maisem Ali 590792915a wgengine/router{win}: ignore broadcast routes added by Windows when removing routes.
Signed-off-by: Maisem Ali <maisem@tailscale.com>
4 years ago
David Anderson f6b7d08aea net/dns: work around new NetworkManager in other selection paths.
Further bits of #1788

Signed-off-by: David Anderson <danderson@tailscale.com>
4 years ago
David Anderson 25ce9885a2 net/dns: don't use NM+resolved for NM >=1.26.6.
NetworkManager fixed the bug that forced us to use NetworkManager
if it's programming systemd-resolved, and in the same release also
made NetworkManager ignore DNS settings provided for unmanaged
interfaces... Which breaks what we used to do. So, with versions
1.26.6 and above, we MUST NOT use NetworkManager to indirectly
program systemd-resolved, but thankfully we can talk to resolved
directly and get the right outcome.

Fixes #1788

Signed-off-by: David Anderson <danderson@tailscale.com>
4 years ago
David Anderson 31f81b782e util/cmpver: move into OSS from corp repo.
Signed-off-by: David Anderson <danderson@tailscale.com>
4 years ago
Aleksandar Pesic 7c985e4944 ipn/ipnlocal: add file sharing to windows shell
Updates: tailscale/winmin#33

Signed-off-by: Aleksandar Pesic <peske.nis@gmail.com>
4 years ago
Brad Fitzpatrick e41075dd4a net/interfaces: work around race fetching routing table
Fixes #1345
Updates golang/go#45736

Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
4 years ago
Brad Fitzpatrick fe53a714bd ipn/ipnlocal: add a LocalBackend.Start fast path if already running
Updates tailscale/corp#1621

Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
4 years ago
Brad Fitzpatrick ad1a595a75 ipn/ipnlocal: close peer API listeners on transition away from Running
Updates tailscale/corp#1621

Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
4 years ago
Brad Fitzpatrick d94ed7310b cmd/tailscale/cli: add test for already-submitted #1777
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
4 years ago
Josh Bleecher Snyder 8d7f7fc7ce health, wgenegine: fix receive func health checks yet again
The existing implementation was completely, embarrassingly conceptually broken.

We aren't able to see whether wireguard-go's receive function goroutines
are running or not. All we can do is model that based on what we have done.
This commit fixes that model.

Fixes #1781

Signed-off-by: Josh Bleecher Snyder <josharian@gmail.com>
4 years ago
David Anderson 30f5d706a1 net/dns/resolver: remove unnecessary/racy WaitGroup.
Fixes #1663

Signed-off-by: David Anderson <danderson@tailscale.com>
4 years ago
Brad Fitzpatrick 8a449c4dcd ipn: define NewBackendServer nil as not affecting Backend's NotifyCallback
Updates tailscale/corp#1646

Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
4 years ago
David Anderson 30629c430a cmd/tailscale/cli: don't force an interactive login on --reset.
Fixes #1778

Signed-off-by: David Anderson <danderson@tailscale.com>
4 years ago
David Anderson 36d030cc36 ipn/ipnlocal: use fallback default DNS whenever exit nodes are on.
Fixes #1625

Signed-off-by: David Anderson <danderson@tailscale.com>
4 years ago
David Anderson 67ba6aa9fd cmd/tailscale/cli: fix typo in ExitNodeID mapping.
Prevented turning off exit nodes.

Fixes #1777

Signed-off-by: David Anderson <danderson@tailscale.com>
4 years ago
Brad Fitzpatrick 86e85d8934 ipn/ipnlocal: add peerapi goroutine fetch
Between owners.

Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
4 years ago
Josh Bleecher Snyder 5835a3f553 health, wgengine/magicsock: avoid receive function false positives
Avery reported a sub-ms health transition from "receiveIPv4 not running" to "ok".

To avoid these transient false-positives, be more precise about
the expected lifetime of receive funcs. The problematic case is one in which
they were started but exited prior to a call to connBind.Close.
Explicitly represent started vs running state, taking care with the order of updates.

Signed-off-by: Josh Bleecher Snyder <josharian@gmail.com>
4 years ago
Brad Fitzpatrick 3411bb959a control/controlclient: fix signRegisterRequest log suppression check on Windows
Fixes #1774

Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
4 years ago
Brad Fitzpatrick 2d786821f6 ipn/ipnlocal: put a retry loop around Windows file deletes
oh, Windows.

Updates tailscale/corp#1626

Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
4 years ago
Brad Fitzpatrick 11780a4503 cmd/tailscale: only send file basename in push
Fixes #1640

Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
4 years ago
Josh Bleecher Snyder f845aae761 health: track whether magicsock receive functions are running
Signed-off-by: Josh Bleecher Snyder <josharian@gmail.com>
4 years ago
Brad Fitzpatrick 529ef98b2a ipn/ipnlocal: fix approxSize operator precedence
Whoops.

Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
4 years ago
Brad Fitzpatrick 820952daba cmd/tailscale: don't print out old authURL on up --force-reauth
Fixes #1671

Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
4 years ago
Brad Fitzpatrick 12b4672add wgengine: quiet connection failure diagnostics for exit nodes
The connection failure diagnostic code was never updated enough for
exit nodes, so disable its misleading output when the node it picks
(incorrectly) to diagnose is only an exit node.

Fixes #1754

Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
4 years ago
Brad Fitzpatrick b03c23d2ed ipn/ipnlocal: log on DeleteFile error
Updates tailscale/corp#1626

Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
4 years ago
Brad Fitzpatrick 6f52fa02a3 control/controlclient, tailcfg: add Debug.SleepSeconds (mapver 19)
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
4 years ago
Brad Fitzpatrick c91a22c82e cmd/tailscale: don't print auth URL when using a --authkey
Fixes #1755

Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
4 years ago
Brad Fitzpatrick e40e5429c2 cmd/tailscale/cli: make 'tailscale up' protect --advertise-exit-node removal
The new "tailscale up" checks previously didn't protect against
--advertise-exit-node being omitted in the case that
--advertise-routes was also provided. It wasn't done before because
there is no corresponding pref for "--advertise-exit-node"; it's a
helper flag that augments --advertise-routes. But that's an
implementation detail and we can still help users. We just have to
special case that pref and look whether the current routes include
both the v4 and v6 /0 routes.

Fixes #1767

Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
4 years ago
Brad Fitzpatrick a16eb6ac41 cmd/tailscale/cli: show online/offline status in push --file-targets
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
4 years ago
Brad Fitzpatrick dedbd483ea cmd/tailscale/cli: don't require explicit --operator if it matches $USER
This doesn't make --operator implicit (which we might do in the
future), but it at least doesn't require repeating it in the future
when it already matches $USER.

Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
4 years ago
Brad Fitzpatrick 2f17a34242 ipn/ipnlocal: fix tailscale status --json AuthURL field
It was getting cleared on notify.

Document that authURL is cleared on notify and add a new field that
isn't, using the new field for the JSON status.

Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
4 years ago
Brad Fitzpatrick 09891b9868 ipn/ipnlocal: on fresh lazy-connecting install, start in state NeedsLogin
Fixes #1759

Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
4 years ago
Josh Bleecher Snyder a29b0cf55f wgengine/wglog: allow wireguard-go receive routines to log
I've spent two days searching for a theoretical wireguard-go bug
around receive functions exiting early.

I've found many bugs, but none of the flavor we're looking for.

Restore wireguard-go's logging around starting and stopping receive functions,
so that we can definitively rule in or out this particular theory.

Signed-off-by: Josh Bleecher Snyder <josharian@gmail.com>
4 years ago
Josh Bleecher Snyder eb2a9d4ce3 wgengine/netstack: log error when acceptUDP fails
I see a bunch of these in some logs I'm looking at,
separated only by a few seconds.
Log the error so we can tell what's going on here.

Signed-off-by: Josh Bleecher Snyder <josharian@gmail.com>
4 years ago
Naman Sood 4a90a91d29
wgengine/netstack: log ForwarderRequest in readable form, only in debug mode (#1758)
* wgengine/netstack: log ForwarderRequest in readable form, only in debug mode

Fixes #1757

Signed-off-by: Naman Sood <mail@nsood.in>
4 years ago
Josh Bleecher Snyder 07c95a0219 wgengine/wgcfg/nmcfg: consolidate exit node log lines
These were getting rate-limited for nodes with many peers.
Consolate the output into single lines, which are nicer anyway.

Signed-off-by: Josh Bleecher Snyder <josharian@gmail.com>
4 years ago
Brad Fitzpatrick 3d4d97601a derp/derpmap: add São Paulo (derp11)
Updates #1499
4 years ago
Brad Fitzpatrick 91c9c33036 cmd/tailscaled: don't block ipnserver startup behind engine init on Windows
With this change, the ipnserver's safesocket.Listen (the localhost
tcp.Listen) happens right away, before any synchronous
TUN/DNS/Engine/etc setup work, which might be slow, especially on
early boot on Windows.

Because the safesocket.Listen starts up early, that means localhost
TCP dials (the safesocket.Connect from the GUI) complete successfully
and thus the GUI avoids the MessageBox error. (I verified that
pacifies it, even without a Listener.Accept; I'd feared that Windows
localhost was maybe special and avoided the normal listener backlog).

Once the GUI can then connect immediately without errors, the various
timeouts then matter less, because the backend is no longer trying to
race against the GUI's timeout. So keep retrying on errors for a
minute, or 10 minutes if the system just booted in the past 10
minutes.

This should fix the problem with Windows 10 desktops auto-logging in
and starting the Tailscale frontend which was then showing a
MessageBox error about failing to connect to tailscaled, which was
slow coming up because the Windows networking stack wasn't up
yet. Fingers crossed.

Fixes #1313 (previously #1187, etc)

Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
4 years ago
Alex Brainman 7d8f082ff7 .github/workflows: add --race tests on Linux and Windows
Updates #50
Updates #833

Signed-off-by: Alex Brainman <alex.brainman@gmail.com>
4 years ago
Alex Brainman 7689213aaa cmd/tailscaled: add subcommands to install and remove tailscaled Windows service
This change implements Windows version of install-system-daemon and
uninstall-system-daemon subcommands. When running the commands the
user will install or remove Tailscale Windows service.

Updates #1232

Signed-off-by: Alex Brainman <alex.brainman@gmail.com>
4 years ago
David Anderson 6fd9e28bd0 ipn/ipnlocal: add arpa suffixes to MagicDNS for reverse lookups.
This used to not be necessary, because MagicDNS always did full proxying.
But with split DNS, we need to know which names to route to our resolver,
otherwise reverse lookups break.

This captures the entire CGNAT range, as well as our Tailscale ULA.

Signed-off-by: David Anderson <danderson@tailscale.com>
4 years ago