March 2023 Android Update Post-mortem

or, “My 3 weeks of hell”

Brenton Bostick

Thursday Apr 6 2023

Overview

PostMortem_1.gif

Timeline

mon Feb 27 2023:

finished work

https://bostick.github.io/zerotier/android-report/report.htm

fri Mar 3 2023:

1.10.3-1 tag

mon Mar 6 2023:

initial release

all hell broke loose

PostMortem_2.gif

tues Mar 7 2023:

got Samsung A13 from Micro Center

1.10.3-2 tag

wed Mar 8 2023:

actual fix for ANDROID-56: crash inside newNetworkConfig
cast all arguments to varargs functions as good style

The source for the crashes of the Android is fundamentally due to calling JNI functions like `NewObject()` with incorrect sized arguments. Functions like `NewObject()` are necessarily vararg, and there is no way for C to know that the programmer is providing the correct arguments.

Arguments that were being treated as booleans on the Java side were being passed in as ints on the C side, and the crucial part is that JNI assumes booleans being passed in are only 1 byte.

In the case of calling `NewObject()` for `VirtualNetworkConfig`, the bad booleans were then causing subsequent vararg arguments to be corrupted, i.e., the 0th byte of the boolean argument was treated as the boolean, but then the 1st, 2nd, and 3rd bytes of the boolean argument were being treated as the 0th, 1st, and 2nd bytes of the next argument. So absolute chaos.

Why was this so hard to find?

All of my devices were ARMv8 and believe this shows up on ARMv7 devices. The Samsung A13 I got on Tuesday is ARMv7. I haven’t checked deeper on ABI differences to be definitive.

Why did this not happen before my changes?

A part of my changes was the move to using single calls to `NewObject()` to construct objects. Before my changes, objects tended to be constructed empty, and then have the fields set individually with e.g. `SetBooleanField()`. So no varargs problem.

I’m changing my JNI style to now always cast arguments to the JNI types when calling vararg functions like `NewObject()`, even it is completely redundant, just to be completely safe.

thurs Mar 9 2023:

fixed fromInt exception being thrown:

//
// Before the 1.10.3-1 release,
// NetworkType had these values:
// UNKNOWN(0),
// PRIVATE(1),
// PUBLIC(2);
//
// After 1.10.3-1,
// NetworkType had these values:
// PRIVATE(0),
// PUBLIC(1);
//
// There was a bug:
// when older versions were updated to newer versions,
// the serialized version of NetworkType in the database was sent to fromInt,
// and if the NetworkType had been PUBLIC(2), then this would now throw an exception
//
// This was not originally tested because the NetworkType property of NetworkConfiguration
// is now deprecated in preference to using VirtualNetworkType in VirtualNetworkConfig directly
//
// So make sure to handle the old value of PUBLIC(2)
//
// This is ok to do because this value is not used anywhere, we just need to make sure to not
// throw an exception.
//

1.10.4-1 tag

mon Mar 13 2023:

1.10.4-2 tag

tues Mar 14 2023:

actual fix for ANDROID-63:
do not keep VNC objects as property of Network
GC causes Network objects to be collected and lose VNC objects
Add a map to ZeroTierOneService for keeping VNC objects
Also, NetworkDetailsStatusFragment now needs to bind to ZeroTierOneService
to get access to VNC map

1.10.5-1 tag

wed Mar 15 2023:

fix ANDROID-66: addressToNetmask for IPv6

Xiaomi phone delivered

1.10.5-2 tag

thurs Mar 16 2023:

1.10.5-3 tag

fri Mar 17 2023:

fix ANDROID-67: ANRs due to no network connectivity

//
// First, completely disconnect from previous service
//
// There was a bug:
// Even with no network connectivity, NetworkInfoUtils.getNetworkInfoCurrentConnection()
// was not returning CurrentConnection.CONNECTION_NONE
//
// This was because a previous service may still be connected and that VPN connection is treated
// as a network connection (even if not on WiFi or Cell)
//
// So the gate-keeping for network connectivity failed here.
// But then the service is actually destroyed here before creating a new one.
// So no VPN connection any more.
// So the gate-keeping for network connectivity in ZeroTierOneService then causes a
// 30-second hang on main thread while waiting for connectivity.
//

sat Mar 18 2023:

Motorola Moto G from Micro Center

tues Mar 21 2023:

1.10.5-4 tag

wed Mar 22 2023:

fix ANDROID-79: handle non-empty networks.d after a crash

fix ANDROID-75: race causing NPE in TunTapAdapter$1.run

ThreadMode.BACKGROUND:
EventBus uses a single background thread that will deliver all its events sequentially.

This is important.

hreadMode.ASYNC works from a thread pool and may end up executing on different threads

thurs Mar 23 2023:

1.10.6-1 tag

fri Mar 24 2023:

fix ANDROID-73: relay through main thread to guarantee posting to “the” background thread

1.10.6-2 tag

discuss threads

https://discuss.zerotier.com/t/android-zerotier-keep-stop-working-after-1-10-3-update/12184

https://discuss.zerotier.com/t/zerotier-android-client-keeps-disconnecting/12230

https://discuss.zerotier.com/t/android-1-10-3-problems-continued/12260

https://discuss.zerotier.com/t/android-app-battery-use-wifi-vs-cell/12535

https://discuss.zerotier.com/t/zerotier-is-connected-on-android-device-no-vpn/12601

Biggest Issues

Could not reproduce problems

ARMv7

32-bit

garbage collection

not fully understanding that greenDAO keeps WeakReferences, and Transient fields may be lost at any time

EventBus BACKGROUND

not fully understanding EventBus BACKGROUND

from:
https://greenrobot.org/eventbus/documentation/delivery-threads-threadmode/#ThreadMode_BACKGROUND

Subscribers will be called in a background thread. On Android, if the posting thread is not the main thread, event handler methods will be called directly in the posting thread. If the posting thread is the main thread, EventBus uses a single background thread that will deliver all its events sequentially. Event handlers using this mode should try to return quickly to avoid blocking the background thread.

Future steps

releases: smaller audience, spread over time

more dogfooding

reduce number of moving parts

ONLY update Android code or ONLY update ZeroTierOne code

more QA

testing upgrading from previous versions

more devices

various tricks for forcing GC (Battery Saver)

Beta program

Created with the Wolfram Language