Getting ANRs because (I think) the main thread's waiting for the write
thread to die and now the write thread's doing a ton of work
sometimes. So move the threads into a standalone object that can be
allowed to die on its own time without anybody waiting.
I *think* the reason I'm occasionally seeing toasts about not finding a
move is that when the engine's interrupted by there being a UI event in
the queue that error is posted. Instead try posting only when at the end
of the search nothing's been found.
Having reconfigured to use non-existent relay port as a test of falling
back to the web apis, tweak stuff: send the packets that have been
accumulated when an EOQ is found (rather than dropping all of them
immediately) before exiting the write thread; and start the threads up
when posting a packet in case they aren't (they may not be when the post
happens via timer firing.)
Seemed to be causing ANRs. Integrate instead into outgoing message queue
by using poll(timeout) then checking for unack'd packets every time
through the loop (but not more than once/3 seconds or so.)
Presence of timestamp instead of a boolean determines whether packet
should next via Web. Timestamps might also allow to process a larger
number of unacked packets in a single timer fire....
Track ack'd and unack'd packets. When there are ten more of the latter,
skip the UDP-send step. This is probably not the algorithm I'll settle
on (an explicit PING to the relay over UDP might be simpler), but it's
simple and easy.
Send each packet via UDP if that's thought to be working (always is,
now) and start a 10-second timer. If it hasn't been ack'd by then,
resend via Web API. Tested by configuring to use a UDP socket that the
relay isn't listening on. Only problem is that the backoff timers are
broken: never stops sending every few seconds.
The plan's to use the native relay protocol first, then to fall back to
the slower but more reliable (esp. on paranoid/block-everything wifi
networks) webAPI. This is the name change without behavior
change (except that the native kill() to report deleted games is gone.)
When grouping to allow multiple packets per outbound API call I forgot
that some are there to mark the end-of-queue: can't be sent! Trying
caused a NPE. Now if any EOQ is found in the queue that batch is dropped
and the thread's exited.
AddrInfo now has ref()/unref() and keeps a global socket->refcount map
(since actual AddrInfo instances come and go.) When the
count drops to 0, the existing CloseSocket() method is called. This
seems to fix a bunch of race conditions that had a socket being closed
and reused while old code was still expecting to write to the device
attached to the socket the first time (along with lots of calls to close()
already-closed sockets, attempts to write() to closed sockets, etc.)
Use of mutex logging recurses infinitely if config uses mlock, so remove
that. And don't free sockets after handling their messages as they may
be in use elsewhere. This likely introduces a leak of sockets.
Haven't seen it happen, but I think there was a bug that could have led
to all the sockets coming back as ready from poll() being dropped. Fixed
that and added/cleaned up some logging.
translate the most-used features of my too-big-for-bash script into python3,
which is clearly much better suited. Tried to keep the structure and variable
names intact so that diff has a chance of showing something, but when a class
replaces a bunch of arrays that were being kept in sync there's only so much
you can to. Currently doesn't support stuff like app upgrades and switching
from tcp to udp, but those should be relatively easy to bring over from the
.sh when/if I need them.
translate the most-used features of my too-big-for-bash script into python3,
which is clearly much better suited. Tried to keep the structure and variable
names intact so that diff has a chance of showing something, but when a class
replaces a bunch of arrays that were being kept in sync there's only so much
you can to. Currently doesn't support stuff like app upgrades and switching
from tcp to udp, but those should be relatively easy to bring over from the
.sh when/if I need them.
I'm not quite sure why it was firing, but the pattern of a FindGame
failing due to a race condition and requiring a retry exists elsewhere
in the code. Lots of tests pass once this change is made. :-)
The hack I came up with for storing them in memory isn't working. Even
if it's just that I don't understand C++ maps they need to be cleared
when the DB's wiped (my favorite test these days) and I don't want to
spend the time now.
It breaks rematch that "dict" is being passed to the Android client from
the linux side, and this is easier than figuring out how and when to
dereference the link.
Making the right_side elem match its parent height prevents the
lower-right region of game list items from falling through and
triggering a toggle-selection event.
Making the right_side elem match its parent height prevents the
lower-right region of game list items from falling through and
triggering a toggle-selection event.
As expected, moves are no longer received instantly because the UDP
socket isn't available for the relay to write too once the URL
handler (relay.py) finishes.
Ideally the comms module wouldn't go through its connecting routine in
order to join a game. To that end I added a join() method to relay.py
and code to call it. Joins happen (pairing games, starting new ones,
etc.), but after that communication doesn't. First part of fixing that
would be to make cookieID persistent and transmit it back with the rest
of what join sends (since it's used by all the messages currently sent
in a connected state), but I suspect there's more to be done, and even
that requires a fair number of changes on the relay side. So all that's
wrapped in #ifdef RELAY_VIA_HTTP (and turned off.)
Was coming in as a string when called via curl. This may be a problem
for other ints if I go with lots of params instead of a json, which is
looking less likely.
join's how devices will create or join existing games. It more
compilicated than I'd like but seems to work except that once a slot's
assigned it's unavailable to anybody else even if the other fails ever
to respond (i.e. needs the ACK function of the c++ relay.)
If I want to move relaycon into common so Android can use it (assuming
the jni code starts including json-c and libcurl so it can handle
networking) I'll need a replacement for GSList. This is a start.
With the new http stuff, at least for now, it takes longer to get things
communicated and so killing games after 2 seconds of runtime meant no
moves ever got made. Making it configurable, and passing 10 (seconds)
means nearly all games in a large test run complete reasonably quickly.
ACK doesn't need to wait 2 seconds for a reply, and when it does so the
next send waits too. Eventually we'll want to combine messages already
in the queue into a single send. For now, this makes things better.