Fig. 3. Graph zoomed in. Each node corresponds to a tile on the board, each edge represents that either nodes are spatial neighbors, or they are the same tiles but at adjacent timestamps. Graph is oriented, nodes from future graphs are pointing at nodes at previous steps. Thus, all the information flows to current board representation which will be used for actions generation later.

Node input features $\mathbf{v}_i$ contain kore amount on the tile, timestamp, total amount of kore collected by player and its opponent. Also, if any fleet is located on the tile - its cargo, collection rate and number of ships. If any shipyard is on the tile - its number of ships, maximum plan length for this number and maximum number of ships, which can be spawned.

The edge input features $\mathbf{e}_{ji}$ are $(1, 0, 0)^T,ドル $(0, 1, 0)^T$ and $(0, 0, 1)^T$ for temporal, lateral and longitudinal edges respectively. Basically, they are one-hot representations of these edge classes. Why treat spatial edges differently? When you create a flight plan, its length increases when fleet makes a turn: while for both "N1S" and "NEWS" fleet moves 2 tiles from shipyard and returns back, the first one is smaller and requires fewer ships.

Graph Encoder Architecture

If we had to work with standard representations of board, we would most likely use convolutional layers. There exists an analogue of convolutions in a graph world, called, big surprise, graph convolutions. Similarly to ResNet architecture we can build a ResGCN encoder:

Fig. 4. Bird’s eye view of graph encoder architecture. The summation of ResGCN input and output graphs is performed both over node and over edge features.

It is also extremely important to make our graph neural network anisotropic, which means that each neighbor should have a different effect on the node depending on the weight of the edge between them. The idea is that the neural network transforms the nodes in such a way that the encoded features of the nodes lying on the agent’s path are more similar to each other than to nodes that are not.

Fig. 5. ResGCN architecture. First, node features are encoded through several independent ResGCN blocks. Each block is a mapping $\mathbb{R}^d \rightarrow \mathbb{R}^{d/n},ドル where $n$ is a number of blocks. Outputs are concatenated together, constructing vector of the same size as the input. Then to get new edge features we pass outputs through feed-forward layers: $\operatorname{Query}, \operatorname{Key} \in \mathbb{R}^{d \times 3},ドル followed by element-wise multiplication and $\operatorname{tanh}$ activation.

ResGCN block consists of sequential graph convolutional layers:

\[\operatorname{GCN}(v_i, e) = \Theta^T \sum_{j\in\mathcal{N}(i) \cup \lbrace i \rbrace} \frac{e_{ji}}{\sqrt{\hat{d}_i \hat{d}_j}} v_j\]

with \(\hat{d}_i = 1 + \sum_{j \in \mathcal{N}(i)}e_{ji}\) and $\mathcal{N}(i)$ - set of all neighbors for node $v_i$.

Fig. 6. ResGCN block schema. GraphNorm layer normalizes node features over each graph in a batch.

Imitation learning

Now, we can train our network to imitate actions of best agents on a leaderboard. Each turn for each node with player shipyard on it, we have to decide for two things:

One can train a network with cross-entropy loss for the action and mean-squared loss for the ships number. However, due to the fact that flight plan length depends on discretized 2ドル \log$ of fleet size, we can end up with errors like predicting number of ships to be 20ドル$ with true number 21ドル$ and thus having maximum plan length of 6ドル.99$ and not being able to build a path with desired length of 7ドル$. To avoid this we must split our policy into multiple classes, each representing maximum flight plan length:

Here in total we have 16 classes, but this amount can be reduced or increased, depending on what engineer thinks is a reasonable number. To choose a fleet size we can set our target to be a ratio of ships to total amount of ships in a shipyard (or to maximum spawn number in case of ‘spawn’ action).

Fig. 7. Main policy heads. Encoded embedding is taken from node of current board graph representation, where shipyard is located on. This embedding is passed through linear layer to predict an expert action. In parallel, it is concatenated with one-hot representation of expert action and then goes through another feed-forward layers to predict expert spawn/launch ships ratio. On inference, $\arg\max$ of agent predictions can be used instead of expert action.

Finally, we face path generation task. The idea is that for each node, starting from shipyard, we take all of its neighbors and itself, and predict next node in a flight plan. If it chooses node itself, we convert ships to a new shipyard (we can mask such action if fleet size is less than 50 or if fleet is already on a shipyard). To make the prediction dependent not only on the current node, but also on the entire sequence of previously selected nodes, we can use recurrent layers. It is also important to consider amount of space we have left for path generation.

Fig. 8. Path policy head. We start at shipyard node. At each step all neighbor candidates, including node itself, are concatenated with one-hot representation of a ‘space left’ for plan generation. Then they are passed through recurrent layer. Received neighbor embeddings are passed through linear layer $\operatorname{Query}$ and multiplied with node features passed through $\operatorname{Key}$ layer. Resulting vector represents probabilities of each neighbor to be the next node. Then we move along path defined by an expert agent and repeat the same procedure.

Now having probabilities for each move, we can generate path by greedy or beam search, cutting path when it reaches shipyard or when there is no more space left. There is also a case, when path stuck in a loop like ‘NW999..’, and it can take forever until we won’t have enough space. For such thing it is useful to make a stop by some maximum-moves threshold.

Bonus: reinforcement learning

In my experiments, imitation learning was good enough for an agent to start making good moves, however, there was constant instability on inference. At some point an agent could make a ridiculously stupid move (with small probability, but nevertheless), effecting balance of the game dramatically. This move could’ve never been made by an expert agent, therefore we end up with board state which neural network didn’t see in training data. So the chance of making wrong moves increases at each step and finally imitation agent loses.

To tackle this issue I’ve tried off-policy reinforcement learning. The architecture of policy neural network was the same, critic neural network had similar but smaller encoder. Reward was +1 for winning, -1 for losing. I used TD($\lambda$) target $v_s$ for value prediction and policy loss with clipped importance sampling:

\[\mathcal{L}_{value} = (v_s - V_\omega(s))^2, \quad \mathcal{L}_{policy} = -\log \pi_\theta(a|s) \min \big(1, \frac{\pi_\theta(a|s)}{\beta(a|s)} \big) A_\omega(a, s),\]

where $A_\omega(a, s)$ is an advantage, obtained by UPGO method.

RL helped to stabilize inference, but sadly it didn’t go far beyond that. New agent was able to beat rule-based agents, which were on top of Kore Beta competition ~80% of the time.

Results and conclusions

Unfortunately, best of my agents were able to reach only up to 30-ish places, but still, it was a fun ride. What could’ve been done better? Here is my list:

Hopefully I’ll take this experience into account in the next Kaggle simulation competition.

Graph Neural Networks
kaggle reinforcement learning
This post is licensed under CC BY 4.0 by the author.
Share

AltStyle によって変換されたページ (->オリジナル) /