The story of the non deterministic Replay

This is the story of how I discovered my simplified replay system wasn’t so deterministic as I believed because I had an ugly bug, but read the post if you want to know where exactly.

While integrating the lockstep engine on the game I am working on, I decided to do something to save and load replays to be able to easily reproduce some bugs I was experimenting. After I had that done and working, it was pretty awesome to see I can replay the same game multiple times (food for another blog post by the way), however, I thought it could be fun and easy to play them faster, why not.

Since I have a fixed time step logic, it should should be pretty straightforward, simply use a multiplied time and then the fixed timestep logic would do all the work and the game logic shouldn't notice the change. I decided to give it a try and it worked…. almost, when playing the replays at higher speeds I noticed some visual differences but I wasn’t totally sure (it could be interpolation code).

To verify, I went back to the test project, where I had the moving box, and test it there, but I needed some way to be sure. Since I have already a way to calculate checksums of the game state, I used that to verify the game states when playing replays at different speeds (from 2x to 16x).

It failed, even though it only failed to validate some frames, following frames were not necessarily invalid (this is something important to consider).

Image 1: It shows one of the best tools in the world to check game states when replaying a game.

So, I was right, I saw some differences, I could be sure that something was happening. The thing was, with only the checksums I couldn't know what the real difference was. Next step, making something to detect it.

In order to do that, I had to change to start saving (at least for debug) the game state, not only the checksum, and to check differences between stored game states in the replay and the current game state (when replaying the game) when checksum validation fails. It worked too, now I have the exact place where the differences are.

Image 2: It shows why serializing all the game state in one string is the best thing to do in your life.

After testing it a bit, I noticed another curious thing, the game validation wasn’t always failing given the same replay and the same speed. That gave me a hint that the problem was probably not related with the game code itself (the moving box).

So, if I played the replay at 1x, it was validated properly. If I played the replay at 8x, it failed, most of the time, but not always. So, it seems there is something related with speed I don’t understand yet.

I decided to test the same replay but with Unity timescale modified, my first test was using 1x for replay but 5x for the timescale, validation failed, then the opposite, 10x for replay but 0.1x for timescale, and it worked well. So the problem seems to be related with my accumulator logic inside the fixed timestep logic?

Some test cycles later, it turns out that, it was indeed a bug in one of the core classes of the engine!

The problem was on my class LockstepFixedUpdate, the first version was overriding the Update() method and performing lockstep logic, it worked ok if only at most one fixed update is processed, but in the case a big delta time arrives, it only process lockstep logic once for the first fixed update and never again.

That means that, in case replay commands were to be processed in frame 3, we are in frame 1 and a big dt of 10 frames arrives, then lockstep logic checks only in frame one and never again until all 10 frames were processed. This bug even bypass the lockstep!

Since I made a test to replicate the bug, it was really easy to fix it, I changed to process the lockstep logic with each fixed step updates and it works fine now, I have high speed replays!! YEAH!!

Conclusion

In the process of finding this bug I started to expand the engine support and create better tools, this is really important if I want to build something solid over it.

The only way to detect issues as soon as possible is to iterate over the engine as soon as possible and to do that, use cases are needed and games provide the best use cases. In my case, I am using not only the game I am trying to make but also other similar games as references when deciding what I want and how I want to test it, for example, having replays, being able to play the replay at different speeds, being able to save the replays, etc. Also, being able to replicate a bug in a small test case where you can iterate quickly to fix it is super useful.

Detecting (and having) problems like this in a small and simple game gives the idea of the complexity of a medium to big game, all the variables and the difficulties, it is not something to underestimate, so when developers say they couldn't add multiplayer features to their game because it was really hard to do it, it is not a lie.

I love all of this stuff, even though I understand it is not an easy path.

To complete this post, here is a video showing a prototype of how I load and play a replay which was created by playing with two players in LAN, one was my computer and the other was my phone:

The quote of the day is 'Fail as much as possible, as soon as possible to avoid failing when it is too late'.

Hope you enjoyed the journey.

VN:F [1.9.22_1171]

What to consider as game state when validating a simulated synchronization.

One really important thing when performing a synchronized simulation is having a way to check if all clients have the same simulation state. In order to do that, it must be clear which parts of the game state are important for the synchronized simulation and which parts aren’t. Obviously, this depends a lot on the game.

The idea of this blog post is to talk briefly about what is normally an important value for a game, and what isn't.

What is not important for the game state

Let's talk first about what is not important for your game. Even though it depends on the game, there are some common things that shouldn’t be considered as part of the important game state.

These things are normally all the stuff that doesn’t affect the game outcome, in other words, all the stuff that could be removed from the game and the game will be the same. For example, the smoke particles coming out from a bonfire, or even the current playtime of fire sound effect.

Also, things that could be derived from important gamestate even though they could affect the outcome in some way. For example, the units health bar current fill percentage could be derived from the unit health, that value could be used by the game logic but since it is directly derived from other values, it is not considered important.

What is important for the game state

All the things that could affect directly the game outcome are important for the game state. If you have a way to restore the game from a saved state, you have to be able to finish the game the same way each time you restore it.

For example, in a game like Doom, if you save and load you have to have the same health, armor, weapons and ammo, you should be in the same world position you were and enemies too, with same health, etc.

In Starcraft, what is important is your current vespene gas and your minerals since that decide what you can build or not, your current units with their health, your structures and their queued orders, the revealed map, etc. Things that are not so important include the blood in the ground when your units die (if you are playing Terran of course), your current units selection, the exact frame of a unit animation (unless the game depends on that to fire a skill).

If you are just saving the game state to be restored locally, you may want to store more stuff that you need since that shouldn’t matter. However, in the case of checking state synchronization between machines, having stuff that could differ and doesn’t affect the game outcome is not a good thing to have since you are not only wasting bandwidth but also risking the synchronization check.

Next time

Currently I am working on a way to represent the game state of a game I am working on and I want to have both features, a way to check synchronization between machines as well as having a way to save and restore the game state. I hope to talk about the benefits and problems of those features in the next posts, as well as having a more detailed information on the concepts of this blog post too.

Even though this blog post was a bit basic, I hope you liked it. I just wanted to talk a bit about the challenges I am dealing with right now.

VN:F [1.9.22_1171]

Lockstep multiplayer first steps

In this Gamasutra article, Tundra developers explain their approach to minimize problems when developing a lockstep multiplayer platform used for their game Rapture - World Conquest. Based on that, my first approach was to create and understand a deterministic lockstep logic without even considering networking yet, to focus on one problem at a time.

My objective is to have a reproducible game that given the start state and all the players actions during time, the game can be played again and the final state will be the same.

Lockstep logic

The first step was start by creating a logic to encapsulate the fixed game state for physics and game logic, and a lockstep to perform player actions. As I started without considering networking yet, all the player actions are directly enqueued locally by the user input or by a saved replay.

A lockstep logic means that there are some conditions that, if not met, the game should pause the simulation and wait for them. In the case of multiplayer games this is when the game have to wait for other players actions that didn't arrive in time (and where the waiting for other players dialog is shown in some games).

My code looks something like this pseudo code:

```update(dt) {
if (lockstepLogic.IsLockstepTurn()) {
return;
lockstepLogic.Process();
}
// normal accumulator logic for the fixed gamestep
}```

Since I create all player actions locally, the lockstep never happens right now but it is a good practice to simulate and test it anyways.

Testing it

My first prototype using this logic is a box that moves over the screen. When the right mouse button is pressed, the game enqueues a player action to move the box to the specified position, then when the lockstep logic is processed the box receives the command and start moving to that position.

Interpolation

The moving logic was pretty simple, based on start and destination positions and a speed, the box moves itself to the destination in a straight line. Since that logic is performed in each fixed gamestep, the player see the box jumping between positions, it works but it doesn't look so good. To improve that and make the movement smoother I created a simple interpolation code.

Interpolation depends a lot on what you are trying to interpolate, in some cases could be a simple as the code I used there but there are other cases like the bouncing ball which need more data to create a better interpolation.

Note: adding interpolation means we have now a delay of one fixed gamestep between the box position being rendered and the real position in the game since we are using previous and current positions info for the interpolation.

My first replays

By recording all the player actions with the fixed gamestep frame they were executed, I created a basic way to replay the "game" by just reseting the game state to the initial state and start enqueing all the recorded actions in each corresponding frame.

Here is a video of the progress so far:

Yeah, I know, it's not a great thing, but at least I am starting to understand some of the challenges of making multiplayer games.

Validating simulation

To validate the game is always performing the same simulation given the same input, I added a checksum calculation based on the game state (for now just the moving box state) which is saved from time to time to use later as validation when simulating the game again. The idea was to start defining an API to get the important game state to consider when validating the simulation, and also to start testing game state validation. The code looks like this:

```update(frame) {
if (IsChecksumFrame(frame)) {
if (recording) {
checksumRecorder.RecordState(frame, CalculateChecksum());
} else {
if (!checksumValidator.IsValid(frame,
checksumRecorder.SavedChecksums, CalculateChecksum())) {
throw Exception("current game state is not valid!");
}
}
}
}```

In my first tests I wasn't reseting the game to the initial state properly so my game was producing different checksums when reproducing the saved player actions, checksums validation started to work after that was fixed.

How am I calculating the Checksum? Right now I am not sure exactly what algorithm to use nor which game state should I consider for the checksum, so what I did was to encapsulate that in some interfaces and implemented a basic way to get both. For the checksum I am using a simple MD5 over a string, and for the game state, well..., a string with all important values concatenated (the moving box position, destination, if it is moving or not, etc).

```string CalculateChecksum() {
string gameStateValue = game.GetGameState();
return MD5.CalculateHash(gameStateValue);
}

string GetGameState() {
string gameState = "";
foreach (object in gameObjects)
gameState += object.GetGameState();
return gameState;
}```

For a future implementation what I know is that the game state should be composed with important values that can affect the simulation, so I shouldn't care about about audiovisual stuff like particles, effects and sounds.

Also, the game state concept could be used to reset to the initial game state or even for saved game states (to easily replicate some bug for example), and finally, it could be used to synchronize and validate state if for some reason I end up using a client/server architecture.

Next frontier: Determinism?

For now I didn't explore determinism realm because the solution really depends on the game logic but at the same time I must have it clear before starting the game code. One of the next steps is probably start testing with fixed point math, not sure yet, the idea is try to follow an approach similar to the gamasutra article's of reducing non determinism problem to the minimum before going multiplayer.

If you want to take a look all the code used for this blog post, here is the link.