The Two Generals’ Problem (And What You Can Do About It)

02.03.20

 

Two Generals' Problem

On 6th September 2018, Deliveroo had a particularly bad day thanks to a bug in their app. Hundreds of customers received multiple identical orders or no food at all. Their system was in chaos.

But how did it happen?

It was a case of the Two Generals’ Problem, also known as the Two Generals Paradox – a classic issue that plagues IT computer systems.

Grab a coffee, put your feet up and engage your imagination for a moment. We need to tell you a story to explain this one.

The story of the Two Generals’ Problem

Imagine you live in the Land of Computing and you’re a general of the army. Your title is General A. You and your soldiers are strategically positioned in a valley to the west of the enemy’s fortified stronghold. General B is in the valley to the east. The only way you can defeat the enemy is if your soldiers and General B’s soldiers make a coordinated attack from opposite sides of the valley. But in the Land of Computing, there is only one way to communicate with General B to set a time for the attack. You have to send messengers on a high-risk path through enemy lines.

Now we have a problem, General.  There’s no way of knowing whether your messengers have completed their mission safely. If General B sends a confirmation message back to you, your runners must return through enemy territory. They might not make it, but General B won’t know that. He’ll go ahead with the attack at the agreed time and be defeated.

What’s a General to do?  It doesn’t matter how many messengers you send back and forth; the problem remains the same. It’s unsolvable. In the Land of Computing, there’s no radio communication, carrier pigeon, binoculars or semaphore to help you with this operation.

The Real-Life Paradox

Let’s apply this situation to two computers. There is never any way for Computer A and Computer B to guarantee that the data they exchange is received and acknowledged. So, sometimes that data gets captured (lost) and doesn’t reach its destination. It’s a simple problem, but it takes a clever engineer to find a workaround.

This is what happened with Deliveroo. People ordered their meals and received confirmation. Happy days. Then the app sent them a message to say that there had been a problem and the order hadn’t gone through. Some people, probably the computer geeks who know a bit about apps, ignored the error message and checked their order summary. Seeing that the order had been processed, they sat tight in the hope that their food would arrive. Others placed their order again, and again until they realised, they were getting nowhere. What they didn’t realise was that Deliveroo was receiving and processing everything as usual. It was an error in the logic pathway of the app that was telling customers the process had failed. Computer B (Deliveroo’s system) couldn’t confirm receipt of the message from computer A (the customer’s device).

Then human error entered the mix. Some restaurant staff recognised that there was a problem. They either cancelled all orders, so their customers didn’t receive anything, or rectified the situation by cancelling and refunding duplicate orders. Others didn’t pick up on the issue at all. Deliveroo’s customer services line was on fire with complaints from angry, hungry customers.

Most of us have fallen foul to the Paradox at some point. Perhaps you haven’t ever ordered one special fried rice from Deliveroo and received seven, but you’ve probably had a print job that disappeared into the ether. It’s the same issue at work.  In a digital world where signals can be interrupted frequently (not necessarily maliciously), this is a real problem for many businesses.

What’s the answer?

The problem doesn’t have a solution, but there are processes that businesses can put in place to remain operational when Computer A fails to get a message to Computer B.

There are a couple of options. You could apply an idempotency token. Simply put, each fie (message) is given a unique ID. If Computer B receives multiple messages with the same ID, it only counts one of them. Alternatively, you could set up content scanning to check that everything Computer B receives is new information. Whichever safeguard your engineers use, make sure they sacrifice a few files to enemy territory and test the system.