how to handle / architect soa when service becomes unavailable?

Question 1

Say I've got a third-party service called Cool.io that provides a RESTful API but often goes down. My applications consume that API, but when Cool.io goes down, my app really can't do much... but it should!

Say I create a proxy to the Cool.io and this proxy stores a copy of Cool.io's data and provides RESTful endpoints for my application. Let's assume that my proxy has 100% reliability, or, at least more than Cool.io's. Let's also assume that my proxy has no problem propagating PUTs/POSTs to Cool.io, and no problems receiving the same from Cool.io (in other words, there are no data sync issues between my proxy and Cool.io).

This decouples my application's dependency on Cool.io, which goes down all the time... but it doesn't decouple the proxy from the Cool.io service. As such, when my application modifies something on the proxy, the proxy's data changes and tries to send that change to Cool.io.

But, if Cool.io is down, that request just results in an unavailability error (or a failed connection).

What's the proper way to architect something like this? How do guys like Amazon or Netflix decouple their services such that an outage doesn't affect the consumer, or if it does, it does so gracefully?

Is a message queue in order here?

Is there a simpler solution than creating a proxy to the unreliable service?

Question 2

What type of outages is Cool.io experiencing? Is it down for a few minutes a few times a day? Does the network just occasionally lag? Does Cool.io have a database routinely goes down and it is returning weird errors every few weeks and is down for hours at a time?

Question 3

Is Cool.io write-mostly, read-mostly, or an exactly an even split? When it is down do all relevant clients see it as down? What are your consistency needs? In the case of Amazon, they likely scale out on this service so that loss of any one instance does not affect the service adversely. Can your apps degrade gracefully (i.e. immplement partial functionality while Cool.io is unavailable)?

Question 4

@aceinthehole Intermittent outages for maintentance, sometimes for hours.

Question 5

@JamesYoungman More read-intensive. Consistency is important but not as critical as availability. As for graceful degradation, that's something I need to figure out... but not really. For example, my app depends on fetching a collection of data from the Cool.io service, and without this data it can't really do much.

Question 6

I think you answered it yourself: Message Queue. You need your proxy to be able to wait for the service to come back and then resubmit your requests. So, push the requests into a queue. Have the queue manager regularly ping the service to check its status, and send the messages from the queue when its up. Don't remove the message from the queue until you get a positive response that the service processed it.

Question 7

Depending on the app, it may also be necessary for the proxy to serve a view of the data which is consistent with the updates queued in the message queue (since the client thinks they have been applied). This is hard, especially with geographically widely distributed clients which are performance-demanding.

Question 8

@JamesYoungman True. I wouldn't attempt to do what you suggest unless I absolutely had to--especially seeing that the remote service might disallow an action when it comes back up. That said, you still have to decide how to handle duplicate requests added to the queue while waiting for the service to come back.

Question 9

@MatthewFlynn Sounds like I need to educate myself on implementing a messaging layer.

Question 10

Since your service is read-intensive, using a caching proxy sounds like a good idea. But beware: the more you try to retain the semantics of the original system when the backend is down, the more complex your system becomes. And usually, the more surprising its failure modes are to the end-user. Both of those factors will often motivate a decision to adopt a simple read-only proxy.

Even a read-only proxy should return a timestamp with the data, to indicate when the data was last fresh. For HTTP you can encode this with IMS or an Etag; for other systems it will depend on the protocol you are using. Refusing to serve very unfresh data from the proxy is a choice you might but don't have to make.

In your application layer, you will need to decide what to do when the user wants to perform a mutation-like action on data of age T seconds.

I think it is normally best to just accept the change, and return an error if the (Cool.io) backend failed to process the request. If you choose instead to decline without trying, you will have problems in which the pre-mutation check finds the backend in one state, but the attempt to actually apply the change finds it in another -- this situation is hard to test for in your system's regression test suite, so my advice is not to build a system that tries to do this.

If the backend cannot apply your change, your system could treat this error as final or it could offer to try to apply the change when the service returns.

As a user-experience optimisation, when you know that the proxy has been unable to get service from the backend, you can display warnings on the user interface, so that they user can avoid a time-consuming data entry only to find the backend is down.

If you do offer to apply a failed change later, the changes the user wanted to apply but which could not be applied in the short term are going to need to be stored. You can store them in a queue, but as @matthew-flynn pointed out, you will need to handle duplicate (perhaps conflicting) queued changes. Hence you will probably need to "queue" the changes in a queryable way. Such as in a database table of unreconciled changes. The simple thing to do there is to reject changes to data which itself hasn't really been applied to the backend. Otherwise, a failure to apply a certain change to the backend may require more than one user-level change to be rejected.

If it is possible for a queued change to fail to apply you are going to need to provide some kind of functionality in which the user reconciles the failed changes.

A particularly interesting case is where the backend has just come up and another user has submitted a conflicting change, live. That is, a new change has "overtaken" a queued change. You might consider blocking all changes by users when there are queued pending changes. One way to achieve that is for all changes, live or deferred, to use the same queue. If you do that, be very sure that problem changes cannot get stuck at the head of the queue.

As you will note from the above, all this this clearly requries changes in the semantics of the application. You can't just do it invisibly in a proxy layer, unless you reject mutations that couldn't immediately be applied to the backend. And if it makes a difference to the user whether the data is fresh or not you may need to warn them that they're looking at stale data.

You also asked how large services deal with this. One of the popular ways is for the backends to be sharded by user-id so that if a given part of the service is down, only some users are affected. This is easy to do for things like serving static data (which mostly won't care who you are) but much harder for services in which users have N-to-N relationships (for example things like Twitter - though in the case of Twitter the complexities around failed mutations are mostly absent).

Question 11

Your services should be autonomous, and they should communicate with each other via event-based message. Besides using a queue-based solution, your data should be decentralized. And avoid service orchestration.

In order to achieve that with the minimum amount of pain, decompose your system into service along your business-capabilities. There are different ways to identify them, but probably the most straightforward one is treating your services as steps that your business should walk through in order to obtain a business value. Here is an example of using this technique.

Matthew Flynn Matthew Flynn 13.5k2 gold badges41 silver badges59 bronze badges · Accepted Answer · 2012-05-25 20:49:47Z

5

I think you answered it yourself: Message Queue. You need your proxy to be able to wait for the service to come back and then resubmit your requests. So, push the requests into a queue. Have the queue manager regularly ping the service to check its status, and send the messages from the queue when its up. Don't remove the message from the queue until you get a positive response that the service processed it.

Share

Improve this answer

answered May 25, 2012 at 20:49

Matthew Flynn's user avatar

Matthew Flynn Matthew Flynn

13.5k2 gold badges41 silver badges59 bronze badges

3

Depending on the app, it may also be necessary for the proxy to serve a view of the data which is consistent with the updates queued in the message queue (since the client thinks they have been applied). This is hard, especially with geographically widely distributed clients which are performance-demanding.

James Youngman
– James Youngman

05/25/2012 22:22:03
Commented May 25, 2012 at 22:22
@JamesYoungman True. I wouldn't attempt to do what you suggest unless I absolutely had to--especially seeing that the remote service might disallow an action when it comes back up. That said, you still have to decide how to handle duplicate requests added to the queue while waiting for the service to come back.

Matthew Flynn
– Matthew Flynn

05/25/2012 22:27:24
Commented May 25, 2012 at 22:27
@MatthewFlynn Sounds like I need to educate myself on implementing a messaging layer.

ybakos
– ybakos

05/27/2012 14:56:29
Commented May 27, 2012 at 14:56

Add a comment |

Stack Exchange Network

how to handle / architect soa when service becomes unavailable?

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

how to handle / architect soa when service becomes unavailable?

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions