Friday, 26 August 2011

Designing for Database Replication

If you are writing a web application by yourself or in a small team, the temptation is usually to get it finished as soon as possible because even small sites don't take 5 minutes and once it is finished, there is a sense of satisfaction. The problem with this is that many things aren't really considered early on. Sites like Twitter didn't expect so many people to sign up so quickly and if the site wasn't able to scale, people would have got fed up and left or gone to a rival site. We should be thinking about scalability early on in the design and production cycle.
You might decide that due to the nature of the system, it is unlikely ever to exceed a certain size. Perhaps it is an internal company site and would never require more than, say, 1000 users doing simple things. If this is the case, fine. However, any public site runs the risk that it will become popular and an increase in user numbers means an increase in database, memory, network and processor load. If this is the case, you should design in the ability to scale as early as possible, even if your site will not be scaled initially.
One of the realistic possibilities is that you will need multiple database servers and if you need the ability to write to more than one of these at the same time, you will need some way to replicate the data. The difficulty with this relates to the actual design and might not be too bad.
For example, take the most basic replication issue of multiple inserts into the same table. If you have an index column (which you should) and you create user John Smith with ID 1000 in DB1 and then someone else creates user Michael Jones with ID 1000 in DB2, what happens when these databases replicate? In this case, although we might presume that the different names are different people, it would be possible for 2 John Smiths to create an account at the same time (in the gap between replication cycles). In this scenario, the best idea is to use a GUID for the id instead of an integer. Since these are globally unique, you would never have 2 users with ID 1000 on 2 different DBs. This also applies to foreign keys which would also use the GUID and would not be in danger of incorrect linking when copied from one table to its counterpart in the second database.
Updates can be more tricky. If someone sets a piece of data in the row for John Smith in DB1 at the same time as someone changes it in DB2, replication will do 1 of 2 things. If different fields have changed and this can be detected, it will simply take both changes and merge them. If the same field has changed or there is no way in your DB to detect which field has changed, you have the following options:

  1. Flag it to a user who needs to manually decide what to do
  2. Accept the low risk of this happening and simply take the latest change as the correct one
  3. Create a manual system which records the changes so they can be automatically or manually merged
  4. Apply a design which avoids the chance of this happening.
Really, to design out of these scenarios is always best because it is less maintenance. In our example, suppose we run a site like Facebook. One way we could avoid the problem is only to allow writes on a single database server, the others are read-only. The second option would be to ensure that updating user "John Smith" is only permitted on a specific 'home' database in which case multiple changes are applied sequentially and are replicated to other databases correctly.
Really the options depends heavily on the type of application you are creating. You might need to perform some load testing on your environment to find out where it is stressed under high-load. Don't spend time improving things that are working OK.

Tuesday, 23 August 2011

Disaster Recovery

It's quite fashionable nowadays to talk about DR (disaster recovery) or what happens when something goes wrong. For lots of people though, they can only think about backups and the money to buy a new server if it breaks but DR has to encompass much more than this. It is all very well having backups but do you ever restore them to make sure they work? There is too much at stake to assume it will be alright. What happens after you make an upgrade to your database servers or web servers? Do you restore the backups again to make sure they still work? You might have UPSs on your servers but what happens when the power goes off? Do you assume it will not be off for more than 5 minutes? Do your UPSs cause the servers to shutdown so they are left clean or do you just wait for the UPS to run out and the servers to die? What would happen if your offices burned down? Would you be able to continue business or would all of your important information be lost?
Fortunately, it isn't rocket science, you simply need to perform a fault analysis of every element in your network (not every workstation but a workstation), servers, power supplies, UPSs, backups, internet etc. and then consider what would happen, how you would know and what you would do about it.

Finding code faults

I read a great computer science book once (whose name I can't remember!) which talked about having certain 'rules' when we write software and one of them was, "when a defect occurs, it should be obvious where the fault is and it should only need fixing in one place". This rule implies encapsulation and more importantly, well defined and thought out fault handling.
Ignoring the debate of exceptions vs other error handling scenarios, it is obvious that most software still does not have a very well thought out system of fault handling. I have to support a live system and many defects that occur are related to specific scenarios and are extremely difficult to track down. Some of them we assume are environmental because we cannot see any path that would make sense yet all of these should be obvious. If something encounters an unexpected scenario, it should log that or display an error (or both). I think the underlying problem is that we might review code for neatness and main logic being correct but we rarely review code to look for what might go wrong. How often have you asked what errors a service call might throw (does the developer even know?) how are you going to handle that.
We can save so many hours or even days with a simple error like "X failed because the Y object was not populated" but we need a process to ensure this happens. Leave it to people and they will forget or not bother because of the pressures of implementing functionality.
The other big saver is trying to make a method either a list of calls to other methods or some sort of functional work and not both. This allows sequences to be easily reviewed visually and for functional methods to be small and concise and easier to re-factor.

Sunday, 14 August 2011

Time outs and Application Integrity

When we design software systems, we tend to think of perfect real world scenarios where all connections succeed, where services are always available and where the network is always available and running quickly. Sadly, many of us don't really consider what happens if some of these are not true at any given point. Perhaps in our system, it is acceptable to simply allow an error or timeout and for the user to hit refresh (as long as this actually works!) but for some systems, especially ones that require many screens of information to be navigated and saved into the session, this might not be adequate.
The company I work for writes a system to apply for mortgages. Various screens of information are captured and various service calls are made along the way. We had a defect raised the other day which appears to have occurred because something took slightly too long and the page displayed an error. The problem was, the system was then left in an undetermined state. The user interface was happy to try the screens again but the underlying system had completed what it was supposed to do and therefore would not return a 'success' code. There was no way to easily fix the problem, it would have required some direct hacking of a database to attempt to force the case through. In the end, the customer simply re-keyed the whole case but this is not always easy and is not popular if it happens too frequently.
This was a case where we hadn't properly thought through the possible error conditions. In our case, the risk is increased due to the numbers of people using the web app so it is not surprising that these things happen. Not only does it not look too good but it takes time and money to investigate the problem even if we end up not being to fix it. It is also not great as an engineer guessing as to the cause in the absence of any other discernible problem.
The solution as with all good design is to consider all aspects of the work flow that are outside of our control. This means the server(s) that is hosting the various parts of the system, the network and any third-party calls to services or other apps. We need to consider not only the timeouts, which in our case we do, but also what happens if something times out at roughly the same time the underlying system has completed successfully. This is especially true if you are using multi-threaded applications (which in many cases for web apps you will be). Ideally, your system should be able to go back in time to a known-good state from all pages but certainly those which are higher risk (which call onto third-party services etc) which should cancel and delete all current activity and take you back to a place that you know the system can recover from. This might be slightly annoying to the user but much less so than a full-on crash. You can also mitigate this risk by displaying a user friendly message such as, "the third-party service was unavailable. Please press retry to re-key the details and try again".