Lately, I have been involved in writing some huge distributed programs, that run on hundreds of machines. Its a known fact that writing such programs is non-trivial, but just how non-trivial is something that is not understood till one actually attempts to do so. Here are some of the lessons that I learned (the hard way), which make life a bit easier, lest I forget again (Not likely to ;-), its been a lot of pain).
1. Log everything: The first thing that everyone knows about computers is that they act weird sometimes, and the more computers you have, more likely it is that things go wrong. Logging what goes on is the one and only way that you can actually get to know what went wrong where, specially when dealing with a huge number of machines, since it is impossible to be monitoring all machines at once by hand.
2. Making code restartable and idempotent: Closely tied to the point above, it makes life a lot easier if the code is restartable and idempotent; This becomes important when machines die off in the middle of a lengthy process; Having the ability to restart processes again makes it relatively painless to manage machines. Log serves as a useful way of knowing exactly where the process was when it died off; reading log serves as a useful feedback mechanism.
3. Keeping synchronization to a minimum: When multiple unreliable machines have to co-ordinate with each other, life becomes hard. That's when you understand that animal-trainers have a tough job ;-) . If it is possible, it is much better to divide a complex job such that each computer works on an independent chunk of the whole, rather than working on parts that need to be passed around.
4. Taking Advantage of Memory: Disks are slow - really, really slow, when compared to memory. While you'd think that you can take it or leave it for small programs, for large programs, the amount of disk you touch has a significant impact on performance. It is always a good idea to take advantage of the large amounts of memory available to most computers today.
So, touching one disk file instead of many small ones is better (because the buffer cache of the operating system kicks in, to read ahead from disk to memory in anticipation), but best would be to gobble the entire file in memory explicitly before processing.
The tricky part is writing:if you process too much without committing to the disk, you run the risk of losing all your computation if the computer fails. On the other hand, very frequent writing is bound to cost in terms of performance. This is a very delicate balance to achieve, but with the CPU speeds we have, my own rule of thumb is to favour repeating computation rather than frequent writes.
5. Reducing communication costs: Communication setup costs sometime dominate the cost of actual connection. For example, setting up an ssh or an scp session is quite expensive(~1 sec to establish connection?). Considering this, it is better to optimize the amount of data transferred per communication setup.
So secure copying a tarred directory is better than transferring the contents of the directory one by one..
6. Algorithms Matter: The lesson is, that the easiest way to code a problem may not be most efficient, and efficiency counts for long running jobs.
As a very vivid example, I came across some code that a friend had written that inserted entries into a database table. For those not in the know, a database table has a primary key (a number) to identify each row (just like the row numbers that MS Excel has).
However, the code to insert a row (or tuple as a purist would put it) was written such that, it read through all previous rows to find the last row number, and insert the new row with the largest row number+1. So each time one needs to insert the nth row, all n-1 rows have to be read. One can quickly see how this would ruin the performance of inserts. With this algorithm, it would take the code 2.5 hours to insert 10,000 entries into the table.
After changing the algorithm, so that the last primary key was stored in another table, the algorithm improved substantially - just how substantially? It inserted 10,000 entries in under 30 seconds!
So its much better to go for something slightly more painful, if its algorithmically better. Specially considering that you'd probably deploy the code in multiple machines, and its hard to change code again and again.