Sometimes a web server and a database is fair enough to meet our project requirements. But if the project scales we probably need to think in a clustered solution. This post is an attempt at being an unsorted list of ideas working with clustered PHP applications. Maybe more than a list of ideas is a list of problems that you will face when swapping from a standalone server to a clustered server.
Source code
If you’re going to spread your code among different servers, one important fact is the source code must be always the same. You cannot assume differences in the source code. If you need to mark something related to the node, use environ variables and point your code to those environ variables instead of using different source code in each node. Perhaps changing the code is the fastest way to do the things initially but you will mess up your deploy system or at least turn it into a nightmare.
Databases
Here you are the biggest problem. Clustering databases is a hard job. All Databases normally have one kind or another of replica system but replication as way to solve problems can backfire on you. The easiest way of database replication is a master slave replica system. Inserts in one node and selects in all. This scenario can be easy to implement and not very hard to maintain but maybe you need a multi-master replica system (inserts in all) and this can turn your database administration into a nightmare. One solution is use noSql databases. They have a simple and useful multi-master replica system. My recommendation is use a noSql database for those kind of information we need to have distributed in all nodes. I know it isn’t always straightforward but all the work done in this direction will be will be worthwhile in our outcomes. In fact noSQL’ main purpose is data scalability but also there are people who thinks noSql databases are some kind of hype actually. Who is right?
File-systems
Problems again. Imagine you need to write a simple log system in your application. fopen, fwrite, a few PHP lines and your log is ready. But: Where is it located?. Probably on your local filesystem. If you want to place your application on a distributed cluster, you cannot use those simple functions. You must consider using a distributed filesystem. Maybe a simple rsync would work. Maybe a single fileserver mounted on every node of your cluster. You need to balance between them and choose the solution that fits to your requirements. One smart and simple solution is benefit from CouchDB’s attachments in addition to its replication system. (can you feel I like CouchDb?)
Deploy
That’s a mandatory point. You must have an effective deploy system. You can’t rely on you to deploy the code on the among the cluster by hand. This work must be automated. PHING, rsync, pull/push actions in a distributed revision control (e.g. mercurial) or maybe a simple bash script can be enough. But never do it by hand.
Authentication and Authorization
If your application is protected with any authentication mechanism you must ensure all nodes will be able to authenticate against the same database. A good pattern is to work with an external authentication mechanism such as OAuth in a separate server. But if it’s not possible for you must consider where is located user/password database. CouchDB is good solution.
After your user is authenticated you need no check the authorization. OAuth servers can help you on authentication but your application must deal with authorization (aka user roles and role groups) anyway. My best choice here is CouchDb but you can use relational databases too. As I said in Databases section I avoid multi-master Database Replication with relational databases like the plage, but if you need to use relational replication in the authorization database you can use master-slave replica and update all role and authorizations in the master database.
Sessions, Logs and Cache files
Witch session handler do you use in your application?
File based (go to filesystem section of this post), Database (go to database section of this post).
The same with cache and logs. With cache files you can consider keep it locally. If your cache system works unattended you can maintain the cache in local node instead to replicate it among the cluster. You can save bandwidth.
To Authentication, Authorization, Sessions I use of memcached.
Memecached is a great tool. I use it only for caching. Do you use only memecached or you store information in a database and use mencache d for caching? e.g. New users are added only to memecached?
ty ,,, nice info and your article is very educatif
A good overview – but there are lots of others ways to skin a cat.
Source Code – sometimes env vars are not the right place to hold node / environment specific data. There’s a lengthy discussion about using a hierarchy of include paths in the 2009 September issue of PHPArchitect (http://www.phparch.com/magazine/2009/september/)
Databases – the noSQL dbs can be very efficient and very, very scalable. But for anything other than very simple ORM they quickly become very hard to support across the full life-cycle. These are mostly NOT relational databases (no integrity constraints).
File-Systems – it’s important that data-replication does not overlap with code replication (or that it does so in very controlled ways). When you’ve got any additional medium (like a database) which constrains code, the problem becomes even more murkier if deployments are not managed. But having said that there are lots of solutions for on-demand, bulk and off-line replication including AFS, (network or direct attached) shared filesystems and others.
The central issue with authentication is to maintain a consolidated source of authentication data. Authorization is a completely different kettle of fish. People run into problems here because they don’t separate the 2 objectives, and try to extend simple session cookie based systems across multiple domains.
Many thanks.
The main aim of this post is to show that the things become different when we change from a standalone server to a clustered server. E.g. Trivial things in a standalone server such as access log may be deeply evaluated within a cluster. There’re always different ways to face the problems and must be evaluated according to the cluster installation. One important thing at least for me is to reset my mind when face to the new environment. Returning back to the access log example. The first approach can be think in a distributed filesystem (because our original log was in a local filesystem). It can be a good approach because we don’t need to alter the application source code. But maybe if we change the location of the log from filesystem to a couchDb database (I like couchDB 😉 ) the solution of the problem become different. Easy to develop but now we need to change the source code.
Very nice.
I’m setting a PHP distributed evn.
I’m using memcached for replace session filesistem handle.
Thanks for tips.
There’s a lot of people using memcached to store sessions. I don’t use it because memcached don’t have any replication system (tell me if I’m wrong), If we want roaming between servers we need to code it or use another solution like repcached (a patch of memcached with replication). But if we don’t need roaming, memcached is another good solution
Good article, clear and very understandable, but I think one important subject is missing: the hosting.
I am looking for information about noSQL solutions and it is hard to find if hosting providers can support them or not. Mainly they only support mySQL database (at least in Switzerland).
Yes. The hosting is a problem. PHP is cool because we’ve a lot of cheap hosting around the world. Mysql is no problem, but more “exotic” things such as nosql are a bit more difficult to find. The best solution are dedicated servers (amazon and similars) and build all by our own. But that’s means we need to administer all services. Anyway if we’re speaking about nosql only we’ve got services such as http://www.couchbase.com/ for example with noslq databases hosting plans.
Good respond in return of this matter with real arguments and
describing the whole thing concerning that.