Archive for Tutorials

Coding in the Cloud - Rule 4 - Avoid Unnecessary External Dependencies

// August 6th, 2009 // 13 Comments » // Community, Development, Tutorials

Coding in the Cloud

By Adrian Otto

This continues my series, Rules for Coding in the Cloud - rules I’ve developed after watching applications encounter problems at scale when deployed on Cloud Sites.

chain-linksAvoid Unnecessary External Dependencies

Time after time on Cloud Sites, a new site will come online that displays information from another web site, like, say, stock quotes.  Let’s say the site sells dump trucks, and there are stock quotes for CAT and other equipment manufacturers they sell, and they want to show those stock quotes on their web site.  Every time there’s a page view, the site makes an outgoing HTTP connection to a stock web site, downloads the stock ticker data for those companies and then displays it as part of the HTML output of their own web site.

This works just fine—provided you’re not doing a whole lot of it.  But if your site suddenly becomes exceedingly popular because of press mentions, links from very busy web sites or Twitter, all of a sudden two million people are trying to access your site (and consequently the stock site), which can crash the stock site and take yours down with it.

The first thing the frustrated customer does is ask us why their site crashed.  When we look, we see that it jammed up waiting for stocks.whatever.com to respond.  So what happens is not that the load crashes your site running on Cloud Sites, but it crashes the remote site, the stock site in this scenario, and that dependency causes a train wreck that results in customer frustration.

Lesson learned: be smart about external dependencies.  Eliminate all external dependencies you don’t need and be smart about the ones you do - from sites that offer stock quotes, or geo location services, or any of these things that require you to call somebody else’s web service - because you just can’t trust that their site is going to scale as well as your own.  This can happen no matter what the size of the external site. We’ve seen it happen with cases where the external site was big, like stocks.yahoo.com.  There are some use cases where we’ve clogged stocks.yahoo.com in this very way because they see all of our requests coming from a single place, and it becomes completely unreachable from our network because of the way the request routing works.  You must not assume that because the remote web site is big or hosted by a big company that it’s running on an infrastructure that’s going to scale when you access it from your web app.  That’s not necessarily the case.

An increasingly popular feature for adding into sites is geolocation services, where you get the location of the person browsing your site. You go to a site, and it might say, “Thanks for browsing from San Antonio. We have a special offer for you in our store at River Center Mall.” These services work by looking up the user’s IP address and using it to determine the user’s location. Some geolocation services are free and not very accurate; others available for a price and tend to be more accurate.  Regardless, this is just the kind of external dependency that can bring down your site. The service starts responding slowly. Since we are charging for the time that your application is running, that slowness translates directly into dollars. Now you’re paying a premium to have geo location services on your site.  If you really must have geo location, don’t do it with a remote web service.  Do it with some kind of a local logic map, like a lookup database that you consult directly and that’s under your own control.

Mashups are another popular use case for external dependencies, and they don’t scale well unless you have a way of caching the results from the dependent web site.  If you include a mashup that passes all of your traffic through a remote site, you are trusting that site to scale as well as yours will. Unfortunately, unless that remote site is running on Cloud Sites, it’s probably not going to scale well, simply because it’s not backed by hundreds of servers.

aotto-gnu-softwareIf you must reference external data, be smart about it. I wrote a piece of software that you can use as an example for how to help mitigate this problem. This PHP code allows you to display information from another site, but, because it uses a caching approach, it can get fresh remote data in a generally non-blocking fashion and at reasonable time intervals.  You can configure the refresh interval to suit your needs. It allows you to have a remote dependency on your site, by limiting the frequency with which you interact with that site.

The bottom line with external dependencies is that they are evil when used blindly. Do everything you can to avoid them, or put a suitable buffer between your web app and any external dependency so that if the remote site does crash, your web app can still run.

Setting up memcached on Cloud Servers

// July 30th, 2009 // 1 Comment » // Community, Development, Tutorials

By: Adrian Otto

This tutorial explains how to set up memcached for RHEL5 or CentOS5 on Cloud Servers. Although the example is provided for PHP, you can access a memcached server from practically any language using one of the memcached Client API’s.

Step 1: Start up a Cloud Server

Use your Control Panel to start a Cloud Server. For an example, see this video demo.

Step 2: Log Into your Cloud Server using SSH

You may connect as root if your server configuration allows it, or connect as a user and use ’su’ to get a root shell. I use my Terminal from OSX or PuTTY on Windows.

$ ssh root@12.34.56.7
root@12.34.56.7's password:
[root@mcd-demo ~]#

Step 3: Set up rpmforge

The rpmforge package allows you to download and install third party packages from the DAG RPM archive using the ‘yum’ utility on any RHEL5 or CentOS5 system. First visit this URL and determine the current rpmforge package for your system. Note that all Cloud Servers are 64 bit.

http://dag.wieers.com/rpm/packages/rpmforge-release

Once you have determined the current release for your system, use the ‘rpm’ tool to install it from the correct URL. Example:

# wget http://dag.wieers.com/redhat/el5/en/x86_64/rpmforge/RPMS/rpmforge-release-0.3.6-1.el5.rf.x86_64.rpm
# rpm -ivh rpmforge-release-0.3.6-1.el5.rf.x86_64.rpm

Step 4: Install memcached

Install memcached from rpmforge using yum, set it to run at system boot, and start it up.

# yum -y install memcached
# chkconfig memcached on
# service memcached start

Step 5: Configure memcached

You can skip this skip if you want to use a default installation. Otherwise, you can edit /etc/sysconfig/memcached on your server. You will find these settings by default:

PORT="11211"
USER="nobody"
MAXCONN="1024"
CACHESIZE="64"
OPTIONS=""

If you plan on using memcached locally on this same server, you can restrict it to only listen on a local UNIX socket and disable networking.

OPTIONS="-s /var/run/memcached/memcached.sock"

Make sure that if you do use a socket file that the directory is created and owned by the user specified by the USER setting.

# mkdir -p /var/run/memcached
# chown nobody /var/run/memcached

If you plan on using memcached over the network, you may want to install it on a non standard port number of your choice.

PORT="22222"

You can also tune the slab size so that it’s suitable for the size of the data you plan on storing in memcached. The default is 48 bytes. This example sets it to 512 bytes:

OPTIONS="-n 512"

To enable UDP, add this option for the UDP port number you want it to listen on:

OPTIONS="-U 22222"

If you want it to use more memory, you can change this setting to the number of megabytes it should use:

CACHESIZE="64"

On a dedicated 1GB Cloud Server, I set mine to 512MB. This allows a lot of memory for connection handling, and the OS, and guarantees that I won’t exhaust the memory and start running from swap:

CACHESIZE="512"

If you have a busy site and expect more than 1024 concurrent connections, you can safely set that maximum much higher:

MAXCONN="20480"

So here is an example settings file that has a bunch of options changed from the default:

PORT="22222"
USER="nobody"
MAXCONN="20480"
CACHESIZE="512"
OPTIONS="-U 22222 -n 512 -l 10.1.2.3"

Note: The above example says it should only listen on the private IP address 10.1.2.3. This will help avoid unauthorized access to your memcached for setups where you access it from your other Cloud Servers. If you want to limit it only to local clients on the same server, then use “-l 127.0.0.1″ instead.

For more information, consult the man page:

# man memcached

Now save the file and restart the memcached daemon:

# service memcached restart

Step 6: Set up a test application

Now let’s set up a simple PHP application that uses memcached to show that it works. First let’s set up an Apache+PHP environment:

# yum -y install httpd httpd-devel php php-pecl-memcache
# service httpd start
# chkconfig iptables off
# service iptables stop

Now, here you have the option of using the easier installation and using the Memcache class in your PHP script, or use a more complex installation to use the superior Memcached class in your PHP script. This procedure requires that you upgrade to version 5.2.0 or newer of PHP. For this tutorial we will show you the easier way to set up the older Memcache class. In this example, I just download the test script from this blog.

# wget -O /var/www/html/test.php http://blog.mosso.com/wp-content/uploads/2009/07/example.txt

Now browse to your server (use the IP address if you have not set up and DNS for it yet):

Example: http://12.34.56.7/test.php

You will notice that if you visit it multiple times within 5 seconds that you only see the cached result and if you wait 5+ seconds between visits to the page that you will see it insert and fetch new data from the cache.

Look at the source code for the example.

Completion

Congratulations! You’ve set up memcached on a Cloud Server! Now it’s time to begin using it in your application to add speed and scalability to your application and start saving money!

Setting up memcached on Cloud Sites

// July 30th, 2009 // 1 Comment » // Community, Development, Press Releases, Tutorials

By: Adrian Otto

This tutorial explains how to access a memcached server running on (one or more) Cloud Servers from Cloud Sites. Using this approach you can leverage all of the features of the Cloud Sites application platform, and all it’s related scalability while still enjoying the benefits of memcached at the same time. Note that Cloud Sites and Cloud servers are currently provisioned in the same data center together, so network latency will be low which means throughput will be high.

Step 1: Set up memcached on Cloud Servers

Use my tutorial on setting up memcached on cloud servers to complete this step. You can skip step 6. You can also skip step 5 if you want to run a default configuration.

Step 2: Set up a Test Script

I have included an example PHP script for using memcached from Cloud Sites. You can download it, edit the $server_hostname variable in the script to refer to the address of your Cloud Server, rename it to example.php, and upload it to your Cloud Sites account using SFTP or FTP. Once it’s uploaded, you can see how the caching works.

About Security

You must recognize that memcached comes with no security controls. Its possible for a hacker to dump the contents of your cache, or potentially access or change the data in the cache if they know what the address and port of your memcached server are, and what keys you are using. I suggest that you use a non-standard port number for memcached, and prefix all of your keys with a 10+ digit string that you keep secret. If you are highly motivated, you can make a custom version of memcached that has the ‘flush_all’ command disabled.

I can save you a bit of work. Here is a custom patched memcached 1.4.0 x86_64 RPM I wrote that adds a command line option ‘S’ to disable ‘flush_all’ and ’stats detail on’ . The original 1.4 source, a SPEC file for RHEL5 and CentOS5 and the patch are both included in the SRPM. By disabling these commands with the -S option in /etc/sysconfig/memcached (OPTIONS=”-S”) you can prevent would-be hackers from dropping all your cached items, or finding out what the names are of the keys you are using. The memcached maintainers want to do this a different way, so this patch won’t be included in the base memecahced source tree.

You might also be considering the restriction of access to your memcached instance by IP address. If you plan to use it from Cloud Sites that will be difficult because you won’t know what IP addresses your connections will come from, and they could change without notice. Furthermore, any other user of Cloud Sites would be coming form the same IP address. For this reason, it’s best to simply use the custom memcached version mentioned above and a secret key text that you prepend to all of your keys.

Completion

Congratulations! You’ve set up memcached on your Cloud Sites account! Now it’s time to begin using it in your web application to add speed and scalability to your application and start saving money!

Coding in the Cloud - Rule 3 - Use a “Stateless” design whenever possible

// July 17th, 2009 // 4 Comments » // Community, Development, Tutorials

Coding in the Cloud

By Adrian Otto

This continues my series on Rules for Coding in the Cloud, rules I’ve developed after watching applications encounter problems at scale when deployed on Cloud Sites.

The use of Sessions

The Web was originally designed to be stateless, that is, not to rely on information about users being stored between interactions. Sessions are a way to save the state of the application for the user between requests (shopping carts being the classic example of this). Programmers, feeling that they need to save information about particular users between interactions, find all sorts of ways to use sessions that are not really appropriate.  As a result, the concept of sessions gets abused and the scalability of the applications suffers due to requiring sessions. If you can design your application not to use sessions at all and be completely stateless, it can be scaled horizontally much more easily and the application will run much better under high load than if you try to keep track of which user needs to be reconnected with which session settings.

The error that we often see is that the application programmer makes a dependency on sessions central to the design of the software, and later when it becomes identified as a scalability bottleneck, it cannot be removed from the application because it was designed in at such a low level.

The reason why sessions are so evil is that most developers store the session information in the database. To get the information you need out of the session, you need to query the database and read the information. And because sessions change so often, such applications require lots and lots of writing to the database to keep the state of the session. You can end up with a read/write ratio of about 50/50 because of sessions, which kills scalability.

Working with REST

One way to make sure you don’t rely on sessions is to use Representational State Transfer (REST) architectures, which are by definition stateless.

If your application can work with REST, sessions are a non-issue.  If you have a REST architecture, you can put Ajax on top of REST and have a very useful application. Ajax can maintain information on the client side, which is fine for many applications. Going back to the shopping cart example, you could implement a shopping cart using Ajax. You could work around a lot of the storage that would be required by storing it in the JavaScript app that’s running on the client and give it to the server at the end of the interaction.

Some types of information are stateful by nature,  such as something that requires an open network connection to monitor progress of a remote operation that does not have stateless features. In such cases, using a stateless design may not be practical, or even possible. When you have more clients than servers it makes sense to do as much work on the clients as you reasonably can.  If your application must save state, save it on the client rather than on  the server. Consider using a browser cookie, or some Ajax. This will greatly improve your scalability.  Furthermore, you pay for the server side resources you consume, but client side resources are free!

Bottom line: Stateless apps are ideal. Server-side database writes are your enemy. If you must save state, save it on the client. Use cookies and Ajax where appropriate. If you follow this rule, you will multiply your scalability.

Coding in the Cloud - Rule 2 - Don’t write to the database in real time

// July 9th, 2009 // 14 Comments » // Community, Development, Tutorials

Coding in the Cloud
By Adrian Otto

This continues my series on Rules for Coding in the Cloud, rules I’ve developed after watching applications encounter problems at scale when deployed on Cloud Sites.

People think about the cloud as an unlimited resource, but there are certain limits and you will reach those limits when you try to do something like writing multiple rows into the database for every single hit to your web site.  For example, if your site gets a million hits in an hour and you write four rows for each hit, you’d write four million rows of new data into your database in an hour.  That use pattern will cause lots of blockage and lots of wasted money. And in some cases, you can produce a write use pattern that can quickly exceed the capacity of the database server to write since writes take on average 10 times longer than any equally sized read.

So if you can avoid it, don’t write to the database.  When you must write to the database, do it very infrequently.  Don’t write based on the access pattern of your web site, or your entire application will fail when it’s under high load. Instead, find a way to individually queue that data in a scalable fashion, summarize it and then add it to the database at an infrequent interval.

Don’t use the database as a web log. If you do, and you have an application that’s running on hundreds of thousands of nodes in parallel, you’ll be unpleasantly surprised by the outcome. It will fail.

So what kinds of applications tend to break this rule?

Ad networks for one. Ad networks are designed to track where ads come from in real time so they can get up-to-date intelligence about the performance of ads. The critical error in logic that some ad network developers have made is that to get real time data you need real time logging, which you don’t.  All you need is real time summary data.  You don’t need the detail level in real time.  You need the detail level for archival, but you don’t need the detail level to get real time intelligence.  So what you really want in the case of serving an ad network is summary counters of the performance of all the various objects that you’re serving, and you want those counters updated in memory resources, not in the database. Every few minutes, read that information out of the memory counters and write it into the database for permanent storage.  This gives you a real time view of the ad network without writing multiple nodes into a database for every single access to every single ad.  We’ve seen multiple ad networks make this mistake, run into scalability constraints, and have to redesign the way their systems work. The principle here applies to all sites, though, not just ad networks.

Because of the way Cloud Sites works, storing something in memory as a summary value may seem rather tricky. The best way to do this is to use a memcached instance running on Cloud Servers. From a PHP application you can use the Memecached class for this. It supports the increment method that will allow you to safely increment the value of a given key from numerous servers simultaneously.

Bottom line: don’t try to write to the database in real time.  Writing to the database in real time is a recipe that will fail at scale. If you’re going to write to the database, do it asynchronously. Take the data in a batched format and save it to the database at regular intervals.