Archive for Development

How do you put a Database in the Cloud?

// October 28th, 2009 // 3 Comments » // Development

By Jonathan Bryce, Founder

There’s a lot of buzz in the cloud world around Amazon’s new Relational Database Service. With this move, Amazon inches up one level from pure infrastructure to also owning the operating system and base server software (you can’t SSH into an RDS EC2 instance). More interesting than the announcement itself is the discussion it’s generated, a frequent question being, “Is it really cloud?”

  Techcrunch provides coverage and an intelligent discussion in the comments.

Rightscale makes the point that RDS instances are basically MySQL appliances, at the core just EC2 instances running MySQL. This is a capability RightScale has offered for years on top of the same infrastructure. The RDS instances then have some valuable, automated services layered on top to back up and scale the resources available to that EC2 instance. This is similar to the value added services Rightscale has offered as well as similar to the snapshot backups and in-place scaling Cloud Servers offers for all server types. A side note is that this is obviously a step that will be worrisome for some of the Amazon partners who are building businesses on top of Amazon’s infrastructure services.

Database != Cloud?

Back to the original question: How do you do databases in the cloud? This is a question we’ve been consumed with for years. We are are running thousands and thousands of applications, most of which are back-ended by a MySQL or Microsoft SQL Server database.  At Rackspace, we have a few basic philosophies that influence how we approach our product offerings from managed hosting to cloud to email.

First, we want to give users a variety of options that start very low in the stack and go all the way up to software services. Customers can pick where in the stack they want to work to match their needs for customization, ease of use, and required technical skill.

Second, we want to try to make the transition from one type of service to another as smooth as possible. I applaud Amazon for implementing this in a way that preserves the standard MySQL protocol. We’ve taken the same approach with our Cloud Sites database capabilities.

The third goal, though, is the hardest: smooth scalability. One of the primary promises of the cloud is elasticity. For something like our web application servers, we run custom versions of web server software that allow us to reach a level of scale that will meet practically any need. Relational databases, though, are much more difficult to scale infinitely. Amazon’s approach is to give their users the building blocks and ask them to handle the scaling. We’ve taken a different tack, trying to handle scaling as seamlessly as possible.

Along the way, we’ve learned many lessons about scaling, especially databases. The first lesson is that there’s still a limit to how far you can stretch today’s relational database software. We’ve been able to create a MySQL offering for Cloud Sites that has elastically handled massive volume, but we’ve also reached upper bounds and had to help a small number of customers deploy in other configurations. You can throw bigger, beefier hardware at it, but eventually you can’t go any farther vertically. You can scale horizontally using projects like mysql-proxy to load-balance queries, but again, you will run into problems like maintaining consistency across all your nodes. For the vast majority of database usage out there, these problems never appear on the horizon and the work we’ve already done on MySQL is elastic enough. For those cases where the database needs to do more, we’ve been working with two interesting new technologies.

Drizzle

Drizzle is a project that is one step removed from MySQL. It’s primarily worked on by developers who also worked on MySQL, with the goal being to modularize every component of MySQL and build a scalable, cloud-friendly version of the world’s most popular RDBMS. Drizzle is still in early stages of development, but it shows a lot of promise. Bringing real horizontal scaling to MySQL, while maintaining compatibility and relational database functionality will be a huge step forward. And if it won’t require a complete rework of the decades of development time that has been spent on RDBMS-backed applications, that is a big bonus.

Cassandra

Beyond Drizzle, we are actively contributing to and working on the Cassandra distributed database system. Cassandra goes beyond trying to scale a traditional relational database. Cassandra removes many of those traditional concepts and places a priority on scaling. When you are dealing with billions of writes and terabytes of data, you’ve moved into a realm of technology needs that requires you to adopt some new concepts. Truly web-scaled applications, like Digg, have reached this point and started to make the shift. The possibility of reaching hundreds of millions of users worldwide with online applications creates scale problems like never before. We see distributed databases as a key component to solving the next wave of scaling problems and that’s why we are investing heavily into it. If you put in a little bit of extra development effort upfront, they offer the potential for truly elastic, cloud database services.

If the idea of creating infinitely scalable database technology consumes your thoughts, send us your resume - jobs@rackspacecloud.com. We are always looking to add skilled engineers and developers, and will provide an opportunity to work on some of the largest infrastructure systems in the world, handling billions of transactions every month.

Related Posts: The Cassandra Project

Amazon CloudFront vs. Rackspace CloudFiles CDN Performance

// October 23rd, 2009 // 9 Comments » // Community, Development

Angela Bartels, Cloud Maven

It’s not enough for us to tell you why you should use our services. We believe that it is more powerful when you hear real stories from customers on how they are utilizing our cloud computing platform. It’s even better when customers are not only using us but also trying out another provider so they can do a comparable analysis. This is the cool stuff.

Chris Meller, an Amazon S3 and Rackspace Cloud Servers customer, chose Amazon CloudFront as his first choice for offloading all his static files to a CDN. Being a fan of our Cloud Servers offering, he didn’t want to rule out Rackspace Cloud Files. So he tried both CloudFront and Cloud Files and compared his results.

He did find annoyances on both ends as he quotes:

“So with one minor annoyance on each side of the aisle I turned to hard quantifiable data, something every programmer loves. I loaded up my stylesheet on both CDNs and pointed a Pingdom check at each. The results were surprising.”

Click here to see his results.

Security Series: 3 Tips on How to Prevent an Attack on the User Environment

// October 21st, 2009 // No Comments » // Development

By Chad Wilson, Security Engineer

Securing your development and deployment environment is equally as important as securing your web server. Over the past few years, the attack vector has shifted away from attacking and compromising servers and is now focused heavily on client side vulnerabilities via code injection and cross-site scripting tactics.

This post serves to provide awareness and recommend actions that can be taken to prevent the outcome of the scenario attack as seen below.

Scenario of an Attack on the User Environment

This is a scenario of an attack targeted at end users resulting in the compromise of their web site. The attack might start when the victim is browsing a legitimate web site that contains hidden iframes, preventing the site owner or visitor from noticing that the site has actually been compromised. The hidden iframe may contain JavaScript logic that determines the victim’s browser version, OS version, etc. to launch an exploit intelligently. In this example, the JavaScript determines that the victim might be vulnerable to an exploit in the Adobe Acrobat browser plug-in. This exploit will lead to arbitrary execution of code. In this case, shellcode is executed that installs a backdoor and registers itself as part of the botnet.

[Click to to see how you can prevent an attack such as this one]

Once the backdoor/rootkit is in place, it registers with the Command and Control (CnC) server. The CnC server pushes a key logger and a program that searches for stored FTP user credentials.

Anything that is found is sent back to the CnC server.  The CnC server stores the credentials from the zombie computer and uses them to connect to the Victims FTP server. If only the FTP server name and user name is collected, brute force password guessing can be used to break into accounts with weak and predictable passwords. Once an account is validated it can be loaded into a program that connects and recursively infects all web site content by inserting hidden malicious code.

After infecting the victim’s computer and web site, the attacker now has one more computer in its botnet, and one more web site helping spread the malware. It doesn’t take long before thousands of websites and computers are compromised. These botnets are being rented and sold on black-markets for pay-per-click advertising and other profitable usage.

Here are 3 Quick Tips on How to Prevent this Attack:

1) Turn on automatic updates for your browser and plug-ins

Drive-by downloads and other variants of cross site scripting (XSS) attacks are commonly the first stage of an attack on end users, not servers, in an attempt to infect their computers with malware. In the example above, the victim simply visited a legitimate, but compromised, web site. Due to an unpatched browser plug-in, the victim’s computer and web site became compromised.

The popular browsers offer the ability to integrate add-ons and plug-ins. A browser plug-in is a handler for a media type that the browser itself cannot render, such as Adobe Flash or Apple Quicktime. Browsers usually have an “auto update” option or annoying pop-ups to help alert users about new patches and updates. Historically, plug-ins have been less “active” in notifying the end user about upgrades and updates. Not being regularly patched and updated, yet being common to most all browsers, makes plug-ins attractive targets for attackers.

Client side interpreters like Javascript, ActiveX, and Java all add significant security risk. Be cautious when browsing sites that require them. Although these interpreters are limited in what they can do outside of the browser environment, it is trivial for attackers to write exploits that steal cookie and session data. Also, as in the example provided, these languages are commonly used to trigger and event that can be exploited.

Bottom line: Turn on automatic updates for your browser and plug-ins. Only use trusted plug-ins, and only allow client side execution from trusted sites.

2)    Install Anti-Malware

Client side protection is more important now than before, but the reason why is slightly different. Traditionally, people installed anti-virus software to protect their computers from destructive viruses that resulted in loss of services and/or data. These types of viruses and worms have not completely disappeared, but attackers have found that the same methods of propagation can be used for profit by partnering with pay-per-click advertising and such. Rogue software, like spyware and adware, can be delivered with the virus that can open pop-up ads when a user browses certain sites or types key words, for example.

Early spyware and adware collected browsing trends and sent it off to marketing agencies. This malware was usually bundled other software that a user willingly installed. Today, the malware that infects a host usually comes from browsing a web site that was forced on the end user. Instead of creating benign tracking cookies, the malicious sites now aim to exploit a vulnerability in the browser or one of the plug-ins to allow arbitrary execution of attackers choice.  Once exploited, the workstation is compromised. Rootkits, key loggers, and trojans help maintain control over the host. Attackers have been using browser add-ons (not plug-ins) to conceal their rogue applications. Browser add-ons load when the browser starts. Staying hidden inside the browser executable, attackers can hide rogue programs from the view of processes running on the host.

Some anti-malware can warn the user about web sites that are suspected to perform malicious activity.  Black lists created by Google Safe Browsing, Norton Safeweb, and McAfee Site Advisor often identify these websites.

Bottom Line: Anti-malware software protects your computer from being compromised by malware. Once compromised, it is easy to harvest information from the host about FTP credentials, CMS credentials, stored passwords, etc. All this information could be used to compromise your website.
Extra: Anti-malware products are often the target of attackers. Anti-malware software is common, and in exploiting it, the attacker can bypass all detection. It is this reason that some people run two different brands of anti-malware software.

3) Secure FTP

FTP credentials are the primary target of many new malware attacks. After harvesting FTP credentials, the malware sends them off to a command and control server. The CnC server connects periodically to the FTP server and executes a search-and-inject routine finding all HTML documents and injecting malicious content. This aids in infecting other hosts, steadily recruiting more zombies for a botnet. (see illustration above)

Key loggers or similar daemons can be trojaned on a compromised/infected host so that updates can be sent to the CnC server when passwords are changed or added. Changing the passwords would be futile in this case as long as the host is compromised.

Malware is not the only concern with FTP. Attackers constantly hammer away brute-force password attacks at FTP servers that discover through various non-malicious methods. This is why choosing a strong password is important.

Finally, the FTP protocol is in itself insecure. Login credentials are passed over the Internet un-encrypted. Attackers that have access to the same subnet that this data traverses can deploy packet sniffers and trivially harvest FTP credentials.  Always use a secure channel for FTP, such as SFTP. This protects you from such attacks.

Bottom Line: Choose strong passwords for FTP. If you are unsure about what qualifies as a strong password, use publically available password strength analysis tools or password generators, but make sure it’s from a trusted site. Configure your FTP client to always use SFTP instead of plain FTP. Do not store/save your FTP password in your FTP client unless you are positive that it is stored using strong encryption.

This completes this security series post. Look for more security series posts in the near future. If you have comments, questions, or suggestions, please comment here and I will respond quickly.

As always, our support team is available 24/7 via live chat, phone and email so please don’t hesitate to contact us to ask questions specific to your set up.

The Cassandra Project

// September 23rd, 2009 // 7 Comments » // Development

By Jonathan Ellis, System Architect

You may have heard about the Cassandra distributed database in recent articles or conferences. I’d like to explain what advantages Cassandra offers over traditional relational databases like MySQL or Oracle and why Rackspace has committed resources to the Cassandra project.

The Cassandra project was started by Facebook in 2007 to scale their internal applications, particularly Inbox Search. Earlier this year, they released it to the Apache incubator where other people from the community could become involved and start contributing. This allowed  the project to move forward in a direction that is more general to the public than just to Facebook’s needs.

In March, I became the first outside committer to this Apache Incubator project. Eric Evans from Rackspace and Jun Rao from IBM Research soon followed, and we recently added Chris Goffinet from Digg. The community has grown from 5 people in the IRC channel in December to  over 60.

Distributed vs. Relational Databases

Traditional relational databases are 30 years old, are well understood and have a huge ecosystem of tools around them.  For that reason, it’s a compelling option when building your application. Postgres, MySQL, and Oracle are all relational databases modeling a schema on entities and relations between those entities. That’s a good, powerful programming model with interesting theoretical properties. But companies with large amounts of data have already gone past what you can reasonably fit on a single machine, even on high-end hardware, and it’s provably impossible to keep the traditional relational model, in particular the ACID properties, while scaling across multiple machines. Even if you’re willing to give up availability, scaling reads (via caching and replication) is difficult with relational databases, and scaling writes by partitioning is either very expensive, very painful from an application programming and operations standpoint, or both.

Cassandra is taking the approach that, given that you’re going to have to give up some parts of the relational model to scale, let’s start over and rethink things. Let’s add things like transparent replication and failover, built-in partitioning and load balancing, multiple data center support, and the ability to add capacity without ever disturbing applications running against the database.

Rackspace’s Involvement

The original Facebook team has been busy elsewhere, so the community has had to step up and take the initiative in moving Cassandra forward.  Cassandra is open source and I don’t want to downplay others’ contributions, including those from IBM Research, Digg, and Twitter as well as other companies and individuals, but I’m proud that Rackspace’s support has been instrumental in adding many important new features, fixing bugs, and getting out new releases.

Here are 3 reasons why Rackspace has committed resources:

1-    As stated in previous posts by Erik Carlin, we are committed to an Open Cloud. With Amazon’s Simple DB or Google App Engine’s datastore, you’re locked in. Cassandra presents an open alternative: you can write against Cassandra and deploy anywhere.  That’s important.

2-    We have a suite of Cloud products that are productized beyond just the raw Cloud Servers. Cassandra is interesting to us because we can use it under the hood to improve Cloud Sites and Cloud Files. And people are already starting to ask, “When can I just go to Rackspace and deploy a preconfigured Cassandra cluster?” It’s still early, but that’s definitely something we’re looking at.

3-    Rackspace itself has a ton of data that we generate from our switches and routers and the rest of our infrastructure. Right now we are getting by with traditional monitoring and logging technologies, searching those logs and so forth. Cassandra will help us a lot with that as our volumes continue to increase. Our Mail & Apps products are also very interested in using Cassandra to store mail messages and other data.

Finally, I want to emphasize Cassandra is not a magic bullet. You can’t just take your SQL app and put it on Cassandra and expect it to work.  It’s a different programming model and instead of modeling as entities and relationships and just adding indexes to get performance, you need to think at a more basic level: “What information do I need to retrieve from each query?” and model your Cassandra schema accordingly.  It’s a different way of thinking and does require new code to be written. It’s very much for people that have a lot data that doesn’t fit on a single machine and are feeling the pain from traditional approaches to scaling that.

We plan to write some other posts in the future detailing what a switch might look like for some sample applications.

Coding in the Cloud - Rule 6 - HTTP Includes

// September 22nd, 2009 // 1 Comment » // Development

By Adrian Otto

This continues my series on Rules for Coding in the Cloud. These are rules I’ve developed after watching applications encounter problems at scale when deployed on Cloud Sites.

Rule 6:  Never use HTTP include. Let me explain.

How does a HTTP include work?

You tell your PHP application, “I want to include a file.” For the file name, you supply a URL, which the server must download.  A client makes a connection to a PHP web server, the PHP web server runs an application, the application opens a file, and the file type is a URL. The server makes contact with another server, downloads this URL and puts the output into the PHP script.

Why is this a problem?

This results in not only a huge security problem, but also a performance problem. And now you’re faced with a potential outcome that could be disastrous—an infinite loop in an elastic server environment. You can accidentally create an HTTP include which includes something from your own site, which includes something from your site, which includes something from your site, and… well, you get the idea. If you do that, you’ll get a single client connection, which will open a connection to itself, over and over, until you have 50,000 of them running in parallel. The last connection will then hit the limit that you’re allowed to create and the entire thing will roll all the way back. You’ll get a failure, and the whole application will proceed as if it never happened.  Unfortunately, you will not be aware of this issue until you receive your bill with an outrageous amount of compute cycle usage. The cloud had to do huge amounts of work that you couldn’t even see!  That’s really the scary part about this scenario because the site looks like it’s working just fine. When you browse through your site, it comes up relatively quickly because that just scales through the entire system.  Meanwhile, The Rackspace Cloud is receiving alerts. You may not even know that your site has done the equivalent of 50,000 hits for every single hit.

In addition, you may also inadvertently involve someone else’s site. If you have two interdependent sites, the two may end up fighting back and forth, creating a massive loop.  And because the server is making the HTTP connection, the browser is completely unaware of it, so the browser’s anti-loop code won’t prevent it.  There’s no way to break the loop because there’s no way to see where it starts.

There is more than one way to do an HTTP include. One of them actually allows you to include PHP code from a remote URL and execute it as part of the local application. This feature (gaping hole) in PHP is actually disabled on Cloud Sites. What does work is using an fopen() call where the argument is a URL. This allows you to read data from that file handle and process it (potentially just printing it out to the browser). Try not to be tempted to eval() any of that output.

This may strike you as familiar advice. I mentioned a similar subject in Rule 4 - Avoid External Dependencies and included a code example of how to download content from a remote site on demand, cache a local copy, and provide non-blocking access to that data. The reason why this is a separate rule is I’ve seen it broken repeatedly, but not as an external dependency. It’s a risk of a circular internal (or external) dependency. People find reasons to HTTP include content from their own site but please try not to! What seems like an innocent include eventually leads to the infinite loop situation described above.

Bottom line: Never use HTTP include.

Click here to learn more about cloud computing.