SAN FRANCISCO — For many MySQL database admins, Amazon Web Services represents the brave new world of cloud computing — one fraught with disappearing servers, disk I/O variability, and other shared resource challenges, not to mention questions regarding the security of data.
But for those seeking to tap the powerful flexibility that cloud computing affords MySQL database deployments, AWS should be viewed simply as the next natural step in data center evolution — one in which virtualization and commodity hardware are a given, and scalability, performance, and high availability are readily attained.
To help DBAs take advantage of what Amazon.com offers, we’ve compiled the following primer on managing MySQL databases in Amazon’s cloud. Along the way, you’ll find essential tools and techniques for migrating databases to AWS, tuning them for high performance and high availability, and avoiding the pitfalls of computing in the cloud.
Demystifying deployment and disappearing servers
Of all the steps to ensure a successful AWS deployment, spinning up a basic instance is the simplest. More challenging is dealing with the new reality of disappearing servers.
To get started, download the Amazon API tools and install. Next set up your environment variables (EC2_HOME, EC2_PRIVATE_KEY, EC2_CERT; you may also need to set JAVA_HOME and PATH), spin up your instance, fetch the name of your new instance, and connect:
$ ec2-run-instances ami-31814f58 -k my-keypair -t t1.micro
$ ec2-describe-instances
$ ssh -i my-keypair ec2-user@ec2-108-22-252-188.compute-1.amazonaws.com
Next you’ll want to set up MySQL and a few base packages. Here we recommend the Percona free edition of MySQL:
$ rpm -Uhv http://www.percona.com/downloads/percona-release/percona-release-0.0-1.x86_64.rpm
$ yum install -y Percona-Server-shared-compat
$ yum install -y Percona-Server-server-55
$ yum install -y Percona-Server-client-55
$ yum install -y libdbi-dbd-mysql
Set the root password on your new MySQL instance, and you’re ready to begin.
Perhaps the most difficult shift you’ll make in adapting your thinking to cloud computing is around virtual machines themselves. AWS instances are built on virtualization technology, and although they sit on top of physical hardware that behaves much like the servers you’re used to, the virtual machines are not as reliable as physical ones. These machines can disappear out from under you and your application without notice. As such, redundancy, high availability, and scripted automation are key. Such pressures also put disaster recovery front and center. Now no longer relegated to a best practices list of tasks you’ll get to when other pressing problems are resolved, disaster recovery becomes an urgent priority.
Take, for example, what the operations team at Netflix decided to do. They wanted to meet this server reliability question head on, so they built a piece of software that would play Russian roulette with their servers. The resulting Chaos Monkey randomly knocks out servers in their production environment in the middle of the day. What’s more incredible is how this illustrates two sides to the AWS cloud coin. On one hand, the servers aren’t as reliable; on the other, Amazon provides the tools with which to build in all the redundancy you need.
For example, Amazon makes using multiple data centers seamless. They organize the objects (AMIs, snapshots, instances, and so forth) around the availability zones and regions in the environment. There are currently seven regions to choose from outside of AWS GovCloud, including Virginia, Oregon, California, Ireland, Singapore, Japan, and Brazil. Each region includes multiple data centers. Replicate your database data between these regions, build and keep fresh your server images, and automate push-button rebuilds to run with the most robust and fault-tolerant infrastructure possible.
Beware disk I/O variability
Relational databases often appear as unnecessarily complex beasts. But they’ve evolved the way they have to provide an array of great features. We can load them full of data, then mix and match that data asking complicated questions and selecting slices based on an endless set of conditions.
Behind the SQL language we use to fetch data and make changes, the underlying engine of a relational database — whether it’s MySQL, Oracle, or SQL Server — has the sole job of reading and writing data to disk, keeping the hottest (most requested) bits in memory, and finally protecting against server failure.
That said, disk I/O — the speed with which you read and write to that underlying storage system — is crucial. In the early days of Oracle, for example, before you had RAID, the database engine offered ways to stripe data across many disks, and it emphasized putting redologs to protect against data loss on its own disks entirely. When RAID became widely available, DBAs could simply place all their data files on one volume and the RAID array would take care of the rest.
Enter the present day where Amazon’s EBS (Elastic Block Storage) is virtualized, allowing you to cut up a slice of a RAID array that sits somewhere on your network and attach it to any instance. This greatly enhances operational flexibility, allowing easy programmatic changes to hardware, but with any new solution there are challenges.
EBS grapples with the limitations of a shared network resource. Many servers and many customers will all be using that network storage, so their resource usage can potentially impact your server. Amazon’s SLAs promise an average disk I/O throughput; however, that throughput can rise and fall dramatically in a given time period. This door swings both ways. When the disk subsystems are overused by multiple tenants, you’ll receive less of the resource; when it becomes underutilized, you will receive more.
Keep offsite backups
Replication is great for a high-availability solution that will cover you should your primary database fail you. But what if Amazon suffers a sitewide failure that affects networks or EBS in general? These scenarios can and do happen.
To guard against this, take advantage of Amazon’s global proliferation of data centers. Place your objects and instances in a variety of AWS Regions. You can choose MySQL’s Multi-AZ option, which works with RDS to replicate your data automatically to another data center. Or you can build your own MySQL primary database in your main availability zone and a replica slave in another region.
Further, consider a scenario where a court action or subpoena impacts Amazon. In the discovery phase, an expansive net accidentally draws your servers into the mess, interrupting your business.
In either case, you’ll want a last-ditch insurance policy for restoring your application. Scripting can vary depending on the complexity of your environment. For just a couple of servers, you can build your image as you like it, then snapshot and use that as your gold-standard server. When the instance spins up, a user data script is called, which you can supply to fulfill last steps or additional configuration needs.
For more complex environments, Scalr or Rightscale can provide a templating solution for your automation needs. For even more sophisticated environments or for operations teams ready to embrace configuration management to the fullest, Chef and Puppet may be options for you. With your automation scripts built, you can deploy a new server either in Amazon or another cloud provider, then deploy your code and configurations. As a final step you’ll want to restore your data; with an offsite backup, you’ll have that base covered.
Cache aggressively
Caching is crucial in the cloud, and you can ensure high performance through four modes of caching: caching in the browser, caching objects, caching queries, and caching data.
Caching in the browser: All items in a page that the browser fetches can be cached. This includes images, HTML objects, code, and so forth. Although many browsers are configured to perform a lot of caching by default, this is often not sufficient. That’s because Web pages are returned to the browser with a cache-control setting that is primarily commanded and configured by the Web server itself. This is good news for the operations team, as it is another switch or dial that can be turned to speed things up.
Typically, cache control should be set with a maximum age of one week. This becomes the default for objects sent to the browser. They can still be expired manually in code and application logic as necessary.
Caching objects: Object caches are a great addition to the caching layer at the Web server tier. Application code such as PHP or Ruby include libraries for making use of memcache, a popular object cache that handles name and value pairs. Whenever data is requested from the database, the object cache is checked first. If it is available, the data will be returned as much as 100 times faster — no network round trip to the database box, no complex database queries or caching.
If the data is not available in the object cache, a cache miss happens, and a request to the database will follow. The results returned will then be placed in the object cache for future use.
In AWS, a systemwide solution called ElastiCache is available. This technology is built on memcache but does not require individual configuration, monitoring, or tuning. Simply configure your application for this object cache — you’re off and running. Scalability is built in as a bonus.
Caching queries: Oh those queries, requiring heaps of attention to get them just right and as much attention to keep them working properly! That’s why the query cache should be another tool in your arsenal of performance equipment. Be sure to have it enabled in MySQL and configured for sufficient memory so that useful queries are not expiring before their useful life has ended.
Caching in the query cache will keep the query plan and data in one place. If the database finds an exact match of a query it’s already run in the past, it’ll send it back to the Web server without doing any real work at all.
Caching database data: The MySQL database contains a very important setting called the buffer cache. This caches all index and table data. Since data that is previously cached will incur only logical reads, you’ll avoid a ton of I/O to your data files. As we’ve mentioned, because of the challenges of EBS virtualized storage, this is doubly important in the Amazon AWS environment.
Scaling the database tier
The Web server tier can be set up with autoscaling in AWS because each of the server’s data is not changing dynamically. Sure, some files and images are uploaded or new code deployed, but this happens intermittently.
Scaling the database tier, however, typically involves a bit of finesse and careful execution as it poses specific challenges. The most widely used method is to maintain a single master database. This database receives all changes, such as Insert, Update, Delete statements from the application. All Select activity from the application gets sent to one or more slaves. These can be regularly reimaged, so you always have the latest copy of your MySQL slave as an image ready. When you need to add read capacity, simply spin up the latest image, the instance will start, and MySQL will start and connect to the single master just as the others do.
If capacity on the master database is waning, you can vertically scale that server. Set up a new, larger EC2 instance in AWS, then set it to replicate off the master. Set all of your existing slaves to point to the new master, then failover the application to the new, larger instance. Voilà, you’ve scaled vertically with zero downtime.
Actionable advice for databases on AWS
Now that you have a handle on the AWS environment and some of the hurdles you will encounter along the way, here is a targeted list of considerations for your database migration to the cloud, in particular around performance and disaster recovery.
Performance tips
You’ll want to manage performance even more aggressively than you already do, to maintain the same level of response your application and users already expect. Here are four tips to help keep your database instances humming:
1. Tune your MySQL server settings. When tuning your server, be sure your tables are InnoDB. Use innodb_file_per_table to avoid contention on a single tablespace datafile. Set innodb_buffer_pool_size to use a large portion of physical memory. Set innodb_log_file_size and innodb_log_buffer_size appropriately. Also set tmp_table_size and max_heap_table_size sufficiently to keep temp tables in memory. In addition, set your per-session controls: sort_buffer_size, join_buffer_size, read_buffer_size, and read_rnd_buffer_size.
2. Tune queries aggressively. This is far and away the most important aspect of keeping your server humming. It also tends to be a challenging one. Start by enabling the slow query log and filtering for queries running under, say, 0.5 second. Lower the threshold as you clean out the biggest culprits and heaviest resource consumers. Use the pt-query-digest tool to summarize and isolate the worst offenders. Add indexes as appropriate, consider multicolumn indexes, covering indexes to avoid additional table lookups.
3. Use xfs and RAID 10 with your EBS volumes. Although ext3 is a very common file system and pretty much the standard for Linux servers, xfs is a much higher-performance file system. It will make a difference. It’s a small step when you first set up your volumes and little additional work to maintain after.
In addition you can use software RAID over EBS volumes to mitigate quite a bit against disk I/O throughput problems and variability. As with physical server RAID arrays, EBS volumes will benefit from RAID 10 as much, allowing you to both stripe across a set of virtual disks, as well as swap out disks from the set as necessary.
4. Use Percona server. Percona distributes a distribution of the MySQL server, which is fully drop-in compatible with the Community Edition shipped by Oracle. What’s more, the Percona server incorporates additional performance tweaks, bug fixes, and community patches that have often not yet made it to the Oracle shipping version. The server is rock-solid and runs blazingly fast.
Disaster recovery tips
As mentioned, disaster recovery pushes to the foreground when computing in the cloud. You’ll want to emphasize automation and take advantage of Amazon’s availability zones in your DR plan. Here are three essential tips for keeping your environment online:
1. Automate push-button server rebuilds. As we mentioned previously, virtual instances can and will fail. It’s essential that you automate the process of rebuilding those instances. The upside is that this forces best practices around disaster recovery. What’s more, you get essentially free disaster recovery, as you don’t need to keep those backup servers sitting idle until that fateful day.
2. Replicate to alternate availability zones and regions. Amazon has had real availability zone outages, and customers who’ve built everything in a single availability zone have been impacted the most seriously by this detail. In the AWS world, you have multiple data centers at your fingertips; use them wisely and you’ll build additional resilience into your environment and protect against even the most serious outages.
3. Keep offsite backups of your data. Yes, you want to keep a replicated version of your database, but it’s also a good idea to keep regular offsite backups. If you really want to take it to the next level of reliability, engage a second cloud provider, and keep a live copy of your database refreshing there. Then build automation scripts to bring up your entire infrastructure in the second provider’s cloud.