The Disk’s The Thing! Exchange 2010 Storage Essays, part 2

Greetings, readers! When I first posted From Whence Redundancy? (part 1 of this series of essays on Exchange 2010 storage) I’d intended to follow up with other posts a bit faster than I have been. So much for intentions; let us carry on.

In part 1, I began the process of talking about how I think the new Exchange 2010 storage options will play out in live Exchange deployments over the next several years. The first essay in this series discussed what is I believe the fundamental question at the heart an Exchange 2010 storage design: at what level will you ensure the redundancy of your Exchange mailbox databases? The traditional approach has used RAID at the disk level, but Exchange 2010 DAGs allow you to deploy mailbox databases in JBOD configurations. While I firmly believe that’s the central question, answering it requires us to dig under the hood of storage.

With Exchange 2010, Microsoft specifically designed Exchange mailbox servers to be capable of using the lowest common denominator of server storage: a directly attached storage (DAS) array of 7200 RPM SATA disks in a Just a Box of Disks (JBOD) configuration (what I call DJS). Understanding why they’ve made this shift requires us to understand more about the disk drive technology. In this essay, part 2 of this series, let’s talk about disk technology and find out how Fibre Channel (FC), Serially Attached SCSI (SAS), and Serial Advanced Technology Attachment (SATA) disk drives are the same – and more importantly, what slight differences they have and what that means for your Exchange systems.

Exchange Storage SATA vs SAS

So here’s the first dirty little secret: for the most part, all disks are the same. Regardless of what type of bus they use, what form factor they are, what capacity they are, and what speed they rotate at, all modern disks use the same construction and principles:

  • They all have one or more thin rotating platters coated with magnetic media; the exact number varies by form factor and capacity. Platters look like mini CD-ROM disks, but unlike CDs, platters are typically double-sided. Platters have a rotational speed measured in revolutions per minute (RPMs).
  • Each side of a platter has an associated read-write head. These heads are on a single-track arm that moves in toward the hub of the platter or out towards the rim. The heads do not touch the platter, but float very close to the surface. It takes a measurable fraction of a second for the head to relocate from one position to another; this is called its seek time.
  • The circle described by the head’s position on the platter is called a track. In a multi-platter disk, the heads move in synchronization (there’s no independent tracking per platter or side). As a result, each head is on the same track at the same time, describing a cylinder.
  • Each drive unit has embedded electronics that implement the bus protocol, control the rotational speed of the platters, and translate I/O requests into the appropriate commands to the heads. Even though there are different flavors, they all perform the same basic functions.

If you would like a more in-depth primer on how disks work, I recommend starting with this article. I’ll wait for you.

Good? Great! So that’s how all drives are the same. It’s time to dig into the differences. They’re relatively small, but small differences have a way of piling up. Take a look at Table 1 which summarizes the differences between various FC, SATA, and SAS disks, compared with legacy PATA 133 (commonly but mistakenly referred to as IDE) and SCSI Ultra 320 disks:

Table 1: Disk parameter differences by disk bus type

Type Max wire bandwidth(Mbit/s) Max data transfer(MB/s)
PATA 133 1,064 133.5
SCSI Ultra 320 2,560 320
SATA-I 1,500 150
SATA-II 3,000 300
SATA 6 Gb/s 6,000 600
SAS 150 1,500 150
SAS 300 3,000 300
FC (copper) 4,000 400
FC (optic) 10,520 2,000

 

As of this writing, the most common drive types you’ll see for servers are SATA-II, SAS 300, and FC over copper. Note that while SCSI Ultra 320 drives in theory have a maximum data transfer higher than either SATA-II or SAS 300, in reality that bandwidth is shared among all the devices connected to the SCSI bus; both SATA and SAS have a one-to-one connection between disk and controller, removing contention. Also remember that SATA is only a half-duplex protocol, while SAS is a full-duplex protocol. SAS and FC disks use the full SCSI command set to allow better performance when multiple I/O requests are queued for the drive, whereas SATA uses the ATA command set. Both SAS and SATA implement tagged queuing, although they use two different standards (each of which has its pros and cons).

The second big difference is the average access time of the drive, which is the sum of multiple factors:

  • The average seek time of the heads. The actuator motors that move the heads from track to track are largely the same from drive to drive and thus the time contributed to the drive’s average seek time by just the head movements is roughly the same from drive to drive. What varies is the length of the head move; is it moving to a neighboring track, or is it moving across the entire surface? We can average out small track changes with large track changes to come up with idealized numbers.
  • The average latency of the platter. How fast the platters are spinning determines how quickly a given sector containing the data to be read (or where new data will be written) will move into position under the head once it’s in the proper track. This is a simple calculation based on the RPM of the platter and the observed average drive latency. We can assume that a given sector will move into position, on average, in no more than half a rotation. This gives us 30 seconds out of each minute of rotation, or 30,000 ms, into which we can divide the drive’s actual rotation.
  • The overhead caused by the various electronics and queuing mechanisms of the drive electronics, including any power saving measures such as reducing the spin rate of the drive platters. Although electricity is pretty fast and on-board electronics are relatively small circuits, there may be other factors (depending on the drive type) that may introduce delays into the process of fulfilling the I/O request received from the host server.

What has the biggest impact is how fast the platter is spinning, as shown in Table 2:

Table 2: Average latency caused by rotation speed

Platter RPM Average latency in ms
7,200 4.17
10,000 3
12,000 2.5
15,000 2

 

(As an exercise, do the same math on the disk speeds for the average laptop drives. This helps explain why laptop drives are so much slower than even low-end 7,200 RPM SATA desktop drives.)

Rather than painfully take you through the result of all of these tables and calculations step by step, I’m simply going to refer you to work that’s already been done. Once we know the various averages and performance metrics, we can figure out how many I/O operations per second (IOPS) a given drive can sustain on average, according to the type, RPMs, and nature of the I/O (sequential or random). Since Microsoft has already done that work for us as part of the Exchange 2010 Mailbox Role Calculator (version 6.3 as of this writing, I’m going to simply use the values there. Let’s take a look at how all this plays out in Table 3 by selecting some representative values.

Table 3: Drive IOPS by type and RPM

Size Type RPM Average Random IOPS
3.5” SATA 5,400 50
2.5” SATA 5,400 55
3.5” SAS 5,400 52.5
3.5” SAS 5,900 52.5
3.5” SATA 7,200 55
2.5” SATA 7,200 60
3.5” SAS 7,200 57.5
2.5” SAS 7,200 62.5
3.5” FC/SCSI/SAS 10,000 130
2.5” SAS 10,000 165
3.5” FC/SCSI/SAS 15,000 180
2.5” SAS 15,000 230

 

There are three things to note about Table 3.

  1. These numbers come from Microsoft’s Exchange 2010 Mailbox Sizing Calculator and are validated across vendors through extensive testing in an Exchange environment. While there may be minor variances between drive model and manufacturers and these number may seem pessimistic according to calculated IOPS number published for individual drives, these are good figures to use in the real world. Using calculated IOPS numbers can lead both to a range of figures, depending on the specific drive model and manufacturer, as well as to overestimating the amount of IOPS the drive will actually provide to Exchange.
  2. For the most part, SAS and FC are indistinguishable from the IOPs point of view. Regardless of the difference between the electrical interfaces, the drive mechanisms and I/O behaviors are comparable.
  3. Sequential IOPS are not listed; they will be quite a bit higher than the random IOPS (that same 7,200RPM SATA drive can provide 300+ IOPS for sequential operations). The reason is simple; although a lot of Exchange 2010 I/O has been converted from random to sequential, there’s still some random I/O going on. That’s going to be the limiting factor.

The IOPS listed are per-drive IOPS. When you’re measuring your drive system, remember that the various RAID configurations have their own IOPS overhead factor that will consume a certain number

There are of course some other factors that we need to consider, such as form factor and storage capacity. We can address these according to some generalizations:

  • Since SAS and FC tend to have the same performance characteristics, the storage enclosure tends to differentiate between which technology is used. SAS enclosures can often be used for SATA drives as well, giving more flexibility to the operator. SAN vendors are increasingly offering SAS/SATA disk shelves for their systems because paying the FC toll can be a deal-breaker for new storage systems.
  • SATA disks tend to have a larger storage capacity than SAS or FC disks. There are reasons for this, but the easiest one to understand is that SAS, being traditionally a consumer technology, has a lower duty cycle and therefore lower quality control specifications that must be met.
  • SATA disks tend to be offered with lower RPMs than SAS and FC disks. Again, we can acknowledge that quality control plays a part here – the faster a platter spins, the more stringently the drive components need to meet their specifications for a longer period of time.
  • 2.5” drives tend to have lower capacity than their 3.5” counterparts. This makes sense – they have smaller platters (and may have fewer platters in the drive).
  • 2.5” drives tend to use less power and generate less heat than equivalent 3.5” drives. This too makes sense – the smaller platters have less mass, requiring less energy to sustain rotation.
  • 2.5” drives tend to permit a higher drive density in a given storage chassis while using only fractionally more power. Again, this makes sense based on the previous two points; I can physically fit more drives into a given space, sometimes dramatically so.

Let’s look at an example. A Supermicro SC826 chassis holds 12 3.5” drives with a minimum of 800W power while the equivalent Supermicro SC216 chassis holds 24 2.5” drives with a minimum of 900W of power in the same 2Us of rack space. Doubling the number of drives makes up for the capacity difference between the 2.5” and 3.5” drives, provides twice as many spindles and allows a greater aggregate IOPS for the array, and only requires 12.5% more power.

The careful reader has noted that I’ve had very little to say about capacity in this essay, other than the observation above that SATA drives tend to have larger capacities, and that 3.5” drives tend to be larger than 2.5” drives. From what I’ve seen in the field, the majority of shops are just now looking at 2.5” drive shelves, so it’s safe to assume 3.5” is the norm. As a result, the 3.5” 7,200 RPM SATA drive represents the lowest common denominator for server storage, and that’s why the Exchange product team chose that drive as the performance bar for DJS configurations.

Exchange has been limited by performance (IOPS) requirements for most of its lifetime; by going after DJS, the product team has been able to take advantage of the fact that the capacity of these drives is the first to grow. This is why I think that Microsoft is betting that you’re going to want to simplify your deployment, aim for big, cheap, slow disks, and let Exchange DAGs do the work of replicating your data.

Now that we’ve talked about RAID vs. JBOD and SATA vs. SAS/FC, we’ll need to examine the final topic: SAN vs. DAS. Look for that discussion in Part 3, which will be forthcoming.

7 Comments



  1. Andy Schan

    Great stuff, Devin – thanks!


  2. Pavel

    Any updates on your Exchange 2010 Storage Essays, Part 3 post?

  3. Russell Wilson

    Excellent blog is part 3 available yet?

Add to the legend