Things They Forgot

Pat Robertson’s comments on Haiti basically boil down to “they got what was coming to them.” Mr. Robertson, I think you forgot Matthew 25:34-46 (KJV):

34Then shall the King say unto them on his right hand, Come, ye blessed of my Father, inherit the kingdom prepared for you from the foundation of the world: 35For I was an hungred, and ye gave me meat: I was thirsty, and ye gave me drink: I was a stranger, and ye took me in: 36Naked, and ye clothed me: I was sick, and ye visited me: I was in prison, and ye came unto me. 37Then shall the righteous answer him, saying, Lord, when saw we thee an hungred, and fed thee? or thirsty, and gave thee drink? 38When saw we thee a stranger, and took thee in? or naked, and clothed thee? 39Or when saw we thee sick, or in prison, and came unto thee? 40And the King shall answer and say unto them, Verily I say unto you, Inasmuch as ye have done it unto one of the least of these my brethren, ye have done it unto me.

41Then shall he say also unto them on the left hand, Depart from me, ye cursed, into everlasting fire, prepared for the devil and his angels: 42For I was an hungred, and ye gave me no meat: I was thirsty, and ye gave me no drink: 43I was a stranger, and ye took me not in: naked, and ye clothed me not: sick, and in prison, and ye visited me not. 44Then shall they also answer him, saying, Lord, when saw we thee an hungred, or athirst, or a stranger, or naked, or sick, or in prison, and did not minister unto thee? 45Then shall he answer them, saying, Verily I say unto you, Inasmuch as ye did it not to one of the least of these, ye did it not to me. 46And these shall go away into everlasting punishment: but the righteous into life eternal.

Rush Limbaugh may have forgotten the above as well. His claims that Obama is using humanitarian aid for political profit definitely seem to have forgotten Matthew 7:15-20:

15 Beware of false prophets, which come to you in sheep’s clothing, but inwardly they are ravening wolves. 16 Ye shall know them by their fruits. Do men gather grapes of thorns, or figs of thistles? 17 Even so every good tree bringeth forth good fruit; but a corrupt tree bringeth forth evil fruit. 18 A good tree cannot bring forth evil fruit, neither can a corrupt tree bring forth good fruit. 19 Every tree that bringeth not forth good fruit is hewn down, and cast into the fire. 20 Wherefore by their fruits ye shall know them.

If that last passage seems a bit murky, here’s a quote from C. S. Lewis’s The Last Battle (the last book of the Chronicles of Narnia) that I have always loved. The speaker is a Calormene soldier, Emeth, who has had a life-changing encounter with Aslan during the last hours of Narnia:

He answered, Child, all the service thou hast done to Tash, I account as service done to me. Then by reasons of my great desire for wisdom and understanding, I overcame my fear and questioned the Glorious One and said, Lord, is it then true, as the Ape said, that thou and Tash are one? The Lion growled so that the earth shook (but his wrath was not against me) and said, It is false. Not because he and I are one, but because we are opposites, I take to me the services which thou hast done to him. For I and he are of such different kinds that no service which is vile can be done to me, and none which is not vile can be done to him. Therefore if any man swear by Tash and keep his oath for the oath’s sake, it is by me that he had truly sworn, though he know it not, and it is I who reward him. And if any man do a cruelty in my name, then, though he says the name Aslan, it is Tash whom he serves and by Tash his deed is accepted. Dost thou understand, Child?

By their fruits ye shall know them…whatever their claims.

Poor Google? Not.

Since yesterday, the Net has been abuzz because of Google’s blog posting about their discovery they were being hacked by China. Almost every response I’ve seen has focused on the attempted hacking of the mailboxes of Chinese human rights activists.

That’s exactly where Google wants you to focus.

Let’s take a closer look at their blog post.

Paragraph 1:

In mid-December, we detected a highly sophisticated and targeted attack on our corporate infrastructure originating from China that resulted in the theft of intellectual property from Google.

Paragraph 2:

As part of our investigation we have discovered that at least twenty other large companies from a wide range of businesses–including the Internet, finance, technology, media and chemical sectors–have been similarly targeted.

Whoa. That’s some heavy-league stuff right there. Coordinated, targeted commercial espionage across a variety of vertical industries. Google first accuses China of stealing its intellectual property, then says that they weren’t the only ones. Mind you, industry experts – including the United States governmenthave been saying the same thing for years. Cries of ‘China hacked us!” happen relatively frequently in the IT security industry, enough so that it blends into the background noise after awhile.

My question is why, exactly, Google thought this wouldn’t happen to them? They’re a big fat juicy target on many levels. Gmail with thousands upon thousands of juicy mailboxes? Check! Search engine code and data that allows sophisticated monitoring and manipulation of Internet queries? Check! Cloud-based office documents that just might contain some competitive value? Check!

My second question is, why, exactly, is Google trying to shift the focus of the story from the IP theft (which by their own press report was successful) and cloak their actions in the “oh, noes, China tried to grab dissidents’ email” moral veil they’re using?

Paragraph 3:

Second, we have evidence to suggest that a primary goal of the attackers was accessing the Gmail accounts of Chinese human rights activists. Based on our investigation to date we believe their attack did not achieve that objective. Only two Gmail accounts appear to have been accessed, and that activity was limited to account information (such as the date the account was created) and subject line, rather than the content of emails themselves.

Two accounts, people, and the attempt wasn’t even fully successful. And the moral outrage shimmering from the screen in Paragraph 4, when Google says that “dozens” of accounts were accessed by third parties not through any sort of security flaw in Google, but rather through what is probably malware, is enough to knock you over.

Really, Google? You’re just now tumbling to the fact that people’s GMail accounts are getting hacked through malware?

I don’t buy the moral outrage. I think the meat of the matter is back in paragraph 1. I believe that the rest of the outrage is a smokescreen to repaint Google into the moral high ground for their actions, when from the sidelines here it certainly looks like Google chose knowingly to play with fire and is now suddenly outraged that they, too, got burned.

Google, you have enough people willing to play along with your attempt to be the victim. I’m not one of them. You compromised human rights principles in 2006 and knowingly put your users into harm’s way. “Do no evil,” my ass.

From Whence Redundancy? Exchange 2010 Storage Essays, part 1

Updated 4/13 with improved reseed time data provided by item #4 in the Top 10 Exchange Storage Myths blog post from the Exchange team.

Over the next couple of months, I’d like to slowly sketch out some of the thoughts and impressions that I’ve been gathering about Exchange 2010 storage over the last year or so and combine them with the specific insights that I’m gaining at my new job. In this inaugural post, I want to tackle what I have come to view as the fundamental question that will drive the heart of your Exchange 2010 storage strategy: will you use a RAID configuration or will you use a JBOD configuration?

In the interests of full disclosure, the company I work for now is a strong NetApp reseller, so of course my work environment is conducive to designing Exchange in ways that make it easy to sell the strengths of NetApp kit. However, part of the reason I picked this job is precisely because I agree with how they address Exchange storage and how I think the Exchange storage paradigm is going to shake out in the next 3-5 years as more people start deploying Exchange 2010.

In Exchange 2010, Microsoft re-designed the Exchange storage system to target what we can now consider to be the lowest common denominator of server storage: a directly attached storage (DAS) array of 7200 RPM SATA disks in a Just a Box of Disks (JBOD) configuration. This DAS/JBOD/SATA (what I will now call DJS) configuration has been an unworkable configuration for Exchange for almost its entire lifetime:

  • The DAS piece certainly worked for the initial versions of Exchange; that’s what almost all storage was back then. Big centralized SANs weren’t part of the commodity IT server world, reserved instead for the mainframe world. Server administrators managed server storage. The question was what kind of bus you used to attach the array to the server. However, as Exchange moved to clustering, it required some sort of shared storage. While a shared SCSI bus was possible, it not only felt like a hack, but also didn’t scale well beyond two nodes.
  • SATA, of course, wasn’t around back in 1996; you had either IDE or SCSI. SCSI was the serious server administrator’s choice, providing better I/O performance for server applications, as well as faster bus speeds. SATA, and its big brother SAS, both are derived from the lessons that years of SCSI deployments have provided. Even for Exchange 2007, though, SATA’s poor random I/O performance made it unsuitable for Exchange storage. You had to use either SAS or FC drives.
  • RAID has been a requirement for Exchange deployments, historically, for two reasons: to combine enough drive spindles together for acceptable I/O performance (back when disks were smaller than mailbox databases), and to ensure basic data redundancy. Redundancy was especially important once Exchange began supporting shared storage clustering and required both aggregate I/O performance only achievable with expensive disks and interfaces as well as the reduced chance of a storage failure being a single point of failure.

If you look at the marketing material for Exchange 2010, you would certainly be forgiven for thinking that DJS is the only smart way to deploy Exchange 2010, with SAN, RAID, and non-SATA systems supported only for those companies caught in the mire of legacy deployments. However, this isn’t at all true. There are a growing number of Exchange experts (and not just those of us who either work for storage vendors or resell their products) who think that while DJS is certainly an interesting option, it’s not one that’s a good match for every customer.

In order to understand why DJS is truly possible in Exchange 2010, and more importantly begin to understand where DJS configurations are a good fit and what underlying conditions and assumptions you need to meet in order to get the most value from DJS, we need to separate these three dimensions and discuss them separately.

JBOD vs RAID

While I will go into more detail on all three dimensions at later date, I want to focus on the JBOD vs.. RAID question now. If you need some summaries, then check out fellow Exchange MVP (and NetApp consultant) John Fullbright’s post on the economics of DAS vs. SAN as well as Microsoft’s Matt Gossage and his TechEd 2009 session on Exchange 2010 storage. Although there are good arguments for diving into drive technology or storage connection debates, I’ve come to believe that the central philosophy question you must answer in your Exchange 2010 design is at what level you will keep your data redundant. Until Exchange 2007, you had only one option: keeping your data redundant at the disk controller level. Using RAID technologies, you had two copies of your data[1]. Because you had a second copy of the data, shared storage clustering solutions could be used to provide availability for the mailbox service.

With Exchange 2007’s continuous replication features, you could add in data redundancy at the application level and avoid the dependency of shared storage; CCR creates two copies, and SCR can be used to create one or more additional copies off-site. However, given the realities of Exchange storage, for all but the smallest deployments, you had to use RAID to provide the required number of disk spindles for performance. With CCR, this really meant you were creating four copies; with SCR, you were creating an additional two copies for each target replica you created.

This is where Exchange 2010 throws a wrench into the works. By virtue of a re-architected storage engine, it’s possible under specific circumstances to design a mailbox database that will fit on a single drive while still providing acceptable performance. The reworked continuous replication options, now simplified into the DAG functionality, create additional copies on the application level. If you hit that sweet spot of the 1:1 database to disk ratio, then you only have a single copy of the data per replica and can get an n-1 level of redundancy, where n is the number of replicas you have. This is clearly far more efficient for disk usage…or is it? The full answer is complex, the simple answer is, “In some cases.”

In order to get the 1:1 database to disk ratio, you have to follow several guidelines:

  1. Have at least three replicas of the database in the DAG, regardless of which sites they are in. Doing so allows you to place both the EDB and transaction log files on the same physical drive, rather than separating them as you did in previous versions of Exchange.
  2. Ensure that you have at least two replicas per site. The reason for this is that unlike Exchange 2007, you can reseed a failed replica from another passive copy. This allows you to avoid reseeding over your WAN, which is something you do not want to do.
  3. Size your mailbox databases to include no more users than will fit in the drive’s performance envelope. Although Exchange 2010 converts many of the random I/O patterns to sequential, giving better performance, not all has been converted, so you still have to plan against the random I/O specs.
  4. Ensure that write transactions can get written successfully to disk. Use a battery-backed caching controller for your storage array to ensure the best possible performance from the disks. Use write caching for the physical disks, which means ensuring each server hosting a replica has a UPS.

At this point, you probably have disk capacity to spare, which is why Exchange 2010 allows the creation of archive mailboxes in the same mailbox database. All of the user’s data is kept at the same level of redundancy, and the archived data – which is less frequently accessed than the mainline data – is stored without additional significant disk or I/O penalty. This all seems to indicate that JBOD is the way to go, yes? Two copies in the main site, two off-site DR copies, and I’m using cheaper storage with larger mailboxes and only four copies of my data instead of the minimum of six I’d have with CCR+SCR (or the equivalent DAG setup) on RAID configurations.

Not so fast. Microsoft’s claims around DJS configurations usually talk about the up-front capital expenditures. There’s more to a solid design than just the up-front storage price tag, and even if the DJS solution does provide savings in your situation, that is only the start. You also need to think about the lifetime of your storage and all the operational costs. For instance, what happens when one of those 1:1 drives fails?

Well, if you bought a really cheap DAS array, your first indication will be when Exchange starts throwing errors and the active copy moves to one of the other replicas. (You are monitoring your Exchange servers, right?) More expensive DAS arrays usually directly let you know that a disk failed. Either way, you have to replace the disk. Again, with a cheap white-box array, you’re on your own to buy replacement disks, while a good DAS vendor will provide replacements within the warranty/maintenance period. Once the disk is replaced, you have to re-establish the database replica. This brings us to the wonderful manual process known as database reseeding, which is not only a manual task, but can take quite a significant amount of time – especially if you made use of archival mailboxes and stuffed that DJS configuration full of data. Let’s take a closer look at what this means to you.

[Begin 4/13 update]

There’s a dearth of hard information out there about what types of reseed throughputs we can achieve in the real world, and my initial version of this post where I assumed 20GB/hour as an “educated guess” earned me a bit of ribbing in some quarters. In my initial example, I said that if we can reseed 20GB of data per hour (from a local passive copy to avoid the I/O hit to the active copy), that’s 10 hours for a 200GB database, 30 hours for a 600GB database, or 60 hours –two and a half days! – for a 1.2 TB database[2].

According to the Top 10 Exchange Storage Myths post on the Exchange team blog, 20GB/hour is way too low; in their internal deployments, they’re seeing between 35-70GB per hour. How would these speeds affect reseed times in my examples above? Well, let’s look at Table 1:

Table 1: Example Exchange 2010 Mailbox Database reseed times

Database Size Reseed Throughput Reseed Time
200GB 20GB/hr 10 hours
200GB 35GB/hr 7 hours
200GB 50GB/hr 4 hours
200GB 70GB/hr 3 hours
600GB 20GB/hr 30 hours
600GB 35GB/hr 18 hours
600GB 50GB per hour 12 hours
600GB 70GB per hour 9 hours
1.2TB 20GB/hr 60 hours
1.2TB 35GB/hr 35 hours
1.2TB 50GB/hr 24 hours
1.2TB 70GB/hr 18 hours

As you can see, reseed time can be a key variable in a DJS design. In some cases, depending on your business needs, these times could make or break whether this is a good design. I’ve done some talking around and found out that reseed times in the field are all over the charts. I had several people talk to me at the MVP Summit and ask me under what conditions I’d seen 20GB/hour, as that was too high. Astrid McClean and Matt Gossage of Microsoft had a great discussion with me and obviously felt that 20GB/hour is way too low.

Since then, I’ve received a lot of feedback and like I said, it’s all over the map. However, I’ve yet to hear anyone outside of Microsoft publicly state a reseed throughput higher than 20GB/hour. What this says to me is that getting the proper network design in place to support a good reseed rate hasn’t been a big point in deployments so far, and that in order to make a DJS design work, this may need to be an additional consideration.

If your replication network is designed to handle the amount of traffic required for normal DAG replication and doesn’t have sufficient throughput to handle reseed operations, you may be hurting yourself in the unlikely event of suffering multiple simultaneous replica failures on the same mailbox database.

This is a bigger concern for shops that have a small tolerance for any given drive failure. In most environments, one of the unspoken effects of a DJS DAG design is that you are trading number of replicas – and database-level failover – for replica rebuild time. If you’re reduced from four replicas down to three, or three down to two during the time it takes to detect the disk failure, replace the disk, and complete the reseed, you’ll probably be okay with that taking a longer period of time as long as you have sufficient replicas.

All during the reseed time, you have one fewer replica of that database to protect you. If your business processes and requirements don’t give you that amount of leeway, you either have to design smaller databases (and waste the disk capacity, which brings us right back to the good old bad days of Exchange 2000/2003 storage design) or use RAID.

[End 4/13 update]

Now, with a RAID solution, we don’t have that same problem. We still have a RAID volume rebuild penalty, but that’s happening inside the disk shelf at the controller, not across our network between Exchange servers. And with a well-designed RAID solution such as generic RAID 10 (1+0) or NetApp’s RAID-DP, you can actually survive the loss of more disks at the same time. Plus, a RAID solution gives me the flexibility to populate my databases with smaller or larger mailboxes as I need, and aggregate out the capacity and performance across my disks and databases. Sure, I don’t get that nice 1:1 disk to database ratio, but I have a lot more administrative flexibility and can survive disk loss without automatically having to begin the reseed dance.

Don’t get me wrong – I’m wildly enthusiastic that I as an Exchange architect have the option of designing to JBOD configurations. I like having choices, because that helps me make the right decisions to meet my customers’ needs. And that, in the end, is the point of a well-designed Exchange deployment – to meet your needs. Not the needs of Microsoft, and not the needs of your storage or server vendors. While I’m fairly confident that starting with a default NetApp storage solution is the right choice for many of the environments I’ll be facing, I also know how to ask the questions that lead me to consider DJS instead. There’s still a place for RAID at the Exchange storage table.

In further installments over the next few months, I’ll begin to address the SATA vs. SAS/FC and DAS vs. SAN arguments as well. I’ll then try to wrap it up with a practical and realistic set of design examples that pull all the pieces together.

[1] RAID-1 (mirroring) and RAID-10 (striping and mirroring) both create two physical copies of the data. RAID-5 does not, but it allows the loss of a single drive failure — effectively giving you a virtual second copy of the data.

[2] Curious why picked these database sizes?  200GB is the recommended maximum size for Exchange 2007 (due to backup limitations), and 600GB/1.2TB are the realistic recommended maximums you can get from 1TB and 2TB disks today in a DJS replica-per-disk deployment; you need to leave room for the content index, transaction logs, and free space.

A Virtualization Metaphor

This is a rare kind of blog post for me, because I’m basically copying a discussion that rose from one of my Twitter/Facebook status updates earlier today:

I wish I could change the RAM, CPU configuration on running VMs in #VMWare and have the changes apply on next reboot.

This prompted one of my nieces, a lovely and intelligent young lady in high school, to ask me to say that in English.

I pondered just hand waving it, but I was loathe to do so. Like I said, she’s intelligent. I firmly believe that kids live up to your expectations; if you talk down to them and treat them like they’re dumb because that’s what you expect, they’re happy to be that way. On the other hand, if you expect them to be able to understand concepts with the proper explanations, even if they may not immediately grasp the fine points, I’ve found that kids are actually quite able to do so – better than many adults, truth be told.

So, this is my answer:

The physical machinery of computers is called hardware. The programs that run on them (Windows, games, etc.) is software.
VMware is software that allows you to create virtual machines. That is, instead of buying (for example) 10 computers to do different tasks and have most of them have unused memory and processor power, you buy one or two really beefy computers and run VMWare. That allows you to create a virtual machine in software, so those two computers become 10. I don’t have to buy quite as much hardware because each virtual machine only uses the resources it needs, leaving the rest for the other virtual machines.

However, one of the problems with VMWare currently is that if you find you’ve given a virtual machine too much memory or processor (or not enough), you have to shut it down, make the change, then start it back up. I want the software to be smart enough to take the change *now* and automatically apply it when it can, such as when the virtual machine is rebooting. For a physical computer, it makes sense — I have to power it down, crack the case open, put memory in, etc. — but for a virtual computer, it should be able to be done in software.

Think of it this way: hardware is like a closet. You can build a big closet or a small closet or a medium closet, but each closet holds a finite amount of stuff. Software is the stuff you put in the closet — clothes, shoes, linens, etc. You can dump a bunch of stuff into a big closet, but doing so makes it cluttered and hard to use. So if you use multiple smaller closets, you’re wasting space because you probably won’t fill every one exactly.

In this metaphor, virtualization is like a closet organizer system. You can add a clothing rod here to hang dresses and blouses on, and underneath that add a shelf or two for shoes, while to the side you have more shelves for pants and towels and other stuff. You waste a little bit of your closet space for the organizer, but you keep everything organized and clutter-free, which means you’re better off and take less time to keep everything up.

Of course, this metaphor fails on my original point, because it totally makes sense you have to take all the stuff off shelves before moving those shelves around. In the world of software, though, it doesn’t necessarily make sense — it’s just the right people didn’t think of it at the right time.

Clear?

I came close to busting out Visio and starting to diagram some of this. I decided not to.

Edit: I don’t have to diagram it! Thank you, Ikea, and your lovely KOMPLEMENT wardrobe organizer line!

Ikea KOMPLEMENT organizer as virtualization software