Keeping up with infrastructure innovation, using resources efficiently, a look at management interfaces and touchpoints and selecting effective policies.
Fact: Virtualization fundamentally and permanently changed IT and the data center. Today, most services are running inside virtual environments, and IT often takes a “virtualized first” approach to new application and service deployment. That is, administrators consider the virtual environment for running new applications rather than just building a new physical environment. Although virtualization offers significant benefits, it also introduces challenges that IT must overcome to help propel the business forward. This chapter describes those challenges.
Every time some startup company releases a great new product, enterprises scramble to implement that solution. The proliferation of purpose-built devices has created unnecessary complexity — and the result has been data center chaos. Innovation is great, and we all want it to continue, but eventually, data centers have so much stuff that they become unmanageable. It’s time to clean out the closet, so to speak.
Over the past decade, IT departments have focused on solving the storage capacity problem, deploying all kinds of technology to tame the capacity beast, such as WAN optimization hardware and backup deduplication appliances. As a result, data efficiency technologies have become standard features of many different products. But what happens when you put these products together in the data center? You end up constantly deduplicating and hydrating data as it moves between various devices. Storage deduplicates data; then you read the data to back it up, where it requires hydrating (to return it to a state that the backup application understands) and often re-deduplicating it somewhere in the backup data path. The CPU cost to reprocess the same data is enormous, not to mention the bandwidth cost of all that hydrated data.
Tip: Deduplication is a process in which data is examined for common blocks. When identified, these common blocks are replaced with a tiny little pointer to the unique copy of the data already stored on disk — which takes up significantly less capacity when written to storage. Deduplication has a tremendous savings on storage capacity and, importantly, on input/output operations per second (IOPS) since less writes occur to disk. Hydration is reversing the deduplication process, such as when moving the data to a new system that doesn’t support deduplicated data.
Virtualization helped organizations consolidate many of their servers to run on a common platform: the hypervisor software layer. This move has helped IT departments make much better use of their server resources. Before virtualization, it was common for server utilization to average just 15 percent. Virtualization has pushed that average much higher. As a result, organizations now enjoy a much better return on their server investments. Moreover, they usually don’t need to buy as many physical servers as they did in the past. Virtualization has changed the game when it comes to server resources.
Unfortunately, IT departments often need to maintain separate groups of people to manage separate hardware resources. One group manages storage, for example; another group manages the server side; a third group handles networking. When an issue arises, it’s not uncommon to see a lot of finger-pointing. Further, emerging workloads are creating resource challenges that push IT departments to create infrastructure environments on a per-service basis. Virtual desktop infrastructure (VDI) environments, for example, have vastly different resource usage patterns from server virtualization projects.
To meet user expectations with VDI, IT professionals often implement completely separate environments, from servers on down to storage. Aren’t resource islands the very problems that virtualization is meant to solve? These islands are among the biggest culprits of underutilization. Virtualization is supposed to result in a single resource pool from which resources are carved out to meet application needs, thereby maximizing the use of those resources.
Multiple Management Interfaces
Storage devices. Optimizers. Hypervisors. Load balancers. What do they have in common? Each of these disparate components features its own management interface. If you use multiple components, each with separate management consoles (and policy engines) rather than a single, centralized, easy-to-use administrative system, you may experience the following challenges:
Vendors blaming each other when something goes wrong.
The inability to scale your data center environment easily and linearly.
Greater complexity due to policies and management being tied to IT components versus workloads.
Deployment Difficulty and Delays
Resource challenges represent the number one reason why organizations continue to have problems deploying new applications and services. A close second is administrative overhead. Allow me to explain.
Converting to flat IT
The legacy data center is very delicate in many ways. Any change at any level has the potential to disrupt the overall structure. With lessons and tactics learned from the big cloud vendors, hyperconvergence vendors are replacing tiered and resource-siloed data centers with a much flatter IT structure. As practically all of the formerly separated data center hardware gets folded into the hyperconverged environment, the IT department needs to shift its focus, changing its resourcing structures and skill sets. Rather than having staff with deep subject matter knowledge in each resource area, hyperconvergence can give rise to infrastructure generalists. Emerging infrastructure generalists don’t have deep knowledge of individual resources, but they have broad knowledge of all resource areas. They don’t need to have deep knowledge. In a hyperconverged world, the most complex stuff is handled under the hood. Infrastructure generalists need to have enough broad knowledge to meet business needs and to manage the entire environment through a single administrative interface. In many ways, these people are far more application-focused than their island-based predecessors were. They just need to know how to apply infrastructure resources to meet individual application needs. This development offers several bits of really good news for IT departments that have struggled to align IT operations with business needs:
This new structure paves the way to eliminating the inefficient resource management islands that have emerged in IT.
A flat data center managed by an infrastructure engineering group provides improved economies of scale compared with old resource islands.
Infrastructure generalists are far closer to applications than specialists of old were.
Flash arrays as quick fixes
As flash-based (and really fast) storage has emerged at relatively afford- able prices, new arms of the storage market have sprung up. One such arm provides storage arrays based solely on flash storage. Although vendors in this all-flash space offer compelling products, many of these products are quick fixes designed to solve single- application problems; think VDI and Big Data analytics. For the vast majority of enterprise workloads, though, all-flash arrays are the very definition of throwing hardware at a performance problem. Storage solu- tions based on a combination of flash storage and spinning disks provide. a far more balanced and reason- able approach to meeting workload needs. In addition, the cost per giga- byte of flash storage is pretty expensive compared with other storage options. That said, for applications that need to achieve hundreds of thousands or even millions of IOPS in a tiny amount of rack space, all- flash arrays can’t be beat. For everything else, consider more balanced storage options. Remember, a flash array really is a higher-performing storage array. It doesn’t address the resource islands, infrastructure management, interoperability chal- lenges, or scalability issues of the modern data center. Multiple challenges exist on the resource front, including the following:
IO blender: The consolidation of virtual machines (VMs) contributes to a random IO workload — each with its own pattern for reading/writing data to storage. I discuss the IO blender in detail later in this chapter.
Capacity: Another challenge is ensuring adequate capac- ity as the organization grows. With resources divvied up and islands of resources strewn about the data center, managing ongoing capacity so that there’s enough to go around becomes increasingly difficult.
Overhead: Even if you have enough resources to deploy a new application (see the preceding bullet), the administrative overhead involved in the process presents its own challenges:
A new logical unit number (LUN) must be provi- sioned to support the new application. If tiers of storage are involved, this process could require multiple steps.
One or more new VMs must be provisioned.
Networking for those new VMs has to be configured.
Load balancers and wide-area network (WAN) opti- mization devices need to be managed to support the new VMs.
Data protection mechanisms must be implemented for the new services.
Whew! That’s a lot to do. All of it is time-consuming, and all of it involves different teams of people in IT. Good luck!
Virtualization is heavily dependent on storage, but this resource has wreaked havoc in companies that are working hard to achieve 100 percent virtualized status. Here’s why. Consider your old-school, physical server-based workloads. When you built those application environments, you carefully tailored each server to meet the unique requirements for each individual application. Database servers were awarded two sets of disks — one for database files and one for log files — with different redundant array of independent disks (RAID) structures. File servers got RAID 5 to maximize capacity, while still providing data protection. Now consider your virtualized environment. You’ve taken all these carefully constructed application environments and chucked them into a single shared-storage environment. Each application still has specific storage needs, but you’ve basically asked the storage to sort out everything for you, and it hasn’t always done a good job.
In the old days, storage systems were optimized around LUN management. LUNs were replicated from a controller in one storage array to a LUN attached to a controller in a second array. The storage systems took snapshots of LUNs, and LUNs could be moved from one host to another host. Today, servers have been replaced by VMs, and many VMs are running on a single host and many hosts are using a single LUN to store VMs. This means that the storage system has dozens (or hundreds) of logical servers (VMs) all stored in the same LUN. A single application or host or VM can no longer be managed from the storage system perspective. A VM-centric platform cuts through this IO blender — a term that’s been coined to describe environments in which mixed IO patterns are vying for limited storage resources — and allows you to optimize individual VMs. Policies can be applied to individual VMs. Performance can be optimized for individual VMs. Backups can be managed per VM, and replication is configured per VM. Do you see a pattern emerging here? When all of your applications attempt to work together on the same LUN, the IO blender is created. Here are some ways that common services contribute to the IO blender:
Databases: Databases feature random IO patterns. The system has to jump all over the disk to find what you’re looking for.
Database log files: Log files are sequential in nature. Usually, you just write to log files — again, sequentially.
Random file storage: File servers are very random when it comes to IO. You never know when a user will be saving a new file or opening an old one.
Enterprise-level applications: Applications such as Microsoft Exchange and SharePoint are sensitive in terms of storage configuration, and each application often includes a mix of random and sequential IO.
VDI: VDI is one of the primary disruptors in the storage market. VDI storage needs are varied. Sometimes, you need only 10 to 20 IOPS per user. At other times, such as when you’re handling boot storms and login storms, IOPS needs can skyrocket.
What the industry has done over the years is combine all these varied workloads. In other words, their very efforts to consolidate environments have created a storage monster. Many storage-area network (SAN)-based storage environ- ments suffer big-time problems due to this IO blender:
Continued consolidation of VMs contributes to random IO workloads, each with its own pattern for reading and writing data to underlying storage.
Highly random IO streams adversely affect overall performance as VMs contend for disk resources.
One situation that perfectly demonstrates a relatively new phenomenon in storage is the VDI boot storm, which occurs when many users attempt to boot their virtual desktops at the same time. The result: Storage devices can’t keep up with the sheer number of requests.It’s the beginning of the day that really kills storage, though. As computers boot, the operating system has to read a ton of data and move it to memory so that the system can be used. Now imagine what happens when hundreds or thousands of users boot up their virtual desktops at the same time. Legacy storage systems crumble under the IO weight, and it can end up taking a long time for users to fully boot their systems.
Tip:The situation is largely mitigated by the use of solid-state storage as a caching layer. Adding this kind of service without considering the administrative inefficiencies that it introduces has been standard operating procedure for quite a while and is one of the main reasons why people implement resource islands when they want to do VDI.
In traditional data center environments with shared storage, the difference in performance between reading and writing data is incredible. Reading generally is quick and can be accomplished with just a single IO operation. Writing data is a different story; it can take up to six IO operations to accomplish. As administrators have moved from RAID 5 to RAID 6 for better data protection, they have introduced additional overhead to the storage equation. A RAID 6 write operation requires six IOs to complete. The reason: it’s not just the data that has to be written, but also the parity information, which has to be written multiple times in RAID 6. RAID calculations also tend to require a lot of CPU over- head to perform the actual parity calculations needed for data protection.
Administrators can work inside their legacy environments in various ways to try to solve the serious IO issues that arise with shared workloads. Following are a few of those ways:
Buy a separate environment to support each application.
Buy complex standalone storage devices that include automatic tiering features.
Buy multiple tiers of storage and manage them separately.
What do these mitigation techniques have in common? They require administrators to overprovision storage, which requires more investment in storage hardware. They also require additional time by the administrator to configure and manage. Eventually, these models become unsustainable.
Touching data many times in a virtualized environment isn’t so great. Consider the following scenario: A legacy, but heavily virtualized, data center has many VMware vSphere servers connected to a SAN. The SAN has data deduplication mechanisms. The company backs up its data by using a local disk-to-disk-to- tape method; it also copies certain VMs to a remote data center each day. This way, the company maintains local backups and gains disaster recovery (DR) capabilities. Quite a bit of redundancy is built into this scenario. Figure 2-1 examines the path the data travels as it wends its way through the various processes associated with the scenario:
Figure 2-1: Hyperconverged infrastructure requires far less CPU power and network bandwidth than nonconverged systems.
Every time the data has to be hydrated and then re-deduplicated as it makes its way to different components, the CPU must be engaged. Deduplication can be an expensive operation and constantly treating data in different locations has several drawbacks:
Constant CPU use to treat data multiple times limits the number of VMs that can be run in the environment.
Data takes more time to traverse the network because it isn’t in reduced form as it moves between services.
WAN bandwidth costs are significant as data travels across wide-area connections in unreduced form.
It gets worse. Many storage systems — including those related to data protection — use a post-process deduplication method, as opposed to what is known as an inline deduplication process. Post-process deduplication means that data is not deduplicated until after it’s actually been written to disk. Here are the steps:
1. Write data to disk undeduplicated. Requires availability capacity and uses IOPS.
2. Read data back from disk later. The data then needs to be touched again by the post-process deduplication engine, consuming yet more IOPS and CPU resources.
3. Invest CPU to deduplicate or compress. Once read, the data then needs to be processed again using more CPU.
This means that the data will be replicated before deduplication and then all the dedupe work to save capacity must happen at both the primary and disaster recovery sites. This consumes additional resources, including CPU, IOPS, capacity, and network bandwidth. Post-process deduplication invests all these resources to get a reduction in storage capacity. The tradeoff isn’t a positive one. The results: greater costs and lower efficiency. The best outcome in any environment is to eliminate writes to disk before they even happen. In a hyperconverged environment, because of caching in RAM, many operations don’t have to touch storage. In the modern data center, data efficiency is about IOPS and WAN bandwidth, not storage capacity. Capacity has become plentiful as vendors release bigger drives (6TB and more!). Originally, data efficiency technologies were focused on the backup market. The objective was to provide an alternative for tape-based technology. In order to make the economics work, the primary objective was to fit more data on the disk while delivering the throughput needed for backups. In short, pack 20 pounds of data into a 5-pound bag. This was the right solution at the time. While disks have gotten bigger, performance drives have barely improved.
Organizations don’t have a capacity problem; they have an IOPS problem, which manifests as poor performance. With the addition of DR in most customer environments, the demand for WAN bandwidth has increased and we also have a bandwidth challenge. Data reduction technologies, such as deduplication, are intended to address emerging resource chal- lenges, including WAN bandwidth needs. Given this reality, in a primary storage environment, infrastructure needs to optimize for performance and latency, not capacity and throughout. This requires new technology and an approach to data efficiency that is systemic — which is one of the hallmarks of hyperconverged infrastructure. Inline deduplication provides the level of efficiency needed and consists of only two steps: process data and write data. Inline deduplication invests CPU just one time and gets a reduction in IOPS, WAN bandwidth, and storage capacity. These are critical resources, but these resources are only con- served when the data efficiency is delivered inline. In the modern data center, data efficiency is also about mobil- ity and data protection, particularly when talking about online backup and restore. Data efficiency saves all the IO traditionally required to do backup and restore operations.
In addition to facing the performance challenges of the post-virtualization world, virtualized organizations face policy challenges in both the physical and virtual worlds.
Physical: Physical servers have a direct mapping from application to server to storage array to LUN to storage policy. This results in an environment where an application policy is directly linked to an internal construct of the storage array. There is no abstraction. This approach is what makes storage upgrades so complex. For example, a replication policy is applied to a LUN in storage array X at IP address Y and tells that LUN to replicate to storage array A at IP Address B. Imagine the complexity of an array replacement when there are a couple of arrays in a couple of locations, and the replication policies are all tangled together. No wonder there are so many storage administrators in IT.
Virtual: In the virtualized world, there are many applications on a host, and many hosts on a single LUN. It isn’t efficient to apply a policy to a single LUN if that LUN represents the data for many applications (and hosts). In a hyperconverged environment, backup and replication policies are applied directly to individual application (or VM). There are no LUNs or RAID sets to manage. Replication policies specify a destination — in this case, a data center — that is abstracted away from the infrastructure. This allows an administrator to perform a platform upgrade without any policy reconfiguration or data migration, which increases efficiency and decreases risk.
PSSST…Don’t Forget Your Free ebook!
Download your free copy of Hyperconverged Infrastructure for Dummies eBook by Scott D. Lowe here…