Category Archives: In Depth

Understanding Business Intelligence

Though Business Intelligence is technology driven, it is more about Business requirements and less about technology.  BI Champion/ Sponsor in the organization defines the vision and mission.

BI-infographic1[1]

The leader must have the ability to precisely define the Business intelligence requirements of the organization; the format of the reports; the relationship between the different data elements and version of the data to be used. The leaders need to specify how much history needs to be included; how often the data needs to be provided to the different stakeholders of the process.

BI is then driven by the business objective which may not necessarily always be to reduce the cost or increase the top line/ bottom line for an organization, but to use the data analysis in bringing efficiencies in the process or enhance the service experience for the customer.

The impetus for the BI initiative is the availability of data and technology for data analysis.  Ideally the organization should have at least two or three years of data in electronic format for meaningful analysis. The larger the volume of core data available in the systems, the more meaningful will be the output that can be expected from Business Intelligence.

However it must be recognized that data may not always be structured and data coming from multiple sources may not be in the same format. Data will have to be interpreted on the basis of logical assumptions and data gaps will have to be identified upfront so that changes can be made to business processes at the point of data capture.

The Management must be made aware that resource and time commitment for BI is much higher than resource commitment for IT projects and there must be a high level of management commitment to making the BI project a success.

  1. Business leaders and IT analysts will have to allocate substantial amount of time to mapping business requirements to IT system capabilities both during the design and implementation phase of the project.  The leaders and IT developers will have to constantly interact for a proper representation of the Business questions in terms of analytical outputs.

2.      IT professionals involved in the process will have to explain to the Business leadership the nature and meaning of the data that is available in the systems. They will have to clarify how exceptions are to be handled by the leadership and also the limitations of the systems. The implication is that IT resources will also have to make a time commitment for BI.

  1. Subject matter experts also have a significant role to play in Business Intelligence projects.  As ultimate users of the system, they are in the best position to test the outputs and validate the significance of the Queries and the outputs derived from such queries.  They have the experience and the ability to flag exceptions and help fix them.  This implies that subject matter experts too need to contribute a significant amount of time for the success of the project.  They may even need to join the project team on a full time basis during the testing phase.

The Project plan must reflect the level of business personnel who will be involved and the organization must be willing to release these people to join the project team as and when a request for their services is made.

It also follows that the project team and all those involved in BI must be open to learning and acquiring the skills that are essential for the effective functioning of the BI system that is being put in place.

Having said all this, it is necessary to point out that Business Intelligence cannot be a time bound project. Nor can the teams be disbanded with the first successful run of a data query or queries.

Query design, testing, redesign and use of new toolsets are inevitable. In short, BI is an evolving system that cannot be pinned down and bounded by traditional project management definitions.

The BI requirements change as business requirements evolve and change. No single requirement definition can be characterized as a permanent, unchanging requirement.  The Business leaders, IT analysts and Subject matter experts will have to be constantly engaged in designing and developing queries; testing the outputs on field formations; obtaining feedback on the usefulness or otherwise of the outputs and exceptions that need to be handled.  It is an iterative process.

It requires the institution of an agile system development and process management approach. Highly skilled personnel must constantly and continuously work together to deliver on the objectives of BI with little requirements being defined upfront with more requirements being designed and refined on the go.

Scope of work will have to be “time-boxed” to each cycle of the project and goals and objectives of each time box will have to be specified separately.  Cycles which cannot be completed within the time specified will have to be deferred and included as part of the future development cycles. As a result, traditional project management methodologies will fail.

Since change is the only constant in BI, definition of a change management strategy is an imperative. The strategy must be built around the recognition that Business requirements change, changing BI requirements/queries; resulting in a change in the type of toolsets used and the skillsets that are required by BI stakeholders.

It should be remembered that People are resistant to change. Consumers of BI reports are no exceptions. They need to be educated about the benefits of the exercise and the producers of the data must be aligned to ensure data quality is never compromised by placing appropriate controls in the systems.

Training needs must be studied, documented; trainings organized whenever there is a change in any one or more components that operationalize the Business Intelligence unit.

The organization must also recognize that reporting requirements and formats will change with every change in the BI requirement. All reporting formats cannot be axed at once and all reporting formats cannot change overnight. The change must be planned and initiated based on schedules that have been agreed upon by the different stakeholders.

Existing reports and tools should be retired gradually and transition periods must be orchestrated carefully and thoughtfully. Trainings must be organized to transition all stakeholders to the new formats.  Organization of workshops and change management seminars must be part and parcel of the Business Intelligence unit’s functioning.

Finally, it must be reiterated that BI is not a project. It is a program.  The solution must dovetail into the existing environment and reinforce the business processes that are in use.  Any data extraction exercise for BI must be done without disturbing the workflow in the organization or impacting the reliability of the information that is being gathered during business operations.

In Depth Look : Private Cloud Infrastructure as a Service Capabilities

saas[1]

 

The primary purpose of a Private Cloud Infrastructure as a Service capability is to provide well managed infrastructure services to the Platform and Software Layers. To achieve this, the Infrastructure Layer, highlighted in the Private Cloud Reference Model diagram below, includes five capabilities.


Figure 1: Private Cloud Reference Model

This document describes these Infrastructure Layer capabilities and the impact of Private Cloud Infrastructure as a Service (IaaS) patterns on their planning and design. These patterns are defined in the Private Cloud Principles, Concepts, and Patterns document and are summarized here:

  • Resource Pooling: Divides resources into partitions for management purposes.
  • Physical Fault Domain: The group of physical resources dependent on a single point of failure such as an Uninterruptible Power Supply (UPS).
  • Upgrade Domain: A group of resources upgraded as a single unit.
  • Reserve Capacity: Unallocated resources, which take over service in the event of a failed Physical Fault Domain.
  • Scale Unit: A collection of resources treated as a single unit of additional capacity.
  • Capacity Plan: A model that enables a private cloud to deliver the perception of infinite capacity.
  • Health Model: Defines how a service or system may remain healthy.
  • Service Class: Defines services delivered by Infrastructure as a Service.
  • Cost Model: The financial breakdown of a private cloud and its services.

The Health Model, Service Class, and Scale Unit patterns directly affect Infrastructure and are detailed in the relevant sections later. Conversely, private cloud infrastructure design directly affects Physical Fault Domains, Upgrade Domains, and the Cost Model. These relationships are shown in Figure 2 below.


Figure 2: Infrastructure Relationship with Patterns

Background

The private cloud principles “perception of continuous availability” and “resiliency over redundancy mindset” are designed to make a private cloud architect think differently.

Traditional solutions rely heavily on redundancy to achieve high availability and avoid failure. But redundancy at the facility (power) and infrastructure (network, server, and storage) layers is very costly. Modern cloud applications are designed with a different, holistic approach to achieving availability. This means shifting focus from building redundancy into the facility and physical infrastructure to engineering the entire solution to handle failures — eliminating them, or at least minimizing their impact.

This approach to availability relies on resilience as well as redundancy. Resilience means rapid, and ideally automatic, recovery from a failure. Redundancy is typically achieved at the application level. (A non-cloud example is Active Directory®, where redundancy is achieved by providing more domain controllers than is needed to handle the load.)

Customer interest in cost reduction will help drive adoption of this approach over the medium term. Removing power redundancy from racks or co-location rooms has a big impact on operational expenses, but this typically occurs only when the hosted application doesn’t have to be highly available, or when high availability is achieved through redundancy at the application layer – for example, Active Directory replication, or application layer mirroring such as SQL Server™ mirroring. Combining reductions in physical redundancy with virtualization results in lower capital and operational expenditure compared to a highly redundant infrastructure.

Applications that depend on a highly available infrastructure will not achieve their Service-Level Agreement (SLA) when placed on the type of infrastructure defined earlier. Customers are therefore likely to develop two environments when designing their private cloud: a standard environment with reduced facility and infrastructure redundancy, and a high-availability environment with traditional levels of redundancy.

Standard Environment

High-Availability Environment

No power redundancy to the rack (for example one in-rack UPS) Redundant power to each server
No network redundancy to the servers (redundant core network) Redundant network connections to each server
Local storage, possibly redundant storage, and storage network Redundant storage presented to each server
Ideally no migration or possibly quick migration Live Migration

These two environments allow a Architect to differentiate service classifications from a high-availability perspective. The standard environment is appropriate for stateless workloads; stateful workloads will require the high availability environments. Stateful and stateless machines are managed differently. Statefulness will likely appear as a characteristic of the service classifications.

Stateless workloads (web servers, for example) are typically redundant at the server level via a load-balanced farm. These servers could easily be hosted in the Standard Environment. If all stateless workloads had an automated build, the Standard Environment could do away with any form of VM migration – and simply deploy another VM after destroying the existing one, thereby saving the cost of shared storage.

Stateful workloads, on the other hand, require a specific management approach and impose higher costs on the consumer. Unless designed for high availability at the application level, they will require some form of redundancy in the infrastructure. Further, the High-Availability Environment requires Live Migration to enable maintenance of the underlying fabric and load balancing of the VMs.

Security

The number one concern of customers considering moving services to the cloud is security. Recent concerns expressed in the industry forums are all well founded and present reasons to think through the end to end scenarios and attack surfaces presented when deploying multiple services from various departments in an organization on a private cloud.

In a cloud-based platform, regardless of whether it is private or public cloud, customers will be working on an essentially virtualized environment. The platform or software will run on top of a shared physical infrastructure managed internally or by the service provider. The security architecture used by the applications will need to move up from the infrastructure to the platform and application layers. In private cloud security this will provide security in addition to the perimeter network.

Public cloud involves handing over control to a third party, sharing services with unrelated business entities or even competitors and requires a high degree of trust in the providers security model and practices. In many ways the security concerns of private cloud and similar those of self-hosted or outsourced datacenter however the move to a virtualized self-service service oriented paradigm inherent in private cloud computing introduces some additional security concerns.

First is the isolation of tenants from each other and the hosting infrastructure at both the compute and network layers. Virtualization is a part of any private cloud strategy and the security of this model is totally dependent on the ability to isolate one tenant from another and prevent the careless or malicious tenant from impacting the stability of the core infrastructure upon which all tenants rely.

Another concern is Authentication, Authorization and Auditing of access to the cloud services. Self-service implies that tenant administrators can initiate management processes and workflows where previously this was accomplished through IT. For any misconfiguration or excessive permissions granted to these users can impact the stability or security of the cloud solution.

Many private cloud security concerns are also shared by traditional datacenter environment which is not surprising since the private cloud is just an evolution of the traditional datacenter model. These include:

  • Impact the confidentiality, integrity or availability through exploitation of software vulnerabilities.
  • Unauthorized access due to weak or misconfiguration.
  • Impact to confidentiality, integrity or availability by malicious code.
  • Impact to confidentiality, integrity or availability of data.
  • Compliance with internal or industry specific regulations and standards.

Secure Virtualization Platform

The biggest risk in running in a multi-tenant virtualized environment is that a tenant running services on the same physical infrastructure as you can break out of its isolating partition and impact the confidentiality, integrity or availability of your workload and data. Therefore the security in virtualization platform is key in the isolation and non-interference between the individual virtual machines running on the infrastructure.

Highly Automated Management, Monitoring and Reporting

Many management tasks involve multiple steps that must be completed in the proper sequence by multiple administrators across multiple systems. Any shortcuts, omissions or errors can leave assets vulnerable to unauthorized access or affect the reliability of components within a solution. By orchestrating discreet management and monitoring tasks into workflows that require proper authorization and approval greatly diminish the chance of mistakes that affect the security of the solution.

Authentication, Authorization and Auditing

Most organizations have a common capability for providing an overarching framework for authentication and access control and then a private cloud introduces all parts of hosting and hosted services that include the hosting infrastructure and the virtual machines workloads that run in that infrastructure. This framework must be designed and possibly extended to provide a single point of managing identities and credentials, authentication services and common security model for access to resources across the private cloud.

Multi-layer Security

Moving to a cloud-based platform requires a change in mind-set of developers and IT security professionals. Some of the risks of the public cloud are mitigated by using a private cloud architecture, however, the perimeter security protecting a private cloud should be seen as an addition to public cloud security practices, not an alternative. You cannot apply the traditional defense-in-depth security models directly to cloud computing, however you should still apply the principal of multiple layers of security. By taking a fresh look at security when you move to a cloud-based model, you should aim for a more secure system rather than accepting security that continues with the current levels.

Security Governance

Enterprise IT systems are now typically well regulated and controlled. The security risks are well documented and therefore proper processes are put in place to develop new applications and systems, or to provision them from 3rd party vendors. It is very unlikely that a department manager would be able to purchase and install software without approval from the IT department.

With public cloud systems and Web browser clients however, it is possible that individual department managers could bypass the IT department and provision public cloud-based software. Indeed, they might use free cloud storage systems as a convenient means to synchronize documents without even considering that they are using public cloud services. Public cloud systems might be appealing to a manager as they could very quickly provision a new system and remove what they might see as unnecessary bureaucracy. They may even be unaware of the security and compliance policies that are in place to protect the organization. In a cloud-based landscape, we must protect corporate systems and data from these unauthorized, untested systems.

Facilities

Facilities represent the physical components – buildings, racks, power, cooling, and physical interconnects – that house or support a private cloud. It is beyond the scope of this document to provide detailed guidance on facilities, but the private cloud principles affect facility design.

The definition of a Scale Unit impacts power, cooling, space, racking, and cabling requirements. The team that defines a Scale Unit should include personnel that design and manage these aspects of the facility in addition to the procurement, Capacity Planning, and Service Delivery teams. The following table lists some ramifications of Scale Unit size choices from a facilities perspective.

Small Scale Unit

Large Scale Unit

Benefit

Trade-off

Benefit

Trade-off

  • Lower amount of physical labor needed to add a Scale Unit
  • Complicates the Resource Pool, Fault Domain, and Reserve Capacity equation
  • Inefficient
  • Stranded power (un-utilized power)
  • Un-utilized space
  • Allocation of full facilities units (for example, UPS, Rack, and Co-location Room) is easy to cost and engineer
  • Reduces under-utilization of power, cooling, and space
  • Higher amount of labor to commission

Knowing how much power, cooling, and space each Scale Unit will consume enables the facilities team to perform effective Capacity Planning and the engineering team to effectively plan resources.

Compute, Network, and Storage Fabric

The term Fabric defines a collection of interconnected compute, network, and storage resources.

The concept of homogenous physical infrastructure, introduced in the Private Cloud Principles, Concepts, and Patterns guide, stipulates that all servers in a Resource Pool should be identical. Homogenizing the compute, storage, and network components in servers allows for predictable scale and performance. In other words, every server in a Resource Pool should have the same processor characteristics such as family (Intel/AMD), number of cores/CPUs, and generation (Xeon 2.6 Gigahertz (GHz)). The homogenized compute concept also stipulates that each server have the same amount of Random Access Memory (RAM) and the same number of connections to Resource Pool storage and networks. With these specifications met, any virtualized service could relocate from one failing or failed physical server to another physical server and continue to function identically.

Physical Server

The physical server hosts the hypervisor and provides access to the network and shared storage. In the Standard Environment, the facilities do not provide power redundancy, so the servers do not require dual power supplies.

Every server will be a member of a single compute Resource Pool and a single Physical Fault Domain. Assuming all servers are homogeneous (as recommended), they will all be members of a single Upgrade Domain.

Capacity Planning must be done for each server specification, as its size (CPU and RAM specification) will determine how many virtual machines it is able to host. This is covered in greater detail in the Private Cloud Planning Guide for Service Delivery.

Server specification selection impacts the Scale Unit, Cost Model, and service class. Scale Units have a finite amount of power and cooling, so server efficiency has an impact on a private cloud. It may be that all power in a Scale Unit is consumed before all physical space. The cost of servers impacts the Cost Model irrespective of whether this cost is passed onto the consumer. Selecting only small one-unit servers will limit the architect’s ability to define a range of service classifications. The server needs to accommodate the largest service classification after the parent partition and hypervisor consume their resources.

Microsoft research shows servers with processors one or two models behind the latest versions offer a better price, performance, and power consumption ratio than the newer processors.

The Private Cloud Reference Architecture dictates that the “concept of homogenization of physical infrastructure” be adopted for each Resource Pool. Server specifications (CPU, RAM) may vary between Resource Pools, but this complicates Fabric Management (defined in the Private Cloud Planning Guide for Systems Management), which spans Resource Pools and Capacity Planning, and may necessitate different service classes for each pool.

Delivering IaaS requires that the service is pre-defined and delivered consistently. To achieve consistent performance, the VMs must have equal resources available to them from each server, in other words, the same CPU cycles and RAM. If servers within a Resource Pool do not provide homogeneous performance and RAM, consistent performance cannot be guaranteed.

Absolute homogenization may be hard to maintain over the long term as server models may be discontinued by the vendor; therefore relationships between Resource Pools, Scale Units, and server model longevity must be considered carefully.

The following table lays out some of the benefits and trade-offs of homogeneous and heterogeneous Resource Pools.

Homogeneous Physical Infrastructure

Heterogeneous Physical Infrastructure

Benefit

Trade-off

Benefit

Trade-off

  • Predictable performance within a Resource Pool
  • Guaranteed Live Migration across the fabric
  • Reuse of existing equipment may not be possible
  • Possible reuse of existing equipment
  • Allows for a broader range of server classes
  • VMs cannot be moved between Resource Pools
  • More upfront work to make sure Live Migration will work appropriately

In addition, servers should support the following requirements to achieve an automated infrastructure and resiliency:

Automated Infrastructure

  • Wake On Local Area Network
  • Remote BIOS Upgrades/Configurations
  • Boot from Flash
  • Pre-Boot Execution Environment (PXE) for remote imaging
  • Virtualization Support
    • Data Execution and Prevention
    • 64 bit CPUs
  • Standard Environment: 2 Network adapters that support TCP offload (TOE)
    • Management x 1
    • Consumer x 1
  • High-Availability Environment: 4 or 6 redundant network adapters that support TOE
    • Management x 2: Could be teamed for redundancy
    • Live Migration x 2: Could be teamed for redundancy
    • Consumer x 2: Could be teamed for resiliency
  • Standard Environment: Storage connections that meet the required service classification
    • For Internet Small Computer System Interface (iSCSI), 1 x Hardware iSCSI initiators: Could use vendor-specific software to achieve resiliency
    • For Fiber Channel, 1 x Fiber Channel host bus adapter (HBA): – Could use vendor-specific software to achieve resiliency
  • High-Availability Environment: Redundant storage connections that meet the required service classification
    • For iSCSI, 2 x Hardware iSCSI initiators: Could use vendor-specific software to achieve resiliency
    • For Fiber Channel, 2x Fiber Channel HBAs: Could use vendor-specific software to achieve resiliency

To dynamically initiate remediation events in case of failure or impending failure of server components, each server is required to display warnings, errors, and state information for the following:

  • CPU
    • State (Busy/Ready)
    • Utilization
    • Heat
    • Fans
  • RAM
    • Utilization
    • Error-Correcting Code (ECC) Errors
  • Storage
    • Read/Write Failures
    • Predictive Failures
  • Network Interface Cards (NICs)
    • Port State
    • Send/Receive Errors
  • Motherboard
    • Server Post Errors
  • Power Supply
    • State
    • Active / Passive
    • Power Output Variations
  • Fans
    • Speed
    • State

Storage

To achieve the perception of infinite capacity, proactive Capacity Management must be performed, and storage capacity added ahead of demand. The amount of storage added as a single unit (a Storage Scale Unit) will depend on the rate of storage consumption, hardware vendor lead time, and the level of risk the business wishes to assume (that is, weighing remaining unallocated capacity against the possibility of exhausting all capacity). This is detailed in Private Cloud Planning Guide for Service Delivery.

Storage will be placed in Storage Resource Pools, from which it is automatically allocated to consumers. Though Resource Pools are not a new concept for Storage Area Networks (SANs), allowing the infrastructure to allocate storage on-demand based on policy may be a new approach for many organizations. Further, the SAN must present an application programming interface (API) to Fabric Management to allow automation of allocation and provisioning.

The storage provided within a private cloud must be consistent in performance and availability. This means the Input/output (I/O) Operations per Second (IOPS) cannot vary significantly. If there is a need to make different levels of storage performance available to users of a private cloud, it can be accomplished through multiple service classifications. A private cloud is intended, however, to provide a limited set of standardized services; therefore, variances should be carefully considered.

The cost of providing the storage within a private cloud should be clearly defined. This permits metering, and possibly allocation of costs to consumers. If different classes of storage are provided for different levels of performance, their costs should be differentiated. For example, if SAN is being used in an environment, it is possible to have storage tiers where faster Solid State Drives (SSD) are used for more critical workloads. Less-critical workloads can be placed on a Tier 2 Secure Attention Sequence (SAS), and even less-critical workloads on Tier 3 SATA drives.

The Private Cloud Reference Architecture assumes the storage arrays and the storage network are redundant, with no single point of failure beyond the array itself. In this regard, the storage array can be considered a Fault Domain.

The design should adopt some form of de-duplication technology to reduce storage consumption.

As the storage array is a single point of failure, it should display health information to the systems monitoring service to make sure that any outages and their impact are quickly identified. Providing snapshots and mirroring between arrays for continuity is beyond the scope of this guidance.

Physical Storage Switches

If a Architect follows the recommendation to allow any VM to execute on any server in a Resource Pool, Virtual Hard Disks (VHDs) should reside on a SAN. While it is possible to host VHDs locally, the guidance assumes that they are hosted on a SAN.

A key decision in private cloud design is whether to use iSCSI or Fiber Channel for storage. If iSCSI is utilized to house virtual workload storage, it is suggested that each virtualization host include iSCSI HBAs instead of standard NICs for performance reasons.

The purpose of a storage switch is to provide resilient and flexible connectivity between shared storage and physical servers. The storage switch must meet peak storage I/O requirements for the virtual services. In addition, the interconnect speeds between switches should be evaluated to determine the maximum throughput for switch-to-switch communications. This may limit the maximum number of hosts that can be placed on each switch.

While switch throughput is important, attention should also be paid to the number of available switch ports needed to support the physical virtualization hosts. Refer to the switch hardware vendor to make sure it meets these requirements.

Physical storage switch requirements include:

  • Dedicated switch port on each switch for each host and storage processor connection. This is needed for redundancy and I/O optimization.
  • iSCSI traffic separated from all other IP traffic, preferably on its own switched infrastructure or logically through a virtual local area network (VLAN) on a shared IP switch. This segregates data access from traditional network communications for host-to-host and workload operations and provides data security.
  • Redundant power supplies and cooling fans increase the number of faults the storage switch can withstand.
  • Programmatic interface to support firmware upgrades and configuration.

Physical Storage Subsystem

Stateless workloads can be hosted on Direct-Attached Storage (DAS) instead of SAN, driving down the cost of service. The downside is that Fabric Management has to handle transitioning active user connections between VMs homed on different hosts, as VM migration is impossible. This may mean tighter integration with the network than is specified in this document (in order to know when all connections to a VM have been abandoned or terminated before stopping the VM, for example).

SAN storage, while more expensive, provides advantages:

  • The VM can be re-homed to other servers.
  • Live Migration can be employed.
  • Backup (of the VM) can occur out of band (for example, taking snapshots).
  • Capacity can be increased almost limitlessly.

The logical storage configuration (or storage classification) should be designed to meet requirements in the following areas:

  • Capacity: To provide the required storage space for the virtual service data and backups.
  • Performance Delivery: To support the required number of IOPS and throughput.
  • Fault Tolerance: To provide the desired level of protection against hardware failures. If a SAN is used, this may include redundant HBA and switches.
  • Manageability: To provide a high degree of platform self-management. This requires a programmatic interface to provide automated configuration and firmware upgrades.

Additionally, a private cloud must meet the following requirements to make sure that it is highly available and well-managed:

  • Multiple paths to the disk array for redundancy. Should a disk fail, hot or warm spare disks can provide resiliency in the provisioned storage. Consult the storage vendor for specific recommendations.
  • A storage system with automatic data recovery, to allow an automatic background process to rebuild data onto a spare or replacement disk drive when another disk drive in the array fails.
  • Redundant power supplies and cooling fans, to increase the number of faults the Storage Array can withstand.

Network

The Private CLoud Reference Architecture assumes that the network presented to servers is not redundant for the Standard Environment and is redundant for the High-Availability Environment.

The network is tightly coupled with physical servers. Each Compute Resource Pool includes the network switches necessary for the servers to operate; each Scale Unit includes a pre-defined and fixed number of servers and switches.

The switches must be monitored to make sure no workloads saturate the network. A private cloud is designed as a general-purpose infrastructure. Workloads that challenge the network with high utilization may not be good candidates for a private cloud unless separate Resource Pools are created specifically to handle these workloads.

Switches are members of network upgrade domains, but the definition and membership of upgrade domains will likely vary depending on the nature of the upgrade. If switches are not redundant (for example, in the Standard Environment), the whole Resource Pool will need to be taken offline for switch maintenance, which requires switch reboots.

Network hardware (switches and load balancers) must display an API to Fabric Management that enables automated management of networks such as creation of VLANs, Virtual IP addresses (VIP), and addition or removal of hosts from the VIP.

Physical Network

Some key decisions that should be made to increase the bandwidth of the physical networks are related to the use of Live Migration requirements of port security, and the need for link aggregation. Here is a table showing the benefits and trade-offs of using Live Migration:

Use Live Migration

Do Not Use Live Migration

Benefit

Trade-off

Benefit

Trade-off

  • Transparent movement of Stateful applications
  • Transparent infrastructure upgrades
  • Additional network switch ports will be required
  • More network adapters are required per virtualization host
  • Greater Reserve Capacity may be required because of cluster size limitations of 16 nodes
  • Less switch ports are required
  • Fewer network adapters are required per virtualization host
  • Ideal for stateless applications
  • No transparent movement of Stateful applications
  • For Stateful applications, infrastructure upgrades will need to be coordinated with VM owners

To support the dynamic characteristics of a private cloud, a network switch should support a remote programmatic interface – for firmware upgrades, and prioritization of traffic for quality of service. These switches should be dedicated for a private cloud to maintain predictable performance and to minimize risks associated with human interaction. As defined earlier, the servers need to be connected to at least two networks, management and consumer, with live migration (if required). The connections should always be the same; for example, network adapter 1 to management, network adapter 2 to consumer, and network adapter 3 to Live Migration.

If iSCSI is chosen for the storage interconnects, iSCSI traffic should reside in an isolated VLAN in order to maintain security and performance levels. This iSCSI traffic should not share a network adaptor with other traffic, for example the management or consumer network traffic.

The interconnect speeds between switches should be evaluated to determine the maximum bandwidth for communications. This could affect the maximum number of hosts which can be placed on each switch.

When designing network connectivity for a well-managed infrastructure, the virtualization hosts should have the following specific networking requirements:

  • Support for 802.1Q VLAN Tagging: To provide network segmentation for the virtualization hosts, supporting management infrastructure and workloads. This is the preferred method to help secure and isolate data traffic for a private cloud.
  • Remote Out-of-band Management Capability: To monitor and manage servers remotely over the network regardless of whether the server is turned on or off.
  • Support for PXE Version 2 or Later: To facilitate automated server provisioning.

To dynamically initiate remediation events in response to the failure or impending failure of network switch components, each switch is required to display warnings, errors, and state information for the following:

  • CPU
    • Utilization
    • Temperature
  • Flash Memory
    • Utilization
  • Interface Details
    • Port State
    • Port Errors
    • Bandwidth Utilization
  • Power Supply
    • State
    • Active / Passive
    • Power Output Variations
  • Fans
    • Speed
    • State

Storage Switch/Subsystem Health Model

To dynamically initiate remediation events in response to either the failure or impending failure of storage switches and storage subsystem components, each component is required to display warnings, errors, and state information for the following:

Storage Switch

  • CPU
    • Utilization
    • Temperature
  • Flash Memory
    • Utilization
  • Interface Details
    • Port State
    • Port Errors
    • Bandwidth Utilization
  • Power Supply
    • State
    • Active / Passive
    • Power Output Variation
  • Fans
    • Speed
    • State

Storage Subsystem

  • CPU
    • Utilization
    • Temperature
  • Flash Memory
    • Utilization
  • Service Processor
    • State
    • Errors
    • IOPS
  • Disks
    • Read / Write Failures
    • Predictive Failures
  • Power Supply
    • State
    • Active / Passive
    • Power Output Variations
  • Fans
    • Speed
    • State

Hypervisor

The hypervisor exposes the VM services to consumers. It needs to be configured identically on all hosts in a Resource Pool, and ideally all hosts in the private cloud. Fabric Management will orchestrate the addition of virtual switches, machines, and disks.

An architect needs to decide whether the private cloud should use CPU Resource Reservations to make sure of predictable performance of VMs. This table lists the benefits and trade-offs:

Use CPU Resource Reservations

Do Not Use CPU Resource Reservations

Benefit

Trade-off

Benefit

Trade-off

  • Consistent VM performance for consumers
  • Fixed number of VMs per host might lead to low utilization of resources
  • Toolset may not set resource reservations
  • Variable number of VMs per host means resource utilization can be maximized
  • Consumers do not experience consistent VM performance
  • One VM can adversely affect the processing performance of others

The decision is driven by whether efficiency or consistency is more important for the private cloud.

The architect could elect to provide different classes of services – one which uses resource reservations to deliver predictability, and another which shares the resources. Separate Resource Pools could be deployed accordingly, along with differential pricing to incent the consumers to exhibit desired behavior.
Resource reservations will not prevent a host from saturating the network and crippling the performance of other hosts. As stated in the Network section earlier, this needs monitoring.

Parent Partition

The parent partition provides the hypervisor with access to physical resources such as network and storage. It also hosts the hypervisor management interfaces. The parent partition needs to be configured identically on all servers in a Resource Pool.

If an architect elects to create a service classification which depends on consuming LUNs directly (not via the parent partition), the parent partition must be configured to present the pass-through for this storage. Further, this storage must be available to all parent partitions in that Resource Pool to enable VM portability between hosts.

The parent partition displays health information for the server, the parent partition operating system, and the hypervisor. The health monitoring system, in turn, consumes this information to enable Capacity Management and Fabric Management.

Management Layers

Task Execution

Task execution is the low level management operations that can be performed on a platform and generally are surfaced through the command line or Application Programming Interface (API). The capability to execute tasks must not only exist but the usage semantics should be consistent across members of a fault domain to enable automation using a common format. When differences in semantics exist this forces the automation layer to compensate for these differences through custom code in the orchestration or even require using different execution hosts or engines within a fault domain.

Automation

The automation layer is made up of the foundational automation technology plus a series of single purpose commands and scripts that perform operations such as starting or stopping a virtual machine, restarting a server, or applying a software update. These atomic units of automation are combined and executed by higher-level management systems. The modularity of this layered approach dramatically simplifies development, debugging, and maintenance.

Orchestration

In much the same way that an enterprise resource planning (ERP) system manages a business process such as order fulfillment and handles exceptions such as inventory shortages, the orchestration layer provides an engine for IT-process automation and workflow. The orchestration layer is the critical interface between the IT organization and its infrastructure and transforms intent into workflow and automation.

Ideally, the orchestration layer provides a graphical user interface in which complex workflows that consist of events and activities across multiple management-system components can be combined, to form an end-to-end IT business process such as automated patch management or automatic power management. The orchestration layer must provide the ability to design, test, implement, and monitor these IT workflows.

Service Management

Service management provides the means for automating and adapting IT service management best practices, such as those found in the IT Infrastructure Library (ITIL), to provide built-in processes for incident resolution, problem resolution, and change control.

Self Service

Self Service capability is a characteristic of private cloud computing and must be present in any implementation. The intent is to permit users to approach a self-service capability and be presented with options available for provisioning in an organization. The capability may be basic where only provisioning of virtual machine with a pre-defined configuration or may be more advanced allowing configuration options to the base configuration and leading up to a platform capability or service.

Self service capability is a critical business driver that enables members of an organization to become more agile in responding to business needs with IT capabilities to meet those needs in a manner that aligns and conforms with internal business IT requirements and governance.

This means the interface between IT and the business are abstracted to simple, well defined and approved set of service options that are presented as a menu in a portal or available from the command line. The business selects these services from the catalog, begins the provisioning process and notified upon completions, the business is then only charged for what they actually use.

This is analogous to capability available on Public Cloud platforms.

The entities that consume self service capabilities in an organization are individual business units, project teams, or any other department in the organization that have a need to provision IT resources. These entities are referred to as Tenants. In a private cloud tenants are granted the ability to provision compute and storage resources as they need them to run their workload. Connectivity to these resources is managed behind the scenes by the fabric management layers of the private cloud.

Tenant administrators are granted access to a self-service portal where they can initiate workflows to provision virtualized services in the appropriate configuration and capacity. For example compute resources may be available in small, medium or large instance capacities and also storage of the appropriate size and performance characteristics. Resources are provisioned without any intervention from infrastructure personnel in IT and the overall progress is tracked and reported by the fabric management layer and reported through the portal.

A chargeback model is defines how tenants will be charged for using the cloud resources. This is typically the numbers and size of resources provisioned times the amount of time they are provisioned for. This information is available to tenant administrators through the self-service portal and well as the ability to provide cost reporting.

Tenants are granted the ability to manage, monitor and report on the resources that they have provisioned.