This section contains topics that describe some best practices for administering your system.
Best practices for system sizing and scaling
These best practices apply specifically to the scaling of product services across master instances (nodes). If you do not plan to scale product services across master instances then you can skip this section.
Running product services on master instances can maximize the use of all physical resources in the site. However, since master instances are critical to the orchestration and management of HCP for cloud scale, it is important to avoid any impact on them. Here are best practices for master instances.
First, do not over-provision master instances:
- CPU/RAM - HCP for cloud scale can partition the usage of these resources, but an instance can still exhaust these resources.
- Network - HCP for cloud scale cannot control usage of network resources.
- Disk space - HCP for cloud scale cannot control usage of disk space. One service might consume all the free space and thus affect other services. This is most likely to happen with the Metadata Gateway and Message Queue services, but all stateful services consume disk space.
Second, consider the impact of running product services on master instances:
- When scaling product services across master instances, you must take into account the resources (CPU, memory, etc.) already consumed there by system services.
- If product services run into an issue, resolution could require restarting services (including system services) or even restarting the host (operating system). This can affect system services and thus the operation of the entire site.
- When designing a solution, consider hardware failures and other causes of outages. If you plan to use the master instances for product services, ensure there are enough instances (nodes) that the cluster will still be able to manage both existing objects and the anticipated throughput (addition and deletion) of objects if one or even two hosts fail.
You should consider the differing impacts of stateful and stateless services on resource consumption when planning how to scale system and worker service instances across your HCP for cloud scale system.
A stateful service is a service that permanently saves data to disk. Any data that is stored by a stateful service is critical for the operation of the system and the integrity of customer data, so all data that is stored by a stateful service is protected by having three identical copies. Each copy of the data resides on a different instance (node). Each instance of a stateful service runs on a different physical instance.
Stateful services are also persistent. A persistent service runs on a specific instance (node) that you designate. If an instance of a persistent service fails, HCP for cloud scale restarts the instance on the same node. (In other words, HCP for cloud scale does not automatically bring up a new stateful service instance on a different node.)
Because every stateful service is also persistent, a failure or even a planned outage of an instance (node) affects the copy of the data of all stateful service instances that had been running on this instance (node). For more information refer to Service failure recovery and Scaling Metadata Gateway instances.
Also, stateful services typically require more computing power to process, store, and read the data on disk securely, efficiently, and with high performance.
A stateless service is a service that does not save data to disk. Stateless services are usually also floating. A floating service can run on any node assigned to a pool of instances associated with the floating service. If an instance of a floating service fails, HCP for cloud scale restarts the instance on any node in the instance pool. Therefore, stateless floating service instances have less resource impact, and are typically easier to manage and recover, than stateful persistent service instances.
The impact of stateful and stateless services is as follows:
- The number of floating/stateless services affects the speed of operations. If there are too few of them operations slow down.
- The number of persistent/stateful services affects both the speed of operations and availability. If there are too few of them operations slow down and can fail entirely.
- Not every stateful service is critical to application-facing operations (for example, S3 operations).
- Not every stateful service is resource intensive.
- Not every stateless service is easy and lightweight in terms of resource usage.
The following services can be resource intensive and should be scaled across worker and master instances carefully:
- Data Lifecycle
- Metadata Gateway
- Message Queue
- Policy Engine
- S3 Gateway
Consider the following use cases:
- Ring buffer - Data has a short life cycle (a few weeks) and will be deleted after a certain period of time
- Synchronization - Buckets are configured to synchronize (mirror) data out to or in from external services
- Performance sensitive - The planned usage must achieve a specific level of performance (an SLA)
If your planned usage of the HCP for cloud scale system matches one of these use cases it's best to size and scale it as follows:
- The minimum cluster size is six instances (nodes).
- With fewer than 12 instances (nodes), do not scale resource-intensive services across more than one master node.
- With 12 or more instances (nodes), do not scale resource-intensive services across any master nodes.
If, however, your planned usage of the HCP for cloud scale system does not match any of these use cases it's best to size and scale it as follows:
- The minimum cluster size is four instances (nodes).
- With four or five instances (nodes), do not scale resource-intensive services across more than two master nodes.
- With six or seven instances (nodes), do not scale resource-intensive services across more than one master node.
- With eight or more instances (nodes), do not scale resource-intensive services across any master nodes.
Best practices for maintaining system availability
For a multi-instance system, master instances should run on separate physical hardware. If your instances run on virtual machines, run the master instances on separate physical hosts.
In a multi-instance system, you can choose which and how many instances that each service can run on. For redundancy, you should run each service on more than one instance.