Monitoring
Your system gives a number of mechanisms that allow you to monitor the health and performance of the system and all of its instances and services.
Monitoring instances
You can use the Admin App, CLI, and REST API to view a list of all instances in the system.
Viewing all instances
To view all instances, in the Admin App, click Dashboard > Instances.
The page shows all instances in the system. Each instance is identified by its IP address.
This table describes the information shown for each instance.
Property | Description |
State |
|
Services | The number of services running on the instance. |
Service Units |
The total number of service units for all services and job types running on the instance, out of the best-practice service unit limit for the instance. An instance with a higher number of service units is likely to be more heavily used by the system than an instance with a lower number of service units. The Instances page displays a blue bar for instances running less than the best-practice service unit limit. The Instances page displays a red bar for instances running more than the best-practice service unit limit. |
Load Average | The load averages for the instance for the past one, five, and ten minutes. |
CPU | The sum of the percentage utilization for each CPU core in the instance. |
Memory Allocated |
This section shows both:
|
Memory Total | The total amount of RAM for the instance. |
Disk Used | The current amount of disk space that your system is using in the partition on which it is installed. |
Disk Free | The amount of free disk space in the partition in which your system is installed. |
Viewing the services running on an instance
To view the services running on an individual instance, in the Admin App:
Procedure
Click Dashboard > Instances.
Select the instance you want.
The page lists all services running on the instance.
For each service, the page shows:
- The service name
- The service state:
- Healthy: The service is running normally.
- Unconfigured: The service has yet to be configured and deployed.
- Deploying: The system is currently starting or restarting the service. This can happen when:
- You move the service to run on a completely different set of instances.
- You repair a service.
- Balancing: The service is running normally, but performing background maintenance.
- Under-protected: In a multi-instance system, one or more of the instances on which a service is configured to run are offline.
- Failed: The service is not running or the system cannot communicate with the service.
- CPU Usage: The current percentage CPU usage for the service across all instances on which it's running.
- Memory: The current RAM usage for the service across all instances on which it's running.
- Disk Used: The current total amount of disk space that the service is using across all instances on which it's running.
Related CLI commands
getInstance
listInstances
Related REST API methods
GET /instances
GET /instances/{uuid}
You can get help on specific REST API methods for the Admin App at REST API - Admin.
Monitoring services
You can use Admin App, CLI, and REST API to view the status of all services for the system.
Viewing all services
To view the status of all services, in the Admin App, click Services.
For each service, the page shows:
- The service name
- The service state:
- Healthy: The service is running normally.
- Unconfigured: The service has yet to be configured and deployed.
- Deploying: The system is currently starting or restarting the service. This can happen when:
- You move the service to run on a completely different set of instances.
- You repair a service.
- Balancing: The service is running normally, but performing some background maintenance operations.
- Under-protected: In a multi-instance system, one or more of the instances on which a service is configured to run are offline.
- Failed: The service is not running or the system cannot communicate with the service.
- CPU Usage: The current percentage CPU usage for the service across all instances on which it's running.
- Memory: The current RAM usage for the service across all instances on which it's running.
- Disk Used: The current total amount of disk space that the service is using across all instances on which it's running.
Viewing individual service status
To view the detailed status for an individual service, select the service on the Services page.
In addition to the status information, the page shows:
- Instances: A list of all instances on which the service is running.
- Volumes: To view a list of volumes used by the service, select the row for an instance in the Instances section.
- Network: [Internal|External]: Which network type this service uses to receive communications.
This section also displays a list of the ports that the service uses.
- Configuration settings: The settings you can configure for the service.
- Service Units: The total number of service units currently being spent to run this service. This value is equal to the service's service unit cost times the number of instances on which the service is running.
- Service unit cost: The number of service units required to run the service on one instance.
- Service Instance Types: For services that have multiple types, the types that are currently running.
- Instance Pool: For floating services, the instances that this service is eligible to run on.
- Events: A list of all system events for the service.
Related CLI commands
getService
listServices
Related REST API methods
GET /services
GET /services/{id}
You can get help on specific REST API methods for the Admin App at REST API - Admin.
Monitoring processes
The Processes page lets you view information about what the system is doing. This includes any service operations you started and any internal maintenance processes the system needs to run.
Monitoring service operations
You can use the Admin App, CLI, and REST API to monitor all service operations. These include:
- The initial deployments of services when the system was installed.
- Service relocations that you begin.
For each one, the system shows:
- The name of the service involved
- The status of the operation
- The number of steps completed out of the total number of steps
Procedure
Select the Processes window.
Results
Related CLI commands
listSystemTasks
getSystemTask
Related REST API methods
GET /tasks/system
GET /tasks/system/{uuid}
You can get help on specific REST API methods for the Admin App at REST API - Admin.
Monitoring system processes
You can use Admin App, REST API, and CLI to view the progress of internal system processes. These include package installation tasks and regularly scheduled system maintenance activities such as log rotation.
For each process, your system shows:
- The process name
- The process state
- The times at which each step in the process run occurred
Procedure
In the Admin App, select Processes.
To view the currently running processes, select the System tab.
To view the scheduled processes, select the Scheduled tab.
Related CLI commands
listSystemTasks
getSystemTask
Related REST API methods
GET /tasks/system
GET /tasks/system/{uuid}
You can get help on specific REST API methods for the Admin App at REST API - Admin.
System events
Your system maintains a log of system events that you can view through the Admin App, CLI, and REST API.
Procedure
To view all system events, in the Admin App, click Events.
Related CLI commands
queryEvents
To view events through the CLI, your requests need to specify which events you want to retrieve.
For example, this JSON request body searches the event log for all events that have a
severity level of warning
:
{ "severities": [ "warning" ] }
Related REST API methods
POST /events
To view events through the REST API, your requests need to specify which events you want to retrieve.
For example, this JSON request body searches the event log for all events that have a
severity level of warning
:
{ "severities": [ "warning" ] }
You can get help on specific REST API methods for the Admin App at REST API - Admin.
HCP for cloud scale events
Most events are generated by and reported through the System Management application.
Events specific to HCP for cloud scale are reported with the IDs 6006 (informational), 6007 (warning), and 6008 (severe). These events are:
ID | Severity | Message | Description |
6006 | INFO | Service Information: Job Configuration 'id' of type 'job_type' updated with status 'status' | The policy configuration has changed and the status of job id of type 'job_type is one of the following:
|
6006 | INFO | Service Information: Job Configuration 'id' started | Policy configuration for job id has started. |
6006 | INFO | Service Information: Lifecycle policy {CREATE|UPDATE|DELETE} bucket 'bucket_name' | The lifecyle policy for bucket bucket_name has been either created, updated, or deleted. |
6006 | INFO | Service Information: Lifecycle policy deleted for bucket 'bucket_name' | The lifecyle policy for bucket bucket_name has been deleted. |
6006 | INFO | Service Information: Serial number updated to value | The HCP for cloud scale serial number has been changed to value. |
6006 | INFO | Service Information: setting was set to value | The S3 setting setting has been changed to value. |
6006 | INFO | Service Warning: Storage component 'id' created | The storage component id has been created. |
6006 | INFO | Service Warning: Storage component 'id' is now state | The storage component id is in one of the following:states
|
6006 | INFO | Service Warning: Storage component 'id' updated: configuration | The storage component id has been updated. configurationlists the changes. |
6007 | WARNING | Service Warning: Certificate for Storage component 'id' is about to expire in 'n' days | The SSL certificate for the storage component id is set to expire in n days. If the certificate expires, HCP for cloud scale will not be able to read from or write to the storage component. |
6007 | WARNING | Service Warning: Storage component 'id' is now INACCESSIBLE | The storage component id is inaccessible.HCP for cloud scale cannot read from or write to the storage component. |
6008 | SEVERE | Service Error: Storage Component Certificate Expired. | The SSL certificate for a storage component has expired. HCP for cloud scale cannot read from or write to the storage component. |
6008 | SEVERE | Service Error: Vault Connection Issue | The active vault node can't be reached. |
6008 | SEVERE | Service Error: Vault Service Completely Sealed | The vault service (Key Management Server service) is completely sealed. Unseal it using the unseal keys you obtained when you turned on encryption. |
6008 | SEVERE | Service Error: Vault Service Node Error. IP: ip_address | One of the vault nodes can't be reached. If other active nodes are available service continues, but attend to this issue immediately. |
6008 | SEVERE | Service Error: Vault Service Node Sealed. IP: ip_address | One of the vault nodes is sealed. If other active nodes are available service continues, but attend to this issue immediately. Unseal it using the unseal keys you obtained when you turned on encryption. |
Alerts
Alert messages notify you of situations that need your attention. Alerts can have a severity of Info, Warning, Severe, or Critical. You can view system alerts through the Admin App, CLI, or REST API, and storage component alerts through the Object Storage Management app.
Each alert corresponds to a system event.
Severity | Alert Description | Action |
Severe | Instance ip-address disk usage severe threshold |
The specified instance has less than 10% free disk space. Add additional storage to the instance. Important: If an instance runs out of disk space, the system can become unresponsive. |
Severe | Master Instance ip-address is down |
Do one of these:
|
Severe | Service is down |
Verify the health of your instances. If one is down, do one of these:
Otherwise, if your instances are healthy and the problem persists, contact Support. |
Severe | Worker Instance ip-address is down |
Do one of these:
|
Warning | Instance ip-address disk usage warning threshold |
The specified instance has less than 25% free disk space. Add additional storage to the instance. Important: If an instance runs out of disk space, the system can become unresponsive. |
Warning | Package installation failed |
Your system failed to install a package that you uploaded. |
Warning | Service below recommendation |
The service is currently running on fewer than the minimum number of instances. Configure this service to run on additional instances. |
Warning | Service under-protected |
A service has lost redundancy; that is, one or more instances on which that service is running are unresponsive. Verify the health of your instances. If one is down, do one of these:
Otherwise, if your instances are healthy and the problem persists, contact Support. |
Warning | SSL server certificate chain expires soon |
A certificate in the SSL server certificate chain for this system expires soon. If the certificate chain expires, users can't access the system. |
Warning | SSL server certificate chain expired |
The SSL server certificate chain for this system contains an expired certificate. Users cannot access the system until the certificate chain is replaced. |
Info | Package installation in progress |
Your system is currently installing a package that you uploaded. Depending on the contents of the package, this might take a while. |
Warning | The certificate for the storage component (storage-id) is about to expire in n days | Renew the storage component certificate. |
Info | The storage component (storage-id) is unavailable | Verify that the storage component ID is correct and valid and that the storage component is active. |
Severity | Message | Description |
Warning | Certificate for Storage component id is about to expire in n days | The SSL certificate for the storage component id is set to expire in n days. If the certificate expires, HCP for cloud scale will not be able to read from or write to the storage component. |
Warning | Storage component id is now inaccessible | The storage component id is in the state INACCESIBLE. HCP for cloud scale cannot read from or write to the storage component. |
Severe | Certificate for Storage component idexpired | The SSL certificate for the storage component id has expired. HCP for cloud scale cannot read from or write to the storage component. Install a new certificate. |
Severe | Error communicating with a vault node. Node IP: ip_address | One of the vault nodes can't be reached. If other active nodes are available service continues, but attend to this issue immediately. Examine the vault instance logs to determine the cause of this issue. |
Severe | Failed to connect to KMS server | One of the vault nodes can't be reached. If other active nodes are available service continues, but attend to this issue immediately. If ingest is halted, then investigate why the KMS service is failing to run on all nodes. If ingest is still working, the original active node has failed over. Examine the vault instance logs to determine the cause of the failure. |
Severe | Failed to connect to KMS server as it is completely sealed | The vault service (Key Management Server service) is completely sealed. Unseal it using the unseal keys you obtained when you turned on encryption. |
Severe | Service error: There is a critical issue with the Metadata Gateway database. Shutting down the Metadata Gateway Service. |
A Metadata Gateway instance has encountered an issue and shut down. Use the System Management Services function Repair to restart it. If restarting the service doesn't resolve the issue, contact Support. |
Severe | Vault node is sealed. Node IP: ip_address | One of the vault nodes is sealed. If other active nodes are available service continues, but attend to this issue immediately. Unseal it using the unseal keys you obtained when you turned on encryption. |
Critical | Failed to connect to KMS server | The Key Management System service is not available. Until the service is available, data on encrypted storage components can't be read or written. When KMS service restarts, if there is only one active instance log in to HCP for cloud scale on port 8200 and provide unseal keys to reopen the vault. |
Critical | Failed verification for retrieved encryption key for StorageComponent_ID{uuid=uuid} | The encryption key returned from the Key Management System server doesn't match the key for the storage component uuid. Verify that the KMS service is available. If the service is available, verify that you have provided the service with a quorum of unseal keys. If objects on the storage component still can't be read, contact Support. |
Critical | Metadata-Coordination cannot communicate with Sentinel service to get state information | The Sentinel service is not responding to requests for state information. Using the System Management application, immediately review the health of the Metadata-Coordination and Sentinel services and ensure that the Sentinel container has adequate heap size for the configuration of the cluster. |
Viewing alerts
Procedure
To view alerts, click the user icon () in the top right corner of each Admin App page and then click Notifications.
Object Storage Management application instructions
The Object Storage Management application displays alerts about storage components. If an alert is raised the alert icon turns red and displays a badge with the number of active alerts. For example:
Click the icon to display a window listing alert text.
Related CLI commands
listAlerts
Related REST API methods
GET /alerts
You can get help on specific REST API methods for the Admin App at REST API - Admin.
Related REST API methods (HCP for cloud scale)
POST /alert/list
For information about specific HCP for cloud scale API methods, see the MAPI Reference or, in the Object Storage Management application, click the profile icon and select REST API.
Email notification rules
For the system to send email notifications, you need to create a rule that specifies who to email, what email server to use, what events to send emails about, and what information to include in email messages.
- Enable: Turns on email notifications.
- Host: The hostname or IP address of the email server.
- Port: The port on which the email server listens for email messages.
- Security: The security protocol used by the email server (SSL or STARTTLS) or None if the email server doesn’t use a security protocol.
- Authenticated: Enable this if the email server needs authentication, then specify:
- In the Username field, the username for an email account that’s authorized to establish the connection between the system and the email server.
- In the Password field, the password for the email account.
You use the email notification message settings to configure a template for formatting all email notifications sent by the system.
- From: The email address from which you want email notifications to be sent.
- Subject: The email subject.
- Body: The email message body.
This table lists the variables you can use to make the email notification template. When the system sends an email notification, it replaces the variables in the notification with event-specific information.
Variable | Description |
$severity | Event severity: INFO, WARNING, or SEVERITY. |
$subject | A short description of the event. |
$message | Event message text. |
$userName | Name of the user responsible for the event. |
$objectId | Unique identifier for component affected by the event. |
$subsystem | Category for the component affected by the event. |
$objectSourceId | Unique identifier of the internal system component or process that was the source of the event. Value is [unknown] for most events. |
- Email addresses: A comma-separated list of email addresses to send notification emails to.
- Severity Filter: The event severities about which to send email notifications. Can be one or more of these: INFO, WARNING, SEVERITY.
Creating email notification rules
Procedure
Select the Configuration window.
Click Notifications.
Click Create.
In the Type field, select Email.
Type a name for the notification rule.
Configure the SMTP settings and message settings for the notification rule.
Specify a comma-separated list of emails to send notifications to.
Specify a comma-separated list of emails to send notifications to.
Click Create.
Related CLI commands
createNotificationRule
Related REST API methods
POST /notifications
You can get help on specific REST API methods for the Admin App at REST API - Admin.
Creating syslog notification rules
When you create a syslog notification rule, the system sends log messages to your syslog server for each applicable system event.
- Enable: Turns on syslog notifications
- Host: The hostname or IP address of the syslog server
- Port: The port on which the syslog server listens for log messages
- Facility: Category for the messages sent by this notification rule
You use the syslog notification message settings to configure a template for formatting all syslog notifications sent by this notification rule.
- Message: The message to send. You can use these variables as part of the message:
Variable Description $severity Event severity: INFO, WARNING, or SEVERITY $subject A short description of the event $message Event message text $time Time at which the event occurred $userName Name of the user responsible for the event $subsystem Category for the component affected by the event $objectId Unique identifier for component affected by the event $objectType The type of the component affected by the event. $objectSourceId Unique identifier of the internal system component or process that was the source of the event. Value is [unknown] for most events. $objectSourceType Type of the internal system component or process that was the source of the event. Value is [unknown] for most events. - Sender Identity: Identity of the sender for the event. Sent with every syslog message.
The event severities about which to send email notifications. Can be one or more of these: INFO, WARNING, or SEVERITY.
Creating syslog notification rules
Procedure
Select the Configuration window.
Click Notifications.
Click Create.
In the Type field, select Syslog.
Type a name for the notification rule.
Configure the settings for the notification rule.
Specify a severity filter for the notification rule.
Click Create.
Related CLI commands
createNotificationRule
Related REST API methods
POST /notifications
You can get help on specific REST API methods for the Admin App at REST API - Admin.
Logs and diagnostic information
Each service maintains its own set of logs. By default, log files are maintained in the folder install_path/log
on each instance in the system. During installation, you can configure each service to store its log files in a different, non-default location.
This table lists the available log levels.
Level | Levels included |
ALL | FATAL, ERROR, WARN, INFO, DEBUG, TRACE |
TRACE | FATAL, ERROR, WARN, INFO, DEBUG, TRACE |
DEBUG | FATAL, ERROR, WARN, INFO, DEBUG |
INFO | FATAL, ERROR, WARN, INFO |
WARN | FATAL, ERROR, WARN (default) |
ERROR | FATAL, ERROR |
FATAL | FATAL |
OFF | None |
You can manage any of the log files yourself. That is, you can delete or archive them as necessary.
System logs are managed automatically in one or more of these ways:
- All log files are periodically added to a compressed file and moved to install_path/retired/. This occurs at least one time per day, but can also occur at other times. For example:
- Whenever you run the
log_download
script. - Hourly, if the system instance's disk space is more than 60% full.
- At the optimum time for a specific service.
- Whenever you run the
- When a log file grows larger than an optimum size, the system stops writing to that file, renames it, and begins writing to a new file. For example, if the file
exampleService.log.0
grows to 10MB, it is renamed toexampleService.log.1
and the system creates a new file namedexampleService.log.0
to write to. - When an optimum number of log files for a specific service is reached, the system can overwrite the oldest file. For example, if a service is limited to 20 log files, when the file
exampleService.log.19
is filled, the system overwrites the file namedexampleService.log.0
.
The tool log_download
lets you easily retrieve logs and diagnostic information from all instances in the system. This tool is located at this path on each instance:
install_path/bin/log_download
For information about running the tool, use this command:
install_path/bin/log_download -h
- When using the tool
log_download
, if you specify the option--output
, do not specify an output path that contains colons, spaces, or symbolic links. If you omit the option--output
, you cannot run the script from within a folder path that contains colons, spaces, or symbolic links. - When you run the script
log_download
, all log files are automatically compressed and moved to the folder install_path/retired/. - If an instance is down, you need to specify the option
--offline
to collect the logs from that instance. If your whole system is down, you need to run the log_download script with the--offline
option on each instance.
By default, each service stores its logs in its own folder at this path:
install_path/log
This table shows the default log folder names for each service. Depending on how your system was configured when first deployed, your system's logs might not be stored in these directories.
Default log folder name | Related service | Contains information about |
com.hds.ensemble.plugins.service.adminApp | Admin-App | The System Management application. |
com.hds.ensemble.plugins.service.cassandra | Database |
|
com.hds.ensemble.plugins.service.chronos | Scheduling | Workflow task scheduling. |
com.hds.ensemble.plugins.service.elasticsearch | Metrics | The storage and indexing of:
|
com.hds.ensemble.plugins.service.haproxy | Network-Proxy | Network requests between instances. |
com.hds.ensemble.plugins.service.logstash | Logging | The transport of system events and workflow task metrics to the Metrics service. |
com.hds.ensemble.plugins.service.marathon | Service-Deployment | The deployment of high-level services across system instances. High-level services are the ones that you can move and configure, not the services grouped under System Services. |
com.hds.ensemble.plugins.service.mesosAgent | Cluster-Worker | Work ordered by the Cluster-Coordination service. |
com.hds.ensemble.plugins.service.mesosMaster | Cluster-Coordination | Hardware resource allocation. |
com.hds.ensemble.plugins.service.remoteAction | Watchdog | Internal system processes. |
com.hds.ensemble.plugins.service.sentinel | Sentinel | Internal system processes. |
com.hds.ensemble.plugins.service.solr | Index | Index collections and search indexes. |
com.hds.ensemble.plugins.service.watchdog | Watchdog | General diagnostic information. |
com.hds.ensemble.plugins.service.zookeeper | Synchronization | Coordination of actions and database operations across instances. |
com.hitachi.aspen.foundry.service.mapi.gateway | MAPI-Gateway | Management of API requests. |
com.hitachi.aspen.foundry.service.rabbitmq.server | Message Queue | Transmission of data between instances. |
com.hitachi.aspen.foundry.service.metadata.coordination | Metadata-Coordination | Transmission of system metadata between instances. |
com.hitachi.aspen.foundry.service.metadata.gateway | Metadata-Gateway | Transmission of system metadata between instances. |
com.hitachi.aspen.foundry.service.metadata.async.policy.engine | Policy-Engine | Updates. |
com.hitachi.aspen.foundry.service.clientacess.data | S3-Gateway | Transmission of S3 requests to endpoints. |
com.hitachi.aspen.foundry.service.jaeger.agent | Tracing-Agent | Usage of traces. |
com.hitachi.aspen.foundry.service.jaeger.collector | Tracing-Collector | Usage of tracing collections. |
com.hitachi.aspen.foundry.service.jaeger.query | Tracing-Query | Usage of tracing queries. |