Skip to main content

We've Moved!

Product Documentation has moved to docs.hitachivantara.com
Hitachi Vantara Knowledge

Monitoring

Your system gives a number of mechanisms that allow you to monitor the health and performance of the system and all of its instances and services.

Monitoring instances

You can use the Admin App, CLI, and REST API to view a list of all instances in the system.

Viewing all instances

To view all instances, in the Admin App, click Dashboard > Instances.

The page shows all instances in the system. Each instance is identified by its IP address.

GUID-F6C9E700-DA8E-4C87-9084-8BD9DA87D8B1-low.png

This table describes the information shown for each instance.

PropertyDescription
State
  • Up: The instance is reachable by other instances in the system.
  • Down: The instance cannot be reached by other instances in the system.
ServicesThe number of services running on the instance.
Service Units

The total number of service units for all services and job types running on the instance, out of the best-practice service unit limit for the instance.

An instance with a higher number of service units is likely to be more heavily used by the system than an instance with a lower number of service units.

The Instances page displays a blue bar for instances running less than the best-practice service unit limit.

The Instances page displays a red bar for instances running more than the best-practice service unit limit.

GUID-701CFEF4-B49C-4DCD-8B83-2FA6EB8A8D03-low.png

Load AverageThe load averages for the instance for the past one, five, and ten minutes.
CPUThe sum of the percentage utilization for each CPU core in the instance.
Memory Allocated

This section shows both:

  • The amount of RAM on the instance that's allocated to all services running on that instance.
  • The percentage of this allocated RAM to the total RAM for the instance.
Memory TotalThe total amount of RAM for the instance.
Disk UsedThe current amount of disk space that your system is using in the partition on which it is installed.
Disk FreeThe amount of free disk space in the partition in which your system is installed.

Viewing the services running on an instance

To view the services running on an individual instance, in the Admin App:

Procedure

  1. Click Dashboard > Instances.

  2. Select the instance you want.

    The page lists all services running on the instance.

    For each service, the page shows:

    • The service name
    • The service state:
      • Healthy: The service is running normally.
      • Unconfigured: The service has yet to be configured and deployed.
      • Deploying: The system is currently starting or restarting the service. This can happen when:
        • You move the service to run on a completely different set of instances.
        • You repair a service.
      • Balancing: The service is running normally, but performing background maintenance.
      • Under-protected: In a multi-instance system, one or more of the instances on which a service is configured to run are offline.
      • Failed: The service is not running or the system cannot communicate with the service.
    • CPU Usage: The current percentage CPU usage for the service across all instances on which it's running.
    • Memory: The current RAM usage for the service across all instances on which it's running.
    • Disk Used: The current total amount of disk space that the service is using across all instances on which it's running.

Related CLI commands

getInstance

listInstances

Related REST API methods

GET /instances

GET /instances/{uuid}

You can get help on specific REST API methods for the Admin App at REST API - Admin.

Monitoring services

You can use Admin App, CLI, and REST API to view the status of all services for the system.

Viewing all services

To view the status of all services, in the Admin App, click Services.

For each service, the page shows:

  • The service name
  • The service state:
    • Healthy: The service is running normally.
    • Unconfigured: The service has yet to be configured and deployed.
    • Deploying: The system is currently starting or restarting the service. This can happen when:
      • You move the service to run on a completely different set of instances.
      • You repair a service.
    • Balancing: The service is running normally, but performing some background maintenance operations.
    • Under-protected: In a multi-instance system, one or more of the instances on which a service is configured to run are offline.
    • Failed: The service is not running or the system cannot communicate with the service.
  • CPU Usage: The current percentage CPU usage for the service across all instances on which it's running.
  • Memory: The current RAM usage for the service across all instances on which it's running.
  • Disk Used: The current total amount of disk space that the service is using across all instances on which it's running.

Viewing individual service status

To view the detailed status for an individual service, select the service on the Services page.

In addition to the status information, the page shows:

  • Instances: A list of all instances on which the service is running.
  • Volumes: To view a list of volumes used by the service, select the row for an instance in the Instances section.
  • Network: [Internal|External]: Which network type this service uses to receive communications.

    This section also displays a list of the ports that the service uses.

  • Configuration settings: The settings you can configure for the service.
  • Service Units: The total number of service units currently being spent to run this service. This value is equal to the service's service unit cost times the number of instances on which the service is running.
  • Service unit cost: The number of service units required to run the service on one instance.
  • Service Instance Types: For services that have multiple types, the types that are currently running.
  • Instance Pool: For floating services, the instances that this service is eligible to run on.
  • Events: A list of all system events for the service.

Related CLI commands

getService

listServices

Related REST API methods

GET /services

GET /services/{id}

You can get help on specific REST API methods for the Admin App at REST API - Admin.

Monitoring processes

The Processes page lets you view information about what the system is doing. This includes any service operations you started and any internal maintenance processes the system needs to run.

Monitoring service operations

You can use the Admin App, CLI, and REST API to monitor all service operations. These include:

  • The initial deployments of services when the system was installed.
  • Service relocations that you begin.

For each one, the system shows:

  • The name of the service involved
  • The status of the operation
  • The number of steps completed out of the total number of steps
Admin App instructions

Procedure

  1. Select the Processes window.

Results

The Service Operations tab shows information about in-progress and completed service operations.

Related CLI commands

listSystemTasks

getSystemTask

Related REST API methods

GET /tasks/system

GET /tasks/system/{uuid}

You can get help on specific REST API methods for the Admin App at REST API - Admin.

Monitoring system processes

You can use Admin App, REST API, and CLI to view the progress of internal system processes. These include package installation tasks and regularly scheduled system maintenance activities such as log rotation.

For each process, your system shows:

  • The process name
  • The process state
  • The times at which each step in the process run occurred
NoteSystem processes have a type of SCHEDULED or ONE-TIME.
Admin App instructions

Procedure

  1. In the Admin App, select Processes.

  2. To view the currently running processes, select the System tab.

  3. To view the scheduled processes, select the Scheduled tab.

Related CLI commands

listSystemTasks

getSystemTask

Related REST API methods

GET /tasks/system

GET /tasks/system/{uuid}

You can get help on specific REST API methods for the Admin App at REST API - Admin.

System events

Your system maintains a log of system events that you can view through the Admin App, CLI, and REST API.

Admin App instructions

Procedure

  1. To view all system events, in the Admin App, click Events.

Related CLI commands

queryEvents

To view events through the CLI, your requests need to specify which events you want to retrieve.

For example, this JSON request body searches the event log for all events that have a severity level of warning:

{
"severities": [
"warning"
  ]
}

Related REST API methods

POST /events

To view events through the REST API, your requests need to specify which events you want to retrieve.

For example, this JSON request body searches the event log for all events that have a severity level of warning:

{
"severities": [
"warning"
  ]
}

You can get help on specific REST API methods for the Admin App at REST API - Admin.

HCP for cloud scale events

Most events are generated by and reported through the System Management application.

Events specific to HCP for cloud scale are reported with the IDs 6006 (informational), 6007 (warning), and 6008 (severe). These events are:

IDSeverityMessageDescription
6006INFOService Information: Job Configuration 'id' of type 'job_type' updated with status 'status'The policy configuration has changed and the status of job id of type 'job_type is one of the following:
  • ENABLED
  • DISABLED
6006INFOService Information: Job Configuration 'id' startedPolicy configuration for job id has started.
6006INFOService Information: Lifecycle policy {CREATE|UPDATE|DELETE} bucket 'bucket_name'The lifecyle policy for bucket bucket_name has been either created, updated, or deleted.
6006INFOService Information: Lifecycle policy deleted for bucket 'bucket_name'The lifecyle policy for bucket bucket_name has been deleted.
6006INFOService Information: Serial number updated to valueThe HCP for cloud scale serial number has been changed to value.
6006INFOService Information: setting was set to valueThe S3 setting setting has been changed to value.
6006INFOService Warning: Storage component 'id' createdThe storage component id has been created.
6006INFOService Warning: Storage component 'id' is now state The storage component id is in one of the following:states
  • ACTIVE
  • INACTIVE
  • UNVERIFIED
6006INFOService Warning: Storage component 'id' updated: configurationThe storage component id has been updated. configurationlists the changes.
6007WARNINGService Warning: Certificate for Storage component 'id' is about to expire in 'n' days The SSL certificate for the storage component id is set to expire in n days. If the certificate expires, HCP for cloud scale will not be able to read from or write to the storage component.
6007WARNINGService Warning: Storage component 'id' is now INACCESSIBLE The storage component id is inaccessible.HCP for cloud scale cannot read from or write to the storage component.
6008SEVEREService Error: Storage Component Certificate Expired. The SSL certificate for a storage component has expired. HCP for cloud scale cannot read from or write to the storage component.
6008SEVEREService Error: Vault Connection IssueThe active vault node can't be reached.
6008SEVEREService Error: Vault Service Completely SealedThe vault service (Key Management Server service) is completely sealed. Unseal it using the unseal keys you obtained when you turned on encryption.
6008SEVEREService Error: Vault Service Node Error. IP: ip_addressOne of the vault nodes can't be reached. If other active nodes are available service continues, but attend to this issue immediately.
6008SEVEREService Error: Vault Service Node Sealed. IP: ip_addressOne of the vault nodes is sealed. If other active nodes are available service continues, but attend to this issue immediately. Unseal it using the unseal keys you obtained when you turned on encryption.

Alerts

Alert messages notify you of situations that need your attention. Alerts can have a severity of Info, Warning, Severe, or Critical. You can view system alerts through the Admin App, CLI, or REST API, and storage component alerts through the Object Storage Management app.

Each alert corresponds to a system event.

System alerts
SeverityAlert DescriptionAction
SevereInstance ip-address disk usage severe threshold

The specified instance has less than 10% free disk space. Add additional storage to the instance.

Important: If an instance runs out of disk space, the system can become unresponsive.

SevereMaster Instance ip-address is down

Do one of these:

  • Restart the instance hardware or virtual machine.
  • Restart the script run on the instance. This script is located in the folder bin in the installation folder.
SevereService is down

Verify the health of your instances. If one is down, do one of these:

  • Restart the instance hardware or virtual machine.
  • Restart the script run on the instance. This script is located in the folder bin in the installation folder.

Otherwise, if your instances are healthy and the problem persists, contact Support.

SevereWorker Instance ip-address is down

Do one of these:

  • Restart the instance hardware or virtual machine.
  • Restart the script run on the instance. This script is located in the folder bin in the installation folder.
WarningInstance ip-address disk usage warning threshold

The specified instance has less than 25% free disk space. Add additional storage to the instance.

Important: If an instance runs out of disk space, the system can become unresponsive.

WarningPackage installation failed

Your system failed to install a package that you uploaded.

WarningService below recommendation

The service is currently running on fewer than the minimum number of instances. Configure this service to run on additional instances.

WarningService under-protected

A service has lost redundancy; that is, one or more instances on which that service is running are unresponsive.

Verify the health of your instances. If one is down, do one of these:

  • Restart the instance hardware or virtual machine.
  • Restart the script run on the instance. This script is located in the folder bin in the installation folder.

Otherwise, if your instances are healthy and the problem persists, contact Support.

WarningSSL server certificate chain expires soon

A certificate in the SSL server certificate chain for this system expires soon. If the certificate chain expires, users can't access the system.

WarningSSL server certificate chain expired

The SSL server certificate chain for this system contains an expired certificate. Users cannot access the system until the certificate chain is replaced.

InfoPackage installation in progress

Your system is currently installing a package that you uploaded. Depending on the contents of the package, this might take a while.

WarningThe certificate for the storage component (storage-id) is about to expire in n daysRenew the storage component certificate.
InfoThe storage component (storage-id) is unavailableVerify that the storage component ID is correct and valid and that the storage component is active.
Storage component alerts
SeverityMessageDescription
WarningCertificate for Storage component id is about to expire in n days The SSL certificate for the storage component id is set to expire in n days. If the certificate expires, HCP for cloud scale will not be able to read from or write to the storage component.
WarningStorage component id is now inaccessibleThe storage component id is in the state INACCESIBLE. HCP for cloud scale cannot read from or write to the storage component.
SevereCertificate for Storage component idexpired The SSL certificate for the storage component id has expired. HCP for cloud scale cannot read from or write to the storage component. Install a new certificate.
SevereError communicating with a vault node. Node IP: ip_addressOne of the vault nodes can't be reached. If other active nodes are available service continues, but attend to this issue immediately.

Examine the vault instance logs to determine the cause of this issue.

SevereFailed to connect to KMS serverOne of the vault nodes can't be reached. If other active nodes are available service continues, but attend to this issue immediately.

If ingest is halted, then investigate why the KMS service is failing to run on all nodes. If ingest is still working, the original active node has failed over. Examine the vault instance logs to determine the cause of the failure.

SevereFailed to connect to KMS server as it is completely sealedThe vault service (Key Management Server service) is completely sealed.

Unseal it using the unseal keys you obtained when you turned on encryption.

SevereService error: There is a critical issue with the Metadata Gateway database. Shutting down the Metadata Gateway Service.

A Metadata Gateway instance has encountered an issue and shut down. Use the System Management Services function Repair to restart it.

If restarting the service doesn't resolve the issue, contact Support.

SevereVault node is sealed. Node IP: ip_addressOne of the vault nodes is sealed. If other active nodes are available service continues, but attend to this issue immediately.

Unseal it using the unseal keys you obtained when you turned on encryption.

CriticalFailed to connect to KMS serverThe Key Management System service is not available. Until the service is available, data on encrypted storage components can't be read or written.

When KMS service restarts, if there is only one active instance log in to HCP for cloud scale on port 8200 and provide unseal keys to reopen the vault.

CriticalFailed verification for retrieved encryption key for StorageComponent_ID{uuid=uuid}The encryption key returned from the Key Management System server doesn't match the key for the storage component uuid.

Verify that the KMS service is available. If the service is available, verify that you have provided the service with a quorum of unseal keys. If objects on the storage component still can't be read, contact Support.

CriticalMetadata-Coordination cannot communicate with Sentinel service to get state informationThe Sentinel service is not responding to requests for state information. Using the System Management application, immediately review the health of the Metadata-Coordination and Sentinel services and ensure that the Sentinel container has adequate heap size for the configuration of the cluster.

Viewing alerts

Admin App instructions

Procedure

  1. To view alerts, click the user icon (GUID-F65C10E6-C9EB-4958-9692-6E677A75CBF1-low.png) in the top right corner of each Admin App page and then click Notifications.

Object Storage Management application instructions

The Object Storage Management application displays alerts about storage components. If an alert is raised the alert icon turns red and displays a badge with the number of active alerts. For example:

Alert badge showing one active alert

Click the icon to display a window listing alert text.

Related CLI commands

listAlerts

Related REST API methods

GET /alerts

You can get help on specific REST API methods for the Admin App at REST API - Admin.

Related REST API methods (HCP for cloud scale)

POST /alert/list

For information about specific HCP for cloud scale API methods, see the MAPI Reference or, in the Object Storage Management application, click the profile icon and select REST API.

Email notification rules

For the system to send email notifications, you need to create a rule that specifies who to email, what email server to use, what events to send emails about, and what information to include in email messages.

SMTP settings
  • Enable: Turns on email notifications.
  • Host: The hostname or IP address of the email server.
  • Port: The port on which the email server listens for email messages.
  • Security: The security protocol used by the email server (SSL or STARTTLS) or None if the email server doesn’t use a security protocol.
  • Authenticated: Enable this if the email server needs authentication, then specify:
    • In the Username field, the username for an email account that’s authorized to establish the connection between the system and the email server.
    • In the Password field, the password for the email account.
Message settings

You use the email notification message settings to configure a template for formatting all email notifications sent by the system.

  • From: The email address from which you want email notifications to be sent.
  • Subject: The email subject.
  • Body: The email message body.
Message variables

This table lists the variables you can use to make the email notification template. When the system sends an email notification, it replaces the variables in the notification with event-specific information.

VariableDescription
$severityEvent severity: INFO, WARNING, or SEVERITY.
$subjectA short description of the event.
$messageEvent message text.
$userNameName of the user responsible for the event.
$objectIdUnique identifier for component affected by the event.
$subsystemCategory for the component affected by the event.
$objectSourceIdUnique identifier of the internal system component or process that was the source of the event. Value is [unknown] for most events.
Recipient settings
  • Email addresses: A comma-separated list of email addresses to send notification emails to.
  • Severity Filter: The event severities about which to send email notifications. Can be one or more of these: INFO, WARNING, SEVERITY.

Creating email notification rules

Admin App instructions

Procedure

  1. Select the Configuration window.

  2. Click Notifications.

  3. Click Create.

  4. In the Type field, select Email.

  5. Type a name for the notification rule.

  6. Configure the SMTP settings and message settings for the notification rule.

  7. Specify a comma-separated list of emails to send notifications to.

  8. Specify a comma-separated list of emails to send notifications to.

  9. Click Create.

Related CLI commands

createNotificationRule

Related REST API methods

POST /notifications

You can get help on specific REST API methods for the Admin App at REST API - Admin.

Creating syslog notification rules

When you create a syslog notification rule, the system sends log messages to your syslog server for each applicable system event.

Syslog settings
  • Enable: Turns on syslog notifications
  • Host: The hostname or IP address of the syslog server
  • Port: The port on which the syslog server listens for log messages
  • Facility: Category for the messages sent by this notification rule
Message settings

You use the syslog notification message settings to configure a template for formatting all syslog notifications sent by this notification rule.

  • Message: The message to send. You can use these variables as part of the message:
    VariableDescription
    $severityEvent severity: INFO, WARNING, or SEVERITY
    $subjectA short description of the event
    $messageEvent message text
    $timeTime at which the event occurred
    $userNameName of the user responsible for the event
    $subsystemCategory for the component affected by the event
    $objectIdUnique identifier for component affected by the event
    $objectTypeThe type of the component affected by the event.
    $objectSourceIdUnique identifier of the internal system component or process that was the source of the event. Value is [unknown] for most events.
    $objectSourceTypeType of the internal system component or process that was the source of the event. Value is [unknown] for most events.
  • Sender Identity: Identity of the sender for the event. Sent with every syslog message.
Severity filter

The event severities about which to send email notifications. Can be one or more of these: INFO, WARNING, or SEVERITY.

Creating syslog notification rules

Admin App instructions

Procedure

  1. Select the Configuration window.

  2. Click Notifications.

  3. Click Create.

  4. In the Type field, select Syslog.

  5. Type a name for the notification rule.

  6. Configure the settings for the notification rule.

  7. Specify a severity filter for the notification rule.

  8. Click Create.

Related CLI commands

createNotificationRule

Related REST API methods

POST /notifications

You can get help on specific REST API methods for the Admin App at REST API - Admin.

Logs and diagnostic information

Each service maintains its own set of logs. By default, log files are maintained in the folder install_path/log on each instance in the system. During installation, you can configure each service to store its log files in a different, non-default location.

Log levels

This table lists the available log levels.

NoteRaising the log level (for example, from WARN to INFO) results in writing more data to the log file, but the file size increases more rapidly. Lowering the log level (for example, from WARN to ERROR) results in the file size increasing more slowly, but results in writing less data to the log file.
LevelLevels included
ALLFATAL, ERROR, WARN, INFO, DEBUG, TRACE
TRACEFATAL, ERROR, WARN, INFO, DEBUG, TRACE
DEBUGFATAL, ERROR, WARN, INFO, DEBUG
INFOFATAL, ERROR, WARN, INFO
WARNFATAL, ERROR, WARN (default)
ERRORFATAL, ERROR
FATALFATAL
OFFNone
Log management

You can manage any of the log files yourself. That is, you can delete or archive them as necessary.

CautionDeleting log files can make it more difficult for support personnel to resolve issues you might encounter.

System logs are managed automatically in one or more of these ways:

  • All log files are periodically added to a compressed file and moved to install_path/retired/. This occurs at least one time per day, but can also occur at other times. For example:
    • Whenever you run the log_download script.
    • Hourly, if the system instance's disk space is more than 60% full.
    • At the optimum time for a specific service.
  • When a log file grows larger than an optimum size, the system stops writing to that file, renames it, and begins writing to a new file. For example, if the file exampleService.log.0 grows to 10MB, it is renamed to exampleService.log.1 and the system creates a new file named exampleService.log.0 to write to.
  • When an optimum number of log files for a specific service is reached, the system can overwrite the oldest file. For example, if a service is limited to 20 log files, when the file exampleService.log.19 is filled, the system overwrites the file named exampleService.log.0.
Retrieving logs and diagnostic information

The tool log_download lets you easily retrieve logs and diagnostic information from all instances in the system. This tool is located at this path on each instance:

install_path/bin/log_download

For information about running the tool, use this command:

install_path/bin/log_download -h

Note
  • When using the tool log_download, if you specify the option --output, do not specify an output path that contains colons, spaces, or symbolic links. If you omit the option --output, you cannot run the script from within a folder path that contains colons, spaces, or symbolic links.
  • When you run the script log_download, all log files are automatically compressed and moved to the folder install_path/retired/.
  • If an instance is down, you need to specify the option --offline to collect the logs from that instance. If your whole system is down, you need to run the log_download script with the --offline option on each instance.
Default log locations

By default, each service stores its logs in its own folder at this path:

install_path/log

This table shows the default log folder names for each service. Depending on how your system was configured when first deployed, your system's logs might not be stored in these directories.

Default log folder nameRelated serviceContains information about
com.hds.ensemble.plugins.service.​adminAppAdmin-AppThe System Management application.
com.hds.ensemble.plugins.service.​cassandraDatabase
  • System configuration data.
  • Document fields and values.
com.hds.ensemble.plugins.service.​chronosSchedulingWorkflow task scheduling.
com.hds.ensemble.plugins.service.​​elasticsearchMetricsThe storage and indexing of:
  • System events
  • Performance and failure metrics for workflow tasks
com.hds.ensemble.plugins.service.​haproxyNetwork-ProxyNetwork requests between instances.
com.hds.ensemble.plugins.service.​logstashLoggingThe transport of system events and workflow task metrics to the Metrics service.
com.hds.ensemble.plugins.service.marathonService-DeploymentThe deployment of high-level services across system instances. High-level services are the ones that you can move and configure, not the services grouped under System Services.
com.hds.ensemble.plugins.service.​​mesosAgentCluster-WorkerWork ordered by the Cluster-Coordination service.
com.hds.ensemble.plugins.service.​mesosMasterCluster-CoordinationHardware resource allocation.
com.hds.ensemble.plugins.service.remoteActionWatchdogInternal system processes.
com.hds.ensemble.plugins.service.​sentinelSentinelInternal system processes.
com.hds.ensemble.plugins.service.solrIndexIndex collections and search indexes.
com.hds.ensemble.plugins.service.​watchdogWatchdogGeneral diagnostic information.
com.hds.ensemble.plugins.service.​zookeeperSynchronizationCoordination of actions and database operations across instances.
com.hitachi.aspen.foundry.service.mapi.gatewayMAPI-GatewayManagement of API requests.
com.hitachi.aspen.foundry.service.​rabbitmq.serverMessage QueueTransmission of data between instances.
com.hitachi.aspen.foundry.service.metadata.​coordinationMetadata-CoordinationTransmission of system metadata between instances.
com.hitachi.aspen.foundry.service.metadata.​gatewayMetadata-GatewayTransmission of system metadata between instances.
com.hitachi.aspen.foundry.service.metadata.async.policy.​enginePolicy-EngineUpdates.
com.hitachi.aspen.foundry.service.clientacess.​dataS3-GatewayTransmission of S3 requests to endpoints.
com.hitachi.aspen.foundry.service.jaeger.agentTracing-AgentUsage of traces.
com.hitachi.aspen.foundry.service.jaeger.​collectorTracing-CollectorUsage of tracing collections.
com.hitachi.aspen.foundry.service.jaeger.queryTracing-QueryUsage of tracing queries.

 

  • Was this article helpful?