What Information Do I Need to Gather to Allow GSC to Diagnose an HNAS Performance Problem

What help can be provided for my performance problem?

In reality, there is no such thing as a "performance problem," there are only "lack of capacity problems." As such there are three possible approaches that can be taken to address these:

Perform a "performance tuning exercise,"
Perform a "system sizing exercise,"
Request a product enhancement to the system so that it has increased capacity without additional hardware by making changes to hardware or software of the product.

Performance tuning

A performance tuning exercise is something that can be lead by Hitachi GSC. There are three things to bear in mind before going down this path:

The purpose of a performance tuning exercise is just to try and determine if there are any changes to the environment which could be made which will increase the capacity available from a given system. Whether or not any such changes are considered feasible is a customer business decision.
Since capacity is heavily dependent on specific workload, GSC are unable to advise whether any particular load is within the expected capacity range of any particular system configuration (both hardware and system settings/configuration.)
As a result GSC are unable to guarantee whether it is possible to achieve an acceptable level of performance for any particular workload on any particular hardware configuration. (Or indeed whether any such configuration may even be possible.)

The way this process proceeds is as follows:

Performance data for the HNAS and associated storage is captured that covers a time period when the perceived problem is occurring.
Hitachi GSC evaluate this data and see whether any recommendations can be made for changes to the system which may allow additional load to be sustained on the existing hardware with improved performance.

The changes recommended may be:

Simple configuration changes to the file/block storage system (usually bringing the system into "best practice" state.) It is rare however that these changes will have significant impact and we would generally expect systems to be setup to best-practice at install time.
Changes to the way the file/block storage is used to try and achieve a more efficient use of the resources that are available. These changes will usually require some data migration moving from the sub-optimal configuration to a hopefully more efficient usage.
Suggested changes to the client behavior that would make more efficient use of the available storage resources.

We don't however want to propose too many changes at once for the following reasons:

Some changes may actually cause the available capacity to decrease due to unanticipated workload related factors,
It becomes difficult to tell which changes were useful and which were counter-productive and need to be reversed.

As a result, a performance tuning exercise is usually iterative in nature:

Data is gathered and analyzed,
Suggested changes are proposed and applied,
New data is gathered and analyzed to see whether additional changes may be worthwhile.

The performance tuning loop finishes when:

An acceptable level of performance is achieved,
The customer no longer wishes to apply suggested changes,
There are no further suggested changes.

In the event of 2 & 3 this may mean that a system sizing exercise or product enhancement request is then required.

System sizing

System sizing would be carried out by your Hitachi account team and/or Hitachi GSS and there are two aspects to this:

Understanding what load the system is required to sustain (including peak loads) and including margin for growth,
Determining which system configuration(s) would be suitable for handling those loads.

Once this has been carried out and the necessary additional capacity provisioned then you can plan and implement a migration from the old capacity to the new additional capacity. This approach is particularly suitable for customers who want "one set of changes that is going to resolve my problem."

Product enhancements

Under certain circumstances it may be possible to "tune" the hardware/software of a product so that it can accommodate a greater load without requiring any additional hardware resources. The process for asking whether any such "tuning" may be possible is to raise a Product Enhancement Request (PER.)

An example of a product enhancement might be changing the system so that it can handle additional load before it exhausts the available CPU capacity. In Hitachi product enhancements are considered a sales rather than a support function and should be requested through your account team.

Suggested product enhancements are reviewed by product management and if deemed reasonable are added to the engineering backlog for possible scheduling and implementation. In general the lead time for product enhancement requests would be quite long and they may be rejected if they are not considered suitable.

Verify system not impacted by common causes of performance problems

Common causes of HNAS performance problems are documented in the article What Are Common Causes of Performance Problems in HNAS Systems? Before escalating to Hitachi you should identify and resolve any of the listed problems.

Storage Performance Data Collection

Kick off performance data collection from the Hitachi storage array for 60 minutes at 1 minute intervals (below are the links with detailed instructions for Hitachi Midrange and Enterprise Storage):

DF subsystems support Open Systems applications. Performance monitor is not constantly running on DF Subsystems.

Below are links for collecting the required data:

HNAS Performance Data Collection

While the above storage data collection is happening, kick off a 10 minute Performance Information Report (PIR) on the HNAS cluster specifying the file system that is currently not performing as expected:

How to Collect and Download a HNAS Performance Information Report or PIR

Packet Capture on Impacted Client

Whilst the PIR above is being collected please gather a short (~30 second) packet capture on an impacted client as per the guidance in How to Collect Packet Captures for Troubleshooting HNAS Problems.

Please also provide:

The IP address of the client the capture was taken from,
The IP addresses of the EVS(es) that the client should be accessing,
A description of what operations were being undertaken on the client whilst the packet capture was being gathered.

Additional Required Data Collection

After the PFM (storage) and PIR (HNAS) data collections have completed, gather a simple trace (Midrange) or dump (Enterprise) from the array and HNAS diagnostics and upload everything collected to TUF:

Performance Problem Questionnaire

Once the performance data collection is underway, please look at and provide answers for the HNAS Performance Issues Questionnaire:

HNAS Performance Issues Questionnaire

Additional Notes

Which file system should I focus the PIR on?

If the file system to focus on is not obvious from the context of the problem, try and "focus" the PIR (-f switch) on the busiest file system on the impacted EVS or storage pool (span). You can determine the busy file systems using the process described in:

How To: Determine Busy File Systems (HNAS)

How can I identify the busiest clients using the HNAS?

Please refer to the knowledgebase article:

How to Determine Busy Clients on HNAS .

What if my performance problem is intermittent?

If your performance problem is intermittent then we recommend using the HNAS crontab CLI command to start a PIR on the impacted file system at every 00, 15, 30 and 45 minutes past the hour. You can then collect the PIRs and when the problem reoccurs send the PIR covering the time period in question to GSC. A default length "10 minute" PIR take approximately 13 minutes to run so starting one every 15 minutes means the previous one will have completed whilst still giving good coverage.

If you are using HNAS firmware 12.5 or later then you may also be able to use "continuous PIR" - see the performance-info-report HNAS CLI command man page for additional details.

If a particular HNAS event seems to mark the start of the performance problem?

If a particular HNAS event seems to mark the start of the performance problem then it may be useful to trigger the start of a PIR when that event occurs. The procedure for doing that is documented in:

How to Trigger a PIR By Event ID Using HNAS CLI.

Performance Data Collection

If the HNAS Is Not Attached to Hitachi Storage
1. Start the PIR on the HNAS while the problem is happening
2. When the PIR is finished, collect Diagnostics From HNAS
If the Problem Involves HNAS Connected to a Midrange Array (AMS2000 or HUS100 Series)
1. Validate Network Time Protocol is Setup for Midrange Array, Collection Server, and HNAS
  - If NTP is not possible on all items, note the time differential between HNAS clock, Array clock, Collection Server time, and local time
2. Start the Performance Collection from the Midrange Array
3. While the Performance Collection is running on the Midrange Array, start the PIR on the HNAS. They must overlap.
4. When the Performance Collection is complete for the Midrange Array, collect a Simple Trace
5. When the Performance Collection is complete for the HNAS, collect Diagnostics
If the Problem Involves HNAS Connected to an Enterprise Array With a Midrange Array Being Virtualized
1. Validate Network Time Protocol is Setup for Midrange Array, Collection Server, HNAS, and Enterprise Array
  - If NTP is not possible on all items, note the time differential between HNAS clock, Array clock, Collection Server time, and local time
2. Start the Performance Collection from the Midrange Array
3. While the Performance Collection is running on the Midrange Array, start the PIR on the HNAS. They must overlap.
4. When the Performance Collection is Complete for the Midrange Array and HNAS, collect a Performance Data from the Enterprise array.
5. When the Performance Data is Complete for the Enterprise Array, collect the following Hardware logs:
  1. Detailed DUMP from Enterprise Array:
    - FD Dump Tool: How to Download Dump Files Using the FD Dump Tool, or
    - Remote Ops (Hi-Track) SVP Agent: How to Collect a Dump Using Remote Ops (Hi-Track) SVP Agent Version C6 or Higher
  2. Simple Trace From Midrange Array: DF Hardware Data Collection
  3. Diagnostics From HNAS: Hitachi NAS (HNAS) Platform Data Collection
If the Problem Involves HNAS Connected to an Enterprise Array Without a Midrange Array Being Virtualized
1. Validate NTP on HNAS and if possible on Enterprise Array
  - If NTP is not possible on all items, note the time differential between HNAS clock, Array clock, Collection Server time, and local time
2. Start the PIR on the HNAS while the problem is happening
3. When the PIR is finished, collect Diagnostics From HNAS
4. When the PIR is finished, collect Detailed DUMP from Enterprise Array:
  - FD Dump Tool: How to Download Dump Files Using the FD Dump Tool, or
  - Remote Ops (Hi-Track) SVP Agent: How to Collect a Dump Using Remote Ops (Hi-Track) SVP Agent Version C6 or Higher
5. Collect Performance Data from the Enterprise array within 4 hours of the PIR

Environment

Answer