What Information Do I Need to Gather to Allow GSC to Diagnose an HNAS Performance Problem
Content

Question

What Information Do I Need to Gather to Allow GSC to Diagnose a HNAS "Performance Problem"?

See also:

Environment

  • Hitachi Network Attached Storage (HNAS)
    • 3100/3200
    • 3080/3090
    • 4000 series

Answer

A standard set of HNAS diagnostics taken after the event is usually insufficient to diagnose the cause of a HNAS "performance problem."  If a HNAS system is having a performance problem the high level process is the following:

  • Understand what options are available for getting help with performance problems.
  • Verify that the system in question isn't impacted by any of the common causes of performance problems.
  • Performance data collection from the Hitachi storage array,
  • Gather a performance-info-report (PIR) from the HNAS on the impacted file system(s),
  • Whilst the PIR is running, gather a short (30sec) packet capture from one of the impacted clients,
  • Gather the additional required data,
  • Answer the performance problem questionnaire to provide the context for the problem.

Please note: It is important that the performance data such as the PIR and packet capture gathered below are collected while the performance problem is happening.   Data collected outside of the performance problem will provide no insight into what is causing the performance problem.

What help can be provided for my performance problem?

In reality, there is no such thing as a "performance problem," there are only "lack of capacity problems."  As such there are three possible approaches that can be taken to address these:

  • Perform a "performance tuning exercise,"
  • Perform a "system sizing exercise,"
  • Request a product enhancement to the system so that it has increased capacity without additional hardware by making changes to hardware or software of the product.

Performance tuning

A performance tuning exercise is something that can be lead by Hitachi GSC.  There are three things to bear in mind before going down this path:

  1. The purpose of a performance tuning exercise is just to try and determine if there are any changes to the environment which could be made which will increase the capacity available from a given system.  Whether or not any such changes are considered feasible is a customer business decision.
  2. Since capacity is heavily dependent on specific workload, GSC are unable to advise whether any particular load is within the expected capacity range of any particular system configuration (both hardware and system settings/configuration.)
  3. As a result GSC are unable to guarantee whether it is possible to achieve an acceptable level of performance for any particular workload on any particular hardware configuration.  (Or indeed whether any such configuration may even be possible.)

The way this process proceeds is as follows:

  1. Performance data for the HNAS and associated storage is captured that covers a time period when the perceived problem is occurring.
  2. Hitachi GSC evaluate this data and see whether any recommendations can be made for changes to the system which may allow additional load to be sustained on the existing hardware with improved performance.

The changes recommended may be:

  • Simple configuration changes to the file/block storage system (usually bringing the system into "best practice" state.)  It is rare however that these changes will have significant impact and we would generally expect systems to be setup to best-practice at install time.
  • Changes to the way the file/block storage is used to try and achieve a more efficient use of the resources that are available.  These changes will usually require some data migration moving from the sub-optimal configuration to a hopefully more efficient usage.
  • Suggested changes to the client behavior that would make more efficient use of the available storage resources.

We don't however want to propose too many changes at once for the following reasons:

  • Some changes may actually cause the available capacity to decrease due to unanticipated workload related factors,
  • It becomes difficult to tell which changes were useful and which were counter-productive and need to be reversed.

As a result, a performance tuning exercise is usually iterative in nature:

  • Data is gathered and analyzed,
  • Suggested changes are proposed and applied,
  • New data is gathered and analyzed to see whether additional changes may be worthwhile.

The performance tuning loop finishes when:

  1. An acceptable level of performance is achieved,
  2. The customer no longer wishes to apply suggested changes,
  3. There are no further suggested changes.

In the event of 2 & 3 this may mean that a system sizing exercise or product enhancement request is then required.

System sizing

System sizing would be carried out by your Hitachi account team and/or Hitachi GSS and there are two aspects to this:

  1. Understanding what load the system is required to sustain (including peak loads) and including margin for growth,
  2. Determining which system configuration(s) would be suitable for handling those loads.

Once this has been carried out and the necessary additional capacity provisioned then you can plan and implement a migration from the old capacity to the new additional capacity.  This approach is particularly suitable for customers who want "one set of changes that is going to resolve my problem."

Product enhancements

Under certain circumstances it may be possible to "tune" the hardware/software of a product so that it can accommodate a greater load without requiring any additional hardware resources.  The process for asking whether any such "tuning" may be possible is to raise a Product Enhancement Request (PER.)

An example of a product enhancement might be changing the system so that it can handle additional load before it exhausts the available CPU capacity.  In Hitachi product enhancements are considered a sales rather than a support function and should be requested through your account team.  

Suggested product enhancements are reviewed by product management and if deemed reasonable are added to the engineering backlog for possible scheduling and implementation.  In general the lead time for product enhancement requests would be quite long and they may be rejected if they are not considered suitable.

Verify system not impacted by common causes of performance problems

Common causes of HNAS performance problems are documented in the article What Are Common Causes of Performance Problems in HNAS Systems?  Before escalating to Hitachi you should identify and resolve any of the listed problems.

Storage Performance Data Collection

Kick off performance data collection from the Hitachi storage array for 60 minutes at 1 minute intervals (below are the links with detailed instructions for Hitachi Midrange and Enterprise Storage):

DF subsystems support Open Systems applications. Performance monitor is not constantly running on DF Subsystems.

Below are links for collecting the required data:

HNAS Performance Data Collection

While the above storage data collection is happening, kick off a 10 minute Performance Information Report (PIR) on the HNAS cluster specifying the file system that is currently not performing as expected:

See also How To: Collect a PIR if HNAS is Not Configured to Send via Email.

Packet Capture on Impacted Client

Whilst the PIR above is being collected please gather a short (~30 second) packet capture on an impacted client as per the guidance in How to Collect Packet Captures for Troubleshooting HNAS Problems.

Please also provide:

  • The IP address of the client the capture was taken from,
  • The IP addresses of the EVS(es) that the client should be accessing,
  • A description of what operations were being undertaken on the client whilst the packet capture was being gathered.

Additional Required Data Collection

After the PFM (storage) and PIR (HNAS) data collections have completed, gather a simple trace (Midrange) or dump (Enterprise) from the array and HNAS diagnostics and upload everything collected to TUF:

Performance Problem Questionnaire

Once the performance data collection is underway, please look at and provide answers for the HNAS Performance Issues Questionnaire:

Additional Notes

Which file system should I focus the PIR on?

If the file system to focus on is not obvious from the context of the problem, try and "focus" the PIR (-f switch) on the busiest file system on the impacted EVS or storage pool (span).  You can determine the busy file systems using the process described in:

How To: Determine Busy File Systems (HNAS)

How can I identify the busiest clients using the HNAS?

Please refer to the knowledgebase article:

What if my performance problem is intermittent?

If your performance problem is intermittent then we recommend using the HNAS crontab CLI command to start a PIR on the impacted file system at every 00, 15, 30 and 45 minutes past the hour.  You can then collect the PIRs and when the problem reoccurs send the PIR covering the time period in question to GSC.  A default length "10 minute" PIR take approximately 13 minutes to run so starting one every 15 minutes means the previous one will have completed whilst still giving good coverage.

If you are using HNAS firmware 12.5 or later then you may also be able to use "continuous PIR" - see the performance-info-report HNAS CLI command man page for additional details.

If a particular HNAS event seems to mark the start of the performance problem?

If a particular HNAS event seems to mark the start of the performance problem then it may be useful to trigger the start of a PIR when that event occurs.  The procedure for doing that is documented in:

Performance Data Collection

 

Attachments
Service Partner Notes

Internal Notes

**This section is only visible to Hitachi Employees.

Example Escalations

Before escalating a HNAS performance case to ES the following must have been completed:

  • You should have developed a good problem statement of what the issue is using the answers to the questions in the HNAS Performance Issues Questionnaire.  Specifically it should be clear in any escalation:
    • What the symptoms are and any related error messages?
    • Which shares/exports on the HNAS are impacted?
    • What EVSes are impacted?
    • What the name of the HNAS nodes are that are currently hosting those EVSes?
    • What the names of the file systems that are impacted are?
    • What the names of the storage pools (spans) that are impacted are?
    • Which system drives are in the impacted span and which storage array those system drives are coming from.
    • The timings related to when the problem was first noticed and when it has been seen.
    • Any other information which is deemed useful and related to the problem.
  • For the time periods when the customer said the problem was occurring you should have reviewed the HNAS eventlog and debug logs on the impacted nodes to see if there are any error messages that may be related to the problem and which might suggest a resolution.  
    • Examples of any error messages which you think might be related should be included in any escalation with a description of why you think they might be related and which files those error messages were taken from.  
    • You should have searched for those error messages in the knowledgebase to see if there are any known resolutions - include the searches you used in your escalation description.
    • If you have access to wroggler you should also have searched that to see if there are any known resolutions  - include the searches you used in your escalation description.
  • You should have verified that the problem the customer is reporting is not impacted by any of the common causes of performance problems in HNAS systems.  If they are you should've worked with the customer to address those issues before escalating for further assistance if the problem is not resolved.
  • You should have worked with the customer to gather the following data:
    • A single PIR that covers a time period when the performance problem was occurring.  If there are HNAS events or debug log messages related to the problem you should verify that these messages occurred during the duration of the PIR.  The PIR should be focussed on one of the file systems that the customer reports is having problems
    • A single short packet capture (~30 seconds) taken from an impacted client, readable in Wireshark, that doesn't have truncated packet contents, that was taken when the above PIR was running.  Also the following information about the capture should be included:
      • The IP address of the client the capture was taken from,
      • The IP addresses of the EVS(es) that the client should be accessing,
      • A description of what operations were being undertaken on the client whilst the packet capture was being gathered.
    • Performance data etc. for the storage array which is hosting the system drives in the storage pools that the customer is reporting are impacted.  This performance data must cover the time period covered by the above PIR.
  • If you have access to dportal you should upload the PIR and packet capture to the same dportal entry.
Employee Notes
Support Center Notes
CXone Metadata

Tags: Diagnosis,Q&A,hnas,Performance

PageID: 2907