Skip to main content
Hitachi Vantara Knowledge

About capacity saving

 

Capacity saving function: data deduplication and compression

 

When the capacity saving function is in use, the controller of the storage system performs data deduplication and compression to reduce the size of data to be stored. Capacity saving can be enabled on DP-VOLs in Dynamic Provisioning pools. You can use the capacity saving function on internal flash drives only, including data stored on encrypted flash drives.

How capacity saving works

The capacity saving function includes deduplication and compression:

  • Deduplication

    The data deduplication function deletes duplicate copies of data written to different addresses in the same pool. The deduplication function is enabled on the desired DP-VOLs in the pool. When deduplication is enabled, data that has multiple copies between DP-VOLs assigned to that pool is removed.

    When you create DP-VOLs with the deduplication function enabled, the deduplication system data volumes (fingerprint) and data store are automatically created. The deduplication system data volumes (fingerprint) stores the search table for searching the deduplicated data. For one pool, four deduplication system data volumes (fingerprint) are created. The deduplication system data volumes (data store) stores the source data for duplicated data. For one pool, four deduplication system data volumes (data store) are created. When deduplication is disabled for all DP-VOLs in a pool, all deduplication system data volumes in the pool are automatically deleted.

  • Compression

    The data compression function utilizes the LZ4 compression algorithm to compress the data. The compression function is also enabled per DP-VOL.

The following figure illustrates the capacity saving function.

GUID-316CE5EF-4E8D-42EE-906B-52E00E3C8F21-low.png

If data in the DP-VOLs of which Compression is enabled are updated, or the un-duplicated data in DP-VOLs of which the Deduplication and Compression enabled is enabled are updated, the data before updating in the storage area is no longer required. The kind of data is called garbage data. The used capacity of the pool increases until garbage collection, which collects old data that is no longer required. The pool capacity that is eventually required is the sum of the physical data capacity after capacity saving plus the amount of metadata.

Note
  • The temporary area and the data storage area are not assigned fixed capacities. They share the pool and use the pool as needed.

The capacity overheads associated with the capacity saving function include the following:

  • Capacity consumed by metadata

    The capacity consumed by metadata for the capacity saving function (deduplication and compression) is approximately 3% of the consumed DP-VOL capacity that has been processed by capacity saving. For example, if the consumed capacity of a DP-VOL is 150 TB and the capacity saving feature has processed 100 TB of the 150 TB consumed capacity and reduced it to 30 TB, the capacity consumed by metadata for capacity saving function is approximately 3 TB (3% of 100 TB). The total consumed capacity of this DP-VOL at this instant is 83 TB (30 TB + 50 TB + 3 TB).

  • Capacity consumed by garbage (invalid) data

    The capacity consumed by garbage data is approximately 7% of the total consumed capacity of all DP-VOLs with capacity saving enabled. The capacity is dynamically consumed based on garbage data created by the capacity saving process and cleaned by the background garbage collection process. The garbage collection process is a background process with a lower priority than host I/O, so the capacity consumed by garbage data depends on both the garbage created and the host I/O rate.

The total capacity consumed by these overheads is about 10% (3% for metadata + 7% for garbage data) of the consumed capacity of DP-VOLs with capacity saving enabled. During periods of high write activity from the host, this capacity might increase over 10% temporarily, and then it returns to around 10% when host write activity decreases.

Capacity saving processing for existing data

The compression and deduplication processing is performed asynchronously for pages that store data, and the free area of the pool can be increased, thereby reducing the cost of purchasing drives over time.

applying capacity saving
Capacity saving processing for new write data

The capacity saving mode of a DP-VOL (post-process mode or inline mode) determines how capacity saving is applied to new write data from the host:

  • Post-process mode

    When you apply capacity saving with the post-process mode to a DP-VOL, the compression and deduplication processing are performed asynchronously for new write data. Since capacity saving processing is not performed at the time the new data is written, the post-process mode can reduce the impact of capacity saving processing on I/O performance, but pool capacity is required to store the new write data until the capacity saving processing is performed.

    When you enable capacity saving on a DP-VOL using Device Manager - Storage Navigator, post-process mode is applied.

  • Inline mode (CCI only)

    When you apply capacity saving with the inline mode to a DP-VOL, the compression and deduplication processing are performed synchronously for new write data. The inline mode minimizes the pool capacity required to store new write data but can impact I/O performance more than the post-process mode. The inline mode should be applied when writing data with sequential I/Os, for example, when writing data to target volumes of data migration or secondary volumes of copy pairs. When the data migration or copy pair creation has completed, the mode should be changed from the inline mode to the post-process mode.

    If you want to use inline mode, you must use CCI (raidcom add ldev [-capacity_saving_mode <saving mode>] or raidcom modify ldev [-capacity_saving_mode <saving mode>]).

The following example illustrates how the pool used capacity changes over time when performing data migration. The red line shows the capacity when the post-process mode is applied, and the black line shows the capacity when the inline mode is applied. This example assumes that the writing speed (GB/h) for the new data is faster than the initial capacity saving processing (GB/h).

change in pool used capacity over time for inline and post-process modes

When the inline mode is applied, capacity saving processing is performed synchronously for the writing of data. When the post-process mode is applied, capacity saving processing is performed asynchronously for the writing of data, and the temporary storage area is required for the write data. The capacity required for the temporary storing area depends on the writing speed of the new data, or on the frequency of data updates during migration.

The following table shows the processing method (synchronous or asynchronous) for initial data, new write data, and update data. For new write data, the capacity saving processing is performed at different times for the post-process mode and the inline mode.

Mode

Initial data*

New write data

Updated write data

Compression processing

Deduplication processing

Compression processing

Deduplication processing

Post-process mode

Asynchronous

Asynchronous

Asynchronous

Synchronous when compressed data is updated

Asynchronous when uncompressed data is updated

Asynchronous

Inline mode

Asynchronous

Synchronous

Synchronous for data whose transfer length is 256 KB or more.

Asynchronous for data whose transfer length is less than 256 KB.

Synchronous when compressed data is updated

Asynchronous when uncompressed data is updated

Asynchronous

* The initial data is the existing data on the DP-VOL when the capacity saving function is enabled. Both compression and deduplication processing are performed asynchronously for the initial data.

Use cases for capacity saving

 

The results of enabling the capacity saving functions of deduplication and compression depend on the properties and access patterns of the stored data. In addition, when capacity saving is enabled, some storage behaviors are different from conventional behaviors because of the increase in load of storage controller processing caused by data scanning and garbage collection by data update. Before implementing capacity saving, you need to confirm whether it should be applied to your specific storage environment.

The following table lists several storage use cases and describes the application of capacity saving to each use case.

Use case

Settings

Description

Office

Deduplication and compression

Because there are many identical file copies, deduplication is effective.

VDI

Deduplication and compression

Deduplication is very effective because of OS area cloning.

Database

Compression

Deduplication is not effective because the database has unique information for each block.

Image/video

Not suitable (Disable)

Compressed by application.

Backup/archive

Deduplication and compression

Deduplication is effective between backups.

Caution
  • I/O performance to data with compression and deduplication is degraded. Verify the performance by utilizing best practices or Cash Optimization Tool (COT) tool before using the capacity saving function.
  • Because approximately 10% is used for metadata and garbage data, capacity saving should be applied only when the result is expected to be 20% or higher.
  • In deduplication and compression, processing is performed per 8 KB. Therefore, if the block size of the file system is an integral multiple of 8 KB, capacity saving is likely to be effective.
  • The capacity saving function is not a good fit for high-write workloads. If the write workload rate is higher than garbage collection throughput, Cache Write Pending increases, causing performance degradation. Contact customer support to determine the garbage collection throughput for your configuration.

Usage planning requirements for the capacity saving function

 

The following table outlines the items to review and plan for before using the capacity saving function.

Classification Item Remarks
Implementation Implementation method
  • New implementation
  • Changing DP-VOL to DRD-VOL
  • Migrating from old model (using a Program Product)
  • Migrating from old model (through a server)
Capacity Total used capacity of DRD-VOL (Total used capacity of DRD-VOL) Total used capacity of VOL (DRD-VOL) to which the capacity saving function is applied. Capacity before capacity saving.
Capacity saving ratio [%]

If the data to which capacity saving is applied already exists, you can run the data reduction estimation tool.

If the data to which capacity saving is applied does not exist, you can estimate the capacity saving ratio by checking the system configuration guidelines for Hitachi Virtual Storage Platform F350, VSP F370, VSP F700, VSP F900; VSP G350, VSP G370, VSP G700, VSP G900.

Capacity saving ratio shown as N:1 can be converted to the capacity saving rate in % by using the following formula:

Capacity saving rate [%]=(1-1÷N)×100

Total used capacity of DP-VOL (Total used DP-VOL capacity) Total used capacity of DP-VOL to which the capacity saving function is not applied.
Configuration Storage system model When planning the pool, if you want to implement capacity and performance, you must consider which model is suitable.
RAID level RAID 1, RAID 5, or RAID 6 can be used.
Drive type Having the same drive type (including rotational speed) in a pool is recommended.
Capacity of one parity group None.
Performance Requirement for throughput (IOPS)

When planning the pool, if you want to implement capacity and performance, these items must be considered. If you account for these items, Performance Monitor output can be used. Average I/O size can be calculated as follows:

Average throughput [MB/s] ÷ Average throughput [IOPS] × 1024

Read/Write ratio
Average I/O size [KB]
Performance boundary for one parity group [IOPS]

Calculate the performance boundary for one parity group by using performance information:

  • Drive type: Consider the drive you plan to use.
  • Read/Write ratio: Consider the throughput requirements.
  • I/O size: Consider the throughput requirements.
Other requirement Use of encryption When encryption is used, accelerated compression cannot be used.

Storage planning considerations for the capacity saving function

 

Review the following table for information about settings, configuration, and performance considerations when using the capacity saving function.

Category Item Remarks
Setting Capacity saving setting Determine the capacity saving function to use by using the capacity saving rate(%) that is estimated by the data reduction estimation tool or the capacity saving rate(%) obtained by making a guess.
Configuration Volume capacity

Estimate the number of volumes and the volume capacity provided to the host. For DRD-VOL, we recommend that you create volume of smaller than 2.4 TB. When 2.4 TB or larger volume is created, the processing efficiency of the capacity saving processing and that of garbage collection are degraded due to the limitation of the cache management device capacity, and the effect of data reduction is reduced.

When the number of volumes is small, the following performance might not be fully achieved; host I/O performance, post-process initial capacity saving, garbage collection performance, inline data migration performance, performance of disabling capacity saving function, LDEV format performance, LDEV removal performance, and initial copy performance. To fully achieve garbage collection and post-process initial capacity saving performance, have at least 40, 24, 20, and 12 volumes in G900, G700, G370, and G350 respectively.

Number of parity groups

Determine the number of parity groups when designing a pool. If you consider the number of parity groups, following cases can be considered:

  • The capacity, alone
  • The capacity and performance

For details, contact customer support.

Cache memory capacity Determine the cache memory capacity to be installed based on the total used DRD-VOL capacity.For details, contact customer support.
Shared memory capacity Determine the shared memory capacity to be installed based on the total used DRD-VOL capacity. For details, contact customer support.
Performance Estimated performance value Estimate the average write throughput in a customer use case and confirm that garbage data does not keep increasing with the workload. For the average write throughput, estimate the write throughput in the operation cycle (1 day to 1 week, for example). Use information output by Performance Monitor for estimation. In the case where garbage data increases constantly, the capacity saving function cannot be applied.

Pool capacity consumed by metadata

 

When you use the capacity saving function, the following capacities are consumed for the pool capacity:

  • Used capacity of the pool consumed by user data
  • Used capacity of the pool consumed by garbage data
  • Used capacity of the pool consumed by metadata

Metadata for the compression function. When the compression function is enabled, 2% of the total used capacity of the compression-enabled DP-VOLs is consumed as the metadata for the compression function. The capacity of the metadata for the compression function is added to the used capacity of the pool. To view the used capacity of the pool, see Pool Capacity (Used/Total) in the Pools window. To view the system data capacity for a pool, see the item of System Data in the Pools window. The system data capacity indicates the total capacity of meta data and garbage data.

Metadata for the deduplication function. When the deduplication function is enabled, 3% of the total used capacity of the deduplication-enabled DP-VOLs is consumed as the metadata for the deduplication function. To view the capacity of the metadata for the deduplication function, see the capacity of the deduplication system data volumes (finger print). The capacity of the metadata of the deduplication function is added to the used capacity of the pool. To view the used capacity of the pool, see Pool Capacity (Used/Total) in the Pools window. To view the system data capacity for a pool, see the item of System Data in the Pools window. The system data capacity indicates the total capacity of meta data and garbage data.

Deduplication system data volume specifications and requirements

 

The following table lists the requirements for the deduplication system data volume (fingerprint).

 

Item Description
Volume type DP-VOL (V-VOL). When DP-VOLs whose capacity saving setting is Deduplication and Compression are created, the deduplication system data volumes are automatically created.
Number per pool 4 deduplication system data volumes (fingerprint) are associated with a pool.
Volume capacity
  • Virtual Storage Platform G350 or Virtual Storage Platform F350: 10 TB
  • Virtual Storage Platform G370 or Virtual Storage Platform F370: 10 TB
  • Virtual Storage Platform G700 or Virtual Storage Platform F700: 10 TB
  • Virtual Storage Platform G900 or Virtual Storage Platform F900: 10 TB
Cache management devices
  • Virtual Storage Platform F350: 1 deduplication system data volume (fingerprint) uses 1 cache management device.
  • Virtual Storage Platform G350 or Virtual Storage Platform F350: 1 deduplication system data volume (fingerprint) uses 4 cache management devices.
  • Virtual Storage Platform G370 or Virtual Storage Platform F370: 1 deduplication system data volume (fingerprint) uses 4 cache management devices.
  • Virtual Storage Platform G700 or Virtual Storage Platform F700: 1 deduplication system data volume (fingerprint) uses 4 cache management devices.
  • Virtual Storage Platform G900 or Virtual Storage Platform F900: 1 deduplication system data volume (fingerprint) uses 4 cache management devices.
Path definition Not available
LDEV format Not available

The following table lists the requirements for the deduplication system data volume (data store).

 

Item Description
Volume type DP-VOL (V-VOL). When DP-VOLs whose capacity saving setting is Deduplication and Compression are created, the deduplication system data volumes (data store) are automatically created.
Emulation type OPEN-V
Number per pool 4 deduplication system data volumes (data store) are associated with a pool.
Volume capacity

From 5.98TB to 256 TB

The subsequent table describes the maximum capacity of deduplication system data volumes (data store) for a pool and a storage system.

The capacity size when the volumes are initially created is the same with the total capacity of pool volumes in a pool. When the pool operation starts, the total capacity of the 4 deduplication system data volumes (data store) automatically expands to the same size of the total pool volumes capacity. However, you can also manually expand capacities of these volumes.

Cache management devices Number of cache management devices that are used for 1 deduplication system data volume is from 4 to 172.
Path definition Not available
LDEV format

Available.

However, the LDEV format operation can be performed for LDEVs that are initialized of the duplicated data in a pool.

The following table lists the maximum capacity of deduplication system data volumes (data store) for a pool and a storage system.

 

Storage system Added shared memories Maximum capacity of deduplication system data volumes (data store) for a pool (PB) Maximum capacity of deduplication system data volumes (data store) for a pool (PB)
Virtual Storage Platform G350 or Virtual Storage Platform F350 Base

Smaller capacity of the following:

  • 0.0725
  • Total capacity of pool volumes in a pool
0.0725
Extension 1

Smaller capacity of the following:

  • 0.4
  • Total capacity of pool volumes in a pool
0.4
Extension 2

Smaller capacity of the following:

  • 1.0
  • Total capacity of pool volumes in a pool
1.1
Extension 3 Not available Not available
Virtual Storage Platform G370 or Virtual Storage Platform F370 Base

Smaller capacity of the following:

  • 0.4
  • Total capacity of pool volumes in a pool
0.4
Extension 1

Smaller capacity of the following:

  • 1.0
  • Total capacity of pool volumes in a pool
1.1
Extension 2

Smaller capacity of the following:

  • 1.0
  • Total capacity of pool volumes in a pool
2.0125
Extension 3 Not available Not available
Virtual Storage Platform G700 or Virtual Storage Platform F700 Base

Smaller capacity of the following:

  • 0.4
  • Total capacity of pool volumes in a pool
0.4
Extension 1

Smaller capacity of the following:

  • 1.0
  • Total capacity of pool volumes in a pool
1.1
Extension 2

Smaller capacity of the following:

  • 1.0
  • Total capacity of pool volumes in a pool
2.0125
Extension 3

Smaller capacity of the following:

  • 1.0
  • Total capacity of pool volumes in a pool
3.125
Virtual Storage Platform G900 or Virtual Storage Platform F900 Base

Smaller capacity of the following:

  • 1.0
  • Total capacity of pool volumes in a pool
1.1
Extension 1

Smaller capacity of the following:

  • 1.0
  • Total capacity of pool volumes in a pool
2.0125
Extension 2

Smaller capacity of the following:

  • 1.0
  • Total capacity of pool volumes in a pool
3.125
Extension 3

Smaller capacity of the following:

  • 1.0
  • Total capacity of pool volumes in a pool
4.15

Reviewing compatibility with the capacity saving function

 

The capacity saving function cannot be used with certain program products or functions.

The following table lists the program products or functions that cannot be used with the capacity saving function.

Program product Restrictions when using the capacity saving function
Dynamic Provisioning The V-VOL full allocation function cannot be used. To prevent writing failures caused by full pool capacity, you must consider monitoring the free space of a pool.
Dynamic Tiering Dynamic Tiering cannot be used. You must separate the pool of which the tier management is applied, and the pool of which the capacity saving function is applied.
Active flash The active flash function cannot be used. You must separate the pool of which the tier management is applied, and the pool of which the capacity saving function is applied.
Universal Volume Manager The data direct mapping cannot be set to DP-VOL of which the capacity function is enabled. You must separate the pool of which the data direct mapping is applied, and the pool of which the capacity saving function is applied.
ShadowImage The quick restore function cannot be used. Therefore, it takes time when you restore the backup data and then resume the application.
Volume Migration Volume migration of the program product cannot be used. If you migrate DP-VOLs of which the capacity saving is enabled, consider other methods, such as migrating through a server.
Accelerated compression

The capacity saving function can be used. However, the accelerated compression is only effective for certain tasks. In this case, there is no advantage to use the capacity saving function with accelerated compression. You must select the appropriate function according to the circumstances.

The following table lists the behavioral combinations between the capacity saving function and accelerated compression.

Capacity saving accelerated compression (capacity expansion by FMD) Behavior
Compression Deduplication and compression
Disabled Disabled Enabled Only accelerated compression is performed. The storage controller does not perform the compression/deduplication processing. Because the overhead of the processing by the storage controller is not generated, I/O can be processed quickly. It cannot be used in conjunction with the encryption function.
Enabled Disabled Disabled

The storage controller compresses data and stores the compressed data in the pool. It can be used in conjunction with the encryption function that cannot use accelerated compression.

Software compression and accelerated compressioncan be used simultaneously, but it is not recommended because performance would be degraded compared to the case where only accelerated compression is used.

Disabled Enabled Enabled

When identical data is stored in a pool, the storage controller keeps only one of them (Deduplication). For compression, the storage controller automatically determines that accelerated compression can be used and uses it.

The pool needs to consist of only FMD, and accelerated compression needs to be enabled in all parity groups in the pool.

It cannot be used in conjunction with the encryption function.

Disabled Enabled Disabled The storage controller performs the deduplication and compression processing. It can be also used in conjunction with the encryption function. The storage controller has the largest overhead of the capacity saving processing.

 

  • Was this article helpful?