Dedupe File Systems

Last updated
Save as PDF

Deduplication is a file system feature that incorporates enhancements to the file system and the application layer. Deduplication features the ability to reduce redundancy in stored data blocks. All data in the specified file system is scanned at intervals and duplicate blocks are removed, resulting in reclaimed disk space. All dedupe activity and the elimination of redundant blocks is transparent to the user.

Base deduplication is enabled by default and does not require a license key. This is a dedupe feature with a single SHA-256 engine, capable of indexing data at a rate of up to 120 MB per second.

Premium deduplication is a licensed feature and must be installed before deduplication can be performed. This is a dedupe feature with four SHA-256 engines, capable of indexing data at a much faster rate. Contact your Hitachi Vantara representative for more information.

For license key information, see the Server and Cluster Administration Guide.

NoteDo not use NAS deduplication and storage-based deduplication (capacity saving) on the same LUs, as the additional processing reduces I/O performance.

NoteDeduplication support for object replication target filesystems is supported from release 13.6.

Deduplication characteristics

The deduplication feature has the following characteristics:

Only user data blocks are deduplicated.
Dedupe is a post-process that is performed as a fixed block-based deduplication. It is not an inline dedupe process.
Data is deduped within a given file system and not across multiple file systems.
Dedupe has been designed with quality of service (QoS) as a key component. File system activity takes precedence over dedupe activity when file serving load goes beyond 50 percent of the available IOPS or throughput capacity. The deduplication process throttles back to allow the additional file serving traffic load to continue without impacting performance.
You can configure a new file system to support and enable dedupe.
An existing WFS-2 file system can be converted to be dedupe-enabled.
File systems with support for dedupe can be dedupe-enabled or dedupe-disabled.

Deduplication interoperability with existing NAS Platform features

The following table lists the applications that are compatible with Dedupe.

Application	Interoperability
Object Replication	File data is not retained in the deduplicated state when replicating a file system using the Object Replication option. Deduplication is supported on object replication targets from release 13.6.
File Replication	File data is not retained in the deduplicated state during file replication. The target file system can have dedupe-enabled, in which case the files on the target will be eventually deduped.
Snapshots	Deduped file systems lack exact physical snapshot block usage reporting. There is no way to determine how much physical space is taken up by a given snapshot and, thus, how much will be freed up when they delete a snapshot. Additionally, running kill-snapshots on a deduped file system will leak all snapshot blocks and you will need to take further measures to reclaim the leaked space. snapshot-delete-all is the preferred tool for deleting all snapshots as it does not require the file system to be unmounted, no space is leaked, and it does not affect checkpoint selection.
Data Migration	With both local and external migrations, migrated files will be rehydrated.
NDMP file-based backup	Files to be deep copied during a NDMP file based backup are restored to single files from tape; a NDMP recovery however cannot be deduped supported or enabled.
Sync Image backup	Sync Image backups do not retain their deduplicated state.
Quotas	Quota calculations are based on the length of the file (size-based quotas), or the rehydrated usage of a file (usage-based quotas).
Tiered File Systems	User data can be deduplicated; however, metadata is not.

Calculating deduplication space savings

The following example describes how the deduplication space savings are calculated.

If the difference between physical and logical space of 100 TB of data before deduplication is as follows:

Group A: 30 TB of distinct data
Group B: 70 TB of duplicated data that contains only 10 TB of unique data blocks.
- Given an arbitrary data block in Group B, there may be one or more identical data blocks in Group B, and not in Group A, but an arbitrary data block in Group A has no identical data block in either groups.

If both Group A and Group B have gone through the dedupe process:

Group A had no duplicates removed and consumed the same 30 TB.
Group B had duplicates removed and consumed only 10 TB to hold the unique data blocks.
Group B (70 TB) = {Group C (10 TB raw remaining)} + {Group D (60 TB deduped and now sharing or pointing to physical blocks of group C)}

The original 100 TB of data now requires only 40 TB (30 plus 10) of physical blocks because all duplicates were removed. However, the logical data size is 100 TB (30 plus 70), which is the amount of space needed if the data were not deduped. The results are outlined in the following table:

Used Space	The amount of physical disk space used by the file system, in this example, group A and group C = 30 + 10 = 40 TB
Deduped space	The amount of duplicate data that does not occupy its own physical disk space, but has been deduped to share existing physical blocks = group D = 60 TB
Logical space	The amount of physical disk space that would be required if the data were not deduped = {used space} + {deduped space} = 40 + 60 = 100 TB

Based on the example presented, the dedupe percentage gives the amount of physical disk space saved by removing the duplicates. The percentage measures against the amount of space that would be required if the data were not deduped.

Viewing deduped file system usage

The df command reports deduplication information for a file system. The Deduped column reports the amount of data that has been deduped in the file system, which is the amount of disk space that was saved by using this feature.

For example:

All columns except Snapshots and Deduped have the same meaning as a normal file system:
- Size column: The formatted capacity of the file system.
- Used column: The amount (and percentage) of formatted capacity used by live and snapshot data.
- Avail column: The amount of formatted capacity available for further data.
Deduped column
- This column reports the amount (and percentage) of deduped data in the file system.
Snapshots column
- This column normally reports the amount (and percentage) of logical space used by snapshots.
- On file systems that do not support dedupe, the logical size of data is equal to the physical space used by that data. However, on file systems that support dedupe, the logical space used by snapshots can exceed the physical used size due to block sharing through Dedupe.
- In some cases, snapshot space usage can even exceed the total formatted capacity of file system size. To avoid confusion, the Snapshots column displays NA for file systems that support dedupe.

We've Moved!

Product Documentation has moved to docs.hitachivantara.com

Deduplication characteristics

Deduplication interoperability with existing NAS Platform features

Calculating deduplication space savings

Viewing deduped file system usage

Still Looking?

Quick Links

Deduplication characteristics

Deduplication interoperability with existing NAS Platform features

Calculating deduplication space savings

Viewing deduped file system usage