Skip to main content

We've Moved!

Product Documentation has moved to docs.hitachivantara.com
Hitachi Vantara Knowledge

About the Content Software for File system

The Content Software for File solution enables the implementation of a shareable, scalable, distributed file storage system.

Basic Content Software for File system deployment

The basic Content Software for File deployment model involves the creation of a shareable filesystem to be used by the application servers. This requires the installation of Content Software for File client software which implements a POSIX filesystem driver on each application server intended to access data. This filesystem driver enables each of the application servers to access the Content Software for File system as if it is a local drive, perceiving the Content Software for File system as a local attached filesystem device while it is actually shared among multiple application servers.

The file services are implemented by a group of backend hosts running the Content Software for File software and fully dedicated to the Content Software for File system. SSD drives for storing the data are installed on these servers. The resultant storage system is scalable to hundreds of backends and thousands of clients.

GUID-960D6212-9D63-48D3-876F-90892A280A8A-low.png

The Content Software for File backends are configured as a cluster which, together with the Content Software for File clients installed on the application servers, forms one large shareable, distributed and scalable file storage system:

  • Shareable

    All clients can share the same filesystems, so any file written by any of the clients is immediately available to any client reading the data. In technical terms, this means that Content Software for File is a strongly-consistent, POSIX-compliant system.

  • Distributed

    A Content Software for File system is formed as a cluster of multiple backends, each of which provides services concurrently.

  • Scalable

    The Content Software for File system linear performance depends on the size of the cluster. Consequently, a certain amount of performance will be received for a cluster of size x, while doubling the size of the cluster to 2x will deliver double the performance. This applies to both data and metadata.

Features

The Content Software for File provides a number of unique features and functions.

Protection

The Content Software for File system is N+2 or N+4 fully protected, meaning that any 2 concurrent failures in drives or backends do not cause any loss of data and maintains the Content Software for File system up and running to provide continuous services. This is achieved through a complex distributed protection scheme, which is determined when forming a cluster. The data part can range from 3 to 16, and the protection scheme can be either 2 or 4, i.e., clusters can be 3+2, 10+2, and even 16+4 for a large cluster of backend hosts.

Distributed network scheme

The Content Software for File system implements an any-to-any protection scheme, ensuring that if a backend fails, a rebuild process is performed using all other backends, taking the data that resided on the failed backend and recreating it using redundancy on other backends in the cluster. Consequently, redundancy is not redundancy across groups of backends, but is achieved through groups of data sets that protect each other in the whole cluster of backends. In this way, if one backend fails in a cluster of 100 backends, all the other 99 backends will participate in the rebuild process, simultaneously reading and writing. This means that the Content Software for File system rebuild process is extremely fast, unlike traditional storage architectures where functioning backends are only a small part of the backends or drives participating in the rebuild process. Furthermore, the bigger the cluster, the faster the rebuild process.

Failed component replacement as a background process

The hot spare is configured in the Content Software for File system clusters by providing the extra capacity required to return to full redundancy after a rebuild, unlike traditional approaches which dedicate specific physical components as the hot spare. Consequently, a cluster of 100 backends will be configured with sufficient capacity to rebuild the data and return to full redundancy even following two failures, after which it is still possible to withstand another two failures.

This strategy for the replacement of a failed component does not affect the vulnerability of the system. Following a system failure, it is not necessary to replace a failed component with a valid component in order to recreate the data. In the Content Software for File system, data is immediately recreated, leaving the replacement of the failed component with a functioning component as a background process.

Failure domains

Failure domains are groups of backends that may fail because of a single, root cause. For example, all servers in a rack can be considered a failure domain if all are powered through a single power circuit, or all are connected through a single top-of-rack (TOR) switch. Consider a setup of 10 such racks with a cluster of 50 Content Software for File backends (five (5) backends in each rack). During formation of the cluster, it is possible to configure with 6+2 protection and make the Content Software for File system aware of these possible failure domains by forming a protection stripe across racks. In this way, the 6+2 stripe will be spread on different racks, ensuring that the system remains operational in a full rack failure and that data is not lost.

For failure domains, the stripe width must be less or equal to the failure domain count - if there are 10 racks and one of them represents a single point of failure, 16+4 cluster protection is not possible. Consequently, protection and support of failure domains is dependent on the stripe width, the protection level, and the number of hot spares required.

Prioritized data rebuild process

When a failure occurs, the data rebuild process begins by reading all the stripes where the failure occurred, rebuilding the data and returning to full protection. If a second failure occurs, there will actually be three possible types of stripes:

  1. Stripes not affected by either of the failed components – no action required.
  2. Stripes affected by only one of the failed components.
  3. Stripes affected by both the failed components.

Naturally, according to rules of multiplicity, the number of stripes affected by two failed components is much smaller than the number of stripes affected by a single failed component. However, in situations where stripes affected by both the failed components have yet to be rebuilt, a third component failure will expose the Content Software for File system to data loss.

To reduce this risk, the Content Software for File system prioritizes the rebuild process, starting first with stripes affected by two component failures. Since the number of such stripes is much smaller, this rebuild process is performed very quickly, within minutes or less. The Content Software for File system then returns to the rebuild of stripes affected by only one failed component, and can still withstand another concurrent failure without any loss of data. This prioritized approach to the rebuild process ensures that data is almost never lost, and that service and data are always available.

Seamless distribution, scale, and performance

Each Content Software for File client installed on an application server directly accesses the relevant backend host storing the data, specifically that each client does not access one backend, which then forwards the access request. Content Software for File clients include a completely synchronized map of which backend stores which type of data, representing a joint configuration that all clients and backends are aware of.

When a Content Software for File client tries to access a certain file or an offset in a file, a cryptographic hash function indicates which backend contains the required file or offset. When a cluster expansion is performed or a component failure occurs, the backend responsibilities and capabilities are instantly redistributed between the various components. This is the basic mechanism that allows the Content Software for File system to linearly grow performance and is the key to linearly synchronizing scaling size to scaling performance. If, for example, backends are added to double the size of a cluster, different parts of the filesystems are redistributed to the new backends, thereby instantly delivering twice the performance.

Furthermore, if a cluster is just grown modestly for example, from 100 to 110 backends, it is not necessary to redistribute all the data, and only 10% of the existing data will be copied to the new backends, in order to equally redistribute the data on all the backends. This balancing of the data – extending participation of all backends in all read operations - is important for scaled performance, ensuring that there are no idle or full backends, and that each backend in a cluster stores the same amount of data.

The duration of all these completely seamless operations depends on the capacity of the root backends and the network bandwidth. Ongoing operations are not affected, and performance is improved as the redistribution of data is executed. Completion of the redistribution process delivers optimal capacity and performance.

Data reduction

Our enhanced data reduction maintains exceptional performance while delivering significant reductions on various workloads. The Content Software for File system looks for blocks of data that are similar to each other (they don’t need to be 100% identical like traditional data reduction techniques) and reduce them, storing any differences separately.

Data reduction can be enabled per filesystem. Compression ratios will be workload-dependent and are excellent with text-based data, large-scale unstructured datasets, log analysis, databases, code repositories, and sensor data. We are providing a Data Reduction Estimation Tool (DRET) that can run on existing file systems to calculate the reduction rate of your datasets. For more information, contact the Customer Success Team.

Converged Content Software for File system deployment

The Content Software for File system can be deployed in a converged configuration. An alternative to the basic system deployment, this enables the configuration of hundreds of application servers running user applications and installed with Content Software for File clients in order to access the cluster. Consequently, instead of provisioning servers fully dedicated to backends, it enables the installation of a client on each application server, and the installation of one or more SSDs as well as backend processes on the existing application servers. In such a configuration, the Content Software for File system backend processes operate as one big cluster, takeover the local SSDs and form a shareable, distributed and scalable filesystem available to the application servers, in the same way as in the basic system deployment. The only difference is that instead of installing SSDs on backends dedicated to the Content Software for File system, in this configuration the backends share the same physical infrastructure with the application servers.

This mixture of different storage and computation abilities delivers more effective performance and a better utilization of resources. However, unlike the basic Content Software for File system deployment, where an application server failure has no effect on the other backends, here the cluster will be affected if an application server is rebooted or fails. The cluster is still protected by the N+2 scheme, and can withstand two such concurrent failures. Consequently, converged Content Software for File deployments require more careful integration, as well as more detailed awareness between computation and storage management practices.

Otherwise, this is technically the same solution as the basic Content Software for File system deployment, with all the same system functionality features for protection, redundancy, failed component replacement, failure domains, prioritized data rebuilds and seamless distribution, scale and performance. Some of the servers may be installed with a Content Software for File backend process and a local SSD, while others may have clients only. This means that there can be a cluster of application servers with Content Software for File software installed on some and clients installed on others.

Selecting a redundancy scheme

Redundancy schemes in the Content Software for File system deployments can range from 3+2 to 16+4. There are a number of considerations for selecting the most suitable, optimal configuration. It all depends on redundancy, the data stripe width, the hot spare capacity, and the performance required during a rebuild from a failure.

  • Redundancy

    Redundancy can be N+2 or N+4 and impacts both capacity and performance. A redundancy of 2 is sufficient for the majority of configurations. A redundancy of 4 is usually used for clusters of 100 or more backends, or for extremely critical data.

  • Data Stripe Width

    The number of data components, which can be 3-16. The bigger the data stripe, the better the eventual net capacity. Consideration has to be given to both raw and net capacity. Raw capacity is the total capacity of SSDs in the deployment. Net capacity relates to how much is actually available for the storage of data. Consequently, bigger stripe widths provide more net capacity but may impact performance under rebuild, as discussed below, in Performance Required During a Rebuild from a Failure. For extremely critical data, it is recommended to consult the Weka Support Team to determine whether the stripe width matches the resiliency requirements.

    For extremely critical data, it is recommended to contact your customer support representative to determine whether the stripe width matches the resiliency requirements.

    NoteThe active failure domains count cannot be less than the stripe width, for example, a situation where two failure domains become unavailable in 3+2 protection with 6 failure domains, since this will leave the Content Software for File cluster vulnerable and unable to rebuild. In such situations, contact your customer support representative.
  • Hot Spare Capacity

    An IT issue, relating to the time required to replace faulty components. The faster that IT succeeds in processing failures, or guarantees the replacement of faulty components, the lower the hot spare capacity required. The more relaxed, and hence cost-effective, the component replacement schedule is, the more the required hot spare capacity. For example, remotely-located systems visited once a quarter to replace any failed drives require more hot spares than systems with guaranteed 24/7 service.

  • Performance required during a rebuild from a failure

    Is impacted only by read rebuild operations. Unlike other storage systems, writing performance is unaffected by failures and rebuilds, since Content Software for File systems continue writing to functioning backends in the cluster. However, read performance can be affected, because the reading of data from a failed component has to be performed from the whole stripe. This requires a simultaneous operation and an instant priority rebuild for data read operations. If, for example, one failure occurs in a cluster of 100 backends, performance will be affected by 1%; however, in a cluster of 100 backends with a stripe width of 16, performance will be reduced by up to 16% at the beginning of the rebuild. Naturally, the cluster size can exceed the stripe width or the number of failure domains. Consequently, for large clusters, it is recommended that the stripe width does not exceed 25% of the cluster size, e.g., for a cluster of 40 backends, 8+2 protection is recommended so that if a failure occurs, the impact on performance will not exceed 25%.

  • Write Performance

    Is generally better the larger the stripe width since the system has to compute a smaller proportion of protected data to real data. This is particularly applicable to large writes in a system accumulating data for the first time

SSD capacity management

Terminologies relating to Content Software for File system capacity management and the formula for calculating the Content Software for File system net data storage capacity

  • Raw capacity

    Raw capacity is the total capacity on all the SSDs assigned to a Content Software for File system cluster, e.g., 10 SSDs of 1 terabyte each have a total raw capacity of 10 terabytes. This is the total capacity available for the Content Software for File system. This will change automatically if more hosts or SSDs are added to the system.

  • Net capacity

    Net capacity is the amount of space available for user data on the SSDs in a configured Content Software for File system. It is based on the raw capacity minus the Content Software for File filesystem overheads for redundancy protection and other needs. This will change automatically if more hosts or SSDs are added to the system.

  • Stripe width

    The stripe width is the number of blocks that share a common protection set, which can range from 3 to 16. The Content Software for File system has distributed any-to-any protection. Consequently, in a system with a stripe width of 8, many groups of 8 data units spread on various hosts protect each other (rather than a group of 8 hosts forming a protection group). The stripe width is set during the cluster formation and cannot be changed. Stripe width choice impacts performance and space.

    NoteIf not configured, the stripe width is set automatically to #Failure Domains - Protection Level
  • Protection level

    The protection level is the number of additional protection blocks added to each stripe, which can be either 2 or 4. A system with a protection level of 2 can survive 2 concurrent failures, while system data with a protection level of 4 is protected against any concurrent 4 host/disk failures, and its availability is protected against any 4 concurrent disk failures or 2 concurrent host failures. A large protection level has space and performance implications. The protection level is set during the cluster formation and cannot be changed.

    NoteIf not configured, the data protection drives in the cluster stripes are automatically set to 2.
  • Failure domain (optional)

    A failure domain is a group of Content Software for File hosts, all of which can fail concurrently due to a single root cause, such as a power circuit or network switch failure. A cluster can be configured with explicit or implicit failure domains. For a system with explicit failure domains, each group of blocks that protect each other is spread on different failure domains. For a system with implicit failure domains, the group of blocks is spread on different hosts and each host is a failure domain. Additional failure domains can be added, and new hosts can be added to any existing or new failure domain.

    NoteThis documentation relates to a homogeneous Content Software for File system deployment, i.e., the same number of hosts per failure domain (if any), and the same SSD capacity per host. For information about heterogeneous Content Software for File system configurations, contact the customer support.
  • Hot spare

    A hot spare is the number of failure domains that the system can lose, undergo a complete rebuild of data, and still maintain the same net capacity. All failure domains are always participating in storing the data, and the hot spare capacity is evenly spread within all failure domains.

    The higher the hot spare count, the more hardware required to obtain the same net capacity. On the other hand, the higher the hot spare count, the more relaxed the IT maintenance schedule for replacements. The hot spare is defined during cluster formation and can be reconfigured at any time.

    NoteIf not configured, the hot spare is automatically set to 1.
  • Content Software for File filesystem overhead

    After deducting the capacity for the protection and hot spares, only 90% of the remaining capacity can be used as net user capacity, with the other 10% of capacity reserved for the Content Software for File filesystems. This is a fixed formula that cannot be configured.

  • Provisioned capacity

    The provisioned capacity is the total capacity assigned to filesystems. This includes both SSD and object store capacity.

  • Available capacity

    The available capacity is the total capacity that can be used for the allocation of new filesystems, which is net capacity minus provisioned capacity.

Deductions from raw capacity to obtain net storage capacity

The net capacity of the Content Software for File system is obtained after the following three deductions performed during configuration:

  1. Level of protection required, which is the amount of storage capacity to be dedicated for system protection.
  2. Hot spare(s), that is the amount of storage capacity to be set aside for redundancy and to allow for rebuilding following a component failure.
  3. Content Software for File filesystem overhead, in order to improve overall performance.

Formula for calculating SSD net storage capacity

GUID-F8170DDB-97A1-4BA8-9DE6-C08CE9717ACB-low.png
Scenario 1

A homogeneous system of 10 hosts, each with 1 terabyte of Raw SSD Capacity, 1 hot spare, and a protection scheme of 6+2.

SSDNetCapacity = 10TB * (10-1) / 10 * 6/(6+2) * 0.9 = 6.075TB
Scenario 2

A homogeneous system of 20 hosts, each with 1 terabyte of Raw SSD Capacity, 2 hot spares, and a protection scheme of 16+2.

SSDNetCapacity = 20TB * (20-2) / 20 * 16/(16+2) * 0.9 = 14.4TB

Filesystems, object stores, and filesystem groups

There are three types of entities relevant to data storage in the Content Software for File system: filesystems, object stores, and filesystem groups.

About filesystems

A Content Software for File filesystem is similar to a regular on-disk filesystem while distributed across all the hosts in the cluster. Consequently, filesystems are not associated with any physical object in the Content Software for File system and act as root directories with space limitations.

The system supports a total of up to 1024 filesystems. All of which are equally balanced on all SSDs and CPU cores assigned to the system. This means that the allocation of a new filesystem or resizing a filesystem are instant management operations performed without any constraints.

A filesystem has a defined capacity limit and is associated with a predefined filesystem group. A filesystem that belongs to a tiered filesystem group must have a total capacity limit and an SSD capacity cap. All filesystems' available SSD capacity cannot exceed the total SSD net capacity.

Thin provisioning

Thin provisioning is a method of on-demand SSD capacity allocation based on user requirements. In thin provisioning, the filesystem capacity is defined by a minimum guaranteed capacity and a maximum capacity (virtually can be more than the vailable SSD capacity).

The system allocates more capacity (up to the total available SSD capacity) for users who consume their allocated minimum capacity. Alternatively, when they free up space by deleting files or transferring data, the idle space is reclaimed, repurposed, and used for other workloads that need the SSD capacity.

Thin provisioning is beneficial in various use cases:

  • Tiered filesystems: On tiered filesystems, available SSD capacity is leveraged for extra performance and released to the object store once needed by other filesystems.
  • Auto-scaling groups: When using auto-scaling groups, thin provisioning can help to automatically expand and shrink the filesystem's SSD capacity for extra performance.
  • Separation of projects to filesystems: If it is required to create a separate filesystem for each project, and the administrator doesn't expect all filesystems to be fully utilized simultaneously, creating a thin provisioned filesystem for each project is a good solution. Each filesystem is allocated with a minimum capacity but can consume more when needed based on the actual available SSD capacity.

Filesystem limits

  • Number of files or directories: Up to 6.4 trillion (6.4 * 10^12)
  • Number of files in a single directory: Up to 6.4 billion (6.4 * 10^9)
  • Total capacity with object store: Up to 14 EB
  • Total SSD capacity: Up to 512 PB
  • File size: UP to 4 PB

Encrypted filesystems

Both data at rest (residing on SSD and object store) and data in transit can be encrypted. This is achieved by enabling the filesystem encryption feature. A decision on whether a filesystem is to be encrypted is made when creating the filesystem.

To create encrypted filesystems, deploy a Key Management System (KMS).

NoteYou can only set the data encryption when creating a filesystem.

Metadata limitations

In addition to the capacity limitation, each filesystem has a limitation on the amount of metadata. The system-wide metadata limit is determined by the SSD capacity allocated to the Content Software for File system and the RAM resources allocated to the Content Software for File system processes.

The Content Software for File system keeps tracking metadata units in the RAM. If it reaches the RAM limit, it pages these metadata tracking units to the SSD and alerts. This leaves enough time for the administrator to increase system resources, as the system keeps serving IOs with a minimal performance impact.

By default, the metadata limit associated with a filesystem is proportional to the filesystem SSD size. It is possible to override this default by defining a filesystem-specific max-files parameter. The filesystem limit is a logical limit to control the specific filesystem usage and can be updated by the administrator when necessary.

The total metadata limits for all the filesystems can exceed the entire system metadata information that can fit in the RAM. For minimal impact, in such a case, the least-recently-used units are paged to disk, as necessary.

Metadata units calculation

Each metadata unit consumes 4 KB of SSD space (not tiered) and 20 bytes of RAM.

Throughout this documentation, the metadata limitation per filesystem is referred to as the max-filesparameter, which specifies the number of metadata units (not the number of files). This parameter encapsulates both the file count and the file sizes.

The following table specifies the required number of metadata units according to the file size. These specifications apply to files residing on SSDs or tiered to object stores.

File sizeNumber of metadata unitsExample
< 0.5 MB1A filesystem with 1 billion files of 64 KB each requires 1 billion metadata units.
0.5 MB - 1 MB2A filesystem with 1 million files of 750 KB each, requires 2 million metadata units.
> 1 MB2 for the first 1 MB plus 1 per MB for the rest MBs
  • A filesystem with 1 million files of 129 MB each requires 130 million metadata units. 2 units for the first 1 MB plus 1 unit per MB for 128 MB.
  • A filesystem with 10 million files of 1.5 MB each requires 30 million units.
  • A filesystem with 10 million files of 3 MB each requires 40 million units.
NoteEach directory requires two metadata units instead of one for a small file.

About object stores

In the Content Software for File system, object stores represent an optional external storage media, ideal for storing warm data. Object stores used in tiered Content Software for File system configurations can be cloud-based, located in the same location (local), or at a remote location.

Content Software for File supports object stores for tiering (tiering and local snapshots) and backup (snapshots only). Both tiering and backup can be used for the same filesystem.

Using object store buckets optimally is achieved when a cost-effective data storage tier is required at a price point that cannot be satisfied by server-based SSDs.

An object store bucket definition contains the object store DNS name, bucket identifier, and access credentials. The bucket must be dedicated to the Content Software for File system and not be accessible by other applications.

Filesystem connectivity to object store buckets can be used in the data lifecycle management and Snap-to-Object features.

About filesystem groups

In the Content Software for FileContent Software for File system, filesystems are grouped into a maximum of eight filesystem groups.

Each filesystem group has tiering control parameters. While tiered filesystems have their object store, the tiering policy is the same for each tiered filesystem under the same filesystem group.

networking

This page reviews the theory of operation for Content Software for File networking.

Overview

The Content Software for File system supports the following types of networking technologies:

  • InfiniBand (IB)
  • Ethernet‌

The currently-available networking infrastructure dictates the choice between the two. If a Content Software for File cluster is connected to both infrastructures, it is possible to connect Content Software for File clients from both networks to the same cluster.

The Content Software for File system networking can be configured either as performance-optimized, where the CPU cores are dedicated to Content Software for File and the use of DPDK networking takes place and cores, or, as CPU-optimized where cores are not dedicated and we use either DPDK (when supported by the NIC drivers) or in-kernel networking (UDP mode).

Performance-optimized networking (DPDK)

For performance-optimized networking, the Content Software for File system does not use standard kernel-based TCP/IP services, but a proprietary infrastructure based on the following:

  • Use of DPDK to map the network device in the user space and make use of the network device without any context switches and with zero-copy access. This bypassing of the kernel stack eliminates the consumption of kernel resources for networking operations and can be scaled to run on multiple hosts. It applies to both backend and client hosts and enables the Weka system to fully saturate 200 GB links.
  • Implementation of a proprietary Content Software for File protocol over UDP, meaning that the underlying network may involve routing between subnets or any other networking infrastructure that supports UDP.

The use of DPDK delivers operations with extremely low-latency and high throughput. Low latency is achieved by bypassing the kernel and sending and receiving packages directly from the NIC. High throughput is achieved because multiple cores in the same host can work in parallel, without a common bottleneck.

Before proceeding, it is important to understand several key terms used in this section, namely DPDK, SR-IOV.

DPDK

Data Plane Development Kit (DPDK) is a set of libraries and network drivers for highly efficient, low latency packet processing. This is achieved through several techniques, such as kernel TCP/IP bypass, NUMA locality, multi-core processing, and device access via polling to eliminate the performance overhead of interrupt processing. In addition, DPDK ensures transmission reliability, handles retransmission, and controls congestion.

DPDK implementations are available from several sources. OS vendors such as Redhat and Ubuntu provide their DPDK implementations through their distribution channels. Mellanox OpernFabrics Enterprise Distribution for Linux (Mellanox OFED), which is a suite of libraries, tools, and drivers supporting Mellanox NICs, offers its own DPDK implementation.

SR-IOV

Single Root I/O Virtualization (SR-IOV) is an extension to the PCI Express (PCIe) specification that enables PCIe virtualization. It works by allowing a PCIe device, such as a network adapter, to appear as multiple PCIe devices, or functions. There are two categories of functions - Physical Function (PF) and Virtual Function (VF). PF is a full-fledged PCIe function that can also be used for configuration. VF is a virtualized instance of the same PCIe device and is created by sending appropriate commands to the device PF. Typically, there are many VFs, but only one PF per physical PCIe device. Once a new VF is created, it can be mapped by an object such as a virtual machine, container, or, in the Weka system, by a 'compute' process.

SR-IOV technology should be supported by both the software and hardware to take advantage of it. Software support is included in the Linux kernel, as well as the Content Software for File system software. Hardware support is provided by the computer BIOS and the network adapter but is usually disabled out of the factory. Consequently, it should be enabled before installing the Content Software for File system software.‌

CPU-optimized networking

For CPU-optimized networking Content Software for File can yield CPU resources to other applications. That is useful when the extra CPU cores are needed for other purposes. However, the lack of CPU resources dedicated to the Weka system comes with the expense of reduced overall performance.

DPDK without core dedication

For CPU-optimized networking, when mounting filesystems using statless clients, it is possible to use DPDK networking without dedicating cores. This mode is recommended when available and supported by the NIC drivers. In this mode, the DPDK networking uses RX interrupts instead of dedicating the cores.

NoteThis mode is supported in most NIC drivers, but not in all, consult https://doc.dpdk.org/guides-18.11/nics/overview.html for compatibility.

AWS (ENA drivers) does not support this mode, hence for CPU-optimized networking in AWS use the UDP Mode.

UDP mode

Content Software for File can also use in-kernel processing and UDP as the transport protocol. This mode of operation is commonly referred to as the 'UDP mode'.

Since the UPD-mode uses in-kernel processing, it is compatible with older platforms lacking the support of kernel offloading technologies (DPDK) or virtualization (SR-IOV), as legacy hardware such as the Mellanox CX3 family of NICs.

Data lifecycle management

The principles of data lifecycle management and how data storage is managed in SSD-only and tiered Content Software for File system configurations.

Media options for data storage in the Content Software for File system

In the Content Software for File system, data can be stored on two forms of media:

  1. On locally-attached SSDs, which are an integral part of the Content Software for File system configuration.
  2. On object-store systems external to the Content Software for File system, which are either third-party solutions, cloud services, or part of the Content Software for File system.

The Content Software for File system can be configured either as an SSD-only system or as a data management system consisting of both SSDs and object stores. By nature, SSDs provide high performance and low latency storage, while object stores compromise performance and latency but are the most cost-effective solution available for storage. Consequently, users focused on high performance only should consider using an SSD-only Content Software for File system configuration, while users are seeking to balance performance and cost should consider a tiered data management system, with the assurance that the Content Software for File system features will control the allocation of hot data on SSDs and warm data on object stores, thereby optimizing the overall user experience and budget.

NoteIn SSD-only configurations, the Content Software for File system will sometimes use an external object store for backup, as explained in Snap-To-Object Data Lifecycle Management.

Guidelines for data storage in tiered Content Software for File system configurations

In tiered Content Software for File system configurations, there are various locations for data storage as follows:

  1. Metadata is stored only on SSDs.
  2. Writing of new files, adding data to existing files, or modifying the content of files is always terminated on the SSD, irrespective of whether the file is currently stored on the SSD or tiered to an object-store.
  3. When reading the content of a file, data can be accessed from either the SSD (if it is available on the SSD) or rehydrated from the object store (if it is not available on the SSD).

This data management approach to data storage on one of two possible media requires system planning to ensure that most commonly-used data (hot data) resides on the SSD to ensure high performance, while less-used data (warm data) is stored on the object store. In the Content Software for File system, this determination of the data storage media is a completely seamless, automatic, and transparent process, with users and applications unaware of the transfer of data from SSDs to object stores, or from object stores to SSDs. The data is accessible at all times through the same strongly-consistent POSIX filesystem API, irrespective of where it is stored. Only latency, throughput, and IOPS are affected by the actual storage media.The network resources allocated to the object store connections can be controlled. This enables cost control when using cloud-based object storage services since the cost of data stored in the cloud depends on the quantity stored and the number of requests for access made.

Furthermore, the Weka system tiers data in chunks, rather than complete files. This enables the smart tiering of subsets of a file (and not only complete files) between SSDs and object stores.

The network resources allocated to the object store connections can be controlled. This enables cost control when using cloud-based object storage services since the cost of data stored in the cloud depends on the quantity stored and the number of requests for access made.

States in the Content Software for File system data management storage process

Data management represents the media being used for the storage of data. In tiered Content Software for File system configurations, data can exist in one of three possible states:

  1. SSD-only: When data is created, it exists only on the SSDs.
  2. SSD-cached: A tiered copy of the data exists on both the SSD and the object store.
  3. Object Store only: Data resides only on the object store.
NoteThese states represent the lifecycle of data, and not the lifecycle of a file. When a file is modified, each modification creates a separate data lifecycle for the modified data.
GUID-854D445F-C4F1-4644-92F0-8BD9280CF712-low.png

The Data Lifecycle Diagram represents the transitions of data between the above states. #1 represents the Tiering operation, #2 represents the Releasing operation, and #3 represents the Rehydrating operation:

  1. Tiering of data from the SSD to create a replicate in the object store. A guideline for the tiering of data is based on a user-defined, time-based policy Tiering Cue.
  2. Releasing data from the SSD, leaving only the object store copy (based on the demand for more space for data on the SSD). A guideline for the release of data is based on a user-defined, time-based policy Retention Period.
  3. Rehydrating data from the object store to the SSD for the purpose of data access.

In order to read data residing only on an object store, the data must first be rehydrated back to the SSD.

In the Content Software for File system, file modification is never implemented as in-place write, but rather as a write to a new area located on the SSD, and the relevant modification of the meta-data. Consequently, write operations are never associated with object store operations.

The role of SSDs in tiered Content Software for File configurations

All writing in the Content Software for File system is performed to SSDs. The data residing on SSDs is hot data, that is data that is currently in use. In tiered Content Software for File configurations, SSDs have three primary roles in accelerating performance: metadata processing, a staging area for writing, and as a cache for read performance.

Metadata processing

Since filesystem metadata is by nature a large number of update operations each with a small number of bytes, the embedding of metadata on SSDs serves to accelerate file operations in the Content Software for File system.

SSD as a staging area

Since writing directly to an object store demands high latency levels while waiting for approval that the data has been written, with the Content Software for File system there is no writing directly to object stores. Much faster writing is performed directly to the SSDs, with very low latency and therefore much better performance. Consequently, in the system, the SSDs serve as a staging area, providing a buffer that is big enough for writing until later tiering of data to the object store. On completion of writing, the Content Software for File system is responsible for tiering the data to the object store and for releasing it from the SSD.

SSD as a cache

Recently accessed or modified data is stored on SSDs, and most read operations will be of such data and served from SSDs. This is based on a single, large least recently used (LRU) clearing policy for the cache that ensures optimal read performance.

NoteOn a tiered filesystem, the total capacity determines the maximum capacity that will be used to store data. It could be that it will all reside on the object store due to the SSD uses above and the below time-based policies.

For example, consider a 100 TB filesystem (total capacity) with a 10TB SSD capacity for this filesystem. It could be that all the data will reside on the object-store, and no new writes will be allowed, although the SSD space is not completely used (until deleting files or increasing filesystem total size), leaving the SSD for metadata and cache only.

Time-based policies for the control of a data storage location

The Content Software for File system includes user-defined policies which serve as guidelines to control the data storage management. They are derived from a number of factors:

  1. The rate at which data is written to the system and the quantity of data.
  2. The capacity of the SSDs configured to the Content Software for File system.
  3. The speed of the network between the Content Software for File system and the object store, and the performance capabilities of the object store itself, for example, how much the object store can actually contain.

Filesystem groups are used to define these policies, while a filesystem is placed in a filesystem group according to the desired policy if the filesystem is tiered.

For tiered filesystems, the following parameters should be defined per filesystem:

  1. The size of the filesystem.
  2. The amount of filesystem data to be stored on the SSD.

The following parameters should be defined per filesystem group:

  1. The Data Retention Period Policy, a time-based policy which is the target time for data to be stored on an SSD after creation, modification or access, and before release from the SSD, even if it is already tiered to the object store, for metadata processing and SSD caching purposes (this is only a target; the actual release schedule depends on the amount of available space).
  2. The Tiering Cue Policy, a time-based policy which determines the minimum amount of time that data will remain on an SSD before it is considered for release to the object store. As a rule of thumb, this should be configured to a third of the Retention Period, and in most cases, this will work well. The Tiering Cue is important because it is pointless to tier a file which is about to be modified or deleted from the object store.

For example, when writing log files which are processed every month but retained forever, it is recommended to define a Retention Period of 1 month, a Tiering Cue of 1 day, and ensure that there is sufficient SSD capacity to hold 1 month of log files.

When storing genomic data which is frequently accessed during the first 3 months after creation, requires a scratch space for 6 hours of processing, and requires output to be retained forever: It is recommended to define a Retention Period of 3 months and to allocate an SSD capacity that will be sufficient for 3 months of output data and the scratch space. The Tiering Cue should be defined as 1 day, in order to avoid a situation where the scratch space data is tiered to an object store and released from the SSD immediately afterwards.

NoteUsing the Snap-To-Object feature causes data to be tiered regardless of the tiering policies. Snap-To-Object enables all the data of a specific snapshot (including metadata and every file) to be committed to an object store.

Bypassing the time-based policies

Regardless of the time-based policies, it is possible to use a special mount option obs_direct to bypass the time-based policies. Any creation or writing of files from a mount point with this option will mark it to release as soon as possible, before taking into account other files retention policy.

For more information, see Advanced Data Lifecycle Management

Content Software for File client and mount modes

Understanding the Content Software for File system client and possible mount modes of the operation in relation to the page cache.

The Content Software for File system client

The Content Software for File system client is a standard, POSIX-compliant filesystem driver installed on application servers that enable file access to the filesystems. Similar to any other filesystem driver, the system client intercepts and executes all filesystem operations. This enables the Content Software for File system to provide applications with local filesystem semantics and performance (as opposed to NFS mounts) while providing a centrally managed, shareable resilient storage.

The Content Software for File system client is tightly integrated with the Linux operating system page cache, which is a transparent caching mechanism that stores parts of the filesystem content in the client host RAM. The operating system maintains a page cache in unused RAM capacity of the application server, delivering quick access to the contents of the cached pages and overall performance improvements.

The page cache is implemented in the Linux kernel and is fully transparent to applications. All physical memory not directly allocated to applications is used by the operating system for the page cache. Since the memory would otherwise be idle and is easily reclaimed when requested by applications, there is usually no associated performance penalty and the operating system might even report such memory as "free" or "available". For a more detailed description of the page cache, see Page Cache, the Affair Between Memory and Files.

The Content Software for File client can control the information stored in the page cache and also invalidate it, if necessary. Consequently, the system can utilize the page cache for cached high-performance data access while maintaining data consistency across multiple hosts.

Each filesystem can be mounted in one of two modes of operation in relation to the page cache:

  • Read cache

    Used where only read operations are using the page cache, file data is coherent across hosts, and resilient to client failures.

  • Write cache (default)

    Used where both read and write operations are using the page cache while keeping data coherency across hosts, which provides the highest data performance.

NoteSymbolic links are always cached in all cached modes.
NoteUnlike actual file data, the file metadata is managed in the Linux operation system by the Dentry (directory entry) cache, which maximizes efficiency in the handling of directory entries, and is not strongly consistent across Content Software for File client hosts. At the cost of some performance compromise, metadata can be configured to be strongly consistent by mounting without Dentry cache (using dentry_max_age_positive=0, dentry_max_age_negative=0 mount options) if metadata consistency is critical for the application.

Read cache mount mode

When mounting in this mode, the page cache uses write cache in the write-through mode; so, any write is acknowledged to the customer application only after being safely stored on resilient storage. This applies to both data and metadata writes. Consequently, only read operations are accelerated by the page cache.

In the Content Software for File system, by default, any data read or written by customer applications is stored on a local host read page cache. As a shareable filesystem, the Content Software for File system monitors whether another host tries to read or write the same data and if necessary, invalidates the cache entry. Such invalidation may occur in two cases:

  • If a file that is being written by one client host is currently being read or written by another client host.
  • If a file that is being read by one host is currently being written from another host.

This mechanism ensures coherence, providing the Content Software for File system with full page cache utilization whenever only a single host or multiple hosts access a file for read-only purposes. If multiple hosts access a file and at least one of them is writing to the file, the page cache is not used and any IO operation is handled by the backends. Conversely, when either a single host or multiple hosts open a file for read-only purposes, the page cache is fully utilized by the Content Software for File client, enabling read operations from memory without accessing the backend hosts.

NoteA host is defined as writing to a file on the actual first write operation, and not based on the read/write flags of the open system call.
NoteIn some scenarios, particularly random reads of small blocks of data from large files, a read cache enablement can create an amplification of reads, due to the Linux operating system prefetch mechanism. For details about this scenario, see Understanding the <bdi> identifier.

Write cache mount mode (default)

In this mount mode, the Linux operating system is used as write-back, rather than write-through; specifically, the write operation is acknowledged immediately by the Content Software for File client and is stored in resilient storage as a background operation.

This mode can provide significantly more performance, particularly in relation to write latency, while keeping data coherency; meaning, if a file is accessed through another host, it invalidates the local cache, and syncs the data to get a coherent view of the file.

To sync the filesystem and commit all changes in the write cache, use the following system calls: sync, syncfs, and fsync.

Multiple mounts on a single host

The Content Software for File client supports multiple mount points of the same file system on the same host, even with different mount modes. This can be effective in environments such as containers where different processes in the host need to have different definitions of read/write access or caching schemes.

NoteTwo mounts on the same hosts are treated as two different hosts with respect to the consistency of the cache, as described above. So, for example, two mounts on the same host, mounted with write cache mode might have different data at the same point in time.

Key terms

TermDescription
AgentThe Content Software for File agent is software installed on user application servers that need access to the Content Software for File file services. When using the Stateless Client feature, the agent is responsible for ensuring that the correct client software version is installed (depending on the cluster version) and that the client connects to the correct cluster.
Backend HostA host that runs the Content Software for File software and is installed with SSD drives dedicated to the Content Software for File system, providing services to client hosts. A group of backend hosts forms a storage cluster.
ClientThe Content Software for File client is software installed on user application servers that need access to Content Software for File file services. The Content Software for File client implements a kernel-based filesystem driver and the logic and networking stack to connect to the Content Software for File backend hosts and be part of a cluster. In general industry terms, "client" may also refer to an NFS, SMB, or S3 client that uses those protocols to access the Content Software for File filesystem. For NFS, SMB, and S3 the Content Software for File client is not required to be installed in conjunction with those protocols.
ClusterA collection of Content Software for File backend hosts, together with Content Software for File clients installed on the application servers, forming one sharable, distributed, and scalable file storage system.
ContainerContent Software for File uses Linux containers (LXC) as the mechanism for holding one node or keeping multiple nodes together. Containers can have different nodes within them. They can have frontend nodes and associated DPDK libraries within the container, or backend nodes, drive nodes, management node, and DPDK libraries, or can have NFS, SMB, or S3 services nodes running within them. A host can have multiple containers running on it at any time.
Converged DeploymentA Content Software for File configuration in which Content Software for File backend nodes run on the same host with applications.
Data Retention PeriodThe target period of time for tiered data to be retained on an SSD.
Data Stripe WidthThe number of data blocks in each logical data protection group.
Dedicated DeploymentA Content Software for File configuration that dedicates complete servers and all of their allocated resources to Content Software for File backends, as opposed to a converged deployment.
Failure DomainA collection of hardware components that can fail together due to a single root cause.
Filesystem GroupA collection of filesystems that share a common tiering policy to object-store.
FrontendIs the collection of Content Software for File software that runs on a client and accesses storage services and IO from the Content Software for File storage cluster. The frontend consists of a frontend node that delivers IO to the Content Software for File driver, a DPDK library, and the Content Software for File POSIX driver.
Hot DataFrequently used data (as opposed to warm data), usually residing on SSDs.
HostA physical or virtual server that has hardware resources allocated to it and software running on it that provides compute or storage services. Content Software for File uses backend hosts in conjunction with clients to deliver storage services. In general industry terms, in a cluster of hosts, sometimes "nodes"is used instead.
Net CapacityAmount of space available for user data on SSDs in a configured Content Software for File system.
NodeA software instance that Content Software for File uses to run and manage WekaFS. Nodes are dedicated to managing different functions such as (1) NVMe Drives and IO to the drives, (2) compute nodes for filesystems and cluster-level functions and IO from clients, (3) frontend nodes for POSIX client access and sending IO to the backend nodes, and (4) management nodes for managing the overall cluster. In general industry terms, a node also may be referenced as a discrete component in a hardware or software cluster. Sometimes when referring to hardware, the term host may be used instead.
POSIXThe Portable Operating System Interface (POSIX) is a family ofstandards specified by the
Provisioned CapacityThe total capacity that is assigned to filesystems. This includes both SSD and object store capacity.
PrefetchThe Content Software for File process of rehydrating data from an object store to an SSD, based on a prediction of future data access.
Raw CapacityTotal SSD capacity owned by the user.
Retention PeriodThe target time for data to be stored on SSDs before releasing from the SSDs to an object-store.
ReleasingThe deletion of the SSD copy of data that has been tiered to the object-store.
RehydratingThe creation of an SSD copy of data stored only on the object-store.
ServerIn Content Software for File terms, a physical or virtual instantiation of hardware on which software runs and provides compute or storage services. In general industry terms, a server may also refer to a software process that provides a service to another process, whether on the same host or to a client (e.g. NFS server, SMB server, etc.).
Stem ModeA mode of the Content Software for File software that has been installed and is running, but has not been attached to a cluster.
Snap-To-ObjectA Content Software for File feature for uploading snapshots to object stores.
Tiered Content Software for File ConfigurationContent Software for File configuration consisting of SSDs and object stores for data storage.
TieringCopying of data to an object store, while it still remains on the SSD.
Tiering CueThe minimum time to wait before considering data for tiering from an SSD to an object-store.
Unprovisioned CapacityThe storage capacity that is available for new filesystems.
VFVirtual Function
Warm DataLess frequently-used data (as opposed to hot data), usually residing on an object-store.

 

  • Was this article helpful?