This section of the Help contains usage considerations that affect namespace access in general.
Choosing an access protocol
The protocol you choose to use to access a namespace depends on a variety of factors. Some factors have to do with the protocols themselves, others with the environment in which you’re working. For example, your client operating system may dictate the choice of protocol. Or, you may need new applications to be compatible with existing applications that already use a given protocol.
In terms of performance, HTTP is the fastest protocol and WebDAV is a close second. Both are suitable for transferring large amounts of data. CIFS and NFS are significantly slower than HTTP and WebDAV.
With both HTTP and WebDAV:
- Client libraries are available for many different programming languages.
- You can store custom metadata in the namespace.
- You can use SSL security for data transfers. The namespace configuration determines whether this feature is available.
- You can retrieve object data by byte ranges.
- Each operation can be completed in a single transaction, which provides better performance.
- You can override metadata defaults when you add an object to the namespace.
- HCP automatically creates any new directories in the paths for objects you store in the namespace.
- You can change object ownership.
- You can add, replace, or delete ACLs for objects.
- Some operations on directories, such as, COPY, MOVE, and DELETE, are performed in a single call.
- You can recursively delete a directory and its subdirectories.
With CIFS and NFS:
- You get file-system semantics.
- Multiple concurrent threads can write to the same object.
- CIFS and NFS have lazy close
- With CIFS and NFS, performance degrades when write operations target directories with large numbers of objects (greater than 100,000).
- With CIFS and NFS, you need to use multiple mounts of a namespace to have HCP spread the load across the nodes in the system.
Hostname and IP address considerations
You can access a namespace by specifying either the namespace hostname or an IP address. If your HCP system supports DNS and you specify the hostname, HCP selects the IP address for you from the currently available nodes. HCP uses a round-robin method to ensure that it doesn’t always select the same address.
When you specify IP addresses, your application must take responsibility for balancing the load among nodes. Also, you risk trying to connect (or reconnect) to a node that is not available. However, in several cases using explicit IP addresses to connect to specific nodes can have advantages over using hostnames.
These considerations apply when deciding which technique to use:
- You may be able to improve the performance of GET requests for an object if you use the IP address of a node on which the object is stored in the request URL.
- If your client uses a hosts file to map HCP hostnames to IP addresses, the client system has full responsibility for converting any hostnames to IP addresses. Therefore, HCP cannot spread the load or prevent attempts to connect to an unavailable node.
- If your client caches DNS information, connecting by hostname may result in the same node being used repeatedly.
- When you access the HCP system by hostname, HCP ensures that requests are distributed among nodes, but it does not ensure that the resulting loads on the nodes are evenly balanced.
- When multiple applications access the HCP system by hostname concurrently, HCP is less likely to spread the load evenly across the nodes than with a single application.
Using a hosts file
Typically, HCP is included as a subdomain in your DNS. If this is not the case, for access to the system, you need to use the tenant domain name in the URL and use a hosts file to define mappings of one or more node IP addresses to that domain name.
The location of the hosts file depends on the client operating system:
- On Windows, by default: c:\windows\system32\drivers\etc\hosts
- On Unix: /etc/hosts
- On Mac OS® X: /private/etc/hosts
Each entry in a hosts file is a mapping of an IP address to a hostname. For an HCP tenant, the hostname must be the fully qualified domain name (FQDN) for the tenant.
Each hosts file entry you create for access to a tenant must include:
- An IP address of a node in the HCP system
- The FQDN of the tenant domain
For example, if the tenant domain name is finance.hcp.example.com and one of the HCP nodes has the IPv4 address 192.168.210.16 and the IPv6 address 2001:0db8::101, you could add either or both of these lines to the hosts file on the client:
192.168.210.16 finance.hcp.example.com 2001:0db8::101 finance.hcp.example.com
You can include comments in a hosts file either on separate lines or following a mapping on the same line. Each comment must start with a number sign (#). Blank lines are ignored.
In the hosts file, you can map IP addresses for any number of nodes to a single domain name. The way a client uses multiple IP address mappings for a single domain name depends on the client platform. For information about how your client handles hosts file entries that define multiple IP address mappings for a single domain name, see your client documentation.
If any of the HCP nodes listed in the hosts file are unavailable, timeouts may occur when you use a hosts file to access the system through the management API.
Here’s a sample hosts file that contains mappings for the Finance tenant for nodes with both IPv4 and IPv6 addresses:
192.168.210.16 finance.hcp.example.com 192.168.210.17 finance.hcp.example.com 192.168.210.18 finance.hcp.example.com 192.168.210.19 finance.hcp.example.com 2001:0db8::101 finance.hcp.example.com 2001:0db8::102 finance.hcp.example.com 2001:0db8::103 finance.hcp.example.com 2001:0db8::104 finance.hcp.example.com
Object naming considerations
When naming objects, directories, and symbolic links, keep these considerations in mind:
- The name of each item must conform to POSIX naming conventions. In particular:
- Object names are case sensitive.
- Object names can include nonprinting characters, such as spaces and line breaks.
- All characters are valid except the NULL character (ASCII 0 (zero)) and the forward slash (ASCII 47 (/)), which is the separator character in directory paths.
- Object names cannot consist of a single period (.) or a single forward slash (/).
- The client operating system, in conjunction with HCP, ensures that object specifications are converted, as needed, to conform to POSIX requirements (for example, when using CIFS, backslashes (\) are converted to forward slashes (/)).
- .directory-metadata is a reserved name.
- The maximum length for the combined directory path and name of an object, symbolic link, or metafile, starting below rest, data, or metadata, including separators, is 4,095 bytes.
- For CIFS and NFS, the maximum length of an individual item name is 255 bytes. This applies not only to naming new objects but also to retrieving existing objects. Therefore, an object stored through HTTP or WebDAV with a longer name may not be accessible through CIFS or NFS.
- Some character-set encoding schemes, such as UTF-8, can require more than one byte to encode a single character. As a result, such encoding can invisibly increase the length of a full object specification (directory path and object name) causing it to exceed the HCP limit of 4,095 bytes.NoteIn some cases, an extremely long object name may prevent a CIFS or NFS client from reading the entire directory that contains the object. When this happens, attempts to list the contents of the directory result in an error.
- When searching namespaces, HDDS and HCP rely on UTF-8 encoding conventions to find objects by name. If the name of an object is not UTF-8 encoded, searches for the object by name may return unexpected results.
- When the metadata query engine or HCP search facility indexes an object with a name that includes certain characters that cannot be UTF-8 encoded, it percent-encodes those characters. Searches for such objects by name must explicitly include the percent-encoded characters in the name.
- Names for email objects stored through SMTP are system-generated.
Naming conventions for email objects
HCP handles email objects the same way it handles other objects, except that for email stored through SMTP, HCP automatically generates directory paths and object names. It generates the paths directly under a parent directory that’s specified in the namespace configuration. To learn the parent directory path, contact your tenant administrator.
The generated path and object name for email stored using SMTP consists of, in order:
- The email path specified in the namespace configuration, ending with a forward slash (/):
- A system-generated numeric ID followed by a forward slash (/):
- The date and time the email was stored, in this format, followed by a hyphen (-):
- An internally generated message ID followed by a hyphen:
- A repeat of the system-generated numeric ID followed by a hyphen:
- A counter to ensure that all objects stored in the same millisecond have unique names followed by an at sign (@):
- The domain name of the sender contained in the From field of the mail header, followed by a hyphen (-):
- The email suffix specified in the namespace configuration:
Here’s the complete path and object name for a sample email message:
The message ID that the mail server generates for an email ingested through the SMTP protocol can include one or more forward slashes (/) or colons (:). Before storing an email, HCP replaces each such slash or colon with a hyphen (-).
The namespace can be configured to store each email together with or separately from its attachments, if any. When stored together, the result is the single email object named as described above.
When stored separately, each attachment is in the same directory as the email object. The name of the attachment object is formed from the name of the email object (without the suffix) concatenated with a hyphen (-) and the name of the attached file.
Here’s an example of the complete path and object names that result from storing two attachments separately from the email with which they arrive:
- First attachment:
/rest/email/365/2013/03/02/17/12/17-12-29.522-4FE72776firstname.lastname@example.org-Wetlands Guidelines 2011-10-01.pdf
- Second attachment:
Because of the way HCP stores objects, the directory structures you create and the way you store objects in them can have an impact on performance. Here are some guidelines for creating effective directory structures. When creating namespaces you must know how the namespace is intended to be used. Following are some guidelines for effective directory usage:
Balanced directory usage
- Plan your directory structures before storing objects. Make sure all namespace users are aware of these plans.
- Avoid structures that result in a single directory getting a large amount of traffic in a short time. For example, if you ingest objects rapidly, use structures that do not store objects by date and time.
- If you do store objects by date and time, consider the number of objects ingested during a given period of time when planning the directory structure. For example, if you ingest several hundred files per second, you might use a directory structure such as year/month/day/hour/minute/second. If you ingest just a few files per second, a less fine-grained structure would be better.
- Follow these guidelines on directory depth and size:
- Try to balance the namespace directory tree width and depth.
- Do not create directory structures that are more than 20 levels deep. Instead, create flatter directory structures.
- Avoid placing a large number of objects (greater than 100,000) in a single directory. Instead, create multiple directories and evenly distribute the objects among them.
Unbalanced directory usage
Avoid numerous nested directories. For example, a structure where each object is stored in its own directory. You must configure the namespace to use unbalanced directory if one or more of the following are true:
- If your application uses a flat directory structure.
- If your object access or usage is focused on a limited number of directories, or it is not possible to ensure a balanced directory structure.
- If your application needs a more unstructured or semi-structured access to your data.
Multiple objects with the same content should all have the same shred setting. If they don't and you delete the objects, each object is shredded or not shredded according to its shred setting.
The namespace can contain objects that are not WORM:
- Objects that are open for write and have no data are not WORM.
- Objects left by certain failed write operations are not WORM.
Objects that are not WORM are not subject to retention. You can delete these objects or overwrite them without first deleting them.
Moving or renaming objects
You cannot move or rename an object in an HCP namespace. If a client tries either of these operations, the operation fails.
If this occurs, many clients automatically try to copy and delete the object instead. (This is how the HCP WebDAV MOVE method works.) If deletion is not allowed (for example, because the object is under retention), the original object remains in place, regardless of whether the copy is created.
When a copy is created and the original object is deleted, the move or rename operation appears to have been successful.
Deleting objects under repair
If you try to delete an object while HCP is repairing it, HCP returns an error response, and the object is not deleted. For HTTP and WebDAV, the return value is an HTTP 409 (Conflict) error code, and for CIFS and NFS, the request may time out. When you get such errors, wait a few minutes and then try the request again.
You can delete a directory only when it is empty. Some clients, however, can appear to delete nonempty directories, as long as those directories don’t contain objects under retention. In such cases, what’s really happening is that the client is using a single call to HCP to first delete the objects in the directory and then delete the now empty directory.
HCP lets multiple threads access the namespace simultaneously. Using multiple threads can enhance performance, especially when accessing many small objects across multiple directories.
Here are some guidelines for the effective use of multithreading:
- Concurrent threads, both reads and writes, should be directed against different directories. If that’s not possible, multiple threads working against a single directory is still better than a single thread.
- To the extent possible, concurrent threads should work against different IP addresses. If that’s not possible, multiple threads working against a single IP address is still better than a single thread.
- Only one client can write to a given object at one time. Similarly, a multithreaded client can write to multiple objects at the same time but cannot have multiple threads writing to the same object simultaneously.
- Multiple clients can read the same object concurrently. Similarly, a multithreaded client can use multiple threads to read a single object. However, because the reads can occur out of order, you generally get better performance by using one thread per object.