Data connection types and settings
By default, your system includes data connections for accessing data sources.
You can write your own data connection plugins to allow the system to connect to different types of data sources. When you do this, you define both the required and optional configuration settings for that data connection type.
Box data connection (Preview Mode)
This data connection allows the system to access files available to all Box Enterprise users to provide HCI Search App capabilities on the crawled and indexed documents. This connector crawls files and folders and only the latest version of a file considered.
The connector presents the Box file system with the following hierarchy:
- "All Files (Enterprise)"
- All Enterprise Box user folders
- Individual user files and folders
Authentication
To set up your Box connector, you will need the OAuth 2.0 with JSON Web Token (JWT) from the Box Developer Console. For more information, see Setup with JWT.
List-based data connection
This is a list-based data connection. This means that when the Check for Updates setting is enabled for a workflow task, this data connection relies on a list kept by the task to determine which files have changed since the last time the data source was read.
This is different from a change-based data connection, such as the HCP MQE data connection, which can ask the data source directly for a list of files that changed during a span of time.
Configuration settings
Setting | Required/Optional | Description |
Name | Required | The name for your Box data source. |
Description | Optional | A description of the data source. |
Authentication Key | Required | The OAuth 2.0 with JSON Web Token (JWT) provided to you by Box. For more information, see Setup with JWT. |
Use Proxy Server | Optional |
Whether to use a proxy server to connect to the data source. When enabled, you need to specify:
|
Filter type | Required |
The filter to use when crawling Enterprise. Choose from the following filter types:
|
Max Visited File Size | Required | Sets a maximum crawlable file size. Files larger than this size will be skipped. The default value is 100 GB. |
Supported actions
This data connection does not support any actions. It can only read and process documents from Box.
Document fields
The following document fields are populated with metadata for your Box connector:
Name | Data Type |
HCI_createdDateMillis | Long |
HCI_createdDateString | String |
HCI_displayName | String |
HCI_doc_version | String |
HCI_filename | String |
HCI_id | String |
HCI_modifiedDateMillis | Long |
HCI_modifiedDateString | String |
HCI_relativePath | String |
HCI_owner | String |
HCI_size | Long |
HCI_URI | String |
Microsoft Sharepoint data connection (Preview Mode)
This data connection allows the system access to enterprise Microsoft Sharepoint files to provide HCI Search App capabilities on the crawled and indexed documents.
This connector only crawls files and folders. Some file types, such as OneNote notebooks, cannot be crawled and are therefore not processed.
The connector presents the Sharepoint file system with the following hierarchy:
- "All Sites (Enterprise)"
- All Sharepoint sites
- All drives for the selected Sharepoint site
- All user files and folders
Authentication
To utilize this connector, you must have an application ID, directory ID, and a secret key through your Microsoft admin account. For more information, see the Microsoft Azure Portal.
The application must also enable the following permissions:
- Files.Read.All
- Sites.Read.All
- Directory.Read.All
List-based data connection
This is a list-based data connection. This means that when the Check for Updates setting is enabled for a workflow task, this data connection relies on a list kept by the task to determine which files have changed since the last time the data source was read.
This is different from a change-based data connection, such as the HCP MQE data connection, which can ask the data source directly for a list of files that changed during a span of time.
Configuration settings
Setting | Required/Optional | Description |
Name | Required | The name for your Sharepoint data source. |
Description | Optional | A description of the data source. |
Application (client) ID | Required | Unique values generated and provided to you through your Microsoft Azure AD account. For more information, see the Microsoft Azure Portal. |
Client Secret | Required | |
Directory (tenant) ID | Required | |
Authority URL | Required | The URL that should be used to authenticate the application. The default value is:https://login.microsoftonline.com |
Use Proxy Server | Optional |
Whether to use a proxy server to connect to the data source. When enabled, you need to specify:
|
Max Visited File Size (bytes) | Required | Sets a maximum crawlable file size. Files larger than this size will be skipped. The default value is 100 GB. |
Supported actions
This data connection does not support any actions. It can only read and process documents from Microsoft Sharepoint.
Document fields
The following document fields are populated with metadata for your Microsoft Sharepoint connector:
Name | Data Type |
driveItemId | String |
driveId | String |
ownerId | String |
HCI_createdDateMillis | Long |
HCI_createdDateString | String |
HCI_displayName | String |
HCI_doc_version | String |
HCI_filename | String |
HCI_id | String |
HCI_modifiedDateMillis | Long |
HCI_modifiedDateString | String |
HCI_relativePath | String |
HCI_siteName | String |
HCI_size | Long |
HCI_URI | String |
Microsoft OneDrive for Business data connection
This data connection allows the system to access enterprise Microsoft files through OneDrive for Business to provide HCI Search App capabilities on the crawled and indexed documents.
This connector only crawls files and folders. Some file types, such as OneNote notebooks, cannot be crawled and are therefore not processed.
The connector presents the OneDrive file system with the following hierarchy:
- All Files (Enterprise)
- All OneDrive folders
- All user files and folders
Authentication
To utilize this connector, you must have an application ID, directory ID, and a secret key through your Microsoft admin account. For more information, see the Microsoft Azure Portal.
The application must also enable the following permissions:
- Files.Read.All
- User.Read
- User.Read.All
- Directory.Read.All
- People.Read.All
List-based data connection
This is a list-based data connection. This means that when the Check for Updates setting is enabled for a workflow task, this data connection relies on a list kept by the task to determine which files have changed since the last time the data source was read.
This is different from a change-based data connection, such as the HCP MQE data connection, which can ask the data source directly for a list of files that changed during a span of time.
Configuration settings
Setting | Required/Optional | Description |
Name | Required | The name for your OneDrive data source. |
Description | Optional | A description of the data source. |
Application (client) ID | Required | Unique values generated and provided to you through your Microsoft Azure AD account. For more information, see the Microsoft Azure Portal. |
Client Secret | Required | |
Directory (tenant) ID | Required | |
Authority URL | Required | The URL that should be used to authenticate the application. The default value is: https://login.microsoftonline.com |
Use Proxy Server | Optional |
Whether to use a proxy server to connect to the data source. When enabled, you need to specify:
|
Max Visited File Size (bytes) | Required | Sets a maximum crawlable file size. Files larger than this size will be skipped. The default value is 100 GB. |
Filter type | Required |
The filter to use when crawling the file system. Choose from the following filter types:
|
Supported actions
This data connection does not support any actions. It can only read and process documents from Microsoft OneDrive.
Document fields
The following document fields are populated with metadata for your OneDrive for Business connector:
Name | Data Type |
driveItemId | String |
driveId | String |
ownerId | String |
HCI_createdDateMillis | Long |
HCI_createdDateString | String |
HCI_displayName | String |
HCI_doc_version | String |
HCI_filename | String |
HCI_id | String |
HCI_modifiedDateMillis | Long |
HCI_modifiedDateString | String |
HCI_owner | String |
HCI_relativePath | String |
HCI_size | Long |
HCI_URI | String |
HCP (Hitachi Content Platform) data connection
This data connection allows access to files on a Hitachi Content Platform (HCP) system.
For information on how this data connection compares to other data connections that can access HCP, see Best practices for connecting to HCP.
HCP system requirements
In order for your system to connect to an HCP namespace, either the HTTP or HTTPS protocol must be enabled for the namespace.
Configuration settings
Setting | Required/Optional | Description |
HCP System Name HCP Tenant Name HCP Namespace Name | Required |
Information about the HCP namespace to connect to. You can find this information in the URL for an HCP namespace: https://<hcp-namespace-name>.<hcp-tenant-name>.<hcp-system-name>
|
HCP Root Directory | Required | The directory path to read. Use / (forward slash) to read all files in the namespace. |
Use SSL | Required |
Whether to use SSL to connect to the data source. When this option is enabled, click Test at the bottom of the Add Data connection page to connect to the data source and retrieve its SSL certificate. |
Use Proxy Server | Optional |
Whether to use a proxy server to connect to the data source. When enabled, you need to specify:
|
HCP Authentication Type | Required | The type of authentication which should be used when connecting to an HCP system. Users can select either their local credentials or Active Directory credentials. The default value is Local. |
User Name | Required | Username for an HCP tenant-level user account. Tip: To access HCP anonymously, specify all_users as the user name and leave the password field blank. |
Password | Required | Password for the user account. |
Supported actions
Action name | Description | Configuration settings | HCP Permissions Required |
Copy File | This action issues an HCP REST Put-Copy API request through to HCP, allowing users to copy objects between HCP namespaces. This action can be used as a workflow output or in a pipeline Execute Action stage. Note: Copy is possible only from within the same HCP system. |
|
read write |
Delete |
For each document, the action deletes the corresponding object from HCP. If versioning is enabled for the HCP namespace, this action removes only the current version of the object. This operation does not affect objects under retention. This operation does not delete folders from HCP. |
| delete |
Hold |
For each document, the action applies an HCP retention hold value to the corresponding HCP object. Hold values can be either true or false. When this value is true for an object, the object is on hold; it cannot be deleted, even by a privileged operation. Also, new versions of the object cannot be created. |
|
write privileged |
Output File |
Depending on the state of the incoming document, executes either the Write File or Delete action. This action usually executes the Write Fileaction. The Delete action is executed only when both of these conditions are true: The outputFile action is included as a workflow output, not as a pipeline Execute Action stage. A document has an This indicates that the corresponding HCP object was deleted from the namespace. Such documents do not go through the pipeline; they are sent directly to workflow output. |
For a stream to be written as a custom metadata annotation, the stream name must start with HCP_customMetadata_ and the stream must have a metadata field named HCP_customMetadatawith the value of the annotation name. For example: streams { HCP_customMetadata_exampleAnnotation: HCP_customMetadata=exampleAnnotation }; |
delete write privileged (for putting objects on hold) |
Privileged Delete | Same as the regular Delete action, except that this action can delete objects under retention. |
Tip: You can use a Tagging stage to add a reasonForDeletion field to your documents. |
delete privileged |
Privileged Purge |
For each document, the action deletes the corresponding HCP object and all of its versions. This is the same as the regular Purge action except that this action can be performed on objects that are under retention. |
|
delete purge privileged |
Purge |
For each document, the action deletes the corresponding HCP object and all of its versions. This action does not affect objects under retention. To purge those objects, use the privileged purge action. |
|
delete purge |
Retention |
For each document, this action applies an HCP retention setting to the corresponding object in HCP. An HCP object's retention setting determines whether the object is eligible for deletion. When you edit the retention setting for an existing object, HCP allows you only to make the setting longer or more restrictive, not less. For more information on HCP retention settings, see the HCP document Using a Namespace. |
| write |
Write Annotation |
This action takes document streams and writes them as custom metadata annotations to existing HCP objects. This action does not create new objects in HCP. That is, to write annotations to an HCP object, the object must already exist in HCP. |
For a stream to be written as a custom metadata annotation, the stream name must start with HCP_customMetadata_ and the stream must have a metadata field named HCP_customMetadatawith the value of the annotation name. For example: streams { HCP_customMetadata_exampleAnnotation: HCP_customMetadata=exampleAnnotation };Tip: Disable this option if the stream you want to write did not originally exist in the document (that is, the stream was created by a stage in your pipeline).
| write |
Write File |
For each document, the action writes the specified stream to an HCP object. If the object exists and versioning is enabled for the HCP namespace, the system writes a new version of the object. |
For a stream to be written as a custom metadata annotation, the stream name must start with HCP_customMetadata_ and the stream must have a metadata field named HCP_customMetadatawith the value of the annotation name. For example: streams { HCP_customMetadata_exampleAnnotation: HCP_customMetadata=exampleAnnotation }; |
write privileged (for putting objects on hold) |
Authentication and action permissions
To configure this data connection, you need the username and password for a tenant-level user account on the HCP system. At a minimum, this user account must have read permission for the namespace you want your system to access.
To perform an action, the user account needs the correct permissions in HCP to perform that action. For example, the user account needs the delete permission to delete objects.
HCP object versioning
This data connection reads only the latest version of each HCP object.
Checking for updates with an HCP connector
This is a list-based data connection. This means that when the Check for Updates setting is enabled for a workflow task, this data connection relies on a list kept by the task to determine which files have changed since the last time the data source was read.
This is different from a change-based data connection, such as the HCP MQE data connection, which can ask the data source directly for a list of files that changed during a span of time.
For information on the Check for Updates setting, see Task settings.
How this data connection determines which file to perform an action on
This syntax shows how this data connection determines where to perform an action:
<location-specified-by-the-data-connection-used-by-the-action>// <Base Path-from-action-config-(if-specified)>/ <Path field-from-action-config>/<Filename field-from-action-config>
This table shows an example of using the Write File action to copy an object named /sourceDir/file.txt from one HCP namespace to another.
Source data connection configuration | Document values | Destination data connection configuration | Action stage configuration | File written to |
Name sourceDataConnection TypeHCP HCP System NamesourceHcp.example.com HCP Tenant NamesourceTenant HCP Namespace NamesourceNamespace HCP Root Directory/sourceDir | HCI_filename file.txt HCI_relativePath/ | Name destinationDataConnection TypeHCP HCP System NamedestinationHcp.example.com HCP Tenant NamedestinationTenant HCP Namespace NamedestinationNamespace HCP Root Directory/destinationDir | Action Name Write File Data connectiondestinationDataConnection StreamHCI_content Filename fieldHCI_filename Path fieldHCI_relativePath Base Path/writtenByHCI | HCP System Name destinationHcp.example.com HCP Tenant NamedestinationTenant HCP Namespace NamedestinationNamespace Filename and path/destinationDir/writtenByHCI/file.txt |
How this data connection populates HCI_relativePath field
This data connection adds the HCI_relativePath field to each document it creates. By default, data connections use the HCI_relativePath field to determine where actions should be performed.
If this data connection is configured to read objects from the root folder (/) of a data source, the value for the HCI_relativePath field is relative to the root folder.
For example, when the file /logs/March.log is converted to a document, the HCI_relativePathfield value for the document is logs/.
If you change the data connection to read from a specific folder (for example, /logs), the HCI_relativePath field value is relative to that folder. For example, in this case, the HCI_relativePath value for /logs/March.log will be /
.
Setting up the HCP connector to perform a Copy File action across HCP tenants
The Copy File action across HCP tenants is only possible when using the HCP connector. To use it to complete this action, HCI users will need to create a new HCP data connection with the all_users
user name and a blank password field.
To set up an HCP connector to perform the Copy File action across HCP tenants:
Procedure
Click Data Connections.
The Add Data Connection button appears.In the Type dropdown, select HCP.
Enter the HCP System Name.
Enter the HCP Tenant Name.
Enter the HCP Namespace Name.
NoteHCP System Name, HCP Tenant Name, and HCP Namespace Name can all be found in the URL for the HCP namespace:https://<hcp-namespace-name>.<hcp-tenant-name>.<hcp-system-name>Enter the HCP Root Directory path.
To use SSL to communicate with the HCP system, enable Use SSL.
To use a proxy server to connect to the HCP system, enable Use Proxy Server.
For User Name, enter all_users.
For Password, leave the field blank.
NoteThe combination of the all_users User Name and blank Password field allow for anonymous access to an HCP system. To perform actions, the all_users account must already have the applicable HCP data access permissions assigned to it.When you are finished, click Create.
HCP MQE (Hitachi Content Platform Metadata Query Engine) data connection
This connector allows your system to access objects in a Hitachi Content Platform (HCP) system. With this connector, the system uses the HCP metadata query engine (MQE) to submit operation-based requests for discovering new and changed files.
You can configure an HCP MQE data connection to access:
- All objects in a single namespace folder.
- All objects in a single namespace.
- All objects in a tenant.
- All objects in an HCP system.
HCP system requirements
In order for your system to use this connector to access an HCP namespace:
- The HTTP or HTTPS protocol must be enabled for the namespace.
- The HCP system must be at version 5.0 or later.
- The namespace must be enabled for search in HCP.
HCP MQE object versioning
When creating or editing an HCP MQE connector, the Track HCP Versions setting can be enabled to identify documents by their HCP versions. The version is then appended to the HCI_ID
and HCI_URI
fields on the document, helping to identify all versions of that document contained within HCP.
When this setting is set to True:
- A new Boolean field (
HCI_trackHcpVersion
) is added to the document with a value ofTrue
. HCP delete operations are not processed but instead stored in the index just like create operations. - When the
HCI_trackHcpVersion
is set toTrue
, HCP delete records are not processed but are instead stored in the index, similar to how create records are handled. - The
HCI_deleted
field for each document is set toFalse
for all HCP create records andTrue
for all other records. - If the
HCI_URI
document field contains version information, the HCP and HCP MQE connector's plugin APIs return data or metadata corresponding to that specific HCP version.
Additionally, the MQE connector will always add a new document field HCI_hcpDocVersion
with Long data type and HCP Version as the field value.
Authentication and action permissions
To configure this data connection, you need the username and password for a tenant-level user account on the HCP system. At a minimum, this user account must have read permission for the namespace you want your system to access.
To perform an action, the user account needs to have the correct permissions in HCP to perform that action. For example, the user account requires the delete permission to delete objects.
Checking for updates with an MQE connector
This is a change-based connector, which means that a workflow task can submit requests directly to the data source to learn what files have changed.
With list-based connectors, such as the regular HCP data connection, the data connection needs to check a list of files that it has already read to determine whether a file in the data source has been read since it was last updated.
Configuration settings
Setting | Required/Optional | Description |
HCP System Name | Required | The HCP system to connect to. |
HCP Tenant Name | Optional |
The name of an HCP tenant. If you omit both this and the HCP Namespace Name, the system reads all files in the HCP system. Note: To read all files in the HCP system, the Use SSL setting must be enabled. |
HCP Namespace Name | Optional |
The name of an HCP namespace. If you omit this:
|
Directories Filter | Optional | A comma-separated list of directories from which the system should read files. If you omit this, the system reads all files in the namespace. Note: The directories you specify here are not used to determine the value for the HCI_relativePath field for each document. |
Use SSL | Optional | Whether to use SSL to connect to the data source. When this option is enabled, click Test at the bottom of the Add Data connection page to connect to the data source and retrieve its SSL certificate. Note: If you've configured this data connection to read all files in the HCP system, this setting must be enabled. |
HCP Authentication Type | Required | The type of authentication which should be used when connecting to an HCP system. Users can select either their local credentials or Active Directory credentials. The default value is Local. |
User Name | Required | Username for an HCP tenant-level or system-level user account. |
Password | Required | Password for the user account. |
Batch Size | Optional | The number of documents to return per MQE request. The default is 500. |
Customize the Query Range | Optional | When enabled, allows you to edit the time period for MQE requests to cover. For example, you can use this setting to process and index only the files that were changed between March 1, 2015 and April 1, 2015. |
Custom query start time (millisec) | Optional | Number of milliseconds since January 1, 1970, 00:00:00 UTC. The systemreads only the files that were added or changed at or after this date and time. |
Custom query end time (millisec) | Optional | Number of milliseconds since January 1, 1970, 00:00:00 UTC. The systemreads only the files that were added or changed at or before this date and time. |
Include delete operations | Optional |
When enabled, a workflow task processes files that were deleted from the HCP namespace. While this option is enabled, and while the workflow task is configured to scan for updates, this field value pair is added to deleted documents: HCI_deleted:true The deleted file is not removed from the index. This setting is disabled by default. |
Supported actions
Action name | Description | Configuration settings | HCP Permissions Required |
Copy File | This action issues an HCP REST Put-Copy API request through to HCP, allowing users to copy objects between HCP namespaces. This action can be used as a workflow output or in a pipeline Execute Action stage. Additionally, the HCP MQE connector is only available if the connector is configured with a specific namespace. Note: Copy is possible only from within the same HCP system. |
|
read write |
Delete |
For each document, the action deletes the corresponding object from HCP. If versioning is enabled for the HCP namespace, this action removes only the current version of the object. This operation does not affect objects under retention. This operation does not delete folders from HCP. |
| delete |
Hold |
For each document, the action applies an HCP retention hold value to the corresponding HCP object. Hold values can be either true or false. When this value is true for an object, the object is on hold; it cannot be deleted, even by a privileged operation. Also, new versions of the object cannot be created. |
|
write privileged |
Output File |
Depending on the state of the incoming document, executes either the Write File or Delete action. This action usually executes the Write File action. The Delete action is executed only when both of these conditions are true: The outputFile action is included as a workflow output, not as a pipeline Execute Action stage. A document has an This indicates that the corresponding HCP object was deleted from the namespace. Such documents do not go through the pipeline; they are sent directly to workflow output. |
For a stream to be written as a custom metadata annotation, the stream name must start with HCP_customMetadata_ and the stream must have a metadata field named HCP_customMetadatawith the value of the annotation name. For example: streams { HCP_customMetadata_exampleAnnotation: HCP_customMetadata=exampleAnnotation }; |
delete write privileged (for putting objects on hold) |
Privileged Delete | Same as the regular Delete action, except that this action can delete objects under retention. |
|
delete privileged |
Privileged Purge |
For each document, the action deletes the corresponding HCP object and all of its versions. This is the same as the regular Purge action except that this action can be performed on objects that are under retention. |
|
delete purge privileged |
Purge |
For each document, the action deletes the corresponding HCP object and all of its versions. This action does not affect objects under retention. To purge those objects, use the privileged purge action. |
|
delete purge |
Retention |
For each document, this action applies an HCP retention setting to the corresponding object in HCP. An HCP object's retention setting determines whether the object is eligible for deletion. When you edit the retention setting for an existing object, HCP allows you only to make the setting longer or more restrictive, not less. For more information on HCP retention settings, see the HCP document Using a Namespace. |
| write |
Write Annotation |
This action takes document streams and writes them as custom metadata annotations to existing HCP objects. This action does not create new objects in HCP. That is, to write annotations to an HCP object, the object must already exist in HCP. |
For a stream to be written as a custom metadata annotation, the stream name must start with HCP_customMetadata_ and the stream must have a metadata field named HCP_customMetadatawith the value of the annotation name. For example:
Note: Disable this option if the stream you want to write did not originally exist in the document (that is, the stream was created by a stage in your pipeline).
| write |
Write File |
For each document, the action writes the specified stream to an HCP object. If the object exists and versioning is enabled for the HCP namespace, the system writes a new version of the object. |
For a stream to be written as a custom metadata annotation, the stream name must start with HCP_customMetadata_ and the stream must have a metadata field named HCP_customMetadatawith the value of the annotation name. For example: streams { HCP_customMetadata_exampleAnnotation: HCP_customMetadata=exampleAnnotation}; |
write privileged (for putting objects on hold) |
How the data connection determines which file to perform an action on
This syntax shows how this data connection determines where to perform an action:
<location-specified-by-the-data-connection-used-by-the-action>//<Base Path-from-action-config- (if-specified)>/<Path field-from-action-config>/<Filename field-from-action-config>
This table shows an example of using the Write File action to copy an object named /sourceDir/file.txt from one HCP namespace to another.
Source data connection configuration | Document values | Destination data connection configuration | Action stage configuration | File written to |
Name sourceDataConnection TypeHCP MQE HCP System NamesourceHcp.example.com HCP Tenant NamesourceTenant HCP Namespace NamesourceNamespace Directories/sourceDir | HCI_filename file.txt HCI_relativePath/sourceDir | Name destinationDataConnection TypeHCP MQE HCP System NamedestinationHcp.example.com HCP Tenant NamedestinationTenant HCP Namespace NamedestinationNamespace Directories/destinationDir | Action Name Write File Data connectiondestinationDataConnection StreamHCI_content Filename fieldHCI_filename Path fieldHCI_relativePath Base Path/writtenByHCI | HCP System Name destinationHcp.example.com HCP Tenant NamedestinationTenant HCP Namespace NamedestinationNamespace Filename and path/writtenByHCI/sourceDir/file.txt |
How this data connection populates the HCI_relativePath field
This data connection adds the HCI_relativePath field to each document it creates. By default, data connections use the HCI_relativePath field to determine where actions should be performed.
For this data connection, the value for the HCI_relativePath field is always relative to the root directory, regardless of the Directories Filter setting.
For example, when the file /logs/March.log is converted to a document, the HCI_relativePath field value for the document is logs/.
HCP Anywhere (Single user) data connection
HCP Anywhere is the file synchronization and sharing system from Hitachi Vantara. This data connection allows your system to read files from a single user's HCP Anywhere folder.
To access all files on an entire HCP Anywhere system, use the HCP Anywhere (System-wide) data connection .
HCP Anywhere system requirements
For your system to be able to read data from HCP Anywhere:
- The HCP Anywhere system must be at version 2.1.1 or later.
- The data connection must use an HCP Anywhere user account that has access to the HCP Anywhere File Sync and Share API.
Checking for updates with an HCP Anywhere single user connector
This is a list-based data connection. This means that when the Check for Updates setting is enabled for a workflow task, this data connection relies on a list kept by the task to determine which files have changed since the last time the data source was read.
This is different from a change-based data connection, such as the HCP MQE data connection, which can ask the data source directly for a list of files that changed during a span of time.
Configuration settings
Setting | Required/Optional | Description |
HCP Anywhere System Name | Required | Hostname for the HCP Anywhere system. |
HCP Anywhere Root Directory | Required | The folder path to read. Use / (forward slash) to read all files in user's HCP Anywhere folder. |
Use Proxy Server | Optional |
Whether to use a proxy server to connect to the data source. When enabled, you need to specify:
|
User Name | Required |
Username for the HCP Anywhere user whose folder you want Hitachi Content Intelligence to read. Note: This user must have permission to use the HCP Anywhere File Sync and Share API. |
Password | Required | Password for the HCP Anywhere user. |
Supported actions
The HCP Anywhere (Single user) data connection does not support any actions. The system can use it only to read files.
How an HCP Anywhere data connection populates HCI_relativePath
This data connection adds the HCI_relativePath field to each document it creates. By default, data connections use the HCI_relativePath field to determine where actions should be performed.
For this data connection, the value for the HCI_relativePath field is always relative to the root folder.
For example, when the file /logs/March.log is converted to a document, the HCI_relativePath field value for the document is logs/.
HCP Anywhere (System-wide) data connection
HCP Anywhere is the file synchronization and sharing system from Hitachi Vantara. This data connection allows your system to read files from an entire HCP Anywhere system.
To access only a single user's files, see HCP Anywhere (Single user) data connection.
HCP Anywhere system and user requirements
For your system to be able to read data from HCP Anywhere, the HCP Anywhere system must be at version 4.0.0 or later and have the Device Management API enabled.
You also need the username and password for an HCP Anywhere user account that has the Administrator and Audit roles and has access to the HCP Anywhere Device Management API.
Checking for updates with an HCP Anywhere system connector
This is a list-based data connection. This means that when the Check for Updates setting is enabled for a workflow task, this data connection relies on a list kept by the task to determine which files have changed since the last time the data source was read.
This is different from a change-based data connection, such as the HCP MQE data connection, which can ask the data source directly for a list of files that changed during a span of time.
Configuration settings
Setting | Required/Optional | Description |
Name | Required | A name for the data connection. |
Description | Optional | A description for the data connection. |
HCP Anywhere System Name | Required | Hostname or IP for the HCP Anywhere system. |
Use Proxy Server | Optional |
Whether to use a proxy server to connect to the data source. When enabled, you need to specify:
The default is false. |
User Name | Required |
Username for an HCP Anywhere user. This user must have access to the Device Management API. The user specified in this field performs all of the supported actions and owns all files in this data connection. |
Password | Required | Password for the HCP Anywhere admin user. |
Filter type | Required |
The filter to use when crawling the HCP Anywhere server. Choose from the following filter types:
|
Batch Size | Required |
The maximum number of files to retrieve from the server. The default is 1000. |
Max visited file size (bytes) | Required |
HCI will create documents only for files smaller than this limit. The default is 107374182400 (100 GB). |
Supported actions
Action name | Description | Configuration settings |
Delete |
For each document, the system deletes the file from a user's file system. Unshared folders take on a different folder structure when they become shared and do not get crawled. This action is available only when the data connection is used by an Execute Action stage, not when it is included as a workflow output. |
|
Output File |
Depending on the state of the incoming document, executes either the Write File or Delete action. This action usually executes the Write File action. The Deleteaction is executed only when both of these conditions are true: The outputFile action is included as a workflow output, not as a pipeline Execute Action stage. A document has an HCI_operation field with a value of DELETED. This indicates that the corresponding file was deleted from the file system. Such documents do not go through the pipeline; they are sent directly to workflow output. |
|
Write File |
For each document, the action creates a file in an Anywhere file system. Shared folders lose their top-level folder after it's written to a destination and their paths differ. |
|
How this data connection determines which file to perform an action on
This syntax shows how this data connection determines where to perform an action:
<location-specified-by-the-data-connection-used-by-the-action>/ <user-filesystem>/<Base Path-from-action-config-(if-specified)>/ <Relative path field-from-action-config>/<Filename field-from-action-config>
This table shows an example of using the Write File action to copy an object named /sourceDir/file.txt from one HCP Anywhere file system to another.
Source data connection configuration | Document values | Destination data connection configuration | Action stage configuration | File written to |
Name sourceDataConnection TypeHCP Anywhere System-Wide AW System NamesourceAW.example.com Base Directory/sourceDir | HCI_filename file.txt HCI_relativePath/sourceDir | Name destinationDataConnection TypeHCP Anywhere System-Wide AW System NamedestinationAW.example.com Base Directory/destinationDir | Action Name Write File Data connectiondestinationDataConnection UserdestinationUser StreamHCI_content Filename fieldHCI_filename Path fieldHCI_relativePath Base Path/writtenByHCI | AW System Name destinationAW.example. com Filename and path/destinationDir/wr |
HCP for Cloud Scale Monitoring data connection (Preview Mode)
HCP for Cloud Scale is a software-defined, massively scalable, object storage system from Hitachi Vantara. This data connection gathers monitoring details from a Cloud Scale system through the metrics of the Prometheus service for use in HCM.
It is created automatically by HCM once a Cloud Scale source is added.
Configuration settings
Setting | Required/Optional | Description |
Name | Required | The name for your data source. |
Description | Optional | A description for your data source. |
HCP for Cloud Scale System to Monitor | Required | The domain name of the system you intend on monitoring. |
Monitor system via Prometheus | Optional | Collect metrics using the Prometheus API on your system. The Prometheus service must be running on the system in order for this setting to function correctly. |
HCP for Cloud Scale Prometheus Port | Required (if Monitor system via Prometheus is enabeled) | The port on which the HCP for Cloud Scale system has configured the Prometheus service to run. |
Time between checks | Required (if Monitor system via Prometheus is enabeled) | The minimum interval (in seconds) between metrics collection attempts. |
Amazon® S3 data connection
This data connection allows the system to access the Amazon Simple Storage Service (S3) on Amazon Web Services (AWS).
For information on using this data connection to migrate data from Amazon S3 to HCP, see Copying data from Amazon S3 to HCP.
Authentication
This data connection needs an access key ID and secret access key for an Amazon AWS account.
Checking for updates with an Amazon S3 connector
This is a list-based data connection. This means that when the Check for Updates setting is enabled for a workflow task, this data connection relies on a list kept by the task to determine which files have changed since the last time the data source was read.
This is different from a change-based data connection, such as the HCP MQE data connection, which can ask the data source directly for a list of files that changed during a span of time.
Configuration settings
Setting | Required/Optional | Description |
Amazon region | Required | A list of Amazon S3 regions. Select the one that you want to connect to. Note: If the region you want is not listed, use the S3 Compatible data connection instead of the Amazon S3 data connection. When configuring the S3 Compatible data connection, specify the applicable endpoint and authentication type for the region you want to connect to. |
Bucket | Required | The name of the S3 bucket to connect to. |
Prefix | Optional | When specified, this data connection retrieves only the files whose names begin with this prefix. |
Prefix delimiter | Required |
The character or characters that the data source uses to separate a file prefixes into segments. The default is / (forward slash). |
Include user metadata | Required | When enabled, the systemretrieves any user-defined metadata for a file, in addition to the file's contents. |
Use SSL | Optional | Whether to use SSL to connect to the data source. |
Use Proxy Server | Optional |
Whether the system should use a proxy server to connect to the data source. When enabled, you also need to specify the:
|
Use STS Authentication | Optional |
Whether to use Amazon Web Services Security Token Service for authentication. When enabled, the system retrieves and uses temporary tokens to authenticate with the data source. For more information, see the Amazon Web Services documentation. |
STS session timeout | Required if STS Authentication is enabled | Time in seconds before the STS session expires. Valid values range from 900 (15 minutes) to 129600 (36 hours). The default is 900 seconds. |
Access key ID | Required |
One half of the access key that this data connection uses to authenticate with the data source. For information on finding your Amazon AWS account access key ID, see the Amazon Web Services documentation. |
Secret access key | Required |
One half of the access key that this data connection uses to authenticate with the data source. For information on finding your Amazon AWS account secret access key, see the Amazon Web Services documentation. |
Supported actions
Action name | Description | Configuration settings |
Delete |
For each document, the system deletes the corresponding file from the data source. This operation does not delete folders. |
|
Output File |
Depending on the state of the incoming document, executes either the Write File or Delete action. This action usually executes the Write File action. The Delete action is executed only when both of these conditions are true: The Output File action is included as a workflow output, not as a pipeline Execute Action stage. A document has an This indicates that the corresponding file was deleted from the data source. Such documents do not go through the pipeline; they are sent directly to workflow output. |
|
Write File | For each document, the system writes a new file to the data source. |
|
How the data connection determines which file to perform an action on
This syntax shows how this data connection determines where to perform an action:
<location-specified-by-the-data-connection-used-by-the-action>// <Base Prefix-from-action-config-(if-specified)>/ <Relative Prefix Field-from-action-config>/<Filename field-from-action-config>
This table shows an example of using the Write File action to copy an file named /sourceDir/file.txt from one S3 data source to another.
Source data connection configuration | Document values | Destination data connection configuration | Action stage configuration | File written to |
Name sourceDataConnection TypeAmazon S3 Amazon Regionus-east-1 BucketsourceBucket PrefixsourceDir Prefix Delimiter/ | HCI_filename file.txt HCI_relativePath/ | Name destinationDataConnection TypeS3 Compatible S3 Endpointtenant1.hcp.example.com Bucketnamespace1 PrefixdestinationDir Prefix Delimiter/ | Action Name Write File Data connectiondestinationDataConnection StreamHCI_content Filename fieldHCI_filename Relative prefix fieldHCI_relativePath Base Prefix/writtenByHCI | HCP System Name hcp.example.com HCP Tenant Nametenant1 HCP Namespace Namenamespace1 Filename and path/destinationDir/writtenByHCI/file.txt |
How this data connection populates the HCI_relativePath field
This data connection adds the HCI_relativePath field to each document it creates. By default, data connections use the HCI_relativePath field to determine where actions should be performed.
If this data connection is configured to read objects from the root folder (/) of a data source, the value for the HCI_relativePath field is relative to the root folder.
For example, when the file /logs/March.log is converted to a document, the HCI_relativePath field value for the document is logs/.
If you change the data connection to read from a specific folder (for example, /logs), the HCI_relativePath field value is relative to that folder. For example, in this case, the HCI_relativePath value for /logs/March.log can be /.
Considerations - Amazon S3 connections
When you use this data connection to write files to Amazon S3, the data connection does not create empty files for representing directories.
For example, say that you write a file whose full path and name in the input data source is:
/patients/ChrisGreen/billing/bill-02-02-2015.txt
The Amazon S3 data connection writes only this file. It does not write any of these files:
/patients/ /patients/ChrisGreen/ /patients/ChrisGreen/billing/Accessing files in Amazon S3 from the Search App
After running a search in the Search App, users can select search result links to download files from your data sources. For users to be able to access files from Amazon S3, the files must have public read permissions.
S3 Compatible data connection
This connector allows your system to access the Amazon Simple Storage Service (S3) on Amazon Web Services (AWS), or any system (such as HCP) that gives an HTTP-based API that's compatible with the API used by Amazon (S3).
For information on:
- Using this data connection to migrate data from Amazon S3 to HCP, see Copying data from Amazon S3 to HCP.
- How this data connection compares to other data connections that can access HCP, see Best practices for connecting to HCP.
.metapairs
. This data connection cannot
read or write any other custom metadata annotations for HCP objects.Authentication
To access an S3 compatible data source, this data connection needs an access key ID and secret access key for an account on that data source.
Connecting to HCP over S3
You can use this data connection to read data from an HCP namespace.
For HCP namespaces with versioning enabled, this connector reads only the latest version of each object.
Connection requirements:
- The HCP system must be at version 6.0 or later.
- The Hitachi API for Amazon S3 protocol must be enabled for the namespaces that you want to connect to.
- You need an HCP user account with read permission for the namespace you want to connect to. To perform actions, the user account must also have write and delete permissions.
To generate the access key ID and secret access key for an HCP user account, you need to base64-encode the account username and md5 hash the password. For example, run this command in a terminal window:
echo `echo -n <username> | base64`:`echo -n <password> | md5sum` | awk '{print $1}'
The command outputs an access key in this format:
<access-key-id>:<secret-access-key>
Checking for updates with an S3 compatible connector
This is a list-based data connection. This means that when the Check for Updates setting is enabled for a workflow task, this data connection relies on a list kept by the task to determine which files have changed since the last time the data source was read.
This is different from a change-based data connection, such as the HCP MQE data connection, which can ask the data source directly for a list of files that changed during a span of time.
Configuration settings
Setting | Required/Optional | Description |
S3 Endpoint | Required |
The point of entry for the data source that you want to connect to. When connecting to HCP, this is the name of the HCP tenant and system in this format: <tenant-name>.<hcp-system-hostname> For example: financeTenant.hcp.example.com |
Bucket | Required |
The name of the S3 bucket to connect to. When connecting to HCP, this is the name of an HCP namespace. |
Prefix | Optional | When specified, this data connection retrieves only the files whose names begin with this prefix. |
Prefix delimiter | Required |
The character or characters that the data source uses to separate a file prefixes into segments. The default is / (forward slash). |
Include user metadata | Optional |
When enabled, the system retrieves any user-defined metadata for a file, in addition to the file's contents. If this data connection connects to an HCP namespace, the system retrieves the custom metadata annotation named .metapairs for each object, if the annotation exists. |
Include S3 object tagging metadata | Optional | Whether to fetch object tagging metadata for each document. |
Object tagging metadata prefix | Required if Include S3 object tagging metadata is enabled | If set, prepends the passed in string to the beginning of the object tag key when converting the S3 object tagging to HCI metadata fields. |
Include S3 object lock metadata | Optional | Whether to fetch object lock metadata for each document. |
Use SSL | Optional | Whether to use SSL to connect to the data source. |
Use Proxy Server | Optional |
Whether the system should use a proxy server to connect to the data source. When enabled, you also need to specify the:
|
Authentication Type | Required |
The process used to sign the access key that this data connection uses to connect to the data source. Options are:
For information on these signing processes, see the Amazon Web Services documentation. |
Use STS Authentication | Optional |
Whether to use Amazon Web Services Security Token Service for authentication. When enabled, the system retrieves and uses temporary tokens to authenticate with the data source. For more information, see the Amazon Web Services documentation. |
STS session timeout | Required if Use STS Authentication is enabled | Time in seconds before the STS session expires. Valid values range from 900 (15 minutes) to 129600 (36 hours). The default is 900 seconds. |
STS Endpoint | Optional | The endpoint for the AWS Security Token Service (AWS STS).. |
Access key ID | Required | One half of the access key that this data connection uses to authenticate with the data source. |
Secret access key | Required | One half of the access key that this data connection uses to authenticate with the data source. |
Supported actions
Action name | Description | Configuration settings | HCP Permissions Required |
Delete |
For each document, the system deletes the corresponding file from the data source. This operation does not delete folders. |
| delete |
Delete tags | Performs an S3 operation to delete all tags from the specified object. |
| |
Output File |
Depending on the state of the incoming document, executes either the Write File or Deleteaction. This action usually executes the Write File action. The Delete action is executed only when both of these conditions are true: The Output File action is included as a workflow output, not as a pipeline Execute Action stage. A document has an This indicates that the corresponding file was deleted from the data source. Such documents do not go through the pipeline; they are sent directly to workflow output. |
|
delete write |
Set Legal Hold | Performs an S3 operation to set legal hold on the specified object. |
| |
Set Retention | Performs an S3 operation to set retention on the specified object. |
| |
Set Tags | Performs an S3 operation to set the configured metadata fields as object tags on the specified object. |
| |
Write File | For each document, the systemwrites a new file to the data source. |
| write |
How the data connection determines which file to perform an action on
This syntax shows how this data connection determines where to perform an action:
<location-specified-by-the-data-connection-used-by-the-action>/ /<Base Prefix-from-action-config-(if-specified)>/ <Relative Prefix Field-from-action-config>/<Filename field-from-action-config>
This table shows an example of using the Write File action to copy an file named /sourceDir/file.txt from one S3 data source to another.
Source data connection configuration | Document values | Destination data connection configuration | Action stage configuration | File written to |
Name sourceDataConnection TypeAmazon S3 Amazon Regionus-east-1 BucketsourceBucket PrefixsourceDir Prefix Delimiter/ | HCI_filename file.txt HCI_relativePath/ | Name destinationDataConnection TypeS3 Compatible S3 Endpointtenant1.hcp.example.com Bucketnamespace1 PrefixdestinationDir Prefix Delimiter/ | Action Name Write File Data connectiondestinationDataConnection StreamHCI_content Filename fieldHCI_filename Relative prefix fieldHCI_relativePath Base Prefix/writtenByHCI | HCP System Name hcp.example.com HCP Tenant Nametenant1 HCP Namespace Namenamespace1 Filename and path/destinationDir/writtenByHCI/file.txt |
How this data connection populates the HCI_relativePath field
This data connection adds the HCI_relativePath field to each document it creates. By default, data connections use the HCI_relativePath field to determine where actions should be performed.
If this data connection is configured to read objects from the root directory (/) of a data source, the value for the HCI_relativePath field is relative to the root directory.
For example, when the file /logs/March.log is converted to a document, the HCI_relativePath field value for the document is logs/.
If you change the data connection to read from a specific directory (for example, /logs), the HCI_relativePath field value is relative to that directory. For example, in this case, the HCI_relativePath value for /logs/March.log would be /.
Considerations - S3 compatible connections
After running a search in the Search App, users can click search result links to download files from your data sources. For users to be able to access files from Amazon S3, the files must have public read permissions.
HCP Monitoring data connection
The HCP Monitoring data connection produces documents containing performance and storage metrics for a Hitachi Content Platform (HCP) system.
This data connection is typically used by the Monitor App, but you can also use it in your workflows to process, archive, and index HCP system information.
How this data connection collects information
This data connection can collect information from any combination of these HCP system resources:
- Node status API: Shows performance metrics for individual HCP nodes.
- MAPI: Shows object storage metrics for individual HCP tenants.
- SNMP: Shows object storage metrics for the HCP system and individual nodes.
Each resource for which you want to collect information must be enabled in HCP.
When used as part of a workflow, the data connection makes regular requests to each enabled resource. Each request produces a document containing information specific to that resource. The Politeness setting determines how often these requests are made.
Supported HCP versions
This data connection is designed to collect information from HCP systems at version 8.0 or later. It can collect information from earlier HCP system versions, but some information might be unavailable. For example, individual HCP node metrics (such as the number of active HTTP connections per node) are available only in HCP version 8.0 and later.
Fields added to all documents
This table lists the fields that appear in all documents produced by this data connection.
Field | Description |
signalTimestamp | The time when metrics were collected. |
signalType |
The source of metrics collection. Select from:
|
signalSystemType | The type of the system from which metrics were collected. For this data connection, the value is always HCP. |
signalSystem | The name of the system from which metrics were collected. |
signalElementType |
The HCP system scope that the metrics apply to. Select from:
|
signalElement | The name of the system element that the document applies to. This field depends on the value for the signalElementType field. For example, if the signalElementType field value is NODE, signalElement is the name of a specific HCP node. |
Fields added by the Node Status API
This table lists the fields added to documents when the Monitor Nodes option is enabled for this data connection.
Field | Description |
beRead | The average number of bytes read from the node per second over the back-end network. |
beWrite | The average number of bytes written to the node per second over the back-end network. |
cpu | The total percentage of CPU capacity used. |
cpuSystem | The percentage of CPU capacity used by the operating system kernel. |
cpuUser | The percentage of CPU capacity used by HCP processes. |
diskRead | The average number of blocks read from the logical volume per second. |
diskUtilization | The usage of the communication channel between the operating system and the logical volume as a percent of the channel bandwidth. |
diskUtilizationAvg | The average value for the diskUtilization field. |
diskUtilizationMdn | The median value for the diskUtilization field. |
diskWrite | The average number of blocks written to the logical volume per second. |
feRead | The average number of bytes read from the node per second over the front-end network. |
feWrite | The average number of bytes written to the node per second over the front-end network. |
httpConnections | The number of HTTP connections. |
ioWait | The percentage of CPU capacity spent waiting to access logical volumes that are in use by other processes. |
swapOut | The average number of pages swapped out of memory per second. |
Fields added by MAPI
This table lists the fields added to documents when the Monitor tenants via MAPI option is enabled for this data connection.
Field | Description |
ingestedVolume | The total size of all objects in this tenant. |
objects | The total number of objects in this tenant. |
objectsCM | The number of objects in the tenant that have custom metadata. |
objectsCompressed | The number of compressed objects in the tenant. |
storageUsed | The amount of storage space used by this tenant. |
Fields added by SNMP
This table lists the fields added to documents when the Monitor tenants via SNMP option is enabled for this data connection.
Field | Description |
objects | The number of objects stored in the system. |
objectsIndexed | The number of objects indexed by the HCP system. |
ingestedVolume | The total size of all objects in the system. |
storageTotal |
Depends on value for the signalElementType field:
|
storageUsed |
Depends on value for the signalElementType field:
|
economyStorageTotal | The total amount of storage space on all S Series nodes in the system. |
economyStorageUsed | The amount of storage space used on all S Series nodes in the system. |
Configuration settings
This table lists the configuration settings for this data connection and, if applicable, the corresponding HCP system configuration required.
Setting | Description | HCP configuration required |
HCP System to Monitor |
The domain name of the HCP system to monitor. For example: corp-hcp.example.com | N/A |
System-level Monitoring | ||
Monitor system via SNMP | Whether to use SNMP (Simple Network Management Protocol) to collect information about the HCP system. |
To enable SNMP in HCP:
|
Version |
Version of the SNMP protocol to use. The options are:
Specify the version setting that your HCP system is using. | |
Community (SNMP version 1 or 2c) | Specify the Community setting that your HCP system is using. | |
Username, Password (SNMP version 3) | Specify the username and password for the SNMP user account your HCP system is using. | |
Time between checks | The interval, in seconds, between each try to collect information. The default is 180. | N/A |
Tenant-level Monitoring | ||
Monitor tenants via MAPI | Whether to use the HCP Management API (MAPI) to collect information about individual tenants in the HCP system. |
To configure HCP to allow MAPI access:
|
Username, Password | The username and password for a system-level HCP user account. | The user account you want this data connection to use must exist on the HCP system. For information on configuring HCP system-level user accounts, see the HCP documentation. |
HCP Authentication Type | Required | The type of authentication which should be used when connecting to an HCP system. Users can select either their local credentials or Active Directory credentials. The default value is Local. |
Time between checks | The interval, in seconds, between each try to collect information. The default is 300. | N/A |
Node-level Monitoring | ||
Monitor Nodes | Whether to use the HCP Node Status API to collect information about individual nodes in the system. |
To enable the Node Status API on your HCP system:
|
Time between checks | The interval, in seconds, between each try to collect information. The default is 60. | N/A |
HCP Syslog Kafka Queue data connection
This data connection is a version of the Kafka Queue data connection, specially configured to read and process syslog messages sent by an HCP system to an Apache Kafka message queue.
This data connection reads syslog messages from a specified Apache Kafka queue, not directly from an HCP system.
For more information on Apache Kafka, see http://kafka.apache.org/
This data connection is typically used by the Monitor App, but you can also use it to process, archive, and index HCP syslog messages in your workflows.
HCP requirements
To use this data connection, you need to configure your HCP system to send syslog messages to a Kafka queue.
Checking for updates with this connector
During a workflow task, when this data connection reads messages from a Kafka queue, it continues until all messages in the queue have been read.
If the Check for Updates setting is disabled for the workflow task, the task stops when all messages have been read.
If the Check for Updates setting is enabled, the data connection continuously scans the queue for new messages and reads them as they are added.
Configuration settings
Setting | Required/Optional | Description |
Connection Settings | ||
Kafka Servers | Required |
A comma-separated list of host/port pairs to use for establishing the initial connection to a Kafka cluster. The list should be in the form: <host>:<port>,<host>:<port>,... For example: kafka1.example.com:9092,kafka2.example.com:9092 These servers are used for the initial connection to discover the full cluster membership, which might change dynamically. The list does not need to contain the full set of servers but you might want to specify additional in the event one becomes unavailable. |
Security Protocol | Required |
The security protocol used to communicate with the Kafka brokers. Options are:
|
Queue Settings | ||
Kafka Topic | Required | Name of the Kafka topic to connect to. |
Initial timestamp | Optional |
The earliest time after which you want to retrieve messages from the queue. Valid values:
|
Batch size | Optional | The maximum number of messages to retrieve from a queue at one time. |
HCP Settings | ||
HCP System | Required | The domain name of the HCP system to process syslog messages from. This option is used to filter the applicable messages from the queue |
Testing this connection
When you test this data connection, the system tests that the data connection can connect to the specified Kafka topic. It does not test whether HCP syslog messages are successfully being added to that topic.
Kafka Queue data connection
The Kafka Queue data connection allows messages to be read from and written to Apache Kafka message queues. These queues facilitate the sharing of messages between systems, often in real-time.
For more information on Apache Kafka, see http://kafka.apache.org/
Checking for updates with this connector
During a workflow task, when this data connection reads messages from a Kafka queue, it continues until all messages in the queue have been read.
If the Check for Updates setting is disabled for the workflow task, the task stops when all messages have been read.
If the Check for Updates setting is enabled, the data connection continuously checks the queue for new messages and reads them as they are added.
Configuration settings
Setting | Required/Optional | Description |
Connection Settings | ||
Kafka Servers | Required |
A comma-separated list of host/port pairs to use for establishing the initial connection to a Kafka cluster. The list should be in the form: <host>:<port>,<host>:<port>,... For example: kafka1.example.com:9092,kafka2.example.com:9092 These servers are used for the initial connection to discover the full cluster membership, which might change dynamically. The list does not need to contain the full set of servers but you might want to specify additional in the event one becomes unavailable. |
Security Protocol | Required |
The security protocol used to communicate with the Kafka brokers. Options are:
|
Queue Settings | ||
Kafka Topic | Required | Name of the Kafka topic to connect to. |
Initial timestamp | Optional |
The earliest time after which you want to retrieve messages from the queue. Valid values:
|
Batch size | Optional | The maximum number of messages to retrieve from a queue at one time. |
Supported actions
Action name | Description | Configuration settings |
Enqueue Message | For each document, the data connection writes a message to the message queue. |
Message: The contents of the message to enqueue for each document. To include a document field values, use this syntax: ${field-name} For example: Document ${HCI_displayName} was processed |
PostgreSQL JDBC data connection
This data connection uses the Java Database Connectivity (JDBC) API to connect to PostgreSQL databases. It uses SQL queries to retrieve documents from specified database tables.
When information from a database table is read into the system, rows become documents and columns become fields within documents.
Authentication
If a database needs authentication, you need to provide the username and password for a PostgreSQL database user account when configuring this data connection. The account must have permission to read the database you want.
Avoid restarting workflow tasks that use this data connection
This data connection does not checkpoint its progress while examining a database. If your workflow uses this data connection, when you pause the workflow task and then resume it, the data connection rereads the entire database.
Checking for updates for an SQL data connector
This is a list-based data connection. This means that when the Check for Updates setting is enabled for a workflow task, this data connection relies on a list kept by the task to determine which files have changed since the last time the data source was read.
This is different from a change-based data connection, such as the HCP MQE data connection, which can ask the data source directly for a list of files that changed during a span of time.
Use the Version setting for this data connection be able to identify updated rows. If you leave this setting blank, the data connection can identify new and deleted rows, but not updated ones.
Configuration settings
Setting | Required/Optional | Description |
JDBC Settings | ||
JDBC Connection | Required |
Connection string for accessing the database using JDBC. For example: jdbc:postgresql://myserver.example.com:5432/postgres For information on formatting the JDBC connection string for PostgreSQL, see the PostgreSQL documentation. |
User Name | Optional | Username for a database user account. |
Password | Optional | Password for the user account. |
Query batch size | Optional | Number of rows to retrieve from the data source per request. To disable batching, specify -1 or leave this setting blank. |
Query Settings | ||
SELECT columns | Required | SQL SELECT clause, used to specify which columns to include with each row extracted from the database. Valid values include:
For example, if your database contains a column called physician, you can change it to doctor in all extracted documents by specifying: doctor AS physicianNote: By default, all data connections add these fields to the documents they read:
|
FROM | Required | SQL FROM clause, used to specify the database tables to extract documents from. |
WHERE | Optional |
SQL WHERE clause, used to limit which rows are retrieved from the data source. For example, say that you have a database with information on cities, which contains a column called Population. To retrieve only rows for cities with populations of one million or more, specify: Population > 1000000 |
Results Settings | ||
Primary Key | Required |
Comma-separated list of columns that uniquely identify a row in the database. This value is used to populate the HCI_id document field. |
Display Name | Optional |
Comma-separated list of columns to be used as the friendly display name for a row. This value is used to populate the HCI_displayName document field. If you don't specify a value for this setting, the value specified for the Primary Key setting is used. |
Version | Optional |
Comma-separated list of columns whose contents can be used to determine when a row has been changed. Leave this setting blank if no such column exists in the data source. This value is used to populate the HCI_doc_version field. |
Supported actions
This data connection does not support any actions. It can only read documents.
MySQL and MariaDB JDBC data connection
This data connection uses the Java Database Connectivity (JDBC) API to connect to MySQL and MariaDB databases. It uses SQL queries to retrieve documents from specified database tables.
When information from a database table is read into the system, rows become documents and columns become fields within documents.
Authentication
If a database needs authentication, you need to provide the username and password for a database user account when configuring this data connection. The account must have permission to read the database you want.
Avoid restarting workflow tasks that use this data connection
This data connection does not checkpoint its progress while examining a database. If your workflow uses this data connection, when you pause the workflow task and then resume it, the data connection rereads the entire database.
Checking for updates for an SQL data connector
This is a list-based data connection. This means that when the Check for Updates setting is enabled for a workflow task, this data connection relies on a list kept by the task to determine which files have changed since the last time the data source was read.
This is different from a change-based data connection, such as the HCP MQE data connection, which can ask the data source directly for a list of files that changed during a span of time.
Use the Version setting for this data connection be able to identify updated rows. If you leave this setting blank, the data connection can identify new and deleted rows, but not updated ones.
Configuration settings for a MySQL and MariaDB JDBC connection
Setting | Required/Optional | Description |
JDBC Settings | ||
JDBC Connection | Required |
Connection string for accessing the database using JDBC. For example: jdbc:mysql://myserver.example.com:3306/mydatabase For more information on:
|
User Name | Optional | Username for a database user account. |
Password | Optional | Password for the user account. |
Query batch size | Optional | Number of rows to retrieve from the data source per request. To disable batching, specify -1 or leave this setting blank. |
Query Settings | ||
SELECT columns | Required | SQL SELECT clause, used to specify which columns to include with each row extracted from the database. Valid values include:
For example, if your database contains a column called physician, you can change it to doctor in all extracted documents by specifying: doctor AS physicianNote: By default, all data connections add these fields to the documents they read:
|
FROM | Required | SQL FROM clause, used to specify the database tables to extract documents from. |
WHERE | Optional |
SQL WHERE clause, used to limit which rows are retrieved from the data source. For example, say that you have a database with information on cities, which contains a column called Population. To retrieve only rows for cities with populations of one million or more, specify: Population > 1000000 |
Results Settings | ||
Primary Key | Required |
Comma-separated list of columns that uniquely identify a row in the database. This value is used to populate the HCI_id document field. |
Display Name | Optional |
Comma-separated list of columns to be used as the friendly display name for a row. This value is used to populate the HCI_displayName document field. If you don't specify a value for this setting, the value specified for the Primary Key setting is used. |
Version | Optional |
Comma-separated list of columns whose contents can be used to determine when a row has been changed. Leave this setting blank if no such column exists in the data source. This value is used to populate the HCI_doc_version field. |
Supported actions
This data connection does not support any actions. It can only read documents.
Solr JDBC data connection
This data connection uses the Java Database Connectivity (JDBC) API to connect to Solr indexes. It uses SQL queries to retrieve documents from specified Solr indexes.
You can use this data connection to retrieve documents from either internal or external Solr indexes.
An index must be at Solr version 6 or later for this data connection to connect to it.
Avoid restarting workflow tasks that use this data connection
This data connection does not checkpoint its progress while examining a index. If your workflow uses this data connection, when you pause the workflow task and then resume it, the data connection rereads the entire index
Checking for updates for an SQL data connector
This is a list-based data connection. This means that when the Check for Updates setting is enabled for a workflow task, this data connection relies on a list kept by the task to determine which files have changed since the last time the data source was read.
This is different from a change-based data connection, such as the HCP MQE data connection, which can ask the data source directly for a list of files that changed during a span of time.
Use the Version setting for this data connection be able to identify updated rows. If you leave this setting blank, the data connection can identify new and deleted rows, but not updated ones.
Configuration settings for a MySQL and MariaDB JDBC connection
Setting | Required/Optional | Description |
JDBC Settings | ||
JDBC Connection | Required |
Connection string for accessing the database using JDBC. For example: jdbc:mysql://myserver.example.com:3306/mydatabase For more information on:
|
User Name | Optional | Username for a database user account. |
Password | Optional | Password for the user account. |
Query batch size | Optional | Number of rows to retrieve from the data source per request. To disable batching, specify -1 or leave this setting blank. |
Query Settings | ||
SELECT columns | Required | SQL SELECT clause, used to specify which columns to include with each row extracted from the database. Valid values include:
For example, if your database contains a column called physician, you can change it to doctor in all extracted documents by specifying: doctor AS physicianNote: By default, all data connections add these fields to the documents they read:
|
FROM | Required | SQL FROM clause, used to specify the database tables to extract documents from. |
WHERE | Optional |
SQL WHERE clause, used to limit which rows are retrieved from the data source. For example, say that you have a database with information on cities, which contains a column called Population. To retrieve only rows for cities with populations of one million or more, specify: Population > 1000000 |
Results Settings | ||
Primary Key | Required |
Comma-separated list of columns that uniquely identify a row in the database. This value is used to populate the HCI_id document field. |
Display Name | Optional |
Comma-separated list of columns to be used as the friendly display name for a row. This value is used to populate the HCI_displayName document field. If you don't specify a value for this setting, the value specified for the Primary Key setting is used. |
Version | Optional |
Comma-separated list of columns whose contents can be used to determine when a row has been changed. Leave this setting blank if no such column exists in the data source. This value is used to populate the HCI_doc_version field. |
Configuration settings for a Solr JDBC data connection
Setting | Required/Optional | Description |
JDBC Settings | ||
JDBC Connection | Required | Connection string for accessing the database using JDBC. For example: jdbc:solr://<zookeeper-host>:<zookeeper-port>/ <zookeeper-path>?collection=<collection-name> For example: jdbc: solr://mySolr.example.com:2181/solr?...ection=myIndex Note: Though you specify a Solr collection here, you can specify a different one for the FROM field. |
Query batch size | Optional | Number of rows to retrieve from the data source per request. To disable batching, specify -1 or leave this setting blank. Important: If you enable batching for this data connection, all index fields you specify for the SELECT columns setting must have the docValues field attribute in the index. |
Query Settings | ||
SELECT columns | Required | SQL SELECT clause, used to specify which columns to include with each row extracted from the database. Valid values include:
For example, if your database contains a column called physician, you can change it to doctor in all extracted documents by specifying: doctor AS physicianNote: By default, all data connections add these fields to the documents they read:
|
FROM | Required | SQL FROM clause, used to specify the database tables to extract documents from. Note: The SQL syntax that you can use is determined by what Solr supports. For information on Solr SQL support, see the applicable Solr documentation. |
WHERE | Optional | SQL WHERE clause, used to limit which rows are retrieved from the data source. For example, say that you have an index of images containing a field called City. To retrieve only the documents for images that were taken in London, specify: City = 'London'Note: The SQL syntax that you can use is determined by what Solr supports. For information on Solr SQL support, see the applicable Solr documentation. |
Results Settings | ||
Primary Key | Required |
Comma-separated list of columns that uniquely identify a row in the index. All fields you specify here must also be specified for the SELECT columns setting This value is used to populate the HCI_id document field. |
Display Name | Optional |
Comma-separated list of fields to be used as the friendly display name for a document. All fields you specify here must also be specified for the SELECT columns setting. This value is used to populate the HCI_displayName document field. If you don't specify a value for this setting, the value specified for the Primary Key setting is used. |
Version | Optional |
Comma-separated list of fields whose contents can be used to determine when a document has been changed. All fields you specify here must also be specified for the SELECT columns setting. Leave this setting blank if no such column exists in the data source. This value is used to populate the HCI_doc_version field. |
Supported actions
This data connection does not support any actions. It can only read documents.
Hadoop File System data connection
This data connection allows access to Hadoop Distributed File Systems (HDFS).
Authentication
This data connection does not support authentication.
Checking for updates with a Hadoop data connection
This is a list-based data connection. This means that when the Check for Updates setting is enabled for a workflow task, this data connection relies on a list kept by the task to determine which files have changed since the last time the data source was read.
This is different from a change-based data connection, such as the HCP MQE data connection, which can ask the data source directly for a list of files that changed during a span of time.
Configuration settings
Setting | Description |
Name | A name for the data connection. |
Description | An optional description for the data connection. |
HDFS Host | Hostname or IP address for the HDFS NameNode. |
HDFS Port | Port to connect to on the HDFS NameNode. |
Use SSL | Whether to use SSL when connecting to HDFS. |
Base directory | The path on the HDFS system to the folder containing the data you want to process. |
Max visited file size (bytes) |
Hitachi Content Intelligence will create documents only for files smaller than this limit. The default is 107374182400 (100 GB). |
Filter type |
The filter to use when crawling the HDFS system. Choose from the following filter types:
|
Supported actions
Action name | Description | Configuration settings |
Delete |
For each document, the system deletes the corresponding file from the HDFS file system. This action is available only when the Hadoop File System data connection is used by an Execute Action stage, not when it is included as a workflow output. |
|
Output File |
Depending on the state of the incoming document, executes either the Write File or Delete action. This action usually executes the Write File action. The Delete action is executed only when both of these conditions are true:
|
|
Write File |
For each document, this action writes the specified stream to a file. If the file doesn't exist, the action creates it. |
|
How this data connection determines which file to perform an action on
This syntax shows how this data connection determines where to perform an action:
<location-specified-by-the-data-connection-used-by-the-action>// <Base Path-from-action-config-(if-specified)>/<Relative path field-from-action-config>/ <Filename field-from-action-config>
How this data connection populates the HCI_relativePath field
This data connection adds the HCI_relativePath field to each document it creates. By default, data connections use the HCI_relativePath field to determine where actions should be performed.
If this data connection is configured to read objects from the root directory (/) of a data source, the value for the HCI_relativePath field is relative to the root directory.
For example, when the file /logs/March.log is converted to a document, the HCI_relativePath field value for the document is logs/.
If you change the data connection to read from a specific directory (for example, /logs), the HCI_relativePath field value is relative to that directory. For example, in this case, the HCI_relativePath value for /logs/March.log is /
.
Local File System data connection (FOR TEST USE ONLY)
This data connection retrieves files from the local file system (LFS) on each Hitachi Content Intelligence instance.
When configuring this data connection, you specify a path to retrieve files from. This path must be located within the product installation folder and must exist on every instance in the system.
Connecting to NFS file systems
You can use the Local File System data connection to allow Hitachi Content Intelligence to read and perform actions on data stored on remote network file systems.
You do this by mounting a network file system to the same location within the Hitachi Content Intelligence installation folder on each instance in the system.
Procedure
Use SSH to access all instances in the system.
On each instance, create a folder within the Hitachi Content Intelligence installation folder:
mkdir/<install-folder>/hci/<nfs-mount-location>
ImportantDo not create your new folder within any of the existing directories under /hci. These directories were created by Hitachi Content Intelligence when it was installed.Mount the NFS file system you want to access:
mount <hostname>:<path-to-mount> /<install-folder>/hci/<nfs-mount-location>
In the Hitachi Content Intelligence Admin App, create a Local File System data connection. For the Base directory, specify the path to the folder where you mounted the NFS file system.
To access the files on the NFS file system, use the data connection as part of a workflow or Execute Action stage.
NoteIf your Local File System cannot access files in the NFS file system, see Troubleshooting.
POSIX metadata
The Local File System data connection collects POSIX filesystem metadata from the files it reads. This metadata is converted to field/value pairs in the resulting documents.
POSIX metadata | Resulting field | Field value type |
size | HCI_size | Long |
uid (ID of the owning user) | HCI_uid | Integer |
gid (ID of the owning group) | HCI_gid | Integer |
mode (file permissions) | HCI_mode | Integer |
ctime (change time) |
HCI_createdDateMillis HCI_createdDateString | LongString |
atime (access time) | HCI_accessDateMillis HCI_accessDateString | LongString |
mtime (modification time) | HCI_modifiedDateMillis HCI_modifiedDateString | LongString |
Configuration settings
Setting | Description |
Name | A name for the data connection. |
Description | An optional description for the data connection. |
Base directory |
The path to the folder that contains the data you want to process. Important:
|
Max visited file size (bytes) |
The system will create documents only for files smaller than this limit. The default is 107374182400 (100GB). |
Filter type |
The filter to use when crawling the file system. Choose from the following filter types:
|
Supported actions
Action name | Description | Configuration settings |
Delete |
For each document, the system deletes the corresponding file from the local file system. This action is available only when the Local File System data connection is used by an Execute Action stage, not when it is included as a workflow output. |
|
Output File |
Depending on the state of the incoming document, executes either the Write File or Delete action. This action usually executes the Write File action. The Delete action is executed only when both of these conditions are true:
|
|
Write File |
For each document, this action writes the specified stream to a file. If the file doesn't exist, the action creates it. |
|
How this data connection determines where to perform an action
This syntax shows how this data connection determines where to perform an action:
<location-specified-by-the-data-connection-used-by-the-action>// <Base Path-from-action-config-(if-specified)>/<Relative path field-from-action-config>/ <Filename field-from-action-config>
How this data connection populates the HCI_relativePath field
This data connection adds the HCI_relativePath field to each document it creates. By default, data connections use the HCI_relativePath field to determine where actions should be performed.
If this data connection is configured to read objects from the root directory (/) of a data source, the value for the HCI_relativePath field is relative to the root directory.
For example, when the file /logs/March.log is converted to a document, the HCI_relativePath field value for the document is logs/.
If you change the data connection to read from a specific directory (for example, /logs), the HCI_relativePath field value is relative to that directory. For example, in this case, the HCI_relativePath value for /logs/March.log is /
.
CIFS data connection
This data connection allows you to access Common Internet File System (CIFS) 2.x shares. CIFS is a form of Service Message Block (SMB), a network sharing protocol.
Authentication
To access a CIFS share, you can use local or Active Directory (AD) authentication. If you are using AD, you must specify a domain name. If you do not specify a domain name, you have access to the CIFS share as a local user. If you do not specify a username, you can access the CIFS share anonymously.
Configuration settings
Setting | Description |
Name | A name for the data connection. |
Description | An optional description for the data connection. |
Host | Hostname or IP address for the CIFS server. |
Share name | Name of the share on the CIFS server. |
Base directory |
The path to the folder that contains the data you want to process. Important:
|
Include user and group SIDs |
Whether to query the CIFS server for user and group identifiers. This needs an additional request per file, so performance is slower. The HCI_ownerSID and HCI_groupSID fields get added to each document when enabled. The default is false. |
Username | A username to access the share. |
Domain | The Active Directory domain to access. This must be specified to access the CIFS share using AD authentication. |
Password | Password for the user account. |
Max visited file size (bytes) |
The system will create documents only for files smaller than this limit. The default is 107374182400 (100GB). |
Filter type |
The filter to use when crawling the file system. Choose from the following filter types:
|
Include Hidden Files | When set to true, crawls hidden files on the CIFS share. Default is false. |
Supported actions
Action name | Description | Configuration settings |
Delete |
For each document, the system deletes the corresponding file from the share. This action is available only when the CIFS data connection is used by an Execute Action stage, not when it is included as a workflow output. |
|
Output File |
Depending on the state of the incoming document, executes either the Write File or Delete action. This action usually executes the Write File action. The Deleteaction is executed only when both of these conditions are true:
|
|
Write File |
For each document, the action writes the specified stream to a share and all parent directories. If the file doesn't exist, the action creates it. |
|
How this data connection determines where to perform an action
This syntax shows how this data connection determines where to perform an action:
<location-specified-by-the-data-connection-used-by-the-action>// <Base Path-from-action-config-(if-specified)>/<Relative path field-from-action-config>/ <Filename field-from-action-config>
How this data connection populates the HCI_relativePath field
This data connection adds the HCI_relativePath field to each document it creates. By default, data connections use the HCI_relativePath field to determine where actions should be performed.
If this data connection is configured to read objects from the root directory (/) of a data source, the value for the HCI_relativePath field is relative to the root directory.
For example, when the file /logs/March.log is converted to a document, the HCI_relativePath field value for the document is logs/.
If you change the data connection to read from a specific directory (for example, /logs), the HCI_relativePath field value is relative to that directory. For example, in this case, the HCI_relativePath value for /logs/March.log is /
.
Solr Query data connections
Solr Query data connection
The Solr Query data connection uses Solr's Query API to retrieve documents from internal or external Solr indexes. As this connector pages through the results of the query, its progress is checkpointed. This allows workflows that use these connectors to be paused/resumed.
When configuring a Solr Query data connector, you specify:
- The Solr server to connect to.
- The index to read from.
- The filtering criteria for limiting which documents are processed.
- The fields to include with processed documents.
/solr
must be used as the path if your Solr index is managed by HCI.Internal Index Query data connection
The Internal Index Query connection works only with internal indexes. It functions exactly as the Solr Query connector does but needs less user input, as the default values for the configuration options are pulled from the managed HCI internal indexes.
Configuration settings for a Solr Query connector
When configuring a Solr Index Query connector, you must specify the following:
- Index Settings:
- Solr Connection: URL used to connect to Solr. Format is
kHost:zkPort[/zkPath]
with the host and port of the ZooKeeper used by the Solr.
- Index: Name of the Solr index.
- Query batch size: Number of documents to be requested from Solr in a single request.
- Solr Connection: URL used to connect to Solr. Format is
- Query Settings:
- Query: The search query to send to Solr.
- Fields: Comma-separated list of fields to request for the query, or use
*
to include all fields returned by default.
- Filter Queries:
- Add Item: Add a document filter to your query. After entering a name, you can either add the item or cancel it.
- Select Fields: Select the document fields you want to filter.
- Delete Selected Fields: Delete the document fields you previously selected.
- Results Settings:
- Unique Key: The unique key that identifies the Solr document. For HCI internal indexes, this will generally be the
HCI_id
field. - Display Name: The display name of the Solr document. For HCI internal indexes, this will generally be the HCI_displayName field. If not specified, Primary Key will be used as the display name.
- Unique Key: The unique key that identifies the Solr document. For HCI internal indexes, this will generally be the
Configuration settings for an Internal Index Query connector
When configuring an Internal Index Query connector, you must specify the following:
- Index Settings:
- Index: Name of the Solr index.
- Query batch size: Number of documents to be requested from Solr in a single request.
- Query Settings:
- Query: The search query to send to Solr.
- Fields: Comma-separated list of fields to request for the query, or use
*
to include all fields returned by default.
- Filter Queries:
- Add Item: Add a document filter to your query. After entering a name, you can either add the item or cancel it.
- Select Fields: Select the document fields you want to filter.
- Delete Selected Fields: Delete the document fields you previously selected.
Crawling behavior for Solr Query connectors
When crawling, Solr Query connectors execute configured queries against a Solr index and creates an HCI document for each Solr query result. The HCI document maintains all fields from the query.