Built-in stages
Workflow Designer includes a number of built-in stages that you can use in your pipelines.
You can also write your own custom stage plugins.
Add Metadata stage
The Add Metadata stage adds a data source's basic metadata to a document reference by its URI.

- Document URI: Used to gather basic metadata for the configured data connection. Supports ${fieldName} syntax for building a dynamic URI per document containing field values.
- Data Connection: The data connection from which to request the specified URI.
Appender stage
The Appender stage lets you:
- Add text to the end of a field value or to the end of a stream.
- Add a field value to the end of another field value or to the end of a stream.
If a field has multiple values, the Appender stage appends the specified text or field value to each value.
To append one stream to another, use the Mapping stage.

Documents
- Documents with field/values that include appended text.
- Documents with streams that reference the document's contents plus the appended text.
- Append Field to Field operation:
- Target Field Name: The field whose values you want to append to.
- Source Field Name: The field whose values you want to append to the target field.
- Append Field to Stream operation:
- Target Stream Name: The stream you want to append to.
- Source Field Name: The field whose values you want to append to the target stream.
- Append Text to Field operation:
- Target Field Name: The field whose values you want to append text to.
- Custom Text: The text you want to append.
- Append Text to Stream operation:
- Target Stream Name: The stream you want to append to.
- Custom Text: The text you want to append.
Attach Stream stage
The Attach Stream stage lets you add a stream from one document to another document, allowing the two documents to be indexed and returned in search results as a single document.
For example, you can use this stage to associate the contents of an audio recording transcript with the recording itself. That way, users can more easily find the actual recording, not just the transcript, in a search.
You can also use this stage to facilitate writing custom metadata to objects in HCP. For information, see Writing file contents as custom metadata to HCP objects.
In this example, a folder in an HCP namespace includes both an audio file and its associated text transcript file:
https://ns1.ten1.myhcp.example.com/r...recording1.mp3 https://ns1.ten1.myhcp.example.com/r...rding1.mp3.txt
The Attach Stream stage accesses the contents of the text file in the data connection and adds it to the audio file as a new stream called transcript.

Documents
Documents with additional streams
- Stream Name : The name of the stream to add to each document processed.
- Stream URI: The location in the specified data connection for the data to add as a stream.
- Data connection selection: The data connection from which to find the data to attach as a stream. Options are:
- Use Document Data Connection: Use the connection that the document was originally read from.
- Specify Data Connection: Select an existing data connection.
- Additional Custom Stream Metadata: Key/value pairs to add as metadata on the stream being added.
- Use the
${fieldName}
syntax for including document field values in the Stream Name, Stream URI, and Additional Custom Stream Metadata settings.For example, if a document includes this field value pair:
HCI_displayName: file1
And you specify this for the Stream Name setting:contents-of-${HCI_displayName}
This stage adds a stream called contents-of-file1 to the document. - Consider excluding the documents that you're reading streams from from your workflow. Because you're adding the contents of one document to another document, you might not need to index or even process both documents in your workflow.
Cache Stream stage
The Cache Stream stage reads the stream for a document into a temporary file stored on your system. Any subsequent stages that need access to a stream read the stream's contents from the local, temporary file instead of from the original data source. This is useful if your data sources are expensive or slow to read data from.
The temporary files that this stage creates for a document are deleted when the document exits the workflow pipeline.

Documents with content stream references that point to data sources.
Documents with content stream references that point to temporary files stored in the system.
Input stream: The name of the stream to read from and save to a temporary file. The default is HCI_content, which is included by default in each file that the system reads.
Place this stage close to the beginning of your workflow pipeline, before any other stages that read document streams. Ideally, this stage should be the only one that reads document contents from the data source.
Content Class Extraction stage
A Content Class Extraction stage uses a content class to extract additional fields from the text-based documents that pass through it.
For information on content classes, see Content classes.
Take for example this XML file for a patient's blood pressure readings:
<xml> <date>2012-09-17</date> <patient> <name>John Smith</name> <age>56</age> <sex>M</sex> </patient> <diastolic>60,mm[Hg]</diastolic> <systolic>107,mm[Hg]</systolic> <assessment>low</assessement> </xml>
And take for example a content class named BloodPressureXMLContentClass that contains these content properties:
Content property name | Content property expression |
diastolic | /xml/diastolic |
systolic | /xml/systolic |
assessment | /xml/assessment |
patientName | /xml/patient/name |
overFifty | boolean(/xml/patient/age > 50) |
This illustration shows the effect of running the example.xml file through a Content Class Extraction stage that's configured to use BloodPressureXMLContentClass.

XML, JSON, or other text-based documents.
XML, JSON, or other text-based documents with additional field/value pairs.
- Content class selection: Select one or more existing content classes for the stage to use.
- Stream selection: Specify the streams that you want to apply the content class
to.
The default is HCI_content. The Text and Metadata Extraction stage extracts full document content as a stream named HCI_content, so the default value should cover the majority of cases.
If you have a custom stage that adds streams to documents and you want to apply content classes to those streams, you need to specify them here on the content class stage.
- Metadata field selection: Specify the fields that you want to apply the content class to.
- Stream or field content size limit: The maximum size in bytes for the content that this stage can process. If the value of a specified field or stream is larger than this limit, the Content Class Extraction stage does not process that field or stream.
The default is 100,000,000 bytes.
Specify -1 for no limit.
- Extracted field size limit: The maximum size in bytes for each field added by
this stage. Use this to manage how the stage handles extracted fields that contain large
values.
This setting cannot be greater than 1048576 (1 MB), which is also the default.
You can configure how this stage should handle fields that exceed the size limit:- Fail: The stage produces a failure for the document. This is the default.
- Store as stream: The extracted data is stored in the document as a stream, rather than a field.
- Truncate: Only a subset of the data is stored in the document. The remaining data is not saved.
- The subset is equal in size to the Extracted field size limit setting.
- Ignore: The field is ignored and not added to the document.
- Parse Multivalued Fields: When disabled (default), multiple values are stored as a single string where each value found is separated by a semicolon. When enabled, this string is instead parsed into a typed multivalued field, merging any duplicate values.
Take for example this XML:
<dates> <date>2013-04-16T00:00:00-0400</date> <date>2013-04-16T00:00:00-0400</date> <date>2015-12-16T05:48:45-0400</date> </dates>
With the Parse Multivalued Fields option disabled, parsing this XML using the content property expression /dates/date yields this field/value pair:date: "2013-04-16T00:00:00-0400;2013-04-16T00:00:00-0400;2015-12-16T05:48:45-0400"
This field/value pair is not recognized as a date by other stages or index collections. You will need to use the Tokenizer stage to split up the values into separate fields to make them usable.However, with the Parse Multivalued Fields option enabled, a usable, multi-valued field/value pair is produced by the Content Class Extraction stage:
date: Wed Dec 16 09:48:45 UTC 2015 | Tue Apr 16 04:00:00 UTC 2013
This stage can add new fields to documents. For a field to be indexed, its name:
- Cannot contain hyphens (-) or any other special characters.
- Cannot start with underscores (_) or numbers.
Use this technique to prevent unnecessary fields from being indexed. For example, you can add a field called $meetsCondition to a document to satisfy a conditional statement later on in the pipeline, but the field might not include any valuable information for your users to search on.
A Content Class Extraction stage is useful only if your workflow inputs contain XML, JSON, or other text-based files.
Date Conversion stage
A document might fail to be indexed if it contains a field with a date value that does not conform to the ISO 8601 format standard. In a workflowtask, such a document might fail with this message:
SolrPlugin error: {"responseHeader":{"status":400,"QTime":1},"error":{"msg":"Invalid Date String:'<date>'","code":400}}
You can ensure that the document is indexed by adding a Date Conversion stage to your pipeline. This stage takes date values for the fields you specify and converts them to the expected format.

Documents with incorrect date values.
Documents with converted date values.
- Fields to Process: A list of fields for which to convert date values. The list contains some default fields, but you can add more.
- Scan Formats: A list mapping regular expressions to date formats. Each list item contains:
- A regular expression used to select strings of characters in a field value.
- A scan pattern that identifies the corresponding date format for the regular expression.
For example, the regular expression ^\d{1,2}-\d{1,2}-\d{4}$ matches a string containing one or two positive integers followed by a hyphen followed by one or two positive integers followed by a hyphen followed by four positive integers. This pattern corresponds to the date format d-M-yyyy.
By default, the Date Conversion stage contains a number of scan formats, but you can add more.TipYou can use regular expressions with non-capturing groups to omit irrelevant characters from field values. For example, in this date value, the digits 123 are not part of the date and should be excluded from conversion:19570110_220434.123
To omit the incorrect digits, use a regular expression that selects only the date portion. For example:\d{8}_\d{6}(?=\.\d{3})
The corresponding date format for this expression is:yyyyMMdd_HHmmss
Use this stage after a Text and Metadata Extraction stage to normalize any date fields that stage adds.
Decompression stage
The Decompression stage converts a compressed file into a decompressed file for processing.
This stage can process gzip, bzip2, and xz files.
Compressed files
Decompressed files
- Document Stream Name: The name of the stream that you want the stage to examine.
The default stream is HCI_content, which is added to every document that the system reads.
- Include source file: When enabled, the original file itself is output by the stage, in addition to the decompressed file.
- In order for a document to be processed by this stage, it must contain a Content_Type field appended by the MIME Type Detection stage. The value must be application/x-bzip2, application/x-xz, application/x-gzip, or application/gzip.
- You can also change the Content_Type field in archived files to process archived gzip files (.tar.gz), archived bzip2 files (.tar.bz2), and archived xz files (.tar.xz). However, this will not work with files that have a .tgz extension.
- Use the TAR Expansion stage instead of this stage to expand .tgz files.
- To determine the MIME types of new documents created by this stage, you should precede
or follow this stage with a MIME Type Detection stage, depending on whether recursion is
enabled for the workflow:
- If either the Workflow Agent Recursion or Preprocessing Recursion settings are enabled for the workflow task, precede this stage with a MIME Type Detection stage.
With these settings enabled, all newdocuments are sent to the beginning of the applicable workflow pipeline, rather than on to the next stage.
- If both recursion settings are disabled, follow this stage with a MIME Type Detection stage.
- If either the Workflow Agent Recursion or Preprocessing Recursion settings are enabled for the workflow task, precede this stage with a MIME Type Detection stage.
- If you need to expand very large documents, use the Preprocessing execution mode for the workflow pipeline that contains this stage.
For more information, see Pipeline execution modes.
DICOM Metadata Extraction stage
The DICOM Metadata Extraction stage extracts text metadata fields from files formatted in DICOM, a standard file format for medical images.

DICOM image documents.
DICOM image documents with added fields.
- Input Stream Name: For each document entering the stage, this is the stream from which to extract field/value pairs. The default is HCI_content.
- DICOM field prefix: A string of characters to apply as the prefix to all fields added by the stage. The default is DICOM_.
- Extracted Character Limit: The maximum number of characters to extract from the input stream.
This stage can add new fields to documents. For a field to be indexed, its name:
- Cannot contain hyphens (-) or any other special characters.
- Cannot start with underscores (_) or numbers.
Use this technique to prevent unnecessary fields from being indexed. For example, you can add a field called $meetsCondition to a document to satisfy a conditional statement later on in the pipeline, but the field might not include any valuable information for your users to search on.
Drop Documents stage
All documents sent to this stage are deleted from the pipeline.
The Discoveries window on the task details page shows the number of documents dropped during a task run. For information, see Task details, status, and results.

Documents
None
None
To control which documents are sent to this stage, enclose the stage within a conditional statement.
Use this stage as early as possible in your pipeline so you can avoid having other stages unnecessarily process your unwanted documents.
Place this stage in one of these positions:
- After a MIME Type Detection stage, so you can drop documents based on their MIME types.
- If your workflow is not recursive, after an expansion that's immediately preceded by a MIME Type Detection stage. That way, you can drop documents from an archive file based on their MIME types.
Document Security stage
The Document Security stage applies access control lists (ACLs) to documents that pass through it. You can use these ACLs to configure which documents are made available to which users.
Index collection query settings are used to configure whether a search index honors ACLs. For more information, see Configuring access control settings in query settings.
If you apply multiple ACL fields to a document and configure a set of query settings to enforce all of those fields, your system honors the ACL settings in this order: Public, Deny ACLs, Allow ACLs.

In this example, if the applicable query settings are configured to honor deny ACLs, any user in the RestrictedUsers group will not be able to find the archive.txt file.
Documents
Documents with access control fields added:
- HCI_denyACL: Values for this field are one or more unique identifiers for users and groups.
- HCI_allowACL: Values for this field are one or more unique identifiers for users and groups.
- HCI_isPublic: If the Document Security stage marks a document as publicly accessible, the value for this field is true. Otherwise, this field is not added to documents.
- Set Document Visibility:
- Set Public: The stage adds the HCI_isPublic field to a document. Use the Enforce Public Setting in query settings to specify whether documents with this field are displayed in search results.
- Enable ACLs: The stage adds HCI_allowACL and HCI_denyACL fields to a document.
- Add these groups to the Allow ACL: Specifies the groups to add to the ACL that allows access to a document.
When configuring this and the following setting, you can browse through the user groups that your system administrator has added to Hitachi Content Intelligence.
- Add these groups to the Deny ACL: Specifies the groups to add to the ACL that denies access to a document.
- Add these custom tokens to the Allow ACL: Specifies the users to add to the ACL that allows access to a document.NoteWhen configuring this and the following setting, you need to specify the unique identifiers (for example, Active Directory SIDs) for user accounts in the identity providers that your system administrator has added to Hitachi Content Intelligence. For more information, see your system administrator.
- Add these custom tokens to the Deny ACL: Specifies the users to add to the ACL that denies access to a document.
- Use this as the last stage of a workflow pipeline. That way, you ensure that all documents exiting the pipeline have passed through this stage on their way to the workflow outputs.
Email Expansion stage
The Email Expansion stage creates documents for any attachments it detects in .eml and .msg files. By default, the stage also outputs the documents for the email documents that originally entered the stage.
This stage affects only .eml and .msg files. It has no effect on other types of files or archives.

Documents for .eml or .msg files.
Documents for .eml or .msg files and documents for any detected attachments.
Documents produced by this stage include these fields, which identify the file from which the documents were expanded:
- HCI_parentDisplay
- HCI_parentId
- HCI_parentUri
- Document Stream Name: The name of the stream that you want the stage to examine.
The default stream is HCI_content, which is added to every document that the system reads.
- Include source email: When enabled, the original email document itself is output by the stage.
- Include embedded non-file streams: By default, the Email Expansion stage extracts and creates documents for email attachments only if it detects that the attachments have filenames. When you enable this option, the stage creates documents for any stream it detects in the original email document.
- Make the attachments field multivalued: When disabled (default), the HCI_attachments field's value is stored as a single string on the original email document. Each value found is separated by a comma. When enabled, this string is instead parsed into a multivalued field, merging any duplicate values.
- Maximum Text Character Length: The maximum number of characters to extract into the HCI_text stream.
- Expanded File Path: Determines the values for the HCI_path and HCI_relativePath fields for documents extracted from an archive file.
The options are:
- Expand to a path named after the archive file: Creates field values based on the name of the archive from which documents were expanded.
For example, with this option selected, a document expanded from an archive file /2016/feburaryMessages.eml will have these field/value pairs:
- HCI_path: "2016/feburaryMessages.eml-expanded/"
- HCI_relativePath: "/feburaryMessages.eml-expanded/"
Use this option to link expanded documents back to the archives from which they were expanded.
- Use original expanded file path: Tries to use the original expanded file path, if found in the archive.
For example, with this option selected, a document expanded from an archive file /2016/feburaryMessages.eml will have these field/value pairs:
- HCI_path: "2016/"_path: "2016/feburaryMessages.eml-expanded/"
- HCI_relativePath: "/"
Use this option when you are writing the expanded documents to a data source.
- Customize the expanded base path: Allows you to specify the expanded file path to use.
For example, if you selected this option and specified a custom path value of myCustomPath, a document expanded from an archive file /2016/feburaryMessages.eml will have these field/value pairs:
- HCI_path: "2016/myCustomPath/"
- HCI_relativePath: "/myCustomPath/"
Use this option when you are writing the expanded documents to a data source.
- Expand to a path named after the archive file: Creates field values based on the name of the archive from which documents were expanded.
If you follow this expansion stage with an Execute Action stage that writes files to a data source, the Expanded File Path setting affects how the expanded files are written:
- With the Expand to a path named after the archive file option selected, each email message that the stage processes will be expanded to its own folder.
- With either the Customize the expanded base path or Use original expanded file path option selected, all email messages that the stage processes will be expanded to the same folder.
If multiple email messages contain files that have the same name, the Execute Action stage tries to write multiple different files to the same path in the data source. The results depend on the data source.
For example, if multiple email messages contain a file called notes.txt, the data source might end up with either:- A single file called notes.txt that has multiple versions.
- A single file called notes.txt, the contents of which match the last notes.txt file to be written to the data source.
- An error writing the file, if the data source does not allow existing files to be overwritten.
Surround this stage with a conditional statement that allows only email message.gz documents to pass through. That way, you avoid processing documents of the wrong type.
To determine the MIME types of new documents created by this stage, you should precede or follow this stage with a MIME Type Detection stage, depending on whether recursion is enabled for the workflow:
If either the Workflow Agent Recursion or Preprocessing Recursion settings are enabled for the workflow task, precede this stage with a MIME Type Detection stage.
With these settings enabled, all new documents are sent to the beginning of the applicable workflow pipeline, rather than on to the next stage.
If both recursion settings are disabled, follow this stage with a MIME Type Detection stage.
If you need to expand very large email message documents, use the Preprocessing execution mode for the workflow pipeline that contains this stage.
Email Notification stage
The Email Notification stage allows for notifications to be sent by email while data is processing. A single notification is sent for every document processed.

Documents
Processed documents and email notifications for each document.
SMTP Settings:
- Host: The hostname or IP address for a SMTP host string.
- Port: The SMTP port. The default is SMTP port 25.
- Security: Security options. Choose from None (default), STARTTLS, or SSL.
- Authentication: Enable or disable authentication. By default, this is disabled.
Message Settings:
- From: The source address of the email notification. The default is pipeline@hci.com.
- Subject: The subject of the email notification.
To include a field value in an email notification body or subject, use this syntax:
${<field-name>}
For example:
${HCI_size}
The default is Processed document ${HCI_URI}.
If, for example, you keep the default and the stage processes a document with the URL http://example.com/document.pdf, the resulting message will be:
Processed document http://example.com/document.pdf
To include an aggregation value in an email notification body or subject, use this syntax:
${<aggregation-name>}
For example:
${Extensions}
If the aggregation name contains spaces, replace them with underscores (_). For example:
${Discovered_Fields}
To include a field value in an email notification body or subject, use this syntax:
${<field-name>}
For example:
${HCI_size}
The default is Processed document ${HCI_URI}.
If, for example, you keep the default and the stage processes a document with the URL http://example.com/document.pdf, the resulting message will be:
Processed document http://example.com/document.pdf
To include an aggregation value in an email notification body or subject, use this syntax:
${<aggregation-name>}
For example:
${Extensions}
If the aggregation name contains spaces, replace them with underscores (_). For example:
${Discovered_Fields}
-
Body: The body of the email notification. This can be either plain text or HTML.
- Body Format: The format for the contents of the Body field. Either Plain Text or HTML.
- Email Addresses: A comma-separated list of recipient email addresses.
Because this stage sends an email for every document it encounters, you should surround it with a conditional statement to limit the documents that reach the stage.
Execute Action stage
You can use Execute Action stages in your pipelines to perform actions supported by various components. For example, you can use an Execute Action stage with an HCP data connection to edit the metadata for an object in an HCP namespace.
Each component supports its own set of actions, or none at all. When you configure an Execute Action stage in a pipeline, you specify a component, the action you want the stage to perform, and any additional configuration settings required for that action.
This example shows how an HCP data connection can be used in an Execute Action stage to delete files from an HCP namespace when a workflow task runs.

- Writing custom metadata to HCP objects.
- Migrating data to HCP from Amazon S3.
- Adding document caching to your workflows.
- Using the Index action.
You use Execute Action stages by adding them to pipelines. For more information, see Adding actions to a pipeline.
The configuration settings for data connection actions differ depending on the data connection. For information on these settings, see the topic for the applicable data connection under Data connection types and settings.
Filter stage
The Filter stage lets you remove unwanted fields and streams from documents.
This stage cannot not remove these required fields:
- HCI_id
- HCI_URI
- HCI_displayName
- Hitachi Content Intelligence_dataSourceUuid
- HCI_doc_version

Documents
Documents with field/value pairs or streams removed.
- Filtering type:
- Whitelist: Only the fields you specify are kept for further processing.
- Blacklist: The fields you specify are discarded while all other fields are left intact.
- Filter fields or streams: Option to filter either fields or streams.
- Filter by patterns: Specifies whether the stage uses regular expressions to match field or stream names.
When disabled, the stage filters only the fields or streams whose names exactly match the names you specify.
- Field/Stream Name: A list of field or stream names to filter. If the Filter by patterns option is enabled, you can specify regular expressions in this list.
Field Parser stage
The Field Parser stage takes a specified field, examines that field's values, and splits those values into smaller segments named tokens. You can then select the tokens that you want and add them as values to other fields.
In this example, a CSV file was passed through a Read Lines stage to produce a document for each line in the file. One of those documents is now passed through a Field Parser stage where the value for the $line field is split into separate fields.

In this example, a previous stage has added a from_address field to a document. The Field Parser stage is configured to extract a username from that field.

Documents with fields.
Documents with additional fields.
Setting | Description |
Input Field Name | The name of the field whose value you want to parse. |
Tokenize Input Field Value | Whether to split the input field value into multiple tokens. |
Tokenization Delimiter | If Tokenize Input Field Value is enabled, a character or regular expression used to determine when one token ends and another begins. For example, specify , (comma) to split lines from a CSV file. |
Substrings to Parse |
Use this setting to parse a subset of the value for the input field. For
example, if the input field contains the value
If you omit a Start sequence, the stage parses the field value from its beginning to the End sequence. If you omit an End sequence, the stage parses the field value from the Start sequence to its end. |
Token to Field Mapping | A list of token index numbers and the field names that each maps to. Valid Token Index values are positive integers starting with 0. |
Add Debug Fields | When enabled, the stage adds $debugParser fields to each document it processes. The stage adds one of these fields for each token produced from the selected text. |
This stage can add new fields to documents. For a field to be indexed, its name:
- Cannot contain hyphens (-) or any other special characters.
- Cannot start with underscores (_) or numbers.
Use this technique to prevent unnecessary fields from being indexed. For example, you can add a field called $meetsCondition to a document to satisfy a conditional statement later on in the pipeline, but the field might not include any valuable information for your users to search on.
Geocoding stage
The Geocoding stage adds new fields to documents that contain latitude and longitude data. For each set of coordinates, the stage adds new fields that contain the corresponding city, state or providence, country, and time zone.

Documents with latitude and longitude fields.
Documents with added city, country, state/providence, and time zone fields.
- Latitude field name: Specifies which field in the incoming documents contains latitude data.
- Longitude field name: Specifies which field in the incoming documents contains longitude data.
- City field name: The name of a field to add to documents. This field contains the name of a city.
- State/Province field name: The name of a field to add to documents. This field contains the name of a governmental region.
- Country field name: The name of a field to add to documents. This field contains the name of a country.
- Time zone field name: The name of a field to add to documents. This field contains the name of a time zone.
This stage can add new fields to documents. For a field to be indexed, its name:
- Cannot contain hyphens (-) or any other special characters.
- Cannot start with underscores (_) or numbers.
Beginning a field name with a dollar sign ($) causes the field to be deleted when the workflow pipeline finishes processing the associated document. The field is not indexed.
Use this technique to prevent unnecessary fields from being indexed. For example, you could add a field called $meetsCondition to a document in order to satisfy a conditional statement later on in the pipeline, but the field may not include any valuable information for your users to search on.
HCP Access Log Parser stage
The HCP Access Log Parser stage reads an individual line from HCP access log files and turns it into a metric that can be displayed in the Monitor App.
Access log file lines.
Documents with the metrics displayed in the Monitor App.
- Field to Parse: The name of the field to process. The default is line.
- Output Index Name: The name of the index to send processed documents to. The default is monitorAccessLogIndex.
- Signal System: The name of the signal system being monitored. When left blank, the signal system is automatically detected based on information in the logs.
This stage is used by the Monitor App.
HCP System Events Parser stage
The HCP System Events Parser stage reads an individual line from HCP admin events log files and turns it into a metric that can be displayed in the Monitor App.
Admin events log file lines.
Documents with the metrics displayed in the Monitor App.
- Field to Parse: The name of the field to parse. The default is line.
- Output Index Name: The name of the index to send processed documents to. The default is monitorEventsIndex.
- Signal System: The name of the signal system being monitored. When left blank, the signal system is automatically detected based on information in the logs.
This stage is used by the Monitor App.
HCP Chargeback Report Parser stage
The HCP Chargeback Report Parser stage reads an individual line from HCP chargeback log files and turns it into a metric that can be displayed in the Monitor App.
Chargeback log file lines.
Documents with the metrics displayed in the Monitor App.
- Field to Parse: The name of the field to parse. The default is line.
- Output Index Name: The name of the index to send processed documents to. The default is monitorMetricsIndex.
- Signal System: The name of the signal system being monitored. When left blank, the signal system is automatically detected based on information in the logs.
This stage is used by the Monitor App.
HCP Resource Metrics Parser stage
The HCP Resource Metrics Parser stage reads resource metrics files from HCP and turns every line into a metric that can be displayed in the Monitor App.
Resource metrics files.
Documents with the metrics displayed in the Monitor App.
- Input Stream Name: The name of the input stream. The default is HCI_content.
- Output Index Name: The name of the index to send processed documents to. The default is monitorMetricsIndex.
- Signal System: The name of the signal system being monitored. When left blank, the signal system is automatically detected based on information in the logs.
This stage is used by the Monitor App.
JavaScript stage
The JavaScript stage lets you specify your own JavaScript code for making transformations to documents. The stage has two modes:
- Single Field Transformation Mode, for editing individual field values.
- Document Transformation Mode, for adding, removing, and editing multiple fields and streams in a document.
This stage supports JavaScript that conforms to the ECMAScript-262 Edition 5.1 standard.
Single Field Transformation Mode
In this mode, you specify both:
- The name of a field whose value you want to change, and
- An implementation for the transformString() function. Your implementation needs to return a new object for the field value. Hitachi Content Intelligence later converts this returned object to a otring.
In this mode, your JavaScript code has access only to the specified field. It cannot read or change any other field in the document.
- Field Name: The name of the field to transform. The value for this field is used as the parameter for the transformString() function.
- JavaScript: Your JavaScript code.
- Allow non-existent values: Determines how the stage behaves if the specified field doesn't exist in an incoming document:
- When enabled, the stage passes null as a parameter into your JavaScript code. The specified field is then added to the document. Its value is the return value of the transformString() function.
- When disabled, the stage reports a document failure.

JavaScript field contents:
/** * Transforms a field value * * @param {String} fieldValue The current value of the field * @return {Object} The Object on which toString() will be called to set the new value for the field */ function transformString(fieldValue) { return fieldValue.toLowerCase(); }
Document Transformation Mode
This mode lets you make transformations to an entire document. For example, you can use this stage to remove fields from a document or add together the values of multiple fields and store the sum in a new field.
In this mode, you provide an implementation for the transformDocument() function. The inputs to this function are:
- callback: A PluginCallback object. Contains utility methods that allow the JavaScript stage to work with documents.
- document: The Document object for the incoming document.
The function must return either a Document object or, in the case that your stage generates additional documents, an Iterator<Document> object.
The PluginCallback objects, Document objects, and others are defined by the Hitachi Content Intelligence plugin SDK. The stage allows your Javascript code to access the functionality provided by the SDK for working with documents. The SDK is written in Java, not Javascript. For information on obtaining the plugin SDK and its documentation, see your Hitachi Content Intelligence administrator.
Example: Create a new field that's the product of two existing fields

Javascript field contents:
/** * Transforms a document * * @param {PluginCallback} callback The callback to access execution context * @param {Document} document The document to be transformed (immutable) * @return {Iterator<Document> or Document} A single Document, or an iterator * containing the new document that is the result of the transformation */ function transformDocument(callback, document) { var IntegerDocumentFieldValue = Java.type("com.hds.ensemble.sdk.model.IntegerDocumentFieldValue"); var heightValue = parseInt(document.getStringMetadataValue("heightInches")); var widthValue = parseInt(document.getStringMetadataValue("widthInches")); var builder = callback.documentBuilder().copy(document); builder.setMetadata("areaInches", IntegerDocumentFieldValue.builder().setInteger(heightValue * widthValue).build()); return builder.build(); }
To get fields from the document, getStringMetadataValue() can be called with the name of the field:
var exampleValue = document.getStringMetadataValue("exampleFieldName");
Setting a field value needs a DocumentFieldValue object, which can be created by calling Java.type() with the full name of the applicable DocumentFieldValue subclass.
For example, to set a String value for a field, you first create a StringDocumentFieldValue object:
var StringDocumentFieldValue = Java.type("com.hds.ensemble.sdk.model.StringDocumentFieldValue");
You can then use this object to build the value:
var exampleStringValue = StringDocumentFieldValue.builder().setString("exampleFieldValue").build()
And finally set the value to a field in the document builder:
var builder = callback.documentBuilder().copy(document);builder.setMetadata("exampleFieldName", exampleStringValue);
The types of document field values you can set are:
com.hds.ensemble.sdk.model.BooleanDocumentFieldValue com.hds.ensemble.sdk.model.DateDocumentFieldValue com.hds.ensemble.sdk.model.DoubleDocumentFieldValue com.hds.ensemble.sdk.model.FloatDocumentFIeldValue com.hds.ensemble.sdk.model.IntegerDocumentFieldValue com.hds.ensemble.sdk.model.LongDocumentFieldValue com.hds.ensemble.sdk.model.StringDocumentFieldValue com.hds.ensemble.sdk.model.TextDocumentFieldValue com.hds.ensemble.sdk.model.UuidDocumentFieldValue
Each one includes builder().set<type>(<type>) methods. For example:
var exampleBooleanValue = BooleanDocumentFieldValue.builder().setBoolean(true).build(); var exampleIntegerValue = IntegerDocumentFieldValue.builder().setInteger(7).build();
For more information, see the documentation included with the Hitachi Content Intelligence plugin SDK.
You can use the callback parameter as a factory to build modified documents:
var builder = callback.documentBuilder().copy(document);
This creates a DocumentBuilder class object, which includes a number of methods for working with the contents of a document. Some of these methods are:
- setMetadata(java.lang.String name, DocumentFieldValue<?> value): Add a single field to this Builder. If metadata with this name already exists, the DocumentFieldValue is replaced.
- removeMetadata(java.lang.String name): Remove a field from this Builder.
- copy(Document original): Copy all of the existing fields (and streams, if relevant) from an existing Document.
- removeStream(java.lang.String name): Remove the named stream from this Builder.
After you've modified the document contents, use DocumentBuilder.build() to build the final document:
return builder().build();
For more information, see the documentation on the DocumentBuilder class in the Hitachi Content Intelligence plugin SDK.
Mapping stage
The Mapping stage lets you:
- Rename document fields or streams.
- Copy a field value or stream data to another field or stream.
- Append a field value or stream data to another field or stream.
For example, you have a document that contains a field named doctor and a document that contains a field named physician. These fields contain equivalent information, that is, names of doctors. To index this information using only the doctor field, you can use a Mapping stage to rename all occurrences of physician to doctor.

Alternatively, to index the names of hospitals using both fields named hospital and facility, but the field facility does not exist in your documents. To index this information using multiple field names, you can use a Mapping stage to duplicate the hospital field by copying it to a new field named facility.

In this example, the field value financial report is appended to the stream HCI_content. This means that when the document leaves the pipeline, the value financial report is indexed as part of the full text of the document, along with the data that the HCI_content stream points to. Users can then use the term financial report to find this document using a simple search in the Search App, rather than having to use an advanced query to search based on values for the type field.

Documents
Documents with renamed or copied field names.
- Mapping Action: Which action the stage should perform:
- Rename Field: Changes the name of a field but keeps its original value.
- Rename Stream: Changes the name of a stream, but the stream continues to point to the same data.
- Copy Field to Field: Copies the value of one field to another field.
- Copy Field to Stream: Copies the value of a field to a stream.
- Copy Stream to Field: Takes the data that a stream points to and adds that data as a value for the specified field.
- Copy Stream to Stream: Copies one stream to another.
- Append Field to Stream: Appends a field value to a stream.
- Append Stream to Stream: Appends one stream to another.
- Targets to Map: A list of fields or streams that the stage should affect. For each list item, you specify:
- Source Name: The field or stream to rename, copy, or append to another stream.
- Target Name: The field or stream to which field values or streams will be added, copied, or appended.
- Overwrite Field:
- When this setting is enabled, if a value for the target field exists, the value is overwritten.
- When this setting is disabled, the output for this field as an additional value on the target field.
This stage can add new fields to documents. For a field to be indexed, its name:
- Cannot contain hyphens (-) or any other special characters.
- Cannot start with underscores (_) or numbers.
Use this technique to prevent unnecessary fields from being indexed. For example, you can add a field called $meetsCondition to a document to satisfy a conditional statement later on in the pipeline, but the field might not include any valuable information for your users to search on.
Place this stage after a Text and Metadata Extraction stage to standardize or rename the fields added by the Text and Metadata Extraction stage.
If you put a Filter stage after this stage, avoid filtering out fields that you had spent time mapping.
Mbox Expansion stage
The Mbox Expansion stage opens Mbox files and creates documents for the emails within.
This stage affects only Mbox files.

Documents for .mbox files.
Documents for emails inside of .mbox files.
Documents produced by this stage include these fields, which identify the file from which the documents were expanded:
- HCI_parentDisplay
- HCI_parentId
- HCI_parentUri
- Document Stream Name: The name of the stream that you want the stage to examine.
The default stream is HCI_content, which is added to every document that the system reads.
- Include source Mbox: When enabled, the original .mbox file itself is output by the stage, in addition to the documents within. NoteIf the stage produces zero documents and encounters zero documents when processing an archive file, it outputs that archive file regardless of this setting. This can happen, for example, when the stage processes an empty archive.
- Maximum Text Character Length: When this stage processes a document, it reads the contents of the document into system memory. For pipelines with many large documents, this can significantly impact processing time.
With this setting, you can limit the number of characters that this stage reads per document; after the limit is reached, the stage stops reading the file and extracting metadata fields from it. To disable this setting, specify -1.
- Expanded File Path: Determines the values for the HCI_path and HCI_relativePath fields for documents extracted from an archive file.
The options are:
- Expand to a path named after the archive file: Creates field values based on the name of the archive from which documents were expanded.
For example, with this option selected, a document expanded from an archive file /2016/feburary.mbox will have these field/value pairs:
- HCI_path: "2016/feburary.mbox-expanded/"
- HCI_relativePath: "/feburary.mbox-expanded/"
- Use original expanded file path: Tries to use the original expanded file path, if found in the archive.
For example, with this option selected, a document expanded from an archive file /2016/feburary.mbox will have these field/value pairs:
- HCI_path: "2016/"
- HCI_relativePath: "/"
Use this option when you are writing the expanded documents to a data source.
- Customize the expanded base path: Allows you to specify the expanded file path to use.
For example, if you selected this option and specified a custom path value of myCustomPath, a document expanded from an archive file /2016/feburary.mbox will have these field/value pairs:
- HCI_path: "2016/myCustomPath/"
- HCI_relativePath: "/myCustomPath/"
- Expand to a path named after the archive file: Creates field values based on the name of the archive from which documents were expanded.
- If you follow this expansion stage with an Execute Action stage that writes files to a data source, the Expanded File Path setting affects how the expanded files are written:
- With the Expand to a path named after the archive file option selected, each .mbox that the stage processes will be expanded to its own folder.
With either the Customize the expanded base path or Use original expanded file path option selected, all .mbox files that the stage processes will be expanded to the same folder.
If multiple .mbox files contain files that have the same name, the Execute Action stage tries to write multiple different files to the same path in the data source. The results depend on the data source.For example, if multiple .mbox files contain a file called notes.txt, the data source might end up with either:
- A single file called notes.txt that has multiple versions.
- A single file called notes.txt, the contents of which match the last notes.txt file to be written to the data source.
- An error writing the file, if the data source does not allow existing files to be overwritten.
- Surround this stage with a conditional statement that allows only .mbox.gz documents to pass through. That way, you avoid processing documents of the wrong type.
- To determine the MIME types of new documents created by this stage, you should precede
or follow this stage with a MIME Type Detection stage, depending on whether recursion is
enabled for the workflow:
- If either the Workflow Agent Recursion or Preprocessing Recursion settings are enabled for the workflow task, precede this stage with a MIME Type Detection stage.
With these settings enabled, all new documents are sent to the beginning of the applicable workflow pipeline, rather than on to the next stage.
For more information on these settings, see Task settings. - If both recursion settings are disabled, follow this stage with a MIME Type Detection stage.
- If either the Workflow Agent Recursion or Preprocessing Recursion settings are enabled for the workflow task, precede this stage with a MIME Type Detection stage.
- If you need to expand very large .mbox documents, use the Preprocessing execution mode for the workflow pipeline that contains this stage.
MIME Type Detection stage
Detects the Multipurpose Internet Mail Extension (MIME) types of documents. This stage adds a field namedContent_Type to the documents it processes.

Documents
Documents with a new Content_Type field/value pair that identifies the MIME type.
- Custom extension mapping: You can specify a file extension and the MIME type that you want to associate with it. You can add any number of these mappings, but each mapping can associate only one extension with only one MIME type.
By default, this stage is configured with custom extension mapping entries for a number of file types that the MIME Type Detection stage might take a long time to process (for example, .zip archives and video files).
- Force Detection: When this setting is enabled, the MIME Type Detection stage always processes incoming documents and overwrites any Content_Type fields already present in a document.
- Overwrite Field:
- When this setting is enabled, if a value for the target field exists, the value is overwritten.
- When this setting is disabled, the output for this field as an additional value on the target field.
-
Use this option when your documents already have a HCI_filename field/value before entering the MIME Type Detection stage (for example, if your data connection adds this field by default to all documents or you have two MIME Type Detection stages in your pipeline).
- Document Stream Name: The name of the stream that you want the stage to examine.
The default stream is HCI_content, which is added to every document that the system reads.
- Use the Custom extension mapping setting to avoid processing documents whose MIME types are time-consuming to discover. Instead of examining a document's contents to determine its MIME type, the stage will automatically assign the specified MIME type to documents that have the specified file extension. However, you should avoid doing this for file extensions that can be mapped to multiple MIME types.
- This stage should be used as the first stage in a pipeline. This allows you to use MIME
type as a criteria in conditional statements in subsequent stages.
Minimally, this stage should be used before the Text and Metadata Extraction stage because that stage uses a document's Content_Type value to determine whether to process the document.
- If the Workflow Recursion option is disabled for a task, you should follow each archive expansion stage (for example, TAR Expansion or ZIP Expansion) with a MIME Type Detection stage. For information on workflow recursion, see Task settings.
PST Expansion stage
The PST Expansion stage opens Microsoft Outlook personal storage table (.pst) files and creates documents for the emails and attachments within.
This stage affects only .pst files.
- You need to use a conditional statement to ensure that only the applicable archive files enter this stage.
For an example of how to do this, see Default pipeline.
- This stage does not create separate documents for most types of email attachments. To do that, add an Email Expansion stage to your pipeline. See Email Expansion stage.

Documents for .pst files
Documents for emails inside of .pst files.
Documents produced by this stage include these fields, which identify the file from which the documents were expanded:
- HCI_parentDisplay
- HCI_parentId
- HCI_parentUri
- Document Stream Name: The name of the stream that you want the stage to examine.
The default stream is HCI_content, which is added to every document that the system reads.
- Include source PST: When enabled, the .pst file itself is output by the stage,
in addition to the documents within.NoteIf the stage produces zero documents and encounters zero documents when processing an archive file, it outputs that archive file regardless of this setting. This can happen, for example, when the stage processes an empty archive.
- Make the attachments field multivalued: When disabled (default), the HCI_attachments field's value is stored as a single string on the original email document. Each value found is separated by a comma. When enabled, this string is instead parsed into a multivalued field, merging any duplicate values.
- Maximum Text Character Length: The maximum number of characters to extract into the HCI_text stream.
- Expanded File Path: Determines the values for the HCI_path and HCI_relativePath
fields for documents extracted from an archive file.
The options are:
- Expand to a path named after the archive file: Creates field values based on the name of the archive from which documents were expanded.
For example, with this option selected, a document expanded from an archive file /2016/feburary.pst will have these field/value pairs:
HCI_path: "2016/feburary.pst-expanded/"HCI_relativePath: "/feburary.pst-expanded/"
Use this option to link expanded documents back to the archives from which they were expanded. - Use original expanded file path: Tries to use the original expanded file path, if found in the archive.
For example, with this option selected, a document expanded from an archive file /2016/feburary.pst will have these field/value pairs:
HCI_path: "2016/"HCI_relativePath: "/"
Use this option when you are writing the expanded documents to a data source. - Customize the expanded base path: Allows you to specify the expanded file path to use.
For example, if you selected this option and specified a custom path value of myCustomPath, a document expanded from an archive file /2016/feburary.pst will have these field/value pairs:
HCI_path: "2016/myCustomPath/"HCI_relativePath: "/myCustomPath/"
Use this option when you are writing the expanded documents to a data source.
- Expand to a path named after the archive file: Creates field values based on the name of the archive from which documents were expanded.
If you follow this expansion stage with an Execute Action stage that writes files to a data source, the Expanded File Path setting affects how the expanded files are written:
- With the Expand to a path named after the archive file option selected, each .pst that the stage processes will be expanded to its own folder.
- With either the Customize the expanded base path or Use original expanded file path option selected, all .pst files that the stage processes will be expanded to the same folder.
If multiple .pst files contain files that have the same name, the Execute Action stage tries to write multiple different files to the same path in the data source. The results depend on the data source.
For example, if multiple .pst files contain a file called notes.txt, the data source might end up with either:- A single file called notes.txt that has multiple versions
- A single file called notes.txt, the contents of which match the last notes.txt file to be written to the data source
- An error writing the file, if the data source does not allow existing files to be overwritten
- Surround this stage with a conditional statement that allows only .pst.gz documents to pass through. That way, you avoid processing documents of the wrong type.
- To determine the MIME types of new documents created by this stage, you should precede
or follow this stage with a MIME Type Detection stage, depending on whether recursion is
enabled for the workflow:
- If either the Workflow Agent Recursion or Preprocessing Recursion settings are enabled for the workflow task, precede this stage with a MIME Type Detection stage.
With these settings enabled, all new documents are sent to the beginning of the applicable workflow pipeline, rather than on to the next stage.
- If both recursion settings are disabled, follow this stage with a MIME Type Detection stage.
- If either the Workflow Agent Recursion or Preprocessing Recursion settings are enabled for the workflow task, precede this stage with a MIME Type Detection stage.
- If you need to expand very large .pst documents, use the Preprocessing execution mode for the workflow pipeline that contains this stage.
Read Lines stage
The Read Lines stage creates a new document for each line in a content stream.
To avoid these failures when using the Read Lines stage, use conditional statements to surround all stages that process the HCI_content stream. If the Workflow Agent Recursion task setting is enabled, you also need to surround the Read Lines stage itself with a conditional statement.
Configure the conditional statements with this conditional expression:Field name: HCI_content
Operator: stream_exists
For information on the Workflow Agent Recursion setting, see Task settings.
In this example, the file 03-03-16.log contains three lines of data:
00:01,INFO,resource created 00:02,INFO,resource edited 00:03,INFO,resource deleted
The Read Lines stage creates a new document for each line.

Documents with streams.
Additional documents, one for each line in the input document stream.
- Input Stream Name: The name of the document stream from which you want to create additional documents.
- Line Output Field Name: For each document produced by this stage, the name of
the field in which to store the content of the line in the original document from which
this document was created.
The default is $line.
- Number of Lines to Skip: The number of lines to be skipped starting from the beginning of the input stream. This setting allows you to skip document headers.
The default value is 0.
This stage can add new fields to documents. For a field to be indexed, its name:
- Cannot contain hyphens (-) or any other special characters
- Cannot start with underscores (_) or numbers
Use this technique to prevent unnecessary fields from being indexed. For example, you can add a field called $meetsCondition to a document to satisfy a conditional statement later on in the pipeline, but the field might not include any valuable information for your users to search on.
- To determine the MIME types of new documents created by this stage, you should precede
or follow this stage with a MIME Type Detection stage, depending on whether recursion is
enabled for the workflow:
- If either the Workflow Agent Recursion or Preprocessing Recursion settings are enabled for the workflow task, precede this stage with a MIME Type Detection stage.
With these settings enabled, all new documents are sent to the beginning of the applicable workflow pipeline, rather than on to the next stage.
- If both recursion settings are disabled, follow this stage with a MIME Type Detection stage.
- If either the Workflow Agent Recursion or Preprocessing Recursion settings are enabled for the workflow task, precede this stage with a MIME Type Detection stage.
Replace stage
The Replace stage lets you replace all or part of a value for the fields you specify. You can use this stage, for example, to redact sensitive information such as credit card numbers, or to normalize values across multiple fields.
This example uses a regular expression to search the field SSN
for social security numbers. Those numbers are then replaced with REDACTED.

This example uses a regular expression to search multiple fields for the value Street and replace each with the abbreviation St..

Documents
Documents with edited field values.
Fields to process: A list of fields in which to replace values.
Values to replace:
- Source expression: A regular expression used to search for the characters you want to replace within a field value. Valid values are regular expressions that are supported by the Java Pattern class.
- Replacement: One or more characters used to replace the characters matched by the corresponding source expression.
- Use
%SPACE
to specify a single space character in replacement values. - Document field values can be included in replacement text using
${<fieldName>}
syntax, for example${HCI_displayName}
.
For example, use a Source expression value of [0-9]{4}-[0-9]{4}-[0-9]{4}-
with Replacement value of XXXXXX-
to redact credit card numbers.
Place this stage after a Text and Metadata Extraction stage to standardize or rename the fields added by that stage.
Reject Documents stage
You can debug your pipelines by using conditional statements to filter problem documents into a Reject Documents stage. Any documents that reach this stage are reported as failures in the workflow task status report and are not processed by subsequent stages in your pipeline.
For information on viewing document failures for a task, see Task details, status, and results.
Say, for example, that you expect all of your documents to contain the author field by the time they reach a certain point in your pipeline. You can verify this by inserting a Reject Documents stage at that point and surrounding it with a conditional statement that verifies whether the author field exists.

Documents
None. Any documents entering the stage are reported as failures.
- Rejection Message: The error message to include with each document failure produced by this stage. The default is Document Rejected.
- Always surround this stage with a conditional statement to ensure that only problem documents enter it.
- If you are testing your pipeline with multiple Reject Documents stages, use specific and descriptive rejection messages so you know where each failure came from.
Snippet Extraction stage
The Snippet Extraction stage reads the content stream for a document, extracts a subset of the document's contents, and stores that subset as a new field. You can use this field to provide some sample text for each document in search results. Instead of returning the full contents of each document, use this stage to index and return a more manageable amount of document contents.

Documents with content streams.
Documents with an additional field containing a snippet of the document's full content.
- Text Input Stream: The name of the stream from which to read document contents.
- Snippet Output Field: The name of the field in which to store text extracted from the specified field. The default is HCI_snippet.
- Maximum Snippet Character Length: The maximum number of characters to extract from each document and store in the snippet field.
- Snippet Start Offset: The number of characters to skip in the document before beginning to extract the snippet text.
This stage can add new fields to documents. For a field to be indexed, its name:
- Cannot contain hyphens (-) or any other special characters.
- Cannot start with underscores (_) or numbers.
Use this technique to prevent unnecessary fields from being indexed. For example, you can add a field called $meetsCondition to a document to satisfy a conditional statement later on in the pipeline, but the field might not include any valuable information for your users to search on.
- If you are using this stage to avoid indexing a document's full content, follow this stage with a Filter stage that filters out the stream from which you extracted the snippet (that is, the stream specified for the Text Input Stream field). For information on the Filterstage, see Filter stage.
- You can use this stage as a debugging tool to view the contents of document streams. For information, see Viewing stream contents.
- Use this stage to support search result highlighting for your users. For information, see Search result highlighting.
SRT Text Extraction stage
The SRT Text Extraction stage lets you extract existing subtitle text from video files. For this stage to work on a video file:
- Subtitle text must already exist. That is, this stage doesn't automatically create subtitle text for a video file that has none.
- Subtitle text must be in SubRip Text (SRT) format:
Subtitle ID number Time range for the subtitle Subtitle text <blank line>
For example:
212 00:33:55,170 --> 00:33:57,210 - Who's there? 21300:33:58,000 --> 00:34:00,455 - Just me.
- Subtitle text must be stored as a content stream associated with a video
file.
For video files from an HCP namespace, for example, this stage processes subtitle text stored as a custom-metadata annotation on the file.
In this example, an HCP object named InstructionalVideo.mp4
contains a stream named HCI_customMetadata_subtitles. This stream is a pointer for Hitachi Content Intelligence to retrieve the video's SRT text from HCP. The SRT Text Extraction stage accesses this stream, extracts SRT text from it, and adds that text to three new document fields for the video: one field for all subtitle text, one field for all subtitle start times, and one field for all subtitle end times.

Documents for videos that have SRT annotations.
Documents for the input videos. The output videos now include three additional multivalued fields: one for subtitle text values, one for subtitle start times, and one for subtitle end times.
- Single-valued (subtitle text only) or Multi-valued:
- Single-valued (subtitle text only): Specifies that all subtitle text should be stored in one single-valued field. Subtitle time data is not extracted or stored.NoteUse this option to better support highlighting video subtitle text in search results. For more information, see Search result highlighting.
- Multi-valued: Specifies that subtitle text and time data should be stored in separate, multi-valued fields.
- Single-valued (subtitle text only): Specifies that all subtitle text should be stored in one single-valued field. Subtitle time data is not extracted or stored.
- Stream Name: The name of the content stream in which subtitle text is stored. This tells the stage where to look for a document's SRT data.
- Subtitles Field: The name of the field where you want to store the extracted subtitle text.
- Event Start Time(if Multi Valued is selected): The name of the field where you want to store the extracted subtitle start times.
- Event End Time(if Multi Valued is selected): The name of the field where you want to store the extracted subtitle end times.
This stage can add new fields to documents. For a field to be indexed, its name:
- Cannot contain hyphens (-) or any other special characters.
- Cannot start with underscores (_) or numbers.
Use this technique to prevent unnecessary fields from being indexed. For example, you can add a field called $meetsCondition to a document to satisfy a conditional statement later on in the pipeline, but the field might not include any valuable information for your users to search on.
Syslog Notification stage
The Syslog Notification stage allows for syslog notifications to be sent while data is processing. A single notification is sent for every document processed.

Documents
Processed documents and a syslog message for each document.
Syslog Settings:
- Host: The hostname or IP address for a syslog server.
- Port: The syslog port. The default is port 514.
- Facility: The syslog facility that the message is sent to. The default is local0, but you can choose from local1 through local7.
- Message: The content of the syslog message.
To include a field value in a syslog message, use this syntax:
${<field-name>}
For example:
${HCI_size}
The default is Processed document ${HCI_URI}.
For example, if you keep the default and the stage processes a document with the URL http://example.com/document.pdf, the resulting message would be:
Processed document http://example.com/document.pdf
To include an aggregation value in a syslog message, use this syntax:
${<aggregation-name>}
For example:
${Extensions}
If the aggregation name contains spaces, replace them with underscores (_). For example:
${Discovered_Fields}
- Severity: The severity of the message. Choose from INFO (default), EMERGENCY, ALERT, CRITICAL, ERROR, WARN, NOTICE, and DEBUG.
- Sender Identity: The identity of the sender for each message. The default is hci.
Because this stage sends a syslog message for every document it encounters, you should surround it with a conditional statement to limit the documents that reach the stage.
Tagging stage
The tagging stage lets you add new fields and values to a document. The fields and values that you configure for this stage are added to all documents that pass through it.

Documents
Documents with additional pairs of fields and values.
- A list of fields and values to add.
- Overwrite Field:
- When this setting is enabled, if a value for the target field exists, the value is overwritten.
- When this setting is disabled, the output for this field as an additional value on the target field.
This stage can add new fields to documents. For a field to be indexed, its name:
- Cannot contain hyphens (-) or any other special characters.
- Cannot start with underscores (_) or numbers.
Use this technique to prevent unnecessary fields from being indexed. For example, you could add a field called $meetsCondition to a document in order to satisfy a conditional statement later on in the pipeline, but the field may not include any valuable information for your users to search on.
To control which documents are sent to this stage, enclose the stage in a conditional statement.
TAR Expansion stage
The TAR Expansion stage opens .tar archives and creates documents for the files within.
This stage can process .tar, tar.gz, and tar.bz files.
For an example of how to do this see, Default pipeline.

Documents for .tar archives.
Documents for files inside .tar archives.
Documents produced by this stage include these fields, which identify the file from which the documents were expanded:
- HCI_parentDisplay
- HCI_parentId
- HCI_parentUri
- Document Stream Name: The name of the stream that you want the stage to examine.
The default stream is HCI_content, which is added to every document that the system reads.
- Limit Extracted Files: When enabled, you can specify a maximum number of files that can be extracted from a .tar file.
- Include source TAR: When enabled, the .tar file itself is output by the stage, in addition to the documents within.
Expanded File Path: Determines the values for the HCI_path and HCI_relativePath fields for documents extracted from an archive file.
The options are:
- Expand to a path named after the archive file: Creates field values based on the name of the archive from which documents were expanded.
For example, with this option selected, a document expanded from an archive file /2016/feburaryLogs.tar will have these field/value pairs:
- HCI_path: "2016/feburaryLogs.tar-expanded/"
- HCI_relativePath: "/feburaryLogs.tar-expanded/"
Use this option to link expanded documents back to the archives from which they were expanded.
- Use original expanded file path: Tries to use the original expanded file path, if found in the archive.
For example, with this option selected, a document expanded from an archive file /2016/feburaryLogs.tar will have these field/value pairs:
- HCI_path: "2016/"
- HCI_relativePath: "/"
Use this option when you are writing the expanded documents to a data source.
- Customize the expanded base path: Allows you to specify the expanded file path to use.
For example, if you selected this option and specified a custom path value of myCustomPath, a document expanded from an archive file /2016/feburaryLogs.zip will have these field/value pairs:
- HCI_path: "2016/myCustomPath/"
- HCI_relativePath: "/myCustomPath/"
Use this option when you are writing the expanded documents to a data source.
If you follow this expansion stage with an Execute Action stage that writes files to a data source, the Expanded File Path setting affects how the expanded files are written:
- With the Expand to a path named after the archive file option selected, each .tar that the stage processes will be expanded to its own folder.
- With either the Customize the expanded base path or Use original expanded file path option selected, all .tar files that the stage processes will be expanded to the same folder.
If multiple .tar files contain files that have the same name, the Execute Action stage tries to write multiple different files to the same path in the data source. The results depend on the data source.
For example, if multiple .tar files contain a file called notes.txt, the data source might end up with either:
- A single file called notes.txt that has multiple versions.
- A single file called notes.txt, the contents of which match the last notes.txt file to be written to the data source.
- An error writing the file, if the data source does not allow existing files to be overwritten.
- Surround this stage with a conditional statement that allows only .tar, tar.gz, and tar.bz.gz documents to pass through. That way, you avoid processing documents of the wrong type.
- To determine the MIME types of new documents created by this stage, you should precede
or follow this stage with a MIME Type Detection stage, depending on whether recursion is
enabled for the workflow:
- If either the Workflow Agent Recursion or Preprocessing Recursion settings are enabled for the workflow task, precede this stage with a MIME Type Detection stage.
With these settings enabled, all new documents are sent to the beginning of the applicable workflow pipeline, rather than on to the next stage.
- If both recursion settings are disabled, follow this stage with a MIME Type Detection stage.
- If either the Workflow Agent Recursion or Preprocessing Recursion settings are enabled for the workflow task, precede this stage with a MIME Type Detection stage.
- If you need to expand very large .tar, tar.gz, and tar.bz documents, use the Preprocessing execution mode for the workflow pipeline that contains this stage.
- Use the Decompression stage instead of this stage to expand Gzip files (.gz).
Text and Metadata Extraction stage
The Text and Metadata Extraction stage has two uses:
- Extracting additional fields from a document's contents.
This is the general purpose metadata extraction stage; it can discover a number of industry-standard metadata fields from hundreds of common document types.
For example, for an email, this stage might extract fields for the email's subject, recipient, and sender.
- Extracting searchable keywords from a document's full content. This is required for your users to be able to search for a document by its content, not just its metadata.
For information on full-content search, see Enabling full-content search.

Documents
Documents with:
- Additional field/value pairs.
- An additional stream that contains all keywords extracted from the input stream.
- Excluded Content-Type Values: A list of MIME types. Documents with these MIME
types bypass this stage. By default, the Text and Metadata Extraction is configured to
exclude archive files such as .zip and .tar files.
This setting depends on the Content-Type field being present in documents. The Content-Type field is added by the MIME Type Detection stage. For information, see MIME Type Detection stage.
TipYou can use this setting to improve pipeline performance by having the Text and Metadata Extraction stage skip file types that you're not interested in. - Input Stream Name: For each document entering the stage, this is the stream from which to extract field/value pairs. The default is HCI_content.
- Output Stream Name: For each document exiting the stage, this is the stream in which the extracted text keywords are stored. The default is HCI_text. ImportantBy default, to support full-content search, index collections are configured to index HCI_text. If you change the name of the output stream for this stage, you need to add a new field to your index collection schema. The new field must have these settings:
- Name: Same as the Output Stream Name for the Text and Metadata Extraction stage.
- Type: text_hci
- Field attributes:
- indexed
- multiValued
For information on adding fields to an index collection schema, see Adding and editing fields in an index collection schema.
NoteTo view the contents of this stream, test your pipeline using the Snippet Extraction stage. For information, see Viewing stream contents. - Extracted Character Limit: The maximum number of characters to extract from the input stream.
- RFC-822 Email Display Subject: When a workflow task reads a document, it adds a
field named HCI_displayName which contains the filename for the document.
With this option enabled, when the Text and Metadata Extraction stage processes an RFC-822-compliant email document, the stage replaces the contents of the HCI_displayName field with the email's subject.
- Include Unprocessed Fields: When enabled, documents include Message_Raw_Header_ fields, which contain unprocessed metadata. This option defaults to false because many of the unprocessed metadata fields contain duplicate values that are not typically useful.
- Skip Embedded Documents: When enabled, extraction will be skipped for any embedded documents.
- This stage can add a significant amount of time to a workflow task. Use the Excluded Content-Type Values setting to avoid processing the files for which you don't need to extract keywords or additional fields.
- Because the Excluded Content-Type Values setting needs the Content_Type field to exist in documents, you should always place this stage after the MIME Type Detection stage.
- Follow the Text and Metadata Extraction stage with a Date Conversion stage to normalize any date fields added to your documents.
Tokenizer stage
The Tokenizer stage lets you specify a string of one or more characters named a token and then use that token to perform several types of transformations on field values. You can then create new fields to hold the transformed values.
The following sections detail the different Tokenizer operations.
This operation lets you split a single field value into multiple values. Every time the stage encounters the token character or sequence of characters, it creates a new value for the field.
Example

Operation-specific configuration settings
- Tokenizer: A string of one or more characters to use as a delimeter.
This lets you take a field value, replace one token with another, and store the new value in a new field.
Example

Operation-specific configuration settings
- Token to replace: A string of one or more characters to be replaced.
- Replacement text: A string of one or more characters with which to replace the specified token.
This lets you shorten field values. When the stage encounters the token for the first time in a field value, the remainder of the field value is deleted. The shortened value is then stored in a new field.
Example

Operation-specific configuration settings
- Token for substring end position: A string of one or more characters. When the stage encounters this string in a field value, it deletes the remainder of the field value.
This lets you shorten field values. The stage extracts a portion of a field value and saves that portion as a new field.
Example

Operation-specific configuration settings
- Start position: A number of characters to offset from the beginning of a field value. This setting is not inclusive. The default is 0 (that is, the stage starts its extraction from the beginning of the field value).
- End position: A number of characters from the beginning of the field value. When the stage encounters the character at the specified position, it stops extracting characters from the field value. This setting is not inclusive. By default, the stage extracts the entire field value.
Documents
Documents with additional field/value pairs.
These configuration settings are common to all operation types:
- Field Name: The metadata data field that you want the Tokenizer stage to act on.
- Tokenized Field Name: The name of the field in which to store the tokenized value.
URL Encoder/Decoder stage
The URL Encoder/Decoder stage lets you percent-encode or percent-decode field values that contain URLs.

Documents with URL fields.
Documents with additional fields containing percent-encoded or percent-decoded URL values.
- Fields to Decode:
- Existing Field Name: The field containing URLs you want to percent-decode.
- Target Field Name: The field in which to store the percent-decoded URL.
- Fields to Encode:
- Existing Field Name: The field containing URLs that you want to percent-encode.
- Target Field Name: The field in which to store the percent-encoded value.
- Overwrite Field:
- When this setting is enabled, if a value for the target field exists, the value is overwritten.
- When this setting is disabled, the output for this field as an additional value on the target field.
This stage can add new fields to documents. For a field to be indexed, its name:
- Cannot contain hyphens (-) or any other special characters.
- Cannot start with underscores (_) or numbers.
Use this technique to prevent unnecessary fields from being indexed. For example, you can add a field called $meetsCondition to a document to satisfy a conditional statement later on in the pipeline, but the field might not include any valuable information for your users to search on.
XML Formatter stage
The XML Formatter stage takes a list of document fields, formats those fields as XML, and then outputs them to a stream.

In this example, the locationData stream added by the XML Formatter stage contains this XML:
<locationData> <city><![CDATA[rome]]></city> <country><![CDATA[IT]]></country> <region><![CDATA[Latium]]></region> <timeZone><![CDATA[Europe/Rome]]></timeZone> <subjects> <value><![CDATA[Colosseum]]></value> <value><![CDATA[Arch of Constantine]]></value> </subjects> </locationData>
Documents with fields.
Documents with added streams that contain XML-formatted data.
- Output stream: A name for the stream that the stage adds to documents that pass through it.
- Root XML element: Optionally, a name for the root XML element in the stream added to the document. If provided, this must be a valid XML element name.
- Fields to format: A list of document fields to format as XML and add to the output stream.
ZIP Expansion stage
The ZIP Expansion stage opens .zip files and creates documents for each file inside.
This stage affects only .zip files.
For an example of how to do this, see Default pipeline.

Documents for .zip files.
Documents for files inside the .zip file.
Documents produced by this stage include these fields, which identify the file from which the documents were expanded:
- HCI_parentDisplay
- HCI_parentId
- HCI_parentUri
- Document Stream Name: The name of the stream that you want the stage to examine.
- The default stream is HCI_content, which is added to every document that the system reads.
- Limit Extracted Files: When enabled, you can specify a maximum number of files that can be extracted from a .zip file.
- Include source ZIP: When enabled, the .zip file itself is output by the stage, in addition to the documents within. NoteIf the stage produces zero documents and encounters zero documents when processing an archive file, it outputs that archive file regardless of this setting. This can happen, for example, when the stage processes an empty archive.
- Expanded File Path: Determines the values for the HCI_path and HCI_relativePath fields for documents extracted from an archive file.
The options are:
- Expand to a path named after the archive file: Creates field values based on the name of the archive from which documents were expanded.
For example, with this option selected, a document expanded from an archive file /2016/feburaryLogs.zip will have these field/value pairs:
- HCI_path: "2016/feburaryLogs.zip-expanded/"
- HCI_relativePath: "/feburaryLogs.zip-expanded/"
Use this option to link expanded documents back to the archives from which they were expanded.
- Use original expanded file path: Tries to use the original expanded file path, if found in the archive.
For example, with this option selected, a document expanded from an archive file /2016/feburaryLogs.zip will have these field/value pairs:
- HCI_path: "2016/"
- HCI_relativePath: "/"
Use this option when you are writing the expanded documents to a data source.
- Customize the expanded base path: Allows you to specify the expanded file path to use.
For example, if you selected this option and specified a custom path value of myCustomPath, a document expanded from an archive file /2016/feburaryLogs.zip will have these field/value pairs:
- HCI_path: "2016/myCustomPath/"
- HCI_relativePath: "/myCustomPath/"
Use this option when you are writing the expanded documents to a data source.
- Expand to a path named after the archive file: Creates field values based on the name of the archive from which documents were expanded.
If you follow this expansion stage with an Execute Action stage that writes files to a data source, the Expanded File Path setting affects how the expanded files are written:
- With the Expand to a path named after the archive file option selected, each .zip that the stage processes will be expanded to its own folder.
- With either the Customize the expanded base path or Use original expanded file path option selected, all .zip files that the stage processes will be expanded to the same folder.
If multiple .zip files contain files that have the same name, the Execute Action stage tries to write multiple different files to the same path in the data source. The results depend on the data source.
For example, if multiple .zip files contain a file called notes.txt, the data source might end up with either:
- A single file called notes.txt that has multiple versions.
- A single file called notes.txt, the contents of which match the last notes.txt file to be written to the data source.
- An error writing the file, if the data source does not allow existing files to be overwritten.
- Surround this stage with a conditional statement that allows only .zip.gz documents to pass through. That way, you avoid processing documents of the wrong type.
- To determine the MIME types of new documents created by this stage, you should precede
or follow this stage with a MIME Type Detection stage, depending on whether recursion is
enabled for the workflow:
- If either the Workflow Agent Recursion or Preprocessing Recursion settings are enabled for the workflow task, precede this stage with a MIME Type Detection stage.
With these settings enabled, all new documents are sent to the beginning of the applicable workflow pipeline, rather than on to the next stage.
- If both recursion settings are disabled, follow this stage with a MIME Type Detection stage.
- If either the Workflow Agent Recursion or Preprocessing Recursion settings are enabled for the workflow task, precede this stage with a MIME Type Detection stage.
- If you need to expand very large .zip documents, use the Preprocessing execution mode for the workflow pipeline that contains this stage.