Skip to main content

We've Moved!

Product Documentation has moved to docs.hitachivantara.com
Hitachi Vantara Knowledge

Built-in stages

Workflow Designer includes a number of built-in stages that you can use in your pipelines.

You can also write your own custom stage plugins.

Add Metadata stage

The Add Metadata stage adds a data source's basic metadata to a document reference by its URI.

GUID-FB1B9D05-C408-4E96-9117-F8C395C999AD-low.png
Configuration settings
  • Document URI: Used to gather basic metadata for the configured data connection. Supports ${fieldName} syntax for building a dynamic URI per document containing field values.
  • Data Connection: The data connection from which to request the specified URI.

Appender stage

The Appender stage lets you:

  • Add text to the end of a field value or to the end of a stream.
  • Add a field value to the end of another field value or to the end of a stream.

If a field has multiple values, the Appender stage appends the specified text or field value to each value.

To append one stream to another, use the Mapping stage.

TipYou can use the Appender stage to add search keywords to documents. When you append text to a stream, the text gets indexed as part of a document's contents. When a user performs a simple search in the Search App, his or her search query is compared against those contents, including the keywords you've appended.
ExamplesGUID-EB395CE3-1825-4127-B2DD-AF96D4A0804D-low.png
Input

Documents

Output
  • Documents with field/values that include appended text.
  • Documents with streams that reference the document's contents plus the appended text.
Configuration settings
  • Append Field to Field operation:
    • Target Field Name: The field whose values you want to append to.
    • Source Field Name: The field whose values you want to append to the target field.
  • Append Field to Stream operation:
    • Target Stream Name: The stream you want to append to.
    • Source Field Name: The field whose values you want to append to the target stream.
  • Append Text to Field operation:
    • Target Field Name: The field whose values you want to append text to.
    • Custom Text: The text you want to append.
  • Append Text to Stream operation:
    • Target Stream Name: The stream you want to append to.
    • Custom Text: The text you want to append.

Attach Stream stage

The Attach Stream stage lets you add a stream from one document to another document, allowing the two documents to be indexed and returned in search results as a single document.

For example, you can use this stage to associate the contents of an audio recording transcript with the recording itself. That way, users can more easily find the actual recording, not just the transcript, in a search.

You can also use this stage to facilitate writing custom metadata to objects in HCP. For information, see Writing file contents as custom metadata to HCP objects.

Example

In this example, a folder in an HCP namespace includes both an audio file and its associated text transcript file:

https://ns1.ten1.myhcp.example.com/r...recording1.mp3
https://ns1.ten1.myhcp.example.com/r...rding1.mp3.txt

The Attach Stream stage accesses the contents of the text file in the data connection and adds it to the audio file as a new stream called transcript.

GUID-EFE58099-6DA3-4888-A631-EF1A34C59E46-low.png
Input

Documents

Output

Documents with additional streams

Configuration settings
  • Stream Name : The name of the stream to add to each document processed.
TipSpecifing HCI_text as the stream name ensures that the stream contents are automatically indexed by an HCI Index that was created with an initial schema setting of Schemaless or Default. For more information, see Index collection types and settings.
  • Stream URI: The location in the specified data connection for the data to add as a stream.
  • Data connection selection: The data connection from which to find the data to attach as a stream. Options are:
    • Use Document Data Connection: Use the connection that the document was originally read from.
    • Specify Data Connection: Select an existing data connection.
  • Additional Custom Stream Metadata: Key/value pairs to add as metadata on the stream being added.
Notes and best practices
  • Use the ${fieldName} syntax for including document field values in the Stream Name, Stream URI, and Additional Custom Stream Metadata settings.

    For example, if a document includes this field value pair:

    HCI_displayName: file1
    And you specify this for the Stream Name setting:
    contents-of-${HCI_displayName}
    This stage adds a stream called contents-of-file1 to the document.
  • Consider excluding the documents that you're reading streams from from your workflow. Because you're adding the contents of one document to another document, you might not need to index or even process both documents in your workflow.

Cache Stream stage

The Cache Stream stage reads the stream for a document into a temporary file stored on your system. Any subsequent stages that need access to a stream read the stream's contents from the local, temporary file instead of from the original data source. This is useful if your data sources are expensive or slow to read data from.

The temporary files that this stage creates for a document are deleted when the document exits the workflow pipeline.

ImportantUsing this stage can cause large amounts of data to be temporarily stored on the instances. Ensure that your instances have enough space to handle the content you are processing.
ExampleGUID-30AEE024-3908-49EE-B909-0C61990452C8-low.png
Input

Documents with content stream references that point to data sources.

Output

Documents with content stream references that point to temporary files stored in the system.

Configuration settings

Input stream: The name of the stream to read from and save to a temporary file. The default is HCI_content, which is included by default in each file that the system reads.

Notes and best practices

Place this stage close to the beginning of your workflow pipeline, before any other stages that read document streams. Ideally, this stage should be the only one that reads document contents from the data source.

Content Class Extraction stage

A Content Class Extraction stage uses a content class to extract additional fields from the text-based documents that pass through it.

For information on content classes, see Content classes.

Example

Take for example this XML file for a patient's blood pressure readings:

<xml>
    <date>2012-09-17</date>
    <patient>
        <name>John Smith</name>
        <age>56</age>
        <sex>M</sex>
    </patient>
    <diastolic>60,mm[Hg]</diastolic>
    <systolic>107,mm[Hg]</systolic>
    <assessment>low</assessement>		
</xml>

And take for example a content class named BloodPressureXMLContentClass that contains these content properties:

Content property nameContent property expression
diastolic/xml/diastolic
systolic/xml/systolic
assessment/xml/assessment
patientName/xml/patient/name
overFiftyboolean(/xml/patient/age > 50)

This illustration shows the effect of running the example.xml file through a Content Class Extraction stage that's configured to use BloodPressureXMLContentClass.

GUID-7F810C11-08BA-41A8-91C9-45C432BC4F33-low.png
Input

XML, JSON, or other text-based documents.

Output

XML, JSON, or other text-based documents with additional field/value pairs.

Configuration settings
  • Content class selection: Select one or more existing content classes for the stage to use.
  • Stream selection: Specify the streams that you want to apply the content class to.

    The default is HCI_content. The Text and Metadata Extraction stage extracts full document content as a stream named HCI_content, so the default value should cover the majority of cases.

    If you have a custom stage that adds streams to documents and you want to apply content classes to those streams, you need to specify them here on the content class stage.

  • Metadata field selection: Specify the fields that you want to apply the content class to.
  • Stream or field content size limit: The maximum size in bytes for the content that this stage can process. If the value of a specified field or stream is larger than this limit, the Content Class Extraction stage does not process that field or stream.

    The default is 100,000,000 bytes.

    Specify -1 for no limit.

  • Extracted field size limit: The maximum size in bytes for each field added by this stage. Use this to manage how the stage handles extracted fields that contain large values.

    This setting cannot be greater than 1048576 (1 MB), which is also the default.

    You can configure how this stage should handle fields that exceed the size limit:
    • Fail: The stage produces a failure for the document. This is the default.
    • Store as stream: The extracted data is stored in the document as a stream, rather than a field.
    • Truncate: Only a subset of the data is stored in the document. The remaining data is not saved.
    • The subset is equal in size to the Extracted field size limit setting.
    • Ignore: The field is ignored and not added to the document.
  • Parse Multivalued Fields: When disabled (default), multiple values are stored as a single string where each value found is separated by a semicolon. When enabled, this string is instead parsed into a typed multivalued field, merging any duplicate values.

    Take for example this XML:

    <dates>
     <date>2013-04-16T00:00:00-0400</date>
     <date>2013-04-16T00:00:00-0400</date>
     <date>2015-12-16T05:48:45-0400</date>
    </dates>
    With the Parse Multivalued Fields option disabled, parsing this XML using the content property expression /dates/date yields this field/value pair:
    date: "2013-04-16T00:00:00-0400;2013-04-16T00:00:00-0400;2015-12-16T05:48:45-0400"
    This field/value pair is not recognized as a date by other stages or index collections. You will need to use the Tokenizer stage to split up the values into separate fields to make them usable.

    However, with the Parse Multivalued Fields option enabled, a usable, multi-valued field/value pair is produced by the Content Class Extraction stage:

    date: Wed Dec 16 09:48:45 UTC 2015 | Tue Apr 16 04:00:00 UTC 2013 
Field naming considerations

This stage can add new fields to documents. For a field to be indexed, its name:

  • Cannot contain hyphens (-) or any other special characters.
  • Cannot start with underscores (_) or numbers.
Tip Beginning a field name with a dollar sign ($) causes the field to be deleted when the workflow pipeline finishes processing the associated document. That is, the field is not indexed.

Use this technique to prevent unnecessary fields from being indexed. For example, you can add a field called $meetsCondition to a document to satisfy a conditional statement later on in the pipeline, but the field might not include any valuable information for your users to search on.

Notes and best practices

A Content Class Extraction stage is useful only if your workflow inputs contain XML, JSON, or other text-based files.

Date Conversion stage

A document might fail to be indexed if it contains a field with a date value that does not conform to the ISO 8601 format standard. In a workflowtask, such a document might fail with this message:

SolrPlugin error: {"responseHeader":{"status":400,"QTime":1},"error":{"msg":"Invalid Date
 String:'<date>'","code":400}}

You can ensure that the document is indexed by adding a Date Conversion stage to your pipeline. This stage takes date values for the fields you specify and converts them to the expected format.

Example GUID-6A54FC32-8A4D-4DE4-AB2F-74D224E66CE9-low.png
Input

Documents with incorrect date values.

Output

Documents with converted date values.

Configuration settings
  • Fields to Process: A list of fields for which to convert date values. The list contains some default fields, but you can add more.
  • Scan Formats: A list mapping regular expressions to date formats. Each list item contains:
    • A regular expression used to select strings of characters in a field value.
    • A scan pattern that identifies the corresponding date format for the regular expression.

    For example, the regular expression ^\d{1,2}-\d{1,2}-\d{4}$ matches a string containing one or two positive integers followed by a hyphen followed by one or two positive integers followed by a hyphen followed by four positive integers. This pattern corresponds to the date format d-M-yyyy.

    By default, the Date Conversion stage contains a number of scan formats, but you can add more.
    TipYou can use regular expressions with non-capturing groups to omit irrelevant characters from field values. For example, in this date value, the digits 123 are not part of the date and should be excluded from conversion:
    19570110_220434.123
    To omit the incorrect digits, use a regular expression that selects only the date portion. For example:
    \d{8}_\d{6}(?=\.\d{3})
    The corresponding date format for this expression is:
    yyyyMMdd_HHmmss
Notes and best practices

Use this stage after a Text and Metadata Extraction stage to normalize any date fields that stage adds.

Decompression stage

The Decompression stage converts a compressed file into a decompressed file for processing.

This stage can process gzip, bzip2, and xz files.

Input

Compressed files

Output

Decompressed files

Configuration settings
  • Document Stream Name: The name of the stream that you want the stage to examine.

    The default stream is HCI_content, which is added to every document that the system reads.

  • Include source file: When enabled, the original file itself is output by the stage, in addition to the decompressed file.
Notes and best practices
  • In order for a document to be processed by this stage, it must contain a Content_Type field appended by the MIME Type Detection stage. The value must be application/x-bzip2, application/x-xz, application/x-gzip, or application/gzip.
  • You can also change the Content_Type field in archived files to process archived gzip files (.tar.gz), archived bzip2 files (.tar.bz2), and archived xz files (.tar.xz). However, this will not work with files that have a .tgz extension.
  • Use the TAR Expansion stage instead of this stage to expand .tgz files.
  • To determine the MIME types of new documents created by this stage, you should precede or follow this stage with a MIME Type Detection stage, depending on whether recursion is enabled for the workflow:
    • If either the Workflow Agent Recursion or Preprocessing Recursion settings are enabled for the workflow task, precede this stage with a MIME Type Detection stage.

      With these settings enabled, all newdocuments are sent to the beginning of the applicable workflow pipeline, rather than on to the next stage.

    • If both recursion settings are disabled, follow this stage with a MIME Type Detection stage.
  • If you need to expand very large documents, use the Preprocessing execution mode for the workflow pipeline that contains this stage.

    For more information, see Pipeline execution modes.

DICOM Metadata Extraction stage

The DICOM Metadata Extraction stage extracts text metadata fields from files formatted in DICOM, a standard file format for medical images.

ExampleGUID-A58CB5DD-9C7E-4D70-9A02-5A6DFE06C5EC-low.png
Input

DICOM image documents.

Output

DICOM image documents with added fields.

Configuration settings
  • Input Stream Name: For each document entering the stage, this is the stream from which to extract field/value pairs. The default is HCI_content.
  • DICOM field prefix: A string of characters to apply as the prefix to all fields added by the stage. The default is DICOM_.
  • Extracted Character Limit: The maximum number of characters to extract from the input stream.
Field naming considerations

This stage can add new fields to documents. For a field to be indexed, its name:

  • Cannot contain hyphens (-) or any other special characters.
  • Cannot start with underscores (_) or numbers.
Tip Beginning a field name with a dollar sign ($) causes the field to be deleted when the workflow pipeline finishes processing the associated document. That is, the field is not indexed.

Use this technique to prevent unnecessary fields from being indexed. For example, you can add a field called $meetsCondition to a document to satisfy a conditional statement later on in the pipeline, but the field might not include any valuable information for your users to search on.

Drop Documents stage

All documents sent to this stage are deleted from the pipeline.

The Discoveries window on the task details page shows the number of documents dropped during a task run. For information, see Task details, status, and results.

ExampleGUID-C5832442-CC91-486B-9F1A-66A5EB7D96C7-low.png
Input

Documents

Output

None

Configuration settings

None

Notes and best practices

To control which documents are sent to this stage, enclose the stage within a conditional statement.

Use this stage as early as possible in your pipeline so you can avoid having other stages unnecessarily process your unwanted documents.

Place this stage in one of these positions:

  • After a MIME Type Detection stage, so you can drop documents based on their MIME types.
  • If your workflow is not recursive, after an expansion that's immediately preceded by a MIME Type Detection stage. That way, you can drop documents from an archive file based on their MIME types.

Document Security stage

The Document Security stage applies access control lists (ACLs) to documents that pass through it. You can use these ACLs to configure which documents are made available to which users.

Index collection query settings are used to configure whether a search index honors ACLs. For more information, see Configuring access control settings in query settings.

If you apply multiple ACL fields to a document and configure a set of query settings to enforce all of those fields, your system honors the ACL settings in this order: Public, Deny ACLs, Allow ACLs.

Example GUID-C5832442-CC91-486B-9F1A-66A5EB7D96C7-low.png

In this example, if the applicable query settings are configured to honor deny ACLs, any user in the RestrictedUsers group will not be able to find the archive.txt file.

Input

Documents

Output

Documents with access control fields added:

  • HCI_denyACL: Values for this field are one or more unique identifiers for users and groups.
  • HCI_allowACL: Values for this field are one or more unique identifiers for users and groups.
  • HCI_isPublic: If the Document Security stage marks a document as publicly accessible, the value for this field is true. Otherwise, this field is not added to documents.
Configuration settings
  • Set Document Visibility:
    • Set Public: The stage adds the HCI_isPublic field to a document. Use the Enforce Public Setting in query settings to specify whether documents with this field are displayed in search results.
    • Enable ACLs: The stage adds HCI_allowACL and HCI_denyACL fields to a document.
  • Add these groups to the Allow ACL: Specifies the groups to add to the ACL that allows access to a document.

    When configuring this and the following setting, you can browse through the user groups that your system administrator has added to Hitachi Content Intelligence.

  • Add these groups to the Deny ACL: Specifies the groups to add to the ACL that denies access to a document.
  • Add these custom tokens to the Allow ACL: Specifies the users to add to the ACL that allows access to a document.
    NoteWhen configuring this and the following setting, you need to specify the unique identifiers (for example, Active Directory SIDs) for user accounts in the identity providers that your system administrator has added to Hitachi Content Intelligence. For more information, see your system administrator.
  • Add these custom tokens to the Deny ACL: Specifies the users to add to the ACL that denies access to a document.
Notes and best practices
  • Use this as the last stage of a workflow pipeline. That way, you ensure that all documents exiting the pipeline have passed through this stage on their way to the workflow outputs.

Email Expansion stage

The Email Expansion stage creates documents for any attachments it detects in .eml and .msg files. By default, the stage also outputs the documents for the email documents that originally entered the stage.

This stage affects only .eml and .msg files. It has no effect on other types of files or archives.

ExampleGUID-EB286D59-775D-4057-B14B-1E0680363EA0-low.png
Input

Documents for .eml or .msg files.

Output

Documents for .eml or .msg files and documents for any detected attachments.

Documents produced by this stage include these fields, which identify the file from which the documents were expanded:

  • HCI_parentDisplay
  • HCI_parentId
  • HCI_parentUri
Configuration settings
  • Document Stream Name: The name of the stream that you want the stage to examine.

    The default stream is HCI_content, which is added to every document that the system reads.

  • Include source email: When enabled, the original email document itself is output by the stage.
  • Include embedded non-file streams: By default, the Email Expansion stage extracts and creates documents for email attachments only if it detects that the attachments have filenames. When you enable this option, the stage creates documents for any stream it detects in the original email document.
  • Make the attachments field multivalued: When disabled (default), the HCI_attachments field's value is stored as a single string on the original email document. Each value found is separated by a comma. When enabled, this string is instead parsed into a multivalued field, merging any duplicate values.
  • Maximum Text Character Length: The maximum number of characters to extract into the HCI_text stream.
  • Expanded File Path: Determines the values for the HCI_path and HCI_relativePath fields for documents extracted from an archive file.

    The options are:

    • Expand to a path named after the archive file: Creates field values based on the name of the archive from which documents were expanded.

      For example, with this option selected, a document expanded from an archive file /2016/feburaryMessages.eml will have these field/value pairs:

      • HCI_path: "2016/feburaryMessages.eml-expanded/"
      • HCI_relativePath: "/feburaryMessages.eml-expanded/"

      Use this option to link expanded documents back to the archives from which they were expanded.

    • Use original expanded file path: Tries to use the original expanded file path, if found in the archive.

      For example, with this option selected, a document expanded from an archive file /2016/feburaryMessages.eml will have these field/value pairs:

      • HCI_path: "2016/"_path: "2016/feburaryMessages.eml-expanded/"
      • HCI_relativePath: "/"

      Use this option when you are writing the expanded documents to a data source.

    • Customize the expanded base path: Allows you to specify the expanded file path to use.

      For example, if you selected this option and specified a custom path value of myCustomPath, a document expanded from an archive file /2016/feburaryMessages.eml will have these field/value pairs:

      • HCI_path: "2016/myCustomPath/"
      • HCI_relativePath: "/myCustomPath/"

      Use this option when you are writing the expanded documents to a data source.

Consideration for writing expanded files to data sources

If you follow this expansion stage with an Execute Action stage that writes files to a data source, the Expanded File Path setting affects how the expanded files are written:

  • With the Expand to a path named after the archive file option selected, each email message that the stage processes will be expanded to its own folder.
  • With either the Customize the expanded base path or Use original expanded file path option selected, all email messages that the stage processes will be expanded to the same folder.

    If multiple email messages contain files that have the same name, the Execute Action stage tries to write multiple different files to the same path in the data source. The results depend on the data source.

    For example, if multiple email messages contain a file called notes.txt, the data source might end up with either:
    • A single file called notes.txt that has multiple versions.
    • A single file called notes.txt, the contents of which match the last notes.txt file to be written to the data source.
    • An error writing the file, if the data source does not allow existing files to be overwritten.
Notes and best practices

Surround this stage with a conditional statement that allows only email message.gz documents to pass through. That way, you avoid processing documents of the wrong type.

To determine the MIME types of new documents created by this stage, you should precede or follow this stage with a MIME Type Detection stage, depending on whether recursion is enabled for the workflow:

If either the Workflow Agent Recursion or Preprocessing Recursion settings are enabled for the workflow task, precede this stage with a MIME Type Detection stage.

With these settings enabled, all new documents are sent to the beginning of the applicable workflow pipeline, rather than on to the next stage.

If both recursion settings are disabled, follow this stage with a MIME Type Detection stage.

If you need to expand very large email message documents, use the Preprocessing execution mode for the workflow pipeline that contains this stage.

Email Notification stage

The Email Notification stage allows for notifications to be sent by email while data is processing. A single notification is sent for every document processed.

ExampleGUID-81D25E39-9965-4AFE-BBB6-2ACDB77A5369-low.png
Input

Documents

Output

Processed documents and email notifications for each document.

Configuration settings

SMTP Settings:

  • Host: The hostname or IP address for a SMTP host string.
  • Port: The SMTP port. The default is SMTP port 25.
  • Security: Security options. Choose from None (default), STARTTLS, or SSL.
  • Authentication: Enable or disable authentication. By default, this is disabled.

Message Settings:

  1. From: The source address of the email notification. The default is pipeline@hci.com.
  2. Subject: The subject of the email notification.
Including field values

To include a field value in an email notification body or subject, use this syntax:

${<field-name>}

For example:

${HCI_size}

The default is Processed document ${HCI_URI}.

If, for example, you keep the default and the stage processes a document with the URL http://example.com/document.pdf, the resulting message will be:

Processed document http://example.com/document.pdf
Including aggregation values

To include an aggregation value in an email notification body or subject, use this syntax:

${<aggregation-name>}

For example:

${Extensions}

If the aggregation name contains spaces, replace them with underscores (_). For example:

${Discovered_Fields}
Including field values

To include a field value in an email notification body or subject, use this syntax:

${<field-name>}

For example:

${HCI_size}

The default is Processed document ${HCI_URI}.

If, for example, you keep the default and the stage processes a document with the URL http://example.com/document.pdf, the resulting message will be:

Processed document http://example.com/document.pdf
Including aggregation values

To include an aggregation value in an email notification body or subject, use this syntax:

${<aggregation-name>}

For example:

${Extensions}

If the aggregation name contains spaces, replace them with underscores (_). For example:

${Discovered_Fields}
  • Body: The body of the email notification. This can be either plain text or HTML.

  • Body Format: The format for the contents of the Body field. Either Plain Text or HTML.
  • Email Addresses: A comma-separated list of recipient email addresses.
Notes and best practices

Because this stage sends an email for every document it encounters, you should surround it with a conditional statement to limit the documents that reach the stage.

Execute Action stage

You can use Execute Action stages in your pipelines to perform actions supported by various components. For example, you can use an Execute Action stage with an HCP data connection to edit the metadata for an object in an HCP namespace.

Each component supports its own set of actions, or none at all. When you configure an Execute Action stage in a pipeline, you specify a component, the action you want the stage to perform, and any additional configuration settings required for that action.

Example

This example shows how an HCP data connection can be used in an Execute Action stage to delete files from an HCP namespace when a workflow task runs.

GUID-BA857D22-0D7D-45B0-A207-AE448A52695F-low.png
Additional Examples
  • Writing custom metadata to HCP objects.
  • Migrating data to HCP from Amazon S3.
  • Adding document caching to your workflows.
  • Using the Index action.
Adding Execute Action stages to pipelines

You use Execute Action stages by adding them to pipelines. For more information, see Adding actions to a pipeline.

Configuration settings

The configuration settings for data connection actions differ depending on the data connection. For information on these settings, see the topic for the applicable data connection under Data connection types and settings.

Filter stage

The Filter stage lets you remove unwanted fields and streams from documents.

This stage cannot not remove these required fields:

  • HCI_id
  • HCI_URI
  • HCI_displayName
  • Hitachi Content Intelligence_dataSourceUuid
  • HCI_doc_version
ExampleGUID-05F2D683-14C8-4614-8D54-D290E11DFB31-low.png
Input

Documents

Output

Documents with field/value pairs or streams removed.

Configuration settings
  • Filtering type:
    • Whitelist: Only the fields you specify are kept for further processing.
    • Blacklist: The fields you specify are discarded while all other fields are left intact.
  • Filter fields or streams: Option to filter either fields or streams.
  • Filter by patterns: Specifies whether the stage uses regular expressions to match field or stream names.

    When disabled, the stage filters only the fields or streams whose names exactly match the names you specify.

  • Field/Stream Name: A list of field or stream names to filter. If the Filter by patterns option is enabled, you can specify regular expressions in this list.

Field Parser stage

The Field Parser stage takes a specified field, examines that field's values, and splits those values into smaller segments named tokens. You can then select the tokens that you want and add them as values to other fields.

TipUse this stage in conjunction with the Read Lines stage to process log files or comma-separated values (CSV) files. For more information, see Parsing and indexing CSV and log files.
NoteBoth the Field Parser stage and the Tokenizer stage split field values into tokens. However, this stage can output multiple fields while the Tokenizer stage outputs only a single field. For more information, see Tokenizer stage.
Example: Parsing a CSV document

In this example, a CSV file was passed through a Read Lines stage to produce a document for each line in the file. One of those documents is now passed through a Field Parser stage where the value for the $line field is split into separate fields.

GUID-3AF09E51-AA99-495A-BE8A-54367A25435A-low.png
Example: Extracting a username from an email address

In this example, a previous stage has added a from_address field to a document. The Field Parser stage is configured to extract a username from that field.

GUID-4F17A789-C9A4-438A-9165-55A5D4EBFAAF-low.png
Input

Documents with fields.

Output

Documents with additional fields.

Configuration settings

SettingDescription
Input Field NameThe name of the field whose value you want to parse.
Tokenize Input Field ValueWhether to split the input field value into multiple tokens.
Tokenization DelimiterIf Tokenize Input Field Value is enabled, a character or regular expression used to determine when one token ends and another begins. For example, specify , (comma) to split lines from a CSV file.
Substrings to Parse

Use this setting to parse a subset of the value for the input field. For example, if the input field contains the value <transmission>automatic</transmission> and you want to parse only automatic, specify a Start sequence of <transmission> and an End sequence of </transmission>.

If you omit a Start sequence, the stage parses the field value from its beginning to the End sequence. If you omit an End sequence, the stage parses the field value from the Start sequence to its end.

Token to Field MappingA list of token index numbers and the field names that each maps to. Valid Token Index values are positive integers starting with 0.
Add Debug FieldsWhen enabled, the stage adds $debugParser fields to each document it processes. The stage adds one of these fields for each token produced from the selected text.
Field naming considerations

This stage can add new fields to documents. For a field to be indexed, its name:

  • Cannot contain hyphens (-) or any other special characters.
  • Cannot start with underscores (_) or numbers.
Tip Beginning a field name with a dollar sign ($) causes the field to be deleted when the workflow pipeline finishes processing the associated document. The field is not indexed.

Use this technique to prevent unnecessary fields from being indexed. For example, you can add a field called $meetsCondition to a document to satisfy a conditional statement later on in the pipeline, but the field might not include any valuable information for your users to search on.

Geocoding stage

The Geocoding stage adds new fields to documents that contain latitude and longitude data. For each set of coordinates, the stage adds new fields that contain the corresponding city, state or providence, country, and time zone.

GUID-42350336-0FD2-4536-B69F-CD9864DFFA11-low.png
Input

Documents with latitude and longitude fields.

Output

Documents with added city, country, state/providence, and time zone fields.

Configuration settings
  • Latitude field name: Specifies which field in the incoming documents contains latitude data.
  • Longitude field name: Specifies which field in the incoming documents contains longitude data.
  • City field name: The name of a field to add to documents. This field contains the name of a city.
  • State/Province field name: The name of a field to add to documents. This field contains the name of a governmental region.
  • Country field name: The name of a field to add to documents. This field contains the name of a country.
  • Time zone field name: The name of a field to add to documents. This field contains the name of a time zone.
Field naming considerations

This stage can add new fields to documents. For a field to be indexed, its name:

  • Cannot contain hyphens (-) or any other special characters.
  • Cannot start with underscores (_) or numbers.
Tip

Beginning a field name with a dollar sign ($) causes the field to be deleted when the workflow pipeline finishes processing the associated document. The field is not indexed.

Use this technique to prevent unnecessary fields from being indexed. For example, you could add a field called $meetsCondition to a document in order to satisfy a conditional statement later on in the pipeline, but the field may not include any valuable information for your users to search on.

HCP Access Log Parser stage

The HCP Access Log Parser stage reads an individual line from HCP access log files and turns it into a metric that can be displayed in the Monitor App.

Input

Access log file lines.

Output

Documents with the metrics displayed in the Monitor App.

Configuration settings
  • Field to Parse: The name of the field to process. The default is line.
  • Output Index Name: The name of the index to send processed documents to. The default is monitorAccessLogIndex.
  • Signal System: The name of the signal system being monitored. When left blank, the signal system is automatically detected based on information in the logs.
Notes and best practices

This stage is used by the Monitor App.

HCP System Events Parser stage

The HCP System Events Parser stage reads an individual line from HCP admin events log files and turns it into a metric that can be displayed in the Monitor App.

Input

Admin events log file lines.

Output

Documents with the metrics displayed in the Monitor App.

Configuration settings
  • Field to Parse: The name of the field to parse. The default is line.
  • Output Index Name: The name of the index to send processed documents to. The default is monitorEventsIndex.
  • Signal System: The name of the signal system being monitored. When left blank, the signal system is automatically detected based on information in the logs.
Notes and best practices

This stage is used by the Monitor App.

HCP Chargeback Report Parser stage

The HCP Chargeback Report Parser stage reads an individual line from HCP chargeback log files and turns it into a metric that can be displayed in the Monitor App.

Input

Chargeback log file lines.

Output

Documents with the metrics displayed in the Monitor App.

Configuration settings
  • Field to Parse: The name of the field to parse. The default is line.
  • Output Index Name: The name of the index to send processed documents to. The default is monitorMetricsIndex.
  • Signal System: The name of the signal system being monitored. When left blank, the signal system is automatically detected based on information in the logs.
Notes and best practices

This stage is used by the Monitor App.

HCP Resource Metrics Parser stage

The HCP Resource Metrics Parser stage reads resource metrics files from HCP and turns every line into a metric that can be displayed in the Monitor App.

Input

Resource metrics files.

Output

Documents with the metrics displayed in the Monitor App.

Configuration settings
  • Input Stream Name: The name of the input stream. The default is HCI_content.
  • Output Index Name: The name of the index to send processed documents to. The default is monitorMetricsIndex.
  • Signal System: The name of the signal system being monitored. When left blank, the signal system is automatically detected based on information in the logs.
Notes and best practices

This stage is used by the Monitor App.

JavaScript stage

The JavaScript stage lets you specify your own JavaScript code for making transformations to documents. The stage has two modes:

  • Single Field Transformation Mode, for editing individual field values.
  • Document Transformation Mode, for adding, removing, and editing multiple fields and streams in a document.

This stage supports JavaScript that conforms to the ECMAScript-262 Edition 5.1 standard.

TipYou can also define your own pipeline stages by using the Hitachi Content Intelligence plugin software development kit (SDK) to write your own custom stage plugins.

Single Field Transformation Mode

In this mode, you specify both:

  • The name of a field whose value you want to change, and
  • An implementation for the transformString() function. Your implementation needs to return a new object for the field value. Hitachi Content Intelligence later converts this returned object to a otring.

In this mode, your JavaScript code has access only to the specified field. It cannot read or change any other field in the document.

Configuration Settings
  • Field Name: The name of the field to transform. The value for this field is used as the parameter for the transformString() function.
  • JavaScript: Your JavaScript code.
  • Allow non-existent values: Determines how the stage behaves if the specified field doesn't exist in an incoming document:
  • When enabled, the stage passes null as a parameter into your JavaScript code. The specified field is then added to the document. Its value is the return value of the transformString() function.
  • When disabled, the stage reports a document failure.
Example: Change a field value to all lower case GUID-76919F3A-0B0B-4DBD-9549-768587875714-low.png

JavaScript field contents:

/**
* Transforms a field value
*
* @param {String} fieldValue The current value of the field
* @return {Object} The Object on which toString() will be called to set the new value for the field 

*/
function transformString(fieldValue) {
 return fieldValue.toLowerCase();
}

Document Transformation Mode

This mode lets you make transformations to an entire document. For example, you can use this stage to remove fields from a document or add together the values of multiple fields and store the sum in a new field.

In this mode, you provide an implementation for the transformDocument() function. The inputs to this function are:

  • callback: A PluginCallback object. Contains utility methods that allow the JavaScript stage to work with documents.
  • document: The Document object for the incoming document.

The function must return either a Document object or, in the case that your stage generates additional documents, an Iterator<Document> object.

The PluginCallback objects, Document objects, and others are defined by the Hitachi Content Intelligence plugin SDK. The stage allows your Javascript code to access the functionality provided by the SDK for working with documents. The SDK is written in Java, not Javascript. For information on obtaining the plugin SDK and its documentation, see your Hitachi Content Intelligence administrator.

Example: Create a new field that's the product of two existing fields

GUID-22D07528-E076-4109-82AE-1B80855531D7-low.png

Javascript field contents:

/**
* Transforms a document
*
* @param {PluginCallback} callback The callback to access execution context
* @param {Document} document The document to be transformed (immutable)
* @return {Iterator<Document> or Document} A single Document, or an iterator
* containing the new document that is the result of the transformation
*/
function transformDocument(callback, document) {
 var IntegerDocumentFieldValue = Java.type("com.hds.ensemble.sdk.model.IntegerDocumentFieldValue");
 var heightValue = parseInt(document.getStringMetadataValue("heightInches")); 
 var widthValue = parseInt(document.getStringMetadataValue("widthInches")); 
 var builder = callback.documentBuilder().copy(document);
 builder.setMetadata("areaInches", IntegerDocumentFieldValue.builder().setInteger(heightValue *
 widthValue).build());
 return builder.build();
}
Accessing field values

To get fields from the document, getStringMetadataValue() can be called with the name of the field:

var exampleValue = document.getStringMetadataValue("exampleFieldName");
Setting field values

Setting a field value needs a DocumentFieldValue object, which can be created by calling Java.type() with the full name of the applicable DocumentFieldValue subclass.

For example, to set a String value for a field, you first create a StringDocumentFieldValue object:

var StringDocumentFieldValue = Java.type("com.hds.ensemble.sdk.model.StringDocumentFieldValue");

You can then use this object to build the value:

var exampleStringValue = StringDocumentFieldValue.builder().setString("exampleFieldValue").build()

And finally set the value to a field in the document builder:

var builder = callback.documentBuilder().copy(document);builder.setMetadata("exampleFieldName", exampleStringValue);

The types of document field values you can set are:

com.hds.ensemble.sdk.model.BooleanDocumentFieldValue
com.hds.ensemble.sdk.model.DateDocumentFieldValue
com.hds.ensemble.sdk.model.DoubleDocumentFieldValue
com.hds.ensemble.sdk.model.FloatDocumentFIeldValue
com.hds.ensemble.sdk.model.IntegerDocumentFieldValue
com.hds.ensemble.sdk.model.LongDocumentFieldValue
com.hds.ensemble.sdk.model.StringDocumentFieldValue
com.hds.ensemble.sdk.model.TextDocumentFieldValue
com.hds.ensemble.sdk.model.UuidDocumentFieldValue

Each one includes builder().set<type>(<type>) methods. For example:

var exampleBooleanValue = BooleanDocumentFieldValue.builder().setBoolean(true).build();
var exampleIntegerValue = IntegerDocumentFieldValue.builder().setInteger(7).build();

For more information, see the documentation included with the Hitachi Content Intelligence plugin SDK.

Building documents

You can use the callback parameter as a factory to build modified documents:

var builder = callback.documentBuilder().copy(document);

This creates a DocumentBuilder class object, which includes a number of methods for working with the contents of a document. Some of these methods are:

  • setMetadata(java.lang.String name, DocumentFieldValue<?> value): Add a single field to this Builder. If metadata with this name already exists, the DocumentFieldValue is replaced.
  • removeMetadata(java.lang.String name): Remove a field from this Builder.
  • copy(Document original): Copy all of the existing fields (and streams, if relevant) from an existing Document.
  • removeStream(java.lang.String name): Remove the named stream from this Builder.

After you've modified the document contents, use DocumentBuilder.build() to build the final document:

return builder().build();

For more information, see the documentation on the DocumentBuilder class in the Hitachi Content Intelligence plugin SDK.

Mapping stage

The Mapping stage lets you:

  • Rename document fields or streams.
  • Copy a field value or stream data to another field or stream.
  • Append a field value or stream data to another field or stream.
Renaming

For example, you have a document that contains a field named doctor and a document that contains a field named physician. These fields contain equivalent information, that is, names of doctors. To index this information using only the doctor field, you can use a Mapping stage to rename all occurrences of physician to doctor.

GUID-7D3EBE34-C458-4944-85D7-97C8F2D40FC0-low.png
Copying

Alternatively, to index the names of hospitals using both fields named hospital and facility, but the field facility does not exist in your documents. To index this information using multiple field names, you can use a Mapping stage to duplicate the hospital field by copying it to a new field named facility.

GUID-1B8D5ED7-B90D-4737-B496-86643C4259C1-low.png
Appending

In this example, the field value financial report is appended to the stream HCI_content. This means that when the document leaves the pipeline, the value financial report is indexed as part of the full text of the document, along with the data that the HCI_content stream points to. Users can then use the term financial report to find this document using a simple search in the Search App, rather than having to use an advanced query to search based on values for the type field.

GUID-A915A59B-C4FF-463E-ABF4-A920C989E12B-low.png
Input

Documents

Output

Documents with renamed or copied field names.

Configuration settings
  • Mapping Action: Which action the stage should perform:
    • Rename Field: Changes the name of a field but keeps its original value.
    • Rename Stream: Changes the name of a stream, but the stream continues to point to the same data.
    • Copy Field to Field: Copies the value of one field to another field.
    • Copy Field to Stream: Copies the value of a field to a stream.
    • Copy Stream to Field: Takes the data that a stream points to and adds that data as a value for the specified field.
    • Copy Stream to Stream: Copies one stream to another.
    • Append Field to Stream: Appends a field value to a stream.
    • Append Stream to Stream: Appends one stream to another.
  • Targets to Map: A list of fields or streams that the stage should affect. For each list item, you specify:
    • Source Name: The field or stream to rename, copy, or append to another stream.
    • Target Name: The field or stream to which field values or streams will be added, copied, or appended.
  • Overwrite Field:
    • When this setting is enabled, if a value for the target field exists, the value is overwritten.
    • When this setting is disabled, the output for this field as an additional value on the target field.
Field naming considerations

This stage can add new fields to documents. For a field to be indexed, its name:

  • Cannot contain hyphens (-) or any other special characters.
  • Cannot start with underscores (_) or numbers.
Tip Beginning a field name with a dollar sign ($) causes the field to be deleted when the workflow pipeline finishes processing the associated document. That is, the field is not indexed.

Use this technique to prevent unnecessary fields from being indexed. For example, you can add a field called $meetsCondition to a document to satisfy a conditional statement later on in the pipeline, but the field might not include any valuable information for your users to search on.

Notes and best practices

Place this stage after a Text and Metadata Extraction stage to standardize or rename the fields added by the Text and Metadata Extraction stage.

If you put a Filter stage after this stage, avoid filtering out fields that you had spent time mapping.

Mbox Expansion stage

The Mbox Expansion stage opens Mbox files and creates documents for the emails within.

This stage affects only Mbox files.

NoteYou need to use a conditional statement to ensure that only the applicable archive files enter this stage.
ExampleGUID-3A034CFE-494F-49E3-994B-8809076323A3-low.png
Input

Documents for .mbox files.

Output

Documents for emails inside of .mbox files.

Documents produced by this stage include these fields, which identify the file from which the documents were expanded:

  • HCI_parentDisplay
  • HCI_parentId
  • HCI_parentUri
Configuration settings
  • Document Stream Name: The name of the stream that you want the stage to examine.

    The default stream is HCI_content, which is added to every document that the system reads.

  • Include source Mbox: When enabled, the original .mbox file itself is output by the stage, in addition to the documents within.
    NoteIf the stage produces zero documents and encounters zero documents when processing an archive file, it outputs that archive file regardless of this setting. This can happen, for example, when the stage processes an empty archive.
  • Maximum Text Character Length: When this stage processes a document, it reads the contents of the document into system memory. For pipelines with many large documents, this can significantly impact processing time.

    With this setting, you can limit the number of characters that this stage reads per document; after the limit is reached, the stage stops reading the file and extracting metadata fields from it. To disable this setting, specify -1.

  • Expanded File Path: Determines the values for the HCI_path and HCI_relativePath fields for documents extracted from an archive file.

    The options are:

    • Expand to a path named after the archive file: Creates field values based on the name of the archive from which documents were expanded.

      For example, with this option selected, a document expanded from an archive file /2016/feburary.mbox will have these field/value pairs:

      • HCI_path: "2016/feburary.mbox-expanded/"
      • HCI_relativePath: "/feburary.mbox-expanded/"
      Use this option to link expanded documents back to the archives from which they were expanded.
    • Use original expanded file path: Tries to use the original expanded file path, if found in the archive.

      For example, with this option selected, a document expanded from an archive file /2016/feburary.mbox will have these field/value pairs:

      • HCI_path: "2016/"
      • HCI_relativePath: "/"

      Use this option when you are writing the expanded documents to a data source.

    • Customize the expanded base path: Allows you to specify the expanded file path to use.

      For example, if you selected this option and specified a custom path value of myCustomPath, a document expanded from an archive file /2016/feburary.mbox will have these field/value pairs:

      • HCI_path: "2016/myCustomPath/"
      • HCI_relativePath: "/myCustomPath/"
      Use this option when you are writing the expanded documents to a data source.
Consideration for writing expanded files to data sources
  • If you follow this expansion stage with an Execute Action stage that writes files to a data source, the Expanded File Path setting affects how the expanded files are written:
  • With the Expand to a path named after the archive file option selected, each .mbox that the stage processes will be expanded to its own folder.

    With either the Customize the expanded base path or Use original expanded file path option selected, all .mbox files that the stage processes will be expanded to the same folder.

    If multiple .mbox files contain files that have the same name, the Execute Action stage tries to write multiple different files to the same path in the data source. The results depend on the data source.

    For example, if multiple .mbox files contain a file called notes.txt, the data source might end up with either:

    • A single file called notes.txt that has multiple versions.
    • A single file called notes.txt, the contents of which match the last notes.txt file to be written to the data source.
    • An error writing the file, if the data source does not allow existing files to be overwritten.
Notes and best practices
  • Surround this stage with a conditional statement that allows only .mbox.gz documents to pass through. That way, you avoid processing documents of the wrong type.
  • To determine the MIME types of new documents created by this stage, you should precede or follow this stage with a MIME Type Detection stage, depending on whether recursion is enabled for the workflow:
    • If either the Workflow Agent Recursion or Preprocessing Recursion settings are enabled for the workflow task, precede this stage with a MIME Type Detection stage.

      With these settings enabled, all new documents are sent to the beginning of the applicable workflow pipeline, rather than on to the next stage.

      For more information on these settings, see Task settings.
    • If both recursion settings are disabled, follow this stage with a MIME Type Detection stage.
  • If you need to expand very large .mbox documents, use the Preprocessing execution mode for the workflow pipeline that contains this stage.

MIME Type Detection stage

Detects the Multipurpose Internet Mail Extension (MIME) types of documents. This stage adds a field namedContent_Type to the documents it processes.

GUID-39D4EF26-7599-4FF0-84D9-E1AE5D80016F-low.png
Input

Documents

Output

Documents with a new Content_Type field/value pair that identifies the MIME type.

Configuration settings
  • Custom extension mapping: You can specify a file extension and the MIME type that you want to associate with it. You can add any number of these mappings, but each mapping can associate only one extension with only one MIME type.

    By default, this stage is configured with custom extension mapping entries for a number of file types that the MIME Type Detection stage might take a long time to process (for example, .zip archives and video files).

  • Force Detection: When this setting is enabled, the MIME Type Detection stage always processes incoming documents and overwrites any Content_Type fields already present in a document.
  • Overwrite Field:
    • When this setting is enabled, if a value for the target field exists, the value is overwritten.
    • When this setting is disabled, the output for this field as an additional value on the target field.
  • Use this option when your documents already have a HCI_filename field/value before entering the MIME Type Detection stage (for example, if your data connection adds this field by default to all documents or you have two MIME Type Detection stages in your pipeline).

  • Document Stream Name: The name of the stream that you want the stage to examine.

    The default stream is HCI_content, which is added to every document that the system reads.

Notes and best practices
  • Use the Custom extension mapping setting to avoid processing documents whose MIME types are time-consuming to discover. Instead of examining a document's contents to determine its MIME type, the stage will automatically assign the specified MIME type to documents that have the specified file extension. However, you should avoid doing this for file extensions that can be mapped to multiple MIME types.
  • This stage should be used as the first stage in a pipeline. This allows you to use MIME type as a criteria in conditional statements in subsequent stages.

    Minimally, this stage should be used before the Text and Metadata Extraction stage because that stage uses a document's Content_Type value to determine whether to process the document.

  • If the Workflow Recursion option is disabled for a task, you should follow each archive expansion stage (for example, TAR Expansion or ZIP Expansion) with a MIME Type Detection stage. For information on workflow recursion, see Task settings.

PST Expansion stage

The PST Expansion stage opens Microsoft Outlook personal storage table (.pst) files and creates documents for the emails and attachments within.

This stage affects only .pst files.

Note
  • You need to use a conditional statement to ensure that only the applicable archive files enter this stage.

    For an example of how to do this, see Default pipeline.

  • This stage does not create separate documents for most types of email attachments. To do that, add an Email Expansion stage to your pipeline. See Email Expansion stage.
ExampleGUID-EA338BD6-647E-4D18-9F54-774D96C18D00-low.png
Input

Documents for .pst files

Output

Documents for emails inside of .pst files.

Documents produced by this stage include these fields, which identify the file from which the documents were expanded:

  • HCI_parentDisplay
  • HCI_parentId
  • HCI_parentUri
Configuration settings
  • Document Stream Name: The name of the stream that you want the stage to examine.

    The default stream is HCI_content, which is added to every document that the system reads.

  • Include source PST: When enabled, the .pst file itself is output by the stage, in addition to the documents within.
    NoteIf the stage produces zero documents and encounters zero documents when processing an archive file, it outputs that archive file regardless of this setting. This can happen, for example, when the stage processes an empty archive.
  • Make the attachments field multivalued: When disabled (default), the HCI_attachments field's value is stored as a single string on the original email document. Each value found is separated by a comma. When enabled, this string is instead parsed into a multivalued field, merging any duplicate values.
  • Maximum Text Character Length: The maximum number of characters to extract into the HCI_text stream.
  • Expanded File Path: Determines the values for the HCI_path and HCI_relativePath fields for documents extracted from an archive file.

    The options are:

    • Expand to a path named after the archive file: Creates field values based on the name of the archive from which documents were expanded.

      For example, with this option selected, a document expanded from an archive file /2016/feburary.pst will have these field/value pairs:

      HCI_path: "2016/feburary.pst-expanded/"

      HCI_relativePath: "/feburary.pst-expanded/"

      Use this option to link expanded documents back to the archives from which they were expanded.
    • Use original expanded file path: Tries to use the original expanded file path, if found in the archive.

      For example, with this option selected, a document expanded from an archive file /2016/feburary.pst will have these field/value pairs:

      HCI_path: "2016/"

      HCI_relativePath: "/"

      Use this option when you are writing the expanded documents to a data source.
    • Customize the expanded base path: Allows you to specify the expanded file path to use.

      For example, if you selected this option and specified a custom path value of myCustomPath, a document expanded from an archive file /2016/feburary.pst will have these field/value pairs:

      HCI_path: "2016/myCustomPath/"

      HCI_relativePath: "/myCustomPath/"

      Use this option when you are writing the expanded documents to a data source.
Consideration for writing expanded files to data sources

If you follow this expansion stage with an Execute Action stage that writes files to a data source, the Expanded File Path setting affects how the expanded files are written:

  • With the Expand to a path named after the archive file option selected, each .pst that the stage processes will be expanded to its own folder.
  • With either the Customize the expanded base path or Use original expanded file path option selected, all .pst files that the stage processes will be expanded to the same folder.

    If multiple .pst files contain files that have the same name, the Execute Action stage tries to write multiple different files to the same path in the data source. The results depend on the data source.

    For example, if multiple .pst files contain a file called notes.txt, the data source might end up with either:
    • A single file called notes.txt that has multiple versions
    • A single file called notes.txt, the contents of which match the last notes.txt file to be written to the data source
    • An error writing the file, if the data source does not allow existing files to be overwritten
Notes and best practices
  • Surround this stage with a conditional statement that allows only .pst.gz documents to pass through. That way, you avoid processing documents of the wrong type.
  • To determine the MIME types of new documents created by this stage, you should precede or follow this stage with a MIME Type Detection stage, depending on whether recursion is enabled for the workflow:
    • If either the Workflow Agent Recursion or Preprocessing Recursion settings are enabled for the workflow task, precede this stage with a MIME Type Detection stage.

      With these settings enabled, all new documents are sent to the beginning of the applicable workflow pipeline, rather than on to the next stage.

    • If both recursion settings are disabled, follow this stage with a MIME Type Detection stage.
  • If you need to expand very large .pst documents, use the Preprocessing execution mode for the workflow pipeline that contains this stage.

Read Lines stage

The Read Lines stage creates a new document for each line in a content stream.

TipUse this stage in conjunction with the Field Parser stage to process log files or comma-separated values (CSV) files.
Important The Read Lines stage doesn't add the HCI_content stream to the documents that it produces. As a result, document failures occur when those documents are processed by stages (such as the Text and Metadata Extraction stage) that expect the HCI_content stream to exist.

To avoid these failures when using the Read Lines stage, use conditional statements to surround all stages that process the HCI_content stream. If the Workflow Agent Recursion task setting is enabled, you also need to surround the Read Lines stage itself with a conditional statement.

Configure the conditional statements with this conditional expression:

Field name: HCI_content

Operator: stream_exists

For information on the Workflow Agent Recursion setting, see Task settings.

Example

In this example, the file 03-03-16.log contains three lines of data:

00:01,INFO,resource created
00:02,INFO,resource edited
00:03,INFO,resource deleted

The Read Lines stage creates a new document for each line.

GUID-F893A01D-1283-48A3-9E92-A69DEEFC34D9-low.png
Input

Documents with streams.

Output

Additional documents, one for each line in the input document stream.

Configuration settings
  • Input Stream Name: The name of the document stream from which you want to create additional documents.
  • Line Output Field Name: For each document produced by this stage, the name of the field in which to store the content of the line in the original document from which this document was created.

    The default is $line.

  • Number of Lines to Skip: The number of lines to be skipped starting from the beginning of the input stream. This setting allows you to skip document headers.

    The default value is 0.

Field naming considerations

This stage can add new fields to documents. For a field to be indexed, its name:

  • Cannot contain hyphens (-) or any other special characters
  • Cannot start with underscores (_) or numbers
TipBeginning a field name with a dollar sign ($) causes the field to be deleted when the workflow pipeline finishes processing the associated document. That is, the field is not indexed.

Use this technique to prevent unnecessary fields from being indexed. For example, you can add a field called $meetsCondition to a document to satisfy a conditional statement later on in the pipeline, but the field might not include any valuable information for your users to search on.

Notes and best practices
  • To determine the MIME types of new documents created by this stage, you should precede or follow this stage with a MIME Type Detection stage, depending on whether recursion is enabled for the workflow:
    • If either the Workflow Agent Recursion or Preprocessing Recursion settings are enabled for the workflow task, precede this stage with a MIME Type Detection stage.

      With these settings enabled, all new documents are sent to the beginning of the applicable workflow pipeline, rather than on to the next stage.

    • If both recursion settings are disabled, follow this stage with a MIME Type Detection stage.

Replace stage

The Replace stage lets you replace all or part of a value for the fields you specify. You can use this stage, for example, to redact sensitive information such as credit card numbers, or to normalize values across multiple fields.

Redaction example

This example uses a regular expression to search the field SSN for social security numbers. Those numbers are then replaced with REDACTED.

GUID-FA1CF82F-44D9-440A-B943-D22E99D4F52F-low.png
Normalization example

This example uses a regular expression to search multiple fields for the value Street and replace each with the abbreviation St..

GUID-1F3EBB81-091B-45D4-B526-96CAEAEAD6A0-low.png
Input

Documents

Output

Documents with edited field values.

Configuration settings

Fields to process: A list of fields in which to replace values.

Values to replace:

  • Source expression: A regular expression used to search for the characters you want to replace within a field value. Valid values are regular expressions that are supported by the Java Pattern class.
  • Replacement: One or more characters used to replace the characters matched by the corresponding source expression.
Tip
  • Use %SPACE to specify a single space character in replacement values.
  • Document field values can be included in replacement text using ${<fieldName>} syntax, for example ${HCI_displayName}.

For example, use a Source expression value of [0-9]{4}-[0-9]{4}-[0-9]{4}- with Replacement value of XXXXXX- to redact credit card numbers.

Notes and best practices

Place this stage after a Text and Metadata Extraction stage to standardize or rename the fields added by that stage.

Reject Documents stage

You can debug your pipelines by using conditional statements to filter problem documents into a Reject Documents stage. Any documents that reach this stage are reported as failures in the workflow task status report and are not processed by subsequent stages in your pipeline.

For information on viewing document failures for a task, see Task details, status, and results.

Example

Say, for example, that you expect all of your documents to contain the author field by the time they reach a certain point in your pipeline. You can verify this by inserting a Reject Documents stage at that point and surrounding it with a conditional statement that verifies whether the author field exists.

GUID-F757A137-210A-409B-92B4-9ED48093BBCC-low.png
Input

Documents

Output

None. Any documents entering the stage are reported as failures.

Configuration settings
  • Rejection Message: The error message to include with each document failure produced by this stage. The default is Document Rejected.
Notes and best practices
  • Always surround this stage with a conditional statement to ensure that only problem documents enter it.
  • If you are testing your pipeline with multiple Reject Documents stages, use specific and descriptive rejection messages so you know where each failure came from.

Snippet Extraction stage

The Snippet Extraction stage reads the content stream for a document, extracts a subset of the document's contents, and stores that subset as a new field. You can use this field to provide some sample text for each document in search results. Instead of returning the full contents of each document, use this stage to index and return a more manageable amount of document contents.

GUID-881B6193-A799-4BD9-A7AF-D9A0D4BB8CF0-low.png
Input

Documents with content streams.

Output

Documents with an additional field containing a snippet of the document's full content.

Configuration settings
  • Text Input Stream: The name of the stream from which to read document contents.
  • Snippet Output Field: The name of the field in which to store text extracted from the specified field. The default is HCI_snippet.
  • Maximum Snippet Character Length: The maximum number of characters to extract from each document and store in the snippet field.
  • Snippet Start Offset: The number of characters to skip in the document before beginning to extract the snippet text.
Field naming considerations

This stage can add new fields to documents. For a field to be indexed, its name:

  • Cannot contain hyphens (-) or any other special characters.
  • Cannot start with underscores (_) or numbers.
Tip Beginning a field name with a dollar sign ($) causes the field to be deleted when the workflow pipeline finishes processing the associated document. The field is not indexed.

Use this technique to prevent unnecessary fields from being indexed. For example, you can add a field called $meetsCondition to a document to satisfy a conditional statement later on in the pipeline, but the field might not include any valuable information for your users to search on.

Notes and best practices
  • If you are using this stage to avoid indexing a document's full content, follow this stage with a Filter stage that filters out the stream from which you extracted the snippet (that is, the stream specified for the Text Input Stream field). For information on the Filterstage, see Filter stage.
  • You can use this stage as a debugging tool to view the contents of document streams. For information, see Viewing stream contents.
  • Use this stage to support search result highlighting for your users. For information, see Search result highlighting.

SRT Text Extraction stage

The SRT Text Extraction stage lets you extract existing subtitle text from video files. For this stage to work on a video file:

  • Subtitle text must already exist. That is, this stage doesn't automatically create subtitle text for a video file that has none.
  • Subtitle text must be in SubRip Text (SRT) format:
    Subtitle ID number
    Time range for the subtitle
    Subtitle text
    <blank line>

    For example:

    212
    00:33:55,170 --> 00:33:57,210
    - Who's there?
    21300:33:58,000 --> 00:34:00,455
    - Just me. 
  • Subtitle text must be stored as a content stream associated with a video file.

    For video files from an HCP namespace, for example, this stage processes subtitle text stored as a custom-metadata annotation on the file.

Example

In this example, an HCP object named InstructionalVideo.mp4 contains a stream named HCI_customMetadata_subtitles. This stream is a pointer for Hitachi Content Intelligence to retrieve the video's SRT text from HCP. The SRT Text Extraction stage accesses this stream, extracts SRT text from it, and adds that text to three new document fields for the video: one field for all subtitle text, one field for all subtitle start times, and one field for all subtitle end times.

GUID-1F282462-6EA2-4B52-BCDA-C330B8180274-low.png
Input

Documents for videos that have SRT annotations.

Output

Documents for the input videos. The output videos now include three additional multivalued fields: one for subtitle text values, one for subtitle start times, and one for subtitle end times.

Configuration settings
  • Single-valued (subtitle text only) or Multi-valued:
    • Single-valued (subtitle text only): Specifies that all subtitle text should be stored in one single-valued field. Subtitle time data is not extracted or stored.
      NoteUse this option to better support highlighting video subtitle text in search results. For more information, see Search result highlighting.
    • Multi-valued: Specifies that subtitle text and time data should be stored in separate, multi-valued fields.
  • Stream Name: The name of the content stream in which subtitle text is stored. This tells the stage where to look for a document's SRT data.
  • Subtitles Field: The name of the field where you want to store the extracted subtitle text.
  • Event Start Time(if Multi Valued is selected): The name of the field where you want to store the extracted subtitle start times.
  • Event End Time(if Multi Valued is selected): The name of the field where you want to store the extracted subtitle end times.
Field naming considerations

This stage can add new fields to documents. For a field to be indexed, its name:

  • Cannot contain hyphens (-) or any other special characters.
  • Cannot start with underscores (_) or numbers.
NoteBeginning a field name with a dollar sign ($) causes the field to be deleted when the workflow pipeline finishes processing the associated document. The field is not indexed.

Use this technique to prevent unnecessary fields from being indexed. For example, you can add a field called $meetsCondition to a document to satisfy a conditional statement later on in the pipeline, but the field might not include any valuable information for your users to search on.

Syslog Notification stage

The Syslog Notification stage allows for syslog notifications to be sent while data is processing. A single notification is sent for every document processed.

GUID-0B81329C-3944-407B-BA08-6AE17980C7B5-low.png
Input

Documents

Output

Processed documents and a syslog message for each document.

Configuration settings

Syslog Settings:

  • Host: The hostname or IP address for a syslog server.
  • Port: The syslog port. The default is port 514.
  • Facility: The syslog facility that the message is sent to. The default is local0, but you can choose from local1 through local7.
Message Settings
  • Message: The content of the syslog message.
Including field values

To include a field value in a syslog message, use this syntax:

${<field-name>}

For example:

${HCI_size}

The default is Processed document ${HCI_URI}.

For example, if you keep the default and the stage processes a document with the URL http://example.com/document.pdf, the resulting message would be:

Processed document http://example.com/document.pdf
Including aggregation values

To include an aggregation value in a syslog message, use this syntax:

${<aggregation-name>}

For example:

${Extensions}

If the aggregation name contains spaces, replace them with underscores (_). For example:

${Discovered_Fields}
  • Severity: The severity of the message. Choose from INFO (default), EMERGENCY, ALERT, CRITICAL, ERROR, WARN, NOTICE, and DEBUG.
  • Sender Identity: The identity of the sender for each message. The default is hci.
Notes and best practices

Because this stage sends a syslog message for every document it encounters, you should surround it with a conditional statement to limit the documents that reach the stage.

Tagging stage

The tagging stage lets you add new fields and values to a document. The fields and values that you configure for this stage are added to all documents that pass through it.

GUID-5C171F15-273F-4ACA-A098-D6C5851CC1E7-low.png
Input

Documents

Output

Documents with additional pairs of fields and values.

Configuration settings
  • A list of fields and values to add.
  • Overwrite Field:
    • When this setting is enabled, if a value for the target field exists, the value is overwritten.
    • When this setting is disabled, the output for this field as an additional value on the target field.
Field naming considerations

This stage can add new fields to documents. For a field to be indexed, its name:

  • Cannot contain hyphens (-) or any other special characters.
  • Cannot start with underscores (_) or numbers.
TipBeginning a field name with a dollar sign ($) causes the field to be deleted when the workflow pipeline finishes processing the associated document. That is, the field is not indexed.

Use this technique to prevent unnecessary fields from being indexed. For example, you could add a field called $meetsCondition to a document in order to satisfy a conditional statement later on in the pipeline, but the field may not include any valuable information for your users to search on.

Notes and best practices

To control which documents are sent to this stage, enclose the stage in a conditional statement.

TAR Expansion stage

The TAR Expansion stage opens .tar archives and creates documents for the files within.

This stage can process .tar, tar.gz, and tar.bz files.

NoteYou need to use a conditional statement to ensure that only the applicable archive files enter this stage.

For an example of how to do this see, Default pipeline.

GUID-9F4E13AA-D6C6-4032-9924-7FDA79E6853D-low.png
Input

Documents for .tar archives.

Output

Documents for files inside .tar archives.

Documents produced by this stage include these fields, which identify the file from which the documents were expanded:

  • HCI_parentDisplay
  • HCI_parentId
  • HCI_parentUri
Configuration settings
  • Document Stream Name: The name of the stream that you want the stage to examine.

    The default stream is HCI_content, which is added to every document that the system reads.

  • Limit Extracted Files: When enabled, you can specify a maximum number of files that can be extracted from a .tar file.
  • Include source TAR: When enabled, the .tar file itself is output by the stage, in addition to the documents within.
NoteIf the stage produces zero documents and encounters zero documents when processing an archive file, it outputs that archive file regardless of this setting. This can happen, for example, when the stage processes an empty archive.

Expanded File Path: Determines the values for the HCI_path and HCI_relativePath fields for documents extracted from an archive file.

The options are:

  • Expand to a path named after the archive file: Creates field values based on the name of the archive from which documents were expanded.

    For example, with this option selected, a document expanded from an archive file /2016/feburaryLogs.tar will have these field/value pairs:

    • HCI_path: "2016/feburaryLogs.tar-expanded/"
    • HCI_relativePath: "/feburaryLogs.tar-expanded/"

    Use this option to link expanded documents back to the archives from which they were expanded.

  • Use original expanded file path: Tries to use the original expanded file path, if found in the archive.

    For example, with this option selected, a document expanded from an archive file /2016/feburaryLogs.tar will have these field/value pairs:

    • HCI_path: "2016/"
    • HCI_relativePath: "/"

    Use this option when you are writing the expanded documents to a data source.

  • Customize the expanded base path: Allows you to specify the expanded file path to use.

    For example, if you selected this option and specified a custom path value of myCustomPath, a document expanded from an archive file /2016/feburaryLogs.zip will have these field/value pairs:

    • HCI_path: "2016/myCustomPath/"
    • HCI_relativePath: "/myCustomPath/"

    Use this option when you are writing the expanded documents to a data source.

Consideration for writing expanded files to data sources

If you follow this expansion stage with an Execute Action stage that writes files to a data source, the Expanded File Path setting affects how the expanded files are written:

  • With the Expand to a path named after the archive file option selected, each .tar that the stage processes will be expanded to its own folder.
  • With either the Customize the expanded base path or Use original expanded file path option selected, all .tar files that the stage processes will be expanded to the same folder.

    If multiple .tar files contain files that have the same name, the Execute Action stage tries to write multiple different files to the same path in the data source. The results depend on the data source.

    For example, if multiple .tar files contain a file called notes.txt, the data source might end up with either:

    • A single file called notes.txt that has multiple versions.
    • A single file called notes.txt, the contents of which match the last notes.txt file to be written to the data source.
    • An error writing the file, if the data source does not allow existing files to be overwritten.
Notes and best practices
  • Surround this stage with a conditional statement that allows only .tar, tar.gz, and tar.bz.gz documents to pass through. That way, you avoid processing documents of the wrong type.
  • To determine the MIME types of new documents created by this stage, you should precede or follow this stage with a MIME Type Detection stage, depending on whether recursion is enabled for the workflow:
    • If either the Workflow Agent Recursion or Preprocessing Recursion settings are enabled for the workflow task, precede this stage with a MIME Type Detection stage.

      With these settings enabled, all new documents are sent to the beginning of the applicable workflow pipeline, rather than on to the next stage.

    • If both recursion settings are disabled, follow this stage with a MIME Type Detection stage.
  • If you need to expand very large .tar, tar.gz, and tar.bz documents, use the Preprocessing execution mode for the workflow pipeline that contains this stage.
  • Use the Decompression stage instead of this stage to expand Gzip files (.gz).

Text and Metadata Extraction stage

The Text and Metadata Extraction stage has two uses:

  • Extracting additional fields from a document's contents.

    This is the general purpose metadata extraction stage; it can discover a number of industry-standard metadata fields from hundreds of common document types.

    For example, for an email, this stage might extract fields for the email's subject, recipient, and sender.

  • Extracting searchable keywords from a document's full content. This is required for your users to be able to search for a document by its content, not just its metadata.

    For information on full-content search, see Enabling full-content search.

NoteThe Text and Metadata Extraction stage cannot process documents that are password-protected or encrypted.
ExampleGUID-74B42218-DF87-42C2-A176-40461B41767F-low.png
Input

Documents

Output

Documents with:

  • Additional field/value pairs.
  • An additional stream that contains all keywords extracted from the input stream.
Configuration settings
  • Excluded Content-Type Values: A list of MIME types. Documents with these MIME types bypass this stage. By default, the Text and Metadata Extraction is configured to exclude archive files such as .zip and .tar files.

    This setting depends on the Content-Type field being present in documents. The Content-Type field is added by the MIME Type Detection stage. For information, see MIME Type Detection stage.

    TipYou can use this setting to improve pipeline performance by having the Text and Metadata Extraction stage skip file types that you're not interested in.
  • Input Stream Name: For each document entering the stage, this is the stream from which to extract field/value pairs. The default is HCI_content.
  • Output Stream Name: For each document exiting the stage, this is the stream in which the extracted text keywords are stored. The default is HCI_text.
    ImportantBy default, to support full-content search, index collections are configured to index HCI_text. If you change the name of the output stream for this stage, you need to add a new field to your index collection schema. The new field must have these settings:
    • Name: Same as the Output Stream Name for the Text and Metadata Extraction stage.
    • Type: text_hci
    • Field attributes:
      • indexed
      • multiValued

    For information on adding fields to an index collection schema, see Adding and editing fields in an index collection schema.

    NoteTo view the contents of this stream, test your pipeline using the Snippet Extraction stage. For information, see Viewing stream contents.
  • Extracted Character Limit: The maximum number of characters to extract from the input stream.
  • RFC-822 Email Display Subject: When a workflow task reads a document, it adds a field named HCI_displayName which contains the filename for the document.

    With this option enabled, when the Text and Metadata Extraction stage processes an RFC-822-compliant email document, the stage replaces the contents of the HCI_displayName field with the email's subject.

  • Include Unprocessed Fields: When enabled, documents include Message_Raw_Header_ fields, which contain unprocessed metadata. This option defaults to false because many of the unprocessed metadata fields contain duplicate values that are not typically useful.
  • Skip Embedded Documents: When enabled, extraction will be skipped for any embedded documents.
Notes and best practices
  • This stage can add a significant amount of time to a workflow task. Use the Excluded Content-Type Values setting to avoid processing the files for which you don't need to extract keywords or additional fields.
  • Because the Excluded Content-Type Values setting needs the Content_Type field to exist in documents, you should always place this stage after the MIME Type Detection stage.
  • Follow the Text and Metadata Extraction stage with a Date Conversion stage to normalize any date fields added to your documents.

Tokenizer stage

The Tokenizer stage lets you specify a string of one or more characters named a token and then use that token to perform several types of transformations on field values. You can then create new fields to hold the transformed values.

NoteBoth Tokenizer stage and the Field Parser stage split field values into tokens. However, that stage can output multiple fields while this stage outputs only a single field.

The following sections detail the different Tokenizer operations.

Tokenize

This operation lets you split a single field value into multiple values. Every time the stage encounters the token character or sequence of characters, it creates a new value for the field.

Example

GUID-6AA300C0-7583-4B16-AFE1-63EA94FE60EA-low.png

Operation-specific configuration settings

  • Tokenizer: A string of one or more characters to use as a delimeter.
Replace

This lets you take a field value, replace one token with another, and store the new value in a new field.

Example

GUID-E3574A8B-6171-400B-B9CA-78C93403CAE1-low.png
TipIf you no longer need the original field, you can add a Filter stage to your pipeline to delete that field. For example, in the provided example, you might want to remove the Maintainer field. You can then add a Mapping stage to rename NewMaintainer to Maintainer.

Operation-specific configuration settings

  • Token to replace: A string of one or more characters to be replaced.
  • Replacement text: A string of one or more characters with which to replace the specified token.
Substring to Token

This lets you shorten field values. When the stage encounters the token for the first time in a field value, the remainder of the field value is deleted. The shortened value is then stored in a new field.

Example

GUID-C3353083-40AF-40CC-B1C2-93CBCB9D6D24-low.png

Operation-specific configuration settings

  • Token for substring end position: A string of one or more characters. When the stage encounters this string in a field value, it deletes the remainder of the field value.
Substring to Position

This lets you shorten field values. The stage extracts a portion of a field value and saves that portion as a new field.

Example

GUID-ED760779-04BD-4148-AE5D-F1B79EAF52B6-low.png

Operation-specific configuration settings

  • Start position: A number of characters to offset from the beginning of a field value. This setting is not inclusive. The default is 0 (that is, the stage starts its extraction from the beginning of the field value).
  • End position: A number of characters from the beginning of the field value. When the stage encounters the character at the specified position, it stops extracting characters from the field value. This setting is not inclusive. By default, the stage extracts the entire field value.
Input

Documents

Output

Documents with additional field/value pairs.

Shared Configuration settings

These configuration settings are common to all operation types:

  • Field Name: The metadata data field that you want the Tokenizer stage to act on.
  • Tokenized Field Name: The name of the field in which to store the tokenized value.
NoteIf you specify the same field for both Field Name and Tokenized Field Name, the tokenized value does not replace the original value. Instead, the new value is appended to the original value for the specified field.

URL Encoder/Decoder stage

The URL Encoder/Decoder stage lets you percent-encode or percent-decode field values that contain URLs.

GUID-679FB39B-D02D-4E89-9FFB-5C1136153A23-low.png
Input

Documents with URL fields.

Output

Documents with additional fields containing percent-encoded or percent-decoded URL values.

Configuration settings
  • Fields to Decode:
    • Existing Field Name: The field containing URLs you want to percent-decode.
    • Target Field Name: The field in which to store the percent-decoded URL.
  • Fields to Encode:
    • Existing Field Name: The field containing URLs that you want to percent-encode.
    • Target Field Name: The field in which to store the percent-encoded value.
  • Overwrite Field:
    • When this setting is enabled, if a value for the target field exists, the value is overwritten.
    • When this setting is disabled, the output for this field as an additional value on the target field.
Field naming considerations

This stage can add new fields to documents. For a field to be indexed, its name:

  • Cannot contain hyphens (-) or any other special characters.
  • Cannot start with underscores (_) or numbers.
TipBeginning a field name with a dollar sign ($) causes the field to be deleted when the workflow pipeline finishes processing the associated document. That is, the field is not indexed.

Use this technique to prevent unnecessary fields from being indexed. For example, you can add a field called $meetsCondition to a document to satisfy a conditional statement later on in the pipeline, but the field might not include any valuable information for your users to search on.

XML Formatter stage

The XML Formatter stage takes a list of document fields, formats those fields as XML, and then outputs them to a stream.

TipUse this stage (in conjunction with an Execute Action stage) to write XML-formatted custom metadata to objects in HCP. For information, see Writing Hitachi Content Intelligence-created custom metadata to HCP objects.
GUID-1C1E2FD5-B5AF-4F47-A1CD-0650B8980874-low.png

In this example, the locationData stream added by the XML Formatter stage contains this XML:

<locationData>
<city><![CDATA[rome]]></city>
<country><![CDATA[IT]]></country>
<region><![CDATA[Latium]]></region>
<timeZone><![CDATA[Europe/Rome]]></timeZone>
<subjects>
<value><![CDATA[Colosseum]]></value>
<value><![CDATA[Arch of Constantine]]></value>
</subjects>
</locationData>
Input

Documents with fields.

Output

Documents with added streams that contain XML-formatted data.

Configuration Settings
  • Output stream: A name for the stream that the stage adds to documents that pass through it.
  • Root XML element: Optionally, a name for the root XML element in the stream added to the document. If provided, this must be a valid XML element name.
  • Fields to format: A list of document fields to format as XML and add to the output stream.

ZIP Expansion stage

The ZIP Expansion stage opens .zip files and creates documents for each file inside.

This stage affects only .zip files.

NoteYou need to use a conditional statement to ensure that only the applicable archive files enter this stage.

For an example of how to do this, see Default pipeline.

GUID-9F297BE5-60E0-4DA9-93FB-204F4D3943A6-low.png
Input

Documents for .zip files.

Output

Documents for files inside the .zip file.

Documents produced by this stage include these fields, which identify the file from which the documents were expanded:

  • HCI_parentDisplay
  • HCI_parentId
  • HCI_parentUri
Configuration settings
  • Document Stream Name: The name of the stream that you want the stage to examine.
  • The default stream is HCI_content, which is added to every document that the system reads.
  • Limit Extracted Files: When enabled, you can specify a maximum number of files that can be extracted from a .zip file.
  • Include source ZIP: When enabled, the .zip file itself is output by the stage, in addition to the documents within.
    NoteIf the stage produces zero documents and encounters zero documents when processing an archive file, it outputs that archive file regardless of this setting. This can happen, for example, when the stage processes an empty archive.
  • Expanded File Path: Determines the values for the HCI_path and HCI_relativePath fields for documents extracted from an archive file.

    The options are:

    • Expand to a path named after the archive file: Creates field values based on the name of the archive from which documents were expanded.

      For example, with this option selected, a document expanded from an archive file /2016/feburaryLogs.zip will have these field/value pairs:

      • HCI_path: "2016/feburaryLogs.zip-expanded/"
      • HCI_relativePath: "/feburaryLogs.zip-expanded/"

      Use this option to link expanded documents back to the archives from which they were expanded.

    • Use original expanded file path: Tries to use the original expanded file path, if found in the archive.

      For example, with this option selected, a document expanded from an archive file /2016/feburaryLogs.zip will have these field/value pairs:

      • HCI_path: "2016/"
      • HCI_relativePath: "/"

      Use this option when you are writing the expanded documents to a data source.

    • Customize the expanded base path: Allows you to specify the expanded file path to use.

      For example, if you selected this option and specified a custom path value of myCustomPath, a document expanded from an archive file /2016/feburaryLogs.zip will have these field/value pairs:

      • HCI_path: "2016/myCustomPath/"
      • HCI_relativePath: "/myCustomPath/"

      Use this option when you are writing the expanded documents to a data source.

Consideration for writing expanded files to data sources

If you follow this expansion stage with an Execute Action stage that writes files to a data source, the Expanded File Path setting affects how the expanded files are written:

  • With the Expand to a path named after the archive file option selected, each .zip that the stage processes will be expanded to its own folder.
  • With either the Customize the expanded base path or Use original expanded file path option selected, all .zip files that the stage processes will be expanded to the same folder.

    If multiple .zip files contain files that have the same name, the Execute Action stage tries to write multiple different files to the same path in the data source. The results depend on the data source.

    For example, if multiple .zip files contain a file called notes.txt, the data source might end up with either:

    • A single file called notes.txt that has multiple versions.
    • A single file called notes.txt, the contents of which match the last notes.txt file to be written to the data source.
    • An error writing the file, if the data source does not allow existing files to be overwritten.
Notes and best practices
  • Surround this stage with a conditional statement that allows only .zip.gz documents to pass through. That way, you avoid processing documents of the wrong type.
  • To determine the MIME types of new documents created by this stage, you should precede or follow this stage with a MIME Type Detection stage, depending on whether recursion is enabled for the workflow:
    • If either the Workflow Agent Recursion or Preprocessing Recursion settings are enabled for the workflow task, precede this stage with a MIME Type Detection stage.

      With these settings enabled, all new documents are sent to the beginning of the applicable workflow pipeline, rather than on to the next stage.

    • If both recursion settings are disabled, follow this stage with a MIME Type Detection stage.
  • If you need to expand very large .zip documents, use the Preprocessing execution mode for the workflow pipeline that contains this stage.

 

  • Was this article helpful?