Skip to main content

We've Moved!

Product Documentation has moved to docs.hitachivantara.com
Hitachi Vantara Knowledge

Special cases and scenarios

This section includes examples of some advanced cases and scenarios that you can configure Hitachi Content Search to handle.

Using aggregations and triggers to monitor a workflow

Say, for example, that you have a data source that charges you for each byte read. This data source is constantly growing, but your budget limits you to reading no more than 1 TB from it.

The following procedure demonstrates how you can use aggregations, triggers, and the Email Notification pipeline stage together to notify you when you've reached 75% of your allowed limit (for example, 750 GB). That way, you can stop your workflow and not exceed your budget.

NoteThis example assumes you've already set up a workflow with data connections, pipelines, and outputs that you want.

This example also uses the Workflow Metrics aggregation, which is created by default for each workflow.

The procedure consists of three steps:

  1. Create the trigger pipeline.
  2. Create the trigger.
  3. Run your workflow.

Create the trigger pipeline

  1. Click the Processing Pipelines window.

  2. Click Create Pipeline.

  3. Enter a name and, optionally, a description for the pipeline.

  4. Click Create.

    The new pipeline appears.
  5. Click Add Stages.

  6. Search for the Email Notification stage and add it to the pipeline.

  7. Configure the stage:

    1. In the SMTP Settings section, specify settings for the email server you want to use.

    2. In the Message Settings section, specify these:

      • From: triggerExample@hci.com
      • Subject: Trigger ${HCI_triggerName} activated — stop the ${HCI_workflowName} workflow
      • Body: Current number of bytes read: ${bytes_read}
      • Recipients: <your-email-address>

Create the trigger

  1. Click Workflows.

    The Workflow Designer page opens.
  2. Click the workflow that you want.

  3. Click the Task window.

  4. Click the Triggers window.

  5. Click the Add Trigger + tab.

  6. Name the trigger, and, optionally, write a description for it.

  7. Specify an expression:

    1. Click Add Condition.

    2. Configure the condition:

      • In the Aggregation field, type Workflow Metrics.
      • In the Key field, type bytes_read.
      • From the menu, select greater_than.
      • In the value field, type 750000000000 (for example, 750 GB in bytes).
  8. Set up the trigger pipeline:

    1. Click the Add Pipeline + tab.

    2. On the add Pipeline to Trigger page, select the pipeline you created. Then click Add to Trigger.

  9. Click Add Trigger to add it to your workflow.

Run your workflow

Procedure

  1. Start your workflow running. For information, see Running workflow tasks.

Results

After the workflow completes processing a batch of documents, if the total number of bytes read is greater than 750 GB, you will begin receiving emails until you pause the workflow. For information, see Pausing and resuming tasks.

Writing HCI custom metadata to HCP objects

This topic details how you use Hitachi Content Intelligence to extract metadata from your HCP objects, format that metadata as XML, and write it back to each HCP object as custom metadata annotations.

To write the custom metadata to HCP objects:

Procedure

  1. Create a data connection to the location where the files are stored.

  2. Create an HCP data connection for the location where the output HCP objects will be stored.

    ImportantThis example does not format custom metadata as XML. To complete these steps, you need to configure the destination HCP namespace to not require custom metadata in XML format.
  3. Create a pipeline.

  4. Add a Drop Documents stage at the beginning of pipeline.

  5. Surround the Drop Documents stage with a conditional statement that ensures only .srt documents are dropped from the workflow.

  6. If the document needs to be written to a separate location other than where it is currently stored, add an Execute Action stage with the following configuration:

    • Data connection: The connection where you want to store the transformed HCP objects.
    • Selected Action: Write Annotation
    • Write Annotation Action Config:
      • Stream: HCI_content
      • Filename field: HCI_filename
      • Path field: HCI_relativePath
      • Write all annotations: No
  7. Add an Attach Stream stage to the pipeline and configure it with the following settings:

    • If you want to attach custom metadata from one document to another:
      • Stream Name: HCP_customMetatadata_
      • Stream URI: ${HCI_URI}.srt
      • Data Source Selection: Use Document Data Source
      • Additional Custom Stream Metadata:
        • Key: HCP_customMetadata
        • Value: subtitles
    • If you want to attach the content of one document as the custom metadata of another:
      • Stream Name: HCI_content
      • Stream URI: ${HCI_URI}.srt
      • Data Source Selection: Use Document Data Source
      • Additional Custom Stream Metadata:
        • Key: HCP_customMetadata
        • Value: subtitles
  8. Add an Execute Action stage to the end of the pipeline and configure it with the following settings:

    • Data connection: The connection where you want to store the transformed HCP objects.
    • Selected Action: Write Annotation
    • Write Annotation Action Config:
      • Filename field: HCI_filename
      • Path field: HCI_relativePath
      • Write all annotations: Yes
  9. Create a workflow.

  10. Add the data connection created in step 1 as the workflow input.

  11. Add the processing pipeline to the workflow.

  12. Run the workflow task.

Results

The system writes XML as a custom metadata annotation named hciMetadata to the corresponding HCP object for each document.
In the previous procedure, configuring the Tagging stage in step 3 to add a field called reasonForHold with a value of Case_12345 results in this custom metadata XML annotation being written to HCP objects:

<hci-metadata>

<reasonForHold>

<![CDATA[ Case_12345 ]]>

</reasonForHold>

</hci-metadata>

Writing file contents as custom metadata to HCP objects

This topic provides an example of how you could use Hitachi Content Intelligence to extract the contents of a file and write it as custom metadata on an object in HCP.

You may want to do this if you have two closely associated files that you want to store as a single HCP object. For example, a video file and an associated SubRip (.srt) text file:

video1.mp4

video1.mp4.srt

To store these files together as a single HCP object:

Procedure

  1. Create a data connection to the location where the files are stored.

  2. Create an HCP data connection for the location where the output HCP objects will be stored.

    ImportantThis example does not format custom metadata as XML. To complete these steps, you need to configure the destination HCP namespace to not require custom metadata in XML format.
  3. Create a pipeline.

  4. Add a Drop Documents stage at the beginning of pipeline.

  5. Surround the Drop Documents stage with a conditional statement that ensures only .srt documents are dropped from the workflow.

  6. If the document needs to be written to a separate location other than where it is currently stored, add an Execute Action stage with the following configuration:

    • Data connection: The connection where you want to store the transformed HCP objects.
    • Selected Action: Write File
    • Write Annotation Action Config:
      • Stream: HCI_content
      • Filename field: HCI_filename
      • Path field: HCI_relativePath
      • Write all annotations: No
  7. Add an Attach Stream stage to the pipeline and configure it with the following settings:

      • Stream Name: HCI_content
      • Stream URI: ${HCI_URI}.srt
      • Data Source Selection: Use Document Data Connection
      • Additional Custom Stream Metadata:
        • Key: HCP_customMetadata
        • Value: subtitles
  8. Add an Execute Action stage to the end of the pipeline and configure it with the following settings:

    • Data connection: The connection where you want to store the transformed HCP objects.
    • Selected Action: Write Annotation
    • Write Annotation Action Config:
      • Filename field: HCI_filename
      • Path field: HCI_relativePath
      • Write all annotations: Yes
  9. Create a workflow.

  10. Add the data connection created in step 1 as the workflow input.

  11. Add the processing pipeline to the workflow.

  12. Run the workflow task.

Results

Hitachi Content Intelligence writes the video file to HCP. The resulting HCP object has a custom metadata annotation called subtitles, whose contents are the same as those from the .srt file.

Adding document caching to your workflows

Problem

You have a workflow that you need to run many times (for example, because you are actively developing it and need to test your pipelines). However, the inputs for this workflow are expensive to read from:

  • Some data lives in an Amazon S3 bucket, which costs money to access.
  • Some data lives in a data source that’s located far away and to which connections are slow.

Solution

Read the data from your expensive data sources and then copy that data to other data sources that are cheap to read from. Using those cheap data sources as your workflow inputs, you can your workflow as much as you want without ever again reading from those expensive data sources.

Procedure

  1. Create a data connection for each of your data sources.

  2. Create an empty pipeline.

  3. Create a workflow with these components:

    • Input: Data connections for your expensive-to-read-from data sources.
    • Processing Pipelines: The empty pipeline.
    • Output: Data connection for your cheap-to-read-from data source.
  4. Create a second workflow with these components:

    • Input: Data connection for your cheap-to-read-from data source.
    • Processing Pipelines: The pipelines that you're testing.
    • Output: The index collections that you're testing.
  5. Run the first workflow once to completion.

  6. Run the second workflow as often as you need. Data is for this workflow is read only from your cheap-to-read data source.

Copying data from Amazon S3 to HCP

This topic describes how to use your system to copy all content from an Amazon S3 bucket to an HCP namespace.

Procedure

  1. Create two data connections:

    • An Amazon S3 data connection for the Amazon S3 bucket you want to copy data from.
    • An S3 Compatible data connection for the HCP namespace that you want to copy data to.
  2. Create a pipeline, but do not add any stages to it.

  3. Create a workflow.

  4. Edit the workflow to use the Amazon S3 data connection as an input.

  5. Edit the workflow to use the empty pipeline.

  6. Edit the workflow to use the S3 Compatible data connection as an output. Configure the output to perform the Output File action.

  7. Run the workflow task.

Results

All data is copied from the Amazon S3 bucket to the HCP namespace.

Parsing and indexing CSV and log files

You can use the Field Parser and Read Lines stages together in your pipeline to extract fields from files such as CSVs and logs (that is, files where each line represents a separate event occurrence or piece of data).

Log file example

Consider these log file entries:

172.18.10.106,admin,[20/Jul/2016:11:30:40],"POST /auth/oauth/ HTTP/1.1",200
      172.18.10.106,admin,[20/Jul/2016:11:30:40],"GET /api/admin/setup HTTP/1.1",200
      172.18.10.106,admin,[20/Jul/2016:11:30:40],"GET /api/admin/alerts HTTP/1.1",200
      172.18.10.106,admin,[20/Jul/2016:11:30:53],"POST /api/admin/objects HTTP/1.1",201
      172.18.10.106,jsmith,[20/Jul/2016:11:30:53],"GET /api/admin/objects /d7dc655f-e42b-4e2e-a48d-72d49beda939 HTTP/1.1",403

Each line in the file contains this information, separated by commas: IP address, username, date and time, API request, and HTTP response code.

You could extract each piece of information as a separate document field and then index those fields. This would allow your users to perform complex queries such as:

  • "Show me all actions that the admin user performed between June and July 2016":

    +eventUser:admin +eventDateTime:[1464739200 TO 1470009599]

  • "Show me events where a user was denied permission to do something":

    +eventReturnCode:403

Example procedure

This procedure demonstrates how you can use the Read Lines and Field Parser stages to extract additional fields from the log files like the example above.

Conditionally, separate log files into individual entries

Workflow tasks perform better when processing large numbers of small files, rather than small numbers of large files. If your logs exists as large log files, you should split them into smaller documents, one for each log entry, before having the system parse and index the entries.

Procedure

  1. Create a workflow.

  2. Create a data connection for the data source that contains your log files.

  3. Add the data connection as a workflow input.

  4. Create another data connection for the data source where you want to store the individual log entry documents.

  5. Add the data connection as a workflow output and configure it to perform the Write File action.

  6. Create a pipeline.

  7. Add the pipeline to the workflow.

    TipWhen your pipeline contains the Read Lines stage, try setting the pipeline execution mode to Preprocessing. In some situations, this may yield faster pipeline performance when using the Read Lines stage.
  8. Add a Read Lines stage to the pipeline. Do not change the stage's configuration.

    By default, the Read Lines stage examines the HCI_content stream for each document and creates a new document for each individual line it finds. Each of these new documents contains a field called $line, by default.

  9. Surround the Read Lines stage with a conditional statement.

  10. Configure the conditional statement to check for the existence of the HCI_content stream:

    Field name: HCI_content

    Operator: stream_exists

  11. Run the workflow task.

  12. The log files from the data source are read by the task and split into individual documents by the Read Lines stage. Each individual log entry document is written to the data source you specified in 5.

Create a workflow to parse and index individual log entries

  1. Create another workflow.

  2. Add the data connection that contains your log entry files as an input to the workflow.

  3. Create a pipeline.

  4. Add the pipeline to the workflow.

  5. Add a Field Parser stage to the pipeline.

  6. Configure the Field Parser stage with these settings:

    • Input Field Name: $line
    • Tokenize Input Field Value: Yes
    • Tokenization Delimiter: , (comma)
    • Substrings to Parse: None
    • Token to Field Mapping:
      Token Index NumberOutput Field Name
      0eventIpAddress
      1eventUser
      2eventDateTime
      3eventApiRequest
      4eventReturnCode
    • Add Debug Fields: No
  7. Add a Date Conversion stage immediately after the Field Parser stage.

  8. Edit the Date Conversion stage by adding the eventDateTime field to the list of Fields to Process.

    This ensures that the eventDateTime is indexed correctly.
  9. Add the data connection and pipeline to the workflow.

  10. Run the workflow task to have the task discover all fields produced by your pipeline.

  11. Create an index collection. Select Basic for the initial schema.

  12. Add the fields that the workflow task discovered to the index collection schema.

  13. Add the index collection to the workflow.

  14. Run the workflow task again to index the log files.

 

  • Was this article helpful?