Special cases and scenarios
This section includes examples of some advanced cases and scenarios that you can configure Hitachi Content Search to handle.
Using aggregations and triggers to monitor a workflow
Say, for example, that you have a data source that charges you for each byte read. This data source is constantly growing, but your budget limits you to reading no more than 1 TB from it.
The following procedure demonstrates how you can use aggregations, triggers, and the Email Notification pipeline stage together to notify you when you've reached 75% of your allowed limit (for example, 750 GB). That way, you can stop your workflow and not exceed your budget.
This example also uses the Workflow Metrics aggregation, which is created by default for each workflow.
The procedure consists of three steps:
- Create the trigger pipeline.
- Create the trigger.
- Run your workflow.
Create the trigger pipeline
Click the Processing Pipelines window.
Click Create Pipeline.
Enter a name and, optionally, a description for the pipeline.
Click Create.
The new pipeline appears.Click Add Stages.
Search for the Email Notification stage and add it to the pipeline.
Configure the stage:
In the SMTP Settings section, specify settings for the email server you want to use.
In the Message Settings section, specify these:
- From:
triggerExample@hci.com
- Subject:
Trigger ${HCI_triggerName} activated — stop the ${HCI_workflowName} workflow
- Body:
Current number of bytes read: ${bytes_read}
- Recipients:
<your-email-address>
- From:
Create the trigger
Click Workflows.
The Workflow Designer page opens.Click the workflow that you want.
Click the Task window.
Click the Triggers window.
Click the Add Trigger + tab.
Name the trigger, and, optionally, write a description for it.
Specify an expression:
Click Add Condition.
Configure the condition:
- In the Aggregation field, type Workflow Metrics.
- In the Key field, type bytes_read.
- From the menu, select greater_than.
- In the value field, type 750000000000 (for example, 750 GB in bytes).
Set up the trigger pipeline:
Click the Add Pipeline + tab.
On the add Pipeline to Trigger page, select the pipeline you created. Then click Add to Trigger.
Click Add Trigger to add it to your workflow.
Run your workflow
Procedure
Start your workflow running. For information, see Running workflow tasks.
Results
Writing HCI custom metadata to HCP objects
This topic details how you use Hitachi Content Intelligence to extract metadata from your HCP objects, format that metadata as XML, and write it back to each HCP object as custom metadata annotations.
To write the custom metadata to HCP objects:
Procedure
Create a data connection to the location where the files are stored.
Create an HCP data connection for the location where the output HCP objects will be stored.
ImportantThis example does not format custom metadata as XML. To complete these steps, you need to configure the destination HCP namespace to not require custom metadata in XML format.Create a pipeline.
Add a Drop Documents stage at the beginning of pipeline.
Surround the Drop Documents stage with a conditional statement that ensures only .srt documents are dropped from the workflow.
If the document needs to be written to a separate location other than where it is currently stored, add an Execute Action stage with the following configuration:
- Data connection: The connection where you want to store the transformed HCP objects.
- Selected Action: Write Annotation
- Write Annotation Action Config:
- Stream: HCI_content
- Filename field: HCI_filename
- Path field: HCI_relativePath
- Write all annotations: No
Add an Attach Stream stage to the pipeline and configure it with the following settings:
- If you want to attach custom metadata from one document to another:
- Stream Name: HCP_customMetatadata_
- Stream URI: ${HCI_URI}.srt
- Data Source Selection: Use Document Data Source
- Additional Custom Stream Metadata:
- Key: HCP_customMetadata
- Value: subtitles
- If you want to attach the content of one document as the custom metadata of another:
- Stream Name: HCI_content
- Stream URI: ${HCI_URI}.srt
- Data Source Selection: Use Document Data Source
- Additional Custom Stream Metadata:
- Key: HCP_customMetadata
- Value: subtitles
- If you want to attach custom metadata from one document to another:
Add an Execute Action stage to the end of the pipeline and configure it with the following settings:
- Data connection: The connection where you want to store the transformed HCP objects.
- Selected Action: Write Annotation
- Write Annotation Action Config:
- Filename field: HCI_filename
- Path field: HCI_relativePath
- Write all annotations: Yes
Create a workflow.
Add the data connection created in step 1 as the workflow input.
Add the processing pipeline to the workflow.
Run the workflow task.
Results
<hci-metadata>
<reasonForHold>
<![CDATA[ Case_12345 ]]>
</reasonForHold>
</hci-metadata>
Writing file contents as custom metadata to HCP objects
This topic provides an example of how you could use Hitachi Content Intelligence to extract the contents of a file and write it as custom metadata on an object in HCP.
You may want to do this if you have two closely associated files that you want to store as a single HCP object. For example, a video file and an associated SubRip (.srt) text file:
video1.mp4
video1.mp4.srt
To store these files together as a single HCP object:
Procedure
Create a data connection to the location where the files are stored.
Create an HCP data connection for the location where the output HCP objects will be stored.
ImportantThis example does not format custom metadata as XML. To complete these steps, you need to configure the destination HCP namespace to not require custom metadata in XML format.Create a pipeline.
Add a Drop Documents stage at the beginning of pipeline.
Surround the Drop Documents stage with a conditional statement that ensures only .srt documents are dropped from the workflow.
If the document needs to be written to a separate location other than where it is currently stored, add an Execute Action stage with the following configuration:
- Data connection: The connection where you want to store the transformed HCP objects.
- Selected Action: Write File
- Write Annotation Action Config:
- Stream: HCI_content
- Filename field: HCI_filename
- Path field: HCI_relativePath
- Write all annotations: No
Add an Attach Stream stage to the pipeline and configure it with the following settings:
-
- Stream Name: HCI_content
- Stream URI: ${HCI_URI}.srt
- Data Source Selection: Use Document Data Connection
- Additional Custom Stream Metadata:
- Key: HCP_customMetadata
- Value: subtitles
-
Add an Execute Action stage to the end of the pipeline and configure it with the following settings:
- Data connection: The connection where you want to store the transformed HCP objects.
- Selected Action: Write Annotation
- Write Annotation Action Config:
- Filename field: HCI_filename
- Path field: HCI_relativePath
- Write all annotations: Yes
Create a workflow.
Add the data connection created in step 1 as the workflow input.
Add the processing pipeline to the workflow.
Run the workflow task.
Results
Adding document caching to your workflows
Problem
You have a workflow that you need to run many times (for example, because you are actively developing it and need to test your pipelines). However, the inputs for this workflow are expensive to read from:
- Some data lives in an Amazon S3 bucket, which costs money to access.
- Some data lives in a data source that’s located far away and to which connections are slow.
Solution
Read the data from your expensive data sources and then copy that data to other data sources that are cheap to read from. Using those cheap data sources as your workflow inputs, you can your workflow as much as you want without ever again reading from those expensive data sources.
Procedure
Create a data connection for each of your data sources.
Create an empty pipeline.
Create a workflow with these components:
- Input: Data connections for your expensive-to-read-from data sources.
- Processing Pipelines: The empty pipeline.
- Output: Data connection for your cheap-to-read-from data source.
Create a second workflow with these components:
- Input: Data connection for your cheap-to-read-from data source.
- Processing Pipelines: The pipelines that you're testing.
- Output: The index collections that you're testing.
Run the first workflow once to completion.
Run the second workflow as often as you need. Data is for this workflow is read only from your cheap-to-read data source.
Copying data from Amazon S3 to HCP
This topic describes how to use your system to copy all content from an Amazon S3 bucket to an HCP namespace.
Procedure
Create two data connections:
- An Amazon S3 data connection for the Amazon S3 bucket you want to copy data from.
- An S3 Compatible data connection for the HCP namespace that you want to copy data to.
Create a pipeline, but do not add any stages to it.
Create a workflow.
Edit the workflow to use the Amazon S3 data connection as an input.
Edit the workflow to use the empty pipeline.
Edit the workflow to use the S3 Compatible data connection as an output. Configure the output to perform the Output File action.
Run the workflow task.
Results
Parsing and indexing CSV and log files
You can use the Field Parser and Read Lines stages together in your pipeline to extract fields from files such as CSVs and logs (that is, files where each line represents a separate event occurrence or piece of data).
Log file example
Consider these log file entries:
172.18.10.106,admin,[20/Jul/2016:11:30:40],"POST /auth/oauth/ HTTP/1.1",200 172.18.10.106,admin,[20/Jul/2016:11:30:40],"GET /api/admin/setup HTTP/1.1",200 172.18.10.106,admin,[20/Jul/2016:11:30:40],"GET /api/admin/alerts HTTP/1.1",200 172.18.10.106,admin,[20/Jul/2016:11:30:53],"POST /api/admin/objects HTTP/1.1",201 172.18.10.106,jsmith,[20/Jul/2016:11:30:53],"GET /api/admin/objects /d7dc655f-e42b-4e2e-a48d-72d49beda939 HTTP/1.1",403
Each line in the file contains this information, separated by commas: IP address, username, date and time, API request, and HTTP response code.
You could extract each piece of information as a separate document field and then index those fields. This would allow your users to perform complex queries such as:
- "Show me all actions that the admin user performed between June and July
2016":
+eventUser:admin +eventDateTime:[1464739200 TO 1470009599]
- "Show me events where a user was denied permission to do
something":
+eventReturnCode:403
Example procedure
This procedure demonstrates how you can use the Read Lines and Field Parser stages to extract additional fields from the log files like the example above.
Conditionally, separate log files into individual entries
Workflow tasks perform better when processing large numbers of small files, rather than small numbers of large files. If your logs exists as large log files, you should split them into smaller documents, one for each log entry, before having the system parse and index the entries.
Procedure
Create a workflow.
Create a data connection for the data source that contains your log files.
Add the data connection as a workflow input.
Create another data connection for the data source where you want to store the individual log entry documents.
Add the data connection as a workflow output and configure it to perform the Write File action.
Create a pipeline.
Add the pipeline to the workflow.
TipWhen your pipeline contains the Read Lines stage, try setting the pipeline execution mode to Preprocessing. In some situations, this may yield faster pipeline performance when using the Read Lines stage.Add a Read Lines stage to the pipeline. Do not change the stage's configuration.
By default, the Read Lines stage examines the HCI_content stream for each document and creates a new document for each individual line it finds. Each of these new documents contains a field called $line, by default.
Surround the Read Lines stage with a conditional statement.
Configure the conditional statement to check for the existence of the HCI_content stream:
Field name: HCI_content
Operator: stream_exists
Run the workflow task.
The log files from the data source are read by the task and split into individual documents by the Read Lines stage. Each individual log entry document is written to the data source you specified in 5.
Create a workflow to parse and index individual log entries
Create another workflow.
Add the data connection that contains your log entry files as an input to the workflow.
Create a pipeline.
Add the pipeline to the workflow.
Add a Field Parser stage to the pipeline.
Configure the Field Parser stage with these settings:
- Input Field Name: $line
- Tokenize Input Field Value: Yes
- Tokenization Delimiter: , (comma)
- Substrings to Parse: None
- Token to Field Mapping:
Token Index Number Output Field Name 0 eventIpAddress 1 eventUser 2 eventDateTime 3 eventApiRequest 4 eventReturnCode - Add Debug Fields: No
Add a Date Conversion stage immediately after the Field Parser stage.
Edit the Date Conversion stage by adding the eventDateTime field to the list of Fields to Process.
This ensures that the eventDateTime is indexed correctly.Add the data connection and pipeline to the workflow.
Run the workflow task to have the task discover all fields produced by your pipeline.
Create an index collection. Select Basic for the initial schema.
Add the fields that the workflow task discovered to the index collection schema.
Add the index collection to the workflow.
Run the workflow task again to index the log files.