Test your pipelines to learn how each stage affects your documents
The easiest way to understand what a pipeline does is to send a test document through it. This lets you examine the fields and streams that each stage adds or removes from your documents. You can test individual pipelines or an entire workflow pipeline.
For information, see Testing pipelines.
Order your stages and pipelines correctly
You can add any number of stages to a pipeline and put them in any order. You can also add any number of pipelines to a workflow and put those pipelines in any order. When a task runs for the workflow, documents pass through the entire workflow pipeline stage-by-stage.
For best results when ordering your pipelines and stages, ensure you understand what each pipeline and stage produces and consumes. Each stage or pipeline needs to be able to handle the documents output by the stage or pipeline before it and output documents that can be handled by the stages or pipelines following it.
For example, say you want to drop all PDFs out of your pipeline. You can do this by using a Drop Documents stage and surrounding it with a conditional statement that filters only PDFs into the stage:
However, this conditional statement needs a document that has a
Content_Type field. This field contains a document's MIME type and is added only after a document passes through the MIME Type Detection stage. So for this conditional statement to work correctly, it must be placed somewhere after a MIME Type Detection stage, either in the same pipeline or in another pipeline in the workflow pipeline.
The system includes a built-in pipeline called Default that you can use as a basis for building and ordering your own pipelines. For more information, see Default pipeline.
Identify pipeline performance issues
The system gives workflow task performance information that you can use to determine whether you have designed your pipelines to run as efficiently as possible. You should use these tools to ensure that your tasks run as quickly as possible. For information, see Evaluating pipeline performance.
When you've identified performance issues, try using these best practices to address them:
- Use conditional statements to limit unnecessary document processing.
- Process only the documents that your users want.
- Use Workflow Recursion to simplify your pipelines.
Use conditional statements to limit unnecessary document processing
If your task has no conditional statements, determine whether you are needlessly sending documents through certain stages. For example, you should send only video files through an SRT Text Extraction stage.
Without conditional statements in your pipeline, every document goes through every stage, which might not be necessary. For information, see Conditional statements.
You can also use the Reject Documents stage to test your conditional statements to ensure they are matching the documents you expect. For information, see Reject Documents stage.
Process only the documents that your users want
If your data sources contain data that your users do not need to search for, don't waste time and disk space by processing and indexing those documents.
To process and index only the data you want, you can:
- Create data connections with narrow scopes. For example, instead of creating an HCP data connection to an entire namespace, you can create data connections for only the relevant directories within that namespace. See Data connection types and settings.
- Use a Drop Documents stage at the very beginning of your workflow pipeline and surround it with a conditional statement that matches all the documents you want to drop from your pipeline. See Drop Documents stage and Conditional statements.
Use the Reject Documents stage to debug your pipelines
You can debug your pipelines by using the Reject Documents stage. This stage produces document failures if your pipeline is not functioning as you expect. For information, see Reject Documents stage.
Use Workflow Recursion to simplify your pipelines
With the Workflow Agent Recursion or Preprocessing Recursion setting enabled for a workflow task, any new documents created by a pipeline stage are sent to the beginning of the applicable set of pipelines, rather than on to the next stage. A document output by a stage is considered new if its HCI_id field value does not match the HCI_id field value of the document that entered the stage.
You should leave these settings enabled; disabling them will likely require you to make your pipelines more complex.
For example, say your workflow includes a pipeline with its execution mode set to Workflow-Agent. The pipeline contains a Text and Metadata Extraction stage followed by a ZIP Expansion stage. When a .zip file passes through the pipeline:
- If Workflow Agent Recursion is enabled, all new documents extracted from the .zip file are sent back to the beginning of the pipeline and are processed by the Text and Metadata Extraction stage.
- If Workflow Agent Recursion is disabled, all new documents extracted from the .zip file continue on to the next stage in the pipeline. For these documents to be processed by the Text and Metadata Extraction stage, you'd have to add a second one to your pipeline. You also need to use conditional statements to ensure that only the documents extracted from the .zip file are processed by the second Text and Metadata Extraction stage.
You should leave the recursion settings enabled if your pipelines include any of these stages:
- ZIP Expansion stage
- TAR Expansion stage
- Mbox Expansion stage
- PST Expansion stage
- Email Expansion stage
- Read Lines stage
Use temporary fields
Starting a field name with a dollar sign ($) causes the field to be deleted when the workflow pipeline finishes processing the associated document. That is, the field is not indexed. Use this technique to prevent unnecessary fields from being indexed. For example, you can add a field called $meetsCondition to a document to satisfy a conditional statement later on in the pipeline, but the field might not include any valuable information for your users to search on.
You can use this technique with stages that add new fields to documents, such as the Tagging and Mapping stages.