Processing pipelines and stages
Processing pipelines perform operations on the documents that the system extracts from your data sources.
A document is a representation of a piece of data. Every piece of data read by a workflow task, whether it's a picture, PDF, audio file or any other type of data, is converted into a document.
- Field/value metadata pairs (or fields for short): For example, a medical image can become a document that contains field/value pairs such as doctor:"John Smith" and location:"City Hospital". These fields are the metadata for your files and can be used to create a search index.
- Streams: Pointers to data that lives in another location, not within the document itself. The data referenced by a stream can live either locally on one or more Hitachi Content Intelligence instances, or remotely in a data source.
Streams that point to locally stored data have this format:
<stream-name>: "X-HCI_local-path=<path-to-stream-tmp-file>"Streams that point to remotely stored data have this format:
<stream-name>: "<pointer-to-stream-location>"Streams typically point to large pieces of data that can be prohibitively expensive to include as document fields, such as the full content of a PDF file. Rather than spending system resources passing this large amount of data through a pipeline, The system uses streams to access data and read it from where it lives.
A processing pipeline is made up of one or more stages, each of which performs a specific type of operation. For example, your system includes stages that can expand .zip files (see ZIP Expansion stage) or add new fields to documents (see Tagging stage).
If you have a stage that you want to affect only certain documents, you can surround that stage with a conditional statement. Documents that do not meet the condition you specify bypass the stage.
You can link multiple pipelines together by adding them to a workflow, thereby forming a workflow pipeline.
When a workflow task runs, documents are discovered and extracted from the inputs for the workflow. Each of these documents is sent through the workflow pipeline.
Your system comes with a number of built-in stages. If these don't cover all your needs, you can use the software development kit (SDK) to write your own custom stage plugins.