Schedule tasks to run when user demand on the system is low
If you have a workflow task that will take a very long time to finish or that runs indefinitely, you should schedule that task to run only when there are relatively few users searching for data. That way, searches are quick to return results because they are not competing with your workflow tasks for system resources.
For more information, see Scheduling tasks.
Favor processing large numbers of small documents
Typically, a workflow task that processes a large number of small documents runs faster than one that processes a small number of large documents. This is because a large number of small documents offers the system more discrete pieces of work that can be performed in parallel.
If you need to process or index large files that contain discrete pieces of information, such as archive files or log files, you should consider splitting those files up before having the system index them.
For example, with an archive file, you can:
- Expand the archive file in the data source and create a data connection to the location where the expanded files are located.
- Create a workflow and pipeline to expand the archive file for you:
- Create a workflow. See Creating a workflow.
- Create a data connection for the data source where your archive files are located. See Creating data connections.
- Add the data connection as a workflow input. See Adding data connections to workflows.
- Create a data connection to write the expanded files to.
- Add the data connection as a workflow output and configure it to perform the Write File action. See Adding existing outputs to a workflow.
- Create a pipeline. See Creating and modifying processing pipelines.
- Add the applicable archive expansion stage to the pipeline.
- Disable the Include source setting for the stage to have the archive expansion stage output only the expanded documents.
- Add the pipeline to the workflow. See Adding pipelines to a workflow.
- Run the workflow task. See Running workflow tasks.
- Create a second workflow.
- Add the data connection that now contains the expanded files as an input to the second workflow.
- Configure this second workflow to process or index the expanded files as you want.
For information on processing log files, see Parsing and indexing CSV and log files.
Optimize your pipeline before changing task performance settings
The default workflow task performance settings have been configured to give optimal performance in a system that runs one workflow task at a time.
If a workflow task experiences poor performance, before changing task performance settings, you should first evaluate your pipeline to make sure you are not processing documents unnecessarily.
Experiment with task performance setting changes
If after optimizing your pipeline's behavior you decide that you need to change task performance settings, use these strategies to determine what settings yield the best task performance:
- Select a representative subset of your data and process it using multiple workflows, each with different performance settings. Compare the task performance details for each workflow to see which performance settings yield the best improvement for processing your data.
- Enable the Collect Historical Metrics setting to view graphs that show past workflow performance. Run your workflow and periodically change task performance settings, taking note when you do. Then, use the historical metrics graphs to observe the affects of your changes over time.
Run the Workflow-Agent jobs on as many instances as possible
When a Workflow-Agent-type job runs on an instance, that instance's computational resources can be used to perform the work of a workflow task.
To maximize the amount of system resources used to run workflow tasks, allow Workflow-Agent jobs to run on all instances in the system, except for:
- Instances that are running other computationally expensive services, such as the Index service.
- Master instances.
Optionally, avoid processing old versions of HCP objects
In an HCP namespace, objects can have more than one version. The data connection type you are using affects whether a workflow task processes these old versions:
- With the regular HCP data connection, only the latest version of each file is read, processed, and indexed.
- With the HCP MQE data connection, all versions of each file are read and processed. However, only the latest versions are indexed, which means the old versions are processed unnecessarily.
If you don't want to process or index old version of HCP objects, use the regular HCP data connection, instead of the HCP MQE data connection, for HCP namespaces that have versioning enabled. However, you need to include every document in the database when using the regular HCP data connection in order to track changes. This leads to extra database usage and impacts performance when re-crawling.
For more information on the HCP data connections, see Data connection types and settings.
Avoid or minimize metrics collection in production
Collecting workflow task metrics can consume a significant amount of memory and decrease task performance. As a result, you should minimize or disable the metrics collection for the workflow tasks that you run in production. You can do this by:
- Disabling the Collect Historical Metrics task setting.
- For aggregation metrics collection, either:
- Disabling the Collect Aggregation Metrics task setting.
- Enabling the Collect Aggregation Metrics, but limiting the total number of aggregations that the workflow has.
Avoid unnecessary file rereads
If a workflow task is interrupted while reading files, when the task resumes, it starts again at the beginning of the set of files that it was reading when it stopped. This means that the task might reread files. These rereads are counted towards the task's performance, failure, and discovery metrics.
To avoid these unnecessary file rereads:
- Avoid pausing and resuming tasks multiple times, especially if:
- Your workflow uses list-based data connections to read files from data sources with large directories.
- Your workflow uses any of the JDBC data connections:
- MySQL and MariaDB JDBC data connection
- PostgreSQL JDBC data connection
- Solr JDBC data connection
- Internal Index JDBC data connection
- Your workflow uses changed-based data connections that process large batches of files.
- When configuring a task schedule, schedule the task to run in long blocks of time. If your data source directories contain large amounts of files, longer blocks of time increase the likelihood that the task will be able to read the entire contents of a folder before stopping.