Index collection types and settings
This table lists the types of index collections that Hitachi Content Intelligence supports.
Index Type | Settings | Supported Actions |
HCI Index (Internal): The resulting search index is a Solr index that is wholly owned by Hitachi Content Intelligence; Hitachi Content Intelligence builds it, updates it, and stores it on one or more instances in the system. | Initial Visibility setting
For more information, see Controlling user access to search indexes. Initial Schema settingFor information, see Initial schema options. Initial Shard Count settingThis determines the number of unique pieces to separate the search index into. The more shards an index has, the larger the index can grow and still be able to service requests quickly. For HCI Indexes, index shards are distributed amongst the instances that run the Indexservice. The default number of shards is 3. For more information, see Index shards. Important: After you create an index collection, you cannot reconfigure the number of shards it has.For guidance on index sharding, see the applicable Apache Solr or Elasticsearch documentation. |
Index: When configured in an Execute Actionstage in a pipeline, all documents that pass through the stage are immediately indexed. For information on using this action, see Adding actions to a pipeline. |
HDDS (External): An existing index created and maintained by Hitachi Data Discovery Suite (HDDS). Your system can use this index to search for files, but cannot make any changes to it.
ImportantThis index collection type has been deprecated as of version 2.2.2 of HCI and will be removed in an upcoming release.
| Initial Visibility setting
| None |
Apache Solr (External): A new or existing index that your system can query and add data to. However, the search index is stored in your Apache Solr infrastructure, not within the system instances. You can also use your Solr tools to manage the index. When creating an Apache Solr index collection, you can configure your system to use an existing Solr index or create a new one. For more information on how Hitachi Content Intelligence interacts with an external Apache Solr index, see Considerations for external Apache Solr index collections. | Initial Visibility setting
For more information, see Controlling user access to search indexes. Initial Schema settingFor information, see Initial schema options. Initial Shard Count settingThis determines the number of unique pieces to separate the search index into. The more shards an index has, the larger the index can grow and still be able to service requests quickly. For HCI Indexes, index shards are distributed amongst the instances that run the Indexservice. The default number of shards is 3. For more information, see Index shards. Important: After you create an index collection, you cannot reconfigure the number of shards it has.For guidance on index sharding, see the applicable Apache Solr or Elasticsearch documentation.
For information on unregistering index collections, see Unregistering an index collection. |
Index: When configured in an Execute Actionstage in a pipeline, all documents that pass through the stage are immediately indexed. For information on using this action, see Adding actions to a pipeline. |
Elasticsearch™: A new or existing index that your system can query and add data to. However, the search index is stored in your Elasticsearch infrastructure, not within the system instances. You can also use your Elasticsearch tools to manage the index.When creating an Elasticsearch index collection, you can configure your system to use an existing index or create a new one.
ImportantThis index collection type has been deprecated as of version 2.2.2 of HCI and will be removed in an upcoming release.
| Initial Visibility setting
For information, see Initial schema options. Initial Shard Count settingThis determines the number of unique pieces to separate the search index into. The more shards an index has, the larger the index can grow and still be able to service requests quickly. For HCI Indexes, index shards are distributed amongst the instances that run the Indexservice. The default number of shards is 3. For more information, Index shards. Important: After you create an index collection, you cannot reconfigure the number of shards it has.For guidance on index sharding, see the applicable Apache Solr or Elasticsearch documentation.
For information on unregistering index collections, see Unregistering an index collection.
|
Index: When configured in an Execute Actionstage in a pipeline, all documents that pass through the stage are immediately indexed. For information on using this action, see Adding actions to a pipeline. |
Initial schema options
When creating an index collection, you need to select an initial schema format. This determines the fields that initially exist in the index collection schema.
You should select a schema format based on your needs:
- If you need to learn about the contents of your data, use Schemaless.
- If you are somewhat familiar with your data but need to know more, use Default.
- If you need to ensure peak performance for a production index, use Basic or Empty.
When created with the Schemaless option, an index collection schema contains a number of dynamic fields. When your system tries to index a document using this schema, it automatically selects a type for each field in the document. This allows data to be indexed automatically; you don't need to manually add fields to the schema or configure their types.
However, automatic field addition and type selection can be problematic because:
- All document fields are added to the index, many of which might not be useful.
- Field types might be selected incorrectly, which can cause documents to be indexed improperly or not at all.
For example, say that a document contains a field named tiffBitsPerSample with a value of 8. When this document is sent to a schemaless index collection, the search engine will automatically create a new field in the schema with the type of int. Any subsequent document indexed with a metadata field of tiffBitsPerSample will have the information indexed into this field as type int.
What happens if another document contains the tiffBitsPerSample field but with a value of eight, not 8? The document will fail to be indexed because the int field type cannot contain letters.
The Default option tries to mitigate the issues that the Schemaless option can cause when automatically adding fields. The Defaultoption also uses dynamic fields to automatically configure any fields that don't exist, but many common fields have already been added to the schema and their types predefined to avoid type selection problems.
With the Default option, the tiffBitsPerSample field is already added to the schema with a type of string so that it can store both the values eight and 8.
However, because this option still contains dynamic fields, it is possible for the index to automatically add fields you don't want or to select the wrong data types for those fields.
Select this option instead of Schemaless if you are using the default pipeline in your workflow pipeline.
The Basic option avoids dynamic fields altogether; fields are not automatically added to the index collection schema so you need to manually add and configure them. When a document is indexed, any fields that don't match the schema are not indexed.
This is the preferred option for a production index collection. Because you define the fields, only the information you want is stored in the index.
With this option, the schema is completely empty. It contains no dynamic or defined fields. You need to add all fields yourself.
This method also has the added benefit of more intelligent type selection. When the task produces a list of fields, it analyzes all possible values for those fields and recommends a field type that's appropriate for the majority of the field values and that needs the lease amount of allocated space in the index.
For more information, see Importing fields from a workflow into an index collection.