Skip to main content

We've Moved!

Product Documentation has moved to docs.hitachivantara.com
Hitachi Vantara Knowledge

Index collection types and settings

This table lists the types of index collections that Hitachi Content Intelligence supports.

Index TypeSettingsSupported Actions
HCI Index (Internal): The resulting search index is a Solr index that is wholly owned by Hitachi Content Intelligence; Hitachi Content Intelligence builds it, updates it, and stores it on one or more instances in the system.Initial Visibility setting
  • Public: When selected, the Public query settings are enabled for the index collection. This means that all users automatically receive access to search the index.
  • Private: When selected, the Public query settings are disabled for the index collection. This means that no user has access to the index until you explicitly grant access to those users.

For more information, see Controlling user access to search indexes.

Initial Schema setting

For information, see Initial schema options.

Initial Shard Count setting

This determines the number of unique pieces to separate the search index into. The more shards an index has, the larger the index can grow and still be able to service requests quickly.

For HCI Indexes, index shards are distributed amongst the instances that run the Indexservice.

The default number of shards is 3.

For more information, see Index shards.

Important: After you create an index collection, you cannot reconfigure the number of shards it has.

For guidance on index sharding, see the applicable Apache Solr or Elasticsearch documentation.

Index: When configured in an Execute Actionstage in a pipeline, all documents that pass through the stage are immediately indexed.

For information on using this action, see Adding actions to a pipeline.

HDDS (External): An existing index created and maintained by Hitachi Data Discovery Suite (HDDS). Your system can use this index to search for files, but cannot make any changes to it.

ImportantThis index collection type has been deprecated as of version 2.2.2 of HCI and will be removed in an upcoming release.
Initial Visibility setting
  • Public: When selected, the Public query settings are enabled for the index collection. This means that all users automatically receive access to search the index.
  • Private: When selected, the Public query settings are disabled for the index collection. This means that no user has access to the index until you explicitly grant access to those users.

    For more information, see Controlling user access to search indexes.

  • Connection URL: URL for the HDDS system.
  • Connection Port: Port to connect to on the HDDS system.
  • Username and password for a local HDDS user account or an AD user account with permissions to access the HDDS system
  • Use SSL: Option to enable or disable SSL for the connection to HDDS.
None

Apache Solr (External): A new or existing index that your system can query and add data to. However, the search index is stored in your Apache Solr infrastructure, not within the system instances. You can also use your Solr tools to manage the index. When creating an Apache Solr index collection, you can configure your system to use an existing Solr index or create a new one.

For more information on how Hitachi Content Intelligence interacts with an external Apache Solr index, see Considerations for external Apache Solr index collections.

Initial Visibility setting
  • Public: When selected, the Public query settings are enabled for the index collection. This means that all users automatically receive access to search the index.
  • Private: When selected, the Public query settings are disabled for the index collection. This means that no user has access to the index until you explicitly grant access to those users.

For more information, see Controlling user access to search indexes.

Initial Schema setting

For information, see Initial schema options.

Initial Shard Count setting

This determines the number of unique pieces to separate the search index into. The more shards an index has, the larger the index can grow and still be able to service requests quickly.

For HCI Indexes, index shards are distributed amongst the instances that run the Indexservice.

The default number of shards is 3.

For more information, see Index shards.

Important: After you create an index collection, you cannot reconfigure the number of shards it has.

For guidance on index sharding, see the applicable Apache Solr or Elasticsearch documentation.

  • Connection URL: URL for the Solr server. To use SolrCloud, prepend the connection URL with Z:. Otherwise, the format is [http:]//[<host>]:[<port>]/solr.
  • Management URL: URL used to connect to the Solr server for management operations. The format is [http://][<host>]:[<port>].
  • Whether to create a new index or register an existing one.
Note: If you've unregistered an index collection and want to reconnect it to Hitachi Content Intelligence, select the option to use an existing index and select the index you want to reconnect.

For information on unregistering index collections, see Unregistering an index collection.

Index: When configured in an Execute Actionstage in a pipeline, all documents that pass through the stage are immediately indexed.

For information on using this action, see Adding actions to a pipeline.

Elasticsearch™: A new or existing index that your system can query and add data to. However, the search index is stored in your Elasticsearch infrastructure, not within the system instances. You can also use your Elasticsearch tools to manage the index.When creating an Elasticsearch index collection, you can configure your system to use an existing index or create a new one.

ImportantThis index collection type has been deprecated as of version 2.2.2 of HCI and will be removed in an upcoming release.
Initial Visibility setting
  • Public: When selected, the Public query settings are enabled for the index collection. This means that all users automatically receive access to search the index.
  • Private: When selected, the Public query settings are disabled for the index collection. This means that no user has access to the index until you explicitly grant access to those users.

    For more information, see Controlling user access to search indexes.

Initial Schema setting

For information, see Initial schema options.

Initial Shard Count setting

This determines the number of unique pieces to separate the search index into. The more shards an index has, the larger the index can grow and still be able to service requests quickly.

For HCI Indexes, index shards are distributed amongst the instances that run the Indexservice.

The default number of shards is 3.

For more information, Index shards.

Important: After you create an index collection, you cannot reconfigure the number of shards it has.

For guidance on index sharding, see the applicable Apache Solr or Elasticsearch documentation.

  • Connection URL: URL for the Elasticsearchserver. The format is <host>:<port>.
  • Whether to create a new index or register an existing one.
Tip: If you've unregistered an index collection and want to reconnect it to your system, select the option to use an existing index and select the index you want to reconnect.

For information on unregistering index collections, see Unregistering an index collection.

  • External Collection Name: The name of an existing external index collection to register with your system.

Index: When configured in an Execute Actionstage in a pipeline, all documents that pass through the stage are immediately indexed.

For information on using this action, see Adding actions to a pipeline.

Initial schema options

When creating an index collection, you need to select an initial schema format. This determines the fields that initially exist in the index collection schema.

You should select a schema format based on your needs:

  • If you need to learn about the contents of your data, use Schemaless.
  • If you are somewhat familiar with your data but need to know more, use Default.
  • If you need to ensure peak performance for a production index, use Basic or Empty.
Schemaless option

When created with the Schemaless option, an index collection schema contains a number of dynamic fields. When your system tries to index a document using this schema, it automatically selects a type for each field in the document. This allows data to be indexed automatically; you don't need to manually add fields to the schema or configure their types.

However, automatic field addition and type selection can be problematic because:

  • All document fields are added to the index, many of which might not be useful.
  • Field types might be selected incorrectly, which can cause documents to be indexed improperly or not at all.

For example, say that a document contains a field named tiffBitsPerSample with a value of 8. When this document is sent to a schemaless index collection, the search engine will automatically create a new field in the schema with the type of int. Any subsequent document indexed with a metadata field of tiffBitsPerSample will have the information indexed into this field as type int.

What happens if another document contains the tiffBitsPerSample field but with a value of eight, not 8? The document will fail to be indexed because the int field type cannot contain letters.

Default option

The Default option tries to mitigate the issues that the Schemaless option can cause when automatically adding fields. The Defaultoption also uses dynamic fields to automatically configure any fields that don't exist, but many common fields have already been added to the schema and their types predefined to avoid type selection problems.

With the Default option, the tiffBitsPerSample field is already added to the schema with a type of string so that it can store both the values eight and 8.

However, because this option still contains dynamic fields, it is possible for the index to automatically add fields you don't want or to select the wrong data types for those fields.

Select this option instead of Schemaless if you are using the default pipeline in your workflow pipeline.

Basic option

The Basic option avoids dynamic fields altogether; fields are not automatically added to the index collection schema so you need to manually add and configure them. When a document is indexed, any fields that don't match the schema are not indexed.

This is the preferred option for a production index collection. Because you define the fields, only the information you want is stored in the index.

Empty option (Elasticsearch indexes only)

With this option, the schema is completely empty. It contains no dynamic or defined fields. You need to add all fields yourself.

NoteThough you need to add fields to the index manually with the Basic and Empty options, you don't have to add them one-at-a-time. You can run a workflow task to produce a list of fields and then add that entire list to the index collection schema.

This method also has the added benefit of more intelligent type selection. When the task produces a list of fields, it analyzes all possible values for those fields and recommends a field type that's appropriate for the majority of the field values and that needs the lease amount of allocated space in the index.

For more information, see Importing fields from a workflow into an index collection.

 

  • Was this article helpful?