Skip to main content

We've Moved!

Product Documentation has moved to docs.hitachivantara.com
Hitachi Vantara Knowledge

Index collection schemas

Each index collection includes a schema, which is a list of fields with configuration settings for each. These settings specify whether a field is indexed and what features, such as sorting, the field supports.

When configuring your index collection, you need to balance the need to provide a rich, useful search experience for your users against the need to keep your search index from becoming bloated and inefficient. You should make your users' search experience as flexible as possible but not so rich that the search index occupies an excessive amount of disk space or is slow to return search results.

For best practices on configuring an index collection schema, see Best practices for designing an index.

TipUnlike changes to an index collection schema, changes to query settings take effect immediately and do not require that you reindex your content. So if you need to make changes to your users' search experience after you've finalized your index collection schema, try making those changes through query settings before you reconfigure your schema.

Defined dynamic and copy fields

The schema for an index collection groups fields into these categories:

  • Defined: This is the standard field category. Defined fields do not include wildcards in their names.
  • Dynamic: Each field in this category includes a wildcard character in its name. You can create dynamic fields to specify how the system should handle fields that it hasn't seen before or fields that you have not explicitly defined in the schema.

    For example, say your index collection includes a dynamic field called *_coordinate with a type of tdouble and your workflow pipeline discovers or adds a field called left_coordinate to your documents. The index collection schema does not include left_coordinate as a defined field. However, left_coordinate matches the dynamic field definition *_coordinate so, when the document is indexed, left_coordinate gets added as a new defined field with a type of tdouble.

  • Copy: This category contains rules for copying the contents of one field to another. For example, when you create an index collection with an initial schema type of schemaless, the index collection includes these copy rules by default:
    Source: HCI_displayName Destination: HCI_autocomplete
    Source: HCI_snippet Destination: HCI_autocomplete 

    This means that the values for the HCI_displayName and HCI_snippet fields are copied to the HCI_autocomplete field. HCI_autocomplete is the field that gets searched by default when a user performs a basic search in the Search App.

NoteCopy fields are not supported for Elasticsearch indexes.

Field types

This table describes the different types of data that each field in a document or index collection can contain. Each field type is optimized for a particular type of data.

TipYou can have your system select field types for you. For information, see Importing fields from a workflow into an index collection.
Note
  • Each field type carries with it a number of field attributes. These attributes and their corresponding use cases are automatically applied to a field when you assign a type to it. For information on field attributes and use cases, see Field attributes and Use cases.
  • The field types that end in s always support multiple values. You cannot sort search results based on fields of these types.
  • Field types that begin with t, as in tlong or tint, are identical to their counterparts, such as long or int, except that the t-prefixed field values are broken down and indexed as smaller terms. This increases the amount of data stored in the index, but increases the likelihood of returning a document based on partial query term match.
  • To minimize the size of your search index, select the appropriate type for each field. For more information, see Use the smallest field types possible for numerical values.

Field typeMultivalued VersionDescription
ancestor_path Values for this field can be file paths.
binary Values for this field are binary data.
booleanbooleansValues for this field can be true or false.
currency Values for this field are currency values. These must be numbers that include a decimal (or floating) point.

date

tdate

dates

tdates

Values for this field are dates in this format:

YYYY-MM-DDThh:mm:ssZ

Where:

  • YYYY is the year
  • MM is the month
  • DD is the day
  • T indicates that the next values represent the time of day
  • hh is the hour
  • mm is the minute
  • ss is the second
  • Z represents the offset from UTC and is specified as:

    (+|-)hhmm

date_longdate_longs

Values for this field can be 64-bit positive or negative integers. Useful for date values that are stored in UNIX epoch time format.

For fields with this type, the Search App gives a special date picker mechanism for users to use when selecting values on which to refine their searches or sort search results.

descendent_path Values for this field can be file paths.

double

tdouble

doubles

tdoubles

Values for this field can be 64-bit numbers that include a decimal (or floating) point.

float

tfloat

floats

tfloats

Values for this field can be 32-bit numbers that include a decimal (or floating) point.
ignored Values for this field are neither indexed nor retrievable in search results.

int

tint

ints

tints

Values for this field can be 32-bit positive or negative integers.
location Values for this field can be latitude/longitude coordinates.
location_rpt Values for this field can be positional data, including latitude/longitude coordinates.

long

tlong

longs

tlongs

Values for this field can be 64-bit positive or negative integers.
lowercase Values for this field are indexed only in lowercase letters.
phonetic_en Values for this field are indexed as English-language phonetic approximations.
point Values for this field can be positional data, for example a point on a graph.
random Used as the field type on the dynamic field random_*. This field can be used for sorting search results in a random order.
stringstrings

Values for this field can be any sequence of characters, numbers, or special characters. Values are case sensitive.

Use this field to support exact phrase matches. That is, a search user must query on the entire value for this field to retrieve results.

text_<language-code>

Values can be basic text of a particular language:

  • ar: Arabic
  • bg: Bulgarian
  • ca: Catalan
  • cjk: CJK
  • cz: Czech
  • da: Danish
  • de: German
  • el: Greek
  • en: English
  • en_splitting: English

    This field type is the same as text_en, but also splits field values based on special characters, case changes, and changes from letters to numbers.

  • en_splitting_tight: English

    Similar to text_en_splitting, but produces less false results.

  • es: Spanish
  • eu: Basque
  • fa: Farsi
  • fi: Finnish
  • fr: French
  • ga: Irish
  • gl: Galician
  • hi: Hindi
  • hu: Hungarian
  • hy: Armenian
  • id: Indonesian
  • it: Italian
  • ja: Japanese
  • lv: Latvian
  • nl: Dutch
  • no: Norwegian
  • pt: Portuguese
  • ro: Romanian
  • ru: Russian
  • sv: Swedish
  • th: Thai
  • tr: Turkish
text_general

Values can be any general piece of text. Values for this field are not case sensitive.

Use this field to support fuzzy matches, not exact matches. That is, the search user does not need to specify the exact value for a field to retrieve results.

text_general_rev

(i.e., reverse, not revised)

Values can be any general piece of text.

This field type is identical to text_general, but also allows each token to be indexed in both forward and reverse order. This improves performance when searching for files using leading wildcards or prefixes.

text_hci

Values can be any general piece of text.

An extension of text_general, text_hci causes a field value to be indexed as-is and also to be split and indexed as a number of separate tokens. This field type also allows filenames to be indexed separately from their extensions.

Important: Use this field type only on fields that contain small amounts of plain text. Using this field type on a large amount of content or on content with a large number of transitions between letters, numbers, and special characters (for example, JSON-formatted text) can severely impact index performance.
text_suggest

Special type applied to the HCI_autocomplete field, used to support making query term suggestions to the user. When a user performs a simple search in the Search App, that user is actually searching for values in the HCI_autocomplete field.

For more information, see Query suggestions.

text_ws Values can be any general piece of text. Values for a field of this type are split based only on white space characters.
uuid Values for this field are universally unique identifier (UUID) strings.

Adding and editing fields in an index collection schema

You edit an index collection schema by adding fields to it. For each field, you need to specify the type of data that the field contains and the field attributes to associate with that field. You can also configure the index collection to copy the values for a field to one or more other fields.

In the Admin App, you can associate one or more use cases with a field. A use case is simply a useful grouping of field attributes. You can use these until you are more familiar with the field attributes and how they interact with one another.

Use cases are not available in the CLI or REST API. With these methods, you use field attributes to configure fields.

Tip
NoteFor a field to be indexed, its name:
  • Cannot contain hyphens (-) or any other special characters.
  • Cannot start with underscores (_) or numbers.
  • For advanced users, your system also allows direct editing of the underlying index collection schema files for internally-managed and Apache Solr indexes. For information, see Directly editing configuration files for HCI Indexes.
  • You cannot delete fields or edit exiting fields in an Elasticsearch index schema.
  • If you make changes to the existing fields in an index collection schema, those changes are not reflected in the search index until a workflow task runs and reindexes your files using your new schema settings.

To add and edit fields:

Procedure

  1. Click the Index Collections window.

  2. Select the index collection you want to configure.

  3. Click the Schema tab.

  4. On the Defined Fields tab, click the edit icon (GUID-2548EA14-5968-4706-89BC-165684BD24D9-low.png) for the field you want to configure or click Create Field to add a new one.

  5. To edit a field:

    1. Specify a name for the field.

      If you add a wildcard character (*) to the field name, the field is listed on the Dynamic Field tab. For more information, see Defined, dynamic, and copy fields.

      NoteYou cannot rename existing fields. To replace a field, you can create a new field with the name you want and then delete the old one.
    2. In the Type drop-down, pick the type of data that the field can hold. For information on the available types, see Field types.

    3. Select the use cases or field attributes that you want the field to support. For information, see Field attributes and Use cases.

  6. To copy the values for one field to another field:

    1. On the Copy Fields tab, click Add Copy Field.

    2. In the Source Field menu, select the field you want to copy.

    3. In the Destination Field menu, select the field to which you want to copy the value from the source field.

    4. Optionally, click Add Destination to add additional fields.

    5. Click Create.

  7. To delete a field, click the delete icon for the field.

  8. Click Update Field.

Use cases

This table lists the use cases you can configure for fields in an index collection schema. Each use case corresponds to a combination of field properties.

TipEnabling a use case for a field means that information must be stored in the search index to support the use case. Enabling many use cases for many fields can cause an index to grow very large and be slow to return results. The Impact value for a use case specifies the relative cost for enabling the use case.

Generally, you should minimize the number of use cases you enable for a field, and also minimize the number of fields in your index collection.

NoteField attributes offer more precise field configuration options, but you can use these use cases until you are more familiar with how field attributes interact with one another. For information on field attributes, see Field attributes.

Use caseDescriptionField attributesImpact
Allow field boosts

You can configure how this field is boosted by configuring query settings for the index. For information, see Boosting field relevancy.

Cannot be used with Sort on field.

  • omitNorms must be disabled
Medium
Facet on Field

Stores the field in a way most efficient for using the field as a facet. That is, the field appears as a facet, or category, in the search results. The Search App lists the number of search results in which this field appears. Users can use the facet to narrow down a list of search results.

For more information, see Adding facets to the Search App.

Note: The Search within a field use case also allows fields to appear as facets. However, if you need a field to appear as a facet but don't want to allow users to search based on that field, use Facet on Field instead.
  • docValues
Medium
Retrieve contents

The contents of the field are returned in the search results.

Use this case to allow images, video, and audio files to be viewable right from the search results page in the Search App. For more information, see Making images, audio, and video appear in search results.

Note: Use query settings to configure how search results are presented to users. For information, see Configuring search result layouts.
  • stored
Medium
Search within a field

The field is added to the search index, allowing users to make queries based on it. The field can also appear as a facet for narrowing down a set of search results.

For example, enabling this for a field called author lets your users create a query like this:

author : brontë

And also allows a facet like this to show up in search results:

author

A. Brontë (2)

C. Brontë (4)

E. Brontë (1)

  • indexed
High
Sort on field

Users can sort search results based on values for this field.

Cannot be used with Allow field boosts or Multiple values.

  • omitNorms
Medium
Keyword Highlighting

Allows for highlighting search result keywords within this field value.

For example, say this use case is enabled for a field called author, which has a value of Smith in a particular document. If a user searches for the field/value pair author:smith, the word Smith will be bolded in the search results.

To enable search result highlighting, you also need to configure the query settings to support it. For information, see Search result highlighting.

  • indexed
  • stored
Medium
Phrase Highlighting

Similar to Keyword Highlighting, but allows for highlighting entire phrases in search results.

To enable search result highlighting, you also need to configure the query settings to support it.

  • indexed
  • stored
  • termPositions
  • termVectors
High
Multiple values

Indicates that within a document, multiple values can exist for the field. All values in the document for this field are indexed.

Cannot be used with Sort on field.

  • multiValued
High

Handling multi value fields

Some documents in your data sources might have more than one value for a field. For example, one document might have two or more authors. You need to handle these documents in both your processing pipeline and your index collection to ensure that the system indexes each value separately.

In a pipeline

When you run a document with a multi-valued field through the Text and Metadata Extraction stage in a pipeline, the values for the multi-valued field are extracted as a single value.

For example, a document authored by both A. Smith and B. Smith contains this field/value pair:

Author : "A.Smith;B.Smith"

To split this single value into multiple values, add a Tokenizer stage to your pipeline. You configure this stage to split up field values based on a specified character which, in this example, is a semicolon.

For more information, see Tokenizer stage.

In an index collection

To index all values for a multivalued field, you need to configure the field to support multiple values in the index collection. You can do this by:

  • Setting a plural type for the field, such as strings or tdates.
  • Setting the multiValued field attribute for the field.
  • Setting the Multiple values use case for the field.

For more information, see Adding and editing fields in an index collection schema.

Field attributes

This table describes the attributes that you can enable for fields in an index collection schema. A field's attributes determine how your users can use the field in their queries and how the fields are presented to your users in search results. Some attributes are not compatible with others.

The Impact column indicates the relative effect that enabling the attribute has on the amount of information stored in the index.

NoteEnabling an attribute for a field means that more information must be stored in the search index to support the attribute's functionality. This can cause an index to grow very large and be slow to return results.

Generally, you should minimize the number of attributes you enable for a field, and also minimize the number of fields in your index collection.

AttributeDefinition
indexedThe field is added to the search index, allowing users to make queries based on it. The field can also appear as a facet for narrowing down a set of search results.
storedThe contents of the field are returned in search results.
docValues

The field and its values are stored in a manner that is more efficient for supporting sorting, faceting, and highlighting.

Enable this value only for fields that you've also configured to support, sorting, faceting, or highlighting.

This attribute is available only for these field types:

  • string
  • uuid
  • date
  • tdouble
  • tfloat
  • tint
  • tlong

For information on field types, see Field types.

multiValued

When enabled for a field, indicates that a single document can contain multiple values for that field.

Mulitvalued fields cannot be used to sort search results.

omitNorms

When enabled, length normalization and index-time boosting are disabled to save memory. Only full-text fields or index-time boosted fields need norms.

This attribute is enabled by default for all primitive fields types, such as int, float, boolean, or string.

termVectors

When enabled, a term vector is created for the field. A vector shows the number of unique terms that occur within a document for the field and the number of occurrences of each unique term.

Used for the Phrase Highlighting use case. For information, see Use cases.

Important: Enabling this attribute has significant impact on index size.
termPositions

When enabled, for each term in the term vector, shows the term's position in the document.

Used for the Phrase Highlighting use case. For information, see Use cases.

Important: Enabling this attribute has significant impact on index size.
required

When true, documents are omitted from the index if they do not have a value for the specified field. The default is false.

Tip: Alternatively, you can use a Filter stage in your processing pipeline to omit documents from the index. For information on this stage, see Filter stage.
uniqueKey

Enabled only for the HCI_id field. You cannot enable this attribute for other fields.

For information on the HCI_id field, see Hitachi Content Intelligence_ fields.

Importing fields from a workflow into an index collection

You can run a workflow task to automatically discover the fields in your data sources. You can then have the system add those fields to an index collection schema.

This method has these advantages:

  • You can add many fields to an index collection schema together, rather than adding and configuring them individually.
  • The workflow task analyzes all values for each field and recommends the most appropriate type for each one. You don't need to pick the field types yourself or use dynamic fields, which can be error prone when picking field types.

Procedure

  1. Click the Workflow Designer window.

  2. Select the workflow that you want.

  3. Click the Task window.

  4. Click the Aggregations window.

  5. Ensure that the workflow has an aggregation of type Document Field Aggregate. If it doesn't, create one. For information, see Adding aggregations to workflows.

  6. Select the aggregation.

  7. Select the fields that you want.

  8. Click Add Fields to Index and then select the index you want to send the fields to.

Using Language Select

Language Select gives Search App users the opportunity to update the text_ fields of their index to display languages other than the default English. When Solr text_ fields are customized with a new language, Search App users will see them appear in their query results.
Important
  • If changes are made to an index after documents have been indexed, for the changes to be taken into effect, the user will have to re-index their data.
  • To revert language selections, changes to the index will need to be made manually.

Located on the Workflow Designer App > Index Collections > Index > Schema tab, Language Select supports the following Solr-supported languages:

  • Arabic: ar
  • Armenian: hy
  • Basque: eu
  • Bulgarian: bg
  • Catalan: ca
  • Chinese, Japanese, Korean: cjk
  • Czech: cz
  • Danish: da
  • Dutch: nl
  • English: en
  • Farsi: fa
  • Finnish: fi
  • French: fr
  • Galician: gl
  • German: de
  • Greek: el
  • Hindi: hi
  • Hungarian: hu
  • Indonesian: id
  • Irish: ga
  • Italian: it
  • Japanese: ja
  • Latvian: lv
  • Norwegian: no
  • Portuguese: pt
  • Romanian: ro
  • Russian: ru
  • Serbian: sr
  • Spanish: es
  • Swedish: sv
  • Thai: th
  • Turkish: tr
  • Ukrainian: uk

To use Language Select:

Procedure

  1. Ensure text_ fields exist in the schema.

    1. In the Workflow Designer App, click the Index Collections tab.
    2. Select your index.
    3. Click the Schema tab to view your available text_ fields.
  2. Ensure all text_ fields are set to Refinable.

    1. In the Workflow Designer App, click the Index Collections tab.
    2. Select your index.
    3. Click the Query Settings tab.
    4. For all text_ fields in your index, set the Refinable toggle to Yes.
  3. In the Workflow Designer App, go to Workflow Designer App > Index Collections > Index > Schema > Language Select.

  4. From the Language Select dropdown, select your preferred language.

  5. Click the I understand the impact of my changes to the index check box to acknowledge the impact of language changes to an index.

  6. Click Change Language.

  7. To view your updated index, open the Search App and search for a document in your chosen language.

 

  • Was this article helpful?