Index collection schemas
Each index collection includes a schema, which is a list of fields with configuration settings for each. These settings specify whether a field is indexed and what features, such as sorting, the field supports.
When configuring your index collection, you need to balance the need to provide a rich, useful search experience for your users against the need to keep your search index from becoming bloated and inefficient. You should make your users' search experience as flexible as possible but not so rich that the search index occupies an excessive amount of disk space or is slow to return search results.
For best practices on configuring an index collection schema, see Best practices for designing an index.
Defined dynamic and copy fields
The schema for an index collection groups fields into these categories:
- Defined: This is the standard field category. Defined fields do not include wildcards in their names.
- Dynamic: Each field in this category includes a wildcard character in its name. You can create dynamic fields to specify how the system should handle fields that it hasn't seen before or fields that you have not explicitly defined in the schema.
For example, say your index collection includes a dynamic field called *_coordinate with a type of tdouble and your workflow pipeline discovers or adds a field called left_coordinate to your documents. The index collection schema does not include left_coordinate as a defined field. However, left_coordinate matches the dynamic field definition *_coordinate so, when the document is indexed, left_coordinate gets added as a new defined field with a type of tdouble.
- Copy: This category contains rules for copying the contents of one field to another. For example, when you create an index collection with an initial schema type of schemaless, the index collection includes these copy rules by default:
Source: HCI_displayName Destination: HCI_autocomplete Source: HCI_snippet Destination: HCI_autocomplete
This means that the values for the HCI_displayName and HCI_snippet fields are copied to the HCI_autocomplete field. HCI_autocomplete is the field that gets searched by default when a user performs a basic search in the Search App.
Field types
This table describes the different types of data that each field in a document or index collection can contain. Each field type is optimized for a particular type of data.
- Each field type carries with it a number of field attributes. These attributes and their corresponding use cases are automatically applied to a field when you assign a type to it. For information on field attributes and use cases, see Field attributes and Use cases.
- The field types that end in s always support multiple values. You cannot sort search results based on fields of these types.
- Field types that begin with t, as in tlong or tint, are identical to their counterparts, such as long or int, except that the t-prefixed field values are broken down and indexed as smaller terms. This increases the amount of data stored in the index, but increases the likelihood of returning a document based on partial query term match.
- To minimize the size of your search index, select the appropriate type for each field. For more information, see Use the smallest field types possible for numerical values.
Field type | Multivalued Version | Description |
ancestor_path | Values for this field can be file paths. | |
binary | Values for this field are binary data. | |
boolean | booleans | Values for this field can be true or
false . |
currency | Values for this field are currency values. These must be numbers that include a decimal (or floating) point. | |
date tdate |
dates tdates |
Values for this field are dates in this format: YYYY-MM-DDThh:mm:ssZ Where:
|
date_long | date_longs |
Values for this field can be 64-bit positive or negative integers. Useful for date values that are stored in UNIX epoch time format. For fields with this type, the Search App gives a special date picker mechanism for users to use when selecting values on which to refine their searches or sort search results. |
descendent_path | Values for this field can be file paths. | |
double tdouble |
doubles tdoubles | Values for this field can be 64-bit numbers that include a decimal (or floating) point. |
float tfloat |
floats tfloats | Values for this field can be 32-bit numbers that include a decimal (or floating) point. |
ignored | Values for this field are neither indexed nor retrievable in search results. | |
int tint |
ints tints | Values for this field can be 32-bit positive or negative integers. |
location | Values for this field can be latitude/longitude coordinates. | |
location_rpt | Values for this field can be positional data, including latitude/longitude coordinates. | |
long tlong |
longs tlongs | Values for this field can be 64-bit positive or negative integers. |
lowercase | Values for this field are indexed only in lowercase letters. | |
phonetic_en | Values for this field are indexed as English-language phonetic approximations. | |
point | Values for this field can be positional data, for example a point on a graph. | |
random | Used as the field type on the dynamic field random_* . This
field can be used for sorting search results in a random order. | |
string | strings |
Values for this field can be any sequence of characters, numbers, or special characters. Values are case sensitive. Use this field to support exact phrase matches. That is, a search user must query on the entire value for this field to retrieve results. |
text_<language-code> |
Values can be basic text of a particular language:
| |
text_general |
Values can be any general piece of text. Values for this field are not case sensitive. Use this field to support fuzzy matches, not exact matches. That is, the search user does not need to specify the exact value for a field to retrieve results. | |
text_general_rev (i.e., reverse, not revised) |
Values can be any general piece of text. This field type is identical to text_general, but also allows each token to be indexed in both forward and reverse order. This improves performance when searching for files using leading wildcards or prefixes. | |
text_hci | Values can be any general piece of text. An extension of text_general, text_hci causes a field value to be indexed as-is and also to be split and indexed as a number of separate tokens. This field type also allows filenames to be indexed separately from their extensions. Important: Use this field type only on fields that contain small amounts of plain text. Using this field type on a large amount of content or on content with a large number of transitions between letters, numbers, and special characters (for example, JSON-formatted text) can severely impact index performance. | |
text_suggest |
Special type applied to the HCI_autocomplete field, used to support making query term suggestions to the user. When a user performs a simple search in the Search App, that user is actually searching for values in the HCI_autocomplete field. For more information, see Query suggestions. | |
text_ws | Values can be any general piece of text. Values for a field of this type are split based only on white space characters. | |
uuid | Values for this field are universally unique identifier (UUID) strings. |
Adding and editing fields in an index collection schema
You edit an index collection schema by adding fields to it. For each field, you need to specify the type of data that the field contains and the field attributes to associate with that field. You can also configure the index collection to copy the values for a field to one or more other fields.
In the Admin App, you can associate one or more use cases with a field. A use case is simply a useful grouping of field attributes. You can use these until you are more familiar with the field attributes and how they interact with one another.
Use cases are not available in the CLI or REST API. With these methods, you use field attributes to configure fields.
- You can have your system automatically add fields to your index collection schema. For information, see Importing fields from a workflow into an index collection.
- To figure out what fields are available for indexing, test your workflow pipeline. By doing this you can see all the fields that your pipeline produces from your data. For information, see Testing workflows.
- Cannot contain hyphens (-) or any other special characters.
- Cannot start with underscores (_) or numbers.
- For advanced users, your system also allows direct editing of the underlying index collection schema files for internally-managed and Apache Solr indexes. For information, see Directly editing configuration files for HCI Indexes.
- You cannot delete fields or edit exiting fields in an Elasticsearch index schema.
- If you make changes to the existing fields in an index collection schema, those changes are not reflected in the search index until a workflow task runs and reindexes your files using your new schema settings.
To add and edit fields:
Procedure
Click the Index Collections window.
Select the index collection you want to configure.
Click the Schema tab.
On the Defined Fields tab, click the edit icon (
) for the field you want to configure or click Create Field to add a new one.
To edit a field:
Specify a name for the field.
If you add a wildcard character (*) to the field name, the field is listed on the Dynamic Field tab. For more information, see Defined, dynamic, and copy fields.
NoteYou cannot rename existing fields. To replace a field, you can create a new field with the name you want and then delete the old one.In the Type drop-down, pick the type of data that the field can hold. For information on the available types, see Field types.
Select the use cases or field attributes that you want the field to support. For information, see Field attributes and Use cases.
To copy the values for one field to another field:
On the Copy Fields tab, click Add Copy Field.
In the Source Field menu, select the field you want to copy.
In the Destination Field menu, select the field to which you want to copy the value from the source field.
Optionally, click Add Destination to add additional fields.
Click Create.
To delete a field, click the delete icon for the field.
Click Update Field.
Use cases
This table lists the use cases you can configure for fields in an index collection schema. Each use case corresponds to a combination of field properties.
Generally, you should minimize the number of use cases you enable for a field, and also minimize the number of fields in your index collection.
Use case | Description | Field attributes | Impact |
Allow field boosts |
You can configure how this field is boosted by configuring query settings for the index. For information, see Boosting field relevancy. Cannot be used with Sort on field. |
| Medium |
Facet on Field | Stores the field in a way most efficient for using the field as a facet. That is, the field appears as a facet, or category, in the search results. The Search App lists the number of search results in which this field appears. Users can use the facet to narrow down a list of search results. For more information, see Adding facets to the Search App. Note: The Search within a field use case also allows fields to appear as facets. However, if you need a field to appear as a facet but don't want to allow users to search based on that field, use Facet on Field instead. |
| Medium |
Retrieve contents | The contents of the field are returned in the search results. Use this case to allow images, video, and audio files to be viewable right from the search results page in the Search App. For more information, see Making images, audio, and video appear in search results. Note: Use query settings to configure how search results are presented to users. For information, see Configuring search result layouts. |
| Medium |
Search within a field |
The field is added to the search index, allowing users to make queries based on it. The field can also appear as a facet for narrowing down a set of search results. For example, enabling this for a field called author lets your users create a query like this: author : brontë And also allows a facet like this to show up in search results: author A. Brontë (2) C. Brontë (4) E. Brontë (1) |
| High |
Sort on field |
Users can sort search results based on values for this field. Cannot be used with Allow field boosts or Multiple values. |
| Medium |
Keyword Highlighting |
Allows for highlighting search result keywords within this field value. For example, say this use case is enabled for a field called
To enable search result highlighting, you also need to configure the query settings to support it. For information, see Search result highlighting. |
| Medium |
Phrase Highlighting |
Similar to Keyword Highlighting, but allows for highlighting entire phrases in search results. To enable search result highlighting, you also need to configure the query settings to support it. |
| High |
Multiple values |
Indicates that within a document, multiple values can exist for the field. All values in the document for this field are indexed. Cannot be used with Sort on field. |
| High |
Handling multi value fields
Some documents in your data sources might have more than one value for a field. For example, one document might have two or more authors. You need to handle these documents in both your processing pipeline and your index collection to ensure that the system indexes each value separately.
When you run a document with a multi-valued field through the Text and Metadata Extraction stage in a pipeline, the values for the multi-valued field are extracted as a single value.
For example, a document authored by both A. Smith and B. Smith contains this field/value pair:
Author : "A.Smith;B.Smith"
To split this single value into multiple values, add a Tokenizer stage to your pipeline. You configure this stage to split up field values based on a specified character which, in this example, is a semicolon.
For more information, see Tokenizer stage.
To index all values for a multivalued field, you need to configure the field to support multiple values in the index collection. You can do this by:
- Setting a plural type for the field, such as strings or tdates.
- Setting the multiValued field attribute for the field.
- Setting the Multiple values use case for the field.
For more information, see Adding and editing fields in an index collection schema.
Field attributes
This table describes the attributes that you can enable for fields in an index collection schema. A field's attributes determine how your users can use the field in their queries and how the fields are presented to your users in search results. Some attributes are not compatible with others.
The Impact column indicates the relative effect that enabling the attribute has on the amount of information stored in the index.
Generally, you should minimize the number of attributes you enable for a field, and also minimize the number of fields in your index collection.
Attribute | Definition |
indexed | The field is added to the search index, allowing users to make queries based on it. The field can also appear as a facet for narrowing down a set of search results. |
stored | The contents of the field are returned in search results. |
docValues |
The field and its values are stored in a manner that is more efficient for supporting sorting, faceting, and highlighting. Enable this value only for fields that you've also configured to support, sorting, faceting, or highlighting. This attribute is available only for these field types:
For information on field types, see Field types. |
multiValued |
When enabled for a field, indicates that a single document can contain multiple values for that field. Mulitvalued fields cannot be used to sort search results. |
omitNorms |
When enabled, length normalization and index-time boosting are disabled to save memory. Only full-text fields or index-time boosted fields need norms. This attribute is enabled by default for all primitive fields types, such as int, float, boolean, or string. |
termVectors | When enabled, a term vector is created for the field. A vector shows the number of unique terms that occur within a document for the field and the number of occurrences of each unique term. Used for the Phrase Highlighting use case. For information, see Use cases. Important: Enabling this attribute has significant impact on index size. |
termPositions | When enabled, for each term in the term vector, shows the term's position in the document. Used for the Phrase Highlighting use case. For information, see Use cases. Important: Enabling this attribute has significant impact on index size. |
required | When true, documents are omitted from the index if they do not have a value for the specified field. The default is false. Tip: Alternatively, you can use a Filter stage in your processing pipeline to omit documents from the index. For information on this stage, see Filter stage. |
uniqueKey |
Enabled only for the For information on the |
Importing fields from a workflow into an index collection
You can run a workflow task to automatically discover the fields in your data sources. You can then have the system add those fields to an index collection schema.
This method has these advantages:
- You can add many fields to an index collection schema together, rather than adding and configuring them individually.
- The workflow task analyzes all values for each field and recommends the most appropriate type for each one. You don't need to pick the field types yourself or use dynamic fields, which can be error prone when picking field types.
Procedure
Click the Workflow Designer window.
Select the workflow that you want.
Click the Task window.
Click the Aggregations window.
Ensure that the workflow has an aggregation of type Document Field Aggregate. If it doesn't, create one. For information, see Adding aggregations to workflows.
Select the aggregation.
Select the fields that you want.
Click Add Fields to Index and then select the index you want to send the fields to.
Using Language Select
text_
fields of their index to display languages other than the default English. When Solr text_
fields are customized with a new language, Search App users will see them appear in their query results.- If changes are made to an index after documents have been indexed, for the changes to be taken into effect, the user will have to re-index their data.
- To revert language selections, changes to the index will need to be made manually.
Located on the Workflow Designer App > Index Collections > Index > Schema tab, Language Select supports the following Solr-supported languages:
- Arabic: ar
- Armenian: hy
- Basque: eu
- Bulgarian: bg
- Catalan: ca
- Chinese, Japanese, Korean: cjk
- Czech: cz
- Danish: da
- Dutch: nl
- English: en
- Farsi: fa
- Finnish: fi
- French: fr
- Galician: gl
- German: de
- Greek: el
- Hindi: hi
- Hungarian: hu
- Indonesian: id
- Irish: ga
- Italian: it
- Japanese: ja
- Latvian: lv
- Norwegian: no
- Portuguese: pt
- Romanian: ro
- Russian: ru
- Serbian: sr
- Spanish: es
- Swedish: sv
- Thai: th
- Turkish: tr
- Ukrainian: uk
To use Language Select:
Procedure
Ensure
text_
fields exist in the schema.- In the Workflow Designer App, click the Index Collections tab.
- Select your index.
- Click the Schema tab to view your available
text_
fields.
Ensure all
text_
fields are set to Refinable.- In the Workflow Designer App, click the Index Collections tab.
- Select your index.
- Click the Query Settings tab.
- For all
text_
fields in your index, set the Refinable toggle to Yes.
In the Workflow Designer App, go to Workflow Designer App > Index Collections > Index > Schema > Language Select.
From the Language Select dropdown, select your preferred language.
Click the I understand the impact of my changes to the index check box to acknowledge the impact of language changes to an index.
Click Change Language.
To view your updated index, open the Search App and search for a document in your chosen language.