Search

Abstract

The search service is responsible for metadata and content extraction, stores that data as index and makes it searchable. The following clarifies the extraction terms metadata and content:

Metadata: all data that describes the file like Name, Size, MimeType, Tags and Mtime.
Content: all data that relates to content of the file like words, geo data, exif data etc.

General Considerations
Scaling
Search Engines
Query language
Extraction Engines
Content Extraction
- Basic Extractor
- Tika Extractor
Search Functionality
Manually Trigger Re-Indexing a Space
Notes
Example Yaml Config

General Considerations

To use the search service, an event system needs to be configured for all services like NATS, which is shipped and preconfigured.
The search service consumes events and does not block other tasks.
When looking for content extraction, Apache Tika - a content analysis toolkit can be used but needs to be installed separately.

Extractions are stored as index via the search service. Consider that indexing requires adequate storage capacity - and the space requirement will grow. To avoid filling up the filesystem with the index and rendering Infinite Scale unusable, the index should reside on its own filesystem.

You can change the path to where search maintains its data in case the filesystem gets close to full and you need to relocate the data. Stop the service, move the data, reconfigure the path in the environment variable and restart the service.

When using content extraction, more resources and time are needed, because the content of the file needs to be analyzed. This is especially true for big and multiple concurrent files.

The search service runs out of the box with the shipped default basic configuration. No further configuration is needed, except when using content extraction.

Consider using a dedicated hardware for this service in case more resources are needed.

Scaling

The search service can be scaled by running multiple instances. Some rules apply:

With SEARCH_ENGINE_BLEVE_SCALE=false, which is the default , the search service has exclusive write access to the index. Once the first search process is started, any subsequent {search processes attempting to access the index are locked out.
With SEARCH_ENGINE_BLEVE_SCALE=true, a search service will no longer have exclusive write access to the index. This setting must be enabled for all instances of the {search service.

Search Engines

By default, the search service is shipped with bleve as its primary search engine. The available engines can be extended by implementing the Engine interface and making that engine available.

Query language

By default, KQL is used as query language, for an overview of how the syntax works, please read the microsoft documentation for more details.

Not all parts are supported. The following list gives an overview of parts that are not implemented yet:

Synonym operators
Inclusion and exclusion operators
Dynamic ranking operator
ONEAR operator
NEAR operator
Date intervals

In the following ADR you can read why we chose KQL.

Extraction Engines

The search service provides the following extraction engines and their results are used as index for searching:

The embedded basic configuration provides metadata extraction which is always on.
The tika configuration, which additionally provides content extraction, if installed and configured.

Content Extraction

The search service is able to manage and retrieve many types of information. For this purpose the following content extractors are included:

Basic Extractor

This extractor is the most simple one and just uses the resource information provided by Infinite Scale. It does not do any further analysis.

Tika Extractor

This extractor is more advanced compared to the Basic extractor. The main difference is that this extractor is able to search file contents. However, Apache Tika is required for this task. Read the Getting Started with Apache Tika guide on how to install and run Tika or use a ready to run Tika container. See the Tika container usage document for a quickstart. Note that at the time of writing, containers are only available for the amd64 platform.

As soon as Tika is installed and accessible, the search service must be configured for the use with Tika. The following settings must be set:

SEARCH_EXTRACTOR_TYPE=tika
SEARCH_EXTRACTOR_TIKA_TIKA_URL=http://YOUR-TIKA.URL
FRONTEND_FULL_TEXT_SEARCH_ENABLED=true
When using the Tika extractor, make sure to also set this enironment variable in the frontend service. This will tell the web client that full-text search has been enabled.

When the search service can reach Tika, it begins to read out the content on demand. Note that files must be downloaded during the process, which can lead to delays with larger documents.

Content extraction and handling the extracted content can be very resource intensive. Content extraction is therefore limited to files with a certain file size. The default limit is 20MB and can be configured using the SEARCH_CONTENT_EXTRACTION_SIZE_LIMIT variable.

When extracting content, you can specify whether stop words like I, you, the are ignored or not. Noramlly, these stop words are removed automatically. To keep them, the environment variable SEARCH_EXTRACTOR_TIKA_CLEAN_STOP_WORDS must be set to false.

When using the Tika container and docker-compose, consider the following:

See the ocis_full example.
Containers for the linked service are reachable at a hostname identical to the alias or the service name if no alias was specified.

Search Functionality

The search service consists of two main parts which are file indexing and file search.

Indexing

Every time a resource changes its state, a corresponding event is triggered. Based on the event, the search service processes the file and adds the result to its index. There are a few more steps between accepting the file and updating the index.

IMPORTANT

When using the Tika Extractor, text and other data, such as EXIF data from images, are extracted from documents and written to the bleve index. Currently, this extra data cannot be searched. See the next section for more information.

Search

A query via the search service will return results based on the index created.

IMPORTANT

Though EXIF data can be present in the bleve index, currently, only text-related data can be extracted. Code must be written to make this type of extraction available to users.
Currently, there is no ocis shell command or similar mechanism to view or browse the bleve index. This capability would be highly beneficial for developers and administrators to determine the type of data contained in the index.

State Changes which Trigger Indexing

The following state changes in the life cycle of a file can trigger the creation of an index or an update:

Resource Trashed

The service checks its index to see if the file has been processed. If an index entry exists, the index will be marked as deleted. In consequence, the file won’t appear in search requests anymore. The index entry stays intact and could be restored via Resource Restored.

Resource Deleted

The service checks its index to see if the file has been processed. If an index entry exists, the index will be finally deleted. In consequence, the file won’t appear in search requests anymore.

Resource Restored

This step is the counterpart of Resource Trashed. When a file is deleted, is isn’t removed from the index, instead the service just marks it as deleted. This mark is removed when the file has been restored, and it shows up in search results again.

Resource Moved

This comes into play whenever a file or folder is renamed or moved. The search index then updates the resource location path or starts indexing if no index has been created so far for all items affected. See Notes for an example.

Folder Created

The creation of a folder always triggers indexing. The search service extracts all necessary information and stores it in the search index

File Created

This case is similar to Folder created with the difference that a file can contain far more valuable information. This gets interesting but time-consuming when data content needs to be analyzed and indexed. Content extraction is part of the search service if configured.

File Version Restored

Since Infinite Scale is capable of storing multiple versions of the same file, the search service also needs to take care of those versions. When a file version is restored, the service starts to extract all needed information, creates the index and makes the file discoverable.

Resource Tag Added

Whenever a resource gets a new tag, the service takes care of it and makes that resource discoverable by the tag.

Resource Tag Removed

This is the counterpart of Resource tag added. It takes care that a tag gets unassigned from the referenced resource.

File Uploaded - Synchronous

This case only triggers indexing if async post processing is disabled. If so, the service starts to extract all needed file information, stores it in the index and makes it discoverable.

File Uploaded - Asynchronous

This is exactly the same as File uploaded - synchronous with the only difference that it is used for asynchronous uploads.

Manually Trigger Re-Indexing a Space

The service includes a command-line interface to trigger re-indexing a space:

ocis search index --space $SPACE_ID

It can also be used to re-index all spaces:

ocis search index --all-spaces

Note that either --space $SPACE_ID or --all-spaces must be set.

Notes

The indexing process tries to be self-healing in some situations.

In the following example, let’s assume a file tree foo/bar/baz exists. If the folder bar gets renamed to new-bar, the path to baz is no longer foo/bar/baz but foo/new-bar/baz. The search service checks the change and either just updates the path in the index or creates a new index for all items affected if none was present.

Example Yaml Config

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55


# Autogenerated
# Filename: search-config-example.yaml

tracing:
  enabled: false
  type: ""
  endpoint: ""
  collector: ""
log:
  level: ""
  pretty: false
  color: false
  file: ""
debug:
  addr: 127.0.0.1:9224
  token: ""
  pprof: false
  zpages: false
grpc:
  addr: 127.0.0.1:9220
  tls: null
token_manager:
  jwt_secret: ""
reva:
  address: com.owncloud.api.gateway
  tls:
    mode: ""
    cacert: ""
grpc_client_tls: null
events:
  endpoint: 127.0.0.1:9233
  cluster: ocis-cluster
  async_uploads: true
  num_consumers: 0
  debounce_duration: 1000
  tls_insecure: false
  tls_root_ca_certificate: ""
  enable_tls: false
  username: ""
  password: ""
engine:
  type: bleve
  bleve:
    data_path: /var/lib/ocis/search
    scale: false
extractor:
  type: basic
  cs3_allow_insecure: false
  tika:
    tika_url: http://127.0.0.1:9998
    clean_stop_words: true
content_extraction_size_limit: 20971520
service_account:
  service_account_id: ""
  service_account_secret: ""

Environment Variables

Name	Type	Default Value	Description
OCIS_TRACING_ENABLED SEARCH_TRACING_ENABLED	bool	false	Activates tracing.
OCIS_TRACING_TYPE SEARCH_TRACING_TYPE	string		The type of tracing. Defaults to ‘’, which is the same as ‘jaeger’. Allowed tracing types are ‘jaeger’, ‘otlp’ and ’’ as of now.
OCIS_TRACING_ENDPOINT SEARCH_TRACING_ENDPOINT	string		The endpoint of the tracing agent.
OCIS_TRACING_COLLECTOR SEARCH_TRACING_COLLECTOR	string		The HTTP endpoint for sending spans directly to a collector, i.e. http://jaeger-collector:14268/api/traces. Only used if the tracing endpoint is unset.
OCIS_LOG_LEVEL SEARCH_LOG_LEVEL	string		The log level. Valid values are: ‘panic’, ‘fatal’, ’error’, ‘warn’, ‘info’, ‘debug’, ’trace’.
OCIS_LOG_PRETTY SEARCH_LOG_PRETTY	bool	false	Activates pretty log output.
OCIS_LOG_COLOR SEARCH_LOG_COLOR	bool	false	Activates colorized log output.
OCIS_LOG_FILE SEARCH_LOG_FILE	string		The path to the log file. Activates logging to this file if set.
SEARCH_DEBUG_ADDR	string	127.0.0.1:9224	Bind address of the debug server, where metrics, health, config and debug endpoints will be exposed.
SEARCH_DEBUG_TOKEN	string		Token to secure the metrics endpoint.
SEARCH_DEBUG_PPROF	bool	false	Enables pprof, which can be used for profiling.
SEARCH_DEBUG_ZPAGES	bool	false	Enables zpages, which can be used for collecting and viewing in-memory traces.
SEARCH_GRPC_ADDR	string	127.0.0.1:9220	The bind address of the GRPC service.
OCIS_JWT_SECRET SEARCH_JWT_SECRET	string		The secret to mint and validate jwt tokens.
OCIS_REVA_GATEWAY	string	com.owncloud.api.gateway	The CS3 gateway endpoint.
OCIS_GRPC_CLIENT_TLS_MODE	string		TLS mode for grpc connection to the go-micro based grpc services. Possible values are ‘off’, ‘insecure’ and ‘on’. ‘off’: disables transport security for the clients. ‘insecure’ allows using transport security, but disables certificate verification (to be used with the autogenerated self-signed certificates). ‘on’ enables transport security, including server certificate verification.
OCIS_GRPC_CLIENT_TLS_CACERT	string		Path/File name for the root CA certificate (in PEM format) used to validate TLS server certificates of the go-micro based grpc services.
OCIS_EVENTS_ENDPOINT SEARCH_EVENTS_ENDPOINT	string	127.0.0.1:9233	The address of the event system. The event system is the message queuing service. It is used as message broker for the microservice architecture.
OCIS_EVENTS_CLUSTER SEARCH_EVENTS_CLUSTER	string	ocis-cluster	The clusterID of the event system. The event system is the message queuing service. It is used as message broker for the microservice architecture. Mandatory when using NATS as event system.
OCIS_ASYNC_UPLOADS SEARCH_EVENTS_ASYNC_UPLOADS	bool	true	Enable asynchronous file uploads.
SEARCH_EVENTS_NUM_CONSUMERS	int	0	The amount of concurrent event consumers to start. Event consumers are used for searching files. Multiple consumers increase parallelisation, but will also increase CPU and memory demands. The default value is 0.
SEARCH_EVENTS_REINDEX_DEBOUNCE_DURATION	int	1000	The duration in milliseconds the reindex debouncer waits before triggering a reindex of a space that was modified.
OCIS_INSECURE SEARCH_EVENTS_TLS_INSECURE	bool	false	Whether to verify the server TLS certificates.
OCIS_EVENTS_TLS_ROOT_CA_CERTIFICATE SEARCH_EVENTS_TLS_ROOT_CA_CERTIFICATE	string		The root CA certificate used to validate the server’s TLS certificate. If provided SEARCH_EVENTS_TLS_INSECURE will be seen as false.
OCIS_EVENTS_ENABLE_TLS SEARCH_EVENTS_ENABLE_TLS	bool	false	Enable TLS for the connection to the events broker. The events broker is the ocis service which receives and delivers events between the services.
OCIS_EVENTS_AUTH_USERNAME SEARCH_EVENTS_AUTH_USERNAME	string		The username to authenticate with the events broker. The events broker is the ocis service which receives and delivers events between the services.
OCIS_EVENTS_AUTH_PASSWORD SEARCH_EVENTS_AUTH_PASSWORD	string		The password to authenticate with the events broker. The events broker is the ocis service which receives and delivers events between the services.
SEARCH_ENGINE_TYPE	string	bleve	Defines which search engine to use. Defaults to ‘bleve’. Supported values are: ‘bleve’.
SEARCH_ENGINE_BLEVE_DATA_PATH	string	/var/lib/ocis/search	The directory where the filesystem will store search data. If not defined, the root directory derives from $OCIS_BASE_DATA_PATH/search.
SEARCH_ENGINE_BLEVE_SCALE	bool	false	Enable scaling of the search index (bleve). If set to ’true’, the instance of the search service will no longer have exclusive write access to the index. Note when scaling search, all instances of the search service must be set to true! For ‘false’, which is the default, the running search service has exclusive access to the index as long it is running. This locks out other search processes tying to access the index.
SEARCH_EXTRACTOR_TYPE	string	basic	Defines the content extraction engine. Defaults to ‘basic’. Supported values are: ‘basic’ and ’tika’.
OCIS_INSECURE SEARCH_EXTRACTOR_CS3SOURCE_INSECURE	bool	false	Ignore untrusted SSL certificates when connecting to the CS3 source.
SEARCH_EXTRACTOR_TIKA_TIKA_URL	string	http://127.0.0.1:9998	URL of the tika server.
SEARCH_EXTRACTOR_TIKA_CLEAN_STOP_WORDS	bool	true	Defines if stop words should be cleaned or not. See the documentation for more details.
SEARCH_CONTENT_EXTRACTION_SIZE_LIMIT	uint64	20971520	Maximum file size in bytes that is allowed for content extraction.
OCIS_SERVICE_ACCOUNT_ID SEARCH_SERVICE_ACCOUNT_ID	string		The ID of the service account the service should use. See the ‘auth-service’ service description for more details.
OCIS_SERVICE_ACCOUNT_SECRET SEARCH_SERVICE_ACCOUNT_SECRET	string		The service account secret.

Search

Abstract

Table of Contents

General Considerations

Scaling

Search Engines

Query language

Extraction Engines

Content Extraction

Basic Extractor

Tika Extractor

Search Functionality

Indexing

Search

State Changes which Trigger Indexing

Resource Trashed

Resource Deleted

Resource Restored

Resource Moved

Folder Created

File Created

File Version Restored

Resource Tag Added

Resource Tag Removed

File Uploaded - Synchronous

File Uploaded - Asynchronous

Manually Trigger Re-Indexing a Space

Notes

Example Yaml Config

Environment Variables