19. File Search Index
The ability to find files based on certain search terms is a key requirement for a system that provides the ability to store unstructured data on a large scale.
More sophisticated search capabilities are expected and can be implemented, especially based on metadata.
To trigger the indexing of a file, the search service listens to create, update and delete events on the internal event bus of oCIS.
The events need to contain a valid reference that defines the file space and file id of the file in question. The event only must be sent when the file operation (update, creation, removal) is finished.
Sharing adds more complexity because the index also needs to react to create, delete and modify shares events. Sharing should not duplicate the indexed data, especially within spaces or group shares.
- Have a simple yet powerful, scalable and performant way of finding files in oCIS
- Be able to construct intelligent searches based on metadata
- Allow the user to filter the search queries based on metadata
- Basic File Search needs to be implemented out of the box without external dependencies
- The Search Indexing Service should be replaceable with more sophisticated technologies like Elasticsearch
- Make use of the spaces architecture to shard search indexes by space
- The Search Indexing Service needs to deal with multiple users accessing the same resources due to shares
- The Search Service should be compatible with different search indexing technologies
Chosen option: Bleve Search, because we can fulfill the MVP and include it into the single binary.
- Basic File Search works out of the box
- We do not need heavy external dependencies which need to be deployed alongside
- We need to be aware of the scaling limits
- We need to find a way to work with shares and spaces
- It has a limited query language
- Good, because it is written in GoLang and can be bundled into the single oCIS binary
- Good, because it is a lightweight but powerful solution which could fulfill a lot of use cases
- Bad, because we do not know exactly how we can represent shares in the index without duplicating data
- Bad, because it is a single process
- Bad, because the query language is limited
- Good, because it has become an industry standard
- Good, because it supports a rich query language
- Good, because it has built in cluster support and scales well
- Good, because it has a permission system and supports multiple users and groups to access the same resource
- Bad, because it is a heavy setup and needs extra effort and knowledge