16. Storage for Files Metadata
In addition to the file content we need to store metadata which is attached to a file. Metadata describes additional properties of a file. These properties need to be stored as close as possible to the file content to avoid inconsistencies. Metadata are key to workflows and search. We consider them as an additional value which enhances the file content.
- Metadata will become more important in the future
- Metadata are key to automated data processing
- Metadata storage should be as close as possible to the file content
- Metadata should be always in sync with the file content
- Database
- Extended file attributes
- Metadata file next to the file content
- Linked metadata in separate file
Chosen option: “Extended File Attributes”, because we guarantee the consistency of data and have arbitrary simple storage mechanism.
- Metadata is always attached to the file itself
- We can store arbitrary key/values
- No external dependencies are needed
- The storage inside extended file attributes has limits
- Changes to extended attributes are not atomic and need file locks
Use a Database or an external key/value store to persist metadata.
- Good, because it scales well
- Good, because databases provide efficient lookup mechanisms
- Bad, because the file content and the metadata could run out of sync
- Bad, because a storage backup doesn’t cover the file metadata
Extended File Attributes allow storing arbitrary properties. There are 4 namespaces user
, system
, trusted
and security
. We can safely use the user
namespace. An example attribute name would be user.ocis.owner.id
. The linux kernel has length limits on attribute names and values.
From Wikipedia on Extended file attributes:
The Linux kernel allows extended attribute to have names of up to 255 bytes and values of up to 64 KiB,[14] as do XFS and ReiserFS, but ext2/3/4 and btrfs impose much smaller limits, requiring all the attributes (names and values) of one file to fit in one “filesystem block” (usually 4 KiB). Per POSIX.1e,[citation needed] the names are required to start with one of security, system, trusted, and user plus a period. This defines the four namespaces of extended attributes.
- Good, because metadata is stored in the filesystem
- Good, because consistency is easy to maintain
- Good, because the data is attached to the file and survives file operations like copy and move
- Good, because a storage backup also covers the file metadata
- Bad, because we could hit the filesystem limit
- Bad, because changes to extended attributes are not atomic
We could store metadata in a metadata file next to the file content which has a structured content format like .json, .yaml or .toml. That would give us more space to store bigger amounts of metadata.
- Good, because there are no size limits
- Good, because there is more freedom to the content format
- Good, because a storage backup also covers the file metadata
- Bad, because it doubles the amount of read / write operations
- Bad, because it needs additional measures against concurrent overwriting changes
To link metadata to file content a single extended attribute with a file id (unique per storage space) is sufficient. This would also allow putting metadata in better suited storage systems like SQLite or a key value store.
- Good, because it avoids extended attribute limits
- Good, because the same mechanism could be used to look up files by id, when the underlying filesystem is an existing POSIX filesystem.
- Bad, because backup needs to cover the metadata as well. Could be mitigated by sharing metadata per space and doing space wide snapshots.
- Bad, because it is a bit more effort to access it to read or index it.