Bucket Storage

Overview

Bucket storage provides these significant features in Humio:

  • Making cloud deployments of Humio cheaper, faster, and easier to run using ephemeral disks as primary storage.

  • Enables recovery from the loss of multiple Humio nodes without any data loss.

  • Allows keeping more events than the local disks can hold, enabling infinite retention.

  • New nodes can join the cluster starting from only configuration, Kafka, and bucket storage.

graph LR B[Humio Node 1] === A("S3 or GCP Bucket") C[Humio Node 2] === A D[Humio Node n] === A
A Bucket is shared as storage for a number of Humio nodes.

Humio supports writing a copy of the ingested logs to Amazon S3 and Google Cloud Storage using the native file format of Humio, allowing Humio to fetch those files and search them efficiently if the local copies are lost or deleted.

The ability to recover from the loss of Humio nodes requires the Kafka cluster not to lose recent data. In case of simultaneous failure of Humio and Kafka, recent events may be lost.

Is this not what you were looking for? See S3 Archiving on how to make Humio export ingested events into S3 buckets in an open format readable by other applications, but not searchable in Humio.

The copies written to the bucket are considered “safe” compared to a local copy as the Bucket storage provided by S3 and GCP keep multiple copies in multiple locations. A bucket has unlimited storage and can hold an unlimited number of files. This allows the Humio cluster to fetch the file from the provider in case of loss of node(s) that had the local copies. It also allows configuring Humio for infinite retention by letting Humio delete local files that are also in the bucket and fetch the data from the bucket once there is a request to search them.

Humio manages the set of files in the bucket according to the retention settings in Humio for each repository and deletes files from the bucket when they are no longer relevant according to the retention setting for each repository.

Using Bucket storage for a cluster in a cloud environment

Bucket storage allows you to run Humio without any persistent disks. In fact, doing so is encouraged when setting up the system this way, as persistent disks in a cloud environment require network traffic and that bandwidth is better put to use for bucket storage and for networking internally in the cluster. Note that the Kafka cluster needs to be persistent and independent from the Humio cluster. Note that it is important to tell Humio that the disks are not to be trusted using USING_EPHEMERAL_DISKS=true in order to get Humio to not trust replicas supposedly present on other nodes in the cluster, as they may vanish in this kind of environment.

Using network attached disks in combination with Bucket storage is discouraged in cloud environments. Often you end up using much more network bandwidth this way.

Compression type should be set at “high” at level 9 (as is the default in 1.7.0 and later) to ensure that the files transferred to/from the bucket get as much compression applied as possible. This reduces the storage requirement both locally and in the bucket, reduces network traffic and improves search performance when data is not present locally.

When using ephemeral disks the choice of node ID needs selecting in a controlled manner, and the strategy of storing it in the file system no longer works. See the option ZOOKEEPER_PREFIX_FOR_NODE_UUID in the configuration page for a method of assigning unique node IDs to nodes using Zookeeper.

Combining the feature of “backup to a remote file system” and Bucket Storage is discouraged. Bucket storage replaces all functionality in the previous backup system.

Target bucket

Humio has the concept of a “Target bucket”, to which new files are uploaded. A node has a single target at any point in time. All buckets that have received a file at some point will get registered in “global” in the Humio cluster. This allows any node in the cluster to fetch files from all buckets that have ever been a target for any node in the cluster.

Bucket storage should be configured for the entire cluster to ensure that all segment files get uploaded. There are no settings for individual repositories, and all data gets written to a single bucket. It is okay to switch to a “fresh” bucket at any point in time, also on a subset of the nodes, or in a rolling update fashion. A node uploads to the configured target bucket. Please note that the nodes will need to be able to fetch files from the previous target buckets for as long as there are files in them that they need to fetch. Do not delete or empty the previous target buckets. Make sure not to revoke any authentication tokens required for the reading of these existing files either. When a fresh target is configured, existing files do not get re-uploaded into that. The existing copy in the previous targets is the one it trusts.

Achieving high-availability using bucket storage

It’s possible to achieve high-availability of ingested events using this feature in combination with a Kafka installation that is not on the same file systems as the Humio segment files. With Humio set up to have multiple digest nodes for each partition, ingested events are either still in the ingest queue in Kafka, or on one of the Humio nodes designated as digest node, or in the bucket storage. Humio can resume from a combination of these after sudden and complete loss of a single Humio node.

The local data storage directory holds the segment files as they get constructed. Once a segment file is complete, it gets uploaded to the bucket. Files copied to bucket storage also stay on the local disk in most cases. See Managing local disk space on how to allow Humio to delete local files that also exist in the bucket.

Humio also uploads the global-snapshot.json files to the bucket. This allows starting a node on an empty local data dir if there is a recent snapshot in the bucket configured as target for the node. The node then fetches the snapshot from the bucket and then gets up to date on the current state by reading global events from Kafka that have the changes since the snapshot was written.

For more information on the concept of segments files and datasources, see segments files and datasources.

What about having only one node assigned to each partition?

Partition assignments are still being applied to segment files when configuring bucket storage. While it is possible to run with just one node assigned on both digest and storage partition rules, it’s a choice with consequences for availability in case of node failure.

The reason is that only “complete” files get put in the bucket, not the “in-progress work”. The partition scheme is applied to decide which hosts are applicable as storage for any particular segment file. Humio does not let every node download any file from the bucket; only the ones listed on the partitions that apply to the segment.

This means that if you set up storage rules to only a single node on each partition, then a single-node failure would make most queries respond with a subset of the events while the node is down. No data would be lost, but the results would be partial until the node returns to the cluster.

If you set up digest rules to have just one node on any partition, then the failure of such a node will result in incoming events on that partition not being processed by Humio but just remaining queued in Kafka. This may eventually lead to data loss as Kafka is not set up for infinite retention by default — and doing so would likely overflow Kafka at some point.

Another consideration is that having 2 replicas in both storage and digest partition rules also makes the system resilient to cases where the bucket storage upload is unable to keep up or even fails to accept uploads at all for a while. Such a scenario is a problem if any partition has only a single node assigned, as that node may fail and there are then no other copies of the data.

For storage partition rules, two replicas are recommended to allow queries to complete without responding with a “partial results” warning in case of a single node failure.

The recommendation thus is to run with two digest replicas and two storage replicas on all partitions, for better reliability of the system.

A future version of Humio will improve this by adding the ability to handle queries without partial results in case of a single node failure, and also in the case of storage partition replicas being set to just one. We will have any node in the cluster able to fetch the segment for query in case of failure of the primary node for the segment. Digest partitions will continue to need at least two for ingest to flow, in case of a single node failure.

Security

The copies in the bucket are encrypted by Humio using AES-256 while being uploaded to ensure that even if read access to the bucket is accidentally allowed, an attacker is not able to read any events or other information from the data in the files. The encryption key is based on the seed key string set in the configuration, and each file gets encrypted using a unique key derived from that seed. The seed key is stored in global along with all the other information required to read and decrypt the files in that bucket. Humio handles changes to the encryption seed key by keeping track of which files were encrypted using what secret.

Humio computes a checksum of the local file (after encryption) while uploading and this checksum must match that computed by the receiver for the upload to complete successfully. This ensures that the bucket does not hold invalid or incomplete copies of the segment files.

Self-hosted setup for S3

Bucket storage is only available in version 1.7.0 and above.

Humio supports multiple ways of configuring access to the bucket, allowing you to use any of the 5 listed in Using the Default Credential Provider Chain on top of the following, which takes precedence if set.

If you run your Humio installation outside AWS, you need an IAM user with write access to the buckets used for storage. That user must have programmatic access to S3, so when adding a new user through the AWS console make sure programmatic access is ticked

Adding user with programmatic access.

Later in the process, you can retrieve the access key and secret key:

Access key and secret key.

Which in Humio is needed in the following configuration:

# These two take precedence over all other AWS access methods.
S3_STORAGE_ACCESSKEY=$ACCESS_KEY
S3_STORAGE_SECRETKEY=$SECRET_KEY

# Also supported:
AWS_ACCESS_KEY_ID=$ACCESS_KEY
AWS_SECRET_ACCESS_KEY=$SECRET_KEY

The keys are used for authenticating the user against the S3 service. For more guidance on how to retrieve S3 access keys, see AWS access keys. More details on creating a new user in IAM.

Configuring the user to have write access to a bucket can be done by attaching a policy to the user.

IAM User Example Policy

The following JSON is an example policy configuration.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::BUCKET_NAME"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
		"s3:DeleteObject"
            ],
            "Resource": [
                "arn:aws:s3:::BUCKET_NAME/*"
            ]
        }
    ]
}

The policy can be used as an inline policy attached directly to the user through the AWS console:

Attach inline policy.

You must also tell which bucket to use. These settings must be identical on all nodes across the entire cluster.

S3_STORAGE_BUCKET=$BUCKET_NAME
S3_STORAGE_REGION=$BUCKET_REGION
# The secret can be any UTF-8 string. The suggested value is 64 or more random ASCII characters.
S3_STORAGE_ENCRYPTION_KEY=$ENCRYPTION_SECRET
# Optional prefix for all object keys, empty if not set.
# Allows sharing a bucket for more Humio cluster by letting them each write to a unique prefix.
# Note! There is a performance penalty when using a non-empty prefix. Humio recommends an unset prefix.
# S3_STORAGE_OBJECT_KEY_PREFIX=/basefolder

# If there are any ephemeral disks in the cluster you must include USING_EPHEMERAL_DISKS=true
USING_EPHEMERAL_DISKS=true

Using Google Cloud Storage

You need to create a Google service account that is authorized to manage the contents of the bucket(s) that will hold the data. See Obtaining and providing service account credentials manually for documentation and go to the Create service account key page

Once you have the JSON file with a set of credentials for such an account, place them in the etc directory of each Humio node and provide the full path to the file in the config like this:

GCP_STORAGE_ACCOUNT_JSON_FILE=/path/to/GCS-project-example.json

The JSON file must include the fields project_id, client_email and private_key. Any other field in the file is currently ignored.

You must also tell which bucket to use.

GCP_STORAGE_BUCKET=$BUCKET_NAME
# The secret can be any UTF-8 string. The suggested value is 64 or more random ASCII characters.
GCP_STORAGE_ENCRYPTION_KEY=$ENCRYPTION_SECRET
# Optional prefix for all object keys, empty if not set.
# Allows sharing a bucket for more Humio cluster by letting them each write to a unique prefix.
# Note! There is a performance penalty when using a non-empty prefix. Humio recommends an unset prefix.
# GCP_STORAGE_OBJECT_KEY_PREFIX=/basefolder

# If there are any ephemeral disks in the cluster you must include USING_EPHEMERAL_DISKS=true
USING_EPHEMERAL_DISKS=true

Switching to a fresh bucket

You can change the settings using the S3_STORAGE_BUCKET / S3_STORAGE_REGION to point to a fresh bucket at any point in time. From that point, Humio will write new files to that bucket while still reading from any previously-configured buckets. Existing files already written to any previous bucket will not get written to the new bucket. Humio will continue to delete files from the old buckets that match the file names that Humio would put there.

Over-committing the local disk space using bucket storage (BETA)

This feature is available for early access testing from version 1.7.0.

Humio can manage which segment files to keep on the local file system based on the amount of disk space used, and delete local files that also exist in bucket storage. This allows you to keep more data files than the local disk has room for, essentially allowing for infinite storage of events, at the cost of paying for S3 storage and (potentially) transfer costs when the files required for a search are not present on any node.

A copy in a bucket counts as a “safe replica” that allows the cluster to delete all local copies of the file and still report it as “perfectly replicated” from a cluster management perspective.

Humio keeps track of which segment files are read by searches. When the disk is filled more than the desired percentage, Humio will delete the least recently used segment files to free up disk space. Only files that have been copied to a bucket are ever removed.

If a file is not present locally and a search needs to read it, the search runs on all the parts that are present locally and schedules a download of the missing segment files from the bucket in the background. Once the required files arrive, they get searched as if they had been there from the start. The files get downloaded by (one of) the hosts that is assigned to the storage partition for the file at the time of the search starts.

Files downloaded from the bucket to satisfy a search are kept in the secondary data dir if configured, or the primary data dir if no secondary is set. The downloaded segment files are treated like the recently created segment files, and are subject to the same rules regarding deleting them again when they become the least recently accessed file at some point.

Since Humio can search the data at a much faster pace than the network and bucket storage provider can deliver them, the performance of searches that need files from the bucket provider will be orders of magnitude slower than searches that only access local files (on fast drives.) The good use case for over-committing the file system like this is to lower the cost of hosting Humio trading in performance of searches in old events that happen to not be on the local disk.

The local file system needs to have room for at least accommodate all the mini segments and the most recent merged segments. Providing more local disk space allows Humio to keep more files in the cache, resulting in better search performance.

Hash filter files are also stored in the bucket and also get downloaded when a search requires the segment file.

# Percentage of disk full that Humio aims to keep the disk at.
# BETA feature! Once the feature becomes mature for production, the BETATESTING prefix will be removed.
BETATESTING_LOCAL_STORAGE_PERCENTAGE=80

# Minimum number of days to keep a fresh segment file before allowing
# it to get deleted locally after upload to a bucket.
# Mini segment files are kept in any case until their merge result also exists.
# (The age is determined using the timestamp of the most recent event in the file)
# Make sure to leave most of the free space as room for the system to
# manage as a mix of recent and old files.
# Note! Min age takes precedence over the fill percentage, so increasing this
# implies the risk of overflowing the local file system!
LOCAL_STORAGE_MIN_AGE_DAYS=0

Configuring for use with non-default endpoints

You can point to your own hosting endpoint for the S3 or GCP to use for bucket storage if you host an S3/GCP-compatible service.

# Set either of these according to the type of provider you use.
S3_STORAGE_ENDPOINT_BASE=http://my-own-s3:8080
GCP_STORAGE_ENDPOINT_BASE=http://my-own-gcs:8080

MinIO in its default mode doesn’t use MD5Sum checksums of incoming streams. This leads to incompatibility with Humio’s client. MinIO provides a workaround: use the --compat option instead to start the server. For example, ./minio --compat server /data

Other configuration options

The following allows tuning for performance. Note that there may be a costs associated with increasing these as S3 is billed also based on the number of operations executed.

# How many parallel chunks to split each file into when uploading and downloading.
# Defaults to 4, the maximum is 16.
# (Not applicable to GCP)
S3_STORAGE_CHUNK_COUNT=4


# Maximum number of files that Humio will run concurrent downloads for at a time.
# Default is the number of hyperthreads / 2
# S3_STORAGE_DOWNLOAD_CONCURRENCY=8
# GCP_STORAGE_DOWNLOAD_CONCURRENCY=8

# Maximum number of files that Humio will run concurrent uploads for at a time.
# Default is the number of hyperthreads / 2
# S3_STORAGE_UPLOAD_CONCURRENCY=8
# GCP_STORAGE_UPLOAD_CONCURRENCY=8

# Chunk size for uploads and download ranges. Max 8 MB, which is the default.
# Minimum value is 5 MB.
S3_STORAGE_CHUNK_SIZE=8388608
GCP_STORAGE_CHUNK_SIZE=8388608

# Prefer to fetch data files from the bucket when possible even if another
# node in the Humio cluster also has a copy of the file.
# In some environments, it may be less expensive to do the transfers this way.
# The transfer from the bucket may be billed at a lower cost than a transfer from
# a node in another region or in another data center.  This preference does not
# guarantee that the bucket copy will be used, as the cluster can
# still make internal replicas directly in case the file is not yet in
# a bucket.
# Default is false.
S3_STORAGE_PREFERRED_COPY_SOURCE=false
GCP_STORAGE_PREFERRED_COPY_SOURCE=false