Installing Humio in a Cluster

This section describes how to install Humio configured as a distributed system across multiple machines. Running a distributed Humio setup requires a Kafka cluster. You can set up such a cluster using our Docker image, or you can install Kafka using some other method.

Installation & Configuration Scripts

We have created a GitHub repository with scripts to help install and configure Humio. We suggest you read through the documentation below and have a look at the repository, then check out the scripts and modify them for your environment.

You should also look at the reference Ansible project for an example of this in practice.

Running Zookeeper & Kafka Docker Images

The recommended default is to run three instances of Zookeeper and Kafka with our Docker images. The Zookeeper and Kafka instances must run on ports that the Humio instances can connect to.

The suggested setup below maps the user Humio on the host machine to the user Humio inside the Docker containers, and runs the Zookeeper, Kafka, and Humio processes as that user. This allows the processes to write to the mounted data directories.

Tailor according to your needs, and make sure the /data/ directories are on a mount point with sufficient storage. (probably not /).

The data is split on four mounting points, in the example configurations below on these prefixes:

  1. /data/logs holds log files from the various processes.
  2. /data/zookeeper-data holds zookeeper data. (Not much)
  3. /data/kafka-data holds Kafka data.
  4. /data/humio-data holds Humio data.

The following shows how to use the humio/humio-kafka image to set up Zookeeper and Kafka in a three-machine cluster. You’ll have to do the following steps on each machine in the cluster.

Ensure the humio user exists:

# adduser --disabled-password --disabled-login humio

Add the humio user to the docker group to run Docker without root privileges. (Humio user should not have sudo access):

# usermod -aG docker humio

Create a data directory for Zookeeper

# mkdir -p /data/logs
# chown -R humio:humio /data/logs
# mkdir -p /data/zookeeper-data
# chown -R humio:humio /data/zookeeper-data

Create a configuration file for Zookeeper. Replace the HOST_1-3 variables with the DNS name or IP addresses of your hosts.

This is the configuration file for HOST; save it in a known location, such as /etc/humio/zookeeper.properties.

dataDir=/data/zookeeper-data
clientPort=2181
clientPortAddress=${HOST}
tickTime=2000
initLimit=5
syncLimit=2
autopurge.purgeInterval=1
admin.enableServer=false
4lw.commands.whitelist=*
server.1=${HOST_1}:2888:3888
server.2=${HOST_2}:2888:3888
server.3=${HOST_3}:2888:3888

Set the myid file to the ID of the given server as specified in the configuration file above (1, 2 or 3)

# echo 1 > /data/zookeeper-data/myid
# chown humio:humio /data/zookeeper-data/myid

Create a data directory for Kafka

# mkdir -p /data/kafka-data
# chown -R humio:humio /data/kafka-data

Make sure Kafka’s mount point is on a separate volume from the others. Kafka is notorious for consuming large amounts of disk space, so it’s important to protect the other services from running out of space by using a separate volume in production deployments.

Also make sure all volumes are being appropriately monitored as well. If your installation does run out of disk space and gets into a bad state, you can find recovery instructions here.

Create a configuration file for Kafka. Each server needs to have a unique name and a broker.id (1, 2 or 3). Make sure the listener is something the Humio instances can reach. If in doubt, please refer to the Kafka documentation.

This is the configuration file for HOST; remember to set broker.id and listeners accordingly, and save it in a known location, such as /etc/humio/kafka.properties

broker.id=1
log.dirs=/data/kafka-data
zookeeper.connect=${HOST_1}:2181,${HOST_2}:2181,${HOST_3}:2181
listeners=PLAINTEXT://${HOST}:9092
replica.fetch.max.bytes=104857600
message.max.bytes=104857600
compression.type=producer
num.partitions=1
log.retention.hours=48
log.retention.check.interval.ms=300000
unclean.leader.election.enable=false
broker.id.generation.enable=false
auto.create.topics.enable=false

Install Docker and pull the latest humio/zookeeper and humio/kafka Docker images

# docker pull humio/zookeeper
# docker pull humio/kafka

Start the Docker images on each host, mounting the configuration files and data locations created in previous steps

# docker run -d  --restart always --net=host \
    -v /etc/humio/zookeeper.properties:/etc/kafka/zookeeper.properties \
    -v /data/logs:/products/kafka/logs \
    -v /data/zookeeper-data:/data/zookeeper-data  \
    --name humio-zookeeper "humio/zookeeper"

# docker run -d  --restart always --net=host \
    -v /etc/humio/kafka.properties:/etc/kafka/kafka.properties \
    -v /data/logs:/products/kafka/logs \
    -v /data/kafka-data:/data/kafka-data  \
    --name humio-kafka "humio/kafka"

Verify Zookeeper & Kafka

Inspect the log files:

  • $ docker logs humio-zookeeper
  • $ docker logs humio-kafka

Use nc to get the status of each Zookeeper instance. The following must respond with either Leader or Follower for all instances

# echo stat | nc 192.168.1.1 2181 | grep '^Mode: '

Optionally, use your favorite Kafka tools to validate the state of your Kafka cluster. You could list the topics using this, expecting to get an empty list since this is a fresh install of Kafka

# kafka-topics.sh --zookeeper localhost:2181 --list

Humio Docker Container

Humio is distributed as Docker images; use the humio/humio-core edition for distributed deployments.

Create an empty file on the host machine to store the Humio configuration. For example, humio.conf.

You can use this file to pass on JVM arguments to the Humio Java process.

Enter and then edit the following settings into the configuration file

# The stacksize should be at least 2M.
HUMIO_JVM_ARGS=-Xss2M

# Make Humio write a backup of the data files:
# Backup files are written to mount point "/backup".
#BACKUP_NAME=my-backup-name
#BACKUP_KEY=my-secret-key-used-for-encryption

# ID to choose for this server when starting up the first time.
# Leave commented out to autoselect the next available ID.
# If set, the server refuses to run unless the ID matches the state in data.
# If set, must be a (small) positive integer.
#BOOTSTRAP_HOST_ID=1

# The URL that other hosts can use to reach this server. Required.
# Examples: https://humio01.example.com  or  http://humio01:8080
# Security: We recommend using a TLS endpoint.
# If all servers in the Humio cluster share a closed LAN, using those endpoints may be okay.
EXTERNAL_URL=https://humio01.example.com

# Kafka bootstrap servers list. Used as `bootstrap.servers` towards kafka.
# should be set to a comma separated host:port pairs string.
# Example: `my-kafka01:9092` or `kafkahost01:9092,kafkahost02:9092`
KAFKA_SERVERS=kafkahost01:9092,kafkahost02:9092

# Zookeeper servers.
# Defaults to "localhost:2181", which is okay for a single server system, but
# should be set to a comma separated host:port pairs string.
# Example: zoohost01:2181,zoohost02:2181,zoohost03:2181
# Note, there is NO security on the zookeeper connections. Keep inside trusted LAN.
ZOOKEEPER_URL=zoohost01:2181,zoohost02:2181

# Select the TCP port to listen for http.
#HUMIO_PORT=8080

# Select the IP to bind the udp/tcp/http listening sockets to.
# Each listener entity has a listen-configuration. This ENV is used when that is not set.
#HUMIO_SOCKET_BIND=0.0.0.0

# Select the IP to bind the http listening socket to. (Defaults to HUMIO_SOCKET_BIND)
#HUMIO_HTTP_BIND=0.0.0.0

For more information on each of these environment variables, see the Configuration reference page.

If you make changes to the settings in your environment file, simply stopping and starting the container will not work. You need to docker rm the container and docker run it again to pick up changes.

Create an empty directory on the host machine to store data for Humio

# mkdir /data/humio-data

Pull the latest Humio image:

# docker pull humio/humio-core

Run the Humio Docker image as a container

# docker run -d  --restart always --net=host \
     -v /data/logs:/data/logs \
     -v /data/humio-data:/data/humio-data \
     -v /backup:/backup  \
     --env-file $PATH_TO_CONFIG_FILE --name humio-core humio/humio-core

Replace /data/humio-data before the : with the path to the humio-data directory you created on the host machine, and $PATH_TO_CONFIG_FILE with the path of the configuration file you created.

Verify that Humio is able to start using the configuration provided by looking at the log file. In particular, it should not keep logging problems connecting to Kafka.

# grep 'Humio server is now running!'  /data/logs/humio_std_out.log
# grep -i 'kafka'  /data/logs/humio_std_out.log

Humio is now running. Navigate to http://localhost:8080 to view the Humio Web UI.

In the above example, we started the Humio container with full access to the network of the host machine. In a production environment, you should restrict this access by using a firewall, or adjusting the Docker network configuration.

Starting Humio as a Service

There are different ways of starting the Docker container as a service. In the above example, we used Docker’s restart policies. Humio can be started using a process manager.

If you receive this warning after starting up the Humio service, please ignore it. This does not affect the Humio service.

WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by com.humio.util.FileUtilsJNA (file:/app/humio/humio-assembly.jar) to field sun.nio.ch.FileChannelImpl.fd
WARNING: Please consider reporting this to the maintainers of com.humio.util.FileUtilsJNA
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will benied in a future release

Nomad

If you want to install Humio on a Nomad cluster we have an example project that you can use as reference when creating your Nomad jobspecs.

Configuring Humio

Please refer to the configuration section.

Cluster Management API

Please refer to the API page.

Component roles

To fully understand the roles of the various components of a Humio cluster, please reference the Single-Node Setup documentation. The following sections can help you understand the effects of adding more nodes of the Humio components to your cluster.

Zookeeper

A Zookeeper cluster can survive losing less than half its nodes. This means that a 3-node Zookeeper cluster can survive 1 node going offline, a 5-node cluster can survive 2 nodes going offline, and so on. A consequence of this is that you should always have an odd number of Zookeeper nodes.

Neither Humio nor Kafka overly stress Zookeeper, so you are unlikely to see any difference in Humio’s performance from adding more Zookeeper nodes.

Kafka

Adding more Kafka nodes can alleviate bottlenecks for data ingestion, but will not affect query performance.

More Kafka nodes allows you more resiliency against data loss, in case Kafka hosts go offline. Adding extra replicas will slow down ingest somewhat, as data must be duplicated across Kafka nodes. The number of nodes you can lose before data loss occurs will depend on Kafka’s configured replication factor. Kafka can survive losing all but one replica. When allowing Humio to manage Kafka topics on a Kafka cluster at or above 3 nodes, Humio will replicate the global-events and transientChatter-events topics to 3 nodes, and will require that 2 of those nodes are available at all times.

It is often convenient to co-host Zookeeper and Kafka on the same nodes. You might want to host them on different nodes so you can have a different number of each. Since Kafka does not need as many nodes to be resilient against downtime, it can make sense to have only a few (e.g. 3 or 5) Zookeepers, but more Kafka nodes.

It is convenient to run Kafka and Humio on the same nodes for low data volumes. As both services can be demanding for the local IO system, we recommend that Humio and Kafka do not run on the same nodes once the cluster is scaled up.

Humio

Adding more Humio nodes will increase performance of queries, as the work can be split across more machines. More nodes also allow you to replicate your data, ensuring resiliency against machine breakage. Humio can survive losing all but one replica. With bucket storage enabled, Humio can survive losing all nodes, as long as the Kafka cluster does not lose state.