Query Language

The Humio Query Language is the syntax that lets you compose queries to retrieve, process, and analyze data in Humio.

Before reading this section, we recommend that you work through the tutorial. The tutorial introduces you to queries in Humio, and lets you try out sample queries that demonstrate the basic principles.

Principles

The Query Language is built around a ‘chain’ of data-processing commands linked together. Each expression passes its result to the next expression in the sequence, allowing you to create complex queries by combining query expressions.

This architecture is similar to the idea of command pipes, a powerful and flexible mechanism for advanced data analysis in Unix and Linux shells.

Structure of a query

The basic model of a query is that data arrives at the left of the query, and the result comes out at the right of the query. When Humio executes the query, it passes the data through from the left to the right.

graph LR; A{Repository} -->|Events| B B[Tag Filters] -->|Events| C C[Filters] -->|Events| D D[Aggregates] -->|Rows| E E{Final Result}
Events flow through the query pipeline, from the repository to the left. Events are filtered or transformed as they pass through filters, and aggregates. After an aggregation the data is usually no longer events, but one or more simple rows containing the results.

As the data passes through the query, Humio filters, transforms, and aggregates it according to the query expressions.

Expressions are chained using the ‘pipe’ operator |.
This causes Humio to pass the output from one expression (left) into the next expression (right) as input.

For example, the following query has these components

  • Two tag filters
  • One filter expression
  • Two aggregate expressions

    #host=github #parser=json | // <-- Tag Filters
    repo.name=docker/* | // <-- Filter Expression
    groupBy(repo.name, function=count()) | sort() // <-- Aggregates

Free-Text Filters (grepping)

The most basic query in Humio is to search for a particular string in the @rawstring field of events. See the events documentation for more details on @rawstring.

You can perform more complex regular expression searches on the @rawstring field of an event by using the regex function.

Examples

Query Description
foo Find all events matching “foo” in the @rawstring field of the events
"foo bar" Use quotes if the search string contains white spaces or special characters, or is a keyword.
"msg: \"welcome\"" You can include quotes in the search string by escaping them with backslashes.

You can also use a regular expression to match rawstrings. To do this, write the regex.

Query Description
/foo/ Find all events matching “foo” in the @rawstring field of the events
/foo/i Find all events matching “foo” in the @rawstring, ignoring case

Field Filters

Besides the @rawstring, you can also query event fields, both as text and as numbers.

Examples

Text fields

Query Description
url = *login* The url field contains login. You can use * as a wild card.
user = *Turing The user field ends with Turing.
user = "Alan Turing" The user field equals Alan Turing.
user != "Alan Turing" The user field does not equal Alan Turing.
url != *login* The url field does not contain login.
user = * Match events that have the field user.
user != * Match events that do not have the field user.
name = "" Match events that have a field called name but with the empty string as value.
user="Alan Turing" You do not need to put spaces around operators (for example, = or !=).

Regex filters

In addition to globbing (* appearing in match strings) you can match fields using regular expressions.

Query Description
url = /login/ The url field contains login.
user = /Turing$/ The user field ends with Turing.
loglevel = /error/i The loglevel field matches error case insensitively; for example, it could be Error, ERROR or error.
/user with id (?<id>\S+) logged in/ | top(id) The user id is extracted into a field named id at search time. The id is then used in the top function to find the users that logged in the most. It is possible to extract fields in regular expressions using named groups. See Regex field extraction for details.

Comparison operators on numbers

Query Description
statuscode < 400 Less than
statuscode <= 400 Less than or equal to
statuscode = 400 Equal to
statuscode != 400 Not equal to
statuscode >= 400 Greater than or equal to
statuscode > 400 Greater than
400 = statuscode (!) The field ‘400’ is equal to statuscode.
400 > statuscode This comparison generates an error. You can only perform a comparison between numbers. In this example, statuscode is not a number, and 400 is the name of an field.

The left-hand-side of the operator is interpreted as a field name. If you write 200 = statuscode, Humio tries to find a field named 200 and test if its value is "statuscode".

If the specified field is not present in an event, then the comparison always fails — unless it is !=. You can use this behavior to match events that do not have a given field, using either not (foo = *) or the equivalent foo != * to find events that do not have the field foo.

Tag Filters

Tag filters are a special kind of field filter. They behave in the same way as regular filters.

In the example shown in the previous section (Basic Query Components), we have separated the tag filters from the rest of the query by a pipe character |.

We recommend that you include the pipe character before tag filters in your queries to improve the readability of your queries.

However, these pipe characters are not mandatory. The Humio query engine can recognize tag filters, and use this information to narrow down the number of datasources to search. This feature decreases query time.

See the tags documentation for more on tags.

Logical Operators: and, or, not, and !

You can combine filters using the and, or, not Boolean operators, and group them with parentheses. ! can also be used as an alternative to unary not.

Examples

Query Description
foo and user=bar Match events with foo in the@rawstring field and a user field matching bar.
foo bar Since the and operator is implicit, you do not need to include it in this simple type of query.
statuscode=404 and (method=GET or method=POST) Match events with 404 in their statuscode field, and either GET or POST in their method field.
foo not bar This query is equivalent to the query foo and (not bar) .
!bar This query is equivalent to the query not bar .
not foo bar This query is equivalent to the query (not foo) and bar . This is because the not operator has a higher priority than and and or.
foo and not bar or baz This query is equivalent to the query foo and ((not bar) or baz) . This is because Humio has a defined order of precedence for operators. It evaluates operators from the left to the right.
foo or not bar and baz This query is equivalent to the query foo or ((not bar) and baz) . This is because Humio has a defined order of precedence for operators. It evaluates operators from the left to the right.
foo not statuscode=200 This query is equivalent to the query foo and statuscode!=200 .

Negating the result of filter functions

The not and ! operators can also be used to negate filter function expressions, which is syntactically more clean than passing in an explicit negate=true argument. Examples of this are

... | !cidr(ip, subnet="127.0.0/16") | ...
... | !in(field, values=[a, b, c]) | ...
... | !regex("xxx") | ...

Adding new fields

New fields can be created in two ways

Regex Field Extraction

You can extract new fields from your text data using regular expressions and then test their values. This lets you access data that Humio did not parse when it indexed the data.

For example, if your log entries contain text such as ... disk_free=2000 ... , then you can use a query like the following to find the entries that have less than 1000 free disk space

regex("disk_free=(?[0-9]+)") | space < 1000

Named capturing groups are used to extract fields in regular expressions. The field space is extracted and is then available after the regex function.
The same can be written using a regex literal

/disk_free=(?[0-9]+)/ | space < 1000

In order to use field-extraction this way, the regex must be a top-level expression, that is, | between bars |. The following doesn’t work

// DON'T DO THIS - THIS DOES NOT WORK
type=FOO or /disk_free=(?[0-9]+)/ | space < 1000

Since regular expressions do need some computing power, it is best to do as much simple filtering as possible earlier in the query chain before applying the regex function.

Fields produced by functions (as-parameters)

Fields can also be added by functions. Most functions set their result in a field that has the function name prefixed with a ‘_’ by default. For example the count puts its result in a field _count.

Most functions that produce fields have a parameter called as. By setting this parameter you can specify the name of the output field, for example

count(as=cnt)

Assigns the result of the count to the field named cnt (instead of the default _count).

See also the assignment operator for shorthand syntax for assigning results to a field.

Eval Syntax

The function eval can assign fields while doing numeric computations on the input.

The := syntax is short for eval. Use | between assignments.

... | foo := a + b | bar := a / b |  ...

is short for

... | eval(foo = a + b) | eval(bar = a / b) | ...

Dynamic field names based on input

If you want the field produced by an eval to be named based on the input you can use the backtick syntax. This works for both eval and the := shorthand, and will produce a field named after the value of the field in backticks.

An example on events with the following fields, which is, for example, the outcome of top(key)

{ key: "foo", value: "2" }
{ key: "bar", value: "3" }
...

Using

... | `key` := value * 2 | ...

will get you events with

{ key: "foo", value: "2", foo: "4" }
{ key: "bar", value: "3", bar: "6" }
...

Then you can time chart them by doing

timechart( function={ top(key) | `key` := _count } )

This last example uses a composite function call for the function= argument.

Assignment operator

You can use the operator := with functions that take an as-parameter. When what’s on the right hand side of the assignment is a function call, the assignment is rewritten to specify the as= argument which, by convention, is the output field name. For example

... | foo := min(x) | bar := max(x) |  ...

is short for

... | min(x, as=foo) | max(x, as=bar) | ...

Field operator

You can use attr =~ fun() with any function that has a parameter named field. It designates the field=attr argument and lets you write:

... | ip_addr =~ cidr(subnet="127.0.0.1/24") | ...

rather than

... | cidr(subnet="127.0.0.1/24", field=ip_addr) | ...

This also works well with regex and replace . It’s a shorthand, and very convenient.

Conditionals

There is no “if-then-else” syntax in Humio, since the streaming style is not well-suited for procedural-style conditions.

But there are a couple of ways to do conditional evaluation

Case Statements

Using case statements, you can describe alternative flows in your queries. It is similar to case or cond you might know from many other functional programming languages. It essentially allows you to write if-then-else constructs that work on events streams.

The syntax looks like this:

case {
  expression | expression | ...;
  expression | expression | ...;
  expression | expression | ...;
  * | expression | ...
}

You write a sequence of pipeline clauses separated by a semicolon (;). Humio will apply each clause from top to bottom until one returns a value (matches the input).

You can add wildcard clause case { ... ; * } which matches all events as the “default case”, essentially the else part of an if-statement. If you don’t add a wildcard clause, any events that don’t match any of the explicit clauses will be dropped. You cannot use the empty clause — you must explicitly write * to match all.

Example

Let’s say we have logs from multiple sources that all have a field named time, and we want to get percentiles of the time fields, but one for each kind of source.

First we try to match some text that distinguishes the different types of line. Then we can create a new field type and assign a value that we can use to group by

time=*
| case { "client-side"     | type := "client";
         "frontend-server" | ip != 192.168.1.1 | type := "frontend";
         Database          | type := "db" }
| groupBy(type, function=percentile(time)))

Match Statements

Using match statements, you can describe alternative flows in your queries where the conditions all check the same field. It is similar to match or switch which you might recognize from many other programming languages. It essentially allows you to write if-then-else constructs that work on events streams. The matches on the field support the filters listed in Field Filters

The syntax looks like this

field match {
  value => expression | expression... ;
  /regex/ => expression | ...;
  * => expression | ...
}

You write a sequence of filter and pipeline clauses to run when the filter matches, separated by a semicolon (;). Humio will apply each clause from top to bottom until one returns a value (matches the input).

You can add a wildcard clause match { ... ; * } which matches all events as the “default case”, essentially the else part of an if-statement. If you don’t add a wildcard clause any events that don’t match any of the explicit clauses will be dropped. You cannot use the empty clause — you must explicitly write * to match all.

Example

Let’s say we have logs from multiple sources that all have a field that holds the time spent on some operation, but in different fields and units. We want to get percentiles of the time fields all in the same unit and in one field.

logtype match {
    "accesslog" => time:=response_time ;     // Access log is in seconds.
    /server_\d+/ => time:=server_time*1000 ; // These servers log in millis
  }
| groupBy(logtype, function=percentile(time)))

Setting a field’s default value

You can use the function default to set the value of a missing or empty field.

Composite Function Calls

Whenever a function accepts a function as an argument, there are some special rules. For all variations of groupBy

(bucket and timechart ), that take a function= argument, you can also use a composite function. Composite functions take the form { f1(..) | f2(..) } which works like the composition of f1 and f2. For example, you can do

groupBy(type, function={ avgFoo := avg(foo) | outFoo := round(avgFoo) })

You can also use filters inside such composite function calls, but not macros.

User Functions

If you have stored a query as a ‘saved query’, then it can be used as a top-level element of another query, sort of like a function call.

To use a saved query this way you invoke it using the syntax $"$NAME_OF_SAVED_QUERY"() or, if the name does not contain whitespace or special characters you can use $nameOfSavedQuery() without quotes. A typical use for this is to define a filter or extraction ruleset that you can use as a prefix of another query.

Currently user functions do not support parameters, though this will be part of a future release — that is why we put parentheses at the end.

Example

$"My Saved Query"() | $filterOutFalsePositive() | ...

Comments

Queries can have comments. This is useful for long multi-line queries, to add some description

#type=accesslog // choose the type
| top(url)  // count urls and choose the most frequently used

The Humio query language supports // single line and /* multi line */ comments like Java or C++.

When showing search results in a table it is possible to create a link that is clickable. If the value of a field looks like a link, the UI will make it clickable. Links can be constructed using the search language. The format function can be handy for this.

$extractRepo() | top(repo) | format("https://myhumio/%s", field=repo, as=link)

Joins

New in version 1.7, Humio supports the filter query function join(). The join() function returns a combined result set from two queries.

This function provides two join modes: inner and left. If you’re familiar with SQL, what would be the left table in SQL is referred to as the primary in Humio, and the right table becomes the subquery.

Inner joins

Let’s first consider the inner join, as it is the join() function’s default behavior.

An inner join returns any intersection of matching values, based upon the criteria you provide, between the primary query and the subquery (shown below in purple).

Left joins

The left join returns everything in the primary and anything that matches from the subquery. In other words, a query using a left join returns any intersection of matching values, based upon the criteria you provide, between the primary query and the subquery (like an inner join) plus all the values of the primary query. This is illustrated below.

Syntax

The join() function’s signature looks like this

join({subquery}, field=arg1, key=arg2, repo=arg3, mode=inner|left)

The unnamed subquery argument, and the named field and key arguments, are required. The named repo, include, and mode parameters are optional. There are other options available. Please see the Query Function reference documentation on join for more details.

First, let’s consider a join query against a single repository. In that scenario, there is no need to define the optional repo parameter. The query will look like this

primary query | join({subquery}, field=arg1, key=arg2)

So to break down how the join() function is used

  • The primary query is any valid Humio query. This provides the first set of results you wish to work with.
  • The arguments you pass to the join() function define what results should be combined and how.
  • The subquery is any valid Humio query. This provides the second set of results you wish to combine with those from your primary query.
  • The field parameter’s value (arg1) defines the field name in the primary query. This value will be used to match data in the subquery.
  • The key parameter’s value (arg2) defines the field name in the subquery. This value will be used to match data in the primary query.

If the field names being used in field and key are identical, you can use the simplified syntax

primary query | field =~ join({subquery})

In the above example, the field argument would be the actual field name. For example, if both queries share a field named host that you wish to join on, you would use

primary query | host =~ join({subquery})

Joins across two repositories

In addition to joining two queries against a single repository, the join() function can also be used to return a result set from more than two repositories. To do so, use the optional repo parameter to define the repository (as arg3) you want the subquery to run against.

primary query
| join({subquery}, field=arg1, key=arg2, repo=arg3)

Note that the user running the join() query must have permissions on both repositories.

Example 1: An inner join

Let’s say we’re analyzing two repositories to see what kinds of web browsers are generating errors in our web servers. In this example, we want to return the intersection of two types of data: the status codes that return errors (anything 400 or higher) from the accesslog in the current repository, and the web browser type (the user agent) from the weblogs repository.

We can use the following query

#type=accesslog statuscode>=400
| join({useragent != ""}, field=sourceipaddress, key=browseripaddress, repo=weblogs)

Our primary query is

#type=accesslog statuscode>=400

We’re looking in the current working repository for logs of type accesslog where statuscode >= 400. Next, we pipe the primary query to our join() function

join({useragent != ""}, field=sourceipaddress, key=browseripaddress, repo=weblogs)

We pass in {useragent != ""} as the subquery argument to the join() function. This subquery returns any value for the useragent that is not an empty string.

To join the result sets of these two queries together we need a way to match them up. If our accesslog and weblogs both contain IP addresses that would be a good option and is the one we will use in this example. However, in our example the two repositories use different field names for the IP address. The accesslog uses sourceipaddress and the weblogs repository uses browseripaddress. We define that relationship by passing in the field used in the primary query and the key that will be matched against in the subquery.

field=sourceipaddress, key=browseripaddress

To define the external repository that the subquery will run against, we simply pass that repository’s name in with the repo parameter repo=weblogs.

What can we expect back as results? The primary query finds all accesslog entries with an error code. The subquery finds all entries in the weblogs repository with a value present for the useragent. However, being an inner join, the combined results will only be returned for entries where the IP addresses match in both queries.

Example 2: A left join

In this next example, we want to get as much information as possible about people who are accessing hosts. The information we have may or may not contain the location of the user, but if it does, we need to include that in our query results.

Since location information is desirable, although optional, but all access logs are required a left join query would better fit our requirements than an inner join query. Here is what it would look like

#type=accesslog
| join({username=*}, field=host, key=hostname, repo=auditlogs,  
   mode=left)
| table(timestamp, username, host, location)

In this example, we get all results from our current repository using #type=accesslog as the primary query. Our subquery (username=*) states that we wish to return all usernames. As in the previous example, we need a common point of data to map the relationship between our repositories. We’ll use the host, again, defining the differing field names used in the two repositories for that data. We passed those into the join() function as field=host, key=hostname. We define our external repository with repo=auditlogs.

Since we want all the results from the primary query, even if there is no match with the subquery, we specify a left join using mode=left.

Finally, we display the results in a table by piping the results to table(timestamp, username, host, location).

Best practices for join queries

Humio recommends the following when running join() queries

  • If your join() query key is constantly changing with live data, use live queries to keep the query updated.
  • When debugging an inner join query, temporarily set the join mode to left (join(mode=left)). This will help you debug your query by showing all results matched or unmatched by the join clause.
  • Another suggestion is to first write the subquery alone to make sure the results are what you expect. Once they are, then you can confidently include it in a join.