Humio supports the joining of queries using the join()
function. It will return a combined results set from two queries it’s given.
This function provides two join modes: inner and left. If you’re familiar with SQL database systems, what would be the left table in SQL is referred to as the primary in Humio, and the right table becomes the subquery.
An inner join returns any intersection of matching values, based upon the criteria you provide, between the primary query and the subquery. This is illustrated in purple in Figure 1 here. Incidentally, the inner join is the join( )
function’s default behavior.
As an example, suppose we’re analyzing two repositories to see what kinds of web browsers are generating errors on our web servers. In this example, we want to return the intersection of two types of data: the status codes that return errors (anything 400 or higher) from the accesslog in the current repository, and the web browser type (the user agent) from the weblogs repository. To do this, we can use the following query:
#type=accesslog statuscode>=400
| join({useragent != ""}, field=sourceipaddress, key=browseripaddress, repo=weblogs)
The primary query here is the first line. It says to search the current working repository for logs of type accesslog
, where statuscode >= 400
.
Next, it pipes the primary query to the join()
function. It passes in {useragent != ""
} as the subquery argument to the join()
function. This subquery returns any value for the useragent
that is not an empty string.
To join the result sets of these two queries together we will need a way to match them. If our accesslog and weblogs both contain IP addresses, that would be a good option and is the one we would use in this example. However, in this example the two repositories use different field names for the IP address. The accesslog
uses sourceipaddress
, while the weblogs
repository uses browseripaddress
. This query defines that relationship by passing in the field used in the primary query, and the key that will be matched against in the subquery — i.e., field=sourceipaddress
, and key=browseripaddress
.
To define the external repository that the subquery will run against, the query here simply passes that repository’s name in with the repo parameter, repo=weblogs
.
As for the results, the primary query will find all accesslog
entries with an error code. The subquery will find all entries in the weblogs
repository with a value present for the useragent
. However, being an inner join, the combined results will only be returned for entries where the IP addresses match in both queries.
As an example of a left join, suppose we want to get as much information as possible about people who are accessing hosts. The information we have may or may not contain the location of the user, but if it does, we need to include that in our query results.
Since location information is desirable, although optional, but all access logs are required a left join query would better fit our requirements than an inner join query. Here is what it would look like
#type=accesslog
| join({username=*}, field=host, key=hostname, repo=auditlogs,
mode=left)
| table(timestamp, username, host, location)
In this example, we get all results from our current repository using #type=accesslog
as the primary query. Our subquery (username=*
) states that we wish to return all usernames. As in the previous example, we need a common point of data to map the relationship between our repositories. We’ll use the host, again, defining the differing field names used in the two repositories for that data. We passed those into the join()
function as field=host, key=hostname
. We define our external repository with repo=auditlogs
.
Since we want all the results from the primary query, even if there is no match with the subquery, we specify a left join using mode=left
.
Finally, we display the results in a table by piping the results to table(timestamp, username, host, location)
.
The join()
function’s signature looks like this
join({subquery}, field=arg1, key=arg2, repo=arg3, mode=inner|left)
The unnamed subquery argument, and the named field
and key
arguments, are required. The named repo
, include
, and mode
parameters are optional. There are other options available. Please see the Join Query Functions on join( ) reference page for more details.
First, let’s consider a join query against a single repository. In that scenario, there is no need to define the optional repo
parameter. The query will look like this
primary query | join({subquery}, field=arg1, key=arg2)
So to break down how the join()
function is used
join()
function define what results should be combined and how.arg1
) defines the field name in the primary query. This value will be used to match data in the subquery.arg2
) defines the field name in the subquery. This value will be used to match data in the primary query.If the field names being used in field and key are identical, you can use the simplified syntax
primary query | field =~ join({subquery})
In the above example, the field argument would be the actual field name. For example, if both queries share a field named host
that you wish to join on, you would use:
primary query | host =~ join({subquery})
In addition to joining two queries against a single repository, the join()
function can also be used to return a result set from more than two repositories. To do so, use the optional repo
parameter to define the repository (as arg3
) you want the subquery to run against.
primary query
| join({subquery}, field=arg1, key=arg2, repo=arg3)
Note that the user running the join()
query must have permissions on both repositories.
Joins are implemented by starting two queries: the main query and the subquery. For static queries, the subquery runs first and then the main query runs. For live queries, the historic part is done as in the static case, whereas the main query and subquery will run simultaneously for the live part.
Because of this and the distributed nature of Humio, subqueries in live queries are not guaranteed to “see” events before the main query does. In practice this can be thought of as there being a small delay (typically in the order of 5-6 seconds) from an event being ingested to it hitting the subquery. This can lead to surprising results when making live joins on events that occur very close in time.
For example, a query that contains only the count( )
function will typically result in a larger count than the query below, which joins events to themselves:
join({*}, field=id) | count()
This is because at the time the main query runs for a given event, the subquery has not seen that event yet and the join will therefore not emit the event. You can avoid this by not joining events that are ingested close in time, using static queries, or by using repeating queries
Humio recommends the following when running join()
queries
join()
query key is constantly changing with live data, use live queries to keep the query updated.join(mode=left)
). This will help you debug your query by showing all results matched or unmatched by the join clause.