This is Part 1 of a 2 part blog. The 2 parts are:
An overview of why Humio is different from other products.
An analysis of why the technology makes the Total Cost of Ownership cheaper than Splunk or ELK. Part 2 can be read here.
One thing I’ve heard quite a lot over recent years is that companies would like to collect and analyse more log information, but as data volumes grow exponentially, traditional log management and security information and event management (SIEM) solutions make it cost prohibitive to collect all the required data.
According to Gartner, “Licensing models for many SIEM tools often dissuade customers from collecting all relevant events and logs…Organizations that lack sufficient budgets to expand their existing SIEM solutions must then choose between deprioritizing existing use cases, not adding new use cases or decreasing the scope of monitoring.”
However, there is now an alternative to high-cost products like Splunk and ELK. Humio has been designed from the ground up to leverage technologies that make it cost-effective and highly efficient to collect and search all log data and do so at scale, and in real time. This blog will look at what makes Humio different and why this also makes it cheaper than the alternatives.
In part 1 of this blog we will look at the technology that makes all this possible.
The Store Phase
Unlike most other tools, Humio is built on an index-free architecture bypassing the requirements of traditional databases. For this reason, Humio’s log data is available for dashboard updates, alerts, and searches in near real-time about 100-300 milliseconds after ingestion. In addition, traditional indexes that make the entire data searchable can become very large, sometimes causing data to grow by up to 300% which in turn requires additional storage and additional expense.
Here’s how this works:
Log data comes in from either a data shipper or directly from the network device. A data shipper is a tool that functions as the connection between a server and Humio, making it possible to transfer log files and metrics easily and reliably. Using a data shipper allows the data to be retransmitted on failure and messages to be batched which the Ingest API does not allow. The process of data arriving at a Humio node and then being presented in search results and saved to disk is called the ingest flow.
Humio also supports Elastic Beats. The OSS Elastic Beats are a cross-platform, lightweight group of data shippers that broaden the data that Humio can collect.
The data is normalized via Humio-provided parsers or custom parsers. When a system sends data (logs) to Humio over one of the Ingest APIs or through an ingest listener the cluster node that receives the request is called the arrival node. The arrival node parses the incoming data (using the configured parsers) and puts the result (called events) in Humio’s humio-ingest Kafka queue.
After the events are placed in the humio-ingest queue a digest node will grab them off the queue as soon as possible. A queue in Kafka is configured with a number of partitions (parallel streams), and each such Kafka partition is consumed by a digest node. A single node can consume multiple partitions and exactly which node that handles which digest partition is defined in the cluster’s Digest Rules.
Digest nodes are responsible for buffering new events and compiling segment files (the files that are written to disk in the Store phase). Once a segment file is full it is passed on to Storage Nodes in the Store Phase. The context of the data is stored in lightweight tags that summarize where the data came from or how it is used.
Digest nodes also processes the Real-Time part of search results. Whenever a new event is pulled off the humio-ingest queue the digest node examines it and updates the result of any matching live searches that are currently in progress. This is what makes results appear instantly in results after arriving in Humio.
The final phase of the ingest flow is saving segment files to storage. These are labelled with information that will tell a subsequent query if the required data could possibly be in that file or not.
These labels include the following:
- Time range of the data in the bucket. While Humio is almost index-free, it does index time series information so all ingested data can be found based on its time. The index information is small enough that it can be stored in memory and it doesn’t slow down the ingest process. For example, to store 30 GB of logs for a month, the index file is less than 1 MB.
- The source and type of data
- A bloom filter computed over the keys and values of the data stored in the bucket
What is a Bloom filter?
A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970 to rapidly and memory-efficiently instruct whether an element is present in a set. It is efficient but as it is a probabilistic data structure it can only say whether the element is either definitely not in the set or may be in the set.
Once a segment file has been completed in the digest phase it is saved in X replicas – how many depends on how your cluster is configured.
Data that comes into a repository is generally stored locally, on the server where Humio and the repository is located. This can be on one node or many; having more nodes will improve performance and allow for redundancy of data. For data that you don’t want to search often you can use external storage in one of these options:
- Secondary Storage – When enabled, Humio will move segment files to secondary storage once the primary disk reaches whatever limit you set. Secondary or cold storage is intended for usages where the primary or hot storage is low-latency and fast, such as NVME, and the secondary is high-latency but very large, such as a SAN.
- Bucket Storage – Similar to secondary storage but utilizes specialized storage with web service providers like Amazon Bucket Storage and Google Cloud Bucket Storage. Bucket storage is particularly useful, though, in a cluster in that it makes deployment of nodes easier and helps to maintain back-ups in case a node or a cluster crash.
- S3 Archiving – The files written by Humio in this format are not searchable by Humio and this data is meant for other systems to consume.
Once the data is stored the users will need to search the data to find answers to questions, set up alerts and get to the root-cause of issues. Whenever a new event is pulled off the humio-ingest queue the digest node examines it and updates the result of any matching live searches that are currently in progress. This is what makes results appear instantly in results after arriving in Humio.
The search itself is performed using a brute-force method. Brute-force search is typically considered to be slow but in Humio it is incredibly fast; up to 30-40 times the speed of a regular search and certainly faster than conventional log management systems.
Let’s look at how this works.
A typical Solid-State Drive (SSD) has a read speed of about 1GB/s, which means that you can copy the file contents from disk into memory at that speed. Therefore, the normal time to search 1 GB of data from a drive takes 1.1 seconds. The 1/10th of a second is for the actual search of 1GB of data from memory.
If we compress the input 10 times using LZ4 compression to 0.1GB it then takes just 0.1 second to read in 0.1GB. (LZ4 is lossless compression algorithm, providing compression speed > 500 MB/s per core, scalable with multi-cores CPU). Add the search time of 1/10th of a second to a total of 0.6s for reading from disk and decompressing, and we can now search through 1GB in just 0.7s.
Assuming we’re on a 4-core system, we can split the compressed data into four units of work that are individually decompressed and searched on each core for an easy 4x speed up; 1/4th of 0.6 seconds on each core is 0.125s. This gives us a total search time of 0.225 seconds, or 4.4GB/s on a single 4-core machine.
Lastly, the data is divided into single MB chunks of decompressed data, rather than 256 MB files, and the compressed data is moved from the main memory to the CPU cache, so you don’t have to scan it in again from main memory. Now it only takes 0.1265 seconds.
All of the above assumes that we work in main memory, which is limited by a theoretical ~50 GB/s bandwidth on a modern CPU, in practice we see ~25 GB/s.
Once data is on the CPU’s caches it can be accessed even faster however the caches are small. The level-2 cache for instance is 256 Kbytes but if we move the data onto the level-2 cache in little compressed chunks, so that their decompression also fits in the same cache, we can search in an incremental way. Memory-accesses on the level-2 cache are about 10 times faster than main memory, so this would let us speed up the decompress-and-search phase. To achieve this, Humio pre-processes the input by splitting the 1 GB into up to 128 k chunks that are individually compressed.
Adding all this up for the example search of 1GB to 0.1s for read-from-disk, 0.004s main-to-core 0.1GB at 25GB/s, and blazing 10x at 0.0125s to decompress-and-search, for a total of 0.1265 seconds reaching 7.9GB/s.
In practice, you’re often searching for current and recent data that can be kept in the page cache, so you can get search times down to 0.0265 seconds, which is 30-40 times the speed of a normal search.
I’d already written earlier how Kafka is used for the ingest flow, but it helps in other ways too:
- It enables Humio to process data in a compressed form, which lets us keep it in memory longer, greatly reducing data transfer loads.
- It’s easy to create new partitions to store a lot of data and it’s easy to scale out the number of consumers to do processing. These are two areas that are bottlenecks in large-scale data processing.
- Kafka is also a highly stable way of processing data in real time. If the digest system crashes, it keeps running by reading the last segment number of the data before the commit marker. If work is lost in progress, the last segment number is read and it rereads what was lost from the Kafka queue.
Compression is commonly used to optimize storage, but traditionally compressing and decompressing data significantly impacts processing speed. With Humio this isn’t the case. Humio is designed and optimized around advanced compression algorithms which make reading, writing, storing, and moving data faster. The compression ratios achieved are often 15 to 1 and in some cases 30 to 1.
So how does this work?
Humio uses a compression algorithm that is very fast at decompressing data (LZ4). After data is ingested and parsed, the bulk of the parsed events are compressed before going into the Kafka ingest queue. In-memory buffers within a mini-segment are compressed as well but eventually deleted after the data is merged. Event data is compressed into segments and uploaded to the bucket for storage during the merge. A bloom filter is generated before the segments are uploaded and stored alongside the segment file. Bloom filters provide an efficient way to determine if a specific dataset is present. Compressed data not only takes up less disk space, but it is also faster to transport in terms of I/O and memory. Because of this, Humio compresses the data before it is written to memory and transported.
Humio can also be configured to use a compression algorithm that will compress data even more (ZFS compression), but this is a configuration option and should only be enabled according to the available data size in the cluster, and the ratio between disks and CPU resources in the cluster.
Part 2 of this blog on what makes the Total Cost of Ownership (TCO) of Humio cheaper than Splunk and ELK can be found here
If you would like to know more about Humio or start a trial then please email me at firstname.lastname@example.org