Hardware abstractions are great, but we are spoiled.
While building Scanner’s security data lake product, we’ve had to think hard about how to push hardware to its fundamental limits to handle the scale of our users’ security log data. Cloud services like AWS, GCP, and Azure provide a remarkable abstraction layer that allows software engineers to avoid thinking about physical hardware. Unfortunately, this abstraction comes at a cost. If you don’t take the time to understand what your software is doing with the underlying hardware, it is easy to build applications that needlessly burn massive amounts of CPU cycles, SSD IOPs, and router operations by doing unnecessary extra work.
For example, your microservices might run on servers that sit inches from one another in an AWS datacenter, but it’s easy to write your code such that they communicate over the internet. If you use traceroute
, you’ll find that packets sent between your microservices will make up to 30 hops between routers on the internet, traveling across miles of physical wire before turning around and heading right back into the datacenter, reaching the destination server inches away from the source.
If you are building traditional, low-traffic web applications, it is completely acceptable to embrace the high-level abstractions that cloud providers offer you and forget about the hardware. However, if you are building data-intensive applications that operate on petabyte-scale data sets, understanding the fundamental limits of your hardware can mean improving your app’s performance by orders of magnitude, not to mention reducing costs by millions of dollars.
Hardware is sometimes shockingly fast when used optimally. Game engine programmers seem to understand this better than most, and if you want to see something inspiring in this vein, I recommend watching Casey Muratori’s demo showing how insanely fast a terminal can be when hardware is used optimally.
Designing from first principles to push your hardware to its limits: Pretend you have a magic wand
At Scanner, we provide a security data lake product that supports rapid search and threat detections for customers who have petabyte-scale log data sets. Given this significant scale, we had no choice but to design Scanner to push hardware to its fundamental limits.
If you want to design software from first principles, a useful cognitive tool is to pretend you have a magic wand that produces the ideal software system to solve your problem. Forget about how you would implement it. Try to imagine a system that makes optimal use of hardware and ask yourself what that would look like. Then, think about the fastest way you could reach that state.
At Scanner, here is what we asked ourselves:
“If we could wave a magic wand and generate an ideal application to store and rapidly search through one petabyte of logs per customer, what would it look like?”
Here is how we answered this question:
Storage scale
An ideal version of Scanner’s security data lake tool would use a fleet of hard-drives large enough to store one petabyte of uncompressed logs for each of 1000 customers. Logs tend to compress well with a compression factor of roughly 10-20x on average, so we would likely need to store a few hundred petabytes of compressed log data. This data should be replicated multiple times for durability, so we would need around 500 petabytes of storage capacity in our system. Assuming that we were using typical 16 TB HDDs you can find in modern datacenters, we would need to use around 30,000 hard drives in this storage fleet. This would be a massive amount of storage hardware.
In a perfect world, these hard drives would be NVMe drives with sub-millisecond latency, or SSDs with single-digit millisecond latency, but both of those options would be incredibly expensive, costing something like tens of millions of dollars to purchase. Using magnetic hard drives would be far less costly at roughly single-digit millions of dollars, an order of magnitude cheaper.
Indexing
Assuming we used magnetic hard drives, an ideal security data lake tool would be designed in such a way that the high disk latency with magnetic hard drives would not materially affect the user experience. For instance, the ideal application would be able to intelligently minimize the search space across the data set given the information it knows about the data, reducing query durations significantly. Hence, maintaining some sort of efficient index would likely be an important part of an ideal solution.
However, building an index in a naive way would be brutal. An index with a postings list that maps from terms to individual log events would be a massive index. Using projects like Elasticsearch as a reference point, such a fine-grained index would likely be 3-5x larger than the original data set. Asking users to store a 3-5 petabyte index in addition to 1 petabyte of log data seems unreasonable.
An ideal solution would use a much more efficient index, mapping from terms to large chunks of log events containing those terms. Thanks in part to how much redundancy there is in log data, a coarse grained index would be 10-20x smaller than the original data set. Using a coarse-grained index would still allow us to dramatically reduce the search space during queries. For instance, during a needle-in-haystack query over one petabyte of logs, we would be able to reduce the search space to 100 GB of logs or less. Also, reading large chunks of neighboring logs is natural to do with magnetic drives, so this coarse-grained index seems like it would be part of an ideal solution.
The security engineering team at a typical Scanner customer would likely execute on the order of tens or at most hundreds of queries per day, not millions or billions of queries per day. Given the low frequency of queries, it would probably be acceptable for a query that operates on one petabyte of logs to require a few seconds to complete – it need not finish in milliseconds. Therefore, an intelligent index running on top of a fleet of magnetic hard drives seems like it would provide acceptable performance.
Bursty parallel compute
Even with an efficient index that would allow us to reduce the search space dramatically, there could still be a significant amount of data to scan when executing a query across a security data lake. For example, a user could query over their entire data set for a term that appears in every log event, and Scanner would have no choice but to scan the customer’s full one petabyte of log data.
An ideal solution would need to be able to handle this case decently well. It would need to call upon a large number of CPU cores and network interfaces to scan the full data set massively in parallel.
To get a feel for how many cores and network interfaces would be required, we can look at the CPU and networking properties of VM instances in AWS like the c7gd.16xlarge
. This instance has 64 vCPU cores and supports up to 30 Gbps of networking bandwidth. Let’s assume that a typical customer’s one petabyte log data set compresses 10x, resulting in 100TB of compressed logs. If we wanted to read the entire 100TB compressed log data set in at most 100 seconds using VM instances with 30 Gbps of bandwidth, we would need to run at least 266 such VM instances. Since each instance uses 64 vCPU cores, we would need about 17,000 vCPU cores. This would be a massive amount of compute and network hardware. Furthermore, to handle N such queries concurrently, we would need to burst up to 17,000 * N cores for short periods of time.
Given the spiky nature of compute and networking needs, an ideal application would “rent” CPUs and network interfaces for brief periods of time to serve large-scale queries and then relinquish these resources back to a pool to be used by other projects. This way, Scanner would not need to purchase massive compute resources to burst to the maximum capacity we need and then let them sit idle for most of the time. However, an ideal solution would be able to call upon these resources immediately to respond to a user’s query. Users should not need to wait minutes for the needed compute to spin up.
How close can you get to the ideal solution?
Given that we have sketched out the shape of an ideal solution, what could we concretely do to implement something close to the ideal?
Here are the decisions we made for Scanner’s security data lake design.
Use inexpensive object storage – like S3
Building a fleet to power 30,000 magnetic hard drives with 16TB of capacity, handling data redundancy, durability, and availability guarantees, would be a brutal engineering task. Thankfully, object storage services like AWS S3 have implemented exactly this for the world, and they can handle the scale we need at reasonably low cost.
Use VPC endpoints to reduce data transfer costs to zero
Customers already have considerable log data in S3. Instead of requiring them to push that data over the internet and incur massive egress fees, we can analyze them directly from the customers’ S3 buckets. Customers can create an IAM role giving Scanner permission to read their S3 data. Since we use VPC endpoints to communicate with S3 in the same region as the customers’ buckets, data transfer cost is zero. Common data lake file formats are all supported by Scanner, eg. JSON, CSV, Parquet, etc., making integration easy.
Maintain an index in S3 that supports massively parallel map-reduce
Even with an efficient index, we will need to scan a lot of data sometimes, and we would like to be able to do that massively in parallel to reduce latency for users. A natural way to support this is to break the index data into separate pieces that are amenable to map-reduce operations. Instead of traversing a single large index sequentially, we would like to be able to traverse many small indices in parallel and rapidly aggregate the final results to send to the user.
Additionally, these small indices need to be easy to merge together into larger indices as we encounter new log data from users. Given the scale, it should be possible to to maintain these indices without needing to load them into memory.
If we use sorted string term indices, range-based numerical indices, and data sketches that use a constant amount of memory, we can merge smaller indices together into larger indices quickly while using a roughly constant amount of memory.
Launch serverless functions to traverse indices rapidly during queries
Given the bursty compute and network capacity needed to process user queries, it seems reasonable to use serverless functions to quickly spin up the resources we need and immediately spin them back down again when finished. These serverless functions can traverse index files massively in parallel, finishing queries as quickly as possible.
Depending on the region, each AWS account is allowed to burst between 500 and 3000 concurrent Lambda functions. To ensure that each customer has access to sufficient compute capacity for large queries, and to ensure no customer can interfere with the compute capacity of another customer, creating a dedicated AWS account per customer seems like a reasonable design choice. This has the added benefit that customer environments are strongly isolated from one another.
Use dedicated compute to index new log data
While reading log data is relatively infrequent, writing log data happens continuously at high scale. Dedicating continuous compute resources to index customer logs allows us to reduce costs. Generating index files uses a decent amount of CPU but a relatively low amount of memory. Thus, we can use highly optimized containers with high CPU capacity and low memory capacity to perform indexing, and costs remain reasonably low.
Use a systems programming language for efficiency – like Rust
Given the scale of the data sets, it would be helpful to use a systems programming language like C, C++, or Rust (possibly Go) to ensure that we can push CPU and RAM hardware to reach maximum efficiency. Since we sometimes need to run a massive amount of parallel compute across many cores, Rust’s thread-safety and memory-safety properties make it an appealing choice to build our application.
Would the design seem reasonable to an advanced civilization?
As a final check to see if our security data lake design is reasonable, we found it helpful to think about what an advanced civilization would think about it if they looked at the design from the outside.
To give a concrete example, here is a common data flow that you see for traditional cloud-based log tools and SIEMs. When viewed from the outside, it seems like there is a huge amount of unnecessary duplicate work:
- Logs are generated by servers in an AWS datacenter.
- The servers push the logs over the datacenter’s network to disks in S3. This is a cheap operation.
- The servers also push the logs out of the datacenter over the internet to send them to a cloud-based log tool or SIEM. This an expensive operation.
- The logs leave the datacenter network and make dozens of hops between routers on the internet, traveling across miles of wire.
- The logs enter another datacenter, or maybe the same datacenter where they originated, hitting a load balancer that expends a lot of energy routing them.
- The logs are buffered on several disks across a fleet of servers working very hard (eg. Kafka), consuming significant CPU and networking capacity.
- The logs are pulled from these servers and pushed to another fleet of servers, which load them into indexes (eg. Elasticsearch). The logs are redundantly duplicated again across many disks, which requires a huge amount of CPU and networking capacity.
- Eventually, the logs are pushed into cold-storage, where they go back into the disks in S3 in the datacenter. They’ve basically arrived where they started.
Scanner’s approach is simpler than the traditional one and gets rid of the unnecessary duplicate work:
- Logs are generated by servers in an AWS datacenter.
- The servers push the logs over the datacenter’s network to disks in S3. This is a cheap operation.
- Other servers in the same datacenter read those logs over the datacenter’s network to generate index files. This is a cheap operation since there is no egress over the internet.
- Those servers send small index files back to those S3 disks over the datacenter’s network.
- When a query occurs, a burst of idle compute in the datacenter is temporarily allocated to traverse the index files rapidly.
- A minuscule amount of data is pushed over the internet to the user’s web browser to render the query results.
We think Scanner’s approach looks more reasonable from the outside than the traditional approach, which wastes a lot of energy routing data over the internet and buffering data on disks multiple times before it returns to where it started.
Pushing hardware to fundamental limits can be highly useful
By imagining an ideal solution, it becomes easier to see how to build your software such that it pushes your hardware to its fundamental limits. As a result, your customers can benefit from dramatically reduced costs or increased speed, or both.
In our case, running Scanner’s security data lake tool on top of a petabyte of logs in your S3 buckets will cost an order of magnitude less than in other log tools, and you can avoid all data egress costs. Query performance remains high with queries completing in seconds instead of minutes or hours with other data lake scanning tools.
I encourage everyone to spend more time getting familiar with the capabilities of their hardware and learn what is happening beneath the abstractions. It takes work, but it can be highly worthwhile for your team and your users.
Share this article
Scanner is a security data lake platform that supercharges security investigations with fast search and detections for petabyte-scale log data sets in AWS S3. It’s 100x faster than Athena and 10x cheaper than traditional tools like Splunk and DataDog.
Scanner can be deployed into your own AWS account or into an AWS account managed by Scanner with read-only permissions to the logs in your S3 buckets. This zero-cost data transfer gives users complete and full control over their data with no vendor lock-in and avoids log shipping over the public internet.
Cliff Crosland
CEO, Co-founder
Scanner, Inc.
Cliff is the CEO and co-founder of Scanner.dev, a security data lake product built for scale, speed, and cost efficiency. Prior to founding Scanner, he was a Principal Engineer at Cisco where he led the backend infrastructure team for the Webex People Graph (previously called Accompany before its acquisition by Cisco). He has a love-hate relationship with Rust, but it’s mostly love these days.