Why We’re Building Scanner: Data Lake Search Must Be Fast

Scanner logo

At our previous startup, our application and security logs experienced a rapid increase in volume – and so did our log management bill.

We didn’t want to spend $1M per year on Splunk or other traditional logging tools, so we started deleting logs or moving them into S3. This made investigations painfully difficult. Sometimes the data we wanted to query was missing, and S3 was too slow to search.

We found that, by indexing data in S3 in an intelligent way, we could provide extremely fast search across massive data sets at low cost: eg. 5 seconds, not 50 minutes, to find all activity from a set of IP addresses in 100TB of logs across all data sources. Cyber threats often take months to be detected, and most teams don’t have access to their historical data because it’s exorbitantly expensive to store and access at scale. This forces companies to make a critical trade-off: pay a fortune and make sense of all their data, or selectively retain some of it over a shorter period of time and suffer from blind spots.

That’s why we built Scanner: a security data lake platform that supercharges security investigations with fast search and detections for petabyte-scale log data sets in AWS S3. We believe everyone should be able to store logs at massive scale and low cost, and search over any period of time, without sacrificing speed. Your tools should scale with your data. No security team should have to live with blind spots.

A new way to index data lakes

With log volumes increasing every year, it is getting more and more difficult to store indexes across a large cluster of servers, like Splunk and Elasticsearch do, due to replication bottlenecks and high storage costs. To keep up with scale, we believe that indexes should now be stored in a data lake – in storage like S3 – which is far easier and cheaper to scale.

Accordingly, for an index to operate on top of a data lake, we need to reimagine the way an index should work. Instead of taking the traditional approach where the index contains an entry mapping each token to every matching log event, Scanner’s index maps each token to every region of log events where the token appears.

This makes Scanner’s index files quite small and easy to maintain at scale, so while traditional indexes are typically multiple times larger than the original data set, Scanner’s index files are much smaller – only 15% of the size of the original data set. 

Demo of Scanner Search Speed vs. Amazon Athena

Here’s what makes Scanner fast and useful

Scanner might be the fastest data lake search tool on the planet for ad-hoc searches on heterogenous log data. Security investigations against archived logs sometimes require a lot of rehydration work – which can take days – but Scanner can search through these archives in just a few seconds. Here are some of the features that makes our tool particularly good at helping security teams run investigations quickly.

Rapid, serverless search

A Scanner query uses the index files to narrow down the set of log regions to scan, and then it scans these regions to find hits. For example, when you search for an IP address across 100TB of logs in S3, Scanner will reduce the search space to the regions containing hits, which means you only need to scan a few gigabytes of logs. This query in AWS Athena may take tens of minutes to run, but only a few seconds in Scanner.

To make querying fast, Scanner invokes Rust-based AWS Lambda functions to traverse index files at speeds of up to 1TB per second. Costs are kept low because there is no need to maintain idle compute for querying – lambda capacity can burst from zero to 1000 concurrent functions in tens of milliseconds.



Logs are indexed in-place in your S3 buckets
When a Scanner instance is deployed, compute resources for indexing and querying are instantiated in a new AWS account in your region. The instance uses IAM permissions to read the logs in your S3 buckets and write index files into a new S3 bucket in your account. Your logs are stored in your environment – in your S3 buckets. No need to ship them over the internet to a SaaS vendor, so there is no vendor lock-in.


Detection queries run continuously as logs are indexed

You can configure Scanner to run queries continuously on logs that it indexes. This allows you to generate metrics and – most importantly – trigger alerts when threat indicators are detected. Alerts can be sent to SOAR tools like Tines or Torq, or to Slack, PagerDuty, Jira, or other HTTP-based endpoints.

Common security log sources are auto-pulled into S3 for you

If you have security logs that you want to pull into your S3 buckets, there is a good chance Scanner has first-class support for the source. It can auto-pull logs for Okta, Office 365, Google Workspace, Github, Cloudflare, Crowdstrike, and other sources, into your S3 buckets.

Data lakes need to be faster

Given the high costs of log tools like Splunk and Datadog, we believe that high volume security data sources should be stored in low cost data lake storage like S3, and that S3 storage needs to be rapidly searchable. Indexing needs to be reimagined for the data scale that we live with – no longer gigabytes per day, think terabytes per day.

If this problem sounds familiar to you, feel free to reach out to book a demo. We’ll deploy a Scanner instance for you in your AWS region for a 30 day free trial. We’ll meet with you to make sure IAM permissions are set up correctly and indexing is operating smoothly, and you can get started with extremely fast search on the logs in your S3 buckets. Also, we’ll work with you to see how we can reduce your Splunk or Datadog bill by redirecting some of your high volume log sources to S3, where Scanner can provide search at much lower cost.

                 Share this article

Scanner is a security data lake platform that supercharges security investigations with fast search and detections for petabyte-scale log data sets in AWS S3. It’s 100x faster than Athena and 10x cheaper than traditional tools like Splunk and DataDog.

Scanner can be deployed into your own AWS account or into an AWS account managed by Scanner with read-only permissions to the logs in your S3 buckets. This zero-cost data transfer gives users complete and full control over their data with no vendor lock-in and avoids log shipping over the public internet.

Cliff Crosland
CEO, Co-founder
Scanner, Inc.

Cliff is the CEO and co-founder of Scanner.dev, a security data lake product built for scale, speed, and cost efficiency. Prior to founding Scanner, he was a Principal Engineer at Cisco where he led the backend infrastructure team for the Webex People Graph. He was also the engineering lead for the data platform team at Accompany before its acquisition by Cisco. He has a love-hate relationship with Rust, but it’s mostly love these days.