A Deep Dive Into An Incident Response With Your Security Data Lake

Incident Response Deep Dive with Splunk AWS Scanner

As almost all security teams will tell you, managing logs can be quite expensive, with common tools like Splunk and Datadog frequently becoming a top five budget item for the team. To reduce costs, teams sometimes move their logs into a data lake built on top of cheap object storage, like S3, and they use tools like Amazon Athena to query them. However, the data lake user experience is usually fairly slow and cumbersome: queries over large time ranges can take a few dozen minutes. We believe that data lakes can be extremely fast if indexed effectively. This is why we built Scanner.

In this post, we will look at the experience of using different tools when investigating a security incident, and we will discuss why we think Scanner’s search improves on the status quo in a significant way.

Problem: Investigating the impact of a leaked secret

For our incident example, let’s say that your company is using GitHub Enterprise and has recently enabled secret scanning for all of your internal repositories. You receive an alert indicating that an AWS access key was accidentally committed to one of your company’s internal repositories and been sitting there in plain sight for twelve months. Many employees at your company can see this credential, and if any of them is either malicious or compromised by an attacker, your AWS infrastructure could be at risk.

After invalidating the AWS access key and ensuring the internal repository is updated, it is time to begin investigating your logs to assess what damage has been done.

We will briefly look at what this investigation looks like with each of these different tools.

  • Splunk Cloud
  • Amazon Athena
  • Scanner

Splunk Cloud

tl;dr: fast on recent logs, but expensive and archive restoration is slow

Let us say that your team uses Splunk Cloud for log management. Since you generate one terabyte of logs per day, this tool will become quite expensive. The AWS Marketplace indicates that Splunk Cloud costs $80k per year for a volume of 100GB per day, and assuming that pricing is somewhat sub-linear relative to volume, Splunk Cloud may cost your team somewhere between $400k and $600k per year. Splunk pricing varies widely between customers, so it is hard to say for sure what the final price is, but given examples we have seen first hand, this price range is fairly common.

It is important to note in the context of our incident example that Splunk Cloud’s default searchable retention duration is 90 days, and logs older that window are archived to a cold storage location like S3.

Let us say that you begin your incident investigation by querying the Splunk index containing your AWS CloudTrail logs, searching for the leaked access key by copy-pasting it into the search box. There are no results. The vulnerability has been present for roughly twelve months, but since Splunk Cloud only retains 90 days of logs in its searchable index, you cannot conclude yet that no malicious activity has occurred using the credential.

Hence, you start to restore your Splunk cold storage logs to continue your investigation into older data. Unfortunately, the restored data size is limited to 10% of your searchable storage entitlement, so you can only restore roughly 9 days of cold storage logs at a time. Each time you restore 9 days of logs, you will need to wait for the restoration to complete, which can be up to 24 hours. If you wanted to thoroughly scan all of your historical logs, you will need to restore logs 30 times to cover the twelve month window during which the vulnerability was present.

Thus, if you wanted to query all of your Splunk Cloud cold storage logs to guarantee that all activity related to the leaked credential will be scanned, this process could take 30 days.

It’s clear that, while Splunk’s query performance on recent searchable data is excellent, the search process for archived logs is fairly brutal, and the tool is expensive when you reach high daily log volume. For these reasons, we think that Splunk Cloud does not provide an optimal user experience for this type of incident investigation.

Amazon Athena

tl;dr: reasonable cost, but investigation is slow

Let us say that your team also maintains a data lake in S3 where logs are stored for one year, and you interact with this data lake using Amazon Athena. If you are generating one terabyte of logs per day and retaining them for one year, your data lake will contain 365 TB of uncompressed data. If the data is compressed with zstd or gzip, these logs might consume 30-40 TB of storage space on S3. Using Standard S3 Storage pricing at $0.023 per GB per month, annual storage cost for your data lake will be around $10k. Depending on how much you query with Athena, which costs $5 per TB scanned, your S3 data lake costs will likely be in the tens of thousands of dollars per year, which is an order of magnitude lower than the cost of Splunk Cloud. The significant cost savings make data lakes incredibly appealing.

Now, let’s get back to the incident investigation. Let’s say that you dig through a few dozen tables in Athena until you identify the one that contains your AWS CloudTrail logs. Because Athena’s tabular schemas are strict and there is no free text search interface, you may need to spend a fair amount of time, maybe ten minutes if you are unfamiliar with the tables, to experiment with various SQL queries before you successfully construct the query you need.

Your data set is quite large at roughly 365 terabytes uncompressed, so the query will likely run for 30 minutes and time out before it can scan the entire data set. At your scale, you will likely need to limit the search space to roughly one month of logs per query. Hence, to cover the full year of logs, you will need to run twelve queries, with each query executing for roughly 30 minutes and costing around $20 – $30.

If you execute the queries sequentially, this could take around six hours to complete and cost $120 – $180. This is far faster than the tedious 30 day archive restoration process with Splunk, but to our team at Scanner, this user experience is still not good enough.

Let’s say that, after six hours, your queries discover that some suspicious AWS API calls using the leaked credential occurred eleven months ago, one month after the credential was leaked. Someone used the credential to create three IAM accounts and two IAM roles. You want to investigate how these IAM accounts and roles have been used, and you are obviously not looking forward to executing more slow Athena queries against your CloudTrail logs. Each line of investigation could take you several hours to complete.

Scanner

tl;dr: reasonable cost, fast search for all logs, not just recent ones

Now, let’s say your team is using Scanner on top of your S3 data lake. Scanner indexes the log files in your S3 data lake and provides much faster search. Since it uses inexpensive S3 storage and serverless querying, Scanner’s cost is comparable to Amazon Athena’s cost, which is 10x lower than Splunk’s.

Here is how Scanner’s data lake search works and why we think it improves the incident investigation experience dramatically.

Scanner’s skip-list index

Going back to the incident example, Scanner continuously indexes your team’s CloudTrail logs in S3, creating skip-list index files which are also stored in S3. These index files are fairly compact because Scanner maintains a small coarse-grained index instead of a voluminous fine-grained index. Concretely, instead of mapping from each string token to each individual log event where the token occurs, the index maps from each string token to each chunk of roughly 64,000 log events where the token occurs. A typical fine-grained index is larger than the original data set (see Elasticsearch, for example), whereas Scanner’s coarse-grained index is much more compact at roughly 5-10x smaller than the original data set.

The trade-off is that Scanner’s coarse-grained skip-list index can only reduce the search space to some number of large chunks of logs that must each be scanned to look for hits – the index does not point to the hits directly.

Nevertheless, this approach still gives remarkably fast results.

Back to the investigation. Let’s say that you type a simple query in Scanner, as simple as the one you used in Splunk. In fact, you just copy-paste the leaked AWS access key into the search box, select a twelve month time range, and submit the query.

Concurrent serverless querying

When the query is executed, Scanner launches one thousand instances of an optimized, Rust-based Lambda function to traverse the skip-list index files in parallel. Within three seconds, the functions narrow the search space to a few dozen chunks of roughly 64,000 log events each. These chunks are distributed to the Lambda functions and are scanned concurrently. To scan a chunk, each Lambda function decompresses the chunk with zstd and decodes it with Mozilla’s bincode deserializer, which is highly optimized for parsing Rust data structures. This parallel log chunk scan completes in roughly two more seconds.

Hence, the total time to finish the query over your full data set is only five seconds, which is a dramatic improvement over Splunk Cloud and Amazon Athena.

Unlocking a new, rapid workflow for querying your data lake

Because Scanner is so fast, you do not hesitate to query your data lake immediately again. This unlocks a new data lake workflow where you can run query after query, making rapid progress in your investigations. This is quite different from Amazon Athena’s workflow, which frequently consists of waiting several minutes for each query to finish whenever you need to analyze a large percentage of your data lake.

Now, let’s get back to the investigation. The results to your first query show suspicious API calls from about eleven months ago where the credential was used to create three IAM accounts and two IAM roles.

Since you received search results in only a few seconds, you continue to query again and again, making rapid progress.

Simple queries

Let’s say that you aren’t yet familiar with the CloudTrail log format, and you want to find all activity involving the three maliciously created IAM accounts and the two IAM roles. Thankfully, you can copy-paste the ARNs of the IAM accounts and roles from your earlier results. If you execute this simple query, it returns the results you need:

				
					"<iam_account_arn_1>" or "<iam_account_arn_2>" or "<iam_account_arn_3>" 
or "<iam_role_arn_1>" or "<iam_role_arn_2>"
				
			

This illustrates that, unlike with Amazon Athena, you do not need to be familiar with your table schemas to craft a query. In Scanner, you can query your data almost as easily as you can query in a search engine like Google.

The new query completes in roughly five seconds again, yielding some more interesting hits. Let’s say that you notice some failed attempts to create IAM policies as well as some failed operations to read data in your S3 buckets.

While you are grateful that the attacker’s attempts to infiltrate AWS seem to have failed, you are now worried that the attacker could have compromised an employee’s account and attempted to access other systems.

You notice in the search results that all of the suspicious CloudTrail activity came from a single source IP address. To determine which employee account is being used by the attacker, you decide to look for activity from this IP address in your Okta auth logs.

Sophisticated queries when you need them

You know that your Okta logs have a field called actor.alternateId, which contains the email address of the user. You decide to look for all of the employee email addresses in Okta log events that contain the suspicious IP address.

				
					%ingest.source_type: "okta:system" and "<ip_address>"
| groupbycount actor.alternateId
				
			

In the results, you see that all activity from the malicious IP address contains the same actor.alternateId value, which is an employee email address.

Scanner produces aggregation results by sending partial results from each query Lambda function to a monoid data structure server. This server can handle very large result sets that can’t fit in memory by making use of probabilistic data structures, like HyperLogLog, returning useful statistics with only around 1-2% error. (The monoid server will be a fun topic for a future blog post.)

Given the results to the query above, you decide to query your Google Workspace logs to determine if this user has executed any suspicious actions within your company’s Google Workspace.

				
					%ingest.source_type: "google:gcp:pubsub:message" and "<email_address>"
| groupbycount protoPayload.methodName
				
			

You see that the malicious actor has executed several different Google methods, including google.admin.AdminService.changePassword. It seems that an attacker has changed the employee’s Google password, which is a decent indicator of compromise. Hence, you determine that the company’s Google Workspace may have been affected.

After disabling the compromised employee’s Google account, you continue using Scanner to dig through the Google logs to see if there were any other suspicious Google accounts created by the compromised user, whether there is document data extraction activity, etc. There are many threads of investigation you can explore quickly over your twelve months of logs.

Because of the ability to rapidly interact with lots of data, we think Scanner’s user experience is far better for this kind of investigation than the status quo.

This new workflow removes blind spots

The investigation above in Scanner would probably take between fifteen and thirty minutes, but, with Athena, it would probably take somewhere between several hours and a few days. With Splunk, many users would probably give up because the archive restoration cycle is so slow, at up to 24 hours per restoration.

Because there is reasonably large friction when analyzing historical logs with tools like Athena and Splunk, teams often have large blind spots in their historical data. Because Scanner returns results quickly for searches that span the full data lake, including all of your historical data, it provides users with the ability to run many steps in an investigation rapidly, removing these blind spots.

In future posts, we will dig into more technical detail about how Scanner’s indexing, search, and aggregations work.

We believe the status quo in security logging can be improved dramatically by making data lakes much faster. Although Scanner’s development is still early, we think it is already having a meaningful impact on the teams that use it, and we are excited to see all of the new workflows it unlocks.

                 Share this article

Scanner is a security data lake tool designed for rapid search and detection over petabyte-scale log data sets in AWS S3. With highly efficient serverless compute, it is 10x cheaper than tools like Splunk and Datadog for large log volumes and 100x faster than tools like AWS Athena.

Cliff Crosland
CEO, Co-founder
Scanner, Inc.

Cliff is the CEO and co-founder of Scanner.dev, a security data lake product built for scale, speed, and cost efficiency. Prior to founding Scanner, he was a Principal Engineer at Cisco where he led the backend infrastructure team for the Webex People Graph. He was also the engineering lead for the data platform team at Accompany before its acquisition by Cisco. He has a love-hate relationship with Rust, but it’s mostly love these days.