Ready, Set, Cloud! Podcast: How Scanner Built an Ultra-Fast Serverless Data Lake

Podcast Audio

Scanner CEO and Co-founder, Cliff Crosland, joins Ready, Set, Cloud! Podcast host, Allen Helton, for a conversation about how and why we built Scanner’s security data lake, Rust, serverless Lambda functions, and goats.

 

Episode Summary

Have you ever wondered why querying your data lakes were so slow? Or, if you’re like Allen, did you ever wonder what a data lake actually is? Join Cliff Crosland as he explains how the Scanner team has changed data lakes forever by going serverless. This episode is a showcase of some brilliant engineering to solve a problem in a serverless manner.

Allen Helton

You are listening to the Ready Set Cloud podcast, a show about trending and difficult topics in serverless and in the cloud. Today we’re gonna dive into a serverless showcase app called Scanner. This application has taken full advantage of serverless services to turn data lakes as we know them on their head. By using the fast scaling properties of Lambda Scanner can run petabyte scale queries on your data lake in seconds. I brought Cliff Crosland on the show to talk about how he built it and give us a little more details on what a data lake actually is. Ready? Set. Let’s go.


Allen Helton

There are a few words in modern computing that make my eyes gloss over when I hear ’em. Kubernetes and Web 3 are a couple of examples that pop into my head right away. I know what they are, but I don’t really know much about them. Another big one that I’m embarrassed to admit is Data Lake.

I thought it was just a database that had multiple input sources and it turns out I am very wrong on that. So I brought Cliff Crosland on the show to set the record straight. Cliff, welcome to the show. Thanks.


Cliff Crosland

So much Alan. Great to be on with you.


Allen Helton

It’s great to have you. Now you’re the founder and the CEO at Scanner. So why don’t you tell us a little bit about what it is and what it does, but I would also love to hear about yourself and how you ended up where you are.


Cliff Crosland

Scanner is a really fun project. What we do is we supercharge data lakes and so we can talk a lot about what a data lake is and what it means to do queries at massive scale. But basically what Scanner does is it provides really, really fast security investigation search on huge, huge data sets. So security teams, they often have the requirement to hold around like a year or multiple years of data. And if you’re using a really expensive tool like Splunk or Datadog or IBM QRadar, they’re unbelievably expensive because of the sort of like old school traditional architecture they have. But with serverless tech, which we can get into, there’s a much, much better way to do it, which is using a data lake approach, which is basically low cost, really cheap storage with lots and lots of different data sources of different formats, but making it extremely fast.

Usually data lake searching and querying is super slow despite the fact that it’s cheap. But it, in our case, we’ve discovered a, a fun way to make the queries unbelievably fast and we, we rely a lot on Rust and Lambda functions for that, which we can get into. But yeah, we, we just try to make querying on huge data sets really, really fast and use really, really low cost storage for it. So like S3, which is like pretty common in Data Lakes and yeah, I can go into how we got started a little bit. So our story is we were both lead engineers at another startup that was acquired by Cisco and we had a massive spike in log volume where our Splunk bill grew from like $10,000 a year to a $1,000,000 a year.

And so the CFO of the startup was just like, okay, this is not okay. Like we gotta, we gotta do something about this. And so we ended up doing is we, we retained like an hour of logs for a while and that was terrible because then if, if you had a problem you’d lose all of the data related to the debugging that you were doing or the security logs that you were collecting. And so like one hour of retention is is clearly crazy. And so we ended up dumping logs into S3, which is a very common place to put data lake data. The problem with S3 though is searching through it is unbelievably slow.

So it’s basically like we moved them into a place where we could store them cheaply, but then it became a huge blind spot. We couldn’t go and search through the logs very quickly at all. At Scanner, my Co-founder and I decided, okay, well what if you could have really, really cheap storage for massive data sets where you’re generating on the order of 1, 2, 3 terabytes a day or more and still have fast search with long-term retention. So that’s where Scanner was born. We played a lot with Rust and Lambda functions and discover that okay, we can actually do something really interesting here by changing the way we organize data in S3. Yeah. But that, that’s a little bit about what Scanner does and how we got started, but it’s been a lot of fun. The, the bigger and bigger companies who are working with us, they’re generating like three terabytes, five terabytes a day, which is pretty wild, but they’ve never been able to have fast search on them.

So it’s been really fun to allow them to have fast search and long-term retention on these massive log data sets and without spending millions of dollars a year, it’s, we just thought that that was totally crazy and something had to be done about it.


Allen Helton

That is so cool. I feel like some of the best products that exist today have emerged from passion projects, which really sounds like the beginnings a Scanner to me. So you explain conceptually what a data lake is or at least how the data is structured. And I’m still a little bit unclear. Like the reason I’m unclear is because I think of a data lake as a cloud resource. So in my head I’m lumping it with things like a Dynamo DB table or a Lambda function and things kinda like that. Am I thinking about it the wrong way?


Cliff Crosland

That’s a great question. The funny thing about data lakes is there are many different definitions for them, but I’m gonna just give a couple of concrete examples about the different kinds of data lake tools that you might see out there and maybe it’ll make a little bit more sense. But I think the reason why it’s hard to know what a data lake is is because everyone has a different definition. So one example is Amazon Security Lake. This is a a cool piece of data lake technology and the idea behind the Amazon Security Lake is they pull in logs from many different sources for you.

I think there are like 30 or 40 different vendors that they pull in logs for and then they store it in S3 and then you can use Amazon Athena to execute queries across these data sets. Another example of a data lake is Databricks. Databricks is a cool company. They have a term they call the data lake house. So that’s a little bit of a blend between a, yeah, a data lake and a data warehouse. A data warehouse is like a giant analytical database where you execute massive scale queries. An analytical database is going to have massive batch uploads and then the idea is that your analysts behind the scenes will execute really big queries on this data warehouse data.

A data warehouse I can think of as a bunch of different SQL tables and a data lake is different from a data warehouse where you dump in the data raw. So a data lake might consist of a bunch of different file types in cheap object storage. So you might see JSON files in there, CSV files. The idea is you load the data in raw into a data lake. There are many different types of data that go into your data lake and then you run a single tool on top of that data to run queries. And that tool is smart about parsing the different kinds of files that you have in your data lake. I would say also a data lake can be extended beyond just having lots of different file types that you can query in object storage.

So you might haves files in S3, you might have some files in Google’s object storage and also Azure blob storage, but you also might have connectors to an Oracle database or a MySQL database. And so your data lake is just like a mix of lots of different file types and the idea a data lake tools job is to provide a single interface to go and interact with all of those different kinds of data. So you can execute like one query and then it will go hit your Oracle database and then join it together with results from scanning JSON in S3 and also join it with data in Kafka. What a data lakehouse is is basically a blend between the data lake where it’s just crazy data of lots of different formats and a data warehouse where everything is super structured into SQL tables.

A data lake house gets your data lake files into a more structured format so that queries on them are more efficient. A common file format for data lake houses is Parquet, which is a more structured data format than just flat JSON files. But yeah, there are many different ways to think of a data lake. The way I think of it’s just like a giant amorphous collection of lots of different data types, lots of different data sources that you try to interact with from one interface. But yeah, that’s a rough outline of how I see data lakes.


Allen Helton

Okay. So in the data lake, the data is not normalized. It stays in its original shape but just gets added to the lake. So which part, and this is gonna sound like a silly question, which part is the data lake? Is it the collection of data or is it the layer that’s sits on top that’s responsible for figuring out how to parse and query all these things? Or is it both? I, I really just don’t know.


Cliff Crosland

Yeah, that’s a good question. I would say that it’s both. You might have a strategy to say as part of our data lake, we’re just gonna dump all of our logs from these different tools into S3 and then we’re going to use Amazon Athena on top to go and query them. So the combination of Athena plus all of the different data sources you have, that might be your data lake altogether and then you plug in other things to your data lake and suddenly that becomes part of your data lake. I think the reason why it’s called a lake is ’cause it’s just very amorphous.


Allen Helton

I remember when I was in physics back in college, lake was always the example that we used when we were learning about entropy. So we would learn about entropy, we’d learn about how things just get more and more complex the more they move around and get added to and and really when thinking about it, at least from your definition of a data lake, it sounds like a very appropriate term. It sounds like there’s a lot of entropy that happens over time as you add data, as you add multiple sources in there.


Cliff Crosland

Oh yeah, absolutely. I mean they’re a mess. That’s why there are a lot of tools that try to clean up data lakes and make them easier to interact with. Everyone has a different approach to try to make sense of the massive amounts of data that they have. And the theme that we found is they’re all slow and we wanted to build something that was fast.


Allen Helton

Alright, let’s talk about the secret sauce. How do you do it differently to make it fast?


Cliff Crosland

What we do is we index data in S3 Scanner doesn’t interact with other kinds of databases, like it’s not gonna connect to MySQL or Oracle or whatever directly. We will tackle the part of your data lake that sits on top of S3 and we’re AWS only for now. In the future that could change, but we love a lot of the Amazon like serverless features. But anyways, what we do is we will analyze all of the different files in your data lake. We deploy Scanner in the same AWS region and you just give us permission to read those buckets. And then we create these index files and these index files are stored in your S3 bucket as well. And when you execute a search, we spin up a large number of Rust-based Lambda functions that go and traverse through these index files very rapidly and they will narrow down the search space dramatically.

So if you do a query in Amazon CloudWatch or Amazon Athena, it might take 30 minutes because it does scan a huge amount of, of data If you, especially if your time range is like go look for this event that happened over the past six months. But in Scanner we have an index from every single token and every number that appears. We have these summarized index data structures, skip lists and MinMax lists and lots of different things that we have on the backend, but they’re all in S3. And then the Lambda functions go and traverse things very rapidly and minimize the search space. So if you have a 100 terabyte data set and you’re searching for a couple of IP addresses or like all of the activity ever from a user, then what we do is Scanner will narrow down to maybe like a hundred gigabytes or 50 gigabytes of data.

And then because Rust is so fast and S3 has high bandwidth, that will finish in two or three seconds instead of waiting like 30 minutes or something to finish. So that’s like at a high level how stanner works and why we love serverless tech so much.


Allen Helton

Cool. So we’re just gonna dive in because you piqued my interest with that explanation. So how do you divvy up this work? I’m gonna assume like let’s imagine you have a hundred terabyte data set and you spin up and I’m gonna throw a random number out there. I don’t, I don’t know how it actually is, but a hundred Lambda functions go in and divvy up this work. How do you assign the the segments or the parts of the index to each one of these Lambda functions so they’re not stomping on each other and looking at what each other’s looking at?


Cliff Crosland

So what we do is we use ECS Fargate to do the indexing. If someone dumps in a mass amount of data or something that scales really, really nicely, we don’t have to mess around with Kubernetes to handle that. But anyways, what that does is as files appear in S3, in your S3 buckets, they will get indexed and we’ll generate these small index files that then get merged together over time into larger and larger index files. But when a query occurs, there’s a my SQL database that has metadata about all of the index files and then the time ranges that they span. And so we say, okay, if your query’s over this particular time span, let’s look at all of the index files whose time ranges intersect with that time span, and then we’ll launch one Lambda function per index file.

And that Lambda function’s job is to figure out, okay, given the query, let’s find all of the, we call them pages, but these like find all of the regions of logs that contain the tokens and the query in the appropriate way. That’s all done with a Redis queue. And then each of those will then spawn a bunch of tasks into Redis again. And all, all of those tasks are simply to go and scan the particular page of logs that was considered to be a hit by the query planning Lambdas. And so then those all happen super, super fast. We’re getting speeds like pretty common like 500 gigabytes per second.

We get up to a terabyte of data scan for second sometimes. But the nice thing is that the search space gets shrunk down so far for your typical query that it takes just like a a handful of seconds to, to run your query. Even if you have a hundred plus terabytes, 500 terabytes, a petabyte, it’s, it’s still quite fast.


Allen Helton

That’s impressive. Lemme throw out what may or may not be an edge case and and see what you do for situations like this. So we all know that Lambda has a concurrent execution limit per AWS account. By default it’s a thousand and for brand new a w s accounts it’s actually a hundred. But how do you go about getting around that? Like what if somebody wanted to search across all time, across everything in there and there’s more than a thousand index files, what do you do to make sure that A, you don’t throttle yourself, but B, make sure that you’re not throttling a serverless application that’s also running in that AWS account?


Cliff Crosland

And that’s a great question. The way that it works is we, we launched Scanner into a unique AWS account that we actually manage. And so there are no other Lambda functions running in that particular AWS account, which is nice. But we do use all 1000 Lambda functions and in some regions it’s only 500, which is kind of a bummer. But despite that, the queries that have 10,000 index files or tens of thousands of index files to get through each index file can be analyzed very rapidly within like less than a second. It’s often like a few hundred milliseconds. So we can still get through a lot with even just a thousand Lambda functions.

But that isn’t unfortunate upper limit on one Scanner instance. But the cool thing is you can spin up more than one Scanner instance if you want and we’ll spin it up for you. And we, a Scanner instance is like an AWS account where we deploy a Scanner. And so you can run across all of those Scanner instances simultaneously if you want multiple thousands of Lambda functions. But so far, even for people who generate one to two to three terabytes of logs per day, 1000 Lambda functions is more than enough, which is nice.


Allen Helton

Yeah. So you said something that stuck out to me and I wanna make sure I get a clarification. Early on, you said that your data is indexed and stored in S3 and that’s stored in my account. So if I was using Scanner that would live in the Ready, Set, Cloud AWS account. But then you just said that Scanner is running in a Scanner owned account. So you have cross AWS account permissions that are set up to get the Lambda functions over here to communicate with my AWS account specifically that S3 bucket.


Cliff Crosland

That’s right, yes. So like the, all of the compute runs in the AWS account that’s managed by Scanner and then all of the storage is in the, the user’s account in the same region. So you don’t have to worry about data transfer costs over the internet or between regions. It’s freed S3 data transfer between S3 and Lambdas, which is nice. But yeah, that’s right. So the, the, the storage is in the customer’s account and using IAM roles, we get permission to write to a Scanner index files bucket in their account and then read any of the S3 buckets that they choose to be indexed.


Allen Helton

Got it. Okay. Now I like that strictly from a point of view that you’re moving the compute cost to you versus me. I see a lot of companies, a lot of startups that deploy resources into your AWS account and you subsequently have to pay for the service and the compute that they’re running inside of your account. So that’s really nice that you’re taking that on for us. Let’s talk about data ingestion and how your indexes grow over time. Let’s say I had a service application, it was running a thousand unique Lambda functions that are all doing their own thing. How does it go from execution to be able to query, how long does that process take?


Cliff Crosland

So we capture all of our Lambda function logs in CloudWatch. And so we have set it up in our own environment to push all of these logs to S3. So there are a couple ways to do that. You can use Kinesis firehose or you can even use like a Lambda subscription to pull your CloudWatch logs and, and put them into S3 or something like that.

Kinesis Firehose does have a long batch delay, which we found kind of annoying. It’s like sometimes 30 seconds or even a minute before the CloudWatch logs get collected in Kinesis fire hose and then pushed to S3. But as soon as the S3 object appears, we get an SQS notification and then we immediately start indexing it. We push that file into S3 and we store metadata in MySQL. So right away we have a small index file ready to be queried and then over time that index file’s gonna be merged with other small ones and and, and they’re gonna get larger and they at automatically swap in. But on our side, the delay is on the order of a few seconds to once the S3 object is available. But it just depends. Sometimes there’s, there’s some lag we’ve noticed between like Amazon creating CloudWatch logs and those logs getting into Kinesis firehose and then getting uploaded into S3.

So yeah, it, it depends on how quickly you can get stuff into your S3 bucket.


Allen Helton

That’s pretty impressive because you said that the indexing process is run through Fargate, right?


Cliff Crosland

Yes.


Allen Helton

So there’s a trigger, something happens in the source system, a Fargate task is being spun up, it’s consuming the new log files outta S3, indexing them and saving them back into S3. And then it’s also updating that manifest that you said that you’re storing in MySQL or something. And that’s all just happening in a few seconds. That’s pretty impressive.


Cliff Crosland

Thanks. I mean it’s, it’s really Rust. The ECS tasks are kind of always around and they’re just pulling SQS and when an SQS message is available that there’s a new S3 object, then they’ll jump in. And as that SQS queue, if it ever grows, then the SQS task count also grows dynamically and then the queue gets strained the queue, the task count gets small again, although Amazon doesn’t drain ECS tasks very quickly, which is annoying. So we proactively say, okay, yes, drain, now we had a giant batch, but now please, I don’t wanna run a hundred ECS tasks anymore. I wanna go back to one or two.


Allen Helton

That’s so fascinating and what I’m piecing together here is really the secret that makes Scanner so good and so fast is not necessarily the indexing mechanism, but rather it’s how you’re taking advantage of serverless for its fan out compute capabilities really on both sides of it on the indexing side but also on the querying side as well.


Cliff Crosland

Yes, definitely. We, we just basically see that the way that logs are done traditionally with tools like Splunk or Elasticsearch, you have this really brutal stateful cluster where the compute and the storage are together on the same machine and with logs the writing is way, way higher than the reading. And so it just makes a lot of sense to put logs into S3 and it doesn’t make sense to keep a huge amount of compute around all the time. And so that’s why at at query time it’s really fun to be able to launch, you know, a thousand Lambda functions from nothing and Rust cold start time is like 30 milliseconds-ish. They’d spin up very rapidly and then disappear again.

And so our, the cost is so tiny. I’ve, I’ve been always surprised at like even our biggest users the amount of Lambda compute costs because the queries only take a few seconds. It’s like tens of dollars or something, you know like, but if you do a crazy query and like do wild card over a year of data, that’s gonna to consume some Lambda capacity. But if you’re doing smart searching and stuff, serverless is so beautiful at that you just get compute capacity when you need and a lot of it very quickly and then it goes away and that keeps everyone’s costs low. Whereas other old school log tools, they’re unbelievably expensive.

It shouldn’t have to be that way.


Allen Helton

Totally agree with that. So is there anything that the speed and the way that you’ve indexed thing and just the abilities of Scanner in general, is there anything now that we can do that historically or with other products would either have been too difficult or, or not possible to do?


Cliff Crosland

Yeah, so two really interesting use cases that we’ve seen from the security teams who are using Scanner. One is there are kind of a couple of different classes of logs and really important audit logs like CloudTrail logs, GitHub logs that show you really high important security events. Those people are used to sending to Splunk, they’re used to sending them to Datadog or to Elasticsearch, but then other kinds of low value but extremely high volume logs like VPC flow logs are just dumped into S3 and, and they’re not really used in the search process, but because Scanner is so much cheaper because it’s just S3 storage and these elastic Lambda functions, you can now start to do queries where you look for an IP address not only in your login logs, your high value audit logs, but the IP address can also be cross correlated with your VPC flow logs and see like, okay, is this attacker getting like what, what, what specific network adapters in AWS are they touching and is there VPC flow? Is there network flow being rejected or is it being accepted?

That’s a new thing. So being able to do these queries on way bigger data sets is is really awesome. And the other thing is just being able to query across lots of history. A study from IBM mentioned that it takes on average 270 days to both find and mitigate a breach on average, which is crazy. A lot of log tools only have 30 to 90 days of history, but in Scanner you you’ll have like a year or multiple years of history because it’s just an S3, it’s pretty cheap to keep it all. And you can say okay, well this this IP address that’s showing up in my logs over the past couple days and scaring me, I wanna see all of their activity for all time.

And in Scanner you can actually get that data rapidly if it’s like a petabyte scale dataset, if it’s a hundred terabytes it might take like three or four seconds. And then what you can do is you can then start to push those logs into Splunk or into your expensive log tool to do deeper analysis on a subset. If you have a bunch of stuff set up in your Splunk, you can do that. And so Scanner is very helpful for making sense of long-term history. So two things that you couldn’t do before, one is just lower value but higher volume logs that you still wanna look at. You can now, you can now look at finally and get fast search on and then you know, billions of log events a day, which would cost you millions of dollars in Datadog.

And the other cool thing we can do is just long history searches that are fast.


Allen Helton

Let me ask you a question that’s more existential of Scanner and you’ve mentioned a couple times that it’s a security data lake, but can you use it for things other than security?


Cliff Crosland

Yes, that’s interesting. We actually have a couple of teams who just use it for application logs. We, we did find that tends to be the case that application developers don’t need a lot more than a few weeks of log retention. And so they tend to be okay with things like Datadog even though it’s kind of expensive, but with a couple of the teams they keep six months or a year of logs in their S3 buckets and Scanner indexes them and then they can answer questions that customers have about activity from months ago. And that can be very helpful because at this very slow process and other tools slowly scan S3. But yeah, some people definitely do and I feel like that is maybe the next frontier after we really nail the security use cases expanding to other kinds of logging and observability generally.


Allen Helton

Okay. Yeah, yeah, that makes sense. So let me ask you, what are the features of Scanner that help facilitate security? Like what are the features that you’ve added that make it a security data lake application?


Cliff Crosland

Yes, one is detection rules. So the ability to write queries that are running all the time continuously on the data that’s flowing in. And then those will trigger detection events which then can be sent to your team. So this particular pattern, it looks scary or someone is deleting users from AWS, those go to PagerDuty or Slack or to your SOAR, which is a security orchestration and automated response tools. So you can send webhooks off to webhook requests off to those. And the other thing that we do for the security use case is just very easy connectors to other tools to pull their security logs into your S3 buckets and then you can start to build your data lake, your amorphous data lake with your security log data and then we index that for you.

So basically the alerts and the integrations to pull in logs are the two things that we really focus on for security. But then generally running queries like you could do detection alerts for observability, like I, my error rate is above a certain value or warnings are higher than expected or something and send me an alert when that occurs. Those are things like might be very much applicable outside of security.


Allen Helton

Gotcha. Okay. Cliff, so we’re running a little bit low on time and if people wanted to try out Scanner or if they have a question for you, where should they reach out and how would they best find you and Scanner?


Cliff Crosland

You can check out the website at Scanner.dev, there’s a button to get a demo, which is a a meeting with me and I can, I’d be really excited to show you what we have and then what we do is if you give us the, the name of the AWS region that you’re in, we deploy an instance of Scanner for you to do a free trial for 30 days and then we’ll do a concierge onboarding with you to make sure that your S3 logs can be indexed properly with the right permissions. And then we love to get this in people’s hands and get their feedback and see if it’s helpful. That’s the best place to get in touch or hit us up on LinkedIn or or Twitter.

Maybe you can find us there too.


Allen Helton

Awesome. Well thank you Cliff. I have learned so much, but also this has been like one of the coolest examples of taking advantage of Serverless that I’ve seen in a long time and you’re doing it in production like you’re doing it for real. So it’s really nice to see something that’s much more than a proof of concept, actually take advantage of it. So well done and thank you again for your time. I really appreciate it. Alright, we’ll talk to you later.

That’s it for this episode of the Ready Set Cloud podcast. Be sure to follow us wherever you listen to podcasts to say up to date on the latest episodes latest. For more info on Trending Cloud topics, be sure to visit ready set cloud.io and sign up for the Serverless Picks of the Week Newsletter. I’m Alan Helton and we’re outta here.

                 Share this article

Scanner is a security data lake platform that supercharges security investigations with fast search and detections for petabyte-scale log data sets in AWS S3. It’s 100x faster than Athena and 10x cheaper than traditional tools like Splunk and DataDog.

Scanner can be deployed into your own AWS account or into an AWS account managed by Scanner with read-only permissions to the logs in your S3 buckets. This zero-cost data transfer gives users complete and full control over their data with no vendor lock-in and avoids log shipping over the public internet.

Cliff Crosland
CEO, Co-founder
Scanner, Inc.

Cliff is the CEO and co-founder of Scanner.dev, a security data lake product built for scale, speed, and cost efficiency. Prior to founding Scanner, he was a Principal Engineer at Cisco where he led the backend infrastructure team for the Webex People Graph. He was also the engineering lead for the data platform team at Accompany before its acquisition by Cisco. He has a love-hate relationship with Rust, but it’s mostly love these days.