Rustacean Station Podcast: A Conversation About Scanner’s Security Data Lake Powered By Rust

Rust Podcast Audio Wave


Scanner CEO and Co-Founder Cliff Crosland had the pleasure of sitting down with Rustacean Station Podcast host Allen Wyma to discuss Scanner’s Next-gen Security Data Lake tool powered by Rust.

To listen to this episode head over to The Rustacean Station, and you can read the full transcript below.

Allen Wyma

Hello and welcome to another episode of Rustacean Station. I’m your host, Allen Wyma. Today we have Cliff Crosland. He’s the CEO and Co-Founder of Scanner. Is it Scanner.dev, right? Not just Scanner Dev.

Cliff Crosland

Yeah, Scanner.dev. That’s right.

Allen Wyma

So hello and welcome. We were supposed to have your partner on, but something came up, so hopefully we can, we have enough material with you to cover the hour, but I think there’s more than enough based on our interactions.

Why don’t you give a quick, quick introduction about yourself and what is Scanner.dev?

Cliff Crosland

Absolutely. So my Co-Founder, Steven Wu and I, we were at a startup previously to the startup that we founded here. We had a massive Splunk bill and we just started to move logs into S3. We felt like there was a much better way to interact with S3 logs and provide extremely fast search on S3 logs. And we’ve discovered that security teams really have this problem intensely. And so we founded Scanner.dev, but we’ve, everything is built in Rust. It’s just, it’s been a really delightful experience. They’re definitely pros and cons for sure, in particular, like the compilation time. But yeah, Scanner.dev is a serverless log management system that uses S3 to very quickly analyze and search through massive amounts of log data.

And for teams that need like a year of logs to do investigations or keep a year of logs for compliance purposes, it’s, it’s really hard to be a Security Engineer today and to find advanced persistent threats, et cetera. Like having a really low cost, extremely efficient solution to, to scan through lots of data quickly is really important. So we found a really cool fit between AWS Lambda functions and Rust, and also just like ECS containers and Rust, lots of fun, fun stuff. Like everything is in Rust except for the front end, and we’re starting to introduce like a little bit of Wasm and Rust into the front end in the, in the coming few months as well. So yeah, big super huge fans of Rust and there are so many different reasons why, why we love it and are very, very grateful to everyone who works on it.

Allen Wyma

We’re talking a little bit about, you know, why, why Rust and you had some kind of firsthand experience about why Rust would make more sense. Can you tell us more about that?

Cliff Crosland

At our prior startup, we, we built this system. It was this intelligent executive assistant, it’s called Accompany. It was acquired by Cisco a couple of years ago. But one of the, the products that we focused on was crawling the web and then building these intelligent briefings to send to executive and salespeople, to, to prepare them for meetings and to do things like, oh, there’s a bunch of news on a particular company you’re about to meet with. There’s a giant news spike. Just highlight interesting insights for people. And we built a lot of our infrastructure in C++.

We had a lot of Google DNA in the founding team, and so, not me, but, but our, our CTO scaled up Google Analytics using a bunch of cool C++ work. But we had a, an a downloading system, so a system that interacted with the internet, our, our crawler microservice, which handled a huge amount of work and we discovered some really weird looking segfault crashes eventually. And as it turned out, there were a couple of memory safety bugs in, in Curl. So we were interacting directly with libcurl in C. And it, it is not that easy to use. I mean, this is something we can talk about a little bit further where like the interfaces between dependencies and Rust is just amazing. They’re just, it’s much easier to know who owns what and who’s responsible for memory safety and freeing things and allocating things and, and, and what you have to, to deal with. But anyway, so with Curl, it took us some time to, to use Curl in a way that we were able to get it to be really fast and use very like little CPU and memory to, to to like go and crawl the web, like at a, at a pretty massive scale. And we, we were seeing some memory leaks and we’re like, okay, maybe that’s our fault.

Maybe that’s Curl. We’re not totally sure. But then once we started seeing these segfaults, we, we got scared, jumped into the GitHub issues, discovered like other people reproducing a similar kind of thing. We submitted a patch to, to help out. There were, there were some things related to it. It was years, a couple years ago, so I’m trying to remember exactly the details, but it was like about basically TTLs and, and like connections being kept alive.

I think Curl Is is amazing, amazing tool, but it did terrify us that, oh, like, it, it, it is definitely possible for even even libraries as widely used as Curl to sometimes experience memory, memory safety problems or threat safety problems. And like, no matter how good you are, it it, it’ll happen eventually, like with C or C++. And so we, we patched it, we, we, we, we tried to help like resolve the, the memory safety issue that we, that we ran into and fix this time to live problem.

And I think there was another like cool patch that we put into Curl for Telnet or I, anyways, there’s a lot of, we have a lot of love for Curl. I’m not trying to like dis Curl. I, I really think it’s an amazing, amazing piece of software, but it just demonstrated that, oh, it’s, it’s, it’s very hard to stay memory safe and thread safe, like even the best programmers in the world do this. So we ended up taking this crawler service and rewriting that from C++ to Rust. And it was awesome. Like our, a lot of the memory leaks went away. Now, like Rust isn’t perfect for this. Like you could definitely have memory leaks in Rust, but just the way that Rust requires you to think about ownership and sort of structuring all of your data structures as a tree instead of a graph means that it’s, it’s much easier to free things and let things go out of scope that need to go out of scope and keep memory low. But anyways, like CPU performance is awesome. Memory performance is amazing, and we just, we stopped having those memory safety issues. So no, no more weird segfaults that like terrified us. He’s like, yeah, we were hitting the internet really hard and we were terrified that, okay, this would be an attack vector if someone could manipulate like a, the, a memory safety bug in a well-known library. This, this would be really scary.

So for the most sensitive part of, of the system or, or the system that interacted the most with sketchy, sketchy servers out there on the internet, we, we felt like using Rust was a big win for us there because we just felt like way more confident, way more secure once we, once we moved from C++ to Rust, and now it’s like, okay, I love everything in Rust. I, I love to like keep staying in that environment anyways. We could talk about a lot of cool things about Rust and, and the crates dependency story, but, but I’ve definitely converted from c plus plus to Rust and I think video game developers might be upset with me, but for, for like web development or I guess like, you know, sort of cloud infrastructure projects, I think Rust and also Go, but like Rust is, is a really awesome choice. We, we’ve really been, I was super pleased with that at Scanner.

Allen Wyma

I mean you touched on so many different topics, it’s hard for me to kind of segue into different areas, but I, I find that one of the most interesting things is that it’s quite important, right? There’s Curl is, like one of the most popular projects that I can think of. I mean, like everything has it in there. Like even if it, even if it has nothing to do with anything, I, I don’t even know how, how to even say it, but it’s always like has some piece of Curl in there and and it’s not just the, the binary that people are, are kind of used to in the terminal, but also the library itself, right? Like is is what you’re using it sounds like.

Cliff Crosland

And yes, that’s true. Lib curl directly. Yeah.

Allen Wyma

Lib curl. And like the, the crazy part is that this thing’s been, I just looking it up over here. It’s almost 30 years old, right? And it’s so battle tested. And when I had Daniel, the creator of Curl on here for the first episode a long time ago, he was talking about how like NASA and stuff is using it. So it’s like all these companies are using it. I mean, when, when you found some, some issues like, I mean, I don’t know, it’s, I I would have a hard time to, to see if it was really me that would be doing it wrong because it’s been so widely used that it seems crazy that you guys would find a bug, right?

I mean, was there a lot of second guessing, like, are we doing something wrong? I mean, I mean, was there just Oh, definitely. How did the process go?

Cliff Crosland

We spent so much time digging through our, our own code and then eventually just jumping into the GitHub issues, it seemed like other people had encountered it as well during a particular release. And really Curl is incredible. It supports not just HTTP, which everyone uses it for, but it just a lot of a, a lot of legacy protocols. It’s really remarkable and so versatile. And so it kind of makes sense that, okay, it’s, it’s at, at a certain level of complexity, there’s no way to avoid memory safety issues forever. So it took us a while, but I guess like combined with what other people were, were saying and like the reproduction test cases that that other people in the GitHub issues were, were bringing up. And then us, like with our use case looked very similar to that. It was, we, we were kind of pushing Curl really hard. I mean, I have to remember all of the, the, the like funky, the, the funky interface we were using, we were using sort of like the, the advanced, like the most advanced ways that you can use Curl to do a lot of parallel concurrent requests just at, at high scale.

So if someone were to, to run into it, it would be maybe teams like us who are crawling the web heavily. But it, it was like, we definitely had to, we spent a lot of time making sure it wasn’t us, and I’m sure like, and, and we definitely found memory safety problems in, in our code over like, so many like countless over, over the years in C++ there. But, but yeah, it just demonstrated to us that like no matter how good you are and how many eyes you have on a C or C++ project, it’s just so much harder. It is just so much harder than Rust in particular to ensure memory safety.

You rely on really good code reviews basically as patches go in. But in Rust, the compiler also helps you. It, it’s not perfect, of course, like you can definitely use unsafe everywhere, but it, it just, I think it’s like at least an order of magnitude, more like memory safe, it’s, it’s just so much easier. It’s gotta be like probably a couple more, a couple of orders of magnitude more memory safe than than C or C++. So yeah, it, it, it, it definitely it. And, and I guess like the, another, another reason why we, we doubted ourselves for a while and wondered if we were doing it wrong is the only way that you can, you sort of enforce the contracts of who owns what memory and, and, and thread safety contracts, like what structures are and what pointers are okay to pass between threads is just by reading the documentation.

So you might see like a function in libcurl or like, you know, somebody that like Graphics magic or some of these other like C or C++ based libraries that say, cool, for this function you own the pointer, you’ve got to call this particular function when you’re free to, to free this the memory that the pointer points to. And in other cases it’s like, no, we own this. This is just a reference you, you can, you can look at and we will free it later.

It’s not always clear from, from a code review, like you have to not only read the code in C/C++, but you also have to open the documentation and your reviewers have to be really familiar with the libraries. So to enforce the memory safety and thread safety contracts in C and C++ is just super hard. Whereas in Rust it’s, it’s like, oh, I compile this, it runs if the, if the, the authors haven’t done anything crazy, we have all done our, our part appropriately. We’re, we’re like, I’m owning the memory in this case they’re owning it, that and that, and the compiler helps enforce that contract instead of developers.

We really love that about Rust and have just really invested heavily into, into using it at Scanner for sure.

Allen Wyma

Okay. Yeah, that, that’s, those are really good points and, and as we were saying before, like so many people keep talking about, oh, you know, it has this, it has that, but you actually have the real world use case. You found stuff and then it definitely influenced your, your thought process. So when you started this new company, right, that you took all that knowledge from before and said, okay, this is what we’re gonna use, that was just kind of immediately to you. There was no like checking around because sometimes you, maybe you, you learn about Rust or, or you enjoyed your, your time over there, you start your new company or you start your new position and you still have to convince those around you that what you think is good is good.

And sometimes even if it is good, the greater good of working together is more important than that. So you’re still stuck with, you know what I mean? And you’re still stuck with, you know, with with, with what the majority wants to work with, right? I mean, yeah.

Cliff Crosland

Definitely. I think, I think it is, it is a tall order to, as a developer to introduce Rust into your, your code base. I think for tools like ours, so like data analysis tools or, or like observability tools, security tools, you’ll see Java a lot and you’ll see Go. And so I think, but Rust is increasingly, you, you’re starting to see that more and more in our space. But yeah, it is, it’s, it’s pretty difficult. And, and I would definitely agree with, with the sentiment that it’s, it really is beneficial to keep with convention instead of introducing new languages and new frameworks all the time. Even if like an interface, if you’re just talking about a, a convention that you have in your own code base, if, if an interface isn’t, you know, maybe the best or, or ideal instead of trying to fix, you know, some aspect of it and then kind of make things inconsistent and and confusing to people at your company, it’s very good to like accept a, a, a convention that everybody shares. And so yeah, I think it, I think it can be tough to introduce Rust. At the prior startup, we introduced Rust in just that one microservice, and that was fine.

It was like very self-contained and there was a good business use case, which is like security. We don’t want memory safety problems there. And, and also it was small enough that we had a lot of freedom to make that choice.

But at, at Scanner we actually have a blog post about this. We felt like, so what we do is there, like the, the amount of data that security engineers have to handle is, is absurd and observability teams, it’s, it’s like a petabyte or it’s like, you know, hundreds of terabytes potentially that you have to scan through. The only like reasonable low cost place to put that is in object storage somewhere like S3, but scanning through that with like Python or something is brutally slow or, or even even using Java, it’s, it’s fine. And there, there’s a lot of a w Ss support for things like Java, but like what we really felt like was necessary is instead of, instead of allocating a huge number of CPUs that sit there idly until someone executes a query, we felt like serverless functions gave you the ability to spin up, you know, the idle compute in, in an AWS data center to do like a massive job very, very quickly and then spin back down against, you’re not really paying for for it all the time because like log queries don’t happen that often when you’re doing an investigation, it might happen like a couple of times a day, so you don’t need a bunch of idle compute sitting there.

And we tried out, we did, did a bunch of experiments to measure the performance of Python, Java, Go and Rust and Java was really bad in Lambdas. I I was surprised actually. And, and it’s even with AWS has some things to kind of make cold starts better with Java, but like Java really needs a lot of time to boot up and warm up and it uses a ton of memory. Whereas Go and Rust, they both, like in, in a Lambda Function, they take maybe like 30 to 50 milliseconds to, to boot up, whereas in Java it was potentially multiple seconds before like that the pro the program was fully running and the amount of memory used or like needed was extremely high as well. The JVM is just a, a big beast, I think.

I think Java’s still a great a, a great tool for doing CPU intensive tasks. I’m like, there are some Java wizards out there who can get a lot of CPU performance out of Java, that that is just as fast as, as people like doing crazy stuff with Rust or like close to as fast. I think it’s, it’s still easy to do faster stuff in Rust, but I, it’s like almost impossible to keep Java’s memory usage down. And so for us we’re like, well, oh my God, we’re, we’re going to, you know, launch a lot of Lambda functions at the same time you spend a lot of money or like you spend money on the, the basically how much memory each Lambda function is using.

If we can minimize the memory to, to basically like the essential bare minimum that’s required to do the job, it’s just, it was trivial to do in Rust. Like we, you could get, you know, in a really good program running using like 20 or 30 megabytes, whereas Java seems to need at least a gig to do anything.

And so it was just, yeah, it was a very, it’s very hardware efficient. I think like Rust was just really great there. So we definitely looked around when we were starting scanner to see like, okay, which language should we, should we try out? And we felt like Rust and Go were very good. Go, Go would probably have been like a decent choice for us. But one of the, the the important things that we faced was a huge amount of thread, like thread concurrency and, and security was really important to us and we just felt like it’s very easy to get thread safety out of Rust and you can definitely shoot yourself in the foot with Go more easily. But, but anyways, I, I definitely feel like if you’re doing something at scale on a massive dataset, use Go or Rust and if you need a lot of threads, like use Rust, but, but there are big trade-offs there.

Rust compiles much more slowly than Go, which is the worst thing about Rust, but it’s not that bad for us. Like I don’t mind, our code base still compiles very quickly, but I I could see someday it being annoying, but I I I think the trade off is still very much worth it for us.

Allen Wyma

Yeah, that’s always one of the biggest criticisms about the compilation time, but it’s like, well you gotta balance, right? And, and I don’t know, I I, there is like a, when you’re in development, obviously you can have an unoptimized build, you can, I think now that we have that, what do you call that? Not called generational, what was it? It’s incremental build, right? Yep. So you already get most of that stuff built already to a certain extent. So you’re, you’re, there’s already acquired a lot of stuff and then it’s like, well just kick it up to the CI and then let that build the release build.

You don’t, we’re we’re, we’re not back to that XKCVD comic where everybody’s sword fighting, waiting for this thing to build anymore. We’re past that time. And I wanna also touch back on the, the Java stuff.

It’s like crazy for me to understand like, why the hell are, there’s so many jobs. So I’m in Hong Kong, it’s huge banking place if you have to basically work with Java, and I find there’s so many jobs out there that have, like, as you were talking about like this, you have to have these extreme Java skills where you know how to do low latency, low memory Java. It’s like, well, why the hell don’t you just do C or C++ at this point because that’s basically what you’re, you, you want. Because otherwise, like yeah, I mean you need so much skill to do that as, as opposed to doing something that is less memory intensive, you know, and then like, oh, oh yeah, you know, you, you get this like jit well you get AOT, which I think is a little bit better, right? No, nobody wants to have like a sudden slowdown and, and then speed up and ensure okay, if you keep it running, but then software mutates and changes over time. So no matter what, you’re gonna have to do an update at some point and then you still have the stop the world garbage collector. I th I think there hasn’t changed yet. I, I mean I’m a little bit surprised about that part.

So there’s so many like benefits, but I guess you then maybe you have, you could theoretically have some benefit where it’s like, well, I’m not releasing my memory now, so then I’m not gonna like waste time on, on CPU cycles on that. Yeah. But then when you do need to release it, you’ll be wasting a lot of time at maybe a very critical moment, right? So I don’t know, you kind of kind of balance it out. I just don’t see Java as being as successful for so many things that people are trying to use it for, you know, it’s like this Swiss army, they’re trying to use Java, like Python, you know, Python, you use so many different places, but Java’s just not good for everything, even though it’s, it’s still weird because it, let’s kind of go back to the data science part. Is it Scala, which is based on the JVM is actually used in data science and that part is also a little bit confusing to me. Do you have any insight about what, what that’s all about?

Cliff Crosland

I mean, that’s a good question. So our, one of our engineers from Twitter used to do a bunch of work in Scala. One of the, I if, if you thought like rest compilation times are bad, like try to have like a massive, you know, multi hundred thousand line Scala project to compile. I think, I think Scala functional programming is really fun. I think, I think Scala provides a really cool story there. And it might be, I, I haven’t done a huge amount of data science, but I think like I can see how if you’re doing just transformation after transformation on data, it can be really fun to write your code in a, in a functional programming like paradigm.

I love Closure actually. I really enjoyed using that for a while, but I, again, I got sick of like how much, how, how much like memory those, those programs would use. And also the, I I kind of miss the strong typing. So Scala kind of has strong typing plus the benefits of just a huge number of libraries in the JVM ecosystem. And I think like really that’s likely why most people stay in the JVM plus. It’s, it is, it is just frankly easier to learn than Rust. You can, you can let things just get garbage collected. You don’t have to really, you don’t have to to fight against a borrow checker. And so, and also like generic types, there are some kind of interesting trait things that we do in our, in our system because we have, when you execute a query, you have to build like an abstract syntax tree and it yeah, the, the you, you have like a bunch of different abstract components that need to work together. And so we use a lot of traits.

I feel like that would be a little bit more straightforward in Java, you know, with like interfaces, polymorphism, et cetera. We, we have to do, we have to do some weird things with a significant number of traits, like all listed out for a given data type. So I do think there are, there are certain things that, that are a little bit simpler in Java, but then once you, I, I totally agree with you. Once you get to, like, once you get to building things that are super low level where you need really great performance and really efficient memory usage, you have to start doing a bunch of weird tricks to get it to be fast enough and efficient enough where like Rust just kind of ha like you just happen to have really good, you know, CPU performance because like you’re using L1 L2 caches really well because the data structures are well suited to sort of batching things in.

You’re not like traversing a, a heap all over the place anyways there, there are just a lot of cool things that we really like about Rust and I think like maybe Java’s async story is a little bit easier than Rust’s. I think it’s fine for us honestly. I think that that might be like the biggest, one of the other places in addition to like bad compilation times where people complain about Rust is the async experience. But I find it awesome, like maybe we just use really good libraries that have like a decent async interface and we don’t really run into too many troubles there.

But yeah, anyways, I think, I think, I think it is interesting, I think like the JVM is it, it it is too bad that there are so many jobs. I I really would love there to be more Rust jobs, but I am, I, I don’t know, I think I am, I am like very happy to see companies like a w s embrace Rust more. The Firecracker VM that powers Lambdas is written in Rust. I’ve heard like a lot of S3’s infrastructure is, is moving to Rust because it’s just easier not to shoot yourself in the foot. So yeah, I think there’s a, there’s a good future but it’ll take some time before people start to do a lot more work outside of the JVM, outside of Java, outside of Scala start starting to use Rust more, I guess. I hope so. I hope it moves in that direction.

Allen Wyma

I mean, I mean it always takes time, right? You were talking a little bit about video games before, it’s, it’s an interesting kind of small segue we can do.

There’s still so much work to be done in like as, as we already know, like C/C++ is taken up the majority of, of video game development and, and, but I think that for data science-y stuff, I, I think Rust and Julia are definitely, you know, catching up big, right? The especially Rust. I mean Julia is, I think itself is is a data science language, right? So of course it’s gonna be nearly there if not there ’cause that’s what it is. But Rust is kind of a, yeah, it’s definitely catching up and I, I guess it’s probably due to all that memory safety and everything else happening. I, I sound an interesting thing.

I I guess this is kind of related right to, to to Scanner.dev because it is about kind of some of data science is about searching unstructured data, right? So yeah, in general the, yeah, I know that I saw that there was a study that I think it was, I have to go back and look it up, but something like Rust is like the top growing language for data science. I thought that was pretty interesting because yeah, it’s always been C/C++ kind of backing Python or R or of course Julia rising, but they said like something like the fastest growing is, is Rust. And I find that super amazing and kind of again, leaning back to the episode with, I’m sorry I don’t remember which episode this was, but I think this happened quite a few times.

Some people have found that adding Rust to their project is bringing more people to, to their project to kind of help out because it is find it to be easier ’cause it’s kind of weird. It’s like it’s both in somewhat easier language, but at the same time it’s also more difficult because you do have this nagging borrow checker that basically as, as I would call it Stockholm syndrome you into obeying the rules and, but it’s kind of weird because then you kind of see this benefit, there’s a mutual benefit to this.

 

Cliff Crosland

Yeah, I definitely, I definitely agree with that. I think, I think one of the things that everyone under or sort of underestimates, especially in a C or C++ project is how, how much you have to do some, do testing in even like in production in staging before you fully reveal your borrow checking like your ownership, your memory ownership problems, it it is, it’s just this extra friction like maybe you can prototype much more quickly and in C or C++ because you don’t have a borrow checker to tell you not to do what you’re doing. And in fact, like a lot of unsafe or like undefined behavior kind of works a little bit. Like if you use a pointer that has just been freed, it might actually work okay for a little while because maybe the memory that it was pointing to is kind of still there hasn’t been reused for something.

And so I think like you have, you don’t have this borrow checker yelling at you, you can kind of like make progress faster in C or C++, but for like a large scale production system that has to do a lot of data science, data infrastructure engineering, you, you will spend a lot of time like banging your head against your desk trying to understand where the, where, where things went wrong with your ownership model. Whereas Rust forces all of that to happen right up front. So like, I think like another example I remember from our, from this like downloader service, but was that sometimes when you’re shutting down the service to like deploy, deploy a new version or something, shutting it down would segfault every now and then in C or C++ because our ownership model was a little bit off and was only off for like the microseconds as things were winding down, like the order in which we were shutting down different parts of the different components in our C++ project was a little bit wrong and things were still available that shouldn’t have been or, or were still referenced.

That shouldn’t have been, but most of the time it shut down cleanly. But every now and then we’d see a weird segfault in the logs and be like, what’s happening? And it was really hard to try to debug like, okay, well to debug this, let’s try shutting it down and see and see what happened. Okay, well let’s try shutting it down again. And it was just really annoying to try to fix that. Whereas in Rust, like the borrow checker immediately, so when we, when we, when we moved over to this new like crawler service, the borrow checker immediately said, oh, the way that you’re like shutting down is not safe. You must, you must put this entire part of your infrastructure in in, in an own it in a different way such that when the main function, when it goes outta scope and the main function, everything is, is okay.

And that was like really revealing. It helped us actually like starting to move it to Rust helped us debug and understand our, like where the problem is coming from in C or C++. So yeah, but it’s, it’s really cool. I think it’s, it’s interesting to see how much it’s, it’s developing and how much Rust is picking up in, in like the, the the data engineering and infrastructure space.

So like you have cool projects, I I’m trying to remember them all. Matano is one which is really cool, I think it’s written in Rust, which I like helps people transform different log sources and put them in a, in Parquet format in S3 there’s Data Fusion, which is really cool. It’s sort of like a, it it, it provides a lot of database like functionality where if you’re trying to build your own database from scratch, you can use a lot of functions that Data Fusion has to accelerate that process and that’s all written in Rust. And I think, who was it? I think it was that there, there are a couple of really cool projects that have recently moved to start using Data Fusion. I think InfluxDB is one, but yeah, you’re definitely seeing like Rust sort of take over a component of a really important project like this, you know, database sort of engine or like core engine part that data fusion I think is, is doing in InfluxDB if I’m not mistaken. And, and, and I think that’s probably like a good way for Rust to get introduced into projects and, and like you’re saying like make the projects more popular because I think it is actually kind of a joy to use Rust despite the borrow checker and despite the long compilation times because you just feel so much more confident that what you push to production’s going to work and not have weird memory safety bugs like in C or C++.

It’s pretty scary to like put something that interacts with the internet at scale out there and like feel like I think there’s still a segfault problem, like, and I’m not totally sure and it’s gonna take us a really long time to uncover that and maybe we’re, we’re willing to risk it because it only seems to happen during shutdown or something, but it’s like, it feels very uneasy. Whereas Rust I feel like, you know, if you, if it works and it compiles there, there are, there’s like a whole, there, there’s several classes of bugs you’re, you’re not going have to, to deal with that are really, really painful in C or C++ and, and, and like Java kind of solves them just through the way it’s designed.

But then in Java you’re gonna have to really like smash your head against the desk to try to get CPU and memory to be efficient doing kind of weird unnatural things in Java. So I don’t know, I think like Rust forces you to, to make those investments upfront in your development cycle and as a result, like it, there is more friction. You can’t like iterate quite as quickly or, or like prototype quite as quickly in Rust, but the things that you do deploy you can be like a lot more confident in. So I think that’s like a fine trade off for us. And once you’re, once you are really comfortable with the borrow checker, I think prototyping is actually quite fast and you realize like, oh, I can put a lot more in a VEC than I thought.

Like I don’t have to build a really weird like, you know, graph with like, you know, RC ref cells, et cetera. A a lot of stuff works well with simple data structures, like better than you think the B Tree Map, et cetera. Yeah, I think, I think just like a lot of cool things you get out of Rust that just requires a little bit more time investment during the, the dev cycle but is is like worth it in the end. It’s like eating your vegetables, like it’s kind of annoying but it, like you, you, you appreciate the health of your code base over time. If if if you, if you do those like healthy things upfront.

Allen Wyma

Yeah, I, I mean let’s kind of, I I don’t want to ’cause we, we spent so much time talking about Rust itself. I, I don’t want to kind of miss like the main topic, which is Scanner.dev. I, I don’t know, like, I guess this, we can kind of segue in this way. You’re talking about kind of like getting performance a bit and how you can do things where, I mean, how, how do you guys choose what to optimize and how do you actually optimize and have you actually found a lot of interesting things that say were not documented that you just kind of found out based on just trial and error maybe accidents or things like that?

Cliff Crosland

That’s a good question. I think, I think one of the things that we, we found really surprising at first was how absolutely essential serialization and deserialization is. So, and, and getting and using the right format there. It, it’s, so using things like zlib and gzip for compression. So like basically the problem that we have is that we have a massive amount of data we want to compress it because we don’t want to send a huge amount of uncompressed data over the network. We want to really reduce how much time we spend pulling data in. But like if we, we also need to optimize how little time we spend decompressing and then parsing that data.

One of the things that we’ve discovered is in Rust because of, of the Serde sort of framework, you have a lot of really cool alternatives that you can very, very quickly play with to, to experiment between different kinds of data formats. So in, in a, in one of our blog posts we talk about like how much more quickly you can like parse data in JSON that’s like JSON parse data. And we briefly mention if you optimize and use and really try, try to get like a data format that’s even faster to serialize and deserialize than than JSON you can get like, you know, an order of magnitude more performance. And so what we found is like, okay, so let’s, let’s build these structures that we have internally in our, in our code base that we have in memory and let’s all make them implement or derive serialize and deserialize from Serde.

One really cool thing that you get out of that is like, okay, let’s try this. Let’s try just using a JSON serialization and see like if we put a bunch of JSON on disk and are we make all of these data structures JSON like, and we, we have these index files that we traverse in chunks extremely rapidly. What does the performance look like? And then let’s, let’s try a bunch of other options. And there are, there are so many cool like serialization libraries in, in Rust. I think a lot of people really love optimizing things as like with an insane obsession in Rust.

So how can we use SIMD instructions in our CPUs to make like serialization and deserialization even faster? So we actually landed on bincode, which is like a, which is a format that, that was developed by Mozilla and I think it’s used or like it’s primary function at Mozilla is to very efficiently send data structures from one process to another, like maybe between tabs and like tab managers in Firefox. But we discovered like, wow, if you use bincode, especially for numerical data, like numerical data and JSON is like terrible. Like that’s, that’s super inefficient. But if you use bincode, bincode data is, you know, potentially like an order of magnitude faster to parse than, than, than than JSON and even message pack, which is quite fast, is still doesn’t hold a candle to bincode.

There are some even more insane serialization libraries in Rust, which are really fun to read, but I don’t think we’ll use like abomination.

It literally says like, don’t use this, this is like dangerous and, and not production ready because it’s so unsafe, but basically it’s completely zero copy serialization, deserialization for common Rust data structures like vectors and strings and hash maps, et cetera.

I think like I need to like dig into exactly all of the different data structures that it supports. But yeah, Rust’s community loves making data fast and like using, using hardware really effectively and instead of like building a completely proprietary data format that like, or like an entirely proprietary, you know, encoding like, like, like you might see in like Postgres or something, you can, you can get really, really far using serialization deserialization libraries that exist out there. So I think that was like actually a big surprise to, to me personally was how absolutely critical it is to be able to parse data and encode data quickly.

There are just a lot of really cool options there and Rust and, and way more that we want to continue to experiment with. ’cause I think we can get even faster over time. But yeah, anyways, and, and, and like slightly better compression, but every, every like every ounce of performance we can eke out of it really counts when you’re dealing with like petabyte scale security data lakes. So, but I think that was like, yeah, that was one thing about Rust that I was pleasantly surprised by is just how awesome it was at a, at encoding and decoding data and how many options there were thanks to Serde really as like a, a framework that everyone seems to adopt.

Allen Wyma

That’s the, the crazy part about Serde, like, I don’t know that might be one of the greatest libraries ever within Rust because not, I mean not only is it something that everybody needs to do, but it’s just so versatile that you could just basically write a couple of kind of, I dunno what the heck those are even called, but it’s like how to encode, how to decode and then it just uses the traits and it can just do it for you like CSV JSON. It’s just like you just change this one bit and then you’re, you’re done. It’s like, wow, that’s, that was easy. Like it’s, yeah, it’s insane. Yeah, but I mean, kind of going back to like you’re talking about the, the storage format or the or or the encoding, encoding decoding format it, to me it it makes so much more sense unless you have a really, really good reason to use some of these custom formats rather than, like you said, making your own, because I mean, people spent lots of time on these and, and open source communities do value kind of like working hard and trying to make some kind of standard that makes sense, you know, and, and like why not lean on the back of that and then you could focus on your, your work and kind of like you said, inking out the performance and if you already have a data and coding format, encoding, decoding format that that works so performant then just like leave it and man, like if that is a bottleneck later on, then that would be the time to change it, right? Like right now you have more important things.

I mean I think one of the biggest ones is probably, at least in US and many other countries is like the, the connection between like the client and the server. How can we like speed that up? How can we compress that part? So I mean is there like a double like compression where, where you say like, okay, can you actually compress the, you know, the, what do you call that the, the connection between the client and, and those services that have all your data and also have that double like the other compression would be the actual bio format that you’re talking about. Like you have this kind of thing going, right?

Cliff Crosland

That’s really interesting. You, you mentioned the, the web clients. So one of, one of the things that we, we discovered as well, like one of the things that was surprising was we, we, we use web sockets to make like the, the, the performance of the, the web app extremely fast to, to be able to send a huge number of streaming results back so you can immediately start to see, see hits even if you’re, you’re like traversing, you know, tens of terabytes of data. And so, but we, we discovered like Rust really outperforms like AWS’s services sometimes like API gateway would like fall over with too much data flowing over, which is what we were using initially like Amazon’s API gateway. But yeah, I definitely think, and, and one of the things that we’re doing now is using, starting to, to play with Wasm to send really cool like data sketch data structures over the wire to the client. So that, and a data sketch is sort of like given a kind of a, like a complex aggregation query that does some cool like statistical function over, over the data. Oftentimes if you were to, if you were to, if you were to store an exact result you would need like, you know, to, to store gigabytes of data somewhere. But if you, you can use a data sketch which gets you close, it has some error, but it’s close enough that that it’s still extremely useful.

A classic example is, give me the IP addresses that have hit my service the most over the past 24 hours. Like you can use a really efficient data sketch data structure written in Rust and, and, and use like something like a bincode, compress it with, you know, gzip or something, send it over the wire and then use Wasm on the other side with, with Rust to, to merge multiple results that are coming in from the backend to so that you can, it’s like the same exact sort of complex data structure on the, the client and the server and you don’t need to send a huge amount of data over the wire. Yeah, I think, I think that’s actually something that’s really fun.

And Rust, like I just imagine if, if you were to try to represent this in JSON and send that over the wire to JavaScript and kind of try to rebuild like a data sketch data structure, it’d probably be like a hundred times bigger or something. I mean, I guess you can send, you can send like, you know, bite arrays over the wire and, and sort of play with JavaScripts like low level bite, bite array sort of interfaces to, to deal with it. But it is just really fun to be able to take fairly complex data structures, get good compression performance, good encoding, and use the same format on the, on the frontend and backend. We have, we, we have a lot more to do there. Like we haven’t, we haven’t spent a lot of time with Wasm.

We’ve done a lot more in the past experimenting and playing with Wasm and really enjoying it. But, and, and Rusts Wasm story is great I think compared to C and C++ Wasm story or like the, the developer experience at least. But yeah, there’s like a lot of really cool innovation you can do by using similar sorts of formats like Rust, really efficient encoding formats on the backend and frontend and like Rusts and using similar Rust libraries on the, on the backend and frontend so you don’t have to like write your own custom encoding format. It’s, it’s pretty sweet.

Allen Wyma

Yeah. I do wanna get, again, kind of back onto topics. So you were, we, I dunno if we actually really discussed more about how Scanner.dev actually works. So you still rely on the cloud infrastructure for storing the data and, and some other pieces. Do you mind to kind of talk about the architecture?

 

Cliff Crosland

Yeah, absolutely. So this was really interesting. So we have a blog recently where we talk about pushing hardware to fundamental limits. So we, we, we felt like the, the problem that in in particular security teams face is just there is so much data to handle and it’s usually a very small team at like a big company who has to deal with this. And a lot of tools require you to push data over the internet to another service. So like if you’re, if you’re pushing logs to Datadog, you actually end up sending them over HTTPS probably and or maybe cis log some of the protocols, but you are sending a huge amount of data over the internet and there’s a lot of egress cost. But one of the things we, we thought about is, okay, so if you, if you just take a, a big step back and look at how logs are, are stored now you have teams like security teams, teams that are responsible for observability at their companies, storing a lot of cold storage logs in S3.

And what we do is we launch a, a new scanner instance in the same AWS region and we work with you, you run like a, a Terraform, Pulumi, or Cloud Formation template to provide an IAM role that gives us read access to that. So instead of pushing a huge amount of data over the internet, which is expensive, just that alone is really expensive. We, we just read it directly from your S3 buckets at, at like zero data transfer cost and we have a bunch of Rust based containers, which will, which build index files from the, the logs that you have in your S3 bucket. And then we, we push those index files back into your AWS account into another S3 bucket that you own.

So all of the data is owned in your AWS environment. It’s, it’s in your S3 buckets, you don’t have to yeah, you, you control not just your own data, but also the, the index files that we generate. And then when, when we ex when a, a user executes queries, then Scanner’s Lambda function spin up and they have permission to go and read the index files in your environment. And again, zero data transfer cost, but you own all of your data. I think it’s kind of like this interesting new world where like a lot of SaaS vendors require you to like push all of your data to them and they own it forever. And it’s very hard to, you kind of have this lock-in experience, but now I think it, it, it’s like sort of this collaboration that we have with our customers where they, they have a really cool set of of log like logging infrastructure and sort of like log scheme that they have established in their S3 buckets.

They might have their own custom tools that interact with it or they might use Athena to go and scan it, but then Scanner can also get access to those and then make it much, much faster to go and search those by building these index files and and saving them alongside your logs. And so yeah, instead of like, instead of pushing your logs to somebody else and spending a lot of money doing that, we, we just, we we kind of pull them or like we, we read them for you analyze them, create these index files, save them in your S3, the index files are way, way smaller than the original data set as well.

They’re like 10 to 20 times smaller than the original data set. So it’s not like you have to store a lot of extra data with, with Scanner.

But yeah, it’s, it’s, it’s fun. It’s sort of like if you think about like, okay, what would be the most efficient thing to do with a massive log data set? It wouldn’t be push it all the way outta the data center over the internet, travel across miles of wire, come all the way back in again like, and then get processed by some vendor. It’s like, no, everything stays within the network and we use, you know, VPCs and AWS to have this like direct connection to S3 with no data transfer costs in the same region. So it’s just like, yeah, it’s, it’s, it’s, there’s just a lot of cool wins where it just seems to us that hardware this way that using this approach hardwares are used way more effectively and there’s not all of this like extra waste moving data all the way out into the internet and back in again. So, and, and it just really makes a huge difference because like the scale of the logging that everyone experiences is just rapidly growing year over year.

And so whatever you can do to really efficiently analyze that at massive speeds, like, you know, for us, like up to a terabyte per second is really helpful. So yeah, that’s, that’s like roughly how the, the ar what the architecture looks like and I guess like you can either run Scanner in a managed AWS account that, that we manage or you can also deploy Scanner into your own AWS account if you really want to. Most of our users all opt for us to manage, but either way it’s like, it’s, it’s quite flexible. We just sort of deploy alongside you and partner with you as opposed to you like letting go of all of your data and doing a bunch of extra work to push it over the internet.

Allen Wyma

So it sounds like you’re working a lot with AWS is AWS the only cloud provider you’re working with or, because I mean you’re talking about enterprises. A lot of enterprises use Azure. In fact, one of my clients, they’re like, oh, we bought into the Microsoft chain, so yeah, we, we have to use teams and blah blah blah. And by the way, let’s not use Azure because it’s such a pain in the ass for us to get our IT department to do. It can be hosted somewhere else. Like that’s actually what I hear a lot more. So are you able, are we actually able to run this on other cloud platforms such as GCP or Azure?

Cliff Crosland

So right now it’s a w s only. It is really interesting to see that there’s decent feature parody like every, every year they all seem to, to release similar kinds of features. So like Azure’s blob storage and GCPs object storage type services also have like an S3 compatible API and that’s the most important thing for us is S3 compatibility. And they also have the ability to very quickly launch up serverless functions. There is a fair amount of work to do to like get it to, to use all of the bespoke cloud services in GCP and Azure. But yeah, it’s really, it’s really cool. It’s very easy to create containers and Rust and, and have them be somewhat mobile between the different cloud services.

But there is more work to do before we, we support GCP and Azure. AWS though it seems to us that the security story at a w s we, we see a lot more pain there where people feel like, you know, using Athena is not fast enough. You know, like running a query over a year of data to generate a compliance report for auditors might take multiple days and cost a huge amount of money. Whereas Scanner can make that happen in like, you know, 30 seconds or something. And so, whereas I do feel like Azure has a pretty cool security story already with Azure Sentinel and is, it’s, it’s effective. It’s, it’s quite, it’s like fairly expensive. I would say Azure Sentinel’s fairly expensive, whereas we really care about like reducing costs.

We think like all this extra cost in the, you know, logging space is kind of ridiculous. And we, we just really care about efficiency. That’s why we picked Rust, and that’s why we use S3 for storage. But yeah, it, it will, it will come for GCP and Azure in the coming years, but it’s not there yet. We are focused on helping people in AWS right now.

Allen Wyma

Okay. Well, I mean, still you’re gonna get, I think has there, I mean, I guess we can kind of a little bit touch on that. Was, is there some customers you did lose because you’re not able to deploy on Azure? Or has it just been like, okay, well you can’t, so let’s just create another account. I’m guessing that the need for this is probably more important because the bill of Azure will be so much higher than what you would have without it, right?

Cliff Crosland

Yeah, so we, we do have people who are like, well, you know, what we’ll do is we’ll actually push logs into S3 or like, we’ll centralize our security logs into AWS. So they, they might be like in a couple of different clouds and they might be running their services in GCP, Azure, and AWS, but then they’re like, okay, in order to get like a really good analysis experience, we, it’s, it’s a lot easier to centralize it as opposed to trying to build something that can, that can read from everywhere. So we, we do have some people who, who are, who are like, who are pushing things to an S3 bucket and, and like kind of moving and centralizing security analysis in AWS.

But I would love to, I think like someday it’d be really fun for us to, to say cool, like, you have an instance of scanner that’s like very cost efficient in each place and it’s quite interchangeable with containers. I think. Like it will take some work for us to reduce or, or, or to translate some of the AWS API calls that we make and use the, the, the appropriate SDK for GCP and Azure. But because they, they also like, they’re actually pretty good. Azure and GCP are are, are pretty good at producing S3 compatible APIs.

Like it isn’t too terrible. We’ll, we’ll get there eventually, but I would love to be able to say cool, like no matter where the data is, you don’t have to transfer it over the internet to some new data center. Your instance of Scanner in each of your regions will, will return you the results in, you know, the, the, the like Rust compatible like Wasm parable format and then your client can kind of aggregate them from each of your instances and give you results. That’s like the dream eventually. But right now there’s, so there are so many companies that need this functionality in AWS and like fast investigations and, and threat detections that like, there’s plenty of work for us to do in AWS before we start to expand to other clouds.

Allen Wyma

So I I, my next question would be kind of like what’s the, the future of Scanner.dev? Like, it seems like you already have a working product, it’s pretty optimized. Is there just that there’s always room to, to optimize? It sounds like ev eventually you wanna do multi-cloud support, but it’s still farther down the, the, the history or whatever you wanna call it, the future. So what is it that you’re working on now and for the, for the upcoming like close future?

Cliff Crosland

Yeah, so right now I would say Scanner is a really good compliment to sort of like these more established log tools like a Splunk or a Datadog where you might keep 30 days of logs in Splunk or Datadog and it has really advanced functionality that they’ve been building for, you know, for in Splunk’s case for like two decades.

And Scanner provides simpler, a simpler query language with like simpler kinds of things that you can do with it. But you can, you can execute queries over a massive amount of, of data. But we are increasing the sophistication of the query language a lot. I think like you can really get to 80% of the functionality of some of these really, really expensive tools with just implementing 20% of the query language features that are the most important. And so our focus right now is making our, basically our aggregation and analysis aspects of our query language way more, way more powerful.

So we can do simple kinds of aggregations right now, like search for all of these hits, but then aggregate it into a small number of values. But in our, our coming like release in the next month and a half to two months, it’s, it’s going to, we’re going, we’re making our aggregation system really flexible so you can compute table, like generate new tables with each step of your query, transform it and compute like multiple statistical functions over your data at the same time in the same pass. It’s hard to do, but it’s really cool.

Like Rust makes, makes a lot of this like really fast and, and, and simpler than otherwise in the libraries there are really cool. So that, that’s another really important thing. We, the next is really awesome detection rules. Supporting Sigma is this really awesome open source project that has a huge list of common detections that you want for different services to, to basically that this crowdsourced repository on GitHub of really good detections and there are lots of other open source detections that we want to support. So like really good detections out of the box and faster data connectors to different like log sources so you don’t have to push logs into S3, we will help pull them into S3 for you is, is, is coming.

And I think the future probably like one of the things that we’re, we’re, we’re researching I think like this is, it’s way too early, but one of the things that we’re excited about is like detection rules that don’t rely on like domain knowledge, but can use things like embeddings, like using vector databases to find similarities and, and clustering of log data so that you can find weird anomalies with, with, with these really cool that these cool new vector databases out there. I think this is definitely, there’s some really cool Rust based vector databases that maybe we’ll like, I don’t know, we’re maybe a little too biased in favor of Rust, but that could be really exciting to play with. It’s, it’s very, very early, but I think like that’s where things are heading.

I think like where things we see, the, the way that we think see things heading is that developers are going to be much more responsible for security in the same way that they have become more and more responsible for ops, you know, like running their infrastructure. They’re gonna have to now protect it. It’s very easy to accidentally deploy like a Terraform script that like opens up a bunch of, you know, permissions that you didn’t realize in your cloud infrastructure. And I think like we want to be there to, to make it so a lot of that’s taken off of your plate and we will automatically enable things like CloudTrail audit logs for you.

We will analyze them for, for errors and, and, and signal weird things that are happening in your environment that seem anomalous, let you investigate them and start to integrate with your like custom web hooks with Slack, et cetera for when these threat detection alerts go off and, and help you really like very quickly identify and fix them. So yeah, but it, but it is, it is like an interesting responsibility and burden that I think developers are going to increasingly carry, which is like, wow, in addition to having to build the application and run it in the cloud, like, and know how, how to like build infrastructure, I’m also going to have to learn like how to protect it better. Like, but that’s, that’s where things are going to go.

And I think like in the same way that cool like infrastructure as code tools have made DevOps easier for, for developers. I think like tools like Scanner that are very easy to use with a lot of power outta the box are, are going to help developers protect what they have without spending a huge amount of time on it. So yeah, I think like it’s, it’s just frightening how easy it is or like, you know, how, how, how many breaches there are and it’s, it’s just incredibly important to be able to understand massive amounts of data very quickly to to to stay protected.

Allen Wyma

Awesome. Well I, we’re approaching the end of our, of our time, so I mean we can go on and on about this stuff for forever. We have to, we have to, we have to say goodbye at some point though, but I mean, is there something that you want us to know about Scanner.dev or maybe we should come check it out or, you know, anything that you wanted to say before we we sign off?

Cliff Crosland

Yeah, absolutely. So like, if it’s interesting to you, we, we also, we would love for for people to, to come check it out and, and try it. We have a lot of of fun features coming, but also like, if you just want to drop a line on Twitter or LinkedIn and talk about Rust, we love that we’ve, we’ve, we’ve really enjoyed working with other, or like chatting with open source contributors, like to really amazing crates like, you know, RegX or s or like Actixs there, there’s some like really amazing people in the space. So if you just wanna chat and, and ping us and socials to, to chat about rust, we, we’d love to, and if you have like strong opinions about infrastructure at, at petabyte scale, like always love to to to hear everyone’s thoughts, but if you, if you would like, like fast search, even if it’s not for the secured use case, if it’s just for application debugging, you don’t wanna ship your logs outside of S3, wanna keep ’em there, like come chat with us, I’d be happy to meet up and and and hang out and, and get to know your problems. So that’d be, that’d be awesome.

 

Allen Wyma

Cool. Well thank you for, for coming on and, and talking about your experience with Rust. I almost feel like you like Rust more than you actually like Scanner.dev sometimes. They’re. Very much related obviously.

 

Cliff Crosland

Yeah, well I mean I think, I think Rust is is really cool. I think like we’re passionate about it. I think like it really does enable massive data infrastructure projects like Scanner.dev and it’s gonna be increasingly important in the future. So yeah, it’s fun to see, it’s fun to see like that trend going in the rest ecosystem and it really enables us to make a super efficient like security data lake solution like unbelievably fast compared to, you know, Athena or CloudWatch for a lot of use cases. So we just really are grateful that Rust allows us to, to provide users with like way better performance and unlock like workflows they haven’t had before. So yeah, I do love Rust a lot.

Allen Wyma

Yeah. Well thank you for coming on and hopefully we’ll have you back on again on in the future.

Cliff Crosland

Awesome, Alan, thanks so much. Take care.

                 Share this article

Scanner is a security data lake tool designed for rapid search and detection over petabyte-scale log data sets in AWS S3. With highly efficient serverless compute, it is 10x cheaper than tools like Splunk and Datadog for large log volumes and 100x faster than tools like AWS Athena.

Cliff Crosland
CEO, Co-founder
Scanner, Inc.

Cliff is the CEO and co-founder of Scanner.dev, a security data lake product built for scale, speed, and cost efficiency. Prior to founding Scanner, he was a Principal Engineer at Cisco where he led the backend infrastructure team for the Webex People Graph. He was also the engineering lead for the data platform team at Accompany before its acquisition by Cisco. He has a love-hate relationship with Rust, but it’s mostly love these days.