Less Is More: Behind the Data at Risk I/O, by Data Scientist Michael Roytman

Michael RoytmanData Scientist, Risk I/O

My name is michael roytman and I’m the data scientist at risk io. I came to risk io just when we were starting to build out our predictive analytics functionality, and I have been wrangling that process like a wild animal ever since. !When I started, we didn’t have hadoop clusters, we didn’t have hive or spark, didn’t have custom machines running in our own data center, and didn’t even think about machine learning. I’m proud to say that my data science operation has kept everything that way… except we recently got rid of mongoDB. !Such wow, so few science you might say. You’d be wrong. In fact, very intentionally limiting our complexity, both statistical and technological, while generating actionable insights and new near-real-time data products is our biggest win to date.

Less Is More: !

Behind The Data at Risk I/O

I don’t like dealing in platitudes. Taken on face, this is a meaningless statement, and even in something as simple as security, nothing is quite so black and white. !But, today I want you to consider the situations and impacts of resisting the current trends “more data, more data science, more hadoop”. Yes, it’s 2014, yes, we need new methods for generating value from our data. But that doesn’t mean we need to buy tulips to do it.

Less Tools

Less DataLess Model Complexity

More Impact

Less Data Scientists

There are four contentions to my rant. First, there is a propensity to implement complex and costly to maintain tools without the explicit need for them. I am versed in hadoop lore, but there’s only a slight difference between foreseeing needs and overspending. Second, data is everywhere and so are data scientists. It’s surprising and wonderful how much data driven work can be done within an existing environment once the directive is given. Third, collecting all the things is great, but data comes at a cost - the cost of storage, the cost of cleaning, the cost of complexity. It is important to know which questions you’re answering before you begin. And lastly, all of these “constraints” are actually blessings. Much like our beloved twitter, limiting the scope and the tools to 140 characters makes for a much more precise and useful product. !

Say “Big Data”

One More Time

As resistant as people are to change, I hear about a lot of organizations jumping the gun on technological or organizational change. !I want to explain an ongoing trend I like to call “knee-jerk hadoop”. Tons of organizations think they’ll get a fast datastore or a mature analytics practice just because they hired 5 people to run a hadoop cluster. !A horn and a horse does not a unicorn make. Real efficiency comes from understanding run-time complexity of your backend and making it as efficient as possible - which takes time and specific knowledge of the system.

Cautionary Tales

A great example of this is from Value America of dot com bubble fame. There’s a similar Groupon tale, but that one’s still ongoing. It was one of the first just in time models, connecting customers directly to manufacturers, much like dell. It was backed by microsoft and fedex founders. At the peak of their success, they were hiring 100 people a month for over a year. When others caught on to the model and hard times hit, they fired 300 people and continued to fire 100 a month - and the morale and communication impacts from the layoffs were the cause of bankruptcy a year later. This is the cost of a bad forecast that changes organizational and technical structure.

“It don’t matta if you win by an inch or a mile - winning’s winning.”

-Vin Diesel, The Fast and the Furious, 2001Winner, Best Movie, MTV Movie Awards

At Risk I/O we launched our first predictive models back in march of 2013 - they were crude predictions of priority vulnerabilities. They weren’t even written by a proper developer. They ran on a ruby backend, they took half a day to compute, even longer to index. They pulled from mongo, used that to look up aggregations in mysql, did calculations in ruby, and pushed back up to mongo. It was horrible, but it worked. One year later, the same method runs every half an hour, indexing takes minutes. !We’ve expanded the scope of the model, and it includes 4 times as many inputs. Our code is smarter, our algorithms ignore duplication at every step, we’ve scrapped mongo, we’ve modified the query DSL for ruby to fit our needs. We could have easily said “this is slow. ruby sucks. bring me hadoop with a side of python please”. !Here’s what we gained by NOT doing that: 1. We’ve saved on infrastructure costs. 2. We took the time to understand exactly what the algorithms are doing, where the deltas are, and what kind of behavior we can expect moving forward. 3. We didn’t spend time and energy hiring, and then tasking our engineers with knowledge transfer. 4. We gained a clean an easy to expand analytics module which requires no specialized skills to work with. If my plane crashes tomorrow into the side of Ed Bellis’s house, a 20 year old without prior CS knowledge could scale the algorithms. !And most importantly, we can still make that move when we need to. By now, we have spark and julia. New tools have evolved that might serve our purpose better and put us ahead of the status quo.

Everyone is a Data ScientistDon’t Save For Tomorrow What You Can Do In Excel Today.

But also don’t use excel. I come from an academic background, and I am well versed in R and recently wizard, a tool that I encourage everyone to look into. However, data science work is largely domain knowledge, and I have the least domain knowledge of anyone. If I ask Andrea, our marketing manager the right set of questions, she can do 80% of the work in google analytics or kiss metrics. Even Ed Bellis knows how to write a SQL query. !I am the only data scientist at Risk I/O, but our data science operation is closer to 3 people. 100% my time, 1/4th CEO, CTO, and marketing time, another 10-20% of a security architect and a developer. !All this to say, it’s important to look around before expanding data science, and to recognize the importance of specialized domain knowledge.

Take Only What You Need

Not all data is good data. Not all good data is useful data. The new york times and business week view of data science is that you collect a slew of data, unleash a hipster from brooklyn on it, and voila, insight! !I disagree. It is much leaner (in the deming sense of cutting out useless movements) to first ask the right questions, then collect the right data, and then generate the right answer. !Here’s how that works at risk i/o. We attempt to solve the contextual problem of which vulnerabilities put an enterprise most at risk. We have access through partner channels and public data to every kind of security data under the sun - yet, when deciding what data parternships to pursue or which data to use, we have a very strict set of criteria that filters out the noise for us BEFORE we get into the hard work. !Here’s an example of just ONE data source integration I did at the end of last year: [168x167 system of equations per CVE live stream row echelon reduction]. !Making quality decisions before you start the process is fundamental in quality control methods, pioneered by Taguchi Toyota. The same applied to data cleaning. So we only take active attacks, active breaches, or data that we can turn into the two. We don’t care about ip reputation data, malware analysis, data that overlaps with a public source. That’s because we can’t afford to row reduce your live stream only to find out it’s useless.

Transparency, et al.

There are huge wins from model simplicity too. A. Fix What Matters story - refer to www.risk.io/data-driven-security Wins: 1. Transparency 2. Ease of implementation 3. Feedback loop on tools, data.

Probability A Vuln Having Property X Has Observed Breaches

RANDOM VULN

CVSS 10

CVSS 9

CVSS 8

CVSS 6

CVSS 7

CVSS 5

CVSS 4

Has Patch

0.000 0.010 0.020 0.030 0.040

Probability A Vuln Having Property X Has Observed Breaches

Random Vuln

CVSS 10

Exploit DB

Metasploit

MSP+EDB

0.0 0.1 0.2 0.2 0.3

Know What You’re After

We have recently been working on a model for risk assessment, the technical documentation for which some of you have seen and which we’ll be releasing shortly. This is a lot more involved than finding one risk factor on a vulnerability - but we’ve structured the effort in a similar matter. We collected a subset of data we knew would be relevant ahead of time. We used the tools at our disposal and the expertise at our disposal to explore the data. Most of my work in creating the model was done on paper, messing around with algebraic equations until they were simple enough where they could be understood easily without losing the value.

www.risk.io@mroytman

Holler!

db.risk.io

Technology

Less Is More: Behind the Data at Risk I/O, by Data Scientist Michael Roytman