What is a Data Scientist?

This is from Forbes on line.

Amazon’s John Rauser on “What Is a Data Scientist?”

Is it possible that a 250-year-old mathematician’s discovery holds the ideal template for a new, future-ready breed of technologist, one who will be capable of extracting value and wisdom from the mounting deluge of “big data” from connected devices?  That is the central theory of John Rauser, principal engineer at Amazon.com, who spoke recently at O’Reilly’s Strata Conference in New York.

(This article is the first in a series in which experts in the field answer the question: “What is a data scientist?”. For a problem statement about the challenge of growing a data scientist and links to other articles in the series please see: Growing Your Own Data Scientists)

According to Rauser, 18th-century German astronomer Tobias Mayer was the first “data scientist,” a in short, a person who possess both pure mathematical and applied engineering skills, and can apply both of them in a useful way. Rauser believes this skill set is critical for helping enterprises solve and conquer the 21st century’s big-data challenges.

Wobbling on Librations

Mayer published a theory of the moon’s “libration” in 1750. The common theory had been that the moon always presents the same face to the earth. This is not quite true. In fact, the moon wobbles slightly on its axis each month. Additionally, the moon’s orbit is not a perfect circle, but is instead an ellipse, and the axis is not fully perpendicular to the earth’s rotation around the sun. Mayer is the astronomer who figured this out.

Here’s how he did it: Mayer looked at the crater Manilius as his data point. He took multiple observations over the course of a month, tracking the motion of the crater as a proxy for the face of the moon in its entirety. As the face “wobbles,” the crater moves around with it.

Mayer used spatial trigonometry to relate the unknowns, for which he sought estimates to the known figures, which were the location points of the moving crater. We’ll spare you the equation Rauser showed on the screen, but essentially it is a set of algebraic equations with three unknowns.

The simplest strategy would have been to observe position X, Y, and Z on three occasions, resulting in three equations and three unknowns. But Mayer had 27 observations (we must assume he took between one and three days off, or they were too cloudy to take readings – this is Germany, after all).

What’s the connection to “big data”? Like today’s technologist, Mayer had more detail than he could handle with his current set of tools, “and had to invent his way out of the situation,” Rauser said.

He organized the observations into 3 groups of 9. The first groups had large positive coefficients of alpha (greater than 1), second were large negative coefficients (less than -1), the third had values close to 0. He added each group up to arrive back at three equations. Rather than attempt to tackle 27 separate equations, Mayer felt that these three equations could then be a proxy for, and take the place of 27, because “each of these equations has been formed in the most advantageous manner.”

More Data is (Quantitatively) Better

This conceptual leap was extraordinary, if slightly flawed: “Because we made 9 times as many observations (as were required) we can conclude they are nine times more accurate.”

“This is the first time in history someone made a quantitative argument that more data is better, which makes Tobias Mayer the first data scientist in my mind,” Rauser said.

For those of you who remember statistics from school (or use it every day), you’ll recognize by now that Mayer got it slightly wrong. Statisticians later determined that the accuracy of a theorem does not drop in direct proportion with the number of estimates, but instead with the square root of the number of estimates, so his final was at best three times more accurate.

Contrast Mayer with Leonhard Euler, a Swiss contemporary who is widely considered to be one of the world’s greatest mathematicians. Euler in 1749 was trying to contrast two variations between the orbit of Jupiter and Saturn. When faced with six equations and two unknowns, Euler wrote that “in the combination of two or more equations, the errors of the observations and of the calculations can multiply themselves.”

According to Rauser, “Euler is by far the greater mathematician, but Mayer had an ‘engineering sense’ – he was a working astronomer. Working with his own observations, he understood his instrument, and the kinds of errors it could introduce, and he had an intuitive sense for what was a likely error, while Euler was a mathematician and thinking only of the maximum number of errors that could enter into a calculation.”

Defining the Data Scientist (Dimensions 1-2)

To Rauser, Mayer’s is the set of skills that really opens up possibilities. Data scientists have applied mathematical and engineering (programming) skills. Both fields are important, and the combination of the two is what will be needed to tackle big data.

Without math, you are “just a software engineer,” said Rauser, who tempered his apparent criticism by announcing that he was “only” a software engineer for 10 years.

“You might be able to move around huge volumes of data, but you don’t have the skills to extract insight,” Rauser said. “If you don’t have skills as an engineer, then you only have skills as a statistician (and that’s what I spent the last half of my career doing).”

Insight is much easier to extract today, of course, than it was in Mayer’s time, or even a few years ago. Now, many important problems can be processed in the memory of an off-the-shelf single machine.  But engineering skills let you investigate for yourself, making queries and getting answers, “without having to talk to anyone else or wait for a pristine data set to arrive on a silver platter,” Rauser said. Indeed, this is the premise of data analysis software on the market, such as QlikView, Splunk and TIBCO Spotfire.  The right kind of mind is still needed to make sense of it all.

Therefore, the ideal data scientist is “someone who has the both the engineering skills to acquire and manage large data sets, and also has the statistician’s skills to extract value from the large data sets and present that data to a large audience,” Rauser said.

Creating and Cultivating the Data Scientist (Dimensions 3-5)

So how does one become a data scientist – or hire one?

Rauser said there is really no discipline of data science, per se, currently taught in schools. His own training was dual degrees in aerospace engineering and computer science, followed by 10 years as a software engineer, and later joining Amazon in 2003, which “was a really good place for data science, because evidence-based arguments are prized above all else,” he said. “It was at Amazon that I figured out that if you could code and answered business questions with data, people really like that. I taught myself more analytical techniques, like statistical modeling.”

There are degrees in computer science that could be augmented with classes on machine learning. For most firms, it would be easiest to find a promising engineer or statistician, connect them to the right resources, and grow them into the role.

But engineering and math are not the only dimensions that make up a data scientist – there are three other key aspects.

Communication

Communication skills are also key to making sure your insights have an impact. The ideal data scientist is not just a nebulous thinker who speaks in algorithms. He must be able to communicate with the wider world.

“If it is not written down, it never happened,” Rauser said. “If people in the future find your work, and can’t understand it, because the language is so opaque, you might as well have never done it.”

In other words, “the written word scales,” Rauser said, citing the fact that we are talking about Mayer today only because he successfully communicated his ideas in writing.

Skepticism

A healthy dose of skepticism comprises fourth dimension of the data scientist.

“If you have a healthy skepticism, you will look as hard for evidence that refutes your thesis as you will for evidence that confirms it,” Rauser said.

+ Comment now

 

 

There is a reason that “born skeptic” is a common expression. But are all skeptics born, rather than made? How to acquire skepticism? Can it be taught? Rauser says we’re in luck, citing the applied statistical computing course at Rice University, taught by one Hadley Wickham, inventor of the ggplot2 statistical visualizaiton program, based off the R statistics computing language.

Wickham places a value on skepticism that encourages it as a learned behavior. If a project uncritically accepts its findings, it gets an “F.” If a project is critical of its findings and uses “multiple approaches and techniques to verify unintuitive results,” an “A+” is awarded.

Curiosity

Curiosity is the final plank in the data scientist framework.

“The great data scientist is the one who lies awake at night, rolling around in her head just the right query that will crack a problem open, or the one that forgot to eat lunch because the data that comes in is so interesting,” Rauser said. “But the real reason curiosity is important is that a new dimension shows up – the curious person is excited to race out and become an expert on that topic.”

In order to acquire new domains of knowledge, mastery of one or more areas of expertise will be required. And only an intensely curious person will enjoy that kind of work.

Defining Success, and Happiness

Perhaps the most important characteristic of a data scientist is an indifference to honorifics, and a purity of desire to do some good in the world, Rauser said. “To me, success is happiness. People are happy when they are engaged in meaningful work, making the world a better place, even in small ways,” Rauser said. “Do something important that people care about, and all the trappings of success will follow.”

In case anyone was wondering about the “so what” of Mayer’s discovery, Rauser provided an example of a disaster that could have been avoided with Mayer’s knowledge. In 1707, a British naval flotilla hit rocks and sank off the Isle of Scilly near Cornwall, killing more than 1,400 men. The admiral, Sir Cloudesley Shovell, had thought the flotilla was in the English Channel, but a storm had blown it more than 100 miles off course. The key cause of the disaster was the inability of the ships’ crews to determine their east-west position, a known and vexing issue called the “longitude problem.”

After Mayer’s discoveries were published, navies were able to determine their longitudinal positioning based on the distances between stars, taking into account their known orbits. This became a standard of naval navigation for a century. Lives were saved because of Mayer’s embrace of the preponderance of data in the constellations.

The hope, it seems, that those with the skills to imagine solutions from the constellations of data that now confront us will be inspired to create similarly groundbreaking solutions, and that business leaders will have the foresight to identify and nurture them.

Dan Woods is chief technology officer and editor of CITO Research, a firm focused on the needs of CTOs and CIOs. He consults for many of the companies he writes about. For more stories about how CIOs and CTOs can grow visit CITOResearch.com.

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: