4
31 march2011 © 2011 The Royal Statistical Society Hal Varian and the sexy profession Hal Varian is not actually a statistician himself. He is an economist. He is in fact the chief economist at Google. As such he is the spokesman for the organisation which is presumably the biggest transmitter, organiser, analyser and general handler of data that the world has ever seen. He is also of course the man who said: “e sexy profes- sion of the next decade will be statistician.” He made the quote in 2008, and it will not let him go. I came across him at the Joint Statistical Meeting of the American Statistical Association in Vancouver, where he was speaking and recruiting statisticians – whether for sexy or other jobs I did not ask – for Google. He is not sure exactly how many statisticians Google currently employs – it is hard to define a statistician, he says – but out of Google’s 22 000-odd employees 600 subscribe to their internal statistics mailing list, which gives, he supposes, some indication. Google recruited Varian himself nine years ago from Berkeley, where he was Professor and Founding Dean of the School of The data revolution is upon us. The data we have and the way we treat it has changed beyond measure – as witness a quote from Hal Varian of Google: “Back in the early days of the Web, every document had at the bottom, ‘Copyright 1997. Do not redistribute.’ Now every document has at the bottom, ‘Copyright 2008. Click here to send to your friends.’” Hal Varian has made another famous quote about statistics in the new data age. Julian Champkin interviewed him.

Hal Varian and the sexy profession - Booth School of …faculty.chicagobooth.edu/nicholas.polson/teaching/41000/v.pdf · december2005 33 data sets acquires whole new powers – which

  • Upload
    ngonhu

  • View
    220

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Hal Varian and the sexy profession - Booth School of …faculty.chicagobooth.edu/nicholas.polson/teaching/41000/v.pdf · december2005 33 data sets acquires whole new powers – which

31march2011© 2011 The Royal Statistical Society

Hal Varian and the sexy profession

Hal Varian is not actually a statistician himself. He is an economist. He is in fact the chief economist at Google. As such he is the spokesman for the organisation which is presumably the biggest transmitter, organiser, analyser and general handler of data that the world has ever seen. He is also of course the man who said: “The sexy profes-sion of the next decade will be statistician.” He made the quote in 2008, and it will not let him go.

I came across him at the Joint Statistical Meeting of the American Statistical Association in Vancouver, where he was speaking and recruiting statisticians – whether for sexy or other jobs I did not ask – for Google. He is not sure exactly how many statisticians Google currently employs – it is hard to define a statistician, he says – but out of Google’s 22 000-odd employees 600 subscribe to their internal statistics mailing list, which gives, he supposes, some indication. Google recruited Varian himself nine years ago from Berkeley, where he was Professor and Founding Dean of the School of

The data revolution is upon us. The data we have and the way we treat it has changed beyond measure – as witness a quote from Hal Varian of Google: “Back in the early days of the Web, every document had at the bottom, ‘Copyright 1997. Do not redistribute.’ Now every document has at the bottom, ‘Copyright 2008. Click here to send to your friends.’” Hal Varian has made another famous quote about statistics in the new data age. Julian Champkin interviewed him.

Page 2: Hal Varian and the sexy profession - Booth School of …faculty.chicagobooth.edu/nicholas.polson/teaching/41000/v.pdf · december2005 33 data sets acquires whole new powers – which

32 march2011

Information and from where he also penned regular pieces in the New York Times. You can add a couple of best-selling economics textbooks to his record. Think huge-brained polymath, and you will not be far wrong. He is a quiet but witty speaker, courteous, and more helpful than mere politeness demands. But also, as someone who occasionally loses his mobile phone, he may not be too unlike the rest of us.

This series has the title ‘A Life in Statistics’ rather than ‘A Life in Economics’ but, as a man whose one-liner is encouraging a generation to take up a life in statistics, he clearly deserves to be here. But why, when he was starting out in life, did he chose economics rather than

statistics? Elsewhere he has written that it was because he wanted to understand people and the way they work. To me he put it like this: “They say economists are people that are good with numbers but don’t have the personality to be accountants. Could this be the explanation? But when I speak of ‘statisticians’ I consider the term broadly to include all the occupations that use quantitative methods to analyse data. So econometricians, psychometricians, opera-tions researchers and many other professions also are included.”

So by that definition he is a statistician after all. At any rate his business certainly handles and analyses data in unimaginable quantities. And if anyone doubts the sheer, un-intended, and probably unwanted power that Google’s massive database can unleash – to say nothing of the law of unintended consequences – consider a news story from November 2010: the Google war between Nicaragua and Costa Rica. The San Juan river has been the de facto, but somewhat disputed, border between them; then Google Maps appeared and for a stretch of a mile and a half put the border on the Costa Rican side of the river. The Nicaraguan army promptly moved across the river and set up camp on the apparently Google-approved

new bit of Nicaragua. Costa Rica protested vigorously at the incursion and is still protest-ing; shooting may have been avoided only because Costa Rica is, sweetly, the only nation on earth that does not have an army. None of which is Hal’s or even Google’s fault, but it did lead one commentator to paraphrase Otto von Bismarck’s prophetic words on the First World War: “The next time war comes, it will probably be over some damned stupid thing on Google.”Hal Varian is, therefore, the ideal man to tell us about massive amounts of data, the opportunities that lie therein for statisticians to mine, analyse and exploit, and what it may all mean for the world.

He has written, in one of his econom-ics textbooks, that when starting out as an economist he unwittingly found himself in the middle of an information revolution. That was then. The revolution, never mind the information, is even greater now. Do such huge quantities of data need new methods to extract truth from them? Are the traditional statistical methods still valid or are new ways needed? Learn, he says, from the computer scientists. “In the last decade we have seen a very fruitful interaction between computer scientists work-ing in machine learning and statisticians. The computer scientists are used to working with vast amounts of data using relatively unstruc-tured models. Statisticians tend to have more complex models but focus on smaller data sets. I think that these two fields have a lot to learn from each other.”

As witness his own early contribution to Google’s fundamental business model. As is well known, there is no standard price for the advertisements on Google; they are sold by auction. “Actually, the ad auction was in place when I joined Google – it was developed by some very talented and astute computer engi-neers. They were very smart people but didn’t know much about the existing work in the eco-nomics of auction design. My first assignment was to examine the economics of the auction, and I think I made some useful contributions there by analysing the auction in terms of game theory and relating it to the classic literature on two-sided matching models.”

He has drawn parallels with the turn of the last (OK, last but one) century, the 19th to the 20th, when the telegraph and telephone between them provided an earlier information revolution – a new network of almost instant worldwide communication. Businesses that adapted their models to handle it flourished;

those that did not went under. What is different now, he says, is that data is effectively so cheap as to be free. “And you see big changes when you move from ‘cheap’ to ‘free.’ Because now we really do have essentially free and ubiquitous data.” Businesses cannot therefore sell mere data; they have to find new ways of adding value to that data – cue statisticians, and the value they add by analysing and interpreting it. “It’s not so much a question of what’s owned or what’s not owned. It’s a question of how can you leverage the assets you have to realise the most value.”

Does this change the basics of business? “I don’t think that it’s the laws of business that have changed, just the emphasis. At one point it used to be very expensive to distribute textual material since it all had to be copied by hand. The invention of the printing press made distribution much, much cheaper, but it still cost something. The internet has made the cost of distributing textual information virtu-ally free. It’s a continuation of what has gone before.”

Even so, the name of the data-business game has changed. Your intellectual property is not what it was. Even the balance sheet of what you own and what you can sell is not what it was either. An advertisement then on-line for a statistician to work for Google in Pittsburgh said: “The market space in which Google operates changes very rapidly and has many complex dynamics that are interrelated: in this environment traditional business mod-eling and analytical frameworks are helpful, but they can no longer fully describe the way our business works.” Which raises the ques-tion: what, then, can fully describe the way his business works? “Our point is that traditional spreadsheets and common business analytics aren’t sufficient. You have to use the more sophisticated techniques from scientific data analysis to really understand our business.

“The great thing about Google is that they have built an infrastructure capable of manag-ing large amounts of data in effective ways, so it is a lot easier to manipulate and analyse data here. Without these tools, the data itself would be useless. Lots of businesses are collecting large amounts of data, but they don’t have the tools or the expertise to really extract useful information from the data they have.”

And the things that can be done when such vast amounts of data come together and are use-fully extracted from are simply extraordinary. Statistics applied to such huge internet-acquired

Computer scientists use vast datasets and unstructured models. Statisticians have

complex models and smaller datasets

Page 3: Hal Varian and the sexy profession - Booth School of …faculty.chicagobooth.edu/nicholas.polson/teaching/41000/v.pdf · december2005 33 data sets acquires whole new powers – which

33december2005

data sets acquires whole new powers – which Google is at the forefront of exploiting. As witness, for example, their language translation programs – run by statisticians, not linguists. It is not necessary to be able to understand French or Chinese to produce a program that will translate between the two. This, to old ways of thinking, is astonishing. It is fairly astonishing to new ways of thinking as well. As he agrees. “It is amazing how well statistical techniques work for language translation. The key is to have a lot of parallel translations of the same documents for your data. So international organisations like the UN, OECD, and so on have been very helpful.” Line up enough documents, do a statistical search for words which appear in roughly similar places – and hey presto, you can translate the greatest works of man. Perhaps those typewriting monkeys trying to write Shakespeare were onto something after all.

Another application he is proud of is voice recognition: “The voice recognition team is a heavy user of statistical methods for signal processing. When we started working in this area back in 2006, we had no data. So the group created a ‘directory assistance’ system called GOOG411 that served up phone num-bers based on verbal requests. All that data was collected and analysed and the system was continuously improved. Now we have discon-tinued GOOG411 since we have a rich set of data coming in from Android mobile phones.”

Google even has, famously, driverless cars – I assume their programming also relies heavily on statistics. A less exotic, perhaps more mainstream and traditional use of data is for forecasting. Purchases made on the internet might be used to generate a daily measure of inflation. Preis et al.1 have shown that a correlation exists between search volumes for company names and the volumes of stock traded in those companies. Click rates can show home-buying on the up or down. Varian calls it “taking the pulse of the economy”. “We think [such data] definitely have predictive power.” All of this implies that the causal laws that link click volumes and various bits of the economy do not have to be understood in order to be able to obtain forecasts from search data; just as in language translation, all that needs to be nailed down is the correlation. Almost any relationship, it seems – causal or not, direct or indirect, first-, second- or third-order – can be exploited in this way. Is all this leading to a degrading of the importance of understanding the causes that

connect two or more phenomena, in favour of merely analysing the correlation between the two? Does the predictive power come at the cost of understanding the physical or economic laws underlying them? “I don’t think so. Most of what we have been doing has focused on ‘predicting the present’ or ‘nowcasting’. So we are looking at contemporaneous indicators. For example, the number of searches on ‘file for unemployment benefits’ could in a given week be a good indicator of the actual number of filings. It turns out that historically there is a strong relationship between the number of filings for unemployment benefits and the unemployment rate 12–18 months later. So knowing the initial claims for unemployment benefits helps you forecast the unemployment rate down the road. But there is still of lot of interesting scientific work in tracing through the causal mechanism in this chain of events. The work we have done in this area has used the publicly available data from Google Search Insights. We did this very deliberately because

we wanted to show how this data can be used by anybody who has the right set of skills. Frankly, there are so many applications out there that we can’t do it all, which is why we’ve focused on building the tools and illustrating some of the possibilities.”

So instantly available internet purchase data can generate a daily measure of inflation and job search clicks can forecast employment rates which are only officially released weeks or months later. Does he see governments bowing out of producing such figures now that they can be obtained more quickly, more independently and more cheaply by a statistician, a computer, and some internet data? “Most large corpora-tions have data systems that track important business metrics in real time. Think of com-panies like Visa, Mastercard, Federal Express, UPS, Wal-Mart, Target and dozens of others. This data can be very useful to economic policy-makers, and they are definitely looking at this sort of information. The challenge we face now is integrating this real-time, private sector data with the traditional economic statistics. That’s going to keep us busy for the next decade or

so. I think that model selection is going to be more important in the future. In economics we have historically only had a few predictors for a given macroeconomic time series. Now we have millions of potential predictors. How do we decide which are best?”

There can be such a thing as too much data. Hence another value of knowing sta-tistics: “A lot of classical statistics developed techniques to analyse data sets that were very small by today’s standards. Today’s data sets definitely pose challenges. However, we have to remember that we have a very powerful tool for dealing with such data: random sampling. In some cases examples or experiments con-ducted on 1% of the data can be very useful and much more efficient than trying to deal with the entire data set.”

And, having analysed your data, you have to convey what you have found to a mass audi-ence. Here, too, the old ways are inadequate. A picture is worth a thousand words, but a set of graphs is for most of a mass audience a turn-off; there are better ways. “There are two sorts of visualisation: one focuses on data exploration, the other on data presentation. Statisticians have had access to visual data ex-ploration tools for a long time but the internet has enabled mass market publications to make use of interactive visualisation in exciting new ways.” He points to the New York Times Data Visualization Lab: http://open.blogs.nytimes.com/2008/10/27/the-new-york-times-data-visu-alization-lab/ This magazine points to page 40 and the work of David McCandless, which appears in the Guardian and elsewhere; either way, statisticians on the whole have been shamefully slow to up their presentational game. It is journalists who have had to prod them. Statisticians have rightly been com-plaining for years about poor understanding of statistics by journalists. Here is a reverse case. Basic tools of communication are too often used badly, or not used at all.All of the above you could call the new business model for data, or for statistics. All of it depends on data liberation – the data stream being freely available in usable forms. Again, Varian’s books have pointed out how economic forces are working towards this. Even governments, no-toriously retentive of information, seem to be making genuine efforts to join in. Data.gov in the US, and data.gov.uk in Britain, are only the start. Is this going to continue more or less automatically and unstoppably, due to the

Statistics applied to huge internet-acquired datasets acquires whole new powers

Page 4: Hal Varian and the sexy profession - Booth School of …faculty.chicagobooth.edu/nicholas.polson/teaching/41000/v.pdf · december2005 33 data sets acquires whole new powers – which

34 march2011

economic rationale? Or is it something that statisticians and others will continue to have to press for to make happen?

“Google has its own Data Liberation Front, whose mission is to make it easier for users to move their data in and out of Google products, as well as a Public Data Explorer that allows for easy access to many public data sets. Most statistical agencies are enthusiastic about making data available to the public, but they usually have resource constraints of one sort or another.” Despite the economic pressures, though, he does not see it as inevitable. “I think that statisticians should continue to make the case about how valuable data can be for under-standing our society and how important it is to make it available.” In other words, continue the fight.

That, mind you, was before WikiLeaks began releasing thousands of confidential US State Department diplomatic cables. The US authorities have made it clear that they hope to prosecute Julian Assange, director of WikiLeaks: Attorney General Eric Holder said officials were pursuing a “very serious criminal investigation” into the matter. But on December 31st, 2010, the UK Information Commissioner, Christopher Graham, said “WikiLeaks is part of the phenomenon of the online, empowered citizen … these are facts that aren’t going to go away. Government and authorities need to wise up to that.”

Are we ready for all that data? Is too much information, too soon, making for open govern-ment – or is it making the world ungovernable?

Given the passions aroused on both sides – Julian Assange as hero or as traitor, finance networks such as PayPal being boycotted, am-bassadors’ positions becoming untenable – the Wikiwar might well happen before the Google

war. It is understandable that Hal Varian did not want to be drawn on the subject. All he did say was that, given that there were apparently half-a-million people with access to SIPRNet, which held the cables, it’s amazing we didn’t see more leaks before this.

Let us end on a happier note. Statistics will be the sexy profession of the next decade – but Varian said this a couple of years ago. What will be the sexy profession of the decade after?

“Gosh, we only have 8 years to go! I’ll go out on a limb and speculate that we are going to see some really cool stuff in biotech and human–machine interfaces. If you build a mobile phone into your body, you won’t forget it or lose it any more, and this will be a big boon to some of us.”

It is comforting, in a way, that even Hal Varian can lose his phone.

Reference1. Preis, T., Reith, D. and Stanley, H. E.

(2010) Complex dynamics of our economic life on different scales: insights from search engine query data. Philosophical Transactions of the Royal Society A, 368, 5707–5719.

Google knows where you live. Google car gathering data for Google Street View. Photo: Byrion Smith

Stop Press

As this article was going to press the events in Egypt that ousted President Mubarak were mov-ing towards a climax. A key player was Wael Ghonim, Google’s head of marketing for the Middle East and North Africa. He has been credited with enabling the protest movement to function, to grow and to organise.

As a private Egyptian citizen rather than as part of his Google duties Ghonim set up a Facebook page for protesters, which became the focal point for their communications and their views. As the number of protestors grew, he was arrested in late January and held in custody for 11 days; his release on February 6th was followed by an “explosive response” from supporters, bloggers and pro-democracy activists on the internet on February 7th. A TV interview, in which he wept for those killed in the demonstrations, confirmed him in the protestors’ emotions. The commentator Fouad Ajami wrote:

“No turbaned ayatollah had stepped forth to summon the crowd. A young Google executive, Wael Ghonim, had energized this protest when it might have lost heart.”

Ghonim himself has said: “This revolution started online. This revolution started on Face-book. This revolution started in June 2010 when hundreds of thousands of Egyptians started collaborating content. We would post a video on Facebook that would be shared by 60 000 people on their walls within a few hours. I always said that if you want to liberate a society just give them the Internet…”

Google CEO Eric Schmidt said in a webcast that his company is “very proud” of Ghonim. “Ghonim and others were able to use a set of technologies that included Facebook, Twitter and number of others to really express the voice of the people. And that is a good example of trans-parency. And we wish them very much the best. I have talked to him. We are very, very proud of what he has done.”

In the concluding paragraphs above, we wrote that the Wikiwar might well happen before the Google war. The words have been overtaken by events. It would appear that in Egypt the first Google, Facebook and Twitter revolution has already occurred.