Cognitive computing with big data, high tech and low tech approaches

Preview:

DESCRIPTION

I explain some very approachable methods for analyzing big data via a detour through clipper ships and the 19th century open source scene. Note that I mixed up the route of the Flying Cloud record in this talk. The Flying Cloud's record was actually from New York to San Francisco and was even more impressive than what I said. The usual time had been about 180 days. With Maury's charts, the time was reduced to about 135 days. The Flying Cloud's time was 89 days. Thanks to Chen Kung for noticing my error.

Citation preview

© 2014 MapR Technologies 1© 2014 MapR Technologies

Cognitive Computing on Hadoop

Low Tech and High Tech Approaches

Ted Dunning

© 2014 MapR Technologies 2© 2014 MapR Technologies

Cognitive Computing on Hadoop with Data

Low Tech and High Tech Approaches

Ted Dunning

© 2014 MapR Technologies 3

Who I am

Ted Dunning, Chief Applications Architect, MapR Technologies

Email tdunning@mapr.com tdunning@apache.org

Twitter @Ted_Dunning

Apache Mahout https://mahout.apache.org/

Twitter @ApacheMahout

© 2014 MapR Technologies 4

The outline

• The first open source, big data project

• Another big data project

• Conclusions

© 2014 MapR Technologies 5

First:

An apology for going off-script

© 2014 MapR Technologies 6

Now, the story

© 2014 MapR Technologies 7

© 2014 MapR Technologies 8

In 1866, the top finishers in the tea race reached London in 99 days, within 2 hours of each other

© 2014 MapR Technologies 9

© 2014 MapR Technologies 10

But in 1851, the record had been set at 89 days by the Flying Cloud

© 2014 MapR Technologies 11

The difference was due (in part) to big data

© 2014 MapR Technologies 12

© 2014 MapR Technologies 13

© 2014 MapR Technologies 14

© 2014 MapR Technologies 15

These charts were free …

If you donated your data

© 2014 MapR Technologies 16

But how does this apply today?

© 2014 MapR Technologies 17

Key Points of Maury’s Work

• Give to get– Give the Abstract Log to captains, get data

• Data consortium wins– Merging data gives pictures nobody else can see

• Give back– Them that gives, also gets

• But this is just what every data driven web site does!– Just 150 years before everybody else

© 2014 MapR Technologies 18

The Real News in Behavioral Analysis

• Everybody knows that:

• You need ensembles of many models to do recommendations

• You need to use factorization models

• You predict what you observe

• (You should predict ratings)

© 2014 MapR Technologies 19

But …

none of this is really true

© 2014 MapR Technologies 20

In fact,

• Fancy models are rarely useful expenditures of time• Factorization can be good, but not much better (if at all)• Ratings are disastrously bad data• Cross-recommendation and multi-modal recommendations are

much more interesting– Multiple kinds of input are far better than multiple models

• The UI has a far larger impact than the models• The best algorithms combine simplicity with accuracy

– So simple you can embed them in a search engine

© 2014 MapR Technologies 21

Here’s how

© 2014 MapR Technologies 22

Cooccurrence Analysis

© 2014 MapR Technologies 23

How Often Do Items Co-occur

© 2014 MapR Technologies 24

Which Co-occurrences are Interesting?

Each row of indicators becomes a field in a search engine document

© 2014 MapR Technologies 25

Recommendations

Alice got an apple and a puppyAlice

© 2014 MapR Technologies 26

Recommendations

Alice got an apple and a puppyAlice

Charles got a bicycleCharles

© 2014 MapR Technologies 27

Recommendations

Alice got an apple and a puppyAlice

Charles got a bicycleCharles

Bob Bob got an apple

© 2014 MapR Technologies 28

Recommendations

Alice got an apple and a puppyAlice

Charles got a bicycleCharles

Bob What else would Bob like?

© 2014 MapR Technologies 29

Recommendations

Alice got an apple and a puppyAlice

Charles got a bicycleCharles

Bob A puppy!

© 2014 MapR Technologies 30

You get the idea of how recommenders can work…

© 2014 MapR Technologies 31

By the way, like me, Bob also wants a pony…

© 2014 MapR Technologies 32

Recommendations

?

Alice

Bob

Charles

Amelia

What if everybody gets a pony?

What else would you recommend for new user Amelia?

© 2014 MapR Technologies 33

Recommendations

?

Alice

Bob

Charles

Amelia

If everybody gets a pony, it’s not a very good indicator of what to else predict...

© 2014 MapR Technologies 34

Problems with Raw Co-occurrence

• Very popular items co-occur with everything or why it’s not very helpful to know that everybody wants a pony… – Examples: Welcome document; Elevator music

• Very widespread occurrence is not interesting to generate indicators for recommendation– Unless you want to offer an item that is constantly desired, such as

razor blades (or ponies)• What we want is anomalous co-occurrence

– This is the source of interesting indicators of preference on which to base recommendation

© 2014 MapR Technologies 35

Overview: Get Useful Indicators from Behaviors

1. Use log files to build history matrix of users x items– Remember: this history of interactions will be sparse compared to all

potential combinations

2. Transform to a co-occurrence matrix of items x items

3. Look for useful indicators by identifying anomalous co-occurrences to make an indicator matrix– Log Likelihood Ratio (LLR) can be helpful to judge which co-

occurrences can with confidence be used as indicators of preference– ItemSimilarityJob in Apache Mahout uses LLR

© 2014 MapR Technologies 36

Which one is the anomalous co-occurrence?

A not A

B 13 1000

not B 1000 100,000

A not A

B 1 0

not B 0 10,000

A not A

B 10 0

not B 0 100,000

A not A

B 1 0

not B 0 2

© 2014 MapR Technologies 37

Which one is the anomalous co-occurrence?

A not A

B 13 1000

not B 1000 100,000

A not A

B 1 0

not B 0 10,000

A not A

B 10 0

not B 0 100,000

A not A

B 1 0

not B 0 20.90 1.95

4.52 14.3

© 2014 MapR Technologies 38

Collection of Documents: Insert Meta-Data

Search Technology

Item meta-data

Document for “puppy” id: t4

title: puppydesc: The sweetest little puppy ever.keywords: puppy, dog, pet

Ingest easily via NFS

© 2014 MapR Technologies 39

From Indicator Matrix to New Indicator Field

id: t4title: puppydesc: The sweetest little puppy ever.keywords: puppy, dog, pet

indicators: (t1)

Solr document for “puppy”

Note: data for the indicator field is added directly to meta-data for a document in Apache Solr or Elastic Search index. You don’t need to create a separate index for the indicators.

© 2014 MapR Technologies 40

Going Further: Multi-Modal Recommendation

© 2014 MapR Technologies 41

Going Further: Multi-Modal Recommendation

© 2014 MapR Technologies 42

For example

• Users enter queries (A)– (actor = user, item=query)

• Users view videos (B)– (actor = user, item=video)

• ATA gives query recommendation– “did you mean to ask for”

• BTB gives video recommendation– “you might like these videos”

© 2014 MapR Technologies 43

The punch-line

• BTA recommends videos in response to a query– (isn’t that a search engine?)– (not quite, it doesn’t look at content or meta-data)

© 2014 MapR Technologies 44

Real-life example

• Query: “Paco de Lucia”• Conventional meta-data search results:

– “hombres de paco” times 400– not much else

• Recommendation based search:– Flamenco guitar and dancers– Spanish and classical guitar– Van Halen doing a classical/flamenco riff

© 2014 MapR Technologies 45

Real-life example

© 2014 MapR Technologies 46

Hypothetical Example

• Want a navigational ontology?• Just put labels on a web page with traffic

– This gives A = users x label clicks

• Remember viewing history– This gives B = users x items

• Cross recommend– B’A = label to item mapping

• After several users click, results are whatever users think they should be

© 2014 MapR Technologies 47

More Details Available

available for free at

http://www.mapr.com/practical-machine-learning

© 2014 MapR Technologies 48

Who I am

Ted Dunning, Chief Applications Architect, MapR Technologies

Email tdunning@mapr.com tdunning@apache.org

Twitter @Ted_Dunning

Apache Mahout https://mahout.apache.org/

Twitter @ApacheMahout

© 2014 MapR Technologies 49

Q & A

@mapr maprtech

jbates@mapr.com

Engage with us!

MapR

maprtech

mapr-technologies

Recommended