39
1 June 2014 Box + Solr = Content Search for Business

Box + Solr = Content Search for Business

Embed Size (px)

Citation preview

Page 1: Box + Solr = Content Search for Business

1

June 2014

Box + Solr = Content Search for Business

Page 2: Box + Solr = Content Search for Business

2

Wei Zhao

Box backend [email protected]

Page 3: Box + Solr = Content Search for Business

3

to make organizations more productive,

competitive and collaborative by connecting

people and their most important information

Box mission

Page 4: Box + Solr = Content Search for Business

4

25MM+Users

225K+ Businesses

99%Fortune 500

Page 5: Box + Solr = Content Search for Business

5

Box search mission is to make user content

easy to discover.

Page 6: Box + Solr = Content Search for Business

6

10Billion+Documents

10TB+ Index size

100M+Daily requests

Box uses Solr for search

Page 7: Box + Solr = Content Search for Business

7

Quick Search

Page 8: Box + Solr = Content Search for Business

8

Quick Search

Page 9: Box + Solr = Content Search for Business

9

Full Search

Page 10: Box + Solr = Content Search for Business

10

Sharding – splitting the index

Agenda

Highly available search

A few more things

1

2

3

4

5 Q&A

Currently working on

Page 11: Box + Solr = Content Search for Business

11

We shard things

Page 12: Box + Solr = Content Search for Business

12

Shard ID = File ID % Total Shards

Page 13: Box + Solr = Content Search for Business

13

Multi-tenant – One big logical index for all users

Solr index

Shard1 Shard2 Shard3 ShardN

Page 14: Box + Solr = Content Search for Business

14

Search scope

Page 15: Box + Solr = Content Search for Business

15

File ID: 12345

OwnerID: user1

Parent Folders IDs: folder1, folder2

File Name: Solr.ppt

File Content: blah......

A typical Solr Document

Page 16: Box + Solr = Content Search for Business

16

Owner: User1Parent: Folder1

Owner: User2Parent: Folder3

Owner: User2Parent: Folder2

Owner: User1Parent:Folder1Folder4

File 1 File 2

File 3 File 4

Page 17: Box + Solr = Content Search for Business

17

User1 with no share folder

Owner: User1Parent: Folder1

Owner: User2Parent: Folder3

Owner: User2Parent: Folder2

Owner: User1Parent:Folder1Folder4

File 1 File 2

File 3 File 4

Page 18: Box + Solr = Content Search for Business

18

User2 shares Folder2 with User1

Owner: User1Parent: Folder1

Owner: User2Parent: Folder3

Owner: User2Parent: Folder2

Owner: User1Parent:Folder1Folder4

File 1 File 2

File 3 File 4

Page 19: Box + Solr = Content Search for Business

19

User2 shares Folder2 with User1

Owner: User1Parent: Folder1

Owner: User2Parent: Folder3

Owner: User2Parent: Folder2

Owner: User1Parent:Folder1Folder4

File 1 File 2

File 3 File 4

Page 20: Box + Solr = Content Search for Business

20

User2 shares Folder2 with User1

Owner: User1Parent: Folder1

Owner: User2Parent: Folder3

Owner: User2Parent: Folder5

Owner: User1Parent:Folder1Folder4

File 1 File 2

File 3 File 4

Removedout of Folder2

Page 21: Box + Solr = Content Search for Business

21

User2 shares Folder2 with User1

Owner: User1Parent: Folder1

Owner: User2Parent: Folder3

Owner: User2Parent: Folder5

Owner: User1Parent:Folder1Folder4

File 1 File 2

File 3 File 4

Removedout of Folder2

Page 22: Box + Solr = Content Search for Business

22

Highly Available Search

Page 23: Box + Solr = Content Search for Business

23

• Index is highly available

• Search functionality is highly available

Page 24: Box + Solr = Content Search for Business

24

Index workflow

Page 25: Box + Solr = Content Search for Business

25

Box Front End

UploadIndex Queue

Queue 1

Queue 2

Queue 3

Indexer 1

Indexer 3

Indexer 2

MySQL

Index1

Index2

Index2

Page 26: Box + Solr = Content Search for Business

26

Search workflow

Page 27: Box + Solr = Content Search for Business

27

Box Front End

query HA Proxy

Head node

HA Proxy

1 2 3 N

Box Front End

queryHA

ProxyHead node

HA Proxy

1 2 3 N

Data center boundary

Page 28: Box + Solr = Content Search for Business

28

A few more things

Page 29: Box + Solr = Content Search for Business

29

File Content Search

Page 30: Box + Solr = Content Search for Business

30

Box Front End

Upload

MySQL Box FileStorage

IndexerSolrIndex

Text ExtractionExtractedText

Page 31: Box + Solr = Content Search for Business

31

Multi-language support

Page 32: Box + Solr = Content Search for Business

32

Raw file content

Languagedetector

English tokenizer

Spanish tokenizer

Japanese tokenizer

German tokenizer

file_content_en

File_content_es{hola}

file_content_ja....

File_content_de

Page 33: Box + Solr = Content Search for Business

33

To Dos

• Scale language support

• Support document with mixed languages

Page 34: Box + Solr = Content Search for Business

34

Search Warm-up

Page 35: Box + Solr = Content Search for Business

35

• Front end informs backend to warm up on keyboard focus

• Backend prepares the search filter and caches it in a search session

• Backend sends a warm-up query to Solr

Page 36: Box + Solr = Content Search for Business

36

What we are working on

Page 37: Box + Solr = Content Search for Business

37

• Search suggestions

• Search operators

• Use machine learning to influence ranking

• Logical sharding

Things we are working on

Page 38: Box + Solr = Content Search for Business

38

Question?

Page 39: Box + Solr = Content Search for Business

39

Contact: [email protected]

We are hiring!