Upload
ravi-mynampaty
View
152
Download
0
Embed Size (px)
Citation preview
Copyright © President & Fellows of Harvard College
Build Your Own World-Class Directory Search From Α to Ω
Ravi Mynampaty
About Ravi
A hustler making a living by pretending to know more about
Enterprise Search than he actually does...
“I can live on a good compliment two weeks with nothing else to eat...”
@RaviMynampaty
Why the heck should I listen to Ravi?
Agenda
Why?
What’s the fuss about?
What features?
What data?
How?
Where are we going?
Search IndexModel &Structure
Raw data
Prototype UI
My goal… (how many iterations?)
Icebreaker
Why?
What’s the big deal?
Records vs.
Documents
vs.
Personfamily_namefirst_namephoneemail...
DocumentTitle: ...Description: ..Content: ……..
………….
………….
………….
………….
………….
………….
………….
……
Content: ……..……
…….……
…….……
…….……
…….……
…….……
…….……
…….……
…….……
…….……
…….……
Content: ……..……
…….……
…….……
…….……
…….……
…….……
…….……
…….……
…….……
…….……
…….
vs.
Personfamily_namefirst_namephoneemail...
DocumentTitle: ...Description: ..Content: ……..
………….
………….
………….
………….
………….
………….
………….
……
Nicknames
Predictable• Elizabeth
• Beth, Bess, Betty, Liz
• Richard
• Rich, Dick
• David
• Dave
Simple Substrings
Srinivas → “Srini”
Mohammad → “Mo”
Somewhat Predictable
Yakub → “Jacob”
Yusuf → “Joseph”
Xian → “Sean”
Unpredictable
Hanuman → “Hank”
Madhav → “Mike”
Babu → “Bob”
Wongsu → “Richard”
Herman → “Dutch”
Abbreviations & Acronyms
Department Names
Information Technology
ITG
Info Tech.
HBS IT
Job Titles
CEO
PM
VP
...
Educational Degrees
PhD
JD
...
ALM → “Master of Liberal Arts”
(magistri in artibus liberalibus studiorum prolatorum)
Substrings
Experiment
I’d like you to meet...
Garoppolo
What was that guy’s name?
What did you search for?
One more...
Roethlisberger
What was that guy’s name?
What did you search for?
My prediction...
G…. & R...
Jimmy Garoppolo
Ben Roethlisberger
Exercise!
Wishlist: How should search work?
Search mechanisms
What should be searchable?
How should users be able to search?
Query interface: What should be supported?
Results interface: What should be displayed?
etc.
Wishlist discussion
Features
Search by:
Name, Department, Email, Job title, Phone number
Nicknames, Aliases
Substrings
Scoped search, Sort options
Faceting/filtering options
Spelling suggestions, Autocomplete, Devices, Voice search
Hands-on!
Let’s create some data
Solr XML<add>
<doc><field name="id">1813-05-05</field><field name="LastName">Kierkegaard</field><field name="FirstName">Søren</field>
</doc>
<doc><field name="id">1966-12-14</field><field name="LastName">Thorning-Schmidt</field><field name="FirstName">Helle</field>
</doc></add>
That was just for practice
Our Dataset: Members of US Congress
Need to create XML for 400+ people records
Install JDK 1.8
http://tinyurl.com/ie17java
(set JAVA_HOME env variable)
java -version
javac -version
echo $JAVA_HOME // *nix
echo %JAVA_HOME% // windows
Install Fusion
https://lucidworks.com/
Download + Unzip
Run it!
Open cmd prompt
cd ...\fusion-3.0.0\fusion\3.0.0\bin
Run it: “fusion.cmd start”
C:\..\Desktop\fusion-3.0.0\fusion\3.0.0\bin>fusion.cmd startStarting zookeeper..Successfully started zookeeper on port 9983 (process ID 144Starting solr..............Successfully started solr on port 8983 (process ID 19564)Starting api............................Successfully started api on port 8765 (process ID 12568)Starting connectors..........................Successfully started connectors on port 8984 (process ID 18Starting ui.............Successfully started ui on port 8764 (process ID 14096)
Admin UI: http://localhost:8764/
1. Create password
Follow along with me:
1. Quickstart
2. Create a new collection (call it “Test1”)
3. Select a dataset: “Revolution Session Data”
4. Try some searches
5. Add faceted search
Break
1st Matrix
1st Matrix: matrix1.xml<doc>
<field name="PersonId">Gabbard, Tulsi</field><field name="LastName">Gabbard</field><field name="FirstName">Tulsi</field><field name="State">Hawaii</field><field name="District">2nd District</field><field name="Room">1433 LHOB</field><field name="Phone">202-225-4906</field><field name="Party">Democratic</field><field name="Committee">Armed Services</field><field name="Email">[email protected]</field>
</doc>
Create Solr collection for US Congress
http://localhost:8764/
Devops → New → Collection Name “house” → Save Collection
Configure Fields
Create Datasource → Add → Filesystem → SolrXML → Datasource ID
Path: set path to XML file on disk: C:\cygwin64\home\rmynampaty\house\matrix1.xml
Start Crawl → (Wait for finish) → Job History → (Observe success/fail)
Let’s search!
Query Workbench
Format Results → Documents
- One Primary Field (which one do you think?)
- One Secondary
- One Other
_s vs. _t
String:
Preserves entirely: no tokenizing, preserve case
text:
Tokenizes, stopwords, lowercase
Query Workbench
Try some searches: are they generally working?
Sort: what sort makes sense for people search?
Scoped Search
<field_name>:<search_string>
e.g.,
State_s:Hawaii
Booleans
AND, "+", OR, NOT and "-"
e.g.,
LastName_s:Smith AND Party_s:Republican
Fuzzy Matching
e.g.,
Castor~
Castor~0.8
Synonyms
No one uses scoped / Boolean :-(
Trick them!
Facets
What facets make sense for people search? Add some.
2nd Matrix
2nd Matrix: matrix2.xml<doc>
<field name="PersonId">Gabbard, Tulsi</field><field name="LastInitial">G</field><field name="LastName">Gabbard</field><field name="FirstName">Tulsi</field><field name="Nickname">POTUS2024</field><field name="State">Hawaii</field><field name="District">2nd District</field><field name="Room">1433 LHOB</field><field name="Phone">202-225-4906</field><field name="Party">Democratic</field><field name="Committee">Armed Services</field><field name="Email">[email protected]</field>
</doc>
Recrawl Solr collection for US Congress
http://localhost:8764/
Devops → Collection Name → Datasource → Clear Datasource
Path: set path to XML file on disk: C:\cygwin64\home\rmynampaty\house\matrix2.xml
Start Crawl → (Wait for finish) → Job History → (Observe success/fail)
Search using Query Workbench
Substrings
Did they work?
3rd Matrix
3rd Matrix: matrix3.xmlGabbard
- Gabbard
- gabbar
- gabba
- gabb
- gab
- ga
- g
Substrings via
N-grams
Recrawl Solr collection for US Congress
http://localhost:8764/
Devops → Collection Name → Datasource → Clear Datasource
Path: set path to XML file on disk: C:\cygwin64\home\rmynampaty\house\matrix3.xml
Start Crawl → (Wait for finish) → Job History → (Observe success/fail)
Search using Query Workbench
Where are we?
Search IndexModel &Structure
Raw data
Prototype UI
Next steps for you
Search IndexModel &Structure
End-user UI
Raw data
PrototypeUI
Thank you!Questions?
[email protected]@RaviMynampatylinkedin.com/in/mynampatyfacebook.com/ravi.mynampaty