27
Searching and Indexing Your Documents and Source with Full Text Search "Google" for your docs 1 / 27

Searching and Indexing Your Documents and Source with Full ... · Searching and Indexing Your Documents and Source with Full Text Search "Google" for your docs 3 1 About me ... ElasticSearch

Embed Size (px)

Citation preview

Searching and Indexing Your Documents andSource with Full Text Search

"Google" for your docs

1 / 27

About meProfessional.

15 years in the software industry.Currently working for Goodville Mutual (regional insurance carrier).10+ years of Java enterprise.RPGLE for 3 years.Software architect.

Open Source.

RPGParser -- ANTLR parser for the RPG language.CFLint -- linter/code analyzer for ColdFusion.

Personal.

Live in rural Vermont.Married with 4 children.Enjoy the outdoors and working with my hands.

2 / 27

DocumentationI appreciate good documentation

I'm not going spend hours sifting through it for answers.

I'm not going to invest much time finding it.

3 / 27

Overview1. Introduction2. Full text search3. ElasticSearch4. Index,Type,Mapping, and Documents5. Search6. Highlighting7. Installing Elastic Search8. Testing it with the Sense plugin9. Calaca UI

10. Mapper plugin11. Demo of searching documentation12. Demo of searching source code13. Wrap up

4 / 27

Full text search"Full-text queries perform linguistic searches against text data in full-textindexes by operating on words and phrases based on rules of a particularlanguage"

RelevanceWhen a query is made, each document is assigned a score. The documentswith the highest score are returned first.

It's not SQLSelect * from session where presenter = 'Ryan' and organization="NEUGC"

How 'well' does it match?

5 / 27

Natural language matching.We want to capture the intent of the query, not just data that matches theliteral values given exactly.

Inflections.1. Number - programmers, programmer.

2. Verb Tense - programmed, programming.

3. Case - my, me, I

Stemming.Remove the differences between inflected forms and indexing the root formof the word.

Lemmatization.A lemma is the canonical, or dictionary, form of a set of related words—thelemma of paying, paid, and pays is pay. Usually the lemma resembles the wordsit is related to but sometimes it doesn’t — the lemma of is, was, am, and being isbe. ref

6 / 27

Fuzzy Search.Words may be "fuzzy" matches. Often these might be typos, mispellings oralternate spellings.

How close are:

color and colour?iSeries and series?Bank and blank?

Levenstein distance.An algorithm to describe how 'close' words are to each other.

"80% of human misspellings have an edit distance of 1."

Elastic Search supports a fuzzy search of 'auto' which allows no edits on reallyshort words, and multiple edits on long ones.

7 / 27

Full Text Implementations.1. Elasticsearch.built on Apache Lucene.

2. SOLR.built on Apache Lucene.

3. 'Traditional' databases.1. DB2 Text Extender2. PostgreSQL3. MongoDB4. MS SQL Server

8 / 27

ElasticSearchFull-Text Search

"Elasticsearch builds distributed capabilities on top of Apache Luceneto provide the most powerful full- text search capabilities available inany open source product. Powerful, developer-friendly query APIsupports multilingual search, geolocation, contextual did-you-meansuggestions, autocomplete, and result snippets."

Near real time.Queries should be answered in under 1 second. Ideally less than 100 milliseconds.

High AvailabilityClustering

9 / 27

Index"An index is a collection of documents that have somewhat similarcharacteristics."

An index is the equivalent of a 'database'. It contains multiple documenttypes.

Example:

PUT http://localhost:9200/neugc/

10 / 27

Mapping Type"Within an index, you can define one or more mapping types. A type is a logicalcategory/partition of your index"

1. A mapping type is defined for a group documents that share commonfields.

2. A mapping type is like a table in a relational database.

3. A mapping type is defined by a mapping that tells ES how to parse andstore it.

11 / 27

Mapping1. Define how documents should be indexed/searched.

2. Mapping are defined for index and type.

The entire document is stored by default in the _src field.

Example:

PUT http://localhost:9200/neugc/_mapping/session{ "properties" : { "title" : {"type" : "string", "store" : true }, "presentation" : {"type" : "string", "store" : true }, "room" : {"type" : "string", "store" : true } }}

12 / 27

Document1. A basic unit of information that can be indexed.

Example : Customer, order, product.

2. A JSON document which is stored in elasticsearch.

3. Equivalant to a row in a table in a relational database.

4. Stored in an index and has a type and an id.

Example:

PUT http://localhost:9200/neugc/session/1{ "title": "ElasticSearch 101", "presenter": "Ryan Eberly", "room": "Middlesex East H", "sessionNumber": 24}

13 / 27

Document (cont)PUT when you know the id.

PUT http://localhost:9200/neugc/session/3{ "title": "Meeting User's Needs with Free Software on your IBM i", "presenter": "Jon Paris", "room": "Grand South A", "sessionNumber": 24}

PUT http://localhost:9200/neugc/session/4{ "title": "Advanced SQL Stored Procedures and Functions", "presenter": "Rob Bestgen", "track": "Commons II C", "sessionNumber": 24}

PUT http://localhost:9200/neugc/session/5{ "title": "Apache Web Server Magic on IBM i", "presenter": "Alan Seiden", "track": "Commons I D", "sessionNumber": 24}

14 / 27

Document - POSTPOST when you want a generated ID.

RequestPOST http://localhost:9200/neugc/session{ "title": "Open Source RPG Tools", "presenter": "Jon Paris", "room": "Grand South A", "sessionNumber": 11}

Response{ "_index": "neugc", "_type": "session", "_id": "AU_--pxTQ4ynt6aGjXqa", "_version": 1, "created": true}

15 / 27

SearchSearch all fields:

GET http://localhost:9200/neugc/_search?q=Seiden

GET http://localhost:9200/neugc/_search {"query": { "query_string" : { "query" : "IBM" } }}

Search specific field:

GET http://localhost:9200/neugc/_search?q=presenter:Ryan

16 / 27

Search - optionsPagingGET http://localhost:9200/neugc/_search?q=IBM&from=0&size=1

SortDefaults to _score asc

GET http://localhost:9200/neugc/_search {"query": { "query_string" : { "query" : "IBM" } }, "sort": { "sessionNumber": { "order": "desc" }}}

What happens if we sort by presenter?

17 / 27

Search - highlightingQuery:

GET http://localhost:9200/neugc/_search {"query": { "query_string" : { "query" : "IBM", "fields" : ["presenter","title","room"] } }, "highlight": { "fields" : { "presenter" : {}, "title" : {}, "room" : {} } } }

Result:

... "highlight": { "title": [ "Meeting User's Needs with Free Software on your <em>IBM</em> i" ] }

18 / 27

InstallationJava 8 or higher

check/set JAVA_HOME environment variable

set JAVA_HOME=c:\Progra~1\Java\jdk1.8.0._40

Download/Unzip ElasticSearch

Execute bin/elasticsearch.bat

Install plugins

bin\plugins install mapper-attachments

Edit elasticsearch/config/elasticsearch.yml

To edit cross domain ajax requests add this to the bottom of the config file:

http.cors.enabled : true http.cors.allow-origin : "*" http.cors.allow-methods : OPTIONS, HEAD, GET, POST, PUT, DELETE http.cors.allow-headers : X-Requested-With,X-Auth-Token,Content-Type, Content-Length

19 / 27

Calaca PluginCalaca is a simple UI for ElastSearch.This works with pre 2.0 ElasticSearch only:

bin/plugin -i romansanchez/calaca go to http://localhost:9200/_plugin/calaca/

With 2.0 and later you can use Calaca directly from the filesystem. file:////code/calaca_neugc/_site/index.html

21 / 27

Configure Calacajs/config.jsAdd the settings for your server and index name

var CALACA_CONFIGS = { url: "localhost:9200", index_name: "neugc", type: "", size: 20,...}

index.htmlAdd the fields you want to view in the results.

<h2>{{result.presenter}}</h2> <h2>{{result.title}}</h2>

Note: A customized copy of Calaca which includes the highlighting behavior isincluded with the session materials. 22 / 27

Mapper PluginMapper attachmentAdd supports for indexing attachments such as PDF, MS Office, HTML etc.

Uses Apache Tika to analyze the content

Example

PUT /neugc/session/_mapping{ "sessionInfo" : { "properties" : { "presentation" : { "type" : "attachment" } } }}

23 / 27

Mapper attachment (cont)The content to be loaded must be base64 encoded.

PUT 'http://localhost:9200/neugc/session/1'{ "title": "ElasticSearch 101", "presenter": "Ryan Eberly", "sessionNumber": 24, "presentation": "... base64 encoded attachment ..."}

24 / 27

Demo:Searching your documentation

25 / 27

Demo:Searching your source code

26 / 27

Questions?

27 / 27