35
Lexical and Statistical Semantics in Professional Search Allan Hanbury

Lexical and Statistical Semantics in Professional Search · Similarity based on Uncertainty in Word Embedding. In Proceedings of the European Conference on Information Retrieval Research

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Lexical and Statistical Semantics in Professional Search · Similarity based on Uncertainty in Word Embedding. In Proceedings of the European Conference on Information Retrieval Research

Lexical and Statistical Semantics in

Professional Search

Allan Hanbury

Page 2: Lexical and Statistical Semantics in Professional Search · Similarity based on Uncertainty in Word Embedding. In Proceedings of the European Conference on Information Retrieval Research
Page 3: Lexical and Statistical Semantics in Professional Search · Similarity based on Uncertainty in Word Embedding. In Proceedings of the European Conference on Information Retrieval Research
Page 4: Lexical and Statistical Semantics in Professional Search · Similarity based on Uncertainty in Word Embedding. In Proceedings of the European Conference on Information Retrieval Research
Page 5: Lexical and Statistical Semantics in Professional Search · Similarity based on Uncertainty in Word Embedding. In Proceedings of the European Conference on Information Retrieval Research

1. A method for scrolling through portions of a data set, said method comprising: receiving a number of units associated with a rotational user input; determining an acceleration factor pertaining to the rotational user input; modifying the number of units by the acceleration factor; determining a next portion of the data set based on the modified number of units; and presenting the next portion of the data set.

2. A method as recited in claim 1, wherein the data set pertains to a list of items, and the portions of the data set include one or more of the items.

3. A method as recited in claim 1, wherein the data set pertains to a media file, and the portions of the data set pertain to one or more sections of the media file.

4. A method as recited in claim 3, wherein the media file is an audio file.

5. A method as recited in claim 1, wherein the rotational user input is provided via a rotational input device.

6. …

Claims

Page 6: Lexical and Statistical Semantics in Professional Search · Similarity based on Uncertainty in Word Embedding. In Proceedings of the European Conference on Information Retrieval Research
Page 7: Lexical and Statistical Semantics in Professional Search · Similarity based on Uncertainty in Word Embedding. In Proceedings of the European Conference on Information Retrieval Research

photo: wocintechchat.com

Page 8: Lexical and Statistical Semantics in Professional Search · Similarity based on Uncertainty in Word Embedding. In Proceedings of the European Conference on Information Retrieval Research
Page 9: Lexical and Statistical Semantics in Professional Search · Similarity based on Uncertainty in Word Embedding. In Proceedings of the European Conference on Information Retrieval Research

Search Systems

Context Technology

• People• Tasks• Documents• …

• Natural Language Processing

• Data-drivenapproaches

• …

Page 10: Lexical and Statistical Semantics in Professional Search · Similarity based on Uncertainty in Word Embedding. In Proceedings of the European Conference on Information Retrieval Research

Natural Language Processing

This is particularly preferable in the case of liquid injection molding (LIM).

This/DT is/VBZ particularly/RB preferable/JJin/IN the/DT case/NN of/IN liquid/JJinjection/NN molding/NN (LIM/NN).

Page 11: Lexical and Statistical Semantics in Professional Search · Similarity based on Uncertainty in Word Embedding. In Proceedings of the European Conference on Information Retrieval Research

Population Intervention Comparison

PIC Detection

Page 12: Lexical and Statistical Semantics in Professional Search · Similarity based on Uncertainty in Word Embedding. In Proceedings of the European Conference on Information Retrieval Research

Population Intervention Comparison

PIC Detection

Page 13: Lexical and Statistical Semantics in Professional Search · Similarity based on Uncertainty in Word Embedding. In Proceedings of the European Conference on Information Retrieval Research
Page 14: Lexical and Statistical Semantics in Professional Search · Similarity based on Uncertainty in Word Embedding. In Proceedings of the European Conference on Information Retrieval Research

Less Effective More Effective

Effectiveness of Asthma Therapies

Page 15: Lexical and Statistical Semantics in Professional Search · Similarity based on Uncertainty in Word Embedding. In Proceedings of the European Conference on Information Retrieval Research

2009

Page 16: Lexical and Statistical Semantics in Professional Search · Similarity based on Uncertainty in Word Embedding. In Proceedings of the European Conference on Information Retrieval Research

Data-Driven

P (Word1 | Word2)

Page 17: Lexical and Statistical Semantics in Professional Search · Similarity based on Uncertainty in Word Embedding. In Proceedings of the European Conference on Information Retrieval Research
Page 18: Lexical and Statistical Semantics in Professional Search · Similarity based on Uncertainty in Word Embedding. In Proceedings of the European Conference on Information Retrieval Research

“Every time I fire a linguist, the performance of the speech recognizer goes up.”

Frederick Jelinek, IBM

Around 1985

Page 19: Lexical and Statistical Semantics in Professional Search · Similarity based on Uncertainty in Word Embedding. In Proceedings of the European Conference on Information Retrieval Research
Page 20: Lexical and Statistical Semantics in Professional Search · Similarity based on Uncertainty in Word Embedding. In Proceedings of the European Conference on Information Retrieval Research

Bag of Words

CC image courtesy of Flickr/surrealmuse

Page 21: Lexical and Statistical Semantics in Professional Search · Similarity based on Uncertainty in Word Embedding. In Proceedings of the European Conference on Information Retrieval Research

“In most cases, the meaning of a word is its use.”

Ludwig Wittgenstein, Philosophical Investigations (1953)

Page 22: Lexical and Statistical Semantics in Professional Search · Similarity based on Uncertainty in Word Embedding. In Proceedings of the European Conference on Information Retrieval Research

A bottle of tesgüino is on the table.

Everybody likes tesgüino.

Tesgüino makes you drunk.

We make tesgüino out of corn.

Example from J. R. Firth

Page 23: Lexical and Statistical Semantics in Professional Search · Similarity based on Uncertainty in Word Embedding. In Proceedings of the European Conference on Information Retrieval Research

Statistical semantics extracts lexical information from the statistics of large amounts of text

Page 24: Lexical and Statistical Semantics in Professional Search · Similarity based on Uncertainty in Word Embedding. In Proceedings of the European Conference on Information Retrieval Research

Word Embedding

http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/

Page 25: Lexical and Statistical Semantics in Professional Search · Similarity based on Uncertainty in Word Embedding. In Proceedings of the European Conference on Information Retrieval Research

Input Layer

Output Layer

Hidden Layer

~104−105

inputs

~102 nodes

#outputs= # inputs

Page 26: Lexical and Statistical Semantics in Professional Search · Similarity based on Uncertainty in Word Embedding. In Proceedings of the European Conference on Information Retrieval Research
Page 27: Lexical and Statistical Semantics in Professional Search · Similarity based on Uncertainty in Word Embedding. In Proceedings of the European Conference on Information Retrieval Research

What is more similar to a typewriter:(a) a bird, or (b) a snowball?

Page 28: Lexical and Statistical Semantics in Professional Search · Similarity based on Uncertainty in Word Embedding. In Proceedings of the European Conference on Information Retrieval Research

Top k

books 0.82

foreword 0.77

author 0.74

published 0.73

preface 0.69

republished 0.68

reprinted 0.68

afterword 0.67

memoir 0.67

book

corpulent 0.44

hideous 0.43

unintelligent 0.42

wizened 0.42

catoblepas 0.42

creature 0.42

humanoid 0.41

grotesquely 0.41

tomtar 0.41

dwarfish

Similarity Threshold

Threshold

Navid Rekabsaz, Mihai Lupu, Allan Hanbury, Guido Zuccon, Exploration of a Threshold for

Similarity based on Uncertainty in Word Embedding. In Proceedings of the European

Conference on Information Retrieval Research (ECIR 2017)

Navid Rekabsaz, Mihai Lupu, Allan Hanbury, Uncertainty in Neural Network Word

Embedding: Exploration of Threshold for Similarity. Proc. Neu-IR Workshop of the ACM

Conference on Research and Development in Information Retrieval (NeuIR-SIGIR 2016)

Similarity Threshold

Page 29: Lexical and Statistical Semantics in Professional Search · Similarity based on Uncertainty in Word Embedding. In Proceedings of the European Conference on Information Retrieval Research

Word Embedding in Search

knowledge

knowledge

knowledge

wisdom

understanding

knowledge

knowledge

0.7 × knowledge

0.8 × knowledge

Page 30: Lexical and Statistical Semantics in Professional Search · Similarity based on Uncertainty in Word Embedding. In Proceedings of the European Conference on Information Retrieval Research

Navid Rekabsaz, Mihai Lupu, Allan Hanbury, Guido Zuccon, Generalizing Translation Models in the Probabilistic Relevance Framework,Proceedings of ACM International Conference on Information and Knowledge Management (CIKM 2016)

MA

P Im

pro

vem

ent

Page 31: Lexical and Statistical Semantics in Professional Search · Similarity based on Uncertainty in Word Embedding. In Proceedings of the European Conference on Information Retrieval Research

CLEF-IP patent collectionResults for Patent Search

Page 32: Lexical and Statistical Semantics in Professional Search · Similarity based on Uncertainty in Word Embedding. In Proceedings of the European Conference on Information Retrieval Research

Multi-Word Terms in Patents

coating method

memory data processorcomplex programmable logic devicerear cross frame memberdocument editing device

Document Length

Sentence Complexity

New Terminology

Page 33: Lexical and Statistical Semantics in Professional Search · Similarity based on Uncertainty in Word Embedding. In Proceedings of the European Conference on Information Retrieval Research

Search Systems

Context Technology

Evaluation

Page 34: Lexical and Statistical Semantics in Professional Search · Similarity based on Uncertainty in Word Embedding. In Proceedings of the European Conference on Information Retrieval Research

Acknowledgements

Linda Andersson, Alexandros Bampoulidis, Tobias Fink, Aldo Lipani, Mihai Lupu, João Palotti, Florina Piroi, Navid Rekabsaz, Abdel Aziz Taha, Markus Zlabinger

Page 35: Lexical and Statistical Semantics in Professional Search · Similarity based on Uncertainty in Word Embedding. In Proceedings of the European Conference on Information Retrieval Research

Univ.-Prof. Dr. Allan Hanbury

Institute for Information Systems Engineering

TU Wien

Favoritenstraße 9-11/194-04

1040 Vienna

Austria

Telephone: +43 1 58801 188310

Mobile: +43 676 978 0991

e-Mail: [email protected]