Upload
harriet-grant
View
218
Download
1
Embed Size (px)
Citation preview
1 / 22
Issues in Text Issues in Text Similarity and Similarity and CategorizationCategorization
Jordan Smith – MUMT 611 – 27 March 2008Jordan Smith – MUMT 611 – 27 March 2008
2 / 22 Jordan Smith – MUMT 611 – 27 March 2008Jordan Smith – MUMT 611 – 27 March 2008
OutlineOutline
Why text?Why text?
Text categorization:Text categorization: Some sample problemsSome sample problems Comparison to MIRComparison to MIR Document indexingDocument indexing
Detailed exampleDetailed example
3 / 22 Jordan Smith – MUMT 611 – 27 March 2008Jordan Smith – MUMT 611 – 27 March 2008
Why text?Why text? 28.9% of MIR queries refer to lyric fragments28.9% of MIR queries refer to lyric fragments
(Bainbridge et al. 2003)(Bainbridge et al. 2003)
Easy to collect!Easy to collect!(Knees et al. 2005, Geleijnse & Korst 2006)(Knees et al. 2005, Geleijnse & Korst 2006)
Accurate ground truthAccurate ground truth(Logan et al. 2004)(Logan et al. 2004)
Information about mood, “content”Information about mood, “content”
4 / 22 Jordan Smith – MUMT 611 – 27 March 2008Jordan Smith – MUMT 611 – 27 March 2008
Why text?Why text?
Potential applications:Potential applications:
Genre, mood categorization (Maxwell 2007)Genre, mood categorization (Maxwell 2007) Similarity searches (Mahadero et al. 2005)Similarity searches (Mahadero et al. 2005) Hit-song prediction (Dhanaraj & Logan 2004)Hit-song prediction (Dhanaraj & Logan 2004) Musical document retrieval (Google)Musical document retrieval (Google) Accompany query-by-humming (Suzuki et al. Accompany query-by-humming (Suzuki et al.
2007, Fujihara et al. 2006)2007, Fujihara et al. 2006)
5 / 22 Jordan Smith – MUMT 611 – 27 March 2008Jordan Smith – MUMT 611 – 27 March 2008
Some text categorization Some text categorization problemsproblems
IndexingIndexing Document organizationDocument organization FilteringFiltering Web content hierarchyWeb content hierarchy Language identificationLanguage identification
etc.etc.
6 / 22 Jordan Smith – MUMT 611 – 27 March 2008Jordan Smith – MUMT 611 – 27 March 2008
What is text categorization?What is text categorization?
“ “ Text categorization may be defined as the task Text categorization may be defined as the task of assigning a Boolean value to each pair <of assigning a Boolean value to each pair <ddjj, ,
ccii> > ∈∈ D D x x CC, where , where D D is a domain of is a domain of
documents and documents and C C = {= {cc11, . . . , c, . . . , c|C||C|}} is a set of is a set of
pre-defined categories. ”pre-defined categories. ”
(Sebastiani 2002)(Sebastiani 2002)
7 / 22 Jordan Smith – MUMT 611 – 27 March 2008Jordan Smith – MUMT 611 – 27 March 2008
Text vs. musicText vs. music
Music classification:Music classification:
extract featuresextract features train classifierstrain classifiers evaluate classifierevaluate classifier
Text categorization:Text categorization:
extract featuresextract features train classifierstrain classifiers evaluate classifierevaluate classifier
Same
Not the same
8 / 22 Jordan Smith – MUMT 611 – 27 March 2008Jordan Smith – MUMT 611 – 27 March 2008
Text feature extractionText feature extraction
Convert each document Convert each document ddjj into a vector into a vector
ddjj = <w = <w1j1j, w, w2j, 2j, …, w…, w|T|j|T|j>>
where where TT is the set of terms { is the set of terms {tt11, t, t22, … t, … t|T||T|}.}.
Different indexing systems:Different indexing systems: Definition of set of termsDefinition of set of terms Computation of weightsComputation of weights
9 / 22 Jordan Smith – MUMT 611 – 27 March 2008Jordan Smith – MUMT 611 – 27 March 2008
Indexing techniquesIndexing techniques
““Set of words” indexingSet of words” indexing
Terms: every word that occurs in the corpusTerms: every word that occurs in the corpus Weights: binaryWeights: binary
10 / 22 Jordan Smith – MUMT 611 – 27 March 2008Jordan Smith – MUMT 611 – 27 March 2008
Indexing techniquesIndexing techniques
““Bag of words” indexingBag of words” indexing
Terms: every word that occurs in the corpusTerms: every word that occurs in the corpus Weights: tf-idfWeights: tf-idf
term frequency / inverse document frequency:term frequency / inverse document frequency:tf-idf(tf-idf(ttkk, d, djj) = #() = #(ttkk, d, djj) · log( |) · log( |TTrr| / #| / #TTrr((ttkk) )) )
Frequency of term tk in document dj
Number of documents that tk occurs in
Normalization:
11 / 22 Jordan Smith – MUMT 611 – 27 March 2008Jordan Smith – MUMT 611 – 27 March 2008
Indexing techniquesIndexing techniques
Phrase indexingPhrase indexing
Terms: all word sequences that occur in the Terms: all word sequences that occur in the corpuscorpus
Weights: binary, tf-idfWeights: binary, tf-idf
12 / 22 Jordan Smith – MUMT 611 – 27 March 2008Jordan Smith – MUMT 611 – 27 March 2008
Indexing techniquesIndexing techniques
““The Darmstadt Indexing Approach”The Darmstadt Indexing Approach”
Terms: properties of the words, documents, Terms: properties of the words, documents, categoriescategories
Weights: variousWeights: various
13 / 22 Jordan Smith – MUMT 611 – 27 March 2008Jordan Smith – MUMT 611 – 27 March 2008
Feature reduction techniquesFeature reduction techniques
Remove function words (the, for, in, etc.)Remove function words (the, for, in, etc.) Remove words that are least frequent:Remove words that are least frequent:
in each documentin each document in the corpusin the corpus
Remainder:Remainder:
low and mid-range frequency wordslow and mid-range frequency words
14 / 22 Jordan Smith – MUMT 611 – 27 March 2008Jordan Smith – MUMT 611 – 27 March 2008
Feature reduction techniquesFeature reduction techniques
Sebastiani 2002
15 / 22 Jordan Smith – MUMT 611 – 27 March 2008Jordan Smith – MUMT 611 – 27 March 2008
Feature reduction techniquesFeature reduction techniques
Latent Semantic Analysis (LSA):Latent Semantic Analysis (LSA):
Search:Search: Demographic shifts in the U.S. with Demographic shifts in the U.S. with economic impact.economic impact.
Result:Result: The nation grew to 249.6 million people The nation grew to 249.6 million people in the 1980s as more Americans left the in the 1980s as more Americans left the industrial and agricultural heartlands industrial and agricultural heartlands for the South and West.for the South and West.
Sebastiani 2002
16 / 22 Jordan Smith – MUMT 611 – 27 March 2008Jordan Smith – MUMT 611 – 27 March 2008
A word on speechA word on speech
““Expert” feature reduction:Expert” feature reduction: RhymingnessRhymingness Iambicness of meterIambicness of meter
17 / 22 Jordan Smith – MUMT 611 – 27 March 2008Jordan Smith – MUMT 611 – 27 March 2008
Example: Hit song predictionExample: Hit song prediction
Goal:Goal: Measure some unknown, global, intrinsic propertyMeasure some unknown, global, intrinsic property
Features:Features: AcousticAcoustic -Mel-Frequency Cepstral Coefficient-Mel-Frequency Cepstral Coefficient LyricLyric -Probabilistic Latent Semantic Analysis-Probabilistic Latent Semantic Analysis
Classifiers:Classifiers: Support vector machinesSupport vector machines Boosting classifiersBoosting classifiers
Corpus:Corpus: 1700 #1 hits from 1956 to 20041700 #1 hits from 1956 to 2004
Dhanaraj, R. and B. Logan. 2005. Automatic Prediction of Hit Songs. Dhanaraj, R. and B. Logan. 2005. Automatic Prediction of Hit Songs. International Conference on Music Information RetrievalInternational Conference on Music Information Retrieval, London UK. 488-91., London UK. 488-91.
18 / 22 Jordan Smith – MUMT 611 – 27 March 2008Jordan Smith – MUMT 611 – 27 March 2008
Example: Hit song detectionExample: Hit song detection
Results of PLSA:Results of PLSA:
Best features are for contraindication
19 / 22 Jordan Smith – MUMT 611 – 27 March 2008Jordan Smith – MUMT 611 – 27 March 2008
Example: Genre classificationExample: Genre classificationLogan, B., A. Kositsky and P. Moreno. 2004. Semantic Analysis of Song Lyrics. Logan, B., A. Kositsky and P. Moreno. 2004. Semantic Analysis of Song Lyrics. Proceedings of IEEE International Conference on Multimedia and Expo. Proceedings of IEEE International Conference on Multimedia and Expo. 1-7.1-7.
20 / 22 Jordan Smith – MUMT 611 – 27 March 2008Jordan Smith – MUMT 611 – 27 March 2008
ReferencesReferences
Sebastiani, F. 1999. Machine learning in automated text Sebastiani, F. 1999. Machine learning in automated text categorization. categorization. Technical reportTechnical report, Consiglio Nazionale , Consiglio Nazionale delle Ricerche. Pisa, Italy. 1–59.delle Ricerche. Pisa, Italy. 1–59.
Dhanaraj, R., and B. Logan. 2005. Automatic prediction of Dhanaraj, R., and B. Logan. 2005. Automatic prediction of hit songs. hit songs. International Conference on Music Information International Conference on Music Information RetrievalRetrieval, London UK. 488–91., London UK. 488–91.
Logan, B., A. Kositsky, and P. Moreno. 2004. Semantic analysis of song lyrics. Logan, B., A. Kositsky, and P. Moreno. 2004. Semantic analysis of song lyrics. Proceedings of IEEE International Conference on Multimedia and Expo. Proceedings of IEEE International Conference on Multimedia and Expo. 11––7.7.
Mahadero, J., Á. Martínez, and P. Cano. 2005. Natural language processing of lyrics. Mahadero, J., Á. Martínez, and P. Cano. 2005. Natural language processing of lyrics. Proceedings of the 13th Annual ACM International Conference on MultimediaProceedings of the 13th Annual ACM International Conference on Multimedia . 475. 475––8.8.
Maxwell, T. 2007. Exploring the music genome: Lyric clustering with heterogeneous Maxwell, T. 2007. Exploring the music genome: Lyric clustering with heterogeneous features. M.Sc. Thesis. University of Edinburgh.features. M.Sc. Thesis. University of Edinburgh.