43
CLiMB: Computational Linguistics for Metadata Building Center for Research on Information Access Columbia University Libraries

CLiMB: Computational Linguistics for Metadata Building

  • Upload
    maille

  • View
    20

  • Download
    0

Embed Size (px)

DESCRIPTION

Center for Research on Information Access Columbia University Libraries. CLiMB: Computational Linguistics for Metadata Building. Goals of Meeting. Review progress since June 2003 meeting Advisory Board suggestions Select a new collection with narrow criteria - PowerPoint PPT Presentation

Citation preview

Page 1: CLiMB:  Computational Linguistics for  Metadata Building

CLiMB: Computational Linguistics for

Metadata Building

Center for Research on Information Access

Columbia University Libraries

Page 2: CLiMB:  Computational Linguistics for  Metadata Building

Judith L. Klavans 2

Goals of Meeting

• Review progress since June 2003 meeting– Advisory Board suggestions– Select a new collection with narrow criteria– Test results outside of image access platform

• Strategize for Next Steps– Potential partners– Driving questions– Selection of project direction(s)

Page 3: CLiMB:  Computational Linguistics for  Metadata Building

Judith L. Klavans 3

Four areas

• Collections

• Technology

• Users and Uses

• Interface Tools

June 2003 to November 2003

Page 4: CLiMB:  Computational Linguistics for  Metadata Building

Judith L. Klavans 4

Problems in Image Access

Cataloging digital images Traditional approach:

manual expertise labor intensive expensive

Can automated techniques help?

Page 5: CLiMB:  Computational Linguistics for  Metadata Building

Judith L. Klavans 5

CLiMB Technical ContributionCLiMB will identify and extract

• proper nouns• terms and phrases

from text related to an image:

September 14, 1908, the basis of the Greenes' final design had been worked out. It featured a radically informal, V-shaped plan (that maintained the original angled porch) and interior volumes of various heights, all under a constantly changing roofline that echoed the rise and fall of the mountains behind it. The chimneys and foundation would be constructed of the sandstone boulders that comprised the local geology, and the exterior of the house would be sheathed in stained split-redwood shakes. —Edward R. Bosley. Greene & Greene. London : Phaidon, 2000. p. 127

Page 6: CLiMB:  Computational Linguistics for  Metadata Building

Judith L. Klavans 6

Can we harvest image descriptors?

Page 7: CLiMB:  Computational Linguistics for  Metadata Building

Judith L. Klavans 7

• Collections

• Technology

• Users and Uses

• Interface Tools

Progress and Planning

Page 8: CLiMB:  Computational Linguistics for  Metadata Building

Judith L. Klavans 8

CLiMB Collections

• Greene & Greene Architectural Drawings– Complex images– Scholarly texts written about the projects– Loose association between text and image– Columbia owns many images

• Chinese Paper Gods– Less complex image– Lay description of each image– Small, valuable collection scanned for CLiMB– Multilingual transcription is non-standard and variable

Page 9: CLiMB:  Computational Linguistics for  Metadata Building

Judith L. Klavans 9

Greene & Greene Architectural Records and

Papers Collection Drawings and ArchivesAvery Architectural and Fine Arts Library

Columbia University Libraries

Page 11: CLiMB:  Computational Linguistics for  Metadata Building

Judith L. Klavans 11

  

C.V. Starr East Asian Library, Columbia University

Chinese Paper GodsAnne S. Goodrich Collection

Page 12: CLiMB:  Computational Linguistics for  Metadata Building

Judith L. Klavans 12

Pan-hu chih-shenGod of tigers

Page 13: CLiMB:  Computational Linguistics for  Metadata Building

Judith L. Klavans 13

New Collection: Desiderata

• Close association between text and image

• Scholarly descriptions well-structured for testing NLP tools

• Clear Target Object Identifiers (TOIs)

• English only

• Intellectual Property Rights

Page 14: CLiMB:  Computational Linguistics for  Metadata Building

Judith L. Klavans 14

Potential Choice

North Carolina Museum of Art : Handbook of the Collections Introduction, Lawrence J. Wheeler ; editor, Rebecca Martin Nagy ; assisted by June Spence ; contributors, Virgina Burden ... [et al.]. Raleigh : The Museum ; New York, NY : Distributed by Hudson Hills Press, 1998.

Page 15: CLiMB:  Computational Linguistics for  Metadata Building

Judith L. Klavans 15

About the Collection

• Available through Saskia

• 70 images

• Good quality images and details

• Well-structured delimited text descriptions

• Rights management still need to be addressed

Page 16: CLiMB:  Computational Linguistics for  Metadata Building

Judith L. Klavans 16

Alex Katz American, born 1927 Six Women, 1975 Oil on canvas 114 x 282 in.

Page 17: CLiMB:  Computational Linguistics for  Metadata Building

Judith L. Klavans 17

Alex Katz has developed a remarkable hybrid art that combines the aggressive scale and grandeur of modern abstract painting with a chic, impersonal realism. During the 1950s and 1960s—decades dominated by various modes of abstraction—Katz stubbornly upheld the validity of figurative painting. In major, mature works such as Six Women, the artist distances himself from his subject. Space is flattened, as are the personalities of the women, their features simplified and idealized: Katz’s models are as fetching and vacuous as cover girls. The artist paints them with the authority and license of a master craftsman, but his brush conveys little emotion or personality. In contrast to the turbulent paint effects favored by the abstract expressionist artists, Katz pacifies the surface of his picture. Through the virtuosic technique of painting wet-on- wet, he achieves a level and unifying smoothness. He further “cools” the image by adopting the casually cropped composition and overpowering size and indifference of a highway billboard or big-screen movie.

In Six Women, Katz portrays a gathering of young friends at his Soho loft. The apparent informality of the scene is deceptive. It is, in fact, carefully staged. Note the three pairs of figures: the foreground couple face each other; the middle ground pair alternately look out and into the picture; and the pair in the background stand at matching oblique angles. The artist also arranges the women into two conversational triangles. Katz studied each model separately, then artfully fit the models into the picture. The image suggests an actual event, but the only true event is the play of light. From the open windows, a cordial afternoon sunlight saturates the space, accenting the features of each woman.

http://ncartmuseum.org/collections/offviewcaptions.shtml#alex

Page 18: CLiMB:  Computational Linguistics for  Metadata Building

Judith L. Klavans 18

Frank Philip Stella American, born 1936 Raqqa II, 1970 Synthetic polymer and graphite on canvas 120 x 300 in.

Page 19: CLiMB:  Computational Linguistics for  Metadata Building

Judith L. Klavans 19

To many artists of Frank Stella’s generation, the highly subjective paintings of the abstract expressionists seemed mannered and self- indulgent. Stella’s response was to systematize the abstract picture using geometry and a strict but arbitrary set of procedures. Explaining that his art “is based on the fact that only what can be seen there is there,” he sought to distill the image to paint and canvas alone. He stripped his paintings of story or statement—even a brushstroke conveyed too much personality. Stella methodically developed images in series, first mapping the designs on paper before transferring them to canvas. Little was left to chance. Raqqa II belongs to Stella’s aptly titled Protractor Series, begun in 1967. Though never completed, the series was to include 31 compositions, each to be carried out in three different formats: interlaces, rainbows and fans. He titled the paintings after ancient, circular-planned cities.

Raqqa II does not lie quietly on the wall. It dominates its surroundings. What at first glance appears a childlike pattern is actually a highly complex exercise in perception. Bright bands of flat color arc and overlap, promising an illusion of receding space. However, their containment within a strict system of seven shaped and framed units confounds that illusion. The monumental scale and aggressive confidence of Raqqa II typify American art during the 1960s.

http://ncartmuseum.org/collections/offviewcaptions.shtml#frank

Page 20: CLiMB:  Computational Linguistics for  Metadata Building

Judith L. Klavans 20

Progress and Planning

• Collections

• Technology

• Users and Uses

• Interface Tools

Page 21: CLiMB:  Computational Linguistics for  Metadata Building

Judith L. Klavans 21

Text Analysis and Filtering

1. Divide text into words and phrases

2. Gather features for each word and phrase • E.g. Is it in the AAT? Is it very frequent?

3. Develop formulae using this information

4. Use formulae to rank for usefulness as potential metadata

Page 22: CLiMB:  Computational Linguistics for  Metadata Building

Judith L. Klavans 22

What Features do we Track?

• Lexical features– Proper noun, common noun

• Relevancy to domain– Text Object Identifier (TOI)– Presence in the Art & Architecture Thesaurus– Presence in the back-of-book index

• Statistical observations– Frequency in the text– Frequency across a larger set of texts, within and

outside the domain

Page 23: CLiMB:  Computational Linguistics for  Metadata Building

Judith L. Klavans 23

Problem: Too much Data!

• How should the output be filtered?

• What filtering helps additional text

processing (e.g. for text segmentation)?

• What filtering matches what users think?

Page 24: CLiMB:  Computational Linguistics for  Metadata Building

Judith L. Klavans 24

Techniques for Filtering

1. Take an initial guess• Collect input from users

• Alter formulae based on feedback

2. Use automatic techniques to guess (machine-learning)

• Collect input from users

• Run programs to make predictions based on given opinions (Bayesian networks, classifiers, decision trees)

3. The CLiMB approach: Use both techniques!

Page 25: CLiMB:  Computational Linguistics for  Metadata Building

Judith L. Klavans 25

Initial Manual Filter

• Increase score if proper noun;

• Decrease score if very frequent in Brown corpus;

• Increase score if frequent in back-of-book indexes;

• Increase score if particularly frequent in domain specific texts;

• Increase score if present in authority lists

Page 26: CLiMB:  Computational Linguistics for  Metadata Building

Judith L. Klavans 26

Early ResultsCordelia CulbertsonGreeneJames CulbertsonJames A. Culbertsonhousespecial furnishings CharlesCordelia A. Culbertson houseBlacker houseTichenor housebedroomsGreene furniturePacific Coast ArchitectCulbertson residencesingle-story elevation

Page 27: CLiMB:  Computational Linguistics for  Metadata Building

Judith L. Klavans 27

Next Steps

• Filter “given” information (already in catalogue record if you are lucky enough to have one!)

• What does CLiMB get that is new?

• How much is useful?

• What is the “cost”?

Page 28: CLiMB:  Computational Linguistics for  Metadata Building

Judith L. Klavans 28

Segmentation

• Determination of relevant segment• Difficult for Greene & Greene

– The exact text related to a given image is difficult to determine

– Use of TOI to find this text

• Easy for Chinese Paper Gods and for next colleciton

• Decision: set initial values manually and explore automatic techniques

Page 29: CLiMB:  Computational Linguistics for  Metadata Building

Judith L. Klavans 29

Progress and Planning

• Collections

• Technology

• Users and Uses

• Interface Tools

Page 30: CLiMB:  Computational Linguistics for  Metadata Building

Judith L. Klavans 30

Formative Evaluation Meeting

• At the advice of External Advisory Board

• October 17, 2003

• Goals:– Get early feedback from many user types– Incorporate that feedback into CLiMB toolset– Help shape next steps

Page 31: CLiMB:  Computational Linguistics for  Metadata Building

Judith L. Klavans 31

Formative Evaluation - Attendees

• CLiMB Project Team

- Judith Klavans - Roberta Blitz - Rebecca Passonneau - Angela Giral - Vera Horvath - David Elson - Bob Wolven - Stephen Davis - Mark Weber

• CLiMB: External Advisory Board - Jeff Cohen (Bryn Mawr) - Carl Lagoze (Cornell) - Merrilee Proffitt (RLG)

• Invitees - Robert Carlucci (Columbia) - Terry Catapano (Columbia) - Paula Gabbard (Columbia) - Deborah Kempe (Frick) - Doug Oard (UMd)

• Could not Attend– Tony Gill (Mellon)– Abby Goodrum (Syracuse)– Elisa Lanzi (Smith)

Page 32: CLiMB:  Computational Linguistics for  Metadata Building

Judith L. Klavans 32

Research Questions

• Will CLiMB metadata help users get access to the digital images they want?

• Will these terms help catalogers provide this access?

• How well are the CLiMB tools performing in providing required metadata?

Page 33: CLiMB:  Computational Linguistics for  Metadata Building

Judith L. Klavans 33

Formative Evaluation

Agenda:

http://www.columbia.edu/cu/cria/climb/meeting.html

Surveys:

http://www1.cs.columbia.edu/~delson/survey/gg-index.html

http://www1.cs.columbia.edu/~delson/survey/cpg-index.html

Page 34: CLiMB:  Computational Linguistics for  Metadata Building

Judith L. Klavans 34

What phrases do people select?ridge beamsguniteCordelia A. Culbertson houseLudowici-Celadon CompanyCordelia Culbertsonextensive water gardensnontimber materialspergola'U planenclosed courtJames CulbertsonJames A. Culbertsonsingle-story elevationtwo-story heightPasadena's Oak Knoll neighborhoodroof over-hangs

Page 35: CLiMB:  Computational Linguistics for  Metadata Building

Judith L. Klavans 35

Results from Formative Evaluation

• Best – Humans select, CLiMB selects– Cordelia A. Culbertson

• Better - Humans select, CLiMB might not– Ludowici-Celadon Company

• Better – Humans might not but CLiMB selects– house, Tichenor house, most significant house

• Good – Humans do not select, CLiMB does not– problem, time

Page 36: CLiMB:  Computational Linguistics for  Metadata Building

Judith L. Klavans 36

Use Results for Improvement

• Determine ways to better filter CLiMB

results

• Use input for improving ranking

Page 37: CLiMB:  Computational Linguistics for  Metadata Building

Judith L. Klavans 37

Use Results for Improvement

1. Use initial ranking to collect feedback

2. Compare CLiMB with user survey ranking

3. Analyze performance and study the errors

4. Refine formula

5. Repeat

Beware: Danger of tailoring to test texts

Page 38: CLiMB:  Computational Linguistics for  Metadata Building

Judith L. Klavans 38

Raw Results

• Raw survey results are at www.cs.columbia.edu/~delson/CLiMB/checklist-results.xls

• Survey results joined with CLiMB ranks, sorted by CLiMB score: www.cs.columbia.edu/~delson/CLiMB/gg-joined-results-by-rank.xls

• Survey results joined with CLiMB ranks, sorted by human score: www.cs.columbia.edu/~delson/CLiMB/gg-joined-results-by-survey.xls

• Quantized survey results (High/Medium/Low): www.cs.columbia.edu/~delson/CLiMB/gg-quantized-results.xls

Page 39: CLiMB:  Computational Linguistics for  Metadata Building

Judith L. Klavans 39

Progress and Planning

• Collections

• Technology

• Users and Uses

• Interface Tools

Page 40: CLiMB:  Computational Linguistics for  Metadata Building

Judith L. Klavans 40

Interface Tools

• Planning the new interface for image professionals to prepare CLiMB metadata from texts

• For catalogers / metadata specialists and visual resources professionals

• Goals– to provide a platform for a wider community

– to be able to collect feedback on CLiMB at a wider level

– to complete the CLiMB interface “deliverable”

Page 41: CLiMB:  Computational Linguistics for  Metadata Building

Judith L. Klavans 41

Interface Tools – Stay Tuned!

• CLiMB toolset currently implemented with textual interface– Fully-functional shell

• New graphical user interface (GUI) can be built on top of existing codebase– Perl/Tk

• Design– Initiating design phase now– Consulting metadata and image specialists

Page 42: CLiMB:  Computational Linguistics for  Metadata Building

Judith L. Klavans 42

Next Steps

• External Advisory Board– June 2004

• Select project directions

• Potential partners

Page 43: CLiMB:  Computational Linguistics for  Metadata Building

Thank you!

www.columbia.edu/cu/cria