Automatic Generation of Music Slide Show Using Personal Photosi.cs.hku.hk/~songhua/pub/ISM08-musicslide.pdf · Automatic Generation of Music Slide Show using Personal Photos Songhua

Automatic Generation of Music Slide Show using Personal Photos

Songhua Xu�,�,‡,∗ Tao Jin‡ Francis C.M. Lau‡

�: College of Computer Science and Technology, Zhejiang UniversityHangzhou, Zhejiang, P.R. China, 310027

�: Department of Computer Science, Yale UniversityNew Haven, Connecticut, USA, 06520-8285

‡: Department of Computer Science, The University of Hong KongPokfulam Road, Hong Kong, P.R. China

*: Correspondence author. Contact him by songhua DOT xu AT gmail DOT com.

Abstract

We present an algorithmic system capable of automati-cally generating a music slide show given a piece of mu-sic with lyrics. Different from previous approaches, ourmethod generates slide shows using personal photos whichare without annotation. We introduce a novel algorithm toinfer the relevance of personal photos to the lyrics, basedon which personal photos are optimally selected to matchthe music. The proposed system first detects the keyframesof the input music. For each music keyframe, it optimallyselects an image from the personal photo collection via ourimage content analysis procedure. Once the keyframe im-ages have been selected, the in-between frames are thengenerated via an image morphing process. Experiment re-sults have shown that our method can successfully generatemusic slide shows which follow the rhythms of the musicand at the same time match the lyrics.

1 INTRODUCTION

Music video is a popular type of media contents on theInternet. However, due to their somewhat higher produc-tion costs, the amount of online music video is significantlysmaller than that of online music. Sight and sound togetheris certainly more entertaining than pure sound; so manymodern music players would add visualization to the mu-sic being played. In most cases, however, the visualizationis pre-programmed to fit any kind of music, and can notbe easily changed to reflect the character (e.g., the mood,rhythm, etc.) of the music. Ideally, every piece of mu-sic should have its own custom visualization—that is, vi-sualization that “dances to the tune”. To choreograph a vi-

sualization to fit a piece of music is however non-trivial,and time-consuming if done manually, even for just a singlepiece. This motivates our research on automatic generationof visuals to accompany music.

In this paper, we introduce an algorithmic system to au-tomatically generate a music slide show for a given inputpiece of music by taking into account the rhythms and thelyrics of the input. The key difference from previous musicvideo or music slide show generation approaches is that ourmethod takes personal photos as the source to generate theslide show video. We pick personal photos because they arepart of everybody’s life, and showing them along with themusic would add a sense of intimacy. And the result canalso be seen as a way to browse and enjoy personal pho-tos with nice music in the background. However, there arechallenges with using personal photos. A big one is thatimages in a personal photo collection are rarely annotatedwhich makes automatic image understanding by computersdifficult or impossible. And even if some may be annotated,the texts would likely not be reliable. To solve the problem,we introduce a novel algorithm to infer the relevance of per-sonal photos with respect to the given music lyrics by refer-ring to similar online images with text annotations from theInternet. Benefitted from this algorithmic process, we canoptimally select personal photos to match the keyframes inthe input music.

The rest of the paper is organized as follows. We sur-vey the most related work in Sec. 2. Sec. 3 presents ourproblem statement and overviews the main ideas behind ouralgorithm design. Sec. 4 discusses how to determine the im-age transition point in the music slide show according to therhythms in the music. Sec. 5 introduces how to optimallyselect images from a personal photo collection to match themusic lyrics. Sec. 6 discusses how to generate images for

Tenth IEEE International Symposium on Multimedia

978-0-7695-3454-1/08 $25.00 © 2008 IEEE

DOI 10.1109/ISM.2008.39

214

the gap between adjacent selected images through an imagemorphing process. Sec. 7 presents some experiment resultsand Sec. 8 concludes the paper.

2 RELATED WORK

We survey the most related work on automatic genera-tion of visuals to accompany the music below.

There have been some attempts to build automatic mu-sic video generation systems, e.g., [2], [4]. The system de-scribed in [2] takes an audio clip and video as the input. Thevideo is segmented and analyzed. Shots with fierce cameramotion or poor contrast are removed. From the remainingvideo shots, suitable video shots are selected to tag the dif-ferent parts of the audio. The criterion they use in videoshot to audio track alignment is to match the video shotboundaries with the major peaks in the audio. Their sys-tem also provides interactive control for the user to edit thevideo. The music video system proposed in [4] refines theabove method by incorporating some more elaborate con-tent analysis algorithms. The visual elements used for mu-sic video generation in their system are scenes from homevideos, which are automatically extracted from the sourcevideos through shot segmentation. Given a clip of homevideo, their algorithm first applies content analysis to theinput video to derive the motion intensities in its shots. Andthen for a piece of user selected music, the music’s rhythmsand repetitive patterns are analyzed. Their algorithm thenassigns input video shots to various parts of the music sothat the tempos of the music (in terms of repetitive patternsin the music) are reflected by the motion intensities in thevideo shots. The main difference between their algorithmand the algorithm we propose in this paper is that they usevideo shots and we use still photos from a personal photocollection. Both types of music visual accompaniment addentertainment values to the original music. We argue how-ever that using photos for the visualization would make theresulting system more accessible to more users simply be-cause photos or still pictures are more abundantly and con-veniently available to most users than videos, and photosare much easier to manipulate by anyone. In fact, mostusers today keep countless personal photos on their com-puter, but a small amount of personal videos, or none at all.Also, videos might contain too much motion and thus causedistraction to music listening. For these reasons, we favormusic slide shows and present a system in this paper for au-tomatically generating a slide show from personal photos.

There also exist some work on automatic music accom-panying visual generation based on images. The P-karaokesystem by Hua et al. [6] can generate karaoke videos frompersonal photos. Their system is capable of converting sin-gle photos into a motion photograph clip using simulatedcamera motions. The resultant motion photograph clip is

then used to garnish the music, namely to align video shotboundaries with the music beats. Unlike their approachwhich emphasizes on the animation, our system focuses(through intense content analysis) on matching personalphotos to the lyrics. Later, Hua et al. extended their workby including style and template support [7]. Shamma etal. [10] proposed a personalized music video creation sys-tem which can generate a music video for a piece of musicusing images found through public image search engines.But they did not elaborate on how to find images that aremost related to the lyrics. Most recently, Cai et al. [1] pro-posed an automatic music video generation approach usingimages from the Internet. Their work is the most relatedstudy to our work here. In choosing images to use for com-posing the visual contents of music videos, they use a handcoded heuristic to re-rank images obtained from online im-age search. The heuristic considers the likelihood the photois taken outdoor and also the percentage in the image thatis occupied by a human face. Thus their image rankingmethod is biased towards photos taken with larger humanfaces and outdoor. Unlike their method, we introduce animage content analysis method to infer the relevance of per-sonal photos to text queries arising from the lyrics. Thusour method represents an improvement over their image re-ranking method for producing more semantically meaning-ful music slide shows using personal photos.

Also related is the automatic home video editing systemintroduced in [5], where home videos are segmented andmatched with a music piece according to the beats and tem-pos of the music as well as the video content. Their systemcan optimally select a music piece to match a given video,which is in fact the reverse problem of the music slide showgeneration task we study here.

3 OVERVIEW

This paper studies the problem of generating a musicslide show for a given input piece of music. The genera-tion is considered successful if the resultant video appears tosynchronize with the rhythms of the music as well as matchthe lyrics. Although there are other attributes of music (suchas tempo, timbre, tone color, mode, dynamics, etc.) whichcould be used for the synchronization task, we find rhythmto be most natural for guiding the behavior of the accompa-nying visualization. Here we assume the input music piececontains lyrics that are correctly synchronized. We stressthe point again that the visual contents used for generatinga music slide show in our method come from photos in apersonal photo collection.

The key steps of our algorithm to generate a music slideshow are as follows. Given a piece of input music, we firstdetect its keyframes according to the rhythms. For eachof the keyframes, the algorithm optimally selects an image

215

from the personal photo collection to match the correspond-ing lyrics. After all the music keyframes have been assignedan optimal image, for every pair of two adjacent key frameimages, we generate the in-between frames through an im-age morphing procedure. In the above steps, one key al-gorithmic decision is in the selection of images from thepersonal photo collection to accompany the music so thatthe image contents would tally with the lyrics. Because ina typical personal photo collection, images are mostly notannotated, we introduce a method to infer the contents ofthese unannotated photos based on images retrieved fromthe Internet.

4 DETERMINING PHOTO TRANSITIONPOINTS VIA MUSIC KEYFRAME DE-TECTION

4.1 Music features and pairwise music dis-tance

Many of the operations we need deal with music featuresand pairwise music distance. Therefore, we look at thesetwo issues first before discussing the algorithm.

4.1.1 Music features

The speech recognition community has proposed a num-ber of audio features for tackling various audio signalprocessing problems such as to characterize and describesounds, to classify musical instruments by their sounds,and to perform psycho-acoustical studies. In developing themusic sound classification system [3], Herrera summarizedand classified the most popular audio features into sixgroups [9]. In this paper, we adopt the following audiofeatures which have been proposed in existent literature:a) Spectral Centroid (SC), which is the balancing point ofthe spectrum. b) Spectral Flux (SF), which is the L2-normof the difference between the magnitudes of the ShortTime Fourier Transform spectrum (STFT) evaluated intwo successive sound frames with energy normalization.c) Mel-Frequency Cepstral Coefficients (MFCC), whichare commonly used in speech recognition. They are aperceptually motivated compact representation of thespectrum. d) Linear Prediction Reflection Coefficients(LPC), which are used in speech research as an estimateof the speech vocal tract filter. e) Root Mean Square(RMS) of the audio signal, which is defined as the squareroot of the sum of the squared values of the signal. f)Spectral Rolloff (SR)—for example, if the percentage is0.90 then Rolloff would be the frequency where 90% ofthe energy in the spectrum is below this point. Theseaudio features are abbreviated as SC, SF , MFCC, LPC,RMS, and SR respectively. The code for calculating

them is available from an open source software framework,Marsyas (Music Analysis, Retrieval and Synthesis forAudio Signals). For each of these features X and for eachmusic frame Mi, we calculate the feature’s mean value Xand variance δ(X) over the music duration spanned by thepreceding five frames of Mi as well as its following fiveframes. We use F(Mi) to represent them, i.e., F(Mi) �(SC, δ(SC), SF , δ(SF ),MFCC, δ(MFCC), LPC,

δ(LPC), RMS, δ(RMS), SR, δ(SR)). The vector

F(Mi) becomes our audio feature vector for Mi. Giventhe above definition over the music features for a singlemusic frame Mi, we can easily define the music featuresfor an entire music piece. The basic equation is the same asthe one above for a single frame; the only difference is themusic duration being spanned. For simplicity, we do notdifferentiate these two types of music features and use thenotation F(Mi) to refer to either a single frame or an entirepiece of music.

4.1.2 Pairwise music distance

Given a pair of music pieces Mx and My , oncetheir music feature vectors are derived as F(Mx) andF(My), we can define the pairwise music distance be-tween them as the square root sum of the differencesbetween their corresponding feature vector components,

i.e.: θmusic(Mx,My) �√∑12

k=1

(Fk(Mx) − Fk(My)

)2,

where Fk(Mx) and Fk(My) are the k-th components of thefeature vectors F(Mx) and F(My) respectively.

4.2 Music keyframe detection

We need to detect the keyframes of the input music toserve as targets for our algorithm to try to optimally selectimages to match. Keyframes are those musical momentsat which the music exhibits some substantial changes. Forthose in-between frames, we generate their images using animage morphing procedure, to be discussed in Sec. 6.

For each frame Mi in a piece of music, we compareits audio features F(Mi) with the audio features of its tenneighboring music frames. When the average difference ex-ceeds a certain threshold, we regard the frame as a musickeyframe. Mathematically, Mi will be identified as a musickeyframe if

110

j=i+5∑j=i−5,j �=i

‖F(Mj) − F(Mi)‖ > ε, (1)

where ‖x‖ computes vector x’s magnitude. We empiricallyset ε as one tenth of the average magnitude of a musicframe’s audio feature vector throughout the music piece,i.e., ε = 1

10

∑i ‖F(Mi)‖∑

i. We also constrain two adjacent

216

music keyframes to be at least 1 second and no more than5 seconds apart. A minimum of 1 second because the userneeds time to see a photo; maximum 5 seconds so that theuser will not be bored looking at the same picture for toolong.

5 OPTIMALLY ASSOCIATING PER-SONAL PHOTOS WITH INPUT MUSIC

5.1 Main idea

In our system, we would like to optimally associate per-sonal photos with the input music so that the content in thephoto would be a best possible match with the meaningof the corresponding lyrics. Unfortunately, personal pho-tos are seldom annotated, which makes our problem diffi-cult. Even when some of them are annotated, the annota-tions usually are not reliable or comprehensive. To solvethis problem, we introduce a novel image content analysismethod which can infer image-to-text-query relevance forpersonal photos, i.e., images with no text annotation. Theidea is to find images with annotations in the Internet andthen compute the image content similarity between thesereference images and images in the personal photo collec-tion. Through this reference, our algorithm may be able toestimate the relevance of a personal photo with respect to atext query.

5.2 Our method

We have in Sec. 4.2 introduced the method to de-tect music keyframes. Since in our experiment, we as-sume the lyrics of the song are available and are in-sync with the music, we thus can cluster all the lyricwords according to the breakpoints detected (i.e., the mu-sic keyframes). Assuming the music keyframes are attime moments t1, t2, · · · , tn respectively, all the lyric wordsthat fall into the time period [0, t1] form a word sequencews1 � {w1,1, w1,2, · · · , w1,l1} (assuming there are l1 suchwords). Similarly, we denote the lyric words that fall intothe time period [ti, ti+1] as wsi � {wi,1, wi,2, · · · , wi,li}(assuming there are li such words). We call each wsi a lyricsegment. In the clustering process, we remove stop wordsin the lyrics. We then sequentially select images to matcheach lyric segment. That is, we first optimally select an im-age from the personal photo collection Λ to match the firstlyric segment ws1, and then optimally select another imageto match the second lyric segment ws2, and so on.

For a given lyric segment wsi, we submit it as a searchphrase to a commercial image search engine to obtain somereference images. In our current experiment setting, we useGoogle Image Search.Assuming there are z(wsi) images re-

turned as search result, forming a reference image set asI(wsi) � {I1(wsi), I2(wsi), · · · , Iz(wsi)(wsi)}.

The optimal image-to-query relevance score between thelyric segment wsi and an image Ix in the personal photocollection Λ is estimated as:

η(Ix, wsi) �∑m

j=1 θimg

(Ix, Iref

j (Ix, wsi))

m, (2)

where θimg(Ix, Iy) computes the image content similaritybetween the pair of images Ix and Iy based on the ScaleInvariant Feature Transform (SIFT) image features [8]; theimages Iref

1 (Ix, wsi), Iref2 (Ix, wsi), · · · , Iref

m (Ix, wsi) arethe m distinct reference images from the image set I(wsi)which yield the highest image content similarity scores withIx. In our current experiment, we empirically tune m = 3.

Thus the image Iopt(wsi) which we optimally select fromthe personal photo collection Λ to match the lyric segmentwsi is determined as:

Iopt(wsi) � arg maxIx∈Λ

η(Ix, wsi). (3)

Since showing the same photo multiple times tends to borethe viewer, in the above optimal image determination pro-cess, we only search among images in the personal photocollection Λ which have not been previously selected to ac-company a lyric segment.

6 PRODUCING IN-BETWEEN FRAMES

For each music keyframe, we have optimally selectedan image to match with it (Sec. 5.2). To generate the in-between frames, we adopt a morphing based approach. As-sume we have two music keyframe images M1 and M2,whose in-between frames need to be generated. We use themulti-resolution image morphing algorithm [26]. The mor-phing algorithm operates entirely automatically. However,there is a free parameter α ∈ [0, 1] involved in the morphingprocess which specifies how closely the generated image re-sembles the two source images respectively: If α = 0, thenthe morphing result image is the same as M1; if α = 1,then the morphing result image is the same as M2; in gen-eral, the bigger the α value, the more similar to M2 themorphing result image would look. In the original mor-phing algorithm, the α values are uniformly sampled fromthe interval [0, 1] for generating the intermediate morphingimages. This is the standard way adopted by morphing al-gorithms designed for computer graphics applications. Inour system, to make the intermediate images appear morereflective of the flow of the underlying music, instead of us-ing the standard uniform distribution to assign values for α,we make these values dependent on the rhythms of the un-derlying music. More concretely, for an intermediate musicframe Mx to be processed, let its music content distances to

217

M1 and M2 be θmusic(Mx,M1) and θmusic(Mx,M2) re-spectively, which are computed via the method introducedat Sec. 4.1.2. Then the α value we choose for generatingthe morphing result image for the music frame Mx is de-termined as: α(Mx) � θmusic(Mx,M1)

θmusic(Mx,M1)+θmusic(Mx,M2) . Toverify whether this simple equation satisfies our need, if Mx

sounds similarly to M1, then θmusic(Mx,M1) −→ 0 sothat α(Mx) −→ 0. In the same way, if Mx sounds sim-ilarly to M2, then α(Mx) −→ 1. Therefore, through theabove equation we can effectively make the images gener-ated through the above morphing process for the in-betweenmusic frames closely reflect the rhythm of the music piece.Once all the in-between frames between every pair of adja-cent music keyframes are generated, the music slide showfor the given music piece is eventually produced.

7 EXPERIMENT RESULTS

Figures 1–2 present two generation results from our ex-periments. Both the personal photos identified and the cor-responding lyric segments are shown. Due to space limit,only the top two personal photos identified as most rele-vant to the respective lyrics are included. These personalphotos are obtained through mining personal blogs of grad-uate students in the authors’ institutions as well as the per-sonal blogs of these students’ immediate friends. Exam-ining these photos along with their accompanying lyrics,most of the pictures seem to be able to express the seman-tics of the lyric segment or the keyword in the lyric seg-ment vividly. These results confirm that the personal pho-tos identified by our algorithm are satisfactorily matched tothe song lyrics. Hence this proves the algorithm introducedat this paper can effectively augment the music listening ex-perience through automatically generating music slide showusing personal photos.

In terms of the time performance of our algorithm, down-loading the top 200 images ranked by Google Image Searchtook around 10 minutes using a local ISP for home users;and the re-ranking process took less than 10 seconds. Thus,the overall time spent is dominated by the online imagedownloading step. Such a feature is advantageous for thewide deployment of our algorithm since in many home andoffice situations the web access bandwidth is not fully uti-lized most of the time.

8 CONCLUSION

We have presented an algorithmic system capable of au-tomatically generating a music slide show given a piece ofmusic with lyrics. The key contribution of our work is thatour system can generate music slide shows using personalphotos and the visual contents in the generated video corre-

spond to both the lyrics as well as the rhythms of the mu-sic. This is realized through an image content understandingprocess which infers the relevance of personal photos with-out text annotations with respect to a text query based onreference images obtained from an online image search pro-cess. Besides using lyrics to optimally select the photos, wealso employed a music rhythm analysis process which de-tects music keyframes to be used as photo transition pointsin the music slide show. Such an analysis is based on mu-sic features that collectively represent the rhythms of themusic. This is different from prior work, e.g., [1], whichsimply uses beat positions, tempo or mode for determiningthe photo transition points.

References

[1] R. Cai, L. Zhang, F. Jing, W. Lai, and W.-Y. Ma. Auto-mated music video generation using web image resource.ICASSP ’07: Proceedings of IEEE International Confer-ence on Acoustics, Speech and Signal Processing, 2:737–740, April 2007.

[2] J. Foote, M. Cooper, and A. Girgensohn. Creating musicvideos using automatic media analysis. In MULTIMEDIA’02: Proceedings of ACM International Conference on Mul-timedia, pages 553–560, New York, NY, USA, 2002. ACM.

[3] P. Herrera, G. Peeters, and S. Dubnov. Automatic classifica-tion of musical sounds. Journal of New Musical Research,32(1):3–21, 2003.

[4] X.-S. Hua, L. Lu, and H.-J. Zhang. Automatic music videogeneration based on temporal pattern analysis. In MULTI-MEDIA ’04: Proceedings of ACM International Conferenceon Multimedia, pages 472–475, New York, NY, USA, 2004.ACM.

[5] X.-S. Hua, L. Lu, and H.-J. Zhang. Optimization-based au-tomated home video editing system. IEEE Transactions onCircuits and Systems for Video Technology, 14(5):572–583,2004.

[6] X.-S. Hua, L. Lu, and H.-J. Zhang. P-karaoke: personalizedkaraoke system. In MULTIMEDIA ’04: Proceedings of the12th Annual ACM International Conference on Multimedia,pages 172–173, New York, NY, USA, 2004. ACM.

[7] X.-S. Hua, L. Lu, and H.-J. Zhang. Photo2videoła systemfor automatically converting photographic series into video.IEEE Transactions on Circuits and Systems for Video Tech-nology, 16(7):803–819, 2006.

[8] D. G. Lowe. Distinctive image features from scale-invariantkeypoints. International Journal of Computer Vision,60(2):91–110, 2004.

[9] G. Peeters. A large set of audio features for sound descrip-tion (similarity and classification) in the cuidado project.Technical report, CUIDADO I.S.T., 2004.

[10] D. A. Shamma, B. Pardo, and K. J. Hammond. Musicstory:a personalized music video creator. In MULTIMEDIA ’05:Proceedings of the 13th Annual ACM International Confer-ence on Multimedia, pages 563–566, New York, NY, USA,2005. ACM.

218

“Whenever sang my songs” “On the stage”

“on my own” “Whenever said my words”

“Wishing they” “would be heard”

“I saw” “you smiling at me”

Figure 1. A music slide show generation experiment using the song “Eyes On Me”.

“Oceans apart” “day after day” “And I slowly go insane”

“I hear your voice” “on the line” “Wherever you go”

“Whatever you do” ‘I will be right here waiting for you” “Whatever it takes”

Figure 2. A music slide show generation experiment for the song “Right Here Waiting”.

219

Documents

Automatic Generation of Music Slide Show Using Personal Photosi.cs.hku.hk/~songhua/pub/ISM08-musicslide.pdf · Automatic Generation of Music Slide Show using Personal Photos Songhua