In Search of a Dataset for Handwritten Optical Music Recognition: … · 2017-03-16 · enough about the sheet music in question, so that the dataset is useful for a wide range of

In Search of a Dataset forHandwritten Optical Music Recognition

Introducing MUSCIMA++Jan Hajic jr

Institute of Formal and Applied LinguisticsFaculty of Mathematics and Physics

Charles UniversityEmail hajicjufalmffcunicz

Pavel PecinaInstitute of Formal and Applied Linguistics

Faculty of Mathematics and PhysicsCharles University

Email pecinaufalmffcunicz

AbstractmdashOptical Music Recognition (OMR) has long beenwithout an adequate dataset and ground truth for evaluatingOMR systems which has been a major problem for establishinga state of the art in the field Furthermore machine learningmethods require training data We analyze how the OMRprocessing pipeline can be expressed in terms of gradually morecomplex ground truth and based on this analysis we present theMUSCIMA++ dataset of handwritten music notation that ad-dresses musical symbol recognition and notation reconstructionThe MUSCIMA++ dataset v09 consists of 140 pages of hand-written music with 91255 manually annotated notation symbolsand 82261 explicitly marked relationships between symbol pairsThe dataset allows training and evaluating models for symbolclassification symbol localization and notation graph assemblyboth in isolation and jointly Open-source tools are provided formanipulating the dataset visualizing the data and annotatingfurther and the dataset itself is made available under an openlicense

I INTRODUCTION WHAT DATASET

Optical Music Recognition (OMR) is a field of documentanalysis that aims to automatically read musical scores Musicnotation encodes the musical information in a graphical formOMR backtracks through this process to extract the musicalinformation from this graphical representation

OMR can perhaps be likened to OCR for the music notationwriting system however it is more difficult [1] and remainsan open problem [2] [3] Common western music notation(CWMN1) is an intricate writing system where both thevertical and horizontal dimensions are salient and used toresolve symbol ambiguity In terms of graphical complexitythe biggest issues are caused by overlapping symbols (in-cluding stafflines) [5] and composite symbol constructionsIn handwritten music the variability of handwriting leadsto a lack of reliable topological properties overall And inpolyphonic music there are multiple sequences written in asense ldquooverrdquo each other ndash as opposed to OCR where theordering of the symbols is linear

1We assume the reader is familiar with this style of music notation andits terminology In case a refresher is needed we recommend the excellentchapter 2 of ldquoMusic Notation by Computerrdquo [4] by Donald Byrd

Moreover the objective of OMR is more ambitious thanOCR recovering not just the locations and classification ofmusical symbols but also ldquothe musicrdquo pitch and durationinformation of individual notes This introduces long-distancerelationships the interpretation of one ldquoletterrdquo of music no-tation may change based on notation events some distanceaway These intricacies of music notation has been thoroughlydiscussed since early attempts at OMR notably by Byrd [4][6]

One of the most persistent problems that have hinderedOMR progress is the lack of datasets These are necessaryto provide ground truth for evaluating OMR systems [1] [6]ndash[12] This is frustrating because the many proposed recog-nition systems cannot be compared Furthermore especiallyfor handwritten notation statistical machine learning methodshave often been used that require training data for parameterestimation [13]ndash[17]

For printed music notation the lack of datasets can bebypassed by rendering sheet music images from synthesizedrepresentations such as LilyPond2 or MEI3 capturing interme-diate steps and using data augmentation techniques to simulatedesired deformations and lighting conditions4 However forhandwritten music no satisfactory synthetic data generatorexists so far and an extensive (and expensive) annotationeffort cannot be avoided Even if a synthesis system wereto be implemented it needs to be evaluated against actualhandwritten data ndash so at least some manual annotation isneeded Therefore in order to best utilize the resources avail-able for creating a dataset we decided to create a dataset ofhandwritten notation

We use the term dataset in the following sense D =〈(xi yi) foralli = 1 n〉 Given a set of inputs xi (in our caseimages of sheet music) the dataset records the desired outputsyi ndash ground truth The quality of OMR systems can thenbe measured by how closely they approximate the ground

2httpwwwlilypondorg3httpwwwmusic-encodingorg4We are aware of one such ongoing effort for the LilyPond format http

lilypond1069038n5nabblecomExtracting-symbol-location-td194857html

(a) Input manuscript image

(b) Replayable output pitches durations onsets Time is thehorizontal axis pitch is the vertical axis This visualization is calleda piano roll

(c) Reprintable output re-typesetting

(d) Reprintable output same music expressed differently

Fig 1 OMR for replayability and reprintability The input (a)encodes the sequence of pitches durations and onsets (b)which can be expressed in different ways (c d)

truth although defining this approximation for the variety ofrepresentations of music is very much an open problem [2][6] [7] [11] [18] [19]

In order to build a dataset of handwritten music we needto decide on two issues

bull What sheet music do we choose as data pointsbull What should the desired output yi be for an image xiThe music scores in the dataset should cover the known

ldquodimensions of difficultyrdquo to allow for assessing OMRsystems with respect to increasingly complex inputs Thereare two main categories of challenges that make OMR lessor more difficult image-related difficulty and complexity ofthe music that we are trying to recognize [6] While image-related difficulties such as uneven lighting or deformations canbe simulated the complexity of notation is not a ldquoknob to turnrdquoand needs to be explicitly relfected by the choice of includedmusic The dataset needs to at least include music of all thefour basic levels of difficulty according to Byrd amp Simonsen[6] single-staff single-voice single-staff multi-voice multi-staff single-voice and ldquopianoformrdquo music

The ground truth must reflect what OMR does Miyaoand Haralick [20] suggest grouping OMR applications intotwo broad groups those that require replayability and thosethat need reprintability Replayability is understood to consistof recovering pitches and durations of individual notes andorganizing them as a time series by note onset (For thepurposes of this article ldquomusical sequencerdquo will refer tothis minimum replayable data) Reprintability means theability to take OMR results as the input to music typesettingsoftware and obtain a result that is equivalent to the inputReprintability implies replayability but not vice versa as one

musical sequence can be encoded by different musical scores(see fig 1)

OMR systems are usually pipelines with four major stages[1] [2] [21]

1) Image preprocessing enhancement binarization scalenormalization

2) Music symbol recognition staffline identification andremoval localization and classification of remaining sym-bols

3) Musical notation reconstruction recovering the logicalstructure of the score

4) Final representation construction depending on the out-put requirements usually inferring pitch and duration(MusicXML MEI MIDI LilyPond etc)

While end-to-end OMR that bypasses some sections of thispipeline is clearly an option (see [10]) and might yet make thispipeline obsolete there is still a need in the field to comparenew systems against more orthodox solutions Therefore todesign a dataset broadly useful to the OMR community wefirst need to examine how the stages of OMR pipelines canbe expressed in terms of inputs and outputs Then we canidentify what ground truth is needed

After the fundamental design decisions are made we alsoneed to consider additional factors

bull Economy which reduces to what is the smallest subsetof the ground truth that absolutely has to be annotatedmanually per item xi

bull Compatibility can existing results on related datasets becompared directly with results on the new dataset

bull Intellectual property rights can the data be released underan open license such as the Creative Commons family

bull Ease of use how difficult is it to take the dataset and runan experiment

Once the dataset is designed we can proceed to build anannotation interface and annotate the data

It should also be understood that while the dataset records itsground truth in some representation it is not trying to enforceit as the representation that should be used for specific ex-periments Rather experiment-specific output representations(such as pitch sequences for end-to-end experiments by Shi etal in [10] or indeed a MIDI file) should be unambiguouslyderivable from the dataset When defining the ground truth weare concerned with what information to record not necessarilyhow to record it However we want to be sure that we recordenough about the sheet music in question so that the datasetis useful for a wide range of purposes The choice of datasetrepresentation is made to give some theoretical guarantees onthis ldquoinformation contentrdquo

A Contributions and outline

The main contributions of this work arebull MUSCIMA++5 ndash an extensive dataset of handwritten

musical symbols6

5Standing for MUsic SCore IMAges credit for abbreviation to [22]6Available from httpsufalmffcuniczmuscima

bull A principled ground truth definition that bridges the gapbetween the graphical expression of music and musicalsemantics enabling evaluation of multiple OMR sub-tasks up to inferring pitch and duration both in isolationand jointly

bull Open-source tools for processing the data visualizing itand annotating more

The rest of this article is organized into five sections Insection II we describe the OMR pipeline through input-outputinterfaces in order to design a good ground truth for thedataset

In section III we discuss what kinds of data should berepresented by the dataset ndash how to choose images to annotate(on a budget) in order to maximize impact for OMR

In section IV we describe existing datasets for musicalsymbol recognition especially those that are publicly avail-able with respect to the requirements formulated in II and IIIWhile there are no datasets that would satisfy both we haveconcluded that the CVC-MUSCIMA dataset of Fornes et al[22] forms a sound basis

In section V we then describe the current MUSCIMA++version 09 in practical terms what images we chose toannotate the data acquisition process tools associated withthe dataset and we discuss its strengths and limitations

Finally in section VI we briefly summarize the work dis-cuss baselines and evaluation procedures for using the datasetin experiments and sketch a roadmap for future versions ofthe dataset

II GROUND TRUTH OF MUSICAL SYMBOLS

The ground truth over a dataset is the desired output of aperfect7 system solving a task In order to design the groundtruth for the dataset we need to understand which task isit designed to simulate How can the OMR sub-tasks beexpressed terms of inputs and outputs

A Interfaces in the OMR pipeline

The image preprocessing stage is mostly practical bymaking the image conform to some assumptions (such asstems are straight attempted by de-skewing) the OMR systemhas less complexity to deal with down the road while verylittle to no information is lost The problems that this stageneeds to handle are mostly related to document quality (degra-dation stains especially bleedthrough) and imperfections inthe imaging process (eg uneven lighting deformations ofthe paper with mobile phone cameras limited depth-of-fieldmay lead to out-of-focus segments of the image) [6] Themost important problem for OMR in this stage is binarization[2] sorting out which pixels belong to the background andwhich actually make up the notation There is evidence thatsheet music has some specifics in this respect [23] and therehave been attempts to tackle binarization for OMR specifically[24]ndash[26] On the other hand other authors have attempted to

7More accurately as perfect as possible given how the ground truth wasacquired See V-D

Fig 2 OMR pipeline from the original image through im-age processing binarization and staff removal While staffremoval is technically part of symbol recognition as stafflinesare symbols as well it is considered practical to think of theimage after staff removal as the input for recognition of othersymbols

bypass binarization especially before staffline detection [27][28]

The input of music symbol recognition is a ldquocleanedrdquoand usually binarized image The output of this stage is alist of musical symbols recording their locations on the pageand their types (eg c-clef beam sharp) Usually there arethree sub-tasks staffline identification and removal symbollocalization (in binary images synonymous with foregroundsegmentation) and symbol classification [2]

Stafflines are usually handled first Removing them thengreatly simplifies the foreground and in turn the remainder ofthe task as it will make connected components a useful (ifimperfect) heuristic for pruning the search space of possiblesegmentations [29] [30] Staffline detection and removal is alarge topic in OMR since its inception [31]ndash[33] and it isthe only one where a competition was successfully organizedby Fornes et al [34] Because errors during staff removalmake further recognition complicated especially by breakingsymbols into multiple connected components with over-eagerremoval algorithms some authors skip this stage Sheridanand George instead add extra stafflines to annul differencesbetween notes on stafflines and between stafflines [35] Pugininterprets the stafflines in a symbolrsquos bounding box to be partof that symbol for the purposes of recognition [36]

Next the page is segmented into musical symbols and thesegments are classified by symbol class While classificationof musical symbols has produced near-perfect scores for

both printed and handwritten musical symbols [30] [37]segmentation of handwritten scores remains elusive as mostsegmentation approaches such as projections [29] [33] [38]rely on topological constraints that do not necessarily holdin handwritten music Morphological skeletons have beenproposed instead [39] [40] as a basis for handwritten OMR

This stage is where the first non-trivial ground truth designdecisions need to be taken the ldquoalphabetrdquo of elementary musicnotation symbols must be defined Some OMR researchersdecompose notation into individual primitives (notehead stemflag) [38] [41]ndash[44] while others retain the ldquonoterdquo as anelementary visual object Beamed groups are decomposed intothe beam(s) and the remaining notehead+stem ldquoquarter-likenotesrdquo [30] [45] [46] or not included [9] [16]

There are some datasets available for symbol recognitionalthough except for staff removal they do not yet respondto the needs of the OMR community especially since mostmachine learning efforts have been directed to this stage seesection IV

In turn the list of locations and classes of symbols on thepage is the input to the music notation reconstruction stageThe outputs are more complex at this stage it is necessaryto recover the relationships among the individual musicalsymbols so that from the resulting representation the ldquomusicalcontentrdquo (most importantly pitch and duration information ndashwhat to play and when) can be unambiguously inferred

There are two important observations to make at stage 3First with respect to pitch and duration the rules of music

notation finally give us something straightforward there isa 11 relationship between a notehead notation primitiveand a note musical object of which pitch and duration areproperties8 The other symbols that relate to a notehead suchas stems stafflines beams accidentals (be they attached to anote or part of a key signature) or clefs inform the readerrsquosdecision to assign the pitch and duration

Second a notable property of this stage is that once inter-symbol relationships are fully recovered (including prece-dence simultaneity and high-level relationships betweenstaves) symbol positions cease to be informative they serveprimarily as features that help music readers infer theserelationships If we wanted to we could forget the input imageafter stage 3

However it is unclear how these relationships should bedefined For instance should there be relationships of multipletypes Is the relationship between a notehead and a stem dif-ferent in principle than between a notehead and an associatedaccidental Or the notehead and the key signature If the noteis written as an f and the key is D major (two sharps on cand on f) does the note relate to the entire key signature orjust to the accidental that directly modifies it What about therelationship between barlines and staves

Instead of a notation reconstruction stage other authors de-fine two levels of symbols low-level primitives that cannot by

8Ornaments such as trills or glissandos do not really count as notes incommon-practice music understanding They are understood as ldquocomplica-tionsrdquo of a note or a note pair

themselves express musical semantics and high-level symbolsthat already do have some semantic interpretation [6] [18] Forinstance the letter p is just a letter from the low-level pointof view but a dynamics sign from the high-level perspectiveThis is a distinction made when discussing evaluation in anattempt to tie errors in semantics to units that can be counted

We view this approach as complementary the high-levelsymbols can also belong to the symbol set of stage 2 and thetwo levels of description can be explicitly linked The fact thatcorrectly interpreting whether the p is a dynamics sign or partof a text (eg presto) requires knowing the positions andclasses of other symbols simply hints that it may be a goodidea to solve stages 2 and 3 jointly

Naturally the set of relationships over a list of elementarymusic notation graphical elements (symbols) can be repre-sented by a graph possibly directed and requiring labelededges The list of symbols from the previous step can be re-purposed as a list of vertices of the graph with the symbolclasses and locations being the vertex attributes We are notaware of a publicly available dataset that addresses this levelalthough at least for handwritten music this is also a non-trivial step Graph-based assembly of music notation primitiveshas been used explicitly eg by [47] [48] and grammar-basedapproaches (eg [33] [41] [43] [49] ) lend themselves to agraph representation as well by recording the parse tree(s)

Finally the chosen output representation of the musicnotation reconstruction step must be a good input for thenext stage encoding the information in the desired outputformat There is a plethora of music encoding formatsfrom the text-based formats such as DARMS9 kern10

LilyPond ABC11 over NIFF12 MIDI13 to XML-based for-mats MusicXML14 and MEI The individual formats are eachsuitable for a different purpose for instance MIDI is mostuseful for interfacing different electronic audio devices MEI isgreat for editorial work LilyPond allows for excellent controlof music engraving Many of these have associated softwaretools that enable rendering the encoded music as a standardmusical score although some ndash notably MIDI ndash do not allowfor a lossless round-trip Furthermore evaluating against themore complex formats is notoriously problematic [11] [19]

Text-based formats are also ripe targets for end-to-endOMR as they reduce the output to a single sequence whichenables the application of OCR text-spotting or even image-captioning models This has been attempted specifically forOMR by Shi et al [10] using a recurrent-convolutional neuralnetwork ndash although with only modest success on a greatlysimplified task However even systems that only use stage 4output and do not use stage 2 and 3 output in an intermediate

9httpwwwccarhorgpublicationsbooksbeyondmidionlinedarms ndash un-der the name Ford-Columbia music representation used as output alreadyin the DO-RE-MI system of Prerau 1971 [32])

10httpwwwmusic-cogohio-stateeduHumdrumrepresentationskernhtml

11httpabcnotationcom12httpwwwmusic-notationinfoenformatsNIFFhtml13httpswwwmidiorg14httpwwwmusicxmlcom

(a) Notation symbols noteheads stems beams ledgerlines a duration dot slur and ornament sign part ofa barline on the lower right Vertices of the notationgraph

(b) Notation graph highlighting noteheads as ldquorootsrdquo ofsubtrees Noteheads share the beam and slur symbols

Fig 3 Visualizing the list of symbols and the notation graph over staff removal output The notation graph in (b) allowsunambiguously inferring pitch and duration

TABLE I OMR Pipeline Inputs and Outputs

Sub-task Input Output

Image Processing Score image ldquoCleanedrdquo imageBinarization ldquoCleanedrdquo image Binary image

Staff ID amp removal Binary image Stafflines listSymbol localization (Staff-less) image Symbol regions

Symbol classification Symbol regions Symbol labels

Notation assembly Symbol regs amp labels Notation graphInfer pitchduration Notation graph Pitchduration attrs

Output conversion Notation graph + attrs MusicXML MIDI

step have to consider stage 2 and 3 information implicitly thatis simply how music notation conveys meaning

That being said stage 4 is ndash or with a good stage 3 outputcould be ndash mostly a technical step The output of the notationreconstruction stage should leave as little ambiguity as possiblefor this last step to handle In effect the combination of outputsof the previous stages should give a potential user enoughinformation to construct the desired representation for any taskthat may come up and does not require more input than theoriginal image after all the musical score contains a finiteamount of information and it can be explicitly represented15

Table I summarizes the steps of OMR and their inputs andoutputs

It is appealing to approach some of these tasks jointly Forinstance classification and notation primitive assembly informeach other a vertical line in the vicinity of an elliptical blobis probably a stem with its notehead rather than a barline16

15This does not mean however that the score contains by itself enoughinformation to perform the music That skill requires years of trainingexperience familiarity with tradition and scholarly expertise and is not agoal of OMR systems

16This is also a good reason for not decomposing notes into stem andnotehead primitives in stage 2 as done by Rebelo et al [45] instead ofdealing with graphically ambiguous symbols define symbols so that they aregraphically more distinct

Certain output formats allow leaving outputs of previous stagesunderspecified if our application requires only ABC notationthen we might not really need to recover regions and labelsfor slurs or articulation marks However the ldquoholy grailrdquo ofhandwritten OMR transcribing manuscripts to a more readable(reprintable) and shareable form requires retaining as muchinformation from the original score as possible and so thereis a need for a benchmark that does not underspecify

B So ndash what should the ground truth be

We believe that there is a strong case for having an OMRdataset with ground truth at stage 3

The key problems that OMR needs to solve reside instages 2 and 3 We argue that we can define stage 3 outputas the point where all ambiguity in the written score isresolved17 That implies that at the end of stage 3 all it takesto create the desired representation in stage 4 is to write adeterministic format conversion from whatever the stage 3output representation is to the desired output format (whichcan nevertheless still be a very complex piece of software)Once ambiguity is resolved in stage 3 the case can be madethat OMR as a field of research has (mostly) done its job

At the same time we cannot call the science finished earlierthan that The notation graph of stage 3 is the first point wherewe have mined all the information from the input image andtherefore we are properly ldquofreerdquo to forget about it if we needto In a sense the notation graph is at the same time a maximalcompression of the musical score and agnostic with respect tosoftware used for rendering the score (or analyzing it further)

Of course because OMR solutions will not be perfectanytime soon there is still the need to address stage 4 in orderto optimize the tradeoffs in stages 2 and 3 for the specificpurpose that drives the adoption of a stage 4 output For

17There is also ambiguity that cannot be resolved using solely the writtenscore for instance how fast is Adagio This may not seem too relevant untilwe attempt searching audio databases using sheet music queries or vice versa

instance a query-by-humming system for sheet music searchmight opt for an OMR component that is not very good atrecognizing barlines or even durations but has very good pitchextraction performance However even such partial OMR willstill need to be evaluated with respect to the individual sub-tasks of stages 2 and 3 even though individual componentsof overall performance may be weighed differently Moreovereven on the way from sheet music to just pitches we need alarge subset of stage 3 output anyway Again one can hardlyerr on the side of explicitness when designing the ground truth

It seems that at least as long as OMR is decomposed intothe stages described in the previous section there is needfor a dataset providing ground truth for the various subtasksall the way to stage 3 We have discussed how to expressthese subtasks in terms of inputs and outputs At the end ouranalysis shows that a good ground truth for OMR shouldcontain

bull The positions and classes of music notation symbolsbull The relationships between these symbols

Stafflines and staves are counted among music notation sym-bols The following design decisions need to be made

bull Defining a vocabulary of music notation symbolsbull Defining the attributes associated with a symbolbull Defining what relationships the symbols can formThere are still multiple ways of defining the stage 3 output

graph in a way that satisfies the disambiguation requirementConsider an isolated 8th note Should edges expressing attach-ment lead from the notehead to the stem and from the stemto the flag or from the notehead to the flag directly Shouldwe perhaps express notation as a constituency tree and insteadhave an overlay ldquo8th noterdquo high-level symbol that has edgesleading to its component notation primitives However asmuch as possible these should be technical choices whicheveroption is selected it should be possible to write an unambigu-ous script to convert it to any of these possible representationsThis space of options equivalent in their information contentis the space in which we apply the secondary criteria of cost-efficient annotation user-friendliness etc

We now have a good understanding of what the ground truthshould entail The second major question is the choice of data

III CHOICE OF DATA

What musical scores should be part of an OMR datasetThe dataset should enable evaluating handwritten OMR with

respect to the ldquoaxes of difficultyrdquo of OMR It should bepossible based on the dataset to judge ndash at least to some extentndash how systems perform in the face of the various challengeswithin OMR However annotating musical scores at the levelrequired by OMR as analyzed in section II is expensive andthe pages need to be chosen with care to maximize impactfor the OMR community We are therefore keen to focus ourefforts on representing in the dataset only those variations onthe spectrum of OMR challenges that have to be providedthrough human labor and cannot be synthesized

What is this ldquochallenge spacerdquo of OMR In their state-of-the-art analysis of the difficulties of OMR Byrd and Simonsen

[6] identify three axes along which musical score imagesbecome less or more challenging inputs for an OMR system

bull Notation complexitybull Image qualitybull Tightness of spacing

Not discussed in [6] is the variability of handwriting styles butwe argue below it is in fact a generalization of the Tightnessof spacing axis

The dataset should also contain a wide variety of musicalsymbols including less frequent items such as tremolos orglissandi to enable differentiating systems also according tothe breath of their vocabulary

A Notation complexity

The axis of notation complexity is structured by [6] intofour levels

1) Single-staff monophonic music (one note at a time)2) Single-staff multi-voice music (chords or multiple simul-

taneous voices)3) Multiple staves monophonic music on each4) Multiple staves ldquopianoformrdquo music

The category of pianoform music is defined as multi-staffpolyphonic with interaction between staves and with norestrictions on how voices appear disappear and interactthroughout

We should also add another level between multi-staff mono-phonic and pianoform music multi-staff music with multi-voice staves but without notation complexities specific to thepiano as described in [6] appendix C Many orchestral scoreswith divisi parts18 would fall into this category as well asa significant portion of pre-19th century music for keyboardinstruments

Each of these levels brings a new challenge Level 1 testsan ldquoOMR minimumrdquo the recognition of individual symbolsfor a single sequence of notes Level 2 tests the ability to dealwith multiple sequences of notes in parallel so eg rhythmicalconstraints based on time signatures [8] [50] are harder to use(but still applicable [51]) Level 3 tests high-level segmentationinto systems and staffs this is arguably easier than dealingwith the polyphony of level 2 [52] as the voices on the stavesare not allowed to interact Our added level 35 multi-staffmulti-voice is just a technical insertion that combines thesetwo difficulties - system segmentation and polyphony Level4 pianoform music then presents the greatest challenge aspiano music has perused the rules of CWMN to their fullest[6] and sometimes beyond19

Can we automatically simulate moving along the axis ofnotation complexity The answer is no While it might bepossible with printed music to some extent there is currentlyno viable system for synthesis of handwritten musical scores

18The instruction divisi is used when the composer intends the playerswithin one orchestral group to split into two or more groups each with theirown voice

19The first author Donald Byrd has been maintaining a list of in-teresting music notation events See httphomessoicindianaedudonbyrdInterestingMusicNotationhtml

Therefore this is an axis of variability that has to be handledthrough data selection

B Image quality

The axis of image quality is discretized in [6] into fivegrades with the assumption of obtaining the image using ascanner We believe that these degradations can be simulatedTheir descriptions essentially provide a how-to increasingsalt-and-pepper noise random pertrubations of object bordersand distortions such as kanungo noise or localized thicken-ingthinning operations While implementing such a simulationis not quite straightforward many morphological distortionshave already been introduced for staff removal data [22] [53]

Because [6] focus is on scanned images as found eg inthe IMSLP database20 the article does not discuss imagestaken by a camera especially a mobile phone even thoughthis mode of obtaining images is widespread and convenientThe difficulties involved in musical score photos are mostlyuneven lighting and page deformation

Uneven lighting is a problem for binarization and givensufficient training data it can quite certainly be defeated byusing convolutional neural networks [28] Anecdotal evidencefrom ongoing experiments suggest that there is little differencebetween handwritten and printed notation in this respect sothe problem of uneven document lighting should be solvableseparately with synthesized notation and equally synthesizedlighting for training data and therefore is not an importantconcern in designing our dataset

Page deformation including 3-D displacement can be syn-thesized as well by mapping the flat surface on a generatedldquolandscaperdquo and simulating perspective Therefore again ourdataset does not necessarily need to include images of scoreswith deformations

A separate concern for historical documents is also back-ground paper degradation and bleedthrough While papertextures can be successfully simulated realistic models ofbleedthrough are to the best of our knowledge not available

Also bearing in mind that the dataset is supposed to addressstages 2 and 3 of the OMR pipeline and the extent to whichdifficulties primarily tackled at the first stage of the OMRpipeline (image processing) can be simulated leads us tobelieve that it is a reasonable choice to annotate binaryimages

C Topological consistency

The tightness of spacing in [6] refers to default horizontaland vertical distances between musical symbols In printedmusic this is a parameter of typesetting that may also changedynamically in order to place line and page breaks for musi-cian comfort The problem with tight spacing is that symbolsstart encroaching into each othersrsquo bounding boxes and hor-izontal distances change while vertical distances donrsquot thusperhaps breaking assumptions about relative notation spacingByrd and Simonsen give an example where the augmentation

20httpwwwimslporg

dot of a preceding note is in a position where it can be easilyconfused with a staccato dot of its following note ( [6] fig 21)This leads to increasingly ambiguous inputs to the primitiveassembly stage

However in handwritten music variability in spacing issuperseded by the variability of handwriting itself which itselfintroduces severe problems with respect to spatial relationshipsof symbols Most importantly handwritten music gives notopological constraints by definition straight lines such asstems become curved noteheads and stems do not touch ac-cidentals and noteheads do touch etc However some writersare more disciplined than others in this respect The variousstyles of handwriting and the ensuing challenges also have tobe represented in the dataset as broadly as possible As there iscurrently no model available to synthesize handwritten musicscores we need to cover this spectrum directly through choiceof data

We find adherence to topological standards to be a moregeneral term that describes this particular class of difficultiesTightness of spacing is a factor that globally influences topo-logical standardization in printed music in handwritten musicthe variability in handwriting styles is the primary source ofinconsistencies with respect to the rules of CWMN topologyThis variability includes the difference between contemporaryand historical music handwriting styles

To summarize it is essential to include the following chal-lenges into the dataset directly through choice of the musicalscore images

bull Notation complexitybull Handwriting stylesbull Realistic historical document degradation

IV EXISTING DATASETS

There are already some OMR datasets for handwrittenmusic What tasks are they for Can we save ourselves somework by building on top of them Is there a dataset thatperhaps already satisfies the requirements of sections II andIII

Reviewing subsec II-A the subtasks at stages 2 and 3 ofthe OMR pipeline are

bull Staffline removalbull Symbol localizationbull Symbol classificationbull Symbol assembly

(See table I for input and output definitions) How are thesetasks served by datasets of handwritten music scores

A Staff removal

For staff removal in handwritten music the best-known andlargest dataset is CVC-MUSCIMA [22] consisting of 1000handwritten scores (20 pages of music each copied by handby 50 musicians) The dataset is distributed with further 11pre-computed distortions for each of these scores according toDalitz et al [53] for a total of 12000 images The state-of-the-art for staff removal has been established with a competitionusing CVC-MUSCIMA [34]

For each input image CVC-MUSCIMA has three binaryimages a ldquofull imagerdquo mask21 which contains all foregroundpixels a ldquoground truthrdquo mask of all pixels that belong toa staffline and at the same time to no other symbol and aldquosymbolsrdquo mask that complementarily contains only pixelsthat belong to some other symbol than a staffline The datasetwas collected by giving a set of 20 pages of sheet music to50 musicians Each was asked to rewrite the same 20 pagesby hand using their natural handwriting style A standardizedpaper and pen was provided for all the writers so thatbinarization and staff removal was done automatically withvery high accuracy and the results were manually checkedand corrected In vocal pieces lyrics were not transcribed

Then the 1000 images were run through the 11 distortionsdefined by Dalitz et al [53] to produce a final set of 12000 images that are since then used to evaluate staff removalalgorithms including a competition at ICDAR [34]

Which of our requirements does the dataset fulfill Withrespect to ground truth it provides staff removal which hasproven to be quite time-intensive to annotate manually but nosymbol annotations (it has been mentioned in [22] as futurework) The dataset also fulfills the requirements for a goodchoice of data as described in III With respect to notationcomplexity the 20 pages of CVC-MUSCIMA include scoresof all 4 levels as summarized in table II (some scores are verynearly single-voice per staff these are marked with a () signto indicate their higher category is mostly undeserved) Thereis also a wide array of music notation symbols across the 20pages including tremolos glissandi an abundance of gracenotes ornaments trills clef changes time and key signaturechanges and even voltas and less common tuples are foundamong the music Handwriting style varies greatly among the50 writers including topological inconsistencies some writerswrite in a way that is hardly distinguishable from printedmusic some write rather extremely disjoint notation and shortdiagonal lines instead of round noteheads as illustrated in fig4

The images are provided as binary so there is no variabilityon the image quality axis of difficulty However in III-B wehave determined this to be acceptable Another limitation isthat all the handwriting is contemporary Finally and impor-tantly it is freely available for download under a CreativeCommons NonCommercial Share-Alike 40 license22

B Symbol classification and localization

Handwritten symbol classification systems can be trainedand evaluated on the HOMUS dataset of Calvo-Zaragozaand Oncina [16] which provides 15200 handwritten musical

21Throughout this work we use the term mask of some graphical elements to denote a binary matrix that is applied by elementwise multiplication(Hadamard product) a value of 0 means ldquothis pixel does not belong to srdquo avalue of 1 means ldquothis pixel belongs to srdquo A mask of all non-staff symbolstherefore has a 1 for each pixel such that it is in the foreground and belongsto one or more symbols that are not stafflines and a 0 in all other cells andif opened in an image viewer it looks like white-on-black results of stafflineremoval

22httpwwwcvcuabescvcmuscimaindex databasehtml

(a) Writer 9 nice handwriting

(b) Writer 49 Disjoint notation primitives

(c) Writer 22 Disjoint primitives and deformed noteheads Somenoteheads will be very hard to distinguish from the stem

Fig 4 Variety of handwriting styles iv the CVC-MUSCIMAdataset

TABLE II CVC-MUSCIMA notation complexity

Complexity level CVC-MUSCIMA pages

Single-staff single-voice 1 2 4 6 13 15Single-staff multi-voice 7 9 () 11 () 19Multi-staff single-voice 3 5 12 16Multi-staff multi-voice 14 () 17 18 20

Pianoform 8 10

symbols (100 writers 32 symbol classes and 4 versions ofa symbol per writer per class with 8 for note-type symbols)HOMUS is also interesting in that the data is captured from atouchscreen it is availabe in online form with x y coordinatesfor the pen at each time slice of 16 ms and for offlinerecognition (ie from a scan) images of music symbols canbe generated from the pen trajectory Together with potentialdata from a recent multimodal recognition (offline + online)experiment [54] these datasets might enable trajectory recon-struction from offline inputs Since online recognition has beenshown to perform better than offline on the dataset [16] sucha component ndash if performing well ndash could lead to better OMR

accuracyOther databases of isolated musical symbols have been

collected but they have not been made available Silva [55]collects a dataset from 50 writers with 84 different symbolseach drawn 3 times for a total of 12600 symbols

However these datasets only contains isolated symbolsnot their positions on a page While it might be possibleto synthesize handwritten music pages from the HOMUSsymbols such a synthetic dataset will be rather limited asHOMUS does not contain beamed groups and chords23

For symbol localization (as well as classification) we areonly aware of a dataset of 3222 handwritten symbols by thegroup of Rebelo et al [30] [45] This dataset is furthermoreonly available upon request not publicly

C Notation reconstruction and final representation

We are not aware of a dataset that explicitly marks therelationships among handwritten musical symbols

Musical content reconstruction is usually understood toentail deriving pitch and relative duration of the written notesIt is possible to mine early music manuscripts in the IMSLPdatabase24 and pair them against their open-source editionswhich are sometimes provided on the website as well orlook for matching encoded data in large repostories such asMutopia25 or KernScores26 however we are not aware of sucha paired collection for OMR or any other available dataset forpitch and duration reconstruction While Bellini et al [18] doperform evaluation of OMR systems with respect to pitch andduration on a limited dataset of 7 pages the evaluation wasdone manually without creating a ground truth for this data

V THE MUSCIMA++ DATASET

We finally describe the data that MUSCIMA++ 09 makesavailable

Our main source of musical score images is the CVC-MUSCIMA dataset described in subsection IV-A The goalfor the first round of MUSCIMA++ annotation was for eachof our annotators to mark one of the 50 versions for each ofthe 20 pages With 7 available annotators this amounted to140 annotated pages of music Furthermore we assigned the140 out of 1000 pages of CVC-MUSCIMA so that all of the50 writers are represented as equally as possible 2 or 3 pagesare annotated from each writer27

There is a total of 91255 symbols marked in the 140annotated pages of music of 107 distinct symbol classesThere are 82261 relationships between pairs of symbols Thetotal number of notes encoded in the dataset is 23352 The set

23It should also be noted that HOMUS is not available under an openlicense so copyright restrictions apply Not even every EU country has theappropriate ldquofair userdquo-like exception for academic research even though it ismandated by the EU Directive 200129EC Article 5(3) (httpeur-lexeuropaeulegal-contentENALLuri=CELEX32001L0029)

24httpswwwimslporg25httpwwwmutopiaprojectorg26httphumdrumccarhorg27With the exception of writer 49 who is represented in 4 images due to a

mistake in the distribution workflow that was only discovered after the imagewas already annotated

TABLE III Symbol frequencies in MUSCIMA++

Symbol Count Symbol (cont) Count

stem 21416 16th flag 495notehead-full 21356 16th rest 436ledger line 6847 g-clef 401beam 6587 grace-notehead-full 348thin barline 3332 f-clef 285measure separator 2854 other text 271slur 2601 hairpin-decr 2688th flag 2198 repeat-dot 263duration-dot 2074 tuple 244sharp 2071 hairpin-cresc 233notehead-empty 1648 half rest 216staccato-dot 1388 accent 2018th rest 1134 other-dot 197flat 1112 time signature 192natural 1089 staff grouping 191quarter rest 804 c-clef 190tie 704 trill 179key signature 695 All letters 4072dynamics text 681 All numerals 594

of symbol classes consists of both notation primitives suchas noteheads or beams and higher-level notation objects suchas key signatures or time signatures The specific choices ofsymbols and ground truth policies is described in subsec V-B

The frequencies of the most significant symbols are de-scribed in table III We can draw two lessons immediatelyfrom the table First even when lyrics are stripped away [22]texts make up a significant portion of music notation ndash nearly5 of symbols are letters Some utilization of handwrittenOCR or at least identifying and removing texts thereforeseems reasonably necessary Second at least among 18th to20th century music some 90 of notes occur as part of abeamed group so works that do not tackle beamed groups arein general greatly restricted (possibly with the exception ofchoral music such as hymnals where isolated notes are stillthe standard)

How does MUSCIMA++ compare to existing datasetsGiven that individual notes are split into primitives andother ground truth policies to obtain a fair comparison weshould subtract the stem count letters and texts and themeasure_separator symbols Some sharps and flats arealso part of key signatures and numerals are part of timesignatures which again leads to two symbols where otherdatasets may only annotate one These subtractions bring usto a more directly comparable symbol count of about 57000

A note on naming conventions CVC-MUSCIMA refersto the set of binary images described in IV-A MUS-CIMA++ 09 refers to the symbol-level ground truth of theselected 140 pages (Future versions of MUSCIMA++ maycontain more types of data such as MEI-encoded musicalinformation or other representations of the encoded musicalsemantics) The term ldquoMUSCIMA++ imagesrdquo refers to those140 undistorted images from CVC-MUSCIMA that have been

annotated so far

A Designated test setsTo give some basic guidelines on comparing trainable

systems over the dataset we designate some images to serveas a test set One can always use a different train-test splithowever we believe our choice is balanced well Similar to[16] we provide a user-independent test set and a user-dependent one Each of these contains 20 images one for eachCVC-MUSCIMA page However they differ in how each ofthese 20 is chosen from the 7 available versions with respectto the individual writers

The user-independent test set evaluates how the systemhandles data form previously unseen writers The imagesare split so that the 2 or 3 MUSCIMA++ images from anyparticular writer are either all in the training portion or all inthe test portion of the data

The user-dependent test set to the contrary contains atmost one image from each writer in its set of the 20 CVC-MUSCIMA pages For each writer in the user-dependent testset there is also at least one image in the training dataThis allows experimenting with at least some amount of useradaptation

Furthermore both test sets are chosen so that the annotatorsare represented as uniformly as possible so that the evalua-tion is not biased towards the idiosyncracies of a particularannotator28

B MUSCIMA++ ground truthHow does MUSCIMA++ implement the requirements de-

scribed in IIOur ground truth is a graph We define a fine-grained

vocabulary of musical symbols as its vertices and we definerelationships between symbols to be expressed as unlabeleddirected edges We then define how ordered pairs participatein relationships (noteheads connect to accidentals numeralscombine to form time signatures etc)29

Each vertex (symbol) furthermore has a set of attributesThese are a superset of the primitive attributes in [20] Foreach symbol we encode

bull its label (notehead sharp g-clef etc)bull its bounding box with respect to the pagebull its mask exactly which pixels in the bounding box

belong to this symbolThe mask is especially important for beams as they are oftenslanted and so their bounding box overlaps with other symbols(esp stems and parallel beams) Slurs also often have thisproblem Annotating the mask enables us to build an accuratemodel of actual symbol shapes

The symbol set includes what [7] [18] and [6] woulddescribe as a mix of low-level symbols as well as high-level symbols but without explicitly labeling the symbols as

28The choice of test set images is provided as supplementary data togetherwith the dataset itself

29The most complete annotation guidelines detailing what the symbolset is and how to deal with individual notations are available onlinehttpmuscimarkerreadthedocsioendevelopinstructionshtml

Fig 5 Two-layer annotation of a triplet Highlighted symbolsare numeral_3 tuple_bracketline and the threenoteheads that form the triplet The tuple symbol itself towhich the noteheads are connected is the darker rectangleencompassing its two components it has relationships leadingto both of them (not highlighted)

either Instead of trying to categorize symbols according towhether they carry semantics or not we chose to express thehigh- vs low-level dichotomy through the rules for formingrelationships This leads to ldquolayeredrdquo annotation For instancea 34 time signature is annotated using three symbols anumeral_3 numeral_4 and a time_signature sym-bol that has outgoing relationships to both of the numeralsinvolved In MUSCIMA++ 09 we do not annnotate invisiblesymbols (eg implicit tuplets) Each symbol has to have atleast one foreground pixel

The policy for making decisions that are arbitrary withrespect to the information content as discussed at the endof subsec II-B was set to stay as close as possible tothe written page rather than the semantics If this principlewas in conflict with the requirement for both reprintabil-ity and replayability introduced in I a symbol class wasadded to the vocabulary to capture the requisite meaningExamples are the key_signature time_signaturetuple or measure_separator These second-layer sym-bols are often composite but not necessarily so for in-stance a single sharp can also form a key_signatureor a measure_separator is expressed by a singlethin_barline An example of this structure for is givenin figure 5

While we take care to define relationships so that the resultis a Directed Acyclic Graph (DAG) there is no hard limiton the maximum oriented path length However in practiceit very rarely goes deeper than 3 as the path in the tuple

example leading from a notehead to the tuple and thento its constitutent numeral_3 or tuple_bracketlineand in most cases this depth is at most 2

On the other hand we do not break down symbols thatconsist of multiple connected components unless it is possiblethat these components can be seen in valid music notationin various configurations An empty notehead may show upwith a stem without one with multiple stems when twovoices share pitch30 or it may share stem with others sowe define these as separate symbols An f-clef dot cannotexist without the rest of the clef and vice versa so we definethe f-clef as a single symbol on the other hand a singlerepeat spanning multiple staves may have a variable number ofrepeat dots so we define a repeat-dot separately This isa different policy from Miyao amp Haralick [20] who split egthe repeat_measure ldquopercent signrdquo into three primitives

We do not define a note symbol Notes are hard to pindown on the written page in the traditional understandingof what is a rdquonoterdquo symbol [30] [16] [7] they consist ofmultiple primitives (notehead and stem and beams or flags)but at the same time multiple notes can share these primitivesincluding noteheads ndash the relationship between high- and low-level symbols has in general an mn cardinality Another high-level symbol may be a key signature which can consist ofmultiple sharps or flats It is not clear how to annotate notes Ifwe follow the rdquosemanticsrdquo criterion for distinguishing betweenthe low- and high-level symbol description of the written pageshould eg an accidental be considered a part of the notebecause it directly influences its pitch31

The policy for defining how relationships should be formedwas to make noteheads independent of each other That is asmuch as possible of the semantics of a note corresponding toa notehead can be inferred based on the explicit relationshipsthat pertain to this particular notehead However this idealis not fully implemented in MUSCIMA++ 09 and possiblycannot be reasonably reached given the rules of music no-tation While it is possible to infer duration from the currentannotation (with the exception of implicit tuples) not so muchpitch First of all one would need to add staffline and staffobjects and link the noteheads to the staff This is not explicitlydone in MUSCIMA++ 09 but given the staff removal groundtruth included with the annotated CVC-MUSCIMA imagesit should be merely a technical step that does not requiremuch manual annotation The other missing piece of the pitchpuzzle are assignment of notes to measures and precedencerelationships Precedence relationships need to include (a)notes to capture effects of accidentals at the measure scope(b) clefs and key signatures which can change within onemeasure32 so it is not sufficient to attach them to measures

30As seen in page 20 of CVC-MUSCIMA31 We would go as far as to say that it is inadequate to try marking rdquonoterdquo

graphical objects in the musical score A note is a basic unit of music butit is not a unit of music notation Music notation encodes notes it does notcontain them

32Even though this does not occur in CVC-MUSCIMA it does happen asillustrated eg by Fig 8 in [6]

Fig 6 MUSCIMarker 11 interface Tool selection on theleft file controls on the right Highlighted relationshipshave been selected (Last bar of from MUSCIMA++ imageW-35_N-08)

Finally in cases where these policy considerations did notprovide clear guidance we made choices in the ground truthdefinition based on aesthetics of the annotation interface sothat annotators could work faster and more accurately

C Available tools

In order to make using the dataset easier we provide twosoftware tools under an open license

First the musicma Python 3 package33 implements theMUSCIMA++ data model which can parse the dataset andenables manipulating the data further (such as assembling therelated primitives into notes to provide a comparison to theexisting datasets with different symbol sets)

Second we provide the MUSCIMarker application34 Thisis the annotation interface used for creating the dataset and itcan visualize the data

D Annotation process

Annotations were done using custom-made MUSCIMarkeropen-source software version 1135 The annotators workedon symbols-only CVC-MUSCIMA images which allowed formore efficient annotation The interface used to add symbolsconsisted of two tools a background-ignoring lasso selectionand connected component selection36 A screenshot of theannotation interface is in fig 6

As annotation was under way due to the prevalence of first-try inaccuracies we added editing tools that enabled annota-tors to rdquofine-tunerdquo the symbol shape by adding or removingarbitrary foreground regions to the symbolrsquos mask Addingsymbol relationships was done by another lasso selection toolAn important speedup was also achieved by providing a rich

33httpsgithubcomhajicjmuscima34httpsgithubcomhajicjMUSCIMarker35httpslindatmffcuniczrepositoryxmluihandle112341-1850 ongoing

development at httpsgithubcomhajicjMUSCIMarker36Basic usage is described in the MUSCIMarker tutorial http

muscimarkerreadthedocsioendeveloptutorialhtml

set of keyboard shortcuts The effect was most acutely feltin the keyboard interface for assigning labels to symbols asthe vocabulary of available labels is rather extensive (158 ofwhich 107 are actually present in the data)

There were seven annotators working on MUSCIMA++Four of these are professional musicians three are experiencedamateur musicians The annotators were asked to completethree progressively more complex training examples Therewere no IT skills required (we walked them through MUS-CIMarker installation and usage the odd troubleshooting wasdone through the TeamViewer37 remote interface) Two of thetraining examples were single measures for basic familiariza-tion with the user interface the third was a full page to ensureunderstanding of the majority of notation situations Based onthis rdquotraining roundrdquo we have also further refined the groundtruth definitions

As noted above each annotator completed one image ofeach of the 20 CVC-MUSCIMA pages The work was dis-patched to annotators in four packages of 5 images eachone package at a time After each package submission by anannotator we checked for correctness Automated validationwas implemented in MUSCIMarker to check for ldquoimpossiblerdquoor missing relationships (eg a stem and a beam should not beconnected a grace notehead has to be connected to a ldquonormalrdquonotehead) However there was still need for manual checksand manually correcting mistakes found in auto-validationas the validation itself was an advisory voice to highlightquestionably annotated symbols not an authoritative response

Manually checking each submitted file also allowed forcontinuous annotator training and filling in ldquoblind spotsrdquo inthe annotation guidelines (such as specifying how to dealwith volta signs or tuples) Some notation events were simplynot anticipated (eg mistakes that CVC-MUSCIMA writersmade in transcription or non-standard symbols) Feedbackwas provided individually after each package was submittedand checked This feedback was the main mechanism forcontinuous training Requests for clarifications of guidelinesfor situations that proved problematic were also disseminatedto the whole group

At first the quality of annotations was inconsistent someannotators performed well some poorly Some required ex-tensive feedback Thanks to the continuous communicationand training the annotators improved and the third andfourth packages required relatively minor corrections Overallhowever only one annotator submitted work at a quality thatrequired practically no further changes during quality controlDifferences in annotator speed did not equalize as much asannotation correctness

Finally after collecting annotations for all 140 images weperformed a second quality control round this time with fur-ther automated checks We checked for disconnected symbolsand for symbols with suspiciously sparse masks (a symbol wasdeemed suspicious if more than 007 of the foreground pixelsin its bounding box were not marked as part of any symbol at

37httpswwwteamviewercom

all) This second round of quality control uncovered yet moreinaccuracies and we also fixed other clearly wrong markings(eg if a significant amount of stem-only pixels was markedas part of a beam)

On average throughout the annotations the fastest annotatormanaged to mark about 6 symbols per minute or one per 10seconds the next fastest came at 46 symbols per minute Twoslowest annotators clocked in around 34 symbols per minuteThe average speed overall was 43 symbols per minute or oneper 14 seconds The ldquoupper boundrdquo on annotation speed wasestablished by the first author (who is intimately familiar withthe most efficient ways of using the MUSCIMarker annotationtool) to be 89 objects per minute (one in 675 seconds) Thesenumbers are computed over the whole time spent annotatingso they include the periods during which annotators weremarking relationships and checking their work in other wordsif you plan to extend the dataset with a comparable annotatorworkforce you can expect an average page of about 650symbols to take about 2 3

4 hoursAnnotating the dataset using the process detailed above took

roughly 400 hours of work the ldquoquality controlrdquo correctnesschecks took an additional 100 ndash 150 The second morecomplete round of quality control took roughly 60 ndash 80 hoursor some 05 hours per image38

E Inter-annotator agreement

In order to assess (a) whether the annotation guidelinesare well-defined and (b) the extent to which we can trustannotators we conducted a test all seven annotators weregiven the same image to annotate and we measured inter-annotator agreement Inter-annotator agreement does explicitlynot decouple the factors (a) and (b) However given that theoverall expected level of ambiguity is relatively low and giventhe learning curve along which the annotators were movingthroughout their work which would as be hard to decouplefrom genuine (a)-type disagreement we opted to not expendresources on annotators re-annotating something which theyhad already done and therefore cannot provide exact intra-annotator agreement data

Another use of inter-annotator agreement is to provide anupper bound on system performance If a system performsbetter than average inter-annotator agreement it may be over-fitting the validation set (On the other hand it may havemerely learned to compensate for annotator mistakes ndash moreanalysis is needed before concluding that the system overfitsBut it is a useful warning one should investigate unexpectedlyhigh performance numbers)

1) Computing agreement In order to evaluate the extent towhich two annotators agree on how a given image should beannotated we perform two steps

bull Align the annotated object sets against each otherbull Compute the macro-averaged f-score over the aligned

object pairs

38For the sake of completeness implementing MUSCIMarker took about600 hours including the learning curve for the GUI framework

TABLE IV Inter-annotator agreement

Setting macro-avg f-score

noQC-noQC (inter-annot) 089noQC-withQC (self) 093

withQC-withQC (inter-annot) 097

Objects that have no counterpart contribute 0 to both precisionand recall

Alignment was done in a greedy fashion For symbol setsS T we first align each t isin T to the s isin S with the highestpairwise f-score F (s t) then vice versa align each s isin Sto the t isin T with the highest pairwise f-score Taking theintersection we then get symbol pairs s t such that they areeach otherrsquos ldquobest friendsrdquo in terms of f-score The symbolsthat do not have such a clear counterpart are left out of thealignment Furthermore symbol pairs that are not labeled withthe same symbol class are removed from the alignment as well

When breaking ties in the pairwise matchings (from bothdirections) symbol classes c(s) c(t) are used If F (s t1) =F (s t2) but c(s) = c(t1) while c(s) 6= c(t2) then (s t1) istaken as an alignment candidate instead of (s t2) (If both t1and t2 have the same class as s then then the tie is brokenrandomly In practice this would be extremely rare and wouldnot influence agreement scores very much)

2) Agreement results The resulting f-scores are summa-rized in table IV We measured inter-annotator agreement bothbefore and after quality control (noQC-noQC and withQC-withQC) and we also measured the extent to which qualitycontrol changed the originally submitted annotations (noQC-withQC) Tentatively the post-QC measurements reflect thelevel of genuine disagreement among the annotators abouthow to lead the boundaries of objects in intersections andthe inconsistency of QC while the pre-QC measurements alsomeasures the extent of actual mistakes that were fixed in QC

Ideally the task of annotating music notation symbols is rel-atively unambiguous Legitimate sources of disagreement lie intwo factors unclear symbol boundaries in intersections andillegible handwriting For relationships ambiguity is mainlyin polyphonic scores where annotators had to decide howto attach noteheads from multiple voices to crescendo anddecrescendo hairpin symbols However after quality controlthere were 689 ndash 691 objects in the image and 613 ndash 637relationships depending on which annotator we asked Thishighlights the limits of both the annotation guidelines andQC the ground truth is probably not entirely unambiguousso various configurations passed QC and additionally the QCprocess itself allows for human error (If we could reallyautomate infallible QC we would also have solved OMR)

At the same time as seen in table IV the two-round qualitycontrol process apparently removed nearly four fifths of alldisagreements bringing the withQC inter-annotator f-score of097 from a noQC f-score of 089 On average quality controlintroduced less change than was originally between individualannotators This statistic seems to suggest that the withQCresults are somewhere in the ldquocenterrdquo of the space of submitted

annotations and therefore the quality control process reallyleads to more accurate annotation instead of merely distortingthe results in its own way

We can conclude that the annotations using the qualitycontrol process is quite reliable even though slight mistakesmay remain Overfitting the test set will likely not be an issue

F Known limitations

The MUSCIMA++ 10 dataset is far from perfect as isalways the case with extensive human-annotated datasets Inthe interest of full disclosure and managing expectations welist the known issues

Annotators also might have made mistakes that slipped boththrough automated validation and manual quality control Inautomated validation there is a tradeoff between catchingerrors and false alarms events like multiple stems per noteheadhappen even in the limited set of 20 pages of MUSCIMA++ Inthe same vein although we did implement automated checksfor highly inaccurate annotations they only catch some of theproblems as well and our manual quality control procedurealso relies on inherently imperfect human judgment All inall the data is not perfect With limited man-hours there isalways a tradeoff between quality and scale

The CVC-MUSCIMA dataset has had staff lines removedautomatically with very high accuracy based on a precisewriting and scanning setup (using a standard notation paperand a specific pen across all 50 writers) However there arestill some errors in staff removal sometimes the staff removalalgorithm took with it some pixels that were also legitimatepart of a symbol This manifests itself most frequently withstems

The relationship model is rather basic Precedence andsimultaneity relationships are not annotated and stafflines andstaves are not among the annotated symbols so notehead-to-staff assignment is not explicit Similarly notehead-to-measure assignment is also not explicitly marked This is alimitation that so far does not enable inferring pitch from theground truth However much of this should be obtainableautomatically from the annotation that is available and theCVC-MUSCIMA staff removal ground truth

There are also some outstanding technical issues in thedetails of how the bridge from graphical expression to in-terpretation ground truth is designed For example there isno good way to encode a 128 time signature The rdquo1rdquo andrdquo2rdquo would currently be annotated as separate numerals andthe fact that they belong together to encode the number 12is not represented explicitly one would have to infer thatfrom knowledge of time signatures A potential fix is tointroduce a ldquonumeric textldquo symbol as an intermediary betweenldquotime signatureldquo and ldquonumeral Xldquo symbols similarly to var-ious ldquo(some) textldquo symbols that group ldquoletter Xldquo symbolsAnother technical problem is that the mask of empty noteheadsthat lie on a ledger line includes the part of the ledger linethat lies within the notehead

Finally there is no good policy on symbols broken intotwo at line breaks They are currently handled as two separatesymbols

VI CONCLUSION

In MUSCIMA++ v09 we provide an OMR dataset ofhandwritten music that allows training and benchmarkingOMR systems tackling the symbol recognition and notationreconstruction stages of the OMR pipeline Building on theCVC-MUSCIMA staff removal ground truth we provideground truth for symbol localization classification and sym-bol graph recovery which is the step that resolves ambigui-ties necessary for inferring pitch and duration Although thedataset does not explicitly record precedence simultaneityand attachment to stafflines and staves this information canbe inferred automatically from the staff removal ground truthfrom CVC-MUSCIMA and the existing MUSCIMA++ symbolannotations

A What can we do with MUSCIMA++

MUSCIMA++ allows evaluating OMR performance on var-ious sub-tasks in isolation

bull Symbol classification use the bounding boxes and sym-bol masks as inputs symbol labels as outputs Use primi-tive relationships to generate a ground truth of compositesymbols for compatibility with datasets of [16] or [45]

bull Symbol localization use the pages (or sub-regions) asinputs the corresponding list of bounding boxes (andoptionally masks) is the outputw

bull Primitives assembly use the bounding boxesmasksand labels as inputs adjacency matrix as output (ForMUSCIMA++ version 09 be aware of the limitationsdiscussed in V-F)

At the same time these inputs and outputs can be chained toevaluate systems tackling these sub-tasks jointly

What about inferring pitch and durationFirst of all we can exploit the 11 correspondence be-

tween notes and noteheads Pitch and duration can thereforebe thought of as extra attributes of notehead-class symbolsDuration of notes (at least relative ndash half quarter etc) canthen already be extracted from the annotated relationships ofMUSCIMA++ 09 Tuplet symbols are also linked explicitlyto the notes (noteheads) they affect To reconstruct pitchthough one needs to go beyond what MUSCIMA++ 09 makesexplicitly available The missing elements of ground truthare relationships that attach key signatures and noteheads tostafflines and secondarily to measures (the scope of ldquoinlinerdquoaccidentals is until the next bar) However these relationshipsshould be relatively straightforward to add automatically usingthe CVC-MUSICMA staffline ground truth

B Future work

MUSCIMA++ in its curent version 09 does not entirelylive up to the requirements discussed in II and III ndash severalareas need improvement

First of all the relationship model is so far rather basicWhile it should be possible to automatically infer precedenceand simultaneity stafflines staves and the relationships ofnoteheads to the staff symbols it is not entirely clear howaccurately it can be done This is the major obstacle toautomatically inferring pitch from the currently available data

Second the source of the data is relatively limited to thecollection effort of CVC-MUSCIMA While the variety ofhandwriting collected by Fornes et al [22] is impressiveit is all contemporary ndash whereas the application domain ofhandwritten OMR is also in early music where differenthandwriting styles have been used The dataset should beenriched by early music sources

Third the current ad hoc data format is not a good long-term solution While it encodes the ground truth informationwell and we do provide tools for parsing visualization andimplement a data model it would be beneficial to re-encodethe data using MEI This does not mean only using thegraphics-independent MEI XML format for recording themusical content ndash MEI also has the capacity to encode thegraphical elements their locations the input pages etc Forrelationships MEI allows user-defined elements Re-encodingthe dataset in MEI would also lift the burden of maintainingan independent data model implementation

Finally now that MUSCIMA++ is available there is suf-ficient data to train models for automating annotation Thiscould significantly speed up future work and enable us to covera greater variety of scores including those from the existingdataset of Rebelo et al [30] The resources freed up throughautomation could also be used to annotate staff removal groundtruth in other scores which by itself takes anecdotally aboutas much time as annotating all the remaining symbols

C Final remarks

In spite of its imperfections the MUSCIMA++ dataset stilloffers the most complete and extensive publicly availableground truth annotation for OMR to date Together withthe provided software it should enable the OMR field toestablish a robust basis for comparing systems and measuringprogress Organizing a competition as called for in [6] wouldbe a logical next step towards this end Although specificevaluation procedures will need to be developed for this datawe believe the fine-grained annotation will enable evaluatingat least the stage 2 and stage 3 tasks in isolation and jointlywith a methodology analogous to those suggested in [6] [7]or [18] Finally it can also serve as the training data forextending the machine learning paradigm of OMR describedby Calvo-Zaragoza et al [56] to symbol recognition andnotation assembly tasks

Our hope is that the MUSCIMA++ dataset will be usefulto the broad OMR community

ACKNOWLEDGMENT

First of all we thank our annotators who did much of thetedious work The authors would also like to express their grat-

itude to Alicia Fornes of CVC UAB39 who generously decidedto share the CVC-MUSICMA dataset under the CC-BY-NC-SA 40 license thus enabling us to share the MUSCIMA++dataset in the same open manner as well

This work is supported by the Czech Science Foundationgrant number P10312G084

REFERENCES

[1] D Bainbridge and T Bell ldquoThe challenge of optical music recognitionrdquoComputers and the Humanities vol 35 pp 95ndash121 2001

[2] Ana Rebelo Ichiro Fujinaga Filipe Paszkiewicz Andre R S MarcalCarlos Guedes and Jaime S Cardoso ldquoOptical Music RecognitionState-of-the-Art and Open Issuesrdquo Int J Multimed Info Retrvol 1 no 3 pp 173ndash190 Mar 2012 [Online] Availablehttpdxdoiorg101007s13735-012-0004-6

[3] Jirı Novotny and Jaroslav Pokorny ldquoIntroduction to Optical MusicRecognition Overview and Practical Challengesrdquo DATESO 2015 Pro-ceedings of the 15th annual international workshop 2015

[4] D Byrd ldquoMusic Notation by Computerrdquo PhD dissertation 1984[5] D Bainbridge and T Bell ldquoDealing with superimposed objects

in optical music recognitionrdquo Proceedings of 6th InternationalConference on Image Processing and its Applications no 443 pp756ndash760 1997 [Online] Available httplinkaiporglinkIEECPSv1997iCP443p756s1ampAgg=doi

[6] Donald Byrd and Jakob Grue Simonsen ldquoTowards a Standard Testbedfor Optical Music Recognition Definitions Metrics and Page ImagesrdquoJournal of New Music Research vol 44 no 3 pp 169ndash195 2015[Online] Available httpdxdoiorg1010800929821520151045424

[7] Michael Droettboom and Ichiro Fujinaga ldquoSymbol-level groundtruthingenvironment for OMRrdquo Proceedings of the 5th International Conferenceon Music Information Retrieval (ISMIR 2004) pp 497ndash500 2004

[8] Victor Padilla Alan Marsden Alex McLean and Kia Ng ldquoImprovingOMR for Digital Music Libraries with Multiple Recognisers andMultiple Sourcesrdquo Proceedings of the 1st International Workshop onDigital Libraries for Musicology - DLfM rsquo14 pp 1ndash8 2014 [Online]Available httpdxdoiorg10114526601682660175

[9] S Chanda D Das U Pal and F Kimura ldquoOffline Hand-Written Musical Symbol Recognitionrdquo 2014 14th InternationalConference on Frontiers in Handwriting Recognition pp 405ndash410sep 2014 [Online] Available httpieeexploreieeeorglpdocsepic03wrapperhtmarnumber=6981053

[10] Baoguang Shi Xiang Bai and Cong Yao ldquoAn End-to-End TrainableNeural Network for Image-based Sequence Recognition and ItsApplication to Scene Text Recognitionrdquo CoRR vol abs1507057172015 [Online] Available httparxivorgabs150705717

[11] J Hajic jr J Novotny P Pecina and J Pokorny ldquoFurtherSteps towards a Standard Testbed for Optical Music Recognitionrdquoin Proceedings of the 17th International Society for MusicInformation Retrieval Conference M Mandel J DevaneyD Turnbull and G Tzanetakis Eds New York UniversityNew York USA New York University 2016 pp 157ndash163 [Online] Available https18798-presscdn-pagelynetdna-sslcomismir2016wp-contentuploadssites2294201607289 Paperpdf

[12] Arnau Baro Pau Riba and Alicia Fornes ldquoTowards the Recognitionof Compound Music Notes in Handwritten Music Scoresrdquo in 15thInternational Conference on Frontiers in Handwriting RecognitionICFHR 2016 Shenzhen China October 23-26 2016 IEEEComputer Society 2016 pp 465ndash470 [Online] Available httpdxdoiorg101109ICFHR20160092

[13] M V Stuckelberg and D Doermann ldquoOn musical score recognitionusing probabilistic reasoningrdquo Proceedings of the Fifth InternationalConference on Document Analysis and Recognition ICDAR rsquo99(Cat NoPR00318) no Did pp 115ndash118 1999 [Online] Availablehttpieeexploreieeeorgxplsabs alljsparnumber=791738

[14] A Rebelo J Tkaczuk R Sousa and J S Cardoso ldquoMetric learning formusic symbol recognitionrdquo Proceedings - 10th International Conferenceon Machine Learning and Applications ICMLA 2011 vol 2 pp 106ndash111 2011

39httpwwwcvcuabespeopleafornes

[15] A Rebelo F Paszkiewicz C Guedes A R S Marcal andJ S Cardoso ldquoA Method for Music Symbols Extraction basedon Musical Rulesrdquo Proceedings of BRIDGES no 1 pp 81ndash882011 [Online] Available httpwwwinescportoptsimjscpublicationsconferences2011ARebeloBRIDGESpdf

[16] Jorge Calvo-Zaragoza and Jose Oncina ldquoRecognition of Pen-Based Music Notation The HOMUS Datasetrdquo 22nd InternationalConference on Pattern Recognition Aug 2014 [Online] Availablehttpdxdoiorg101109ICPR2014524

[17] Cuihong Wen Ana Rebelo Jing Zhang and Jaime Cardoso ldquoA newoptical music recognition system based on combined neural networkrdquoPattern Recognition Letters vol 58 pp 1 ndash 7 2015 [Online] AvailablehttpwwwsciencedirectcomsciencearticlepiiS0167865515000392

[18] Pierfrancesco Bellini Ivan Bruno and Paolo Nesi ldquoAssessing OpticalMusic Recognition Toolsrdquo Computer Music Journal vol 31 no 1 pp68ndash93 Mar 2007 [Online] Available httpdxdoiorg101162comj200731168

[19] M Szwoch ldquoUsing MusicXML to Evaluate Accuracy of OMRSystemsrdquo Proceedings of the 5th International Conference onDiagrammatic Representation and Inference pp 419ndash422 2008[Online] Available httpdxdoiorg101007978-3-540-87730-1 53

[20] H Miyao and R M Haralick ldquoFormat of Ground Truth Data Used inthe Evaluation of the Results of an Optical Music Recognition Systemrdquo2000 p 497506

[21] Andrew Hankinson ldquoOptical music recognition infrastructure for large-scale music document analysisrdquo PhD dissertation 2015 [Online]Available httpdigitoollibrarymcgillcawebclientDeliveryManagerpid=130291ampcustom att 2=direct

[22] A Fornes A Dutta A Gordo and J Llads ldquoCVC-MUSCIMA aground truth of handwritten music score images for writer identificationand staff removalrdquo International Journal on Document Analysis andRecognition (IJDAR) vol 15 no 3 pp 243ndash251 2012 [Online]Available httpdxdoiorg101007s10032-011-0168-2

[23] John Ashley Burgoyne Laurent Pugin Greg and Eustace Ichiro Fu-jinaga ldquoA Comparative Survey of Image Binarisation Algorithms forOptical Recognition on Degraded Musical Sourcesrdquo ACM SIGSOFTSoftware Engineering Notes vol 24 no 1985 pp 1994ndash1997 2008

[24] JaeMyeong Yoo Nguyen Dinh Toan DeokJai Choi HyukRo Parkand Gueesang Lee ldquoAdvanced Binarization Method for Music ScoreRecognition Using Local Thresholdsrdquo 2008 IEEE 8th InternationalConference on Computer and Information Technology Workshops pp417ndash420 2008 [Online] Available httpieeexploreieeeorglpdocsepic03wrapperhtmarnumber=4568540

[25] T Pinto A Rebelo G Giraldi and J S Cardoso ldquoContent AwareMusic Score Binarizationrdquo Tech Rep 2010

[26] mdashmdash ldquoMusic Score Binarization Based on Domain Knowledgerdquo inIberian Conference on Pattern Recognition and Image Analysis vol6669 Berlin Heidelberg Springer Verlag 2011 pp 700ndash708

[27] A Rebelo and J S Cardoso ldquoStaff line detection and removal inthe grayscale domainrdquo Proceedings of the International Conference onDocument Analysis and Recognition ICDAR pp 57ndash61 2013

[28] J Calvo Zaragoza G Vigliensoni and I Fujinaga ldquoDocumentAnalysis for Music Scores via Machine Learningrdquo in Proceedings ofthe 3rd International Workshop on Digital Libraries for Musicologyser DLfM 2016 New York NY USA ACM 2016 pp 37ndash40[Online] Available httpdoiacmorg10114529700442970047

[29] I Fujinaga ldquoAdaptive Optical Music Recognitionrdquo PhD dissertation1996

[30] Ana Rebelo ldquoRobust Optical Recognition of Handwritten MusicalScores based on Domain Knowledgerdquo PhD dissertation 2012 [On-line] Available httpwwwinescportopt7EarebeloarebeloThesispdf

[31] Denis Pruslin ldquoAutomatic recognition of sheet musicrdquo PhD disserta-tion Cambridge Massachusetts USA 1966

[32] D S Prerau ldquoComputer pattern recognition of printed musicrdquo AFIPConference proceedings Vol39 1971 fall joint computer conferencepp 153ndash162 1971

[33] I Fujinaga ldquoOptical Music Recognition using Projectionsrdquo Masterrsquosthesis 1988

[34] A Fornes A Dutta A Gordo and J Llados ldquoThe ICDAR 2011 musicscores competition Staff removal and writer identificationrdquo in DocumentAnalysis and Recognition (ICDAR) 2011 International Conference onIEEE 2011 pp 1511ndash1515

[35] S Sheridan and S E George ldquoDefacing Music Scores for ImprovedRecognitionrdquo Proceedings of the Second Australian Undergraduate

Students Computing Conference no i pp 1ndash7 2004 [Online]Available httpciteseerxistpsueduviewdocsummarydoi=1011722460$delimiterrdquo026E30F$nhttpwwwcsberkeleyedusimbenrpublicationsauscc04paperssheridan-auscc04pdf

[36] Laurent Pugin ldquoOptical Music Recognitoin of Early TypographicPrints using Hidden Markov Modelsrdquo in ISMIR 2006 7th InternationalConference on Music Information Retrieval Victoria Canada 8-12October 2006 Proceedings 2006 pp 53ndash56 [Online] Availablehttpismir2006ismirnetPAPERSISMIR06152 Paperpdf

[37] Cuihong Wen Jing Zhang Ana Rebelo and Fanyong Cheng ldquoADirected Acyclic Graph-Large Margin Distribution Machine Model forMusic Symbol Classificationrdquo PLOS ONE vol 11 no 3 p e0149688mar 2016 [Online] Available httpdxdoiorg101371journalpone0149688

[38] P Bellini I Bruno and P Nesi ldquoOptical music sheet segmentationrdquoin Proceedings First International Conference on WEB Delivering ofMusic WEDELMUSIC 2001 Institute of Electrical amp ElectronicsEngineers (IEEE) 2001 pp 183ndash190 [Online] Available papers2publicationuuidFEF468FA-2244-48DF-94C6-64246D675F15

[39] K Ng D Cooper E Stefani R Boyle and N Bailey ldquoEmbracing theComposer Optical Recognition of Handwritten Manuscriptsrdquo 1999

[40] N Luth ldquoAutomatic Identification of Music Notationsrdquo in Proceedingsof the Second International Conference on WEB Delivering of Music(WDELMUSICrsquo02) 2002

[41] B Couasnon and J Camillerapp ldquoUsing Grammars To Segment andRecognize Music Scoresrdquo Pattern Recognition pp 15ndash27 October1994

[42] D Bainbridge and T Bell ldquoAn extensible Optical Music Recognitionsystemrdquo in Proceedings of the nineteenth Australasian computerscience conference 1997 p 308317 [Online] Available httpwwwcswaikatoacnzsimdavidbpublicationsacsc96

[43] mdashmdash ldquoA music notation construction engine for optical music recogni-tionrdquo Software - Practice and Experience vol 33 no 2 pp 173ndash2002003

[44] A Fornes ldquoAnalysis of Old Handwritten Musical Scoresrdquo PhD dis-sertation 2005

[45] A Rebelo G Capela and J S Cardoso ldquoOptical recognition of musicsymbolsrdquo International Journal on Document Analysis and Recognitionvol 13 pp 19ndash31 2010

[46] V K Pham and G Lee ldquoMusic Score Recognition Based on a Col-laborative Modelrdquo International Journal of Multimedia and UbiquitousEngineering vol 10 no 8 pp 379ndash390 2015

[47] K T Reed and J R Parker ldquoAutomatic computer recognition of printedmusicrdquo Proceedings - International Conference on Pattern Recognitionvol 3 pp 803ndash807 1996

[48] Liang Chen Rong Jin and Christopher Raphael ldquoRenotation fromOptical Music Recognitionrdquo in Mathematics and Computation inMusic Springer Science + Business Media 2015 pp 16ndash26 [Online]Available httpdxdoiorg101007978-3-319-20603-5 2

[49] M Szwoch ldquoGuido A musical score recognition systemrdquo Proceedingsof the International Conference on Document Analysis and RecognitionICDAR vol 2 no 3 pp 809ndash813 2007

[50] Ana Rebelo Andre Marcal and Jaime S Cardoso ldquoGlobal constraintsfor syntactic consistency in OMR an ongoing approachrdquo inProceedings of the International Conference on Image Analysis andRecognition (ICIAR) 2013 [Online] Available httpwwwinescportoptsimjscpublicationsconferences2013ARebeloICIARpdf

[51] Rong Jin and Christopher Raphael ldquoInterpreting Rhythm in OpticalMusic Recognitionrdquo in Proceedings of the 13th InternationalSociety for Music Information Retrieval Conference ISMIR 2012Mosteiro SBento Da Vitoria Porto Portugal October 8-12 2012F Gouyon P Herrera L G Martins and M Muller EdsFEUP Edicoes 2012 pp 151ndash156 [Online] Available httpismir2012ismirneteventpapers151-ismir-2012pdf

[52] Pau Riba Alicia Fornes and Josep Llados ldquoTowards the Alignment ofHandwritten Music Scoresrdquo in Graphic Recognition Current Trendsand Challenges - 11th International Workshop GREC 2015 NancyFrance August 22-23 2015 Revised Selected Papers ser LectureNotes in Computer Science B Lamiroy and R D Lins Edsvol 9657 Springer 2015 pp 103ndash116 [Online] Available httplinkspringercomchapter101007978-3-319-52159-6 8fulltexthtml

[53] Christoph Dalitz Michael Droettboom Bastian Pranzas and IchiroFujinaga ldquoA Comparative Study of Staff Removal AlgorithmsrdquoIEEE Transactions on Pattern Analysis and Machine Intelligence

vol 30 no 5 pp 753ndash766 May 2008 [Online] Availablehttpdxdoiorg101109TPAMI200770749

[54] Jorge Calvo-Zaragoza David Rizo and Jose Manuel Inesta QueredaldquoTwo (Note) Heads Are Better Than One Pen-Based MultimodalInteraction with Music Scoresrdquo in Proceedings of the 17th InternationalSociety for Music Information Retrieval Conference ISMIR 2016 NewYork City United States August 7-11 2016 M I Mandel J DevaneyD Turnbull and G Tzanetakis Eds 2016 pp 509ndash514 [Online]Available httpswpnyueduismir2016wp-contentuploadssites2294201607006 Paperpdf

[55] Rui Miguel Filipe da Silva ldquoMobile framework for recognitionof musical charactersrdquo Masterrsquos thesis 2013 [Online] Availablehttpsrepositorio-abertoupptbitstream1021668500226777pdf

[56] J Calvo Zaragoza G Vigliensoni and I Fujinaga ldquoA machinelearning framework for the categorization of elements in images ofmusical documentsrdquo in Third International Conference on Technologiesfor Music Notation and Representation A Coruna University of ACoruna 2017 [Online] Available httpgrfiadlsiuaesrepositorigrfiapubs360tenor-unified-categorizationpdf

I Introduction what dataset

I-A Contributions and outline

II Ground truth of musical symbols

II-A Interfaces in the OMR pipeline

II-B So ndash what should the ground truth be

III Choice of data

III-A Notation complexity

III-B Image quality

III-C Topological consistency

IV Existing datasets

IV-A Staff removal

IV-B Symbol classification and localization

IV-C Notation reconstruction and final representation

V The MUSCIMA++ dataset

V-A Designated test sets

V-B MUSCIMA++ ground truth

V-C Available tools

V-D Annotation process

V-E Inter-annotator agreement

V-E1 Computing agreement

V-E2 Agreement results

V-F Known limitations

VI Conclusion

VI-A What can we do with MUSCIMA++

VI-B Future work

VI-C Final remarks

References