An object-oriented database for advanced searches …etd.dtu.dk/thesis/274362/ep11_01.pdf · An object-oriented database for advanced searches of file systems based on metadata Christian

An object-oriented

database for advanced

searches of file systems

based on metadata

Christian Holm Fogelberg

Kongens Lyngby 2011

IMM-MSC-2011-1

Technical University of Denmark

Informatics and Mathematical Modelling

Buliding 321, DK-2800 Kongens Lyngby, Denmark

Phone +45 45253351, Fax +45 45882673

[email protected]

www.imm.dtu.dk

Summary

With the increasing amounts of information people are subjected to, the need to better archive and

retrieve this information is becoming more important. The hierarchical file system being used today is

beginning to show its limitations for handling large amounts of documents. The hierarchical design

cannot handle multi category archiving. Over time the hierarchy will become too shallow and list too

many results per folder, or too deep, hiding documents in deep branches. An alternative to the

traditional hierarchical file system, the MetaFS is tag based and metadata aware. Files can be organized

with tags and found again later by using drill-down on the tags through a Virtual Hard Drive (VHD)

interface, to be backwards compatible. For those hard to find files, metadata search is available, using a

metadata driven drill-down technique for narrowing down search results.

ii

Resumé

Med den stigende mængde information folk udsættes for er der et større behov for bedre arkivering og

genfinding af dokumenter. Det hierarkiske fil system i brug i dag er begyndt at vise sine begrænsninger

mht. håndtering af store mængder dokumenter. Det hierarkiske design kan ikke håndtere arkivering af

dokumenter under flere kategorier samtidigt. Med tiden vil hierarki strukturen bliver for generel og vise

for mange dokumenter per kategori eller for indviklet og gemme dokumenter i dybe under kategorier.

Som et alternativ til det traditionelle hierarkiske fil system er MetaFS tag baseret og har viden om

metadata. Filer kan organiseres vha. tags og kan findes igen senere ved brug af tag baseret drill-down

gennem et Virtuel Harddisk (VHD) interface, for at være kompatibel med eksisterende software. For de

filer der er svære at finde med tags, er en metadata søgefunktion tilgængelig, der tillader filtrering af

søgeresultatet baseret på metadata værdier.

iv

Table of Contents

Summary .................................................................................................................................................. i

Resumé .................................................................................................................................................. iii

List of Figures ......................................................................................................................................... xi

1 Introduction ..................................................................................................................................... 1

1.1 Previous research ..................................................................................................................... 7

1.2 Objectives ................................................................................................................................ 9

1.3 Thesis structure ..................................................................................................................... 10

2 Methodology ................................................................................................................................. 11

2.1 Information Retrieval ............................................................................................................. 11

2.2 Storage .................................................................................................................................. 13

2.3 Software Engineering ............................................................................................................. 15

3 Analysis .......................................................................................................................................... 17

3.1 Introduction ........................................................................................................................... 17

3.2 Dokan .................................................................................................................................... 20

3.3 Actors .................................................................................................................................... 21

3.4 Objects .................................................................................................................................. 21

3.5 Unique name constraint ......................................................................................................... 24

3.6 Metadata extraction .............................................................................................................. 27

vi Table-of-Contents

3.7 Search .................................................................................................................................... 28

3.7.1 Live search. .................................................................................................................... 28

3.7.2 Structure of search objects. ............................................................................................ 30

3.7.3 Searchable attributes. .................................................................................................... 32

3.7.4 Multiple Metadata readers for the same file extension. .................................................. 33

3.8 Use Cases ............................................................................................................................... 37

4 Design ............................................................................................................................................ 41

4.1 Class responsibilities .............................................................................................................. 42

4.1.1 Committing changes ....................................................................................................... 42

4.1.2 MFS ................................................................................................................................ 43

4.1.3 MFSIndexManager ......................................................................................................... 43

4.1.4 MFSDokan ...................................................................................................................... 45

4.1.5 IMFSPlugin ..................................................................................................................... 46

4.2 Special considerations ............................................................................................................ 46

4.3 FileStore................................................................................................................................. 47

4.4 Design considerations for each use case ................................................................................. 47

4.4.1 UC4: View Folder ............................................................................................................ 47

4.5 Database object references .................................................................................................... 50

4.6 Class diagrams ....................................................................................................................... 51

4.6.1 Application Logic classes ................................................................................................. 52

4.6.2 GUI classes ..................................................................................................................... 53

4.6.3 DB classes ....................................................................................................................... 54

4.6.4 Extension classes ............................................................................................................ 55

4.6.5 Class connections ........................................................................................................... 56

4.6.6 Full diagram ................................................................................................................... 57

4.7 Sequence diagrams ................................................................................................................ 58

5 Implementation ............................................................................................................................. 59

5.1 Introduction ........................................................................................................................... 59

5.2 Development environment .................................................................................................... 59

5.2.1 Introduction ................................................................................................................... 59

5.2.2 Software + Hardware Setup ............................................................................................ 59

5.2.3 Setup notes. ................................................................................................................... 60

vii

5.2.4 Extensions ...................................................................................................................... 61

5.3 Querying by name .................................................................................................................. 61

5.4 Infinite cleanup loop .............................................................................................................. 61

5.5 Use Case implementation ...................................................................................................... 62

5.5.1 UC1: Add Index Location ................................................................................................. 62

5.5.2 UC2: Remove Index Location .......................................................................................... 63

5.5.3 UC3: View Root .............................................................................................................. 63

5.5.4 UC4: View Folder ............................................................................................................ 63

5.5.5 UC5: Create Folder ......................................................................................................... 63

5.5.6 UC6: Delete Folder ......................................................................................................... 65

5.5.7 UC7: Rename Folder ....................................................................................................... 66

5.5.8 UC8: Move Folder ........................................................................................................... 66

5.5.9 UC9: Create File .............................................................................................................. 66

5.5.10 UC10: Delete File ............................................................................................................ 66

5.5.11 UC11: Rename File ......................................................................................................... 66

5.5.12 UC12: Move File ............................................................................................................. 67

5.5.13 UC13: Read File .............................................................................................................. 69

5.5.14 UC14: Write File ............................................................................................................. 70

5.5.15 UC15: Change FileStore .................................................................................................. 70

5.5.16 Searching ....................................................................................................................... 70

5.6 MFSDebug ............................................................................................................................. 71

5.7 MFSFunctions ........................................................................................................................ 72

5.8 MFSDokan.............................................................................................................................. 72

5.9 Deployment ........................................................................................................................... 73

6 Testing & Results ........................................................................................................................... 75

6.1 Introduction ........................................................................................................................... 75

6.2 Unit Test for every use case ................................................................................................... 75

6.3 Existing Unit Tests .................................................................................................................. 76

6.4 Activation Depth .................................................................................................................... 77

6.5 Query time measuring ............................................................................................................ 77

6.6 Query time, Winamp vs. MetaFS ............................................................................................ 78

6.7 Filter time, Winamp vs. MetaFS .............................................................................................. 80

viii Table-of-Contents

6.8 AddLocation time trace .......................................................................................................... 80

6.9 DropBox ................................................................................................................................. 80

6.10 Bugs ....................................................................................................................................... 80

6.11 Results ................................................................................................................................... 81

6.11.1 Fundamental requirements ............................................................................................ 81

6.11.2 Performance .................................................................................................................. 81

6.12 Screenshots ........................................................................................................................... 82

7 Conclusion ..................................................................................................................................... 87

7.1 Future Work ........................................................................................................................... 88

8 Bibliography ................................................................................................................................... 89

9 Appendix ....................................................................................................................................... 97

9.1 Data Dictionary ...................................................................................................................... 97

9.2 Use Cases ............................................................................................................................. 100

9.2.1 UC1: Add Index Location ............................................................................................... 100

9.2.2 UC2: Remove Index Location ........................................................................................ 100

9.2.3 UC3: View Root ............................................................................................................ 101

9.2.4 UC4: View Folder .......................................................................................................... 101

9.2.5 UC5: Create Folder ....................................................................................................... 102

9.2.6 UC6: Delete Folder ....................................................................................................... 103

9.2.7 UC7: Rename Folder ..................................................................................................... 103

9.2.8 UC8: Move Folder ......................................................................................................... 104

9.2.9 UC9: Create File ............................................................................................................ 104

9.2.10 UC10: Delete File .......................................................................................................... 105

9.2.11 UC11: Rename File ....................................................................................................... 105

9.2.12 UC12: Move File ........................................................................................................... 106

9.2.13 UC13: Read File ............................................................................................................ 107

9.2.14 UC14: Write File ........................................................................................................... 107

9.2.15 UC15: Search ................................................................................................................ 108

9.2.16 UC16: Filter Search ....................................................................................................... 108

9.2.17 UC17: Change FileStore Location .................................................................................. 109

9.2.18 UC18: Rescan Indexed Location .................................................................................... 109

9.2.19 UC19: View Untagged Files ........................................................................................... 109

ix

9.3 Sequence Diagrams .............................................................................................................. 110

9.3.1 Add Location ................................................................................................................ 110

9.3.2 Remove Location .......................................................................................................... 111

9.3.3 View Root ..................................................................................................................... 112

9.3.4 View Folder .................................................................................................................. 112

9.3.5 Create Folder ................................................................................................................ 113

9.3.6 Delete Folder ................................................................................................................ 113

9.3.7 Rename Folder ............................................................................................................. 114

9.3.8 Move Folder ................................................................................................................. 114

9.3.9 Create File .................................................................................................................... 114

9.3.10 Delete File .................................................................................................................... 115

9.3.11 Rename File .................................................................................................................. 115

9.3.12 Move File ..................................................................................................................... 116

9.3.13 Read File....................................................................................................................... 116

9.3.14 Write File ...................................................................................................................... 117

9.3.15 Search .......................................................................................................................... 117

9.3.16 Filter Search ................................................................................................................. 118

9.4 Dokan calls for a ReadFile request. ....................................................................................... 119

9.5 Visualization of activation depth .......................................................................................... 121

9.6 Search Object Structure ....................................................................................................... 123

9.7 Object counts with music folder added ................................................................................ 125

9.8 ID3 Data ............................................................................................................................... 126

9.9 AddLocation on 11.253 files ................................................................................................. 127

9.10 AddLocation Performance Report ........................................................................................ 128

9.11 Activation impact on queries ................................................................................................ 129

10 Index ............................................................................................................................................ 131

x

xi

List of Figures

Figure 1 - The 4 most common filter attributes are listed ......................................................................... 8

Figure 2 - 'Type' attribute used, and 'Name' is now presented in its place ................................................ 8

Figure 3 - Both dot-extension and full name of file type is presented ....................................................... 8

Figure 4 - Metadata attribute filtering ..................................................................................................... 8

Figure 5 - Unsorted elements ................................................................................................................ 18

Figure 6 - DTU logo ................................................................................................................................ 18

Figure 7 - Sorted by color ....................................................................................................................... 18

Figure 8 - Sorted by size ......................................................................................................................... 18

Figure 9 - Sorted by Shape ..................................................................................................................... 19

Figure 10 - Tag and file connections ....................................................................................................... 23

Figure 11 - Structure of search objects................................................................................................... 32

Figure 12 - Viewing the Root (X:) ........................................................................................................... 48

Figure 13 - Viewing X:\CD1 .................................................................................................................... 48

Figure 14 - Viewing X:\Music[51] ........................................................................................................... 49

Figure 15 - Viewing X:\Music[51]\S&M[22] ............................................................................................ 49

Figure 16 - Query time in seconds for Winamp and MetaFS searches .................................................... 78

Figure 17 - Query times for large result sets .......................................................................................... 79

Figure 18 - Location management and VHD options .............................................................................. 83

Figure 19 - A search for "mp3" with filtered result (Def Leppard selected as performer in 3rd list) ......... 83

Figure 20 - Debug output of search times .............................................................................................. 84

Figure 21 - Debug output of filter times, with all 11253 files as filter base ............................................. 84

Figure 22 - No counter on tag-folders .................................................................................................... 85

Figure 23 - Counter on tag-folders ......................................................................................................... 85

Figure 24 - Drilldown: Roskilde (175 files with this tag), Red Hot Chili Peppers (20 files with this tag) .... 86

Figure 25 - Activation Depth ................................................................................................................ 121

Figure 26 - Search Objects example 1 .................................................................................................. 123



Figure 29 - AddLocation times ............................................................................................................. 127

Figure 30 - Object activation and DB commit are time consuming when adding a location .................. 128

xii

Chapter 1

1 Introduction

This thesis involves at least 4 areas of computer technology, databases, information retrieval (IR), file

systems and software engineering, each of which has its own list of acronyms. Some of these acronyms

are quite common, like RAM (Random Access Memory) and others are more rare, like Object-Oriented

Database (OODB). The familiarity of an acronym typically depends on line of work and personal

interests, and since no two people are (supposedly) alike a list of all acronyms and terms used

throughout this thesis is listed in the appendix, 9.1 Data Dictionary, for those times where the letters IR

just don’t ring a bell.

“Usability has never been a priority in file system design. Instead, developers focus mainly on technical

aspects, such as speed, reliability and security.” – Robert Freund [1]

This quote very accurately describes the problem with file systems today. Their design comes from a

time where speed, reliability and security were more important than usability. The NTFS file system in

use on most Windows machines today was introduced in 1993 [2]. While speed might have been an

issue 10-15 years ago, this is no longer the case. With a file system that has been around for this long

the reliability and security aspects should be well proven by now. The technology is available today to

support a file system with focus on something besides those 3 aspects. Perhaps now would be the time

to look into the usability of a file system in the world as it looks today, and not when floppy disks were

the main storage format. Floppy disks and early hard disks had limited storage space, thus putting a limit

on the amount of documents that could be stored and thus the need for elaborate archiving and

retrieval techniques.

2 Introduction

Most file systems today are hierarchical (or flat) in nature, using a treelike structure to organized files by

putting them into folders. Each folder can itself be a subfolder of another folder, giving a child/parent

relationship between folders. The file systems that are not hierarchical are typically done by individuals

or very small groups in relation to an article or a bachelor/master thesis [1], and typically using FUSE

(Filesystem in Userspace).

The hierarchical way of organizing files into categories and sub-categories and sub-sub-categories comes

from the time when documents were physically stored on paper and archived in a structure like

\finances\year\department. It gave the early computer users a familiar way of storing documents. This

approach to interface development is similar to the horse tractor development process:

“The designers of the Phelps farm tractor in 1901 based their interface on a metaphor with the interface

for the familiar horse: farmers used reins to control the tractor. The tractor was steered by pulling on

the appropriate rein, both reins were loosened to go forward and pulled back to stop, and pulling back

harder on the reins caused the tractor to back up.” Quoted from [3] which refers to [4].

The hierarchical file system is an example of an old interface in a new technology. Files are stored in

exactly the same manner as you would in a paper environment. For accounting there may be a shelf or

ring binders for tax related papers, a shelf for incoming invoices and a shelf for outgoing invoices. Each

ring binder could contain the documents for 1 year worth of incoming invoices, with 12 subsections, 1

for each month. To find a particular paper document you would to go incoming invoices shelf, pick the

appropriate binder matching the year and flip through sections till the correct month was located.

Exactly the same method is used when a file is saved on the computer, only difference is that the

computer uses folders in place of the shelves, ring binders and ring binder sections. In programming

languages there is the concept of pointers, multiple pointers pointing to the same object, the exact

same approach could be used in a file system, allowing the same file to appear in different locations and

not just a single location. Tagging is a feature that has gained widespread use on the internet and it is a

feature that is easily implemented using a database. In a discussion on what features a database file

system could entail, tags replacing folder hierarchies seems like a natural evolutionary step from the

traditional hierarchical file system to a database file system.

In the last decade, one feature that has spread to all the major internet sites is tags. Gmail.com uses it

for organizing email, flickr.com for photos, and del.icio.us for bookmarks. Early email organization

consisted of a hierarchical structure, sort of like a mini file system for emails in which each email is

moved to a folder the user finds descriptive for that particular email. Photos are “organized” in a folder

structure when they are stored on the user’s computer typically by the time the photo was taken or the

name of the vacation or place they were taken .There are programs that allow tagging of photos of the

files on the users own hard drive (Picasa, Adobe). Bookmarks also traditionally used the hierarchical

structure. Each bookmark was saved directly to the hard drive as a file, and the structure of the

bookmarks that were presented was stored as a set of folders on the hard drive i.e. bookmarks were

organized using the hierarchical structure of the file system.

3

So, with all of these mini applications that have moved from the hierarchical structure to a tag based

structure, why is it that our file system is still hierarchical and not tag based? Tags seem to work well for

the 3 areas mentioned above, email, photos and bookmarks, is there a reason tags have not made it into

our file system yet?

Tags can be used in place of folders and allow the finding of files in a much more fluent way than

through the hierarchical structure that expects you to take a certain path to locate the document

needed. Tags allow for the same document to be located in multiple locations.

Tags allow for a different approach to locating documents. You start by selecting a tag, and then have a

quick look at the result, if there are too many, another tag is selected to narrow down the result set.

This process is repeated a few times until the result set is small enough that the target document can be

found by quickly skimming the result set. This approach differs from the hierarchical way of locating

documents, in which you cannot see if you chose the right keywords until you get to the end of the tree

structure of the folders.

The problem with tags is that the root level, from which we start the process of locating a document,

will present all the tags (or just the most used ones) which can be hundreds or even thousands. To

support the use of tags, the tags themselves need to be searchable and when a search is performed it

should be possible to filter the result based on tags.

Mac users tend to go for the spotlight search (Gorter’s questionnaire results included Mac users and

points out their go-for-the-search behavior [5]). Windows users don’t really have a search function that

searches outside the document folder and the user must navigate the hierarchical structure in hopes

that the same logic used to locate it now was also used when filing the document days, weeks or years

ago. Windows 7 does have a search feature, “Windows Search”, that allows the user to find documents

from the start menu. This search feature includes metadata for file types that have an iFilter extension

available. Files have to be in a location that is being indexed by Windows for the search to work. The

documents folder is indexed by default and custom locations can be added to the index, should the user

want it. Windows needs a search feature that allows better searching of metadata attributes, outside

the document folder so mp3 files, images, videos, etc. can be located through searching. The search

needs to be fast, no results popping up 5-10 seconds after the query was sent (spotlight can occasionally

add new results 5-10 seconds after the first result shows up). Users are impatient so ideally the full list

of results will be presented within 1 sec, ideally faster. 100-300 ms is the time Google typically uses1,

anything below ½ a second should be acceptable in the long run, depending on query. Winamp uses

~400ms per word searched, 3 words ~= 1,2 sec2.

1 A Google search for “database file system” returns “About 54,800,000 results (0.11 seconds)”

A Google search for “madam blå tekande” returns “About 3,380 results (0.26 seconds)” 2 In Winamp, using the media library search feature, a search for “rock” returns “2305 items in 0.479 sec.”, “rock

live” returns “169 items in 0.793 sec.” and “rock live acoustic” returns “9 items in 1.314 sec”.

Tested with a media library of 11k mp3 files, on a Windows machine with quad core CPU @ 2.4GHz, 8GB RAM.

4 Introduction

It is easy to find information on the internet, it takes under a second to perform a search on Google

which typically return several thousand to millions of results, but finding documents on our own hard

drive takes several seconds, sometimes even minutes to locate using search.

There is so much information today that we now have data about data, metadata, little bits of

information about the document, sort of a quick resume of the file, typically only a few words for

documents or tags for images. Despite all this metadata that is available, the typical way of locating a file

is to search by filename, and when that fails, a “grep” and a coffee break. Recently desktop search

software has emerged, Google desktop, spotlight, etc. that will crawl the local hard drive and create a

searchable index of the local files, based on filename and contents. These search tools effectively allow

files to be searched using the metadata of the files.

Metadata is often auto-generated or assigned by other people, each of which has their own view on the

format of the metadata (article about the utopia of metadata). Users need a way to assign their own

metadata to files (every file type, not just images and documents), so they can use their own

classification for navigation ALL their data. This is where tags are interesting, they are simple, they are

fully controlled by the user, and they provide the user with a more flexible way of navigating their files,

compared to hierarchical file systems.

About a decade ago a new term reached the public, “database file system”. It was one of the major new

features, WinFS (Windows Future Storage), in the upcoming Windows Vista. At the time it was just

another buzzword with no real specifics of what features it provided. Nevertheless, it is a term that has

been stuck in my head ever since, always hoping that the next Windows would contain an actual

implementation, being able to see what their interpretation of a database file system is. Over the last

20 years Microsoft has had a couple of different projects regarding new storage engines, none of which

have resulted in a working system available to customers [6].

There are currently a couple of programs that offer one or more of the features one would expect to

find in a database file system (faster search through indexing, alternative presentation, i.e. non

hierarchical, extra search features e.g. metadata search):

• DBFS [5]

"Database File System; It is a new type of file system that does away with places where you

store your files. Actually do not think of it as a file system, instead think of it as a document

system."

• Gnome Storage[7]

"The current implementation offers natural language access, network transparency, and a

number of other features"

5

• Apple's Spotlight [8]

"Spotlight is a system-wide desktop search feature of Apple's Mac OS X operating system.

Spotlight is a selection-based search system, which creates a virtual index of all items and files

on the system"[9]

• WinFS (beta only) [10]

“With WinFS, we will provide rich new ways to organize and visualize data. And as a final piece,

it's a platform. It's not just for end users: Developers can extend WinFS, integrate their

applications with WinFS, synchronize data between their applications and private databases and

WinFS, and build support for their own data types into WinFS, using full-featured, managed

code APIs [application programming interfaces]." [6]

"A file system (often also written as filesystem) is a method of storing and organizing computer files and

their data. Essentially, it organizes these files into a database for the storage, organization,

manipulation, and retrieval by the computer's operating system." – The Wikipedia definition of a file

system [11].

The programs listed above are mainly focused on organization of data and they are not so much file

systems as they are indexing services. They typically store references to files stored by the normal file

system rather than store the actual file data. Following the Wikipedia definition of a file system (for lack

of better source) the systems mentioned deal with 2 of the 4 terms mentioned, organization and

retrieval. Storage and manipulation is still handled by the underlying file system. The use of a database

allows for more information to be stored (and searched), but there are some parts of a file system that

are not well suited for storage in the database. Storing large quantities of binary data that changes over

time (in both content length and the contents itself) is not something that current database systems

handle very well. Current databases do not support reading or writing only parts of a field (streaming

access), they us an all or nothing approach, which is a problem when dealing with large files, having to

rewrite it all every time a change is made. Changes in file size are problematic because database fields

typically have a fixed length reserved for each field. Db4o (an object-oriented database) uses a blob data

type that is really just a file on a hard drive, so the contents still exists as a file on the hard drive[12]. If

the file content could be stored in the database it would solve the problem of dead references in case

the user bypasses the indexing system when accessing files, moving, deleting or renaming them.

Spotlight is deeply integrated into the file system and every file access notifies Spotlight of any changes

so the search index can be updated if needed. WinFS would likely have been well integrated with the

underlying file system in a similar manner to spotlight, if it had been fully developed. DBFS and Gnome

Storage are/were 1 man projects and how they handle users that modify files manually, around the

supplied user interface, is unknown.

Versioning is another aspect that the database would make easier. The database could store earlier

versions of files in the database, while only displaying the most recent one, unless explicitly told to list

all versions. The storage part of a database file system is currently not a practical solution. Versioning is

6 Introduction

closely related to storage, the older versions of files need to be stored somewhere, and with DB storage

of contents not being practical, there is little left of versioning details to be managed by the database.

Searching for a file by filename in a hierarchically structured file system typically takes a long time (when

not using indexed searches like Spotlight or Windows Search) because of the way the file structure is

represented on the disk. For every sub folder searched, the file allocation table of the disk must be read

to get a list of the files in that particular folder, and then each file name checked for a match. By using a

database to store the file information we can keep an index on file names for faster lookup. The

database is also much better suited to store and query the many to many relations between files and

tags, as opposed to the hierarchical structure, in which every file or folder only has one parent, a

relationship that is easier to store and faster to query than a many to many relationship.

The system needs to be compatible with existing applications. One way to do this is to present the tag

based file system as a Virtual Hard Drive (VHD), using something like Dokan “user mode file system for

Windows” library or FUSE for Linux. The VHD should mimic the hierarchical file system in appearance.

Files and folders are presented, but folders behave differently than they would in a hierarchical system.

The same file can appear in multiple folders on the VHD, and the order of the folders is irrelevant,

/docs/financial/taxes will present the exact same files as /financial/docs/taxes. The VHD should support

the classics create, edit, delete, move, and rename operations on files and folders. The operations would

functions slightly differently in a tag based system, but the basic operations for reading and writing and

retrieving a file by a path should be the same. This is a hybrid system, a tag based system presented in a

hierarchical manner, only for backwards compatibility. Interfaces built specifically for a tag based system

could present the tags and files as folders and files, but tag oriented operations would be available; Such

as viewing or editing the total list of tags associated with a single file in a single click, something that is

extremely hard to do in a hierarchical disk browser interface.

1.1 Previous research 7

1.1 Previous research

In this section a few existing software applications dealing with document retrieval will be presented.

Various operating systems (OS) have a variety of applications, some of which provide part of the

features wanted in a more modern file system.

Mac OS X has Spotlight [13] which is part of the OS and installed by default. Spotlight is a desktop search

application, meaning that it only searches items on the current computer. Spotlight runs a daemon

service which extracts metadata and contents (for document style files e.g. doc, pdf, txt) from files and

emails and internet browsing history and saves all of it in a special format suited for searching. Mac

users will often use spotlight to locate emails, web pages, or files previously created, instead of using the

Mac “Finder” application and navigate to the file to open it.

Searches performed in Spotlight can be saved and turned into Smart Folders [14] that will list all files

that satisfy the search. The user could set up a “smart folder” to list all documents changed in the last

week, to get a list of the documents that are likely to be opened again to continue work on them. This

avoids the trouble of opening the document folder, then opening 2 more sub folders (assuming a

location like \Documents\Reports\2010\report.doc) to locate the document.

Another Mac application is Time Machine[15], which is a backup utility that also allows older versions of

files to be opened from within the application through a special interface, displaying all the earlier

versions in an easily scrollable interface. This feature is implemented by copying the folder structure of

the files marked for backup to a separate drive. Every hour a new time stamped folder is created and

any files changed are copied to the backup disk.

WinFS[6] was an intended feature of Windows Vista that never made it into production; it was supposed

to be a new way of finding documents by their relation to one another, like including relations between

email contacts and the files sent/received from this person, or images of people in the contact list.

Google Desktop [16] is a third party application that will index files by name and metadata (Outlook,

Word, Excel, PDF and many more formats supported [17]). 100.000 files are auto-indexed in the

background, after 100.000 files have been indexed the crawler stops the auto-indexing process, but new

files and file edits will add and update the index. Only the first 10.000 words of a file are indexed, which

could present a problem with large documents.

The Windows 7 search [18] feature is similar to spotlight, it is also installed by default, but it will not

index every file on the hard drive like spotlight does, it only searches files in “indexed locations” which

by default are: Internet Explorer History, Offline Files, Start Menu, and Users (each users document

folder is by default found at C:\Users\<Username>\Documents), additional locations can be added.

There also existed a Windows Desktop Search (WDS) application for Windows XP and Windows Server

2003, but that application is now obsolete. “Windows Search” and “Windows Desktop Search” are 2

separate applications, where one is obsolete, namely WDS [19].

8 Introduction

The Windows 7 search feature of the Explorer application makes it possible to filter searches using

simple file attributes like modification time, file type or size (see Figure 1, Figure 2, Figure 3). More

advanced metadata attributes can also be specified, like “author:” which will even list all the possible

values for author that appears in the indexed location (see Figure 4). The “author:” filter only worked on

.mp3 files when tested, not on the .pdf files of the articles included in the bibliography. The search

feature does not do a good job of presenting which attribute fields can be used to filter, the knowledge

of “author:” as a filter attribute came from a demo video about the search feature.

Extensions can be written for the Windows Search to extract metadata from specific file types like .pdf,

the extensions are referred to as iFilters [20]. Foxit Software has a commercially available iFilter [21].

Adobe also has a PDF iFilter available [22], it is even bundled with Adobe Acrobat and Adobe Reader.

During testing of the Windows search feature, Foxit reader was already installed and believed to have

included an iFilter in the install process. This is one of the dangers of indexing software, you never quite

know which formats are indexed and how much of the data is included when it is indexed.

Figure 1 - The 4 most common filter attributes are listed

Figure 2 - 'Type' attribute used, and 'Name' is now presented

in its place

Figure 3 - Both dot-extension and full name of file type is

presented

Figure 4 - Metadata attribute filtering

1.1 Previous research 9

The Windows search feature also allows searches to be saved and accessed later, providing much the

same functionality as that of “smart folders” for Mac.

Only 2 of the desktop search engines support boolean search operators (AND, OR, NOT), Spotlight and

Windows Search.

Windows Search supports wildcard search with ‘*’ for a string of characters (“s*n” matching “sun”, or

“scan”) and ‘?’ for a single character (“s?n” will only match “sun”, not “scan”).

Spotlight supports the ‘*’ string wildcard character, but not the ‘?’ single character wildcard.

Google Desktop does not support boolean operators or wildcard search.

The previous sections covered the shortcomings of existing file systems and available technologies to

alleviate these shortcomings. The next section will introduce the objectives of this thesis.

1.2 Objectives

The aim of this thesis is to bring the following functionality to Windows in one package.

1. Search feature with metadata search capability ala. Spotlight (Mac),

2. Saved searches ala. Smart Folder (Mac)/Virtual Folders (Win)

3. Tags as known from the internet (Gmail.com, flickr.com, del.icio.us)

4. An alternative way of browsing files; Browse by tags instead of folders.

All of these new features for locating documents will be implemented using an Object Oriented

database instead of the traditional Relational database. By using an Object Oriented database the

process of writing metadata extractor plugins will be a lot simpler. There is no need to create new

tables/attributes for new metadata types, and no need for the plugin developer to know anything about

the database to write the extractor.

A file can have many tags applied, and a tag can be applied to many files, which would typically result in

a very large many-to-many table linking the two. The object database allows us to store the links locally

with each object, meaning that the total amount of links of the system will not affect the time it takes to

navigate from object to object to object.

10 Introduction

1.3 Thesis structure

Chapter 2 describes the methodology used in the process of this thesis. Chapter 3 is an analysis of the

domain, a file system, and which objects are needed to model a file system with integrated search

feature. Chapter 4 involves the design decisions for the classes involved. Chapter 5 deals with

interaction of classes and database to achieve use case goals. Chapter 6 contains testing information

and results and comparisons of existing technologies and the result of this thesis. Chapter 7 is the

conclusion and contains a list of improvements.

Chapter 2

2 Methodology

In this chapter we will present the different methodological tools used throughout this thesis,

information retrieval, relational and object-oriented databases and software engineering.

2.1 Information Retrieval

This section only briefly describes the search techniques used in this thesis. For a more detailed

introduction to Information Retrieval (IR), Mac OS X’s Search Kit Programming Guide [23] contains a

section describing terms and notions common to IR systems. The guide covers some more advanced

techniques like inverted indexes, phrase or proximity search, stemming/suffix stripping, stopwords,

synonyms, minimum term frequency, min term length (all of these words are in the data dictionary).

The most common method of providing IR capabilities is with an inverted index. For a text document the

inverted index is built by identifying each word that appears in the document and adding a record to the

database that maps that word to the document. For non-text documents like images, videos and sound

files there are little headers associated with the file describing the size of the image or vide in pixels, the

length of it. Photos have a field recording the time the picture was taken and the model of the camera

used. ID3 is a metadata container format typically associated with .mp3 files, but used by .ogg and

others as well. The ID3 tag contains information about the music piece of the mp3 file, such as artist,

song title, genre, year, etc. For non-text documents, these metadata attributes are extracted by

applications or extensions that can read the header format, the metadata is then added to the inverted

index in the same way as for the text document.

When a search is performed the search term is looked up in the keyword list and the documents that

the keyword maps to are returned.

12 Methodology

The inverted index implementation in a relational database would most likely consist of a table of

unique keywords, with an auto-incremented PK number used in another many-to-many table, linking

the keyword to the files/documents in which the keyword was encountered during the indexing.

The classical way of locating files, when it cannot be found through navigation, is to search by filename,

and extension type, and sometimes date ranges are included too. Desktop search tools like the ones

mentioned earlier add the ability to search on metadata as well.

Search engines don’t always return the things the user expects, some of the common reasons why

searches fail are[24]:

• Empty searches

o What happens when pressing search with not search term? Nothing? No results? All

results? Error message?

• Wrong scope

o Is it a single directory being searched or is it the entire hard drive

• Vocabulary mismatch

o Searches for “doctor” does not return documents with “physician”.

• Spelling mistakes

o Searching for locks or looks, lose or loose.

• Query requirements not met

o Search engine assumes automatic AND or OR of search terms

• Problems with query syntax

o Is the search engine syntax “penguin NOT linux” or “penguin AND NOT linux”

• Capitalization and extended characters

o “Unix” or “UNIX”. Special characters: Æ, ø, å, French accents, ß, etc.

• Stop words

o I, and, the, a, an, are extremely common and provide little search value and are

occasionally omitted from the search index to save space.

• Short words

o Some search engines do an automatic starts-with or ends-with search, which for small

words returns a lot of results. 1-3 character words are typically not indexed in these

cases.

o “to be or not to be” is hard to find using word search because every word is common

and very short (stop word).

• Numbers

o Numbers may not be indexed at all

o Low numbers 1-999 may be omitted because it’s considered a short word

o Negative numbers “-“ is occasionally considered as NOT when searching

o “2 car garage” or “two car garage”

2.2 Storage 13

2.2 Storage

The previous section covered some basics of IR systems and how they typically store their information in

a database. In this section we will show the different methodologies that can be used to develop such a

database, with a particular focus on the physical storage of the information.

We need to store the metadata that we record for each file and we also need somewhere to store our

inverted index. We should expect to store information of up to 100k files [25], along with metadata

about each of these files, some files containing more metadata than others, MP3 files contain a small

header, text document metadata may involve a list of every word encountered.

We have 3 different types of databases to choose from, the classic, Relational Database (RDB), the

Object-Relational (ORDB) and the Object Oriented Database (OODB). The Relational Database allows for

very generic querying of the data stored but domain objects have to be translated to a row/column

format. The OODB specializes in storing objects directly in the database and navigating the references of

these objects. Querying is still possible in the OODB, but somewhat limited by the structure of the

objects. The ORDB is a middle way solution that stores objects in row/column format, but the database

handles the conversion of object to row/column format and back so the developer doesn’t have to.

In OODBs data is often accessed through navigation (I need the object at this address).

In RDBs data is always accessed through declaration (I need the object with this id value).

In ORDBs data is accessed as objects, as in the OODB, but the data is stored in a relational db. The ORDB

translates object requests from the application into relational queries that is run against the DB and the

result is converted into objects and returned to the application.

The system proposed in this thesis contains a lot of relations, and a lot of the work is expected to be

done by traversing the relationships of objects and not as much by querying the database; This makes

the OODB the ideal choice. The OODB also allows for faster prototyping and refactoring by eliminating

the need for mapping domain objects to database tables.

The OODB used herein is “db4o” by Versant, an open-sourced, dual licensed, object oriented database

that supports Java and .NET.

“The db4o project was started in 2000 by Chief Architect Carl Rosenberger, and first shipped in 2001. …

The db4o product was commercially launch [sic] in 2004.”[26]

“Versant Corporation (Nasdaq:VSNT), acquired the assets of the db4o object database business in Dec,

2008. Versant, went public in 1996 and is a [sic] industry leader in specialized data management

software…” [26]

The following is a list of features that OODBs in general make available to the developers.

14 Methodology

• ACID properties[27].

The ACID (Atomicity, Consistency, Isolation, and Durability) properties found in RDBs are also

found in OODBs.

• Impedance mismatch problem solved

The OODB will store domain objects as domain objects in the database, there is no conversion

to another (row/column) format and thus the memory and CPU cycles normally spent on

conversion is saved when using an OODB.

• Supports refactoring

Because objects don’t have to be translated to another format, adding or removing fields to

domain objects becomes a lot simpler. Objects from earlier versions of an object can still be

read and used so there is no need to convert the database format from old to new before the

refactored application works.

The following is a list of features specific to db4o.

• Native query

Database querying can be written in the language of the programming language (Java, C#)

instead of the more traditional SQL string format. This allows the IDE (Integrated Development

Environment) to perform type checking of the query at compile time, instead of traditional error

from SQL server at runtime.

• Lightweight

Small memory footprint (400-600kb) [28]

• Imbedded

See next point.

• No administration required

The db4o object database is typically embedded in the code of the application which means

when the application runs, so does the db4o database, no setup required, no tables to

create/delete/change.

• 4 query languages:

o LINQ (Language-Integrated Query, .NET only)

o QBE (Query-By-Example)

o Native Query (NQ)

o SODA (Simple Object Database Access)

• Simple to use API

It only takes one line to save an object to the database.

About 10 functions supplied in the standard API for working with the DB, more are available for

more advanced use.

2.3 Software Engineering 15

2.3 Software Engineering

This section covers the technique used for developing the software using the IR and storage

functionality covered previously.

Software Engineering is the process of analyzing, designing, implementing, testing and maintaining

software. An important concept in software engineering is iterative development, breaking the problem

down to smaller projects, with each small project ending in a small system that was analyzed, designed,

implemented and is fully working. The next iteration will then add new functionality on top of the

previous iteration, incrementally adding to the complexity of the system, analyzing, designing,

implementing and testing new functionality at each iteration. Object Oriented Analysis and Design

(OOAD) is another important concept. Object Oriented Analysis (OOA) covers the process of identifying

the users, objects and their attributes. Object Oriented Design (OOD) deals with the actions that each

object can perform, and how objects interact with other objects. The design part is typically achieved

using Computer-Aided Software Engineering (CASE) tools for generating and maintaining the visual

models of the target system. The models created follow a standardized notation, UML (Unified Modeling

Language), which specifies different aspects of the system. Class diagrams are used to describe the

structure of the objects in the system, their attributes and functions for interacting with them and the

relationships between classes. Use case diagrams describe which actions are available to which users.

Text based use cases describe the steps of each action available and how it affects the overall system.

Sequence diagrams are generated on a per use case basis and show the interaction of classes to achieve

the goal of the use case. There are several other UML diagram types, but the ones mentioned are the

ones used in this thesis. This section has been written from [29,30,31].

Implementation involves turning the diagrams (and use cases) into code that fulfills the goals of the use

cases, and follows the sequence of class/object interactions specified in the sequence diagrams.

Maintenance deals with bug fixing and adding new features.

There are different methods of testing but the most common are probably unit tests and black box tests.

Unit testing involves creating little blocks of code that sets up part of the system, and then performs an

action and checks that individual attributes of an object have changed as expected. Typically functions

are tested with extreme values (null, empty string, unique restraint testing), trying to provoke a result.

Black box testing tests the entire system as a single unit. It works like a function call (of an unknown

function) we feed the system an input, “A” and we know what the output should be and compare the

computed output with the actual output. There is no knowledge of how the result is computed or the

values of objects.

The 6 best practices described by the Rational Unified Process (RUP) are [29]:

1. Develop iteratively

Split the overall project into minor pieces, repeating the analysis -> design -> implementation ->

testing steps in each iteration

16 Methodology

2. Manage requirements

Always remember what the users want

3. Use components

Break the overall problem into smaller parts, concepts, items, actors to support unit testing and

code reuse

4. Model visually

Make UML diagrams of the system, class diagrams, use case diagrams, sequence diagrams.

5. Verify quality

Test the system and possibly individual components to ensure proper functionality

6. Control changes

Development in teams need a way to control who changes what and ensure that the changes

did not break anything.

Point one ensures that we develop small pieces that work before moving on to the next level. It is easier

to find and fix problems in smaller systems, and once the smaller system is tested and works as

expected, the next iteration is built on top of that.

Point two in our case involves looking at previous research on topics like the hierarchical file system,

tagging, faceted search, metadata file system, searching in file systems and use that as a guide for user

requirements. Since we are developing a new and experimental approach to file system browsing,

searching and filtering the research takes the place of the user requirements.

Point three makes unit testing and code reuse a lot simpler, and the OODB stores objects directly and

storing and working with smaller objects is faster and simpler.

Point four makes it easier to develop the system further and to gain insight into the relations and

function calls between objects

Point five helps make sure that things function as expected, e.g. when code changes are made because a

performance problem was encountered with the existing code or when new code is added, it needs to

be tested to ensure it works as intended.

Point six is mainly concerned with team projects where multiple people are working on the same code

(different code files) using source control software that tracks code checkouts and checkins and ideally

be able to run unit tests regularly to ensure nobody broke a part of the system with a code update.

Chapter 3

3 Analysis

3.1 Introduction

This section covers the software engineering aspects of the thesis problem, more specifically the

analysis part of the problem. Identifying the actors, objects and interactions involved in a hierarchical

file systems and how these translate to a tag based and metadata aware file system. Possible problems

of translating between a hierarchical and a tag based system are also covered.

The use of an iterative development cycle is reflected in this section. Sub sections sometimes refer to

some parts of the system as already implemented, such as the 3.7 Search section referring to already

implemented tag and file objects.

The file system being developed is called MetaFS (Metadata File System) because of its extended use of

metadata to search for and navigate files. Tags are considered as metadata as each individual tag gives a

clue about the contents of the file. Folders can be considered to be metadata as well, but they are very

inflexible compared to tags. To find a file based on folder metadata you need to know all the folders and

their exact order. To find a file based on tag metadata you only need to know one of the tags and your

file will be part of the result presented.

“Hierarchical file systems are dead” is the title of a research paper [32] and a set of slides [33] by Margo

Seltzer and Nicholas Murphy that presents the problem with existing hierarchical classification of files by

presenting, in the slides, an example with elements of different color (red, yellow, blue), shape (square,

circle, triangle) and size (small, medium, large). This next section concerning ordering of elements is the

problem as presented in the slides.

Imagine the elements in Figure 5, 27 shapes in different shapes, sizes and colors, how would you sort

these elements?

By color (Figure 7), size (Figure 8) or shape (Figure 9)?

18 Analysis

It depends on what you’re going to use the elements for. If you were looking for elements matching the

DTU logo in color a red item would be the way to go, suggesting a color sort or maybe a diamond shape

would be preferable in which case shape sort would be better (and assuming more than the 3 shapes

were available) and because the log is small you need a small element.

Figure 5 - Unsorted elements

Figure 6 - DTU logo

Figure 7 - Sorted by color

Figure 8 - Sorted by size

3.1 Introduction 19

Figure 9 - Sorted by Shape

Storing these elements in a hierarchical system you are forced to select a primary attribute, such as

color, resulting in 3 groups at the root, red, green and blue. The second attribute then sub-divides each

color by shape or size, and the end result is a structure like this:

• Red/Triangle/Small

• Blue/Square/Medium - or with perhaps another attribute order

• Large/Circle/Green

With a structure like Red/Triangle/Small, to find a small red element to match the DTU logo we would

need to look in three places (or more depending on how many shapes there are):

• Red/Triangle/Small

• Red/Square/Small

• Red/Circle/Small

If we don’t care about color or shape but only need a small element, we need to check 9 locations,

because the size attribute is considered last.

The slides also mentions that when visiting websites, typically only the www.someplace.com part is

entered, auto-complete, bookmarks or searches takes us to the information needed, the hierarchical

structure available on web sites is rarely used.

A golden rule of application design that also applies to websites is that any screen/page should be

reachable in 3 clicks, any deeper and the structure becomes too complicated to navigate.

The features for accessing information on the internet are slowly making their way to the desktop to

help manage personal information. Indexed searches on file content are provided by Windows Search,

Mac OS X Spotlight and Google Desktop. Tagging is another feature that originated on the web and is

slowly making an appearance on Desktop computers for photo album management in the form of

Google’s Picasa [34].

In the article by Karl Voit et al [35] some numbers from Bergman et al [36] are presented, describing the

percentage of different methods of file retrieval. 56-68% of users preferred navigation through a folder

hierarchy vs. 4-15% for search.

20 Analysis

These numbers support the implementation of the VHD, users are used to accessing files through

navigation, and the amount of files is just getting to a stage where the hierarchical system is getting

harder to use for document classification.

Karl Voit et al [35] proposes eight “fundamental requirements for future PIM tools” (Personal

Information Management).

1. Be Compatible with Current User Habits

Existing software like word processors and spreadsheet applications should continue to

function with the introduction of the new PIM.

2. Minimal Interference

The user should not have to learn to use the PIM, it should be simple to install and to

start using, advanced settings hidden from the user.

3. Support Multiple Contexts

It should be possible to find information in multiple locations, e.g. a specific track can be

found whether the user starts with ‘music’ or ‘rock’ as the initial tag.

4. Support Browsing

Browsing is still the preferred method for locating documents, and presents the user

with a set of choices when looking for files.

5. No Unnecessary Limitations

The system should support a large amount of files.

6. Transparency

The user should know where their files are and what happens to them.

7. Provide for Expiry Dates

Automatic archiving of files once they reach the expiration date, set by the user or

automatically.

8. Add Metadata While Storing

Allow the user to explicitly enter metadata when a file is saved, to help in retrieval

through browsing at a later time.

The MetaFS will attempt to cover 7 of the 8 requirements, no. 7 being the feature not included.

What the MetaFS is meant to accomplish is the incorporation of some existing web features into the

desktop environment.

3.2 Dokan

Since we are looking to present files and folders in a new way and at the same time be backwards

compatible with existing applications, there seems to be only one option, a Virtual Hard Drive (VHD). For

Linux there is FUSE which has been used a fair bit[1,32,37]. For Windows the process of finding a VHD

library is a lot harder and the only one that could be found was Dokan [38], a “user mode file system for

3.2 Dokan 21

Windows”. Dokan allows a VHD to be mounted on a Windows machine, and any file system calls sent to

this VHD are directed to the user application from where they can be processed.

Being somewhat familiar with the Dokan library helps a lot in identifying what objects are needed and

which interactions are available in a file system. A very simple hierarchical file system was prototyped

with Dokan to test the capabilities of the library and to gain some knowledge of the Dokan object

structure and the interfaces that need to be implemented.

3.3 Actors

Dokan does not support ACL (Access Control List) like most modern file systems which means that all

users have full access to the files on the VHD. This is perfectly fine for our needs as security tends to be a

rather big concept to implement into any system. Our focus is on a single user, and every action is based

on the events of this user. The user can create, move, rename, etc files and folders (which we treat as

tags) on the VHD. Because of this single user approach we don’t have any use case diagrams, since there

is just one user who can initiate all of the use cases.

3.4 Objects

Since it is a file system we are modeling there are some basic object structures that are unavoidable.

Ideally files will keep working like files do; only folders on the VHD will behave differently.

We need a file object to represent files. The VHD is presented through a hierarchical browser (Windows

Explorer or Mac Finder) so we should at the very least have the attributes available to files in a

hierarchical system. These attributes are:

• Name

• Size

• Creation Date

• Modify Date

• Access Date

Besides these basic attributes we also need:

• Tag list – a list of tags applied to the file

• Path – a link to the actual file location as we only store a reference to it and not the data itself.

• Lowercase Name – later iterations have shown the need for this for fast DB lookup.

• Metadata list – a list of metadata objects with metadata information extracted by extensions.

That covers the attributes of file objects. Hierarchical systems use folders for organization; we will use

tags instead, which on the VHD will look like folders, but their behavior will be different. Once again we

copy the basic attributes, from the hierarchical folder to our tag object. These attributes are:

22 Analysis

• Name

• Creation Date

The more specialized attributes, for our system:

• File list – a list of the files this tag is applied to

• Lowercase name – later iterations have shown the need for this for fast DB lookup.

These two objects are all that is needed to support a tag based file system. Later iterations add a

metadata interface, specifying which methods the system expects to use for extracting metadata from

files. Metadata information is based on information extracted by extensions to the MetaFS and it is

therefore impossible, and not needed, to specify what fields these contain, it is entirely up to the

extension which fields are extracted. We use the metadata interface to create metadata objects and link

the file objects to these metadata objects and by using an OODB we can store the metadata object

directly in the DB, without having any knowledge of the fields of the metadata object. The OODB takes

care of extracting information about fields and saving them in the database by using reflection, an

advanced feature of some modern programming languages like C# and Java.

While the tag and file objects may be enough to represent a tag based file system, we are building it on

top of an existing file system (NTFS) and we need a way for the user to specify which files are to be part

of the MetaFS, to be presented on the VHD and have their metadata stored. This is done by allowing the

user to select a folder, and all the files in that folder and sub-folders are added to the MetaFS database

and their metadata is extracted and saved in the DB as well. For this purpose we need an object that can

represent the folder selected by the user, an indexed location object with the following attributes:

• Path – the path on a normal hard drive where the files can be found.

• File object list – a list of the MetaFS file objects that this location is responsible for adding.

The file object list is kept for when the user decides to remove a location, and then the file objects of the

MetaFS from that location can be easily removed from the DB.

To be able to search files, tags and metadata, keyword objects are needed, but the design of these are

heavily influenced by the database and the structure of the remaining parts of the system and are

covered in detail in the Design chapter.

Simply put the MetaFS is a set of files and a set of tags, with a lot of connections between these two

objects, one tag can point to multiple files and one file can be assigned multiple tags. Figure 10 shows

several tags that a user might create to allow for searches like comedy movies with Will Smith, sci-fi

movies or perhaps just anything with Will Smith.

3.4 Objects 23

Figure 10 - Tag and file connections

The following quotes about the Google Desktop application give an idea of how many files can be

expected to be indexed.

“For most documents, Google Desktop will search about the first 10,000 words.” [39].

“Google Desktop only indexes 100,000 files per drive during the initial indexing period. If you have more

than 100,000 files in a particular drive, Google Desktop won't index all of them during this initial period.

However, Google Desktop adds files to your index during real-time indexing when you move or open

them.”[25].

“However, if you're searching for a word within the file, please note that Google Desktop searches only

about the first 75,000 characters.”[25].

The next few sections discuss possible problem areas that need to be considered in the development

process of the system.

24 Analysis

3.5 Unique name constraint

Because of the way Windows treats filenames (case-insensitive), special considerations have to be made

with regards to names of files and tags. Furthermore the file system namespace is shared between file

and folders, so once a file exists with name “myname” there cannot be a subfolder with the same name

in the same folder as the file.

Since our folders are actually tags and not folders nested within each other in a hierarchical structure

every tag needs a unique name as it can appear in any tag-folder. The root of the file system should as a

basis present all the available tags, but because the amount of tags will keep growing throughout the

lifetime of the system, the root level will become cluttered with less frequently used tags. Because of

this the user should be able to set a limit for how many files a tag must reference before being shown at

root level. Because every tag can potentially appear at root level tag names must be unique to prevent

name collisions of tags.

This same problem applies to tag vs. file name collisions. While we don’t expect to list files in the root, it

is still very likely that a tag-folder will end up with a tag and a file with the same name, as files and tags

can potentially appear in any tag folder depending on what relations they have.

File vs. file name collisions are nearly unavoidable, just consider cover.jpg, readme.txt, report.doc files.

This need for uniqueness means that the easiest solution is to have tag names be unique, file names be

unique and file and tag names share the same namespace as they do in the NTFS or FAT32 file systems.

There is however one problem, db4o does not support case insensitive querying, without writing a

delegate function for a case insensitive match. The db4o SODA query language only supports case

sensitive matching of strings, so we need to write a custom delegate match function that converts the

filename we are looking for to lowercase and also converts the db object filename to lowercase to be

able to compare the two. Delegate functions increase the query time a lot when the query optimizer

fails to translate it into a SODA expression. In this case it is better to change the objects a bit, to store

names in lowercase, and manually write a SODA expression.

The uniqueness constraint on files and tags has an advantage when it comes to implementation; it

allows the use of dictionaries which can save the loading of tag objects when only the tag name is

needed.

Files are a bit trickier to handle as their name is not unique in a hierarchical file system. It is very likely to

have two or more files called cover.jpg. This lack of uniqueness creates a problem when presenting all

the files with a given tag, where multiple files have the same name. Imagine the user has selected

‘music’ as a tag; this would include all the cover.jpg files from each album (..\Metallica\Master of

Puppets\cover.jpg, ..\Metallica\Reload\cover.jpg, ..\Metallica\S&M\cover.jpg).

There are 2 ways to deal with this name clash:

1. Every file gets a unique name (cover.jpg, cover(1).jpg, cover(2).jpg, …).

3.5 Unique name constraint 25

When new files are added to the MetaFS, either through the VHD interface or a new index

location is added or new file in an already indexed location, the name is checked against existing

names and if the name is already used the MetaFS assigns a unique name by adding a counter,

“(n)”, to the filename, cover.jpg becomes cover(1).jpg.

2. Duplicate file names are dealt with as they are encountered.

New files are added to the MetaFS directly with no consideration to their name. When the user

then views a subset of files, limited by the tag(s) chosen, we have to check the files for any

occurrences of duplicate file names, and temporarily assign unique names to the duplicates. This

temporary name is assigned as above, filename + “(n)”.

Option 1 has the problem that when viewing files, a lot of them may end with a counter, (n), which may

confuse the user as to where are the files 1 to (n-1). E.g. selecting the tags “Music”, “Metallica”, “S&M”

would present the file “cover(2).jpg”, which might make the user wonder where “cover.jpg” and

“cover(1).jpg” are.

Option 2 has the problem that when we temporarily rename cover.jpg to cover(2).jpg because there

happens to be 5 identically named files, we need to make sure that when the user opens cover(2).jpg he

actually gets the right file. Imagine that the user opened a view for ‘music’ and gets a list of files, then

suddenly decides to go do something else for a bit and comes back to this view later. All sorts of stuff

could have happened since the view was opened, files being deleted, renamed or moved, all of which

affect the temporary naming. The file previously named cover(1).jpg could be gone and cover(2).jpg

would now be cover(1).jpg and cover(3).jpg would be cover(2).jpg. Trying to open cover(2).jpg at this

point would result in the wrong file being opened.

It should in theory be possible to keep a bunch of tables in memory, mapping temporary names to the

correct file for each view presented. This approach is very speculative and it seems to be a very

complicated fix to a problem that could also be replicated on a hierarchical file system under special

circumstances.

Example: The user opens Explorer with a view of the folder \documents\finances and sees “taxes.doc”.

The user may find this name to be lacking in description and open it in Word, edit some things and save

it under a new name, “taxes 2010.doc”. The user may then download an attachment “taxes.doc” from

an email and save it at \documents\finances. At this point, if the user was to go back to Explorer, the

“taxes 2010.doc” would not be visible (until a reload is performed), only “taxes.doc” would appear, and

opening that with a double click would open the file from the email.

This is a pretty specific sequence of events, and not something we can protect against, and the user

should be aware that they performed a rename operation and will get a different file now than when

the view was opened. The MetaFS increases the file renaming problem to include all files with the same

name, not just the one file that was renamed. If multiple “taxes.doc” files are indexed by the MetaFS,

26 Analysis

and the user has gotten used to opening “taxes (5).doc”, performing a rename/delete on “taxes (3).doc”

will most likely cause “taxes (5).doc” to become “taxes (4).doc”.

The tracking of temporary names seems quite messy and complex, and the fact that files can change

names (the counter changes value) could easily become an annoying ‘feature’ of the MetaFS.

The upside of option 2 is that when the user is looking at files tagged ‘music’ the cover.jpg, cover(1).jpg,

… files are shown, but once tags ‘Metallica’ and ‘Reload’ are chosen, cover.jpg is now the only file with

that name and gets to keep that name, instead of being presented as cover(1).jpg.

Option 1 is definitely the way to go since even when using option 2 we still get (n) counter on files at

times, so it is more intuitive to have the filenames stay constant instead of having them change as the

user does a drilldown selecting more and more tags.

The following problem is somewhat related to the above problem. When viewing a folder, folders and

files share the same namespace, meaning that if a folder is called ‘readme’ there cannot be a file called

‘readme’ and the other way around. It is possible to stick with the unique naming scheme from above

(method 1), including folders in the check for uniqueness, but this does present one slightly annoying

problem. Let’s say we already have the files, readme to readme(10), then trying to add the tag, would

cause it to be called readme(11). Numbers on tags is not something we want, it can be confusing to the

user and there is already a plan for using numbers in tags to represent how many files carry the tag

(optional, via settings).

Tags take precedence on names which means that they don’t get ”(#)” counters, only files do. In the

event that a new folder is created using a name that is already used by a file we have to rename the file

(adding a counter) so the tag can get the name instead. Any “last opened files” histories for that file will

be broken.

Some file systems are case sensitive (Ext3, NTFS), others are case insensitive (FAT32). We are making a

file system for Windows, so following the Windows style of case insensitivity seems the logical choice.

Supposedly NTFS is case sensitive, but the Win32 environment is not (64-bit is not mentioned) [40].

One could argue that since we tell the application which folders exist, the user will have to choose from

the tag folders we present, which are in correct case, thus case mismatching should not occur when

browsing the VHD and we have to extract tag names from the path. This is an erroneous assumption,

since it is entirely possible for the user to bookmark the path “C:\Music\Metallica\S&M” to get to that

tag folder quickly. If one of the tags are renamed to fix a capitalization error, or just to emphasize a

particular tag, say rename “Metallica” to “METALLICA”, the bookmark would now be invalid.

3.5 Unique name constraint 27

There are a few ways to deal with this:

• Store a lowercase name used for index and querying (extra storage required)

• Programmatically keep an eye on when this problem occurs and fix it in code (extra coding

complexity)

• Settle for using delegates for querying on name (extra db query time)

• First do a query using built in functionality on (indexed) name attribute, if that fails; we do a

delegate query using lowercase comparison. (double query from time to time)

There is however one fairly big problem with giving every file a unique name, anything linking to (LaTeX

document or HTML files) or loading data by filename (e.g. games, applications), will be unable to find

the files because they may have been renamed on the VHD.

Sadly this problem is not easy to fix, it would require file requests from certain applications to be

handled in a special manner to ensure file links get the contents of the originally linked file and not one

with the same name. An alternative approach could involve reading the LaTeX or HTML document, when

adding it to the DB, and record the files linked to by the document in the DB. Depending on the order of

file request from LaTeX it might be possible to return the correct file content based on the file

relationship data of the DB.

The problem with broken file links is ignored in this thesis as this problem is limited to a few file types,

the majority of file types are unaffected by this (.mp3, .doc, .pdf).

3.6 Metadata extraction

One of the most obvious file types to extract metadata from is the mp3 file. Mp3 files are common

today and most people have quite a few mp3 files on their hard drive. Typically they are just put in a

Music folder somewhere, single files are dumped in the “music root” being the Music folder in this case,

albums typically have artist and album name if the user took a little time to arrange the files. Over time

the music folder just grows and grows, single files may get their own folder with just the artist name.

Some bands/artists start their name with “The” e.g. The Cranberries, where “the” sometimes makes it

into the name and other times not, meaning you have to look under both C and T for their music. While

the structure of the music folder may be messy, a large amount of the files will likely contain ID3 tags,

containing metadata about which album this file belongs to, who the artist is, year of release, genre of

music, etc. The ID3 tags are filled out by the ripping software by accessing huge databases that typically

contain all the info you would want in your ID3 tag, or they are filled when you purchase them from

various online stores. Either way, most mp3 files have ID3 tags, as they are typically added automatically

when the mp3 file is created. These ID3 tags hold a lot more information than can be extracted from the

path of the file and it is this information that will help in locating exactly the file(s) sought after.

28 Analysis

The data is already there, all we need to do is read it, link it to the corresponding file object and save the

metadata to the database for fast searching. For this purpose the TagLib# (TagLib Sharp) library will be

used [41].

3.7 Search

3.7.1 Live search.

The MetaFS should support live search, meaning that whenever a document is written to the VHD,

metadata is extracted and the inverted index is updated, values are removed or added depending on the

change to the metadata. This way search results always return relevant information and results which

contained the information 10 minutes ago, before they were changed.

This feature is a bit tricky to implement. Building the inverted index is not that hard, basically every file

object in the database has its real file examined by any metadata extractor that might be available for its

particular file type and every word that appears in the extracted metadata is added to the inverted

index, with a link to the file. The problem arises when we need to update the inverted index as files are

e.g. renamed, tags added or removed or content is changed causing a change in metadata. If a file has

been tagged by the user with the word “rock”, and the ID3.genre is also “rock”, untagging the file will

still leave a reference in the inverted index from “rock” to our file. Should the ID3.genre field later be

updated (by using the auto-tag feature) from “rock” to something like “pop”, we need to remove the file

reference for the word “rock”. The difficult bit in this example is to determine when the last keyword for

a given file is removed. One way would be to just keep a word to file objects index and when extracting

metadata words ensure that each word has a reference to the file. When the file is updated we extract

the new metadata and compare the old metadata dictionary with the new metadata dictionary. If a

word is in the old metadata dictionary but not the new metadata dictionary we need to check if the

word appears in a metadata dictionary for any of the other types of metadata for the file, e.g. tags,

filename split, etc. to determine if other pieces of metadata contain the word we are about to remove

from the index. This approach is potentially expensive with respect to time used by each extractor and

the reading of file data from disk, which could be several MB (X metadata extractors on a file of Y MB),

having to run all the metadata extractors for the file to check if the keyword still appears in another

piece of metadata for the file. File writes will have to run all the metadata extractors, but file renames

and file moves (tag changes) which do not affect file content also end up needing to run all the

metadata extractors to check if a keyword needs to be removed. We have no knowledge of what kind of

metadata will be extracted by extensions. Image or video analysis could spend a couple of seconds

analyzing content for metadata. This time is pure waste in some cases, like file renaming or moving, and

it would be better to be able to update the inverted index on a per extractor basis, keeping track of

which words were extracted by which extension, e.g. which words by MP3Extension and which by

TagExtension. To accomplish this we need to be able to link keywords and extensions so we can just

remove the keyword link for a particular extension in case the new metadata no longer contains that

keyword. One way to do this is with an object with 3 attributes: keyword, extension class name and file

3.7 Search 29

object. This solution has a redundancy problem, repeating keywords, and needs to be refined, which is

described in the following section.

For a metadata extractor to parse information from a file that has been changed and report changes to

the inverted index may take some time depending on the size of the file and the amount of calculations

that needs to be performed on the file. An image analysis extractor could analyze photos in an attempt

to identify photographs of people, nature scenes, houses/structures, etc. This analysis could very likely

take a few seconds or longer depending on file size, image size, accuracy in recognition, algorithm, etc.

Because of this the live search requirement has to be relaxed a bit and instead changed to a

functionality that will works as fast as possible, without interfering with the user.

Every user action that causes a change to a file or tag object can cause the inverted index to become

outdated, this happens if a piece of metadata is added, removed or changed (same effect as remove &

add). Because the metadata extractors are designed to be customizable, the metadata they extract can

come from any part of the file, be it file content, file name, file size, file date, etc (or even tag

references). It is necessary to identify every action that can potentially cause a change of metadata and

in case of a change, update the inverted index to ensure accurate search results. The list of available

actions primarily consists of those described by the DokanOperations interface and from that the

following operations are identified as candidates that may require inverted index updating.

• CreateFile

• DeleteDirectory

• DeleteFile

• MoveFile

o File Rename

o File Move (changing tags associated with file)

o Tag Rename

• SetFileAttributes (read-only, hidden, archive, encrypt, compress, …)

• SetFileTime

• WriteFile

The adding and removing of indexed locations also cause metadata to be added or removed from the

system, they operate very similar to CreateFile and DeleteFile with regards to inverted index updating.

• Add Indexed Location

• Remove Indexed Location

During most (all but DeleteFile and Remove Indexed Location) of these operations it is necessary to run

metadata extraction on the file affected and compare the new metadata with the old and update the

inverted index accordingly, adding and removing entries depending on the change in metadata.

30 Analysis

In the case of DeleteFile and Remove Indexed Location the existing metadata is checked and for every

word the inverted index entry is removed.

3.7.2 Structure of search objects.

The first draft for an inverted index implemented in the OODB mimicked the many-to-many table, and

not so much the keyword table. The idea was that we would have “keyword” objects containing the

word (“rock”), the extension class name (“MP3Extension”) and the fieldname (“genre”) in which the

keyword was encountered during indexing. This would cause problems when a word like “rock” would

appear in multiple keyword objects, rock|TagExtension|tagname, rock|MP3Extension|genre, and

rock|ImageExtension|description. The thought was that the extension name and field name were used

when the user searches for a value of a specific extension fields vs. the classical search for a general

occurrence of the word anywhere. By using MP3Extension search and entering “rock” we can limit the

search to keyword objects containing the word rock and the extension shall be of type MP3Extension

(none of those family photos). The main point of this approach is that a search on a specific extension

can be performed using indexes (on both keyword._word and keyword._extensiontype fields) for faster

lookup, rather than to search the, non indexed, fields of extension objects. Adding indexes to every

extension field would severely increase the database size, and the use of these indexes would be limited

as they would not all be used.

The problem with this data structure for keywords is that we get a lot of duplicate values for the value

“rock”. In a music collection of 10.000 tracks, it would not be uncommon for 1.000 of these to include

rock as their genre. This repetition of words in our keyword table will likely affect lookups in a negative

manner because the index would be non-unique and contain 1.000 entries of the same word. It would

be better to have a unique index so once the word is found the index lookup can stop. We need to

eliminate redundancy in the keyword table for faster querying and slightly smaller database size. What

we need is a dictionary style keyword table, where each keyword appears only once, with a list of

objects describing where these words appear.

The remainder of this section about search object structure is based upon the premise that not all

metadata is changed at the same time. Some metadata deals with file contents and is updated only on

file write, other metadata involves file name and is affected by file rename operations, a third option is

the tag extension, being affected by tag rename and deletion and file moves. It should be possible to

update the search index for each extension individually.

One possibility would be to have a keyword structure consisting of the word and a list of files in which

the word appears. The problem with this approach is that if the word “rock” points to a file with an ID3

tag (MP3Extension consists mainly of an ID3 reader library) containing “rock” as genre and also the file is

tagged with the word “rock”, removing the tag OR updating the ID3 tags genre should leave the index

entry, but doing both should remove it. This severely complicates index removal, having to check all the

metadata of the file for an occurrence of the word being removed. The other approach did not have this

problem as each keyword was accompanied by the extension type and field where it occurred, so all

3.7 Search 31

that needed to be done was remove that one entry without having to worry about whether any other

piece of metadata contained the same word.

It was later discovered that this design of the search objects was based on a faulty assumption, e.g. that

when a file is moved, the tagextension will cause an update of the search index, and that this updated

would either need to be independent in the search index or it would require all extension to re-read

metadata from the file to ensure a keyword was not removed when available in through another

extension. This would have been the case if we did not save the metadata as objects, but since we do

save metadata objects we can simply query these for words that should be in the search index.

This is probably why Windows Search and Spotlight only supports one extension per file type, it is a lot

simpler.

We could let the keyword link to metadata objects instead but then the metadata objects would need a

link to the file they contain information about, to be able to get the files containing the word searched

for. Metadata containing a link to the file it contains information about is a bit of an awkward direction

as it separates metadata objects from the tag/file objects. We cannot get to metadata from a file, unless

we double link it in the same way as tags and files are double linked. Ideally we want to keep extensions

from referencing MetaFS objects, as it complicates extension writing, requiring the developer to know

about file objects. A more desirable approach is to just have the metadata extensions extract metadata

based on a filename passed to it, and then store that metadata in simple data types, e.g. string, int, etc.

and return this data in a dictionary when requested.

Figure 11 shows how the objects required for searching are linked. The Keyword object, Object1, which

represents the word “gold”, points to a list of 2 KeyWordDetails objects, corresponding to the file+

extension pairs where the word “gold” can be found. Both of the KeyWordDetails point to the same file,

“01.Metallica – Ecstasy of Gold.mp3”. One detail object comes from the MP3Extension, because the

word “gold” appears in the title of the ID3 tag, the second detail object is there because “gold” is in the

file name, extracted by the FileNameSplitExtension.

32 Analysis

_details : HashSet<KeyWordDetails>_word : string = gold

_wordReverse : string = dlog

Object1 : MFSSearch.KeyWord

_file : MFSFile

_type : Type = MP3Extension

Object2 : MFSSearch.KeyWordDetails

_file : MFSFile_type : Type = FileNameSplitExtension


A search for the word "gold" finds the

following keyword object in the database

List<KeyWordDetails> (2)

Both KeyWordDetails objects poin to the same file.

_attributes : FileAttributes

_creationtime : DateTime

_lastaccess : DateTime

_lastwrite : DateTime_length : long

_lowercasename : string = _uniquename.ToLower()

_metadata : Dictionary<Type, IMFSExtension> = MP3Extension, FileNameSplitExtension, TagExtension

_path : string = C:\Music\Metallica\S&M\CD1\01.Metallica - The Ecstasy Of Gold.mp3

_tags : Dictionary<string, MFSTag> = music, metallica, s&m, cd1

_uniquename : string

_hash : string

_file : MFSFile

Figure 11 - Structure of search objects

This structure means that if the FileNameSplitExtension plugin is ever removed, or the file renamed, to

update the search index it would be a simple matter of removing the FileNameSplitExtension detail,

without the need to look at remaining metadata for the file.

When it comes to searching, the detail objects tells us that the word can be found in the ID3 tag, we

don’t need to examine the metadata objects associated with the file. This means fast results to queries

like “show me files where the ID3 tag contains Lars”, no more results with Word documents where Lars

is the author.

3.7.3 Searchable attributes.

The ability to search by filename and file extension is mandatory and must be included in any file search

utility. To include date ranges in search typically requires strict formatting of the data strings entered or

a GUI with a friendly date picking ability, and usually requires more time to implement than the simpler

string or value matching.

3.7 Search 33

Folder names are closely related to file names and should also be searchable, meaning that tags must be

searchable.

Extracting metadata from files is of little value if it is not searchable, so metadata must also be

searchable.

Document formats continue to evolve, new attributes are added and new formats appear. To support

this evolution, extension modules can be written to read new or updated formats, even custom formats

for custom file formats can be supported through extension modules. By default, no attribute is

searchable in the MetaFS, every attribute that needs to be searchable, will need an extension to support

that particular attribute.

3 metadata extractor extensions will be written.

• Filename parser – split words in filename.

• Tag – each tag applied to the file.

• ID3 tag – extracts artist, title, album, etc. from mp3 files.

Additional extension can be written to include .pdf, .doc, txt, jpg, etc. It is up to the extension writer to

decide the level of detail extracted from a file. Section headings and author may be the only interesting

things to extract from a .doc file, or a full text extraction can be performed, recording every word that

appears in the document.

3.7.4 Multiple Metadata readers for the same file extension.

With the mp3 files you would think that it is enough to just have one extension that can extract ID3

data, but the ID3 library currently in use does not include the bitrate of the file (possibly bitrate is part of

the format and not the ID3 tag). Bitrate is a somewhat interesting attribute to be able to filter on so you

can replace all your low-quality sound files with newer ones. Because of situations like this the extension

system needs to be able to handle multiple metadata extractors for the same file type (e.g. 2 or more

extensions for extracting data from mp3 files).

This approach also allows multiple .doc readers to be written, 1 that only extracts words in headings,

and another that does a full text extraction. The search interface will then present the user with options

to list the results of the full text extraction or just the header extraction.

This approach differs from the one taken in the Windows Search application, which only handles a single

IFilter per filetype:

“Although one filter can handle multiple file types, each file type works with only one filter.” [42]

So far search has not received much attention. The only things that are searched so far are the tag and

file names, to ensure their uniqueness when adding new objects. File name and tag name fields are

34 Analysis

indexed for faster querying. Searching the 2 indexed name attributes should not be much slower than

maintaining an index mapping keyword to objects and using that for “searching” both at once.

Presumably a keyword mapping index would only be twice as fast as searching the tag and file name

attributes directly, and since these fields are only searched when files or tags are added or renamed the

extra work of maintaining the keyword mapping index would be overcomplicating a simple name

lookup.

All of this is going to change once metadata makes an entry. Actually file and tag name searching will

still be performed directly on the object name attributes. The reason for this is that they are already

implemented that way so it is easier to leave them as they are, fully functional. Furthermore, and more

importantly, as metadata is added to the inverted index, the amount of keywords (that need to be

searched) will likely be several times bigger than the amount of file/tag names. In a system with 10.000

files and 300 tags, the metadata of these files could potentially contain 20-50.000 unique words to

search through. (15324 keyword objects with 11253 files added; see appendix for more details.)

Currently the file and tag objects perform more search queries than are strictly necessary. The classes

were designed to ensure that the database does not end up with duplicate names. When adding a file,

the db is queried to check if it is already added, if it is not, we generate a unique name for it, which is

done by querying the database again, with the exact same name.

The search approach described next was an early idea for allowing search on metadata, but it has been

changed to a search-all approach (described shortly) and metadata is used for filtering the search result

further.

The user should be able to select a metadata field and search on that, e.g. the year or genre of an ID3

tag. If we are providing search capability on individual metadata fields then an inverted index is not a

necessity, we can programmatically keep an eye on which metadata fields the user typically uses for

searching and keep an index on the 10 most commonly searched fields. Searching on individual fields is

a nice feature, and definitely one we want, but it does require that the user has some knowledge of the

metadata to know which field to search and that the files contain accurate metadata. Furthermore the

ID3 tag contains fields for performers, album artists and composers, all of which could contain the name

of some person/band you feel like listening to. Music tracks featuring artists are easily missed in a

search for album artists but we want these tracks to appear in the result too.

We currently have no way to search the tags or the filenames. These could be added as specific searches

like the metadata field search, but this is a very limited search approach and a much more appealing

approach is to have a search-all field where you just enter a word and every piece of metadata is

searched for the occurrence of that word. Let’s say the user is in the mood for rock, he could pick a tag

based search, an ID3.Genre based search or just enter it as a general search term and get everything

with the word rock, including “Fraggle Rock” and the vacation photos of the rock of Gibraltar. To get rid

3.7 Search 35

of the “Fraggle Rock” and the photos the user would have to enter “music” into the query to, assuming

all the mp3 files carry the music tag.

The need for a search-all field means that an inverted index becomes unavoidable. The alternative is to

search every field that could possibly contain the value, which means that with just the ID3 metadata

extension, 16 fields need to be searched, just for ID3 metadata, we also need to search file name and

tag name. When new extension are added, more fields need to be searched, adding a .doc metadata

plugin would add things like, author, version, subject, resulting in at least 3 more fields to be searched.

The end result is that more plugins means more fields to search, resulting in longer search times, for the

same amount of files.

To support the search-all feature we need an inverted index. An inverted index solves the problems

mentioned above. Searching for an artist name using search-all will ensure that if the artist name

appears in any ID3 field, tag, filename, etc. the file is included in the result set. Adding more extensions

will result in more links for each word in the inverted index, and possibly a few new words (document

author and subject are likely already in the inverted index as they occur as tag and file name, but the

version would likely need to be added).

The ability to search directly on a metadata attribute is not available. Instead a search is performed,

using the search-all technique. The result of the search-all then returns a list of files, which can then be

filtered based on metadata values an approach similar to the one described in “An Intelligent Method

for Searching Metadata Spaces [43].

The Information Retrieval section (2.1) of the methodology covered a list of common causes for search

failures, some of which have been considered with respect to the MetaFS.

• Capitalization and extended characters

Search should be performed in lowercase and special chars like ôöáé are stripped of accents,

umlauts, etc. to convert them to normal characters.

• Stopwords

Every word is included, until such time as the search index becomes too big (loosely defined).

Some stopwords may be helpful in some searches, e.g. The Cranberries (band) vs. simply

cranberries (recipe or images). It is likely that some .mp3 files tagged with/ID3 contains just

Cranberries, omitting “the” so the search might be better if performed with terms “cranberries”

and “music”. Search is very dependent on correct tagging and information.

• Short words

These are also included until they have too big an effect on the DB size.

Min length of search word of 2 or 3 letters, “ER” TV show is hard to search for.

36 Analysis

• Search on numbers.

Numbers are treated as strings, so they are included. Numbers 1-99 or 1-999 are included as

long as there is no minimum limit on words allowed in the inverted index.

There are 2 reasons not to index short words, the first one is that they can be quite common, and may

appear in half or 1/10 of the documents indexed, which means that a short word ends up being

responsible for a large portion of the size of the index, while providing little filtering value, so to save

space, the short words are ignored. The second reason involves them index implementation in a

relational database. An inverted index in a relational database will most likely consist of a table with the

words, a table of the documents, and a many-to-many relation linking the two. This means that every

time ‘an’ appears in a document, the many-to-many relation needs an entry linking the word with the

document, which increases the amount of records to look through when querying for documents with

the given search word. Assuming the many-to-many relation is indexed on the word reference field

using a B*-tree, the ids matching that of the word may be easy to find, but the number of extra records

for short words has added another level to the tree, requiring an additional lookup. If the index is stored

in memory the extra lookup is not that big an issue, the amount of extra records is. If the index is stored

on disk, the extra lookup is an issue as it means an additional read on the disk.

The beauty of the object database is that we are unaffected by the second problem, extra records to

search through. We simply store a list of keyword objects, each of which consist of a word and a list of

references to documents containing this word. This means that as soon as we find the word matching

the search word, we get a reference to a list, with references to the documents containing the word,

skipping the step of finding many-to-many records with an id matching that of the search word.

This is a basic version of an OODB inverted index; the MetaFS stores an intermediate object between the

keyword object and the actual document object, because there are multiple types of metadata for each

document and we need to be able to update keywords for one without affecting the other metadata

types.

As the list of MP3 files and their metadata shows (appendix, 9.8 ID3 Data), the CDDB does not contain

every field, and not always accurately, the file “03.Metallica – For Whom The Bell Tolls.mp3” has the

title of the ID3 tag as “For Whom The Bells Tolls”, with an s on bell. Because of typos and slight

variations of words, we would only find this if our search engine supports begins-with search, or if we

index words in file names, a very basic, and must-have, feature of any file system search function.

3.8 Use Cases

In this section we will provide use cases in order to identify the methods of interaction available to the

user. Since the interface will consist, on a large part, of a VHD, the interface functions of Dokan will be

used to identify some use cases.

The following Dokan interface functions have been identified as candidates for use cases:

1. CreateDirectory – UC5

2. CreateFile – UC9

3. DeleteDirectory – UC6

4. DeleteFile – UC10

5. FindFiles – UC3+4

6. MoveFile – UC7+8+11+12

7. ReadFile – UC13

8. WriteFile – UC14

The remaining Dokan interface functions with a short note on why they did not translate to use cases.

9. Cleanup – No cleanup required

10. CloseFile – Forward file.Close() call to System

11. FlushFileBuffers – Never seen called, likely forward file.FlushFileBuffers() call to System

12. GetDiskFreeSpace – MetaFS does not have an identifiable free space, const values returned

13. GetFileInformation – Get object by name from db and return values

14. LockFile – purpose unknown, ignored

15. OpenDirectory – Purpose unknown, ignored

16. SetAllocationSize – Forward file.SetAllocationSize() call to System

17. SetEndOfFile – Forward file.SetEndOfFile() call to System

18. SetFileAttributes – Forward file.SetFileAttributes() call to System

19. SetFileTime – Forward file.SetFileTime() call to System

20. UnlockFile – Purpose unknown, ignored

21. Unmount – Purpose unknown, ignored

Besides the VHD interface we also need to be able to add indexed locations (UC1) and remove indexed

locations (UC2), and what good is all the metadata if we are unable to search it, so a use case for

searching (UC15) and one for filtering (UC16) the search result is needed.

During development some additional use cases were identified, UC17:Change filestore location,

UC18:Rescan indexed location and UC19:View untagged files. These are not currently implemented, but

should be implemented in a future iteration.

Every use case assumes that the user has the Dokan driver installed and has the rights to run the

program.

38 Analysis

The frequency of occurrence for each use case is an estimated number which is supposed to give a

rough idea of how often an event occurs and how critical it is that it is designed to be executed fast.

Note: Use cases contain a lot of information in very few lines and this and their point by point setup

makes them very annoying to read one after another; they are meant as a guide for development for an

initial idea of functionality, and for later reference to help identify the purpose of a feature. Because of

this the use cases can be found in the appendix (9.2 Use Cases)

3.8 Use Cases 39

Chapter 4

4 Design

After having presented the functional requirements of the software in the last chapter, we will present

the design of the software.

We already identified some objects and their attributes for our system in the previous chapter. File data

will be stored as MFSFile objects, VHD folders correspond to our tags which are stored as MFSTag

objects. The user needs to be able to select one or more folders from a normal hard drive that are going

to be available through the VHD interface, for this we have MFSIndexedLocation objects. Because we

are using an OODB, navigational access to the MFSIndexedLocation instances are preferred, which is

where the MFSIndexManager comes into play. The MFSIndexManager class is a singleton and besides

the improved DB access it provides it is also responsible for creating and removing MFSIndexedLocation

instances and reassigning MFSFile ownership when indexed locations are added or removed. These

make up the basic building blocks of the tag based file system.

For development purposes MFSDebug handles output of debug text and a GUI for resetting the DB and

debugging new semi-implemented features. MFSDebugOptions is referenced by MFSDebug and

contains information about which operation to output as the user performs operations affecting the

objects of the system.

To be able to present the system as a VHD we need to implement the interface of the Dokan library, this

is done in the class MFSDokan. To interpret communication from the VHD, the MFS class is responsible

for translating VHD commands into operations performed on the MFSFile and MFStag objects, to

support the tag based navigation and metadata extraction.

For storing metadata an interface is created, IMFSExtension, with a set of functions needed to extract

and store information for each file of a given filetype.

42 Design

For search purposes a Keyword class is created, it will be used to represent each unique word

encountered in the metadata. The Keyword also has a list of KeywordDetail objects, each of which

contains a reference to a file objects for a particular Keyword.

Due to the complexity of the design of the functional requirements, we have grouped the design of

functionalities into 3 separate iterations.

4.1 Class responsibilities

MFSIndexManager, MFSIndexedLocation, MFSFile, MFSTag, MFSOptions, MFSSearch.KeyWord,

MFSSearch.KeyWordDetails, MFSDebugOptions and classes implementing IMFSExtension are the classes

that are saved in the database and they all have save and load functionality implemented in such a way

as to make the DB use as transparent as possible. Whenever an attribute of either of these classes are

changed they save themselves. MFSFile is auto-saved by the constructor as all values are set in the

constructor and not changed after (until a file access occurs through the VHD). The MFSIndexManager

only consists of a list and there is no reason to save an empty list when the constructor is done. Saving

at the end of constructor calls does lead to some extra database activity, but it removes the need for

explicit save calls for objects when they are created, and the extra database activity is not expected to

be an issue as it only happens when indexed locations are added.

The first approach to saving object data to the DB was to let every edit of an attribute also save that

object. It turns out to be a horrible approach to saving since when we add new files, it will easily be

linked to 5 tags, which means 6 saves for that one file (1 by the constructor, 5 for addtag() calls).

Minimizing the DB access leads to a lot of extra list iterations, but these are still faster than the database

lookups and saves they prevent.

Every class that is stored in the database has a Save() function and every non-singleton class a Delete()

to ensure that all attributes, references, and lists are unreferenced and removed from the database.

4.1.1 Committing changes

DB.commit() is done when the application is exited. This approach only works while testing since it

rarely runs for more than a few minutes at a time. We don’t want to commit too often as it is a

somewhat slow operation, but at the same time, not committing enough could end up losing data if the

program crashes in the middle of a long transaction. Timed commit should help to save the smaller

changes to the system that happen frequently and don’t affect enough objects to be explicitly

committed when they are done. A timed approach requires the use of a timer thread calling the commit

every few minutes to commit any changes that may have occurred since last commit. An alternative

4.1 Class responsibilities 43

approach is to keep a counter of every database access and for every 20-50 access, a commit is called.

The tracking of db access means there is no timer running and the amount of saves correspond the

usage of the database, high usage = many commits, low usage = few commits. Use cases that have a

larger effect will get their own explicit commit calls, i.e. UC1: Add Index, UC2: Remove Index, UC6:

Delete Folder, UC18: Rescan Indexed Location.

4.1.2 MFS

A singleton class.

A lot of create/delete/rename tag and file operations exist here, the same operations exist in the

MFSFile and MFSTag classes. The MFS operations do more than just forward the call. The MFSFile and

MFSTag create/delete/rename are concerned with the object level operations and ensuring naming

conventions are adhered to (name must be unique in the tag/file namespace). The operations at MFS

level work on a logical level, a MoveFile () request from the VHD calls MFS.RenameFile(), which then

calls RenameFile() on a file object. The actual file on the HD may need a rename as well, if it is in the

filestore, which would be handled by the MFS.RenameFile().

Dokan.CreateDirectory calls MFS.CreateTag which then calls MFSTag.Create. Since we only display tags if

they contain links to file relevant to the current path, the newly created directory needs a dummy file

created or the newly created tag-folder will not be show to the user, this is handled by the MFS clas and

not the MFSTag as that dummy file has nothing to do with the tag object, but rather our application

logic.

4.1.3 MFSIndexManager

A singleton class.

Iteration 1:

When removing an indexed location, we need to remove all file objects associated with this location;

there are a couple of ways of doing this.

1. We can query the database for file objects with a path that starts with the path of the index

location to be removed. This involves a string.startswith() call on every file object in the

database, not a cheap operation to run on 1000+ objects without an index (and we rarely use

path, so index would be a waste).

An index on path is added in a later iteration to speed up the TagExtension queries. But

string.startswith() queries don’t use the index as some later tests reveal.

44 Design

2. We can look at the files that exist at the index location that is about to be removed and then do

a lookup on their filenames (and check that their path matches the location being removed).

With an index on filenames in the db this should be faster than method 1. See note below.

3. We let indexed locations keep a reference to the file objects found at the path they are

indexing. This way we have a reference to every file object we need to remove and there is no

need to query the database for string matches on the path.

Note on 2: The following only applies if we are not using (or cannot use it, i.e. FAT32) the

FileSystemWatcher service (see data dictionary). If files have been deleted from the location about to be

removed, this method will leave file objects in the database. The file deleted will not be present in the

location being removed and thus our file to file object mapping will not catch all file objects.

When a file is deleted and the index location remains, or files are created at the indexed location,

bypassing the VHD interface the only MetaFS will not detect a deleted file until it is accessed, and new

files are never detected unless indexed locations are explicitly re-scanned for files. Ideally file changes in

in indexed locations will be detected by using the FileSystemWatcher service.

The 3rd

option is currently the simplest and most effective approach.

When a file is deleted on the VHD, we remove references from tags and then delete the file object from

the DB. This leaves a link in an indexed location file list. There are 2 ways to find and remove this

reference. One way is to load the file list for every indexed location and check if there is a reference to

our deleted file object. The other way is to look at the path of the file and query the MFSIndexManager

for the indexed location covering our deleted file. The second approach requires less data from the DB

to complete the search. The MFSIndexManager has a list of MFSIndexedLocation objects, all we need to

do is iterate through each of these (accessed by reference in the DB) and find the one with the longest

path matching the files path. Doing it this way keeps it to a single iteration of maybe 10 indexed

locations, and we never load the file list of the indexed locations from the database.

Iteration 2:

Adding and removing of index locations works by placing file responsibility as low (far from root) as

possible.

When removing an indexed folder from the index list we need to check if the removed index folder is

either a parent or a subfolder of other indexed folders.

How to handle removal of a parent folder; assuming the folder being removed is the topmost (closest to

root) folder, we can just delete it and the files objects referenced by it.

4.1 Class responsibilities 45

How to handle removal a child folder; assuming the folder being removed has a parent folder, no file

objects should be deleted, and they should instead have their responsibility transferred to the parent

folder.

Handling the addition of new indexed locations gets a bit more complicated since we need proper file

object tracking for when indexed locations are deleted. There are cases where adding a new indexed

location can be a problem. One is when the parent folder is indexed and one of its subfolders is being

added. The other is when a child folder has been added and now the parent is being added.

When order is first parent, then child – when adding child folder, MFSIndexManager should be queried

about any index on parent folder, and if present, return the indexed location object, so the child can ask

for the parent to transfer index responsibility to the child location object.

When order is first child, then parent – when adding parent folder, MFSIndexManager should be

queried about any index on child folders, and if present, the path of the child folders should be ignored

when directory crawling for new files.

Iteration 3:

Next version of the MFSIndexManager should use the FileSystemWatcher service to monitor indexed

location for file changes, and update file objects and indexed location objects accordingly.

4.1.4 MFSDokan

A singleton class that implements the DokanOperations interface to support the VHD interface. The

short version (no parameter or return values specified) of the inferface is:

• Cleanup()

• CloseFile()

• CreateDirectory()

• CreateFile()

• DeleteDirectory()

• DeleteFile()

• FindFiles()

• FlushFilesBuffers()

• GetDiskFreeSpace()

• GetFileInformation()

• LockFile()

• MoveFile()

• OpenDirectory()

• ReadFile()

46 Design

• SetAllocationSize()

• SetEndOfFile()

• SetFileAttributes()

• SetFileTime()

• UnlockFile()

• Unmount()

• WriteFile()

For a more detailed version, including the parameters of each function, the Dokan.Net binding, available

from the web site[44], comes with 2 samples, where each functions is implemented with all parameters.

Most of them are somewhat self explanatory, except CreateFile() which is used for more than just

creating files, and MoveFile() which is also used to move directories and rename files and directories.

CreateFile() is the workhorse of the VHD, it handles probes to see if files or directories exist, it handles

creating of new files and opening of file streams when reading or writing is needed later.

4.1.5 IMFSPlugin

This is the interface that extensions will use for handling new types of metadata.

4.2 Special considerations

As stated in UC19: View Untagged Files, files can potentially get “lost” if they don’t have a tag

associated. Turning this problem upside down uncovers another potential problem, tags that don’t have

any associated files. Since we at root level of the VHD only present the tags with a minimum of file

associations, tags with no files will not appear at the root. Because of the way tags are presented

outside the root, only those applied to files in the current folder, the tag with 0 files (dead tag) is never

shown anywhere. This gives rise to two questions, how does the user apply this dead tag to files if it

cannot be seen anywhere? And what happens if the user tries to create a new tag-folder with the name

of the dead tag? These two questions are very closely related. To apply the dead tag to a file, the user

will have to create a new tag-folder with the name of the dead tag. The act of trying to create a new tag-

folder when the name is already in use will simply revive the dead tag by linking it with the dummyfile.

The dummyfile is also linked with the tags in the current path, to ensure the previously dead tag can

now be seen in the working directory.

The Dokan interface is supposed to return error codes, so the result of VHD commands can be

determined. Some of these error codes cover cases such as, “file not found”, “access denied” and

“already exist”. Applications typically react to these error codes that normally come from hierarchical

file systems. Trying to apply the error codes to a non hierarchical system is complicated at best, and at

4.2 Special considerations 47

times the error codes just don’t cover the problem accurately. A file create operation on a hierarchical

system will return an “already exist” error when trying to use a name already taken in the current folder.

The MetaFS will return “already exists” error code if the name is used by any tag or any file with no

regards to current folder. Because of this error codes are mostly ignored a standard error is typically

returned in which the calling application simply is informed that the requested operation failed, no

reason given.

Rename operations don’t return errors when the target name is already used, instead they auto

generate a new unique name and assign that to the file instead. This might have unforeseen

consequences since the rename operation might not be user initiated, but rather another application

that expects the renamed item to have the exact new name specified and not a new name with a

counter. The uniqueness constraint on the MetaFS makes it more appealing to generate a variation of

the new name rather than return an already-exists error, since it’s near impossible for the user to know

what tag and file names are in use already when many files are indexed. MetaFS is meant for people to

find documents, not for programs to move or rename items, so this renaming issue is ignored.

4.3 FileStore

The FileStore is the folder in which files created through the VHD are saved. Every file created on the

VHD is saved in the same folder which means the filename has to be unique, since that is also a

constraint of the MetaFS the filename used on the VHD is the same name used in the FileStore folder.

This approach works as long as the FileStore is empty when the MetaFS database is empty. If there are

files in the FileStore when the MetaFS database is reset, creating new files on the VHD can result in

name collisions with the existing files in the FileStore. It is therefore important that the FileStore is

empty when the MetaFS DB is empty. Any files found in the FileStore, when the application is loaded

and the database contains no file objects, are moved to \FileStore\old.files and given a unique name if

needed. This means that “old.files” cannot be used as name for a file created on the VHD (because of

file/folder shared namespace).

4.4 Design considerations for each use case

Having little experience with db4o and the performance it delivers with respect to queries and object

access, some of the use cases and classes are designed to perform tasks with as little DB interaction as

possible, as is the case in the following example.

4.4.1 UC4: View Folder

When viewing a folder with the path “X:\Music\Metallica\S&M” we get the list of files to display by

getting each of the 3 tags from the database and do an intersection on their file lists. For calculating the

48 Design

intersection, we take the shortest of the lists and do a tag.hasfile(file) for each of the tags in the path -

the tags that we have already loaded. Tag objects store file references in dictionaries (filename -> file

object), this allows us to perform the hasfile(file) check using dictionary keys, without activation the file

object and thus saves some file object activation compared to doing an intersection of the file lists for

each tag.

The list of tags to display comes from the intersection of file lists, including tags that any of the

intersecting files reference, and in the end removing tags that appear in the path of the current

directory.

File objects uses a dictionary to store tag references (string tagname -> MFSTag tagobj) which means

that a check on a file to see if it has a tag can be done by name alone (file.HasTag(tagobj)), there is no

need to activate or look at the tag object, and the operation is fast, approaching O(1) [45] as opposed to

the tag.HasFile(fileobj) which uses the List<T>.Contains which is an O(n) operation [46].

When viewing a folder the MetaFS will have an option to indicate if there should be a counter on the

tag-folder, displaying how many files this tag is applied to.

Figure 12 - Viewing the Root (X:)

Figure 13 - Viewing X:\CD1

The counter at the end of the tag folder is designed in such a way that it will display how many of the

currently presented files are related to the tag and not how many files in total are referenced by the tag.

This helps to show when a tag is useful for a drilldown. Say we start at the root, Figure 12 and select

“CD1 [11]”, we get the result shown in Figure 13. The counter tells us that it will present 11 files, besides

those files we also get a couple of tag folders, “Music [11]”, “S&M [11]” and “Metallica [11]”, all of

which have a counter of 11, which means that none of those will be of any help in narrowing down the

selection.

4.4 Design considerations for each use case 49

Figure 14 - Viewing X:\Music[51]

Figure 15 - Viewing X:\Music[51]\S&M[22]

Had we instead started with the “Music [51]” folder, Figure 14, we would have seen 51 files and the

folders “CD1 [11]”, “CD2 [10]”, “Metallica [22]”, “Flyleaf-Memento.Mori.2009 [14]”, “S&M [22]”.

Entering “S&M [22]”,

Selecting “S&M” as in Figure 15 will remove the Flyleaf… tag folder and some other tag-folders as well.

As the drill-down process progresses the counter at the end of the tag folders will continue to decrease

(or stay the same, never increase) “X:\Music [51]\S&M [22]\CD1 [11]” goes from 51 to 22 to 11.

During most of the development the only item in the Music folder was the Metallica folder, which

means that these two tags ended up referencing the same set of files, so if you opened the Metallica tag

folder there’s little to be gained from also opening the Music tag folder. When presenting folders for

drill-down use we could remove those that don’t reduce the amount of files presented (have the same

counter as the currently viewed folder), but for now we’ll just leave them, to avoid the confusion of

“missing” tag folders.

50 Design

4.5 Database object references

4.6 Class diagrams 51

4.6 Class diagrams

The full class diagram is enormous when including all attributes and connections and has been split into

smaller pieces that will fit on A4 pages.

4.6.1 is the application logic of the application (between the GUI and the database objects)

4.6.2 is the GUI elements

4.6.3 is the DB objects

4.6.4 is the extension objects, which are also stored in the DB.

4.6.5 is the full diagram consisting only of class names to fit all on one page, a miniature version of the

full diagram.

4.6.6 is the full diagram with attributes and is generally unreadable on A4 but it does a better job of

presenting the amount of functionality in each class and how many connections it shares with other

classes.

+Property() : int

+CTor() : Class Name

-Private function() : int

+Public function() : int

+Static function() : int

-private variable : int

+public variable : int

+static variable : int

Class Name "functions" above the constructor (CTor) are properties.

Interface functions also appear before CTor but

interface and property methods are usually distinguishable

by applying some logic to the method names

Legend:

52 Design

4.6.1 Application Logic classes


4.6.2 GUI classes

The form classes, FormSearch, FormDebug and FormGUI do not have any attributes added because they

consist mainly of buttons, text fields and combo boxes and event handlers.

54 Design

4.6.3 DB classes


4.6.4 Extension classes

IMFSExtension

IMFSExtension

+Author() : string

+FileExtensions() : string[]

+Name() : string

+Version() : string

+CTor() : object

+CTor(in path : string) : object

+Save()

+Extract()

+GetDict() : Dictionary<string, string[]>

+GetSimpleValue(in field : string) : string

+GetMultiValue(in fld : string) : string[]+MultiValueFields() : List<string>

+SimpleFields() : List<string>

+ToString() : string

-_name : string

-_version : string

-_author : string

-_extensions : string

-_tags : string[]

TagExtension

+Author() : string


+Name() : string

+Version() : string

+CTor() : object


+Save()

+Extract()


+GetSimpleValue(in field : string) : string+GetMultiValue(in fld : string) : string[]

+MultiValueFields() : List<string>


FutureExtension(s)

IMFSExtension

+Author() : string

+FileExtensions() : string[]+Name() : string

+Version() : string

+CTor() : object


+Save()

+Extract()



+GetMultiValue(in fld : string) : string[]




-_name : string

-_version : string

-_author : string

-_extensions : string[]-_trackno : string

-_trackCount : string

-_disc : string

-_discCount : string

-_bpm : string

-_title : string

-_performers : string[]

-_album : string

-_albumArtists : string[]

-_year : string

-_genres : string[]

-_composers : string[]

-_comment : string

-_conductor : string-_copyright : string

-_lyrics : string

MP3Extension

IMFSExtension

+Author() : string


+Name() : string+Version() : string

+CTor() : object


+Save()

+Extract()



+GetMultiValue(in fld : string) : string[]




-_name : string

-_version : string

-_author : string

-_extensions : string-_words : string[]

FileNameParserExtension

56 Design

4.6.5 Class connections

Database Classes

Extension Classes

FormSearch FormDebug

MFSInfoRetrMan

MFSDebug

MFSDokan FormGUI

MFSDokanCache

MFSMFSSearchIMFSExtension

MFSExtensionManager

MFSFunctions

MFSSearch.KeyWord MFSDebugOptions MFSTag

MFSSearch.KeyWordDetails

MFSOptions MFSFile MFSIndexManager

MFSIndexedLocation

TagExtension TagExtension FutureExtension(s)FileNameParserExtension

«uses»

«uses»

«uses»

«uses»

«uses»

«uses»

«uses»

«uses»

«uses»

«uses»

«uses»

«uses»

«uses»

«uses»

«uses»

«uses»

«uses»

«bind»

«uses»

«uses»

GUI Classes


4.6.6 Full

diagram

58 Design

4.7 Sequence diagrams

When working with an OODB, ensuring that objects are stored, updated and deleted at the right times,

is important to keep performance high. Repeated storing of an object after every change puts unneeded

execution time into the OODB, so we want to only store objects when we are done with a series of

changes on them. This focus is reflected in the sequence diagrams, where DB.Store() and DB.Delete

commands are explicitly drawn from the save calls even though they are implied (save = store in DB for

all classes) but the save calls are easily missed in the sequence diagrams.

Most of the interaction with the MetaFS causes a lot of list iterating, and branching, both of which can

be displayed on the sequence diagrams, but they are kind of heavy in the sense that they take up a lot of

space in the diagram and they draw a lot of attention when reading the diagram. For this reason

branching and iterating is left out most of the time. This means that the sequence diagrams only show

the calls made and in what sequence, and not under which conditions these calls are made.

Note: As with the use cases, the sequence diagrams are mainly for curiosity/development purposes and

are therefore available in the appendix (9.3 Sequence Diagrams).

Chapter 5

5 Implementation

5.1 Introduction

This section gives insight into implementation details and describes some quirks/features of the MetaFS.

Focus is very much on how the implemented file system works and this section is for gaining

understanding of the system and how it handles tag and file creation, deletion, editing, linking. This

section is mainly for development purposes and for the curious reader that wants to know the inner

workings of the system.

5.2 Development environment

5.2.1 Introduction

This section provides information that allows for replication of the development environment, what is

needed to build and run the thesis code and what is needed for development of extensions for

custom/additional file types.

5.2.2 Software + Hardware Setup

The development environment for the MetaFS consists of a Virtual Machine (VM), running in VMWare

with a single 2,4 GHz processor and 1,5 GB of RAM. The specs of the host computer of the VM are as

follows. Intel® Core™2 Quad CPU Q6600 @ 2.40 GHz, 8,00 GB RAM, 64-bit Windows 7, 3x7200RPM hard

disks in a striped software raid holds the virtual machine settings and hard drive. The host is installed on

its own hard disk, ensuring that hard drive I/O of the host does not affect that of the VM and the other

way around.

There are a couple of reasons for a virtual development environment, the primary one being the end

note of the readme for Dokan. “If there are bugs in Dokan library or file system applications which use the library, you will get the Windows blue screen. Therefore, it is strongly recommended to use Virtual Machine when you develop file system applications.” [38]. Secondly, a clean

environment that is not affected by software previously installed. Backup, antivirus and indexing

60 Implementation

software can have unpredictable behavior when accessing the virtual hard drive, crawling every possible

subdirectory it can find.

The installed OS on the VM is Windows 7, 64 bit, with the following software:

1. Microsoft Security Essentials [47]

2. db4o 7.12 for .NET 3.5 [48]

3. Dokan library 0.5.3 [44]

4. Dokan .NET binding 0.3.0 [44]

5. Visual Studio 2010 [49]

6. Visual Studio Team Foundation Server 2010 [50]

Item 1 is a piece of antivirus software that is needed because the development machine is connected to

the internet for access to online documentation for the various libraries used, db4o, Dokan, and MSDN

Online access. Also, it could help locate any issues between the MetaFS and antivirus software at an

early stage.

Item 2 is our OODBMS and is a very central part of the MetaFS, storing all our data and providing fast

and easy access to them.

Item 3 is the driver needed to mount a Virtual Hard Drive (VHD). This was the only library available, for

Windows, which allows the presentation of data through a VHD.

Item 4 is the library used to write file system applications, presented to the user as a virtual hard drive,

using the driver above.

Item 5 is the IDE used, for syntax highlighting, intellisense, unit testing, and source control (with 6).

Item 6 is the source control software. Alternatives include Visual SourceSafe, Perforce, CVSNT and

others, but TFS is well integrated with VS.

Items 1 and 6 are optional but included for security + early problem detection and easier software

management respectively.

5.2.3 Setup notes.

While the Dokan library does support x64 Windows, the file system application using the Dokan library

must be compiled as an x86 application.

5.2 Development environment 61

5.2.4 Extensions

The power of the MetaFS is its ability to extract metadata from file headers or file data and allow for

quick, indexed, searches of this data. It is possible to write extensions to support custom file formats

and to allow searching on computed attributes of image files (light, dark, black/white, facial

recognition).

Dokan can be a bit tricky to work with at first because there is very little documentation for it and only 2

small examples, one to mirror the C: drive and one to make the registry browsable as a VHD (examples

are included in the Dokan .Net Binding download[44]). Some of the calls made to the DokanOperations

interface functions are somewhat intuitive, but at times some unexpected calls are made. Typically a

sequence of calls are made, CreateFile->Cleanup->CloseFile is a sequence that fairly quickly becomes a

common occurrence while watching the calls made. CreateFile is used for more than the name would

suggest, it is also used as a probe before other calls are made. It is besides file creation also used to test

if a file exists, and is sometimes called on directories too, where a flag has to be set to indicate that it

was called on a folder. The calls work in a parallel manner, when we in a FindFiles call return a set of

folders, we need to make sure that those folders are marked as folders when CreateFile is called with a

filename parameter corresponding to any of these folders. This tight coupling of functions means that

they have to be developed in parallel, it is not possible to work on just one call at a time.

5.3 Querying by name

The name query of GetTag takes ~50ms for case insensitive match, using the delegate method. Using

the SODA interface it only takes ~3ms, so a case sensitive SODA query is always run first, in hopes that it

will save the 50ms, which it does most of the time, since a miss means it usually needs to be created

which is far less often than a hit when the tag needs to be added to a file. Alternatively the SODA

constructs, StartsWith() and EndsWith() can be used since their parameter true/false specifies case

sensitivity. But we still have to check the result(s) since “musicmusic” will be found with searches for

“music”, and “xkcd” will be found by “cd”. Like or contains might work too, with the extra check of the

results returned. StartsWith, EndsWith, Like, Contains are probably not able to use index on attributes,

which will likely make them slow.

During testing, StartsWith and EndsWith are found to ignore indexes, making them slow and undesirable

to use.

5.4 Infinite cleanup loop

When calling RemoveFile(somefile), the tag object will remove the file and call

somefile.RemoveTag(sometag) to ensure that both of the references between the tag and file object are

removed. The RemoveTag(sometag) is implemented in the same way, calling the RemoveFile(somefile)

on the tag being removed. This approach has the potential to go into an infinite loop if not done

62 Implementation

properly. This will end up with 1 redundant remove-something call on either a tag or file object

(indexmanager.RemoveLocation() calls file.Delete(), which calls tag.RemoveFile() calling

file.RemoveTag() calling tag.RemoveFile() at which point the circle ends). This approach ensures that

there will be no 1-way references, either to existing objects unaware of the 1-way reference or to

nonexistent objects.

5.5 Use Case implementation

5.5.1 UC1: Add Index Location

This seemed like an easy usecase to implement, just let the user select a directory and store this

selection and add the files in the folder (and subfolders) and the folders as tags to the MetaFS. The 2nd

part is what makes it complicated, we are creating a lot of new objects and need to check if some tags

already exist, load them, add links to new files and save them again, or create new tags, add file links

and save.

The reason this is so tricky to implement as the first item is that it is responsible for creating new file

objects, and new tag objects and create the relationships between them, thus it involves a lot of the

other use cases, create file, create tag, apply tag to file.

If the tags and files were allowed to save themselves at every change, the db would get extremely busy

when adding new indexed locations. We need to finish work on individual tag and file objects before

saving them.

Iteration 1: Just the ability to add directories, we don’t take responsibility for the user adding subfolders

of already indexed locations, or parent folders of already added locations (iteration 2 will deal with

these issues).

Iteration 2:

Indexed location objects have a list of MFSFile objects they are responsible for indexing, to make the

removal of indexed locations easier.

Indexing a subfolder of an already indexed parent folder will transfer “index ownership” of the files in

the subfolder to the indexed location object of the subfolder. We won’t have to do a scan of the folder

since it has already been scanned when the parent was added, so only the transfer of file object

ownership needs to be sorted.

5.5 Use Case implementation 63

Indexing a parent folder of an already indexed child folder will ignore the folder indexed by the child

folder when scanning for files.

It is possible to select an indexing folder that is a parent to multiple base folders that are already

indexed and all of these subfolders should be excluded from the scan.

Iteration 3:

Use the FileSystemWatcher service, for monitoring file changes of the indexed location.

5.5.2 UC2: Remove Index Location

Iteration 1: simply remove all file objects referenced by the index location object, no responsibility for

parent child relationships between indexed locations.

Iteration 2: check if there is a parent indexed location to pass on file responsibility to. If no parent index

location exists, use the list of file objects to remove those objects from the db.

5.5.3 UC3: View Root

For this one we need to query the DB for all tag objects. This means that every tag object and its list of

file references will have to be loaded and counted. This is one of very few cases where we actually have

to query the DB for objects, most of the time we are working with the references.


The path parameter is split into tags and each tag is fetched from the database. The intersection of all

the tags is calculated, using the tag with the smallest file list as the base for comparisons. For every file

that has all tags applied, the tag list is added to the list of tag-folders to show. The file intersection and

the list of linked tags is translated to file objects and returned to the VHD for presentation to the user.

5.5.5 UC5: Create Folder

This use case is a bit special because of the way we treat folders. Normally you would expect the newly

created folder to appear in the folder currently being viewed. When a new tag folder is created it will

not have any file references and will therefore not appear in the root (if the threshold is 1 or more) and

no files will reference it so it will not show up when viewing any other tag folder. To allow the new tag

folder to be used from the folder in which it was created we need to add a dummy file that forces the

newly created folder to appear in the current view.

64 Implementation

The 1-arg constructor for MFSFile objects works on System.IO.FileInfo objects, which does not allow the

length attribute to be set, so it is not possible to create a FileInfo object for a file that does not exist

because the length attribute will not be set. For testing it is a lot easier if everything that is tested does

not need an actual file to base the test on. The MFSFile class needs a special constructor for testing and

for creating special files that are not based on real files. The MFSFile.Create() used to transform real files

into a MFSFile objects also creates tags that don’t already exist. The System.IO.FileInfo used to create

the file object contains a “fullpath” attribute which is used to create tag references, so it is a logical

place to create the tags.

When we get a request to create a new tag folder, every tag in the path will be create if it does not

already exist. The createfolder does not just operate on the last folder of the path, but on every folder.

CreateFile(“\Music\Blues\Metallica\Reload\CD1\Live”) would end up creating 2 new tags “Blues” and

“Live”, and link the remaining tags with the dummy file.

Alternatively we could check that all the other tags do actually exist or return an error if they do not, but

the circumstances where the path will contain nonexistent tags are somewhat special. The user can

issue an MD (make directory) commands from a command prompt. Window A is opened, and then

window B, where some tags are deleted, then focus is back on window A where a new directory is

created.

There are likely other ways this can happen, but the 2 mentioned are likely to happen at some point and

when they do the create-all-tags behavior is more of a feature than an unwanted side effect. It is a lot

nicer to actually create the tags the user requested than return an error that the user then has to deal

with and try and figure out what failed.

When testing the code to handle CreateDirectory calls from Dokan we learn a little something about

how Windows goes about creating new folders. First an available name is found based on the template

“New folder (#)”, this is done by probing for the existence of earlier numbers, the VHD shows

CreateFile() calls for New folder (1), then New folder (2), then new folder (3) till it eventually finds New

folder (4) available. The Explorer window will then select the new folder and mark the name for the user

to change. When trying to change the name of the new folder we get a debug output that tells us that

Dokan is calling a not yet implemented function, which turns out to be MoveFile(). So besides

implementing CreateDirectory, we now also have to do part of MoveFile, the part that involves tag

renaming.

While working on MoveFile() to actually have it “move” files (or more precisely add/remove tag links) it

turns out that Explorer once again has implemented a little safetycheck of its own. Before moving a file,

CreateFolder() is called with the destination folder of the move as the parameter. This call was

unexpected and it causes the dummy file to appear wherever a file is moved to, which is an annoying

side effect. Clearly CreateFolder() needs to be redesigned so it doesn’t drop dummy files all over the

place, where it’s not needed. Another thing that needs to be changed is the name of the dummy file, it

was originally “dummy:” because “:” is not a legal char for a file, so it would not cause any name

collision, but this file gives an error when trying to delete it through Explorer, the delete request never


reaches the VHD. So while in the process of refactoring CreateFolder(), the dummy file becomes a real

file, dummyfile.tmp, stored in the FileStore.

5.5.6 UC6: Delete Folder

Could prove interesting as folders are recursively deleted in Windows, which means first every subfolder

is queried to determine contents + access to contents and then folders are deleted. Is deletefolder

issued only on the user selected folder or on subfolders (Explorer would issue on subfolders).

Calling for a delete on a folder will check all subfolders (recursively) for access before allowing the

delete. This is potentially a big problem as the number of tags grows, since subfolders are combinations

of tags, with just 7 tags there is the possibility for a folder to have 6! = 720 subfolders. Each of those

queries contains 1-6 tags that are checked in the database - that is a lot of queries. Caching tag queries

could be one possibility, but with 8 tags we have a chance of seeing 7! = 5040 subfolders. That is 5040

potential calls made by Windows Explorer that cannot be avoided, at best they can be cached or an

attempt to detect dir crawling and return an empty list can be made.

The problem is that the check of sub directories is done before the DeleteDirectory() call is made, so

there is no warning preceding the directory crawling.

Basically every permutation of tag order is checked by Explorer (those that contain files, as they are the

only ones shown by the MetaFS) which means that every path is unique so caching by path does not

work here.

The solution that will work for now is to keep track of how many FindFiles() calls are made with different

DokanContext numbers. A quick test run shows that most DokanContext numbers are used 1-20 times, a

few see ~40 uses, but only when delete is called are they used over 100 times. With just 11 tags and 52

files deleting one folder at root level takes 5-10 seconds (“please wait” GUI graphics last for 5-10 sec,

which is the time it takes Explorer to crawl all the tag folders), and it caused 1957 calls, split over 2

DokanContext numbers ( ~600/~1300 split - the reason for the split is unknown). Since the crawling is

done using the same 1 or 2 DokanContext numbers we can return empty filelists once the same number

has been used 200 times. Operations other than the delete operation has so far topped 40

DokanContext reuses, so 200 seems like a “safe” number, leaving enough margin for normal operations.

Another operation that causes all sub folders to be crawled is calling the properties dialog of a folder

since it will count the number of files and folders in the folder, which can only be done by crawling. The

amount of files and folders obtained from recursively crawling makes no sense in a tag based filesystem,

so it’s not a great loss, that his functionality is broken by the limit of 200.

Depending on the software used to view the VHD folder delete actions are handled differently. The

Windows command prompt command “rd” will only delete a folder if it is empty, “deltree” on the other

hand will do a recursive delete. Windows Explorer automatically deletes subfolders when deleting a

folder.

66 Implementation

5.5.7 UC7: Rename Folder

Funny little side effect of renaming tags; MFSFile objects store the list of linked tags in a dictionary using

the tagname as the key and the value is the tag object, changing the tagname of the tag object means

that file objects need to have their dictionary updated with the new name.

5.5.8 UC8: Move Folder

This operation has very limited use and is not implemented.

Docplayer [51] uses a tag based approach to classifying files, but it also allows tags to be nested by drag-

n-dropping one tag onto another, creating a tag-based taxonomy. On the other hand it does not

automatically list the other tags applied to files being view, allowing for a drilldown like the MetaFS.

5.5.9 UC9: Create File

For file-exist probing the unique name constraint of the system makes lookup simple, there is no need

to start at the root and crawl directory after directory according to the path specified to get to the file or

folder at the end of the string. Only the last part of the path is checked for a tag or file name, there is no

concept of current directory. The last string of the path is used to query the database for a matching tag

or file and the details used to fill the FileInformation object that is returned.

For actual file creation the file mode flags are simply forwarded to the System.IO.File class. Only file

copying is handled a bit specially by Explorer for which a translation of a file mode flag is performed.

5.5.10 UC10: Delete File

Delete file operations involve, remove tag links and finally deleting the file object from the database.

Removing tag links is accomplished by iterating through the items in the _tags list and calling

RemoveTag(item) on each of them. A tiny issue with this is that RemoveTag() automatically saves the file

object after every tag is removed, so when removing a file with 7 tags, we end up saving the file object 7

times, before finally deleting it. Thus a 2 parameter RemoveTag is implemented that will allow tag

removals without saving. This non-saving function will only be available from within the MFSFile class as

external calls are better off with saves being handled automatically.

5.5.11 UC11: Rename File

Once the MoveFile() call has been properly processed and determined to be a rename, it’s a simple

matter of doing a db query for a file and tag object and change the name. In case of a tag rename, and

the new name is in use by a file, the file is renamed, so the tag does not get a counter.


5.5.12 UC12: Move File

The MoveFile() function is called by Dokan for both files and directories so the first thing we need to do

is determine if the target is a file or a tag folder. As the name suggests it is meant to move things, but it

is also used for renaming, so we also need to determine if it is a move or a rename operation that is to

be performed.

After we know if we are working on a file or a tag folder we need to determine if it’s a move or a rename

operation.

In total there are 4 situations in which MoveFiles is called, they are:

1. Move tag

Moving a tag folder does not make much sense in the MetaFS since every tag can appear

anywhere in a path, it is only their presentation that limits where they are shown based on how

they are linked with file objects.

This event has very little use and is ignored until a reasonable use has been found.

2. Rename tag

Make sure the path for filename and the new name are identical except for the last folder,

which is the tag being renamed. Before renaming we need to make sure the name is not already

in use by a file, if it is, rename the file and assign the name to the tag instead (this could possibly

break stuff if it’s currently working on the file and the file goes missing and is replaced by a dir.

But the same thing can happen in a normal file system, move the file and put a dir with same

name in its place, so it is not a problem that is unique to the MetaFS).

If another tag is using the name we have to decide which approach makes most sense:

• Return an error indicating that the directory name already exists.

• Combine the two tags by linking all files with the original tag to the existing “newname” tag.

Undo operations are most likely implemented in the viewing software, which has no clue how

the MetaFS works, so our system has no undo function, and the joining of the tags might not be

what the user wanted, and now has no way to undo the operation.

Returning an error is the safer choice. If the user wants to merge to tags this can be

accomplished by opening the source folder from the root and move all the files to the target

directory and then delete the source folder. It is important the source folder is opened form the

root, if not opened from the root there is no guarantee that all files with the given tag are listed

as there are other tags in the path that potentially limit the files presented.

3. Move file

This is quite a tricky operation since it will do different things depending on where the file is

moved from and to.

• Move to parent folder: (Move UP)

68 Implementation

Moving a file to a parent directory will remove the tags of the child folders it is moved from

(moving “C:\Lyrics\Music\Metallica\sandman.txt” to “C:\Lyrics\sandman.txt”) will remove

the “Metallica” and “Music” tags from the file).

• Move to child folder: (Move DOWN)

Moving a file to a child directory will add the tags of the child directories (moving

“C:\Lyrics\sandman.txt” to “C:\Lyrics\Music\Metallica\sandman.txt” will apply the tags,

“Metallica” and “Music”).

• Move to arbitrary folder. (Move ACROSS)

Moving a file to any location not covered by the previous cases will only add new tags, not

remove existing ones (moving “C:\Lyrics\Music\Metallica\sandman.txt” to

“C:\Lyrics\Music\Song texts\sandman.txt” will apply the tag “Song texts” and leave the

“Metallica” still linked with the file.

4. Rename file

Very similar to rename tag, except it cannot “steal” a name already used by a file as the tag

renaming can.

The MoveFile() call from Dokan also passes a replace parameter, but with files having unique names you

never really know if you are overriding the correct file on the VHD until you have opened it, so a replace

parameter seems like a bad idea and will be ignored. To rename a file to a name already taken the old

file has to be manually removed first; this keeps the MetaFS from overriding files the user didn’t mean

to.

Renaming a file on the VHD does not rename the real file, unless the real file is in the FileStore i.e.

indexed locations keep their original file names, FileStore files follow the name they have on the VHD.

It would appear that the process of moving a file will actually call CreateDirectory to ensure that the

target directory exists, in which case this call needs to return an error message of

directory_already_exists.

The CreateDirectory call results in the dummy file being placed in the target directory *sigh*. This

sequence of events caused by a move file event initiated by the user means that the CreateDirectory

needs to be redesigned. Currently CreateDirectory will generate any tag that appears in the path given

and link the dummy file with all of them. During a move operation we do not want this dummy file to

appear at the destination of the move, so we need to check if there exists a file that has all of the tags

given in the CreateDirectory request. If a file exists with the tags provided we do not need to add a

dummy file as the sequence of tags is visible through normal view.

There are 2 ways to approach this do-we-need-a-dummy-file problem.


1. We take a “random” tag (the first one) from the path, query the db for this tag and use the list

of files to compare against the rest of the tags to see if any file is linked with all tags.

+ We do a single search query in the database to get the first tag, everything after that is

lookup by reference.

- The (random) tag we do the query on could be the one with the most files, possibly resulting

in 10.000 files that need to have their tag lists compared against the tags in the path.

2. We do a search query for every tag name in the path, and then pick the shortest file list from

these tag objects and perform the check of the path-tags against these file objects.

+ We get the shortest list of files to compare the path against e.g., a list of 10 vs. a list of

10.000.

- Multiple queries against the database.

We easily run 3+ queries whenever a new file object is added, so the 1-10 queries caused by the

majority of CreateDirectory() calls is not a problem as it is a lot less frequent than file creation.

5.5.13 UC13: Read File

Supposedly the ReadFile() and WriteFile() functions of Dokan are used very much like your normal

ReadFile() and WriteFile() calls, so hopefully it will be a “simple” matter of forwarding the read/write

requests. Dokan uses the typical sequence of CreateFile() -> ReadFile() -> Cleanup() -> CloseFile(), which

means that CreateFile will have to open files when a specific set of file access parameters are passed to

CreateFile(). And the file is left open until CloseFile() is called. The calls are linked by the info.InfoID

number.

Before opening a file, GetFileInformation is called on it to retrieve the length of the file. This length is

used by notepad (and likely other application) to determine how much data is in the file. When a file is

added (through add index location) the length of the file is stored in the db as part of the file object. We

do this to avoid having to query the original file for length whenever it is presented in a view. The

problem with doing this is that any changes made to the original file on the real hard drive will change

the length of the file. If content is added to the file on the real hard drive, bypassing the VHD, only the

first fileobj.length characters will be loaded, the rest is still in the file, but it is not loaded by the

application when opened from the VHD.

Changes to indexed files are not yet detected by the MetaFS, so to fix this problem we have to check the

length of the real file vs. the length in the file object when a file is opened as it would appear that the

70 Implementation

Dokan calls happen in the following order, CreateFile (file mode flag = open), GetFileInformation,

ReadFile, CloseFile, Cleanup.

When opening files the length value of the file is typically used by the application opening the file so we

need to make sure that the length of the file object in the db matches that of the real file when an open

request is made.

5.5.14 UC14: Write File

Forward to System.IO.File

5.5.15 UC15: Change FileStore

New filestore location cannot be added as an indexed location.

New filestore location must be empty.

5.5.16 Searching

The db4o native query language has built-in query constraint methods for startswith and endswith

querying, and it seems only natural to use these when implementing the ‘*’ wildcard character for

searching “hay*” would find keywords that start with “hay” (as in the band “Hayseed Dixie”). It would

seem natural for the startswith method to be able to use the index that exists on the keywords and thus

be able to complete its constraint matching in a time not too far from that of an exact match of the

word “hay”. It turns out that this assumption was erroneous.

Query times were measured in a DB with 931 files added, 2049 keywords, 25228 keyworddetails (this

amount has no impact on the query time, only when presenting the result is the activation time affected

by the amount of detail objects). The initial query to the DB takes a lot longer than subsequent ones

because it has to load the data and possibly activate some linked objects, so the program has to be

restarted before every search to get accurate results.

After startup, a search for “hay” takes 142ms and returns 0 results. Performing the exact same search

again takes ~1ms. Letting the program run and searching for “hayseed” takes 29ms and returns 51

results, repeating the “hayseed” search takes 1-7ms (1ms is achieved around 50% of the time, 7 is an

occasional spike in query time). A search for “hayseed” after a restart takes 170ms.

Restarting the program and doing a search for “hay*” takes 313ms for the first search and returns 51

results, subsequent searches range from 125ms to 137ms. A restart and search for “*seed” results in

similar numbers 291ms initially and around 130ms for the following searches.

Assuming the index on keyword objects is alphabetical it should be usable in startswith queries, but not

in endswith. The numbers above seem to suggest that the index is not used at all when performing

startswith or endswith queries. It is possible that the index is hash based.


The following message supports the lack of index use when performing startswith and endswith queries.

“:: db4o 7.12.156.14667 Diagnostics ::

MetaFileSystem.MFSSearch+KeyWord, MetaFS :: Query candidate set could not be loaded from a field

index

Consider indexing fields that you query for:

Db4o.configure().objectClass([class]).objectField([fieldName]).indexed(true)”

This discovery totally negates the point of the “backwards” attribute of keywords that was supposed to

be able to supply endswith querying supported by an index for faster results.

This is an implementation detail specific to the db4o database and not something we can do anything

about, so when performing wildcard searches the increased query time is unavoidable.

5.6 MFSDebug

During development the database will get filled with lots of useless stuff and the classes are refactored a

lot which can cause problems when testing. This can be fixed by just deleting the database file, but while

the database is open in the program the file cannot be deleted, so either the program has to be closed

or the program has to close the db, delete the file and create a new db. Alternatively the program can

“just” delete every object in the db, which will still leave undo/redo logs intact and possibly some class

information that is not needed will still be present in the db file.

Testing needs a clear database for a reliable result and having to close and reopen the program for every

test is an inconvenience.

To allow for easier testing a reset command has been implemented, that will (re)open the database and

re-initialize every object loaded from the db. This re-initialization process is very important to prevent

duplicate objects in the database.

Early versions of the MFSDebug and MFSDebugOptions caused a lot of problems because when the

program was run the 2nd time it would load the debug settings from 1st run, and re-save them a 2nd time

as a new object, putting 2 identical objects in the database. Changes to one object would be mirrored in

the other object, they were effectively the same object, but occurring twice in the database. The

MFSDebug and MFSDebugOptions are supposed to be singletons, so when starting the program for the

3rd

time, the singleton loader function would complain that there was more than one instance of the

object in the database.

All of this happened because the MFS constructor would open the db, then something would use the

debug class (which would force it to be loaded from the db), then the GUI would call for a MFS.Reset()

as part of the initializing process, forcing the db to be re-opened. This sequence of events lead to the

72 Implementation

debug class loading its MFSDebugOptions from the first ObjectContainer and saving it to the second,

where the second believed it to be a new object, instead of an existing one.

This problem of one object spanning two object containers took a while to track down and it was

actually a web-comic style guide about db4o that highlighted the multi object container issue; “The

Magic Clone” section of [52].

5.7 MFSFunctions

Various functionality that is used a lot and functionality that at times needs to be handled differently like

access to the object database. Test functions and benchmarking should not interfere with the real

database so when testing the DB property can be set to redirects to another DB object.

5.8 MFSDokan

An interesting thing learnt when implementing and debugging this one is that the command line tool,

“copy” uses CreateFile() calls to check if the target exists and then calling CreateFile() with a file mode of

create, forcing an overwrite of any existing file (if the user confirms the overwrite of course). Windows

Explorer uses a different approach to copying when drag & dropping. The target file is created with a file

mode of CreateNew, which if it fails causes an overwrite/rename/cancel dialog to appear. The problem

here is that we first look if the CreateFile() request is supposed to create a file, and create it at this point

if that is the case. Later in the code we check if the target is a filename and open it, simply forwarding

the file mode parameters to a FileStream constructor to open the file. Simply forwarding “CreateNew”

does not work since we create the file and its corresponding FileStore file earlier, so the FileStream

constructor will throw an exception informing us that the file already exists. “CreateNew” has to be

translated to a “Create” when forwarding to the FileStream constructor. This works because our file

creation part of CreateFile() checks if the file exists when given a “CreateNew” flag, and returns

already_exists_error if it does exist. That way we don’t run the risk of opening an existing file with a

“Create” file mode flag (that we changed from “CreateNew”) and risk an overwrite on its contents.

For some reason, the more files that are in a folder, the more calls there are to the CreateFile() function.

Approximately 1 call is made per file when viewing a folder, up to 400 calls. The CreateFile() calls for

each files is not a probe to check that each file exists, they are all targeted at the folder being viewed.

The reason for this behavior is unknown. This is a lot of calls that normally would query the database,

but since the queries are identical (occasionally different read/share/lock modes are used in the first 5

calls) the return value should be the same for all 400 calls, so to save the db some work we cache the

result to feed the remaining 399 calls. MFSDokanCache is the class responsible for maintaining the cache

values. Cachesize is currently set at 10 and the lifetime is 2 seconds.

5.9 Deployment 73

5.9 Deployment

To be able to run the MetaFS application the user must have Dokan library [38] and .NET 3.5 or higher

installed. The db4o database is imbedded into the .exe file.

The MetaFS consists of an executable and a bunch of .dlls and it can in theory be run from any directory,

and will create a new database file in the same directory if an existing one is not found.

74 Implementation

Chapter 6

6 Testing & Results

6.1 Introduction

The MetaFS was developed using unit testing for the simple operations that make up the building blocks

of the system, like creating files and tags, linking them, deleting them, moving them, etc.

A lot of the unit tests assume that the C: drive, where the tests are run, contains a set of folders and

files. One or more of the files from the C:\Music\Metallica folder and subfolders are used in the unit

tests, and if they are not present the tests will fail.

While unit tests are great for ensuring developed functions work as expected and that any code

refactoring continues to work as expected, they do not provide much help in identifying performance

problems (as is evident when Adding an Indexed Location). The unit tests perform actions and test that

the result of the action is as expected, they run automatically (on user request or code check-in) and do

not care how long it takes to complete an operation. Manually testing functionality, by entering data

and pressing buttons, you have to wait for a result and if that result takes 100 seconds there must be a

performance problem somewhere that needs to be looked at.

6.2 Unit Test for every use case

Unit tests are a great tool for ensuring functionality as expected, as functions are written and later when

they may be changed, but the unit tests also take time to write, at times longer than the actual code

they test.

The updating of metadata and search index is accomplished by using the same function for 5-6 different

actions changing data, the update process is the same, but to test that the update is correct for each use

case, several unit tests have to be written, to suit each action causing a change. Thus the creation of the

unit tests becomes more time consuming than the actual process of writing the code to update search

and metadata values in the database.

76 Testing & Results

To save time, not every functionality will get a unit test. This means that code changes may end up

breaking stuff and go unnoticed for a while. This is bad test practice and will usually result in problems

at a later time. But as the system increases and allows for more interaction so does the time to write the

unit tests for every possible interaction, taking time from the actual implementation and using it on

writing unit tests instead. The basic functionality of the system (e.g. create file object, link tag to file,

rename tag/file, name collision resolving, etc.) is sufficiently tested with unit tests that the more

complex operations, that use the simpler and well-tested functionality, can be written and manually

tested. Once time allows it, unit tests should be written for the more complex system operations.

6.3 Existing Unit Tests

Each unit test ensures that, given a valid input some operation performs as expected, and when the

input is invalid the operations can detect it and abort nicely. Listing possible valid and invalid inputs and

operations tested by the unit tests is tedious work and does not make for interesting reading; instead

the source code for each unit test should be consulted to learn what they check for.

For a quick overview of the existing unit tests, here is a list of them.

• MFSDokanCacheTest

o FetchTest()

• MFSFileTest

o TagCombinationExistsTest()

o QueryByPathTest()

• MFSFunctionsTest

o SplitPathTest()

• MFSIndexedLocationTest

o TransferFileResponsibilityTest()

• MFSIndexManagerTest

o AddLocationTest()

o RemoveLocationTest()

o ExistTest

o GetParentLocationTest()

o GetChildLocationTest()

• MFSTest

o AddIndexedLocationTest()

o FileExistsTest()

o PathExistsTest()

o CreateTagTest()

o RenameTagTest()

• SearchTest

o FindTest()

6.4 Activation Depth 77

6.4 Activation Depth

A huge performance problem was found and traced to activation. The tag and file objects are closely

linked and the standard activation depth of 5 can easily end up activating a large portion of the database

objects.

Adding many files

Some of the final tests that were supposed to compare the WinAmp search on the media library of 11k

to a MetaFS search on 11k files have exposed a huge problem. As more and more files (and thus search

entries) are added, the time to add new file increases. The appendix contains some graphs of how long it

takes for 100 files to be added once x files have already been added, e.g. with 2000 files added, the next

100 files takes 160 seconds to add. This is quite a high amount of time required and using the

performance analyzer tool of Visual Studio has been traced activation of objects (appendix 9.10). But

when adding objects it should be as simple as adding an item to the hashset object we use and store the

updated list and the newly added item. The existing items in the list should not need to be activated,

simply get the keyword object, and insert a new entry into the list of details, but for some reason there

seems to be a lot of time spent on activation.

Removing many files

The same performance problem affects the removal of many files (only tested when removing indexed

location, but delete file is likely affected too). The problem is even bigger when removing file objects

and search entries. This does make sense due to the way search entries are removed. The keyword

object is located and the list of details is traversed to find the detail entry that needs to be removed.

This approach requires that the details of the list are actually loaded from the database. The current way

of looping through each detail for a match may be causing unnecessary activation. It may be possible to

use the database engine to remove the object without having to instantiate all of the list items.

6.5 Query time measuring

To query for an object is fast, but the time it takes to read that object depends on how many linked

objects must be loaded based on the activation depth set for the database or a particular class

(Appendix, 9.11 Activation impact on queries). Because of this, the query time measured when

searching the database is the time it takes to populate a list of file objects based on the result of the

search query. The query time covers querying for the KeyWord object matching the search word and

loading the list of KeyWordDetails and adding an entry for each unique file object entry that appears in


the KeyWordDetails. The file list is a HashSet which means it runs the HashCode and IEquals functions of

file objects, forcing activation of file objects.

6.6 Query time, Winamp vs. MetaFS

Winamp was selected as the target for comparison for a few reasons. It was already installed on the VM

host, to be able to listen to music while working. Winamp was known to already support ID3 reading

and searching, saving time in locating applications with similar features. Winamp has been one of the

primary MP3 players around for the past 10 years (based on personal experience) and one would

assume they know what they’re doing after that long. Google Desktop was installed, but interfered with

the Windows Search functionality and was uninstalled, upon reinstallation it stopped working.

Furthermore it is unknown if Google Desktop indexes ID3 data, the same applies to the Windows Search

feature. Another thing promoting the use of Winamp is the fact that they include the search time for

every search, so it’s possible to tell that a search took 0, 411 ms instead of estimating it at ½ or 1 second,

which is what a manual search time estimate would be able to produce when a search is performed.

The approach for testing query times consists of entering the search term and repeating the same

search 5 times in a row in an attempt to find a pattern in search times. Figure 16 and Figure 17 shows

the result of these 5 searches, numbered 1 to 5 (one being the first search).

Figure 16 - Query time in seconds for Winamp and MetaFS searches

Figure 16 shows some query times, in seconds, for Winamp and MetaFS for a comparison of their

performance.

6.6 Query time, Winamp vs. MetaFS 79

The 2.796 second spike in query time of the first MetaFS query for ‘Rock’ is because the application was

just started and only had indexed locations, options and debug options objects loaded from the DB.

Winamp working memory: 108MB.

Winamp Media Library File Size (C:\Users\<user>\AppData\Roaming\Winamp\Plugins\ml\main.dat): 6,24MB

MetaFS working memory: 52MB.

MetaFS Database File Size (Metafs.yap): 61,5MB

Winamp is very consistent with regards to query time, approximately 400ms per search word. The

MetaFS query time depends on the amount of objects in the result, more items in the result means

more objects require activation. With few items in the result and searching on 2 words or more, the

MetaFS is comparable to Winamp on the initial search, and the user rarely does double searches for the

same word. Double searches do come into consideration when doing an initial search, followed by a

search with and added AND ‘someword’, e.g. “Rock” and then “Rock AND Acoustic”, in which case the

items matching Rock are already activated, and only “Acoustic” items need activation. On searches 2-5 a

query for “Rock” takes 21-29 ms, a search for “Acoustic” takes 2-10ms, but a search for both takes 48-

161ms despite the AND operation only consisting of an intersection of 2 HashSet lists. This increase in

search for 2 words is possibly caused when the search for the first word, “Rock”, is run, any “Acoustic”

objects activated from the last query is discarded and “Rock” objects activated instead, requiring the 2nd

part of the search to re-activate the “Acoustic” objects. This theory is supported by the 3 word search

being pretty consistent in query time all 5 times, 525-660ms, discarding cached data and re-activating

objects for each word.

Winamp performs beginswith and endswith searches, which is evident when searching for a word like

“on” (4019 results in Winamp, 492 results in MetaFS). The db4o database seems unable to use the index

on keyword objects when performing beginswith and endswith queries. Possibly the index is hash based,

making it unusable for anything but exact matching.

Figure 17 - Query times for large result sets

Figure 17 shows some bad scenarios for the MetaFS. Handling of large result sets means activating a

large amount of objects, causing an increase in query time.


6.7 Filter time, Winamp vs. MetaFS

Once a search is done we can filter the result by different attributes of the MP3 files.

A search for “rock” and applying a filter like Year=2006 takes 0,369 seconds in Winamp.

In the MetaFS a search for “rock” is performed and the attribute “ExtMp3.Year” is selected, which

spends 1,275 seconds populating the list with years, selecting 2006 returns a result in 4ms.

Selecting another year in the MetaFS returns a new result in 1-3ms, in Winamp, selecting a new year

requires another 350ms.

The MetaFS is slower to populate the list with metadata values initially, but once populated, all the

metadata objects are activated, and all metadata can be accessed in very little time. Changing the

attribute from year to genre populates the list of genre values in 7ms, and selecting a genre returns a

result in 1-3ms.

Winamp performs a new search every time the attribute selection or type is changed, resulting in a filter

time of around 0,400 seconds per search word.

6.8 AddLocation time trace

In the process of comparing the search performance to that of Winamp they need to include the same

files and this requires adding 11.000 files to the MetaFS DB. This uncovered a huge performance

problem where it would take several minutes to add 900 files to an empty database.

Using the Visual Studio Performance Explorer tool, the time-hog is identified to be Activate() calls to the

DB (9.10 AddLocation Performance Report).

Examining the Instrumentation Profiling Report closer the activation problem can traced to the adding

of search index entries.

6.9 DropBox

During testing the process ID of dropbox.exe was discovered in requests to the VHD, which was

unexpected as it has no reason to look at a different hard drive or folder than the designated DropBox

folder. Interestingly enough, Microsoft security essentials has not yet shown itself in VHD requests.

6.10 Bugs

• When in a folder, trying to create a new tag-folder where the name is already used by another

tag-folder, a new tag-folder with the same name is added to the DB, causing a tag name collision

in the DB.

o This was tested to work during implementation, but the lack of a unit test has broken it,

possible during a change in activation level and method of different objects in an

attempt to decrease the activation time problem of AddLocation.

6.11 Results 81

6.11 Results

6.11.1 Fundamental requirements

In the analysis we listed the 8 fundamental requirements suggested for a PIM (Personal Information

Management) system. The following is a re-listing of these and how well they have been fulfilled.

1. Be Compatible with Current User Habits

The VHD allows the use of Windows Explorer to browse files, and files can be read and

written on the VHD..

2. Minimal Interference

Adding a location and mounting the VHD is all it takes, but perhaps the current UI

contains too many options that should be hidden away from the average user.

3. Support Multiple Contexts

The tag based browsing of the VHD different paths to present the same file.

4. Support Browsing

The browsing of the VHD is how backwards compatibility is achieved, covering this

point.

5. No Unnecessary Limitations

The MetaFS fails on this point, working with small file sets (1.000) is no problem, but

larger file sets (10.000) take a long time to add.

6. Transparency

The original files are left in their folders, only their content is changed on read/write

requests.

7. Provide for Expiry Dates

This feature was never part of the plan.

8. Add Metadata While Storing

By creating new folders when saving files, these new folders are applied to the saved file

as tags.

6.11.2 Performance

The time it takes to add files to the MetaFS DB is currently at such a high level that it is practically

unusable.

If the patience to wait for the adding of the files is present, then the search performance is comparable

to that of Winamp, when dealing with multiple search words and result sets of around 600. When

dealing with single words or large result sets, the amount of activation required eats up any time saved

on the search word lookup.

When dealing with filtering the performance of MetaFS is superior to that of Winamp. Like with the

search, the first filtering operation takes some time to complete due to activation, but once that one is

done, listing values for another attribute or filtering by a selected value is done in less than 20ms (Figure

21).


The VHD, browse-by-tag feature seems to be working as expected, but it is hard to predict how useful or

annoying it will turn out to be when used on a daily basis in place of the traditional hierarchical file

structure. With the 11253 files added there are 705 tags, 558 of which are shown at root level because

they have more than 10 files associated with them. With a limit of 20, only 141 are shown, but that is

still a lot of folders to look through when browsing for stuff at the root. The amount of tags may prove

too overwhelming to use on a daily basis, and it may be necessary to reduce the amount of tags created

automatically. Tagclouds seem to be common on tagging websites, this could be an option.

Overall the MetaFS needs some redesigning of the search index to be able to handle file (and thus

keyword) adding much faster.

The initial spikes in search and filter times are problematic and should be improved upon.

There are currently some issues with the MetaFS, some huge, like the adding of files, some small, like

the high initial query time. But with most of the system designed with little knowledge of the db4o DB

and OODBs in general, there are several possible improvement methods available, e.g. redesigning

search objects, automatic activation, change memory allocated to objects, pre-loading objects before

search is performed.

Automatic activation is available and might have been a better choice, but manual activation was

chosen in an attempt to keep object storing and loading to a minimum, a complex task with the highly

connected objects of the MetaFS

6.12 Screenshots

In this section we present some screenshots that show the use of MetaFS and some of the 5 search

queries used in the comparison between MetaFS and Winamp.

6.12 Screenshots 83

Figure 18 - Location management and VHD options

Figure 19 - A search for "mp3" with filtered result (Def Leppard selected as performer in 3rd list)


Figure 20 - Debug output of search times

Figure 21 - Debug output of filter times, with all 11253 files as filter base

6.12 Screenshots 85

Figure 22 - No counter on tag-folders

Figure 23 - Counter on tag-folders


Figure 24 - Drilldown: Roskilde (175 files with this tag), Red Hot Chili Peppers (20 files with this tag)

Figure 24 is an example of a simple drilldown, using “roskilde” as first tag and “Red Hot Chili Peppers” as

the second. The remaining 2 tags (“Shared Folders” and “vmware-host”) provide no further drilldown as

both of them would list all 20 files listed by current view, indicated by the [#] counter.

Alternatively the “roskilde” tag could have been skipped and “Red Hot Chili Peppers” selected as the

first drilldown, giving the 20 files in a single drilldown action.

Chapter 7

7 Conclusion

Building a tag based file system and making it available through the traditional hierarchical browse

method of the Windows Explorer presents some tricky problems and it is not well suited for

presentation of a tag based file system. The VHD interface presented here is somewhat limited and at

times a bit awkward to use through a hierarchical file browser and should be seen as a hybrid of

hierarchical and tag based file access, to support all existing applications that currently only supports

hierarchical file access. If a tag based file system is to ever gain widespread use it needs an access

method of its own, supported by other applications, just as saving/opening a file in a hierarchical file

system is standard in most applications today.

A virtual hard drive is a relatively simple way of providing a file system like interface to something that is

more than just a normal hard drive and it is a bit puzzling that there does not exist anything official for

creating VHD interfaces. Dokan is a great library that does the job, but it would have been nice with

some more documentation on the different operations.

The db4o database is easy to work with and removes the impedance mismatch problem, but at the

same time it also introduces a new problem, activation. The OODB allows for an alternative way of

storing data, other than the classical Relational approach. This is both a blessing and a curse as is evident

by the design and testing of the inverted index, designed to find a single keyword object by searching

the DB and accessing the remaining objects by reference when searching and filtering.

Looking at the original 4 objectives, 3 of them have been met.

88 Conclusion

1. Search feature with metadata search capability ala. Spotlight (Mac).

Data can be searched and afterwards filtered on selected attributes.

2. Saved searches ala. Smart Folder (Mac)/Virtual Folders (Win)

Not implemented.

3. Tags as known from the internet (Gmail.com, flickr.com, del.icio.us).

The VHD can be used to apply and remove tags.

4. An alternative way of browsing files; Browse by tags instead of folders.

The VHD allows browsing files by selecting more and more tags that the target file must

be associated with.

Of the 8 fundamental requirements for PIM, listed in the analysis section, 6 of them have been met

(6.11 Results).

Some work and bug fixing still remains to be done, but overall the MetaFS offers a new way of accessing

files, fulfilling many of the requirements expected of a PIM system. Sadly it only works efficiently on a

small scale (up to 1.000 files) in its current form.

7.1 Future Work

• Using the current Windows Explorer that was designed for hierarchical file systems for a tag

based system is clumsy at times; Moving files to another folder sometimes means adding tags,

sometimes removing tags, but Explorer treats the move as a Hierarchical move, removing the

files from the current view, even if the operation was an add tag that should not affect the

current view. A file browser supporting tag based browsing is required to better support tagging

of files, support undo operation and get rid of the unusual behavior that comes with using a

hierarchical browser.

• A smart-folder feature is not currently available, but should be added, allowing searches to be

saved and accessed through the VHD.

• Currently the search results from the search Window cannot be opened directly, the result is

just a list of file names that match the search. Adding a right-click menu or support double

clicking file names should be added to open files. Additionally the result of the search could be

made available in a special folder on the VHD.

• When viewing a path where a tag-folder is prepended a “!” should exclude files with that tag

from the files presented (\Rock\!2006).

• Introduce the use of stopwords and possibly min word length on search words to reduce

KeyWordDetails objects. (0 is a bad value to index on due to it being the default value for

integer objects and most MP3 files don’t have a disc value specified, making it default to 0.)

• When a new extension is detected a scan of files in the database should be performed to find

those supported by the new extension and their metadata extracted using the new extension.

Chapter 8

8 Bibliography

[1] Robert Freund, "File Systems and Usability - the Missing Link," University of Osnabrück, Bachelor's

Thesis 2007.

[2] Wikipedia. NTFS. [Online]. http://en.wikipedia.org/wiki/NTFS

[3] Jakob Nielsen and Don Gentner. (1996) The Anti-Mac User Interface. [Online].

http://www.useit.com/papers/anti-mac.html

[4] F. Clymer, "Treasury of Early American Automobiles," 1950.

[5] Onne Gorter. (2004) DBFS - Database File System. [Online]. http://dbfs.sourceforge.net/

[6] Paul Thurrott. (2005, August) Paul Thurrott's SuperSite for Windows: Windows Storage

Foundation (WinFS) Preview. [Online].

http://www.winsupersite.com/showcase/winfs_preview.asp

[7] Seth Nickell. GNOME Storage. [Online]. http://people.gnome.org/~seth/storage/

[8] Apple Inc. (2006, June) Working with Spotlight. [Online].

http://developer.apple.com/macosx/spotlight.html

[9] Wikipedia. Spotlight (software). [Online].

http://en.wikipedia.org/wiki/Spotlight_%28software%29

[10] Microsoft. (2005-2006) WinFS Team Blog. [Online]. http://blogs.msdn.com/winfs/

[11] Wikipedia. File system - Wikipedia, the free encyclopedia. [Online].

http://en.wikipedia.org/wiki/File_system

90 Bibliography

[12] Versant. db4o Blob Implementation. [Online].

http://developer.db4o.com/Documentation/Reference/db4o-

7.12/net35/reference/Content/implementation_strategies/type_handling/blobs/db4o_blob_impl

ementation.htm?SearchType=Stem&Highlight=blobs|Blob|Blobs|blob|BLOBs|BLOB

[13] Apple Inc. Apple - Mac OS X - What is Mac OS X - Spotlight. [Online].

http://www.apple.com/macosx/what-is-macosx/spotlight.html

[14] Apple. The Smart Mac: Smart Folders in OS X: Apple. [Online]. http://gigaom.com/apple/the-

smart-mac-smart-folders-in-os-x/

[15] Apple Inc. Apple - Mac OS X - What is Mac OS X - Time Machine. [Online].

http://www.apple.com/macosx/what-is-macosx/time-machine.html

[16] Google. Google Desktop Download. [Online]. http://desktop.google.com/

[17] Google. Google Desktop Filetypes. [Online]. http://desktop.google.com/filetypes.html

[18] Microsoft. Windows Search. [Online].

http://www.microsoft.com/windows/products/winfamily/desktopsearch/default.mspx

[19] Microsoft. Windows Search Overview (Windows). [Online]. http://msdn.microsoft.com/en-

us/library/aa965362%28VS.85%29.aspx

[20] Microsoft. IFilter Interface (Windows). [Online]. http://msdn.microsoft.com/en-

us/library/bb266451%28v=VS.85%29.aspx

[21] Foxit Corporation. Foxit Software - Foxit PDF IFilter 2.0. [Online].

http://www.foxitsoftware.com/pdf/ifilter/

[22] Adobe System Inc. Adobe - Acrobat : For Windows : Adobe PDF iFilter 9 for 64-bit platforms.

[Online]. http://www.adobe.com/support/downloads/detail.jsp?ftpID=4025

[23] Apple Computer. (2005, Dec) Search Kit Programming Guide: Search Basics. [Online].

http://developer.apple.com/library/mac/#documentation/UserExperience/Conceptual/SearchKit

Concepts/searchKit_basics/searchKit_basics.html

[24] SearchTools.com. Guide to Search Tools: Why Searches Fail. [Online].

http://www.searchtools.com/info/whysearchesfail.html

[25] Google. Files: I can't find: - Desktop for Windows Help. [Online].

http://desktop.google.com/support/bin/answer.py?hl=en&answer=13754

91

[26] Versant. db4o : Java &.NET Object Database - Open Source Object Database, Open Source

Persistence, Oodb. [Online]. http://db4o.com/about/company/

[27] C.J. Date, "ACID properties," in An introduction to Database Systems.: Addison-Wesley, 2000, ch.

pp. 459; 843; 845-846.

[28] Rick Grehan, "The Database Behind the Brains," db4o Whitepaper, March 2006,

http://www.db4o.com/about/productinformation/whitepapers/.

[29] Wikipedia. IBM Rational Unified Process - Wikipedia, the free encyclopedia. [Online].

http://en.wikipedia.org/wiki/IBM_Rational_Unified_Process

[30] Wikipedia. Unified Modeling Language. [Online].

http://en.wikipedia.org/wiki/Unified_Modeling_Language

[31] Craig Larman, Applying UML and Patterns, 2nd ed.: Prentice Hall PTR, 2002.

[32] Margo Seltzer and Nicholas Murphy, "Hierarchical File Systems are Dead," Harvard School of

Engineering and Applied Sciences,.

[33] Margo Seltzer and Nicholas Murphy. (2009, May) Hierarchical File Systems are Dead Slides.

[Online]. http://www.eecs.harvard.edu/~margo/papers/hotos09/slides.pdf

[34] Google. Tagging photos : Search and Locate - Picasa Help. [Online].

http://picasa.google.com/support/bin/answer.py?hl=en&answer=106209

[35] Karl Voit, Keith Andrews, and Wolfgang Slany, "Why Personal Information Management (PIM)

Technologies Are Not Widespread - And What to do About It".

[36] O Bergman, R Beyth-Marom, R Nachmias, N Gradovitch, and S Whittaker, "Improved Search

Engines and Navigation Preference in Personal Information Management.," Transactions on

Information Systems, pp. 26(4):1-24, September 2008.

[37] Stijn Dekeyser, Richard Watson, and Lasse Motrøen, "A Model, Schema, and Interface for

Metadata File Systems," in Australasian Computer Science Conference (ACSC), Wollongong,

Australia, 2008, pp. 17-26.

[38] Hiroki. Dokan Library. [Online]. http://dokan-dev.net/en/docs/dokan-readme/

[39] Google. Can I search the full text of long documents? [Online].

http://desktop.google.com/support/linux/bin/answer.py?hl=en&answer=62977

[40] Wikipedia. Comparison of file systems - Wikipedia, the free encyclopedia. [Online].

92 Bibliography

http://en.wikipedia.org/wiki/Comparison_of_file_systems

[41] Unknown Author. TagLib Sharp. [Online].

http://developer.novell.com/wiki/index.php/TagLib_Sharp

[42] Microsoft. Windows Search as a Development Platform (Windows). [Online].

http://msdn.microsoft.com/en-

us/library/bb331575%28v=VS.85%29.aspx#adding_new_file_format

[43] David A. Wiley. (1999) An Intelligent Method for Searching Metadata Spaces. [Online].

http://opencontent.org/docs/if-search.pdf

[44] Hiroki. Dokan - user mode file system for windows. [Online]. http://dokan-dev.net/en

[45] Microsoft. Dictionary(TKey, TValue).ContainsKey Method(System.Collections.Generic). [Online].

http://msdn.microsoft.com/query/dev10.query?appId=Dev10IDEF1&l=EN-

US&k=k(%22SYSTEM.COLLECTIONS.GENERIC.DICTIONARY%602.CONTAINSKEY%22);k(TargetFram

eworkMoniker-%22.NETFRAMEWORK%2cVERSION%3dV4.0%22);k(DevLang-CSHARP)&rd=true

[46] Microsoft. List(T).Contains Method (System.Collections.Generic). [Online].

http://msdn.microsoft.com/query/dev10.query?appId=Dev10IDEF1&l=EN-

US&k=k(%22SYSTEM.COLLECTIONS.GENERIC.LIST%601.CONTAINS%22);k(TargetFrameworkMonik

er-%22.NETFRAMEWORK%2cVERSION%3dV4.0%22);k(DevLang-CSHARP)&rd=true

[47] Microsoft. Virus, Spyware & Malware Protection | Microsoft Security Essentials. [Online].

http://www.microsoft.com/security_essentials/

[48] Versant. db4o : Java &.NET Object Database - Open Source Object Database, Open Source

Persistence, Oodb. [Online]. http://www.db4o.com/

[49] Microsoft. Microsoft Visual Studio 2010 - The Official Site of Visual Studio 2010. [Online].

http://www.microsoft.com/visualstudio/en-us

[50] Microsoft. Visual Studio Team Foundation Server 2010 | Microsoft Visual Studio. [Online].

http://www.microsoft.com/visualstudio/en-us/products/2010-editions/team-foundation-server

[51] Jody Foo and Kevin McGee, "DocPlayer - Design Insights from Applying the Non-Hierarchical

Media-Player model to Document Management," Linköpings Universitet, Sweden, Master Thesis

2003.

[52] Roman Stoffel. (2009, Sep.) db4o: Object-Identity and Higl-Level-Caching | Gamlor. [Online].

http://www.gamlor.info/wordpress/?p=654

93

[53] Wikipedia. Filesystem in Userspace. [Online].

http://en.wikipedia.org/wiki/Filesystem_in_Userspace

[54] MusicBrainz. MusicBrainz Picard - MusicBrainz. [Online].

http://musicbrainz.org/doc/MusicBrainz_Picard

[55] Versant. Native Queries. [Online]. http://developer.db4o.com/Documentation/Reference/db4o-

7.12/net35/reference/Content/object_lifecycle/querying/native_queries.htm#kanchor33

[56] Apple Computer. Search Kit Programming Guide: Search Basics. [Online].

http://developer.apple.com/library/mac/#documentation/UserExperience/Conceptual/SearchKit

Concepts/searchKit_basics/searchKit_basics.html

[57] Versant. SODA Evaluations. [Online].


7.12/net35/reference/Content/object_lifecycle/querying/soda/soda_evaluations.htm#kanchor38

[58] Versant. SODA Special Cases. [Online].


7.12/net35/reference/Content/object_lifecycle/querying/soda/soda_special_case_examples.htm

#kanchor37

[59] Versant. Constraint (db4o - database for objects - documentation). [Online].

https://developer.db4o.com/Documentation/Reference/db4o-

7.4/java/api/com/db4o/query/Constraint.html

[60] SearchTools.com. Background Information About Search Tools. [Online].

http://www.searchtools.com/info/index.html

[61] Matthew Thomas, "When good interfaces go crufty," August 2004.

[62] Jakob Nielsen and Don Gentner. (1996) The Anti-Mac Interface. [Online].

http://www.useit.com/papers/anti-mac.html

[63] William R. Cook and Siddhartha Rai, "Safe Query Objects: Statically Typed Objects as Remotely

Executable Queries".

[64] Richard Pak, Steven Pautz, and Rebecca Iden, "Information organization and retrieval: An

assessment of taxonomical and tagging systems.," Cognitive Technology, no. 12(1), pp. 31-44,

2007.

[65] Shanshan Ma and Susan Wiedenbeck, "File Management with Hierarchical Folders and Tags," in

CHI 2009 ~ Spotlight on Works in Progress ~ Session 1, Boston, USA, April 4-9 2009, pp. 3745-

94 Bibliography

3750.

[66] Christopher Peery, Wei Wang, Amélie Marian, and Thu D. Nguyen, "Multi-Dimensional Search for

Personal Information Management Systems," March 2008.

[67] N. H. Gehani, H. V. Jagadish, and W. D. Roome, "OdeFS: A File System Interface to an Object-

Oriented Database," in Proceedings of the 20th VLDB Conference, Santiago, Chile, 1994, pp. 249-

260.

[68] Randy B. Singer. Macintosh OS X Routine Maintenance. [Online].

http://www.macattorney.com/ts.html

[69] Jing-Shin Chang. Hierarchical Web Document Classification Based on Hierarchically Trained

Domain Specific Words.

[70] Apple Inc. (2005, Dec.) Search Kit Programming Guide. [Online].

http://developer.apple.com/library/mac/documentation/UserExperience/Conceptual/SearchKitC

oncepts/SearchKitConcepts.pdf

[71] William R. Cook and Carl Rosenberger, "Native Queries for Persistent Objects A Design White

Paper," August 2005.

[72] Aditya Kashyap, "File System Extensibility and Reliability Using an in-Kernel Database," Stony

Brook University, Master Thesis 2004.

[73] William Denton. (2009, March) How To Make A Faceted Classification And Put It On The Web.

[Online]. http://miskatonic.org/library/facet-web-howto.html

[74] Craig A. N. Soules and Gregory R. Ganger, "Why can't I find my Files? New methods for

automating attribute assignment," in Proceedings of the Ninth Workshop on Hot Topics in

Operating systems, 2003.

[75] Marti A. Hearst, "UIs for Faceted Navigation - Recent Advances and Remaining Open Problems,".

[76] Jakob Nielsen. (1995) Navigating Large Information Spaces. [Online].

http://www.useit.com/papers/navigating_large_information_spaces/

[77] Duen Horng Chau, Brad Myers, and Andrew Faulring, "What to Do When Search Fails: Finding

Information by Association," , 2008.

[78] Alexander K. Ames, "A Free Associative File System".

[79] Wikipedia. Enterprise Objects Framework. [Online].

95

http://en.wikipedia.org/wiki/Enterprise_Objects_Framework

[80] Rosario De Chiara and Andrew Fish, "Eulerview with projections: non hierarchical visualisation,"

2008.

[81] Versant, "db4o Replication System (dRS)".

[82] Mike Padilla. (2008, April) User Interface Implementations of Faceted Browsing. [Online].

http://www.digital-web.com/articles/user_interface_implementations_of_faceted_browsing/

[83] Kirk McElhearn. Hand-code Smart Folders. [Online].

http://www.macworld.com/article/60386/2007/10/handcode.html

[84] Eric Falsken, "Enabling the Mobile Enterprise with db4o".

[85] Heather Meeker, "db4objects and the Dual Licensing Model".

[86] Rick Grehan, "Complex Object Structures, Persistence, and db4o".

[87] Versant, "db4o Open Source Object Database".

[88] Rick Grehan, "The Database Behind the Brains," March 2006.

[89] Scott Ambler, "Agile Techniques for Object Databases".

[90] Nick Muurphy, Mark Tonkelowitz, and Mike Vernal, "The Design and Implementation of the

Database File System," 2002.

[91] Zhichen Xu, Magnus Karlsson, Chunqiang Tang, and Christos Karamanolis, "Towards a Semantic-

Aware File Store,".

[92] Seth Nickell, "A Cognitive Defense of Associative Interfaces for Object Reference,".

[93] Casey Marshall, "Birch: A Metadata Search File System," University of California, Santa Cruz, 2006.

[94] Stephan Bloehdorn and Max Völkel, "TagFS - Tag Semantics for Hierarchical File Systems," 2006.

[95] Ka-Ping Yee, Kirsten Swearingen, Kevin Li, and Marti Hearst, "Faceted Metadata for Image Search

and Browsing," in CHI 2003, 2003.

[96] David K. Gifford, Pierre Jouvelot, Mark A. Sheldon, and Jr., James W. O'Toole, "Semantic File

Systems,".

96 Bibliography

[97] Won Kim, Introduction to Object-Oriented Databases.: The MIT Press, 1990.

[98] Francois Bancilhon, Claude Delobel, and Paris Kanellakis, Building An Object-Oriented Database

System - The Story of O2, Bruce M. Spatz, Ed.: Morgan Kaufmann Publishers, Inc., 1992.

9 Appendix

9.1 Data Dictionary

ACID (Atomicity,

Consistency,

Isolation, Durability)

ACID is a set of properties that guarantee database transactions are processed

reliably.

CASE tools

(Computer-aided

Software

Engineering)

Software that assists in the development of other software, by allowing easier

creation of diagrams, flowcharts, program development, testing, etc, (all of the

areas of software engineering).

Crawl (dir crawl) The process of inspecting a folder and subfolders to identify all the files located

inside the folder and its subfolders.

Daemon

A background process.

Dead tag A dead tag is a tag that does not reference any files.

It was either created automatically when indexed locations were added and

then later removed, leaving the tag behind, or it was created by the user and

never assigned any file references.

Drill-Down A term common to Business Intelligence describing the act of narrowing down

the items summarized to get a more detailed view of items. For a company the

top level could be the total sales, drill-down could be performed something like

this: total sales->country->city->store->cash register->day->hour. Tags work in

a similar manner; you select more tags to limit the items presented. As more

and more tags are selected, the amount of items displayed will diminish. The

act of selecting more and more tags is described by the term drill-down, as you

“drill-down” through layers of mostly irrelevant files.

FileStore A folder on a real hard drive where the MetaFS can save files created through

the VHD interface.

FileSystemWatcher Listens to the file system change notifications and raises events when a

directory, or file in a directory, changes.

Finder (Mac) The Mac version of the Windows Explorer, used for viewing files and

directories.

98 Appendix

Folder/Directory The terms folder and directory are both used to represent the same thing, and

at times directory may even be shortened to dir (tech people are lazy).

Folders function as containers for files and other folders.

The path “C:\Music\Metallica\Reload” contains the 3 folders: “Music”,

“Metallica” and “Reload”.

FUSE (Filesystem in

Userspace)

“FUSE is a loadable kernel module for Unix-like computer operating systems

that lets non-privileged users create their own file systems without editing

kernel code.” [53]

I/O (input/output)

Reading and writing to/from devices attached to the computer, typically refers

to file operations on a hard disk.

ID3 tag

A format for putting metadata into music files, most commonly mp3 files. It

allows storing information about artist, track title, track no, genre, etc about a

file, inside the file.

IFilter

An extension to Windows search that allows the extraction of metadata from

additional file types.

Impedance mismatch

The problem with mapping relational database data to and from objects used

in the application.

Indexed location or

Indexed folder

An indexed location is a folder on a real hard drive. The files in this folder (and

possibly subfolders) are added to (or monitored with filesystemwatch service)

by the MetaFS so they can be found by tags and their metadata recorded for

searching.

Inverted Index

A term used to describe an index that maps a word to records containing that

particular word, i.e. the entry “metallica” pointing to all files where the word

“metallica” is part of the metadata).

Minimum term

frequency

The more times a word appears in a document, the more likely it is that this

document is of interest when that word is used in a search.

Minimum term length

Some search engines automatically do begins-with and ends-with searches,

which for short words (length 3 or less) returns a lot of results and thus the min

length of a search word is needed, typically 3 or 4 chars minimum (this does

make some things hard to find like TV or ER, the TV show).

NQ (Native Query)

Db4o allows database queries to be written in the programming language used

(C#/Java), instead of writing SQL queries.

Object Container The object that handles requests to the db4o database.

OODB

(Object-Oriented

DataBase)

The database used in this thesis, the database knows of the structure of classes

and saves the actual objects in the database in the same format as the class is

defined in the code, there is no translation like in a RDB.

ORDB (Object-

Relational DataBase)

A mix of the RDB and the OODB. Data is saved as in the RDB, but accessed as

objects as in the OODB.

ORM (Object-

Relational Mapping)

The process of converting the object data into relational data that can be saved

in a relational database.

OS

(Operating System)

Windows XP, Windows 2000, Windows Vista, Windows 7, Mac OS X, Linux, etc.

Path

A path is a sequence of folders, uniquely identifying where to find a certain file

or directory.

9.1 Data Dictionary 99

“C:\Music\Metallica\Reload” would be the path, uniquely identifying the folder

“Reload”.

Phrase or proximity

search

Some search engines support searching for “The Mad Hatter” which will make

sure the words appear next to each other and not just randomly in the result.

PIM (Personal

Information

Management)

Software that helps the user organize and find files on their personal computer.

PK (Primary Key)

A unique key can uniquely identify each row in a table.

QBE

(Query By Example)

A method of querying the db4o database. An example object is passed as

parameter and objects in the database with identical values are returned.

RDB

(Relational DataBase)

The most common database where data is stored in tables and relations exist

as PK links to/from other tables.

SODA (Simple Object

Data Access)

SODA is the internal query system of db4o, QBE, NQ and LINQ queries are

translated to SODA. SODA is available as an API as another way of querying the

db4o DB (LINQ, NQ, QBE, SODA).

Stemming/suffix

stripping

The act of trimming words to their base form, e.g. “swimming” to “swim” or

“stemmer”, “stemming”, “stemmed” to “stem”.

Stopwords

Words like the, is, at, which, on are very common and provide little search

value and can be omitted from search indexes.

Synonyms

A search for doctor will also return results for physician or Student and pupil,

buy and purchase, sick and ill.

Tagcloud A visual representation of the tags in which the most used tags appear larger

than the less frequently used tags.

Tag-folder

The MetaFS presents itself as a virtual hard drive and the folders on this VHD

function differently from normal folders and to avoid confusion is given its own

name. A tag-folder is a folder on the VHD that represents a tag in the MetaFS.

View

For a normal file system a view of a folder is a list of the subfolders and files

within this folder. For the MetaFS, folders correspond to tags and the view will

then show all files associated with the tag and the tag-folders for all tags

associated with any of the files listed.

Virtual Hard Drive

(VHD):

A VHD is a ’fake’ hard drive, emulated by a piece of software (Dokan), allowing

us to present the user with a familiar interface to the MetaFS.

Winamp

An MP3 player for Windows (www.winamp.com).

100 Appendix

9.2 Use Cases

9.2.1 UC1: Add Index Location

Primary actor:

User

Interface:

A Custom GUI.

Post conditions:

The path chosen is added to the list of indexed locations and the files from this path are assigned tags

corresponding to their real path (files from the path “C:\Music\Rock\Aerosmith” are tagged with

“Music”, “Rock” and “Aerosmith”).

Basic flow:

1. The user clicks the “Add Index Location” button in the program GUI.

2. The user is presented with a browse folder dialog, from which he chooses 1 folder.

3. The selected folder is added to the list of indexed folders (only the parent folder is stored)

4. For every file in the chosen folder and its subfolders a file object is created.

5. All file objects files are linked to tags, matching the folder names in the path of the chosen

folder

6. Add search index entries for all new files.

Frequency of occurrence:

0-10 times in a week, for the first month of using the MetaFS.

0-2 times in a month, after the MetaFS has been running for a while.

Once the important areas are chosen, i.e. music, documents, videos, photos, there is not much left

where the MetaFS will be more useful than a standard FS.

Notes:

-

9.2.2 UC2: Remove Index Location

Primary actor:

User

Interface:

A custom GUI.

Post conditions:

The item chosen is removed from the list of indexed locations and the files in that directory (and

subdirectories) are removed from the database. Any tags created when the index was created will

not be deleted, but their references to the files affected by the index are removed.

Basic flow:

1. The user picks one of the locations previously added through UC1: Add Index.

2. The user clicks the “Remove Index Location” button in the program GUI.

3. All file objects with a path attribute matching the removed index are deleted.

4. For each file object deleted, remove metadata words from search index.

5. For each file object deleted, all tags linked to the file object have the file object removed from

their reference list.


0-10 times in a week, for the first month of using the MetaFS.

0-2 times in a month, after the MetaFS has been running for a while.

9.2 Use Cases 101

In the beginning, poor choices may be made regarding index locations and they will need to be

removed. After a while the user will become better at choosing the right folders to index and will

eventually reach a point where he is satisfied with the chosen locations and they will remain

unchanged for a “longer” period.

Notes:

Removing an index can lead to tags that don’t reference any files, these tags can be removed

manually by the user through UC6: Delete Folder or they can be applied to other files (which is why

they are left for the user to remove).

9.2.3 UC3: View Root

Primary actor:

User

Interface:

A virtual hard disk (VHD), viewed through Explorer or command prompt, or other directory

navigating software.

Pre conditions:

The MetaFS application is running and a VHD has been assigned a drive letter and mounted through

the MetaFS GUI. Ideally the user has also added some indexed locations so there is actually

something to display.

Post conditions:

The user is presented with a list of folders corresponding to the tags in the MetaFS. Files are not

shown in the root, there are simply too many to list them all. No change is made to the database

objects or the indexed files; this is simply a presentation of data.

Basic flow:

1. The user runs some application that allows him to browse a hard disk.

2. The user selects the drive letter to which the VHD is mounted.

3. A list of folders is shown, corresponding to the tags in the MetaFS db.


0-10 times per hour. Depending on how active the user is, this action will happen quite frequently,

ranging from a couple of times in a minute, when searching one or more files, to a couple of times in

a day, when the user is only working on a handful of files.

Notes:

Listing all the tags of the MetaFS in the root may prove to be a messy approach. As more and more

locations are indexed and the number of tags grows, a lot of “noise” will appear in the presented

tags, album names, track numbers, years, etc are not going to be your first choice in a drill-down. To

solve this problem we can either put an extra attribute on each tag to indicate if it should be

displayed at root level. Alternatively we can present the user with an option on how many files a tag

must reference before it is displayed at the root level (a number around 20 (changeable by user)

seems appropriate, to exclude music album names).

The first option requires the user to actively mark root tags, for every tag, and each time new tags

are added, a lot of work to put on the user. The 2nd

option is automatic and adaptive, as a tag is

applied to more files, becoming more useful in a drill-down, it becomes visible at root level.


Primary actor:

User

102 Appendix

Interface:

A VHD viewed through Explorer or command prompt, or other directory navigating software.

Pre conditions:


the MetaFS GUI. To view anything but the root there must be existing tags that can be used for the

view, so either an indexed location has been added (and possibly removed, leaving tags behind) or

the user has created a folder on the VHD.

Post conditions:

The user is presented with a list of files where each file presented has all the tags in the path being

viewed on the VHD. Furthermore for all the files being presented, all the tags applied to them are

presented as folders. *cryptic*

Basic flow:


2. The user opens a folder to view, either from the root or from a bookmark.

3. A list of files and folders is shown, based on the tags in the path.


0-100 times per hour.

This is the most common action the user will perform when browsing and searching for files. A

typical drilldown for a file will start at the root and 5-7 tags folders are opened to narrow down the

amount of files to a satisfactory level where the file wanted can be found. To find just one file, 5-7

iterations of view folder is to be expected.

Notes:

-

9.2.5 UC5: Create Folder

Primary actor:

User

Interface:



Pre conditions:


the MetaFS GUI.

Post conditions:

The new folder is created as a tag and appears in the folder in which it was created.

Basic flow:


2. The user opens a folder.

3. The user creates a new folder in the currently open folder.

4. The new tag folder is presented in the current view.


0-100 times in a week, for the first week of using MetaFS.

0-10 times per month once the files have reached a satisfactory level of accessibility by tag folders.

Tags are applied to files based on their path; the user likely has additional tags that can help in

finding files later. These additional tags are created as folders on the VHD (and files are later moved

into them to apply the tag). While the path will typically contain many of the tags that one would

wish to apply, the hierarchical structure makes it difficult to use folders like genre, year of

9.2 Use Cases 103

production, decade of production, these are areas where the user will create new tag folders to

apply more meta data to the file than was possible before.

Notes:

-

9.2.6 UC6: Delete Folder

Primary actor:

User

Interface:



Pre conditions:


the MetaFS GUI. There exist one or more tag folders, either from adding an indexed location or the

user manually creating a folder on the VHD.

Post conditions:

The tag folder is removed from the DB and all files linked to it have their links removed.

Basic flow:


a. Alternative: In case the folder to be deleted is linked with too few files to appear in the

root one of the folders appearing at root level is opened, and within this folder the

folder to be deleted can be selected.

2. The user issues a delete request on a tag folder and accepts the delete confirmation.

3. For each file associated with the folder, remove metadata words from search index.

4. The tag is removed from the DB and all links to and from it removed.




This use case is loosely related to UC5: Create Folder in the sense that the user is likely to create new

folders and remove some automatically added ones as he fine tunes the tags available.

Notes:

-

9.2.7 UC7: Rename Folder

Primary actor:

User

Interface:



Pre conditions:


the MetaFS GUI. There exist one or more tag folders, either from adding an indexed location or the


Post conditions:

The tag folder is renamed. The files in the folder are the same as before the rename operation, no

files are added or removed.

104 Appendix

Basic flow:


2. The user selects the folder and gives it a new name.

3. The tag is renamed and is available under a new name.

a. Alternative: A file exists with the new name – the file is renamed, adding a counter to its

name, the tag is renamed to the new name.

b. Alternative: A tag exists with the new name – the rename operation is canceled.

4. New metadata is extracted for files linked with the tag and the search index is updated.




Notes:

Renaming a folder should also support changing the casing, i.e. “MUsic” -> “Music”.

9.2.8 UC8: Move Folder

The location of tag folders has no meaning, only the combination of tag folders do, so moving a folder is

an operation that accomplishes nothing.

There is one single case in which moving a folder could make sense. If the user wants to use the moved

folder from a different location for applying it to more files, moving it to the folder with the files to be

sorted would save the need to create the folder at that location. But dragging a file from the sorting

location to the “moved” folder would accomplish the same thing. There are already 2 other ways of

getting the folder to appear where you want, this 3rd

move option is superfluous.

9.2.9 UC9: Create File

Primary actor:

User

Interface:



Pre conditions:


the MetaFS GUI. The user has write access to the FileStore folder (on a real HD).

Post conditions:

The new file is created and tags are linked to it based on the path at which the file was created.

Basic flow:

1. The user runs some application that allows him to browse a hard disk and create files.

2. The user navigates to a folder that has the tags wanted for the new file.

3. The user creates a new file.

a. Alternative: A tag or file exists with the same name – the new file gets a counter

appended.

4. The new file is created on the real HD.

5. The file object is created for the new file and linked with tags in the path where the file was

created.

9.2 Use Cases 105

6. Metadata for the new file is added to the search index.


0-100 times in a week. The MetaFS is mainly focused on organizing existing files, but occasionally

new files are created.

Notes:

Moving a folder from one drive letter to another will likely use the CreateFile command to create the

files on the target drive, followed by a DeleteFile on the source drive. By supporting CreateFile, the

user is able to move individual files into the VHD (which are then created as files in the FileStore) if

only selected files requires indexing and not their entire directory.

9.2.10 UC10: Delete File

Primary actor:

User

Interface:



Pre conditions:


the MetaFS GUI. The database contains at least one file object and the user has rights to delete the

real file pointed to by the file object.

Post conditions:

The original file is left intact but the corresponding file object is removed from the database and tag

links to the file are removed.

Basic flow:

1. The user runs some application that allows him to browse a hard disk and create files.

2. The user navigates the tag folders to find the file to delete.

3. The user deletes the file.

4. Metadata words are removed to the search index.

5. The file object is removed from the database.


0-100 times in a week. The MetaFS is mainly focused on organizing files, but occasionally files that

are no longer needed will show themselves on the VHD.

Notes:

If possible use the OS recycle bin for deleted files instead of the harsh, gone-forever-delete.

Alternatively the file object could be left in the database and marked as inactive.

A third option is to leave the file in the database, but mark it with a deleted tag so we can remove it

when presenting results.

9.2.11 UC11: Rename File

Primary actor:

User

Interface:



Pre conditions:

106 Appendix


the MetaFS GUI. There exists at least one file, either from adding an indexed location or the user

manually creating a file on the VHD.

Post conditions:

The file is renamed on the VHD only; the real file keeps its name.

Basic flow:


2. The user navigates the folders and finds the file to be renamed.

3. The user selects the file and gives it a new name.

4. The file renamed and is available under a new name.

a. Alternative: A file exists with the new name – the existing file keeps its name, the file

being renamed gets the new name with a counter added.

b. Alternative: A tag exists with the new name – the tag keeps its name, the file being

renamed gets the new name with a counter added.

5. New metadata is extracted and the search index is updated.


0-100 times in a week.

Notes:

Renaming a file should also support changing the casing, i.e. “readme.txt” -> “ReadMe.txt”.

9.2.12 UC12: Move File

Primary actor:

User

Interface:



Pre conditions:


the MetaFS GUI. There exist two or more tag folders, either from adding an indexed location or the


Post conditions:

Tags are added or removed from one or more files, depending on where the folder was moved.

Basic flow:


2. The user moves a folder to another folder. This move can be in three directions.

a. UP the path (\music\S&M\CD1 -> \music\S&M)

b. DOWN the path (\music\S&M -> \music\S&M\CD1)

c. ACROSS up the path and down another branch (\music\S&M\CD1 -> \music\rock)

3. The tag is renamed and is available under a new name.

a. Alternative: A file exists with the new name – the file is renamed, adding a counter to its

name, the tag is renamed to the new name.

b. Alternative: A tag exists with the new name – the rename operation is canceled.

4. New metadata is extracted and the search index is updated.




Notes:

9.2 Use Cases 107

Renaming a folder should also support changing the casing, i.e. “MUsic” -> “Music”.

9.2.13 UC13: Read File

Primary actor:

Application (initiated by user selecting open file in said application (or double clicks in Explorer))

Interface:

Any program that can read a file can start this event.

Pre conditions:


the MetaFS GUI. There exists at least one file on the VHD.

Post conditions:

The content of the file is returned to the calling program.

Basic flow:

1. The user opens a view on the VHD.

2. The user double clicks a file to start the program for the given file type.

a. Alternative: A program is opened, and the open dialog of the program is used to locate

at open the file.

3. The program is started and it will send a read request to the VHD.

4. The content of the file is returned to the program.


0-10000 times in a day. It is a file system we are making, whose primary task is to store and retrieve

data.

Notes:

-

9.2.14 UC14: Write File

In the tag based version of the MetaFS, write requests are just forwarded to the File.Write() function

and the size and timestamp attributes are updated. The metadata version of the MetaFS will need to

read the data after the write has been performed and update any metadata that has been stored for

this file.

Primary actor:

Application (initiated by user pressing save in said application)

Interface:

Any program that can write to a file can start this event.

Pre conditions:


the MetaFS GUI.

Post conditions:

The content of the file is updated.

Basic flow:

Flow 1:

1. The user has opened an existing file in a program and made some changes and hits save.

2. The program sends a write request to the VHD.

3. NOOP (fall through to shared flow, point 4)

108 Appendix

Flow 2:

1. Alternatively the user wants to save a new document

2. The program sends a write request to the VHD. The write request contains the CreateNew file

mode parameter to indicate that the file is new and does not/should not already exist.

a. If CreateNew file mode is specified and the file exists the write fails.

3. The file is created.

Shared flow:

4. The content of the (new) file is updated with the content from the application.

5. The size and timestamp attributes of the file object are updated in the DB.

6. The MetaFS is checked for metadata extraction add-ins for the extension of the file.

7. The metadata is updated by the add-in if one exists.


0-1000 times in a day. Writing is a lot less common than reading.

Notes:

-

9.2.15 UC15: Search

Primary actor:

User.

Interface:

A custom search GUI.

Pre conditions:

The MetaFS application is running and one or more indexed locations have been added.

Post conditions:

Files containing the word(s) entered by the user are presented.

Basic flow:

1. The user enters one or more word and clicks search.

2. The files containing the word(s), in any part of the metadata extracted, are presented.


0-1000 times in a day.

Notes:

AND, OR, NOT, begins-with (word*) and ends-with (*word) should be supported when searching.

9.2.16 UC16: Filter Search

Primary actor:

User.

Interface:

A custom search GUI.

Pre conditions:

The MetaFS application is running and one or more indexed locations have been added and a search

has been performed.

Post conditions:

The search result is filtered based on user selection.

Basic flow:

9.2 Use Cases 109

1. The user selects an extension and a field that is to be used for filtering.

2. The value of the metadata field selected above is extracted from all the files of the search result

and all the possible values are presented.

3. The user selects one of the metadata field values and only files containing that particular value

is presented.


0-1000 times in a day.

Notes:

A somewhat similar approach to filtering as presented in “An Intelligent Method for Searching

Metadata Spaces” [43]

9.2.17 UC17: Change FileStore Location

Ideally the user will be able to choose where files created on the VHD will be saved, but for now it is

easier if it’s hardcoded so this use case is left for future development.

Changing the filestore location also moves the files from the old location to the new, requiring an

update of file references in the DB.

9.2.18 UC18: Rescan Indexed Location

In case the user changes files in a location that is added as an indexed location, since we are not actively

monitoring the files, we need to be able to scan the locations again for any changes to files. Changes

include add file, deleted file, rename file (cannot detect this, will appear as deleted + add), file content

change (new timestamp or different file size).

9.2.19 UC19: View Untagged Files

Since the root folder only shows tags and not files, and the user has to select a tag to be able to view

files, untagged files will not be presented anywhere on the VHD. To present these untagged files, a

special folder should be present in the root of the VHD. This folder should always be present so the user

can check it to make sure there are no untagged files. This special folder will simply be called

“Untagged” and when creating tag names, this name cannot be used. (Note: How will a move operation

deal with “untagged” being the target folder?)

110 Appendix

9.3 Sequence

Diagrams

9.3.1 Add Location

9.3 Sequence Diagrams 111

9.3.2 Remove Location

112 Appendix

9.3.3 View Root

9.3.4 View Folder


9.3.5 Create Folder

9.3.6 Delete Folder

114 Appendix

9.3.7 Rename Folder

9.3.8 Move Folder

Ignored, see use case for details.

9.3.9 Create File


9.3.10 Delete File

9.3.11 Rename File

116 Appendix

9.3.12 Move File

9.3.13 Read File


9.3.14 Write File

9.3.15 Search

118 Appendix

9.3.16 Filter Search

9.4 Dokan calls for a ReadFile request. 119

9.4 Dokan calls for a ReadFile request.

The following is the debug output for every call made to MFSDokan by Windows to open the file

“lyric1.txt” in notepad. The actions taken to open the file are this:

Open the folder on the target folder “\lyrics [5]” and select the target file, “lyric1.txt”.

Enable every call from Dokan to print debug text.

Switch back to the target folder, clicking on the window header to avoid any additional calls to the VHD.

Hit enter to open the “lyric1.txt” file in notepad.

Once notepad is open, switch to debug window and disable output.

CreateFile output is: “Filename | access | share | mode | options | “ followed by the info object values

“Context; DeleteOnClose; DokanContext; InfoID; IsDirectory; Nocache; PagingIo; ProcessId;

SynchronousIo; WriteToEndOfFile”.

The output for other calls is filename followed by the info object values mentioned above.

ReadFile output includes the offset position which most of the time is 0, “O:0”, between the filename

and the info object, since most of the testing is done on small files that are all read in a single pass.

PID 1244 is Explorer.exe

PID 2184 is Notepad.exe – These are the interesting calls and have been marked with a grey background

----------debug block start----------

Changed debug value 'Dokan' to True CreateFile(\lyrics [5] | Read | ReadWrite, Delete | Open | None | ø;F;

112.685.400 ; 288 ;T;F;F;1244;F;F) GetFileInformation(\lyrics [5] , ø;F; 112.685.400 ; 288 ;T;F;F;1244;F;F) Cleanup(\lyrics [5] , ø;F; 112.685.400 ; 288 ;T;F;F;1244;F;F) CloseFile(\lyrics [5] , ø;F; 112.685.400 ; 288 ;T;F;F;1244;F;F) CreateFile(\lyrics [5]\lyric1.txt | Read | ReadWrite, Delete | Open | None | ø;F;

112.685.400 ; 289 ;F;F;F;1244;F;F) GetFileInformation(\lyrics [5]\lyric1.txt, ø;F; 112.685.400 ; 289 ;F;F;F;1244;F;F) Cleanup(\lyrics [5]\lyric1.txt , ø;F; 112.685.400 ; 289 ;F;F;F;1244;F;F) CloseFile(\lyrics [5]\lyric1.txt , ø;F; 112.685.400 ; 289 ;F;F;F;1244;F;F) CreateFile(\lyrics [5]\lyric1.txt | Read | ReadWrite, Delete | Open | None | ø;F;

112.685.400 ; 290 ;F;F;F;1244;F;F) GetFileInformation(\lyrics [5]\lyric1.txt, ø;F; 112.685.400 ; 290 ;F;F;F;1244;F;F) Cleanup(\lyrics [5]\lyric1.txt , ø;F; 112.685.400 ; 290 ;F;F;F;1244;F;F) CloseFile(\lyrics [5]\lyric1.txt , ø;F; 112.685.400 ; 290 ;F;F;F;1244;F;F) OpenDirectory(\lyrics [5] , ø;F; 112.685.400 ; 291 ;T;F;F;1244;F;F) OpenDirectory(\lyrics [5] , ø;F; 112.685.480 ; 292 ;T;F;F;2184;F;F) FindFiles(\lyrics [5] , ø;F; 112.685.400 ; 291 ;T;F;F;1244;F;F) Cleanup(\lyrics [5] , ø;F; 112.685.400 ; 291 ;T;F;F;1244;F;F) OpenDirectory(\lyrics [5] , ø;F; 112.685.520 ; 293 ;T;F;F;2184;F;F) CloseFile(\lyrics [5] , ø;F; 112.685.400 ; 291 ;T;F;F;1244;F;F) FindFiles(\lyrics [5] , ø;F; 112.685.520 ; 293 ;T;F;F;2184;F;F) FindFiles(\ , ø;F; 112.685.400 ; 294 ;T;F;F;1244;F;F) Cleanup(\lyrics [5] , ø;F; 112.685.520 ; 293 ;T;F;F;2184;F;F) Cleanup(\ , ø;F; 112.685.400 ; 294 ;T;F;F;1244;F;F) CloseFile(\lyrics [5] , ø;F; 112.685.520 ; 293 ;T;F;F;2184;F;F) CreateFile(\lyrics [5]\lyric1.txt | Read | ReadWrite | Open | None | ø;F;

112.685.520 ; 295 ;F;F;F;2184;F;F) CloseFile(\ , ø;F; 112.685.400 ; 294 ;T;F;F;1244;F;F) FindFiles(\lyrics [5] , ø;F; 112.685.400 ; 296 ;T;F;F;1244;F;F) Cleanup(\lyrics [5] , ø;F; 112.685.400 ; 296 ;T;F;F;1244;F;F) CloseFile(\lyrics [5] , ø;F; 112.685.400 ; 296 ;T;F;F;1244;F;F) FindFiles(\lyrics [5] , ø;F; 112.685.400 ; 297 ;T;F;F;1244;F;F) Cleanup(\lyrics [5] , ø;F; 112.685.400 ; 297 ;T;F;F;1244;F;F) CloseFile(\lyrics [5] , ø;F; 112.685.400 ; 297 ;T;F;F;1244;F;F) OpenDirectory(\ , ø;F; 112.685.400 ; 298 ;T;F;F;1244;F;F) Cleanup(\ , ø;F; 112.685.400 ; 298 ;T;F;F;1244;F;F) CloseFile(\ , ø;F; 112.685.400 ; 298 ;T;F;F;1244;F;F)

120 Appendix

CreateFile(\lyrics [5]\lyric1.txt | Read | ReadWrite, Delete | Open | None | ø;F; 112.685.400 ; 299 ;F;F;F;1244;F;F)

GetFileInformation(\lyrics [5]\lyric1.txt, ø;F; 112.685.400 ; 299 ;F;F;F;1244;F;F) Cleanup(\lyrics [5]\lyric1.txt , ø;F; 112.685.400 ; 299 ;F;F;F;1244;F;F) CloseFile(\lyrics [5]\lyric1.txt , ø;F; 112.685.400 ; 299 ;F;F;F;1244;F;F) FindFiles(\lyrics [5] , ø;F; 112.685.400 ; 300 ;T;F;F;1244;F;F) Cleanup(\lyrics [5] , ø;F; 112.685.400 ; 300 ;T;F;F;1244;F;F) CloseFile(\lyrics [5] , ø;F; 112.685.400 ; 300 ;T;F;F;1244;F;F) OpenDirectory(\ , ø;F; 112.685.400 ; 301 ;T;F;F;1244;F;F) Cleanup(\ , ø;F; 112.685.400 ; 301 ;T;F;F;1244;F;F) CloseFile(\ , ø;F; 112.685.400 ; 301 ;T;F;F;1244;F;F) CreateFile(\lyrics [5]\lyric1.txt | Read | ReadWrite, Delete | Open | None | ø;F;

112.685.400 ; 302 ;F;F;F;1244;F;F) GetFileInformation(\lyrics [5]\lyric1.txt, ø;F; 112.685.400 ; 302 ;F;F;F;1244;F;F) Cleanup(\lyrics [5]\lyric1.txt , ø;F; 112.685.400 ; 302 ;F;F;F;1244;F;F) CloseFile(\lyrics [5]\lyric1.txt , ø;F; 112.685.400 ; 302 ;F;F;F;1244;F;F) OpenDirectory(\ , ø;F; 112.685.400 ; 303 ;T;F;F;1244;F;F) GetFileInformation(\lyrics [5]\lyric1.txt, System.IO.FileStream;F; 112.685.520 ; 295

;F;F;F;2184;F;F) GetFileInformation(\lyrics [5]\lyric1.txt, System.IO.FileStream;F; 112.685.520 ; 295

;F;F;F;2184;F;F) Cleanup(\ , ø;F; 112.685.400 ; 303 ;T;F;F;1244;F;F) Cleanup(\lyrics [5]\lyric1.txt , System.IO.FileStream;F; 112.685.520 ; 295

;F;F;F;2184;F;F) CloseFile(\ , ø;F; 112.685.400 ; 303 ;T;F;F;1244;F;F) ReadFile(\lyrics [5]\lyric1.txt , O:0, System.IO.FileStream;F; 112.685.520 ; 295

;F;T;T;2184;T;F) OpenDirectory(\lyrics [5] , ø;F; 112.685.560 ; 305 ;T;F;F;2184;F;F) GetFileInformation(\lyrics [5] , ø;F; 112.685.400 ; 304 ;T;F;F;1244;F;F) FindFiles(\lyrics [5] , ø;F; 112.685.560 ; 305 ;T;F;F;2184;F;F) Cleanup(\lyrics [5] , ø;F; 112.685.400 ; 304 ;T;F;F;1244;F;F) Cleanup(\lyrics [5] , ø;F; 112.685.560 ; 305 ;T;F;F;2184;F;F) CloseFile(\lyrics [5] , ø;F; 112.685.400 ; 304 ;T;F;F;1244;F;F) OpenDirectory(\ , ø;F; 112.685.400 ; 306 ;T;F;F;1244;F;F) CloseFile(\lyrics [5] , ø;F; 112.685.560 ; 305 ;T;F;F;2184;F;F) Cleanup(\ , ø;F; 112.685.400 ; 306 ;T;F;F;1244;F;F) CloseFile(\ , ø;F; 112.685.400 ; 306 ;T;F;F;1244;F;F) GetFileInformation(\lyrics [5] , ø;F; 112.685.400 ; 307 ;T;F;F;1244;F;F) Cleanup(\lyrics [5] , ø;F; 112.685.400 ; 307 ;T;F;F;1244;F;F) CloseFile(\lyrics [5] , ø;F; 112.685.400 ; 307 ;T;F;F;1244;F;F) Changed debug value 'Dokan' to False

----------debug block end----------

From the VHD calls made it would appear that Notepad has 3 steps:

(InfoID=292) Check that the directory exists.

(InfoID=293) Check that the file exists.

(InfoID=295) Open the file. GetFileInformation is used to get the length of the file, which notepad uses

to determine how much data to read from the file. If length is smaller than the actual file length the

result is truncated data displayed in Notepad.

There is a 4th

step (infoid=305) that appears to check the files in the directory; the purpose of this step is

unknown.

CreateFile(), GetFileInformation(), GetFileInformation() (Yes, 2 calls), Cleanup(), ReadFile().

The output only captures the opening of the file, the CloseFile() call is likely later, when notepad is

closed.

9.5 Visualization of activation depth 121

9.5 Visualization of activation depth

File1

Tag1_2

File2

Tag2_3

List<MFSTag>

List<MFSTag

List<MFSFile>

Activation depth = 1




Activation depth = 5 List<MFSFile>

Figure 25 - Activation Depth

objects in test db atm (should be 0):0

objects in test db after save (should be 12):12

activationdepth = 0

IsActive(file1ByQuery)False

file1 name:

file1 list: << FAILED

exception caught (activation depth is the underlying cause):

Object reference not set to an instance of an object.

activationdepth = 1

IsActive(file1ByQuery)True

file1 name: file1

file1 list: 0

IsActive(tag1_2)False

tag1_2 name: << FAILED



activationdepth = 2


file1 name: file1

file1 list: 1

IsActive(tag1_2)False

tag1_2 name:

tag1_2 list: << FAILED



activationdepth = 3

122 Appendix


file1 name: file1

file1 list: 1

IsActive(tag1_2)True

tag1_2 name:tag1_2

tag1_2 list:0

file2 name: << FAILED



activationdepth = 4


file1 name: file1

file1 list: 1


tag1_2 name:tag1_2

tag1_2 list:2

file2 name: << FAILED



activationdepth = 5


file1 name: file1

file1 list: 1


tag1_2 name:tag1_2

tag1_2 list:2

file2 name: file2

file2 list: 0

tag2_3 name: << FAILED



9.6 Search Object Structure 123

9.6 Search Object Structure

Figure 26 - Search Objects example 1

_details : HashSet<KeyWordDetails>

_word : string = metallica

_wordReverse : string = acillatem

Object6 : MFSSearch.KeyWord

_file : MFSFile

_type : Type = MP3Extension





_lastwrite : DateTime

_length : long


_metadata : Dictionary<Type, IMFSExtension> = MP3Extension, FileNameSplitExtension, TagExtension

_path : string = C:\Music\Metallica\S&M\CD1\04.Metallica - Of Wolf And Man.mp3

_tags : Dictionary<string, MFSTag> = music, metallica, s&m, cd1

_uniquename : string = 04.Metallica - Of Wolf And Man.mp3

_hash : string

_file : MFSFile

_file : MFSFile

_type : Type = FileNameSplitExtension


List<KeyWordDetails> (64)

A search for the word "metallica" finds the

following keyword object in the database

There is a total of 51 files in the database. 21 mp3 + 1 txt file is in metallica folder.

64 objects for 22 files seems a bit overwhelming but is due to the

seperation of metadata objects for each file.

22 files have the tag "metallica" and are picked up by the TagExtension

21 files have ID3 tags containing "metallica" as artist, picked up by MP3Extension

21 files have "metallica" in the file name, picked up by the FileNameSplitExtension





_length : long


_metadata : Dictionary<Type, IMFSExtension>

_path : string

_tags : Dictionary<string, MFSTag>


_hash : string

_file : MFSFile





_length : long



_path : string



_hash : string

_file : MFSFile





_length : long



_path : string



_hash : string

_file : MFSFile

_file : MFSFile

_type : Type = TagExtension


_file : MFSFile

_type : Type


_file : MFSFile

_type : Type


_file : MFSFile

_type : Type


_file : MFSFile

_type : Type = TagExtension

Object16 : MFSSearch.KeyWordDetails_attributes : FileAttributes




_length : long


_metadata : Dictionary<Type, IMFSExtension> = FileNameSplitExtension,

TagExtension

_path : string = C:\Music\Metallica\S&M\dont readme.txt

_tags : Dictionary<string, MFSTag> = music, metallica, s&m

_uniquename : string = dont readme.txt

_hash : string

_file : MFSFile

The text file in the Metallica folder is only referenced as a TagExtension type,

not MP3Extension or FileNameSplitExtension, because it is not an mp3 file and it

does not contain the word Metallica in the file name like the other 21 files.


124 Appendix


9.7 Object counts with music folder added 125

9.7 Object counts with music folder added 1 : Db4objects.Db4o.Ext.Db4oDatabase, Db4objects.Db4o 9825 : MetaFileSystem.ExtMp3, MetaFS 11253 : MetaFileSystem.FileNameParserExtension, MetaFS 1 : MetaFileSystem.MFSDebugOptions, MetaFS 11253 : MetaFileSystem.MFSFile, MetaFS 4 : MetaFileSystem.MFSIndexedLocation, MetaFS 1 : MetaFileSystem.MFSIndexManager, MetaFS 1 : MetaFileSystem.MFSOptions, MetaFS 15324 : MetaFileSystem.MFSSearch+KeyWord, MetaFS 335232 : MetaFileSystem.MFSSearch+KeyWordDetails, MetaFS 705 : MetaFileSystem.MFSTag, MetaFS 11253 : MetaFileSystem.TagExtension, MetaFS 1 : System.Collections.Generic.Dictionary`2[[MetaFileSystem.MFSDebug+DebugTarget,

MetaFS], [System.Boolean, mscorlib]], mscorlib 11253 : System.Collections.Generic.Dictionary`2[[System.String, mscorlib],

[MetaFileSystem.MFSTag, MetaFS]], mscorlib 11253 : System.Collections.Generic.Dictionary`2[[System.Type, mscorlib],

[MetaFileSystem.IMFSExtension, MFSInterface]], mscorlib 709 : System.Collections.Generic.HashSet`1[[MetaFileSystem.MFSFile, MetaFS]],

System.Core 15324 : System.Collections.Generic.HashSet`1[[MetaFileSystem.MFSSearch+KeyWordDetails,

MetaFS]], System.Core 1 : System.Collections.Generic.List`1[[MetaFileSystem.MFSIndexedLocation,

MetaFS]], mscorlib 6 : System.Reflection.MemberInfo, mscorlib 6 : System.RuntimeType, mscorlib 6 : System.Type, mscorlib

126

9.8 ID3 Data

Appendix

9.9 AddLocation on 11.253 files 127

9.9 AddLocation on 11.253 files

Figure 29 shows the time it takes to add 100 files to the MetaFS using the Add Location button.

To compare the MetaFS search times with those of Winamp the same music collection needs to be

added to the MetaFS. The test collection consists of 4 folders, 2007 (4365 files), 2006 (3001 files), 2008

(931 files) and “gl musik” (2956 files), totaling 11253 files. The folders were added 1 at a time because

the time to add was discovered to increase with the amount of files already added, so 4 smaller add

operations with commit after each were safer than 1 big one in case of a crash (which did actually

happen, a problem on the host, unrelated to the MetaFS or test VM).

The 2007 folder was added first, which is why it’s lowest on the graph, the first few files are added

quickly, but with 2900 files added, adding the next 100 takes around 250 seconds.

Then the 2006 folder was added, and with 4365 files already added, adding the 2006 files takes around

300sec per 100 files already from the start. The adding of folders 2008 and finally “gl musik” only makes

the problem worse, peaking at 1473 seconds to add 100 files.

Total time to add files to MetaFS DB: 7.043+9.773+5.590+26.879=49.285 seconds (13,7 hours)

Test machine specs: CPU: 1x2.40GHz, RAM: 1.50GB, 64-bit Windows 7, running MetaFS in debug mode.

Figure 29 - AddLocation times

0

200

400

600

800

1000

1200

1400

1600

0 500 1000 1500 2000 2500 3000 3500 4000

2007

2006

2008

gl musik

128 Appendix

9.10 AddLocation Performance Report

Figure 30 - Object activation and DB commit are time consuming when adding a location

9.11 Activation impact on queries 129

9.11 Activation impact on queries

This test was performed before a search index was added and was thus a lot faster than location adding

is currently, but it still shows the impact of activation depth.

ActivationDepth=1

AddIndexedLocation(c:\files.512, true) finished in time: 1.353ms

200 queries on 2.560 file objects: 40ms - index:True - MFSFile query

200 queries on 2.560 file objects: 42ms - index:True - no read of result set

ActivationDepth=2




ActivationDepth=3




ActivationDepth=4


200 queries on 2.560 file objects: 1.777ms - index:True - MFSFile query


ActivationDepth=5




ActivationDepth=6




130

10 Index

ACID, 14

CASE, 15

daemon, 7

dead tag, 46

drill-down, i, iii, 49, 101

filestore, 43, 70, 109

FileStore, 47, 65, 68, 70, 72, 104, 105, 109

FileSystemWatcher, 44, 45, 63

Finder, 7, 21

FUSE, 2, 6, 20

ID3, 11, 27, 28, 30, 31, 32, 33, 34, 35, 36, 78,

126

iFilter, 3, 8

IFilter, 33

impedance mismatch, 87

Impedance mismatch, 14

inverted index, 11, 12, 13, 28, 29, 30, 34, 35, 36,

87

minimum term frequency, 11

native query, 70

Native query, 14

Native Query, 14

object container, 72

OODB, 1, 13, 14, 16, 22, 30, 36, 41, 58, 87, 98

ORDB, 13, 98

phrase search. proximity search

PIM, 20, 81, 88, 99

PK, 12

proximity search, 11

QBE, 14

RDB, 13, 14

SODA, 14, 24, 61

stemming, 11

stopword, 11, 35, 88

Stopword, 35

suffix stripping. stemming

synonym, 11

Tagcloud, 82

tag-folder, 24, 43, 46, 48, 80, 88, 99

VHD, i, iii, 6, 20, 21, 22, 25, 26, 27, 28, 37, 41,

42, 43, 44, 45, 46, 47, 60, 61, 63, 64, 65, 68,

69, 80, 81, 82, 83, 87, 88, 97, 99, 101, 102,

103, 104, 105, 106, 107, 108, 109, 119, 120

Winamp, 3, 78, 79, 80, 81, 82, 99, 127