718S-3

Embed Size (px)

Citation preview

  • 8/2/2019 718S-3

    1/16

    19

    Chapter 3. Text Document Watermarking

    3.1 Introduction

    To actively embrace Internet as a communication and content distribution medium, it is

    necessary to secure Internet contents by incorporating digital watermarking methods.

    Digital watermarking methods for images, audio, and video are already in place and are

    quite effective. For image watermarking, the existing redundancy in images and

    limitations of human visual system (HVS) are utilized. Similar properties have been

    utilized for audio and video watermarking. Watermark has been embedded in video and

    audio frames which remain imperceptible.

    Text is the most extensively used medium of communication existing over the Internet.

    The major components of websites, books, newspapers, articles, legal documents issimple the plain text. Therefore, plain text requires utmost protection and security from

    copyright violators. In past, a number of digital watermarking algorithms have been

    proposed for images, audios, and videos; however digital watermarking algorithms for

    plain text are inadequate and ineffective.

    Digital watermarking is the process of embedding a unique digital watermark in a

    digital content to protect it from illegal copying and copyright violations. The process of

    embedding and extracting a digital watermark to and from a digital text document which

    uniquely identifies the original copyright owner of that text is called Digital Text

    Watermarking. Text watermarking abides by the same principles as image, audio, or

    video watermarking. The watermark should remain resilient to random tampering attacks,

    undetectable to anybody but the original owner/author of the text, as well as easily and

  • 8/2/2019 718S-3

    2/16

    20 Copyright Protection of Plain Text using Digital Watermarking

    fully automatically reproducible by the watermark extraction algorithm. The main

    concern in text watermarking is that the plain text contains less redundant information as

    compared to images, audio, and video which could be used for secret communication, as

    happens in steganography, and watermarking.

    Text watermarking techniques should implant unique and invisible watermarks in text

    documents which remains intact after diverse tampering attacks of insertion, deletion, and

    re-ordering. The digital watermarking solutions for text make it easy to send and receive

    text over Internet, intranet, extranet, and facsimile. Documents can be evaluated for text

    confidentiality and copyright protection. Detection of any tampering made can also be

    done using digital text watermarking techniques by making it tamper proof.

    This chapter is organized as follows: In Section 2, we briefly describe the application

    avenues for digital text watermarking. The rationale behind the difficulties faced in

    watermarking text is stated in Section 3. This is followed, in Section 4, with a description

    of possible attacks, their volumes, and nature on text. Section 5 expatiate the previous

    approaches towards text watermarking and the Section 6 discusses the drawbacks of these

    approaches. The discussion has been summarizes in the last section.

    3.2 Applications

    Text watermarking can be used for large number of applications in the real world. With

    the increasing and widespread use of Internet all over the world for information sharing,

    text watermarking has gained more importance. The emerging concepts of digital

    libraries, e-business, e-learning, and e-government, e-books, has made text watermarking

    a necessity. Legal documents, certificates, web sites, business plans, books, articles,

    poetry, company documents, confidential contents, SMS, and emails, can be protected by

    text marking algorithms.

    Text watermarking can be used for a number of purposes. Authentication, copyright

    protection, copy prevention, covert communication, tamper detection, and fingerprinting

    are some of the applications of text watermarking.

  • 8/2/2019 718S-3

    3/16

    21

    3.2.1 Authentication

    For authentication, fragile watermarks can be used to detect any tampering of a text

    document. If the watermark is detected, the text document is genuine; if not, the text has

    been tampered and cannot be considered. It is very necessary to authenticate text,

    especially when using for legal purposes. In sensitive communication e.g. in defense

    application and in business communication, it is extremely important to authenticate,

    check reliability and completeness of the text messages.

    3.2.2 Copyright Protection

    Text watermarking can be used to protect the intellectual property rights of plain text. It

    is very necessary to protect copyrights of web contents, e-books, research papers, journal

    articles, poetry, quotes, and other documents containing plain text. The content

    owner/author can embed a watermark representing copyright information of his data.

    This watermark can later be extracted to prove ownership if any conflict of copyright

    claim arises in future. This can be very helpful to settle copyright disputes in court. It is

    probably the most prominent use of digital text watermarking.

    3.2.3

    Copy Prevention

    Illegal copying and dissemination of text can also be avoided by the text watermarking

    algorithms. The watermarked information can directly control digital recording/copying

    device which can be a printer or simple a copy paste command. The embedded key can

    represent a copy-permission bit stream that is detected by the recording software which

    then decide if the copying procedure should go on (if it is allowed) or not (if it is

    prohibited by the content owner).

    3.2.4 Covert Communication

    The transmission of private data, which can be plain text or an image, is another

    application of text watermarking. Covert communication in this way means implanting a

    strategic/secret message into an innocuous looking text in a way that would prevent any

  • 8/2/2019 718S-3

    4/16

    22 Copyright Protection of Plain Text using Digital Watermarking

    unauthorized person to detect it and the intended recipient would be able to get it. The

    text watermarking algorithms proposed in this thesis can be used for covert

    communication as well.

    3.2.5 Tamper Detection

    The recent text watermarking algorithms can also identify the type, nature, and volume

    of tampering made by attackers in the original text. Thus, it becomes possible sometimes

    to predict or sense the intentions of attackers. Issues and problems of plagiarism faced by

    current researchers can be resolved by efficient tamper detection algorithms.

    3.2.6 Fingerprinting

    In order to trace the source of illegal copies the text author/owner can embed different

    watermarking keys in the copies that are supplied to different users. For the owner,

    embedding a unique serial number-like watermark is a good way to detect the users who

    break their license agreement by copying the protected data and supplying it to a third

    party. The publishing companies can use such fingerprint watermarks to detect the

    copyright violators.

    3.3 Why Text Watermarking Is Difficult?

    Plain text, being the simplest mode of information, brings various challenges when it

    comes to copyright protection. Text has limited capacity for watermark embedding since

    there is no redundancy in text as can be found in images, audio, and videos. The binary

    nature with clear demarcation between foreground and background, block/line/word

    patterning, semantics, structure, style, and language rules are some of the eminent

    properties of text which are needed to be addressed in any text watermarking algorithm.

    Besides, the inherent properties of a generic watermarking scheme like imperceptibility,

    robustness, and security also need to be satisfied.

  • 8/2/2019 718S-3

    5/16

    23

    Any transformation on text should preserve the meaning, fluency, grammaticality, and

    the value of text. The meaning of the text is its value, and it should be preserved through

    watermarking in order not to disturb the communication. Fluency is required to represent

    the meaning of the text in a clear and fluent way, more importantly in literary writings.

    The embedding process should comply with the grammar rules of the language, in order

    to preserve the readability of the text. Preserving the style of the author is very important

    in some domains such as literature writing or news channels [37]. Sensitive nature of

    some documents such as legal documents, poetry, and quotes do not allow us to make

    semantic transformations randomly because in these forms of text a simple

    transformation sometimes destroys both the semantic connotation and the value of text.

    3.4 Attacks

    Cyber community is not much enthusiastic about text watermarking technologies. The

    reason might be the un-disclosed watermarking methods and lack of robustness towards

    attacks. It is possible for an attacker to perform partial attacks even if he/she is not able to

    do it completely. So it is necessary to analyze each type of attack. Watermark attacks

    include unauthorized insertion, unauthorized detection, and unauthorized deletion. These

    unauthorized attacks, their volume, and nature are described as follows:

    3.4.1 Types

    Generally text encounters reproduction, synonym substitution, and reformatting,

    paraphrasing, and syntactic transformations attacks. All these attacks can be placed in the

    following categories: unauthorized insertion, unauthorized detection, unauthorized

    deletion, re-ordering attack, and combination of the all.

    i. Unauthorized Insertion

    Under this form of attack, words and sentences are added to the text to make it look

    different and sometimes to keep another message\watermark of any attacker. An attacker

    sometimes inserts some text to the original text to add some additional information. This

  • 8/2/2019 718S-3

    6/16

    24 Copyright Protection of Plain Text using Digital Watermarking

    kind of attack happens when an attacker is interested to add some false information for

    example in case of legal documents and cases. Such attacks can be avoided by

    incorporating a certifying authority in the watermarking architectures which timestamps

    the contents in the name of author with current date and time. Whenever, a dispute over

    the copyright claim arises, this timestamp is used to identify the author who registered the

    content first.

    ii. Unauthorized Detection

    In some applications, the ability to detect should be restricted. It is conceivable that the

    ability of an adversary merely identifies whether or not a mark is present in a given Work

    will threaten the security of a watermarking system [38].

    iii. Unauthorized Deletion

    Deletion attack means random deletion of words and sentences from the original text.

    The attacker deletes some information to detract the reader and hide the identity of the

    original author\owner of the text. Security against unauthorized deletion is required in all

    watermarking applications. It is necessary to prevent an attacker from recovering the

    original, but it is more important to prevent removal of watermark from the text. The

    watermark should still survive if the attacker performs a number of alterations in text.

    Watermark should be detectable by the extraction algorithm.

    iv. Re-ordering

    The attacker shuffles and reorders the words and sentences of the text to make it look

    different and to destroy the watermark. In case of text, the attacker rephrases and replaces

    certain words with their synonyms. The intention generally is to destroy the writing style,

    connotation, and sometimes meaning of the text.

    3.4.2 Volume

    The volume of attack depends on the attackers intention. If the attacker is interested to

    add or delete some information to and from the text, then volume of attack will be low.

  • 8/2/2019 718S-3

    7/16

    25

    However, if the attacker is interested in using some part of the text in his\her own text,

    then volume of attack will be high.

    3.4.3 Nature

    Combined insertion, deletion, and re-ordering attack is termed as tampering attack.

    Tampering can be made at any random location in the text document. Tampering can be

    made in two ways: dispersed tampering, and localized tampering.

    i. Localized

    Localized tampering means, insertion or deletion; of words or sentences at a single

    location in the text. This location can be in the beginning, at the end or anywhere in thetext, depending on the attackers intention of use.

    ii. Dispersed

    Dispersed insertion and deletion of sentences and words can be made at multiple

    locations in the original text. The attackers trying to make text look different makes

    dispersed tampering in the text. This kind of attack generally occurs in research

    plagiarism and literary writings.

    3.5 Literature Review

    Text watermarking is the area of research that has emerged after the development of

    Internet and communication technologies. The first reported effort to protect the copyright

    of the text was made in 1994 by Brassil et al. [14] [15], when IEEE Journal on Selected

    Areas in Communications issue was scheduled to be published, for Secure Electronic

    Publishing Trial. There were over 1,200 registered users in first month, and each copy of

    each paper has been registered and watermarked with the recipient [39] and it is currently

    a very active research area with a number of researchers working on text watermarking for

    the English language as well as Persian, Turkish, Korean, Urdu, and Arabic.

  • 8/2/2019 718S-3

    8/16

    26 Copyright Protection of Plain Text using Digital Watermarking

    The previous work on digital text watermarking can be classified in the following

    categories; an image based approach, a syntactic approach, and a semantic approach .

    Description of each category and the work done accordingly are ensuing:

    3.5.1 Image-based Approach

    In this approach towards digital text watermarking, text document image is used to

    embed the watermark. Text is difficult to watermark because of its simplicity,

    sensitiveness, and low capacity for watermark embedding. The initially attempts in text

    watermarking tried to treat text as image. Watermark was embedded in the layout and

    appearance of the text image.

    Brassil, et al. proposed a few methods to watermark text document by using text image

    [12-14]. The first method proposed by Brassil was the line-shift coding algorithm which

    alters the document image by moving lines upward or downward (left or right) depending

    on binary signal (watermark) to be inserted as shown in figure 3.1.

    Figure 3.1 Line shift coding [16]

    The detection algorithm is non-blind in which the original document should be

    available. The second method was the word-shift coding algorithm which moves thewords within text horizontally thus expanding spaces to embed the watermark. The

    algorithm can operate both in non-blind and blind modes. The third method is the feature

    coding algorithm which slightly modifies features such as the pixel of characters, the

    length of the end lines in characters to encode watermark bits in the text. All these

  • 8/2/2019 718S-3

    9/16

    27

    proposed techniques discourage un-authorized distribution by embedding each document

    with a unique codeword. Among the three presented methods, line-shift coding is the

    most robust solution under diverse attacks but this can also be easily defeated.

    Maxemchuk, et al. [39][40][41] analyzed the performance of the above mentioned

    methods. The correlation and centroid-based methods [42] are also suggested which

    treats profiles as a discrete time signal and look for direction of shift and which uses

    distances between the centroids of adjacent profile blocks for detecting the watermark

    respectively. Low, et al. [42][43] further analyzed the efficiency of the methods.

    Huang and Yan [43] proposed an algorithm based on an average inter-word distance in

    each line. The distances are adjusted according to the sine wave of a specific phase and

    frequency. The feature and the pixel level algorithms were also developed which mark

    the documents by modifying the stroke features such as width or serif [45]. Algorithm

    which utilizes gray scale image of text was also developed [46]. Another algorithm which

    watermarks text document image using edge direction histogram was also proposed [47].

    Young-Won Kim et al. proposed a text watermarking algorithm based on word

    classification and inter-word space statistics [48]. In this approach, all the words in a text

    document are classified depending on some text features and then adjacent wordscomprise a segment and that segment is classified depending on class labels of the words

    within the segment. The information is encoded by modifying some statistics of inter-

    word spaces of the segments belonging to the same class. Several advantages over the

    conventional word-shift algorithms are discussed. Adnan M. Alattar et al. proposed an

    algorithm [49] to watermark electronic text documents containing justified paragraphs

    and irregular line spacing.

    The algorithms which exploit the printed text document to identify the source printerwere also developed [50]. These methods use print quality defects as an intrinsic

    signature of a printer shows the banding features of a text document. These features can

    identify the specific make and model of the device which created it. Cox et al., [51]

    described a number of applications of digital watermarking and their common properties

  • 8/2/2019 718S-3

    10/16

    28 Copyright Protection of Plain Text using Digital Watermarking

    like robustness tamper resistance, fidelity, computational cost, and false positive rate.

    They observed that these properties vary greatly depending on the application. They also

    described seven applications of watermarking: broadcast monitoring, owner

    identification, proof of ownership, authentication, transactional watermarks, copy control,

    and covert communication.

    Yang and Kot [52] proposed a method for watermarking on text document images to

    authenticate the owner or authorized user is proposed. The proposed method makes use

    of the integrated inter character and word spaces for watermark embedding. An

    overlapping component which is of size three is utilized, whereby the relationship of the

    left and right spaces of the character is employed for the watermark embedding. The

    integrity of the document can be ensured by comparing the hash value of the charactercomponents of the document before and after watermark embedding, which can be

    applied to other line shifting and word-shifting methods as well.

    Chao et al., [53] proposed a steganographic method to embed secret information into

    text files. This is achieved by making slight modification to scattered inter-word spaces

    of the formatted text using the popular typesetting tool TeX. Qadir and Ahmad [54]

    suggested a novel idea based upon an intelligent encoding scheme in the world of text

    watermarking which has no effect on the alteration of the syntax of the document as wellas the layout. Thus providing a layout/format independent technique in which

    information within the text is manipulated to hide certain information.

    Abdullah and Wahab [55] presented a text watermarking scheme targeting an object

    based environment. The heart of the proposed solution describes the concept of

    watermarking an object based text document where each and every text string is

    entertained as a separate object having its own set of properties. Taking advantage of the

    z-ordering of objects, watermark is applied with the z-axis letting zero fidelity

    disturbances to the text.

    Villan et al. [56] analyzed the theoretical practical aspects of text data hiding in printed

    documents. Mikkilineni et al. [57] [58] [59] worked to enhance data hiding and

  • 8/2/2019 718S-3

    11/16

    29

    watermark embedding capacity of printed paper documents. Micic et al. [60] proposed

    algorithm for authentication of text document using digital watermarking. Text document

    images were compared to evaluate changes. Xingming Sun with his team proposed a

    component based digital watermarking algorithm for Chinese texts [61]. Li and Dong

    [62] proposed an algorithm for Chinese text watermarking based on Chinese characters

    structure. Another text watermarking algorithm using eigen values is also proposed [63].

    Culnane et al. [64] proposed a binary text watermarking algorithm using continuous line

    embedding. Zhou et al. [65] presented a zero-watermarking algorithm for content

    authentication of Chinese text documents.

    3.5.2 Syntactic Approach

    Text is made up of characters, words, and sentences. Sentences have different syntactic

    structures. Applying syntactic transformations on text structure to embed watermark has

    also been one of the approaches towards text watermarking in the past.

    Mikhail. J. Atallah, et al. first proposed the natural language watermarking scheme

    using the syntactic structure of text [17][18][66] where the syntactic tree is built and

    transformations are applied to it to embed the watermark preserving all inherent

    properties of the text. They developed techniques for embedding a robust watermark in

    text by a number of information assurance and security techniques with the advanced and

    resources of natural language processing.

    For watermark embedding, they used the manipulations of TMR (Text meaning

    representation), such as grafting, pruning, and substitution. These methods are resistant

    towards many attacks but change the text to a large extent. Hence cannot be applied to

    the text of sensitive nature like poetry, legal documents, transcripts, and contracts. TheNatural Language Processing (NLP) techniques are used to analyze the syntactic and the

    semantic structure of text while performing any transformations to embed the watermark

    bits.

  • 8/2/2019 718S-3

    12/16

    30 Copyright Protection of Plain Text using Digital Watermarking

    Figure 3.2 Syntactic sentence level watermarking [68]

    Kankanhalli and Hau [67] proposed a method to watermark electronic text documents in

    using the ASCII characters and punctuation in text.

    Hassan et al. proposed the natural language watermarking algorithm by performing the

    morpho-syntactic alterations to the text [68]. The text is first transformed into a syntactic

    tree diagram where the hierarchies and the functional dependencies are made explicit and

    watermark is embedded. The watermarking process is shown in figure 3.2. The author

    stated that agglutinative languages like Turkish are easier to watermark than English

    language. The watermarking solutions for agglomerative languages like Turkish, Korean,

    Arabic, and Urdu are efficient since these languages provide space for watermark

    embedding. However, the syntactic solutions for English language are not much

    adequate.

    Hassan et al. also proposed 21 syntactic tools for text watermarking [69] and Mi-Young

    Kim [70] recently proposed an algorithm for text watermarking using syntactic analysis

    of plain text. Kim [71] also proposed a natural language watermarking algorithm for

    Korean language using adverbial displacement. Helge Hoehn proposed a natural

    language watermarking algorithm which uses rather the semantic and syntactical

    transformations of the original text contents rather than modifying the text [72].

  • 8/2/2019 718S-3

    13/16

    31

    Murphy and Vogel [73] present three natural language marking algorithms using

    shallow parsing techniques, lexical substitutions, and swapping. They also analyzed the

    significance of automated and reversible syntactic transformations to hide data in plain

    text [74].

    3.5.3 Semantic Approach

    The semantic watermarking schemes focus on using the semantic structure of text to

    embed the watermark. Text contents, verbs, nouns, words and their spellings, acronyms,

    sentence structure, grammar rules, etc. have been exploited to insert watermark in the text

    but none of these proved to be resilient and degrade the quality of the text to a large

    extent.

    Atallah et al. were the first to propose the semantic watermarking schemes in 2000

    [17][18][75][76]. Later, the synonym substitution method was proposed in which

    watermark is embedded by replacing certain words with their synonyms [19]. Xingming,

    et al. proposed noun-verb based technique for text watermarking [77] which exploits

    nouns and verbs in a sentence parsed with a grammar parser using semantic networks.

    Mercan et al. proposed a sentence based text watermarking algorithm [78] which relies

    on multiple features of each sentences and exploits the notion of orthogonality between

    features. Later Mercan, et al. proposed an algorithm of the text watermarking by using

    idiosyncrasies to embed the watermark [20]. The algorithms make clever use of typing

    errors, acronyms, and abbreviations that are common in cursory text like emails, blogs,

    chat, SMS etc.

    Algorithms were developed to watermark the text using the linguistic semantic

    phenomena of presuppositions [21][22] by observing the discourse meanings andrepresentations. Presupposition is the implicit information considered as well known.

    Presuppositions are identified and then transformations like passivization, topicalization,

    extraposition, and preposing are applied to embed watermark in the text.

  • 8/2/2019 718S-3

    14/16

    32 Copyright Protection of Plain Text using Digital Watermarking

    The text pruning and the grafting algorithms were also developed in the past. The

    algorithm based on text meaning representation (TMR) strings has recently been

    proposed [79]. Shirali-Shahreza et al. [80] proposed a new method for secret exchange of

    information through SMS by using abbreviation text steganography with the use of the

    invented language of SMS-texting. They also proposed a method for steganography in

    English texts. In this method the US and UK spellings of words substituted in order to

    hide data in an English text. In English some words have different spelling in US and UK

    [81]. Later, Rafat [82] proposed an enhanced method for SMS steganography using SMS-

    texting language, by removing the static nature of word-abbreviation list and introducing

    computationally light weighted XoR encryption.

    Das [83] proposed an enhanced buyer-seller watermarking protocol based on publickey encryption standard, which is secure and flexible.

    Jonathan et al [84] and Robert [85] provided study and surveys of digital

    watermarking techniques for text, image, and video documents. Zhang et al. [86]

    explored the application of text watermarking in digital reading, in which holders are

    compensate for any copyright violation.

    3.6

    Drawbacks

    Text watermarking algorithms using binary text images are not robust against re-typing

    and text reproduction attacks. With increasing and efficient use of OCR (Optical

    Character Recognition) now a days, these methods are totally a failure. The use of OCR

    can destroy the changes made by shifting words upward and downward, to the document

    margins, to the fonts, serif, and features of the text. Also, watermarking can easily

    destroyed by a simple copy paste to notepad attack.

    Text watermarking by using syntactic structure combined with natural language

    processing algorithms, is an efficient approach towards text watermarking but research

    progress in NLP is very slow. Syntactic sentence paraphrasing can result in unnaturalness

    of the sentence. Syntactic techniques also require good performance syntactic analyzers.

    The transformation applied using NLP algorithms are most of the time non-reversible.

  • 8/2/2019 718S-3

    15/16

    33

    Semantic text watermarking techniques significantly improve the information hiding

    capacity of English text by modifying the granularity of meaning of individual

    term/sentence but semantic text watermarking schemes are very conceptual and

    impractical. The synonym based techniques are not resilient to the random synonym

    substitution attacks. There may be the cases where wrong words get selected for synonym

    substitution. Moreover, synonym based methods require a large synonymy dictionary and

    a huge collocation database.

    Sensitive nature of some text like legal documents, poetry, and quotation do not allow

    us to make random semantic transformations. The reason behind is the necessity to

    preserve the semantic connotation as well the value of text, while performing any

    transformation.

    In addition, text watermarking based on semantics, is language dependent where

    language is not something static. With the passage of time, language varies and hence the

    security and copyright solution provided by digital watermarking based on semantic will

    have limited strength. The semantic techniques for digital watermarking use natural

    language processing algorithms to analyze text meaning and to perform transformation.

    NLP is an immature area of research; hence, text watermarking using semantics does not

    provide a practical and complete text watermarking solution.

    3.7 Summary

    In this chapter, text document watermarking is described exclusively in detail. The

    applications areas, the challenges, and possible attacks are also described. It is observed

    that text watermarking methods for English language text proposed so far; lack

    robustness, integrity, accuracy, and generality. Also, the amount of work done on text

    watermarking is very limited and specific.

    Text watermarking algorithms using binary text image are not robust against text

    reproduction attacks and have limited applicability. Similarly, text watermarking using

    text syntactic and semantic structure is not robust against random tampering attacks.

  • 8/2/2019 718S-3

    16/16

    34 Copyright Protection of Plain Text using Digital Watermarking

    These algorithms are application area and/or language specific with limited applicability

    and usability. The previous techniques are computationally expensive and non robust.

    Text being an important medium of information exchange requires complete protection.

    Text encountering massive insertion, deletion, and reordering attacks need to be protected

    from copyright violators. Therefore, efficient and practical text watermarking algorithms

    are required.