718S-3

8/2/2019 718S-3

1/16

19

Chapter 3. Text Document Watermarking

3.1 Introduction

To actively embrace Internet as a communication and content distribution medium, it is

necessary to secure Internet contents by incorporating digital watermarking methods.

Digital watermarking methods for images, audio, and video are already in place and are

quite effective. For image watermarking, the existing redundancy in images and

limitations of human visual system (HVS) are utilized. Similar properties have been

utilized for audio and video watermarking. Watermark has been embedded in video and

audio frames which remain imperceptible.

Text is the most extensively used medium of communication existing over the Internet.

The major components of websites, books, newspapers, articles, legal documents issimple the plain text. Therefore, plain text requires utmost protection and security from

copyright violators. In past, a number of digital watermarking algorithms have been

proposed for images, audios, and videos; however digital watermarking algorithms for

plain text are inadequate and ineffective.

Digital watermarking is the process of embedding a unique digital watermark in a

digital content to protect it from illegal copying and copyright violations. The process of

embedding and extracting a digital watermark to and from a digital text document which

uniquely identifies the original copyright owner of that text is called Digital Text

Watermarking. Text watermarking abides by the same principles as image, audio, or

video watermarking. The watermark should remain resilient to random tampering attacks,

undetectable to anybody but the original owner/author of the text, as well as easily and

8/2/2019 718S-3

2/16

20 Copyright Protection of Plain Text using Digital Watermarking

fully automatically reproducible by the watermark extraction algorithm. The main

concern in text watermarking is that the plain text contains less redundant information as

compared to images, audio, and video which could be used for secret communication, as

happens in steganography, and watermarking.

Text watermarking techniques should implant unique and invisible watermarks in text

documents which remains intact after diverse tampering attacks of insertion, deletion, and

re-ordering. The digital watermarking solutions for text make it easy to send and receive

text over Internet, intranet, extranet, and facsimile. Documents can be evaluated for text

confidentiality and copyright protection. Detection of any tampering made can also be

done using digital text watermarking techniques by making it tamper proof.

This chapter is organized as follows: In Section 2, we briefly describe the application

avenues for digital text watermarking. The rationale behind the difficulties faced in

watermarking text is stated in Section 3. This is followed, in Section 4, with a description

of possible attacks, their volumes, and nature on text. Section 5 expatiate the previous

approaches towards text watermarking and the Section 6 discusses the drawbacks of these

approaches. The discussion has been summarizes in the last section.

3.2 Applications

Text watermarking can be used for large number of applications in the real world. With

the increasing and widespread use of Internet all over the world for information sharing,

text watermarking has gained more importance. The emerging concepts of digital

libraries, e-business, e-learning, and e-government, e-books, has made text watermarking

a necessity. Legal documents, certificates, web sites, business plans, books, articles,

poetry, company documents, confidential contents, SMS, and emails, can be protected by

text marking algorithms.

Text watermarking can be used for a number of purposes. Authentication, copyright

protection, copy prevention, covert communication, tamper detection, and fingerprinting

are some of the applications of text watermarking.

8/2/2019 718S-3

3/16

21

3.2.1 Authentication

For authentication, fragile watermarks can be used to detect any tampering of a text

document. If the watermark is detected, the text document is genuine; if not, the text has

been tampered and cannot be considered. It is very necessary to authenticate text,

especially when using for legal purposes. In sensitive communication e.g. in defense

application and in business communication, it is extremely important to authenticate,

check reliability and completeness of the text messages.

3.2.2 Copyright Protection

Text watermarking can be used to protect the intellectual property rights of plain text. It

is very necessary to protect copyrights of web contents, e-books, research papers, journal

articles, poetry, quotes, and other documents containing plain text. The content

owner/author can embed a watermark representing copyright information of his data.

This watermark can later be extracted to prove ownership if any conflict of copyright

claim arises in future. This can be very helpful to settle copyright disputes in court. It is

probably the most prominent use of digital text watermarking.

3.2.3

Copy Prevention

Illegal copying and dissemination of text can also be avoided by the text watermarking

algorithms. The watermarked information can directly control digital recording/copying

device which can be a printer or simple a copy paste command. The embedded key can

represent a copy-permission bit stream that is detected by the recording software which

then decide if the copying procedure should go on (if it is allowed) or not (if it is

prohibited by the content owner).

3.2.4 Covert Communication

The transmission of private data, which can be plain text or an image, is another

application of text watermarking. Covert communication in this way means implanting a

strategic/secret message into an innocuous looking text in a way that would prevent any

8/2/2019 718S-3

4/16


unauthorized person to detect it and the intended recipient would be able to get it. The

text watermarking algorithms proposed in this thesis can be used for covert

communication as well.

3.2.5 Tamper Detection

The recent text watermarking algorithms can also identify the type, nature, and volume

of tampering made by attackers in the original text. Thus, it becomes possible sometimes

to predict or sense the intentions of attackers. Issues and problems of plagiarism faced by

current researchers can be resolved by efficient tamper detection algorithms.

3.2.6 Fingerprinting

In order to trace the source of illegal copies the text author/owner can embed different

watermarking keys in the copies that are supplied to different users. For the owner,

embedding a unique serial number-like watermark is a good way to detect the users who

break their license agreement by copying the protected data and supplying it to a third

party. The publishing companies can use such fingerprint watermarks to detect the

copyright violators.

3.3 Why Text Watermarking Is Difficult?

Plain text, being the simplest mode of information, brings various challenges when it

comes to copyright protection. Text has limited capacity for watermark embedding since

there is no redundancy in text as can be found in images, audio, and videos. The binary

nature with clear demarcation between foreground and background, block/line/word

patterning, semantics, structure, style, and language rules are some of the eminent

properties of text which are needed to be addressed in any text watermarking algorithm.

Besides, the inherent properties of a generic watermarking scheme like imperceptibility,

robustness, and security also need to be satisfied.

8/2/2019 718S-3

5/16

23

Any transformation on text should preserve the meaning, fluency, grammaticality, and

the value of text. The meaning of the text is its value, and it should be preserved through

watermarking in order not to disturb the communication. Fluency is required to represent

the meaning of the text in a clear and fluent way, more importantly in literary writings.

The embedding process should comply with the grammar rules of the language, in order

to preserve the readability of the text. Preserving the style of the author is very important

in some domains such as literature writing or news channels [37]. Sensitive nature of

some documents such as legal documents, poetry, and quotes do not allow us to make

semantic transformations randomly because in these forms of text a simple

transformation sometimes destroys both the semantic connotation and the value of text.

3.4 Attacks

Cyber community is not much enthusiastic about text watermarking technologies. The

reason might be the un-disclosed watermarking methods and lack of robustness towards

attacks. It is possible for an attacker to perform partial attacks even if he/she is not able to

do it completely. So it is necessary to analyze each type of attack. Watermark attacks

include unauthorized insertion, unauthorized detection, and unauthorized deletion. These

unauthorized attacks, their volume, and nature are described as follows:

3.4.1 Types

Generally text encounters reproduction, synonym substitution, and reformatting,

paraphrasing, and syntactic transformations attacks. All these attacks can be placed in the

following categories: unauthorized insertion, unauthorized detection, unauthorized

deletion, re-ordering attack, and combination of the all.

i. Unauthorized Insertion

Under this form of attack, words and sentences are added to the text to make it look

different and sometimes to keep another message\watermark of any attacker. An attacker

sometimes inserts some text to the original text to add some additional information. This

8/2/2019 718S-3

6/16


kind of attack happens when an attacker is interested to add some false information for

example in case of legal documents and cases. Such attacks can be avoided by

incorporating a certifying authority in the watermarking architectures which timestamps

the contents in the name of author with current date and time. Whenever, a dispute over

the copyright claim arises, this timestamp is used to identify the author who registered the

content first.

ii. Unauthorized Detection

In some applications, the ability to detect should be restricted. It is conceivable that the

ability of an adversary merely identifies whether or not a mark is present in a given Work

will threaten the security of a watermarking system [38].

iii. Unauthorized Deletion

Deletion attack means random deletion of words and sentences from the original text.

The attacker deletes some information to detract the reader and hide the identity of the

original author\owner of the text. Security against unauthorized deletion is required in all

watermarking applications. It is necessary to prevent an attacker from recovering the

original, but it is more important to prevent removal of watermark from the text. The

watermark should still survive if the attacker performs a number of alterations in text.

Watermark should be detectable by the extraction algorithm.

iv. Re-ordering

The attacker shuffles and reorders the words and sentences of the text to make it look

different and to destroy the watermark. In case of text, the attacker rephrases and replaces

certain words with their synonyms. The intention generally is to destroy the writing style,

connotation, and sometimes meaning of the text.

3.4.2 Volume

The volume of attack depends on the attackers intention. If the attacker is interested to

add or delete some information to and from the text, then volume of attack will be low.

8/2/2019 718S-3

7/16

25

However, if the attacker is interested in using some part of the text in his\her own text,

then volume of attack will be high.

3.4.3 Nature

Combined insertion, deletion, and re-ordering attack is termed as tampering attack.

Tampering can be made at any random location in the text document. Tampering can be

made in two ways: dispersed tampering, and localized tampering.

i. Localized

Localized tampering means, insertion or deletion; of words or sentences at a single

location in the text. This location can be in the beginning, at the end or anywhere in thetext, depending on the attackers intention of use.

ii. Dispersed

Dispersed insertion and deletion of sentences and words can be made at multiple

locations in the original text. The attackers trying to make text look different makes

dispersed tampering in the text. This kind of attack generally occurs in research

plagiarism and literary writings.

3.5 Literature Review

Text watermarking is the area of research that has emerged after the development of

Internet and communication technologies. The first reported effort to protect the copyright

of the text was made in 1994 by Brassil et al. [14] [15], when IEEE Journal on Selected

Areas in Communications issue was scheduled to be published, for Secure Electronic

Publishing Trial. There were over 1,200 registered users in first month, and each copy of

each paper has been registered and watermarked with the recipient [39] and it is currently

a very active research area with a number of researchers working on text watermarking for

the English language as well as Persian, Turkish, Korean, Urdu, and Arabic.

8/2/2019 718S-3

8/16


The previous work on digital text watermarking can be classified in the following

categories; an image based approach, a syntactic approach, and a semantic approach .

Description of each category and the work done accordingly are ensuing:

3.5.1 Image-based Approach

In this approach towards digital text watermarking, text document image is used to

embed the watermark. Text is difficult to watermark because of its simplicity,

sensitiveness, and low capacity for watermark embedding. The initially attempts in text

watermarking tried to treat text as image. Watermark was embedded in the layout and

appearance of the text image.

Brassil, et al. proposed a few methods to watermark text document by using text image

[12-14]. The first method proposed by Brassil was the line-shift coding algorithm which

alters the document image by moving lines upward or downward (left or right) depending

on binary signal (watermark) to be inserted as shown in figure 3.1.

Figure 3.1 Line shift coding [16]

The detection algorithm is non-blind in which the original document should be

available. The second method was the word-shift coding algorithm which moves thewords within text horizontally thus expanding spaces to embed the watermark. The

algorithm can operate both in non-blind and blind modes. The third method is the feature

coding algorithm which slightly modifies features such as the pixel of characters, the

length of the end lines in characters to encode watermark bits in the text. All these

8/2/2019 718S-3

9/16

27

proposed techniques discourage un-authorized distribution by embedding each document

with a unique codeword. Among the three presented methods, line-shift coding is the

most robust solution under diverse attacks but this can also be easily defeated.

Maxemchuk, et al. [39][40][41] analyzed the performance of the above mentioned

methods. The correlation and centroid-based methods [42] are also suggested which

treats profiles as a discrete time signal and look for direction of shift and which uses

distances between the centroids of adjacent profile blocks for detecting the watermark

respectively. Low, et al. [42][43] further analyzed the efficiency of the methods.

Huang and Yan [43] proposed an algorithm based on an average inter-word distance in

each line. The distances are adjusted according to the sine wave of a specific phase and

frequency. The feature and the pixel level algorithms were also developed which mark

the documents by modifying the stroke features such as width or serif [45]. Algorithm

which utilizes gray scale image of text was also developed [46]. Another algorithm which

watermarks text document image using edge direction histogram was also proposed [47].

Young-Won Kim et al. proposed a text watermarking algorithm based on word

classification and inter-word space statistics [48]. In this approach, all the words in a text

document are classified depending on some text features and then adjacent wordscomprise a segment and that segment is classified depending on class labels of the words

within the segment. The information is encoded by modifying some statistics of inter-

word spaces of the segments belonging to the same class. Several advantages over the

conventional word-shift algorithms are discussed. Adnan M. Alattar et al. proposed an

algorithm [49] to watermark electronic text documents containing justified paragraphs

and irregular line spacing.

The algorithms which exploit the printed text document to identify the source printerwere also developed [50]. These methods use print quality defects as an intrinsic

signature of a printer shows the banding features of a text document. These features can

identify the specific make and model of the device which created it. Cox et al., [51]

described a number of applications of digital watermarking and their common properties

8/2/2019 718S-3

10/16


like robustness tamper resistance, fidelity, computational cost, and false positive rate.

They observed that these properties vary greatly depending on the application. They also

described seven applications of watermarking: broadcast monitoring, owner

identification, proof of ownership, authentication, transactional watermarks, copy control,

and covert communication.

Yang and Kot [52] proposed a method for watermarking on text document images to

authenticate the owner or authorized user is proposed. The proposed method makes use

of the integrated inter character and word spaces for watermark embedding. An

overlapping component which is of size three is utilized, whereby the relationship of the

left and right spaces of the character is employed for the watermark embedding. The

integrity of the document can be ensured by comparing the hash value of the charactercomponents of the document before and after watermark embedding, which can be

applied to other line shifting and word-shifting methods as well.

Chao et al., [53] proposed a steganographic method to embed secret information into

text files. This is achieved by making slight modification to scattered inter-word spaces

of the formatted text using the popular typesetting tool TeX. Qadir and Ahmad [54]

suggested a novel idea based upon an intelligent encoding scheme in the world of text

watermarking which has no effect on the alteration of the syntax of the document as wellas the layout. Thus providing a layout/format independent technique in which

information within the text is manipulated to hide certain information.

Abdullah and Wahab [55] presented a text watermarking scheme targeting an object

based environment. The heart of the proposed solution describes the concept of

watermarking an object based text document where each and every text string is

entertained as a separate object having its own set of properties. Taking advantage of the

z-ordering of objects, watermark is applied with the z-axis letting zero fidelity

disturbances to the text.

Villan et al. [56] analyzed the theoretical practical aspects of text data hiding in printed

documents. Mikkilineni et al. [57] [58] [59] worked to enhance data hiding and

8/2/2019 718S-3

11/16

29

watermark embedding capacity of printed paper documents. Micic et al. [60] proposed

algorithm for authentication of text document using digital watermarking. Text document

images were compared to evaluate changes. Xingming Sun with his team proposed a

component based digital watermarking algorithm for Chinese texts [61]. Li and Dong

[62] proposed an algorithm for Chinese text watermarking based on Chinese characters

structure. Another text watermarking algorithm using eigen values is also proposed [63].

Culnane et al. [64] proposed a binary text watermarking algorithm using continuous line

embedding. Zhou et al. [65] presented a zero-watermarking algorithm for content

authentication of Chinese text documents.

3.5.2 Syntactic Approach

Text is made up of characters, words, and sentences. Sentences have different syntactic

structures. Applying syntactic transformations on text structure to embed watermark has

also been one of the approaches towards text watermarking in the past.

Mikhail. J. Atallah, et al. first proposed the natural language watermarking scheme

using the syntactic structure of text [17][18][66] where the syntactic tree is built and

transformations are applied to it to embed the watermark preserving all inherent

properties of the text. They developed techniques for embedding a robust watermark in

text by a number of information assurance and security techniques with the advanced and

resources of natural language processing.

For watermark embedding, they used the manipulations of TMR (Text meaning

representation), such as grafting, pruning, and substitution. These methods are resistant

towards many attacks but change the text to a large extent. Hence cannot be applied to

the text of sensitive nature like poetry, legal documents, transcripts, and contracts. TheNatural Language Processing (NLP) techniques are used to analyze the syntactic and the

semantic structure of text while performing any transformations to embed the watermark

bits.

8/2/2019 718S-3

12/16


Figure 3.2 Syntactic sentence level watermarking [68]

Kankanhalli and Hau [67] proposed a method to watermark electronic text documents in

using the ASCII characters and punctuation in text.

Hassan et al. proposed the natural language watermarking algorithm by performing the

morpho-syntactic alterations to the text [68]. The text is first transformed into a syntactic

tree diagram where the hierarchies and the functional dependencies are made explicit and

watermark is embedded. The watermarking process is shown in figure 3.2. The author

stated that agglutinative languages like Turkish are easier to watermark than English

language. The watermarking solutions for agglomerative languages like Turkish, Korean,

Arabic, and Urdu are efficient since these languages provide space for watermark

embedding. However, the syntactic solutions for English language are not much

adequate.

Hassan et al. also proposed 21 syntactic tools for text watermarking [69] and Mi-Young

Kim [70] recently proposed an algorithm for text watermarking using syntactic analysis

of plain text. Kim [71] also proposed a natural language watermarking algorithm for

Korean language using adverbial displacement. Helge Hoehn proposed a natural

language watermarking algorithm which uses rather the semantic and syntactical

transformations of the original text contents rather than modifying the text [72].

8/2/2019 718S-3

13/16

31

Murphy and Vogel [73] present three natural language marking algorithms using

shallow parsing techniques, lexical substitutions, and swapping. They also analyzed the

significance of automated and reversible syntactic transformations to hide data in plain

text [74].

3.5.3 Semantic Approach

The semantic watermarking schemes focus on using the semantic structure of text to

embed the watermark. Text contents, verbs, nouns, words and their spellings, acronyms,

sentence structure, grammar rules, etc. have been exploited to insert watermark in the text

but none of these proved to be resilient and degrade the quality of the text to a large

extent.

Atallah et al. were the first to propose the semantic watermarking schemes in 2000

[17][18][75][76]. Later, the synonym substitution method was proposed in which

watermark is embedded by replacing certain words with their synonyms [19]. Xingming,

et al. proposed noun-verb based technique for text watermarking [77] which exploits

nouns and verbs in a sentence parsed with a grammar parser using semantic networks.

Mercan et al. proposed a sentence based text watermarking algorithm [78] which relies

on multiple features of each sentences and exploits the notion of orthogonality between

features. Later Mercan, et al. proposed an algorithm of the text watermarking by using

idiosyncrasies to embed the watermark [20]. The algorithms make clever use of typing

errors, acronyms, and abbreviations that are common in cursory text like emails, blogs,

chat, SMS etc.

Algorithms were developed to watermark the text using the linguistic semantic

phenomena of presuppositions [21][22] by observing the discourse meanings andrepresentations. Presupposition is the implicit information considered as well known.

Presuppositions are identified and then transformations like passivization, topicalization,

extraposition, and preposing are applied to embed watermark in the text.

8/2/2019 718S-3

14/16


The text pruning and the grafting algorithms were also developed in the past. The

algorithm based on text meaning representation (TMR) strings has recently been

proposed [79]. Shirali-Shahreza et al. [80] proposed a new method for secret exchange of

information through SMS by using abbreviation text steganography with the use of the

invented language of SMS-texting. They also proposed a method for steganography in

English texts. In this method the US and UK spellings of words substituted in order to

hide data in an English text. In English some words have different spelling in US and UK

[81]. Later, Rafat [82] proposed an enhanced method for SMS steganography using SMS-

texting language, by removing the static nature of word-abbreviation list and introducing

computationally light weighted XoR encryption.

Das [83] proposed an enhanced buyer-seller watermarking protocol based on publickey encryption standard, which is secure and flexible.

Jonathan et al [84] and Robert [85] provided study and surveys of digital

watermarking techniques for text, image, and video documents. Zhang et al. [86]

explored the application of text watermarking in digital reading, in which holders are

compensate for any copyright violation.

3.6

Drawbacks

Text watermarking algorithms using binary text images are not robust against re-typing

and text reproduction attacks. With increasing and efficient use of OCR (Optical

Character Recognition) now a days, these methods are totally a failure. The use of OCR

can destroy the changes made by shifting words upward and downward, to the document

margins, to the fonts, serif, and features of the text. Also, watermarking can easily

destroyed by a simple copy paste to notepad attack.

Text watermarking by using syntactic structure combined with natural language

processing algorithms, is an efficient approach towards text watermarking but research

progress in NLP is very slow. Syntactic sentence paraphrasing can result in unnaturalness

of the sentence. Syntactic techniques also require good performance syntactic analyzers.

The transformation applied using NLP algorithms are most of the time non-reversible.

8/2/2019 718S-3

15/16

33

Semantic text watermarking techniques significantly improve the information hiding

capacity of English text by modifying the granularity of meaning of individual

term/sentence but semantic text watermarking schemes are very conceptual and

impractical. The synonym based techniques are not resilient to the random synonym

substitution attacks. There may be the cases where wrong words get selected for synonym

substitution. Moreover, synonym based methods require a large synonymy dictionary and

a huge collocation database.

Sensitive nature of some text like legal documents, poetry, and quotation do not allow

us to make random semantic transformations. The reason behind is the necessity to

preserve the semantic connotation as well the value of text, while performing any

transformation.

In addition, text watermarking based on semantics, is language dependent where

language is not something static. With the passage of time, language varies and hence the

security and copyright solution provided by digital watermarking based on semantic will

have limited strength. The semantic techniques for digital watermarking use natural

language processing algorithms to analyze text meaning and to perform transformation.

NLP is an immature area of research; hence, text watermarking using semantics does not

provide a practical and complete text watermarking solution.

3.7 Summary

In this chapter, text document watermarking is described exclusively in detail. The

applications areas, the challenges, and possible attacks are also described. It is observed

that text watermarking methods for English language text proposed so far; lack

robustness, integrity, accuracy, and generality. Also, the amount of work done on text

watermarking is very limited and specific.

Text watermarking algorithms using binary text image are not robust against text

reproduction attacks and have limited applicability. Similarly, text watermarking using

text syntactic and semantic structure is not robust against random tampering attacks.

8/2/2019 718S-3

16/16


These algorithms are application area and/or language specific with limited applicability

and usability. The previous techniques are computationally expensive and non robust.

Text being an important medium of information exchange requires complete protection.

Text encountering massive insertion, deletion, and reordering attacks need to be protected

from copyright violators. Therefore, efficient and practical text watermarking algorithms

are required.