Upload
mehboob-nazim
View
219
Download
0
Embed Size (px)
Citation preview
8/2/2019 718S-3
1/16
19
Chapter 3. Text Document Watermarking
3.1 Introduction
To actively embrace Internet as a communication and content distribution medium, it is
necessary to secure Internet contents by incorporating digital watermarking methods.
Digital watermarking methods for images, audio, and video are already in place and are
quite effective. For image watermarking, the existing redundancy in images and
limitations of human visual system (HVS) are utilized. Similar properties have been
utilized for audio and video watermarking. Watermark has been embedded in video and
audio frames which remain imperceptible.
Text is the most extensively used medium of communication existing over the Internet.
The major components of websites, books, newspapers, articles, legal documents issimple the plain text. Therefore, plain text requires utmost protection and security from
copyright violators. In past, a number of digital watermarking algorithms have been
proposed for images, audios, and videos; however digital watermarking algorithms for
plain text are inadequate and ineffective.
Digital watermarking is the process of embedding a unique digital watermark in a
digital content to protect it from illegal copying and copyright violations. The process of
embedding and extracting a digital watermark to and from a digital text document which
uniquely identifies the original copyright owner of that text is called Digital Text
Watermarking. Text watermarking abides by the same principles as image, audio, or
video watermarking. The watermark should remain resilient to random tampering attacks,
undetectable to anybody but the original owner/author of the text, as well as easily and
8/2/2019 718S-3
2/16
20 Copyright Protection of Plain Text using Digital Watermarking
fully automatically reproducible by the watermark extraction algorithm. The main
concern in text watermarking is that the plain text contains less redundant information as
compared to images, audio, and video which could be used for secret communication, as
happens in steganography, and watermarking.
Text watermarking techniques should implant unique and invisible watermarks in text
documents which remains intact after diverse tampering attacks of insertion, deletion, and
re-ordering. The digital watermarking solutions for text make it easy to send and receive
text over Internet, intranet, extranet, and facsimile. Documents can be evaluated for text
confidentiality and copyright protection. Detection of any tampering made can also be
done using digital text watermarking techniques by making it tamper proof.
This chapter is organized as follows: In Section 2, we briefly describe the application
avenues for digital text watermarking. The rationale behind the difficulties faced in
watermarking text is stated in Section 3. This is followed, in Section 4, with a description
of possible attacks, their volumes, and nature on text. Section 5 expatiate the previous
approaches towards text watermarking and the Section 6 discusses the drawbacks of these
approaches. The discussion has been summarizes in the last section.
3.2 Applications
Text watermarking can be used for large number of applications in the real world. With
the increasing and widespread use of Internet all over the world for information sharing,
text watermarking has gained more importance. The emerging concepts of digital
libraries, e-business, e-learning, and e-government, e-books, has made text watermarking
a necessity. Legal documents, certificates, web sites, business plans, books, articles,
poetry, company documents, confidential contents, SMS, and emails, can be protected by
text marking algorithms.
Text watermarking can be used for a number of purposes. Authentication, copyright
protection, copy prevention, covert communication, tamper detection, and fingerprinting
are some of the applications of text watermarking.
8/2/2019 718S-3
3/16
21
3.2.1 Authentication
For authentication, fragile watermarks can be used to detect any tampering of a text
document. If the watermark is detected, the text document is genuine; if not, the text has
been tampered and cannot be considered. It is very necessary to authenticate text,
especially when using for legal purposes. In sensitive communication e.g. in defense
application and in business communication, it is extremely important to authenticate,
check reliability and completeness of the text messages.
3.2.2 Copyright Protection
Text watermarking can be used to protect the intellectual property rights of plain text. It
is very necessary to protect copyrights of web contents, e-books, research papers, journal
articles, poetry, quotes, and other documents containing plain text. The content
owner/author can embed a watermark representing copyright information of his data.
This watermark can later be extracted to prove ownership if any conflict of copyright
claim arises in future. This can be very helpful to settle copyright disputes in court. It is
probably the most prominent use of digital text watermarking.
3.2.3
Copy Prevention
Illegal copying and dissemination of text can also be avoided by the text watermarking
algorithms. The watermarked information can directly control digital recording/copying
device which can be a printer or simple a copy paste command. The embedded key can
represent a copy-permission bit stream that is detected by the recording software which
then decide if the copying procedure should go on (if it is allowed) or not (if it is
prohibited by the content owner).
3.2.4 Covert Communication
The transmission of private data, which can be plain text or an image, is another
application of text watermarking. Covert communication in this way means implanting a
strategic/secret message into an innocuous looking text in a way that would prevent any
8/2/2019 718S-3
4/16
22 Copyright Protection of Plain Text using Digital Watermarking
unauthorized person to detect it and the intended recipient would be able to get it. The
text watermarking algorithms proposed in this thesis can be used for covert
communication as well.
3.2.5 Tamper Detection
The recent text watermarking algorithms can also identify the type, nature, and volume
of tampering made by attackers in the original text. Thus, it becomes possible sometimes
to predict or sense the intentions of attackers. Issues and problems of plagiarism faced by
current researchers can be resolved by efficient tamper detection algorithms.
3.2.6 Fingerprinting
In order to trace the source of illegal copies the text author/owner can embed different
watermarking keys in the copies that are supplied to different users. For the owner,
embedding a unique serial number-like watermark is a good way to detect the users who
break their license agreement by copying the protected data and supplying it to a third
party. The publishing companies can use such fingerprint watermarks to detect the
copyright violators.
3.3 Why Text Watermarking Is Difficult?
Plain text, being the simplest mode of information, brings various challenges when it
comes to copyright protection. Text has limited capacity for watermark embedding since
there is no redundancy in text as can be found in images, audio, and videos. The binary
nature with clear demarcation between foreground and background, block/line/word
patterning, semantics, structure, style, and language rules are some of the eminent
properties of text which are needed to be addressed in any text watermarking algorithm.
Besides, the inherent properties of a generic watermarking scheme like imperceptibility,
robustness, and security also need to be satisfied.
8/2/2019 718S-3
5/16
23
Any transformation on text should preserve the meaning, fluency, grammaticality, and
the value of text. The meaning of the text is its value, and it should be preserved through
watermarking in order not to disturb the communication. Fluency is required to represent
the meaning of the text in a clear and fluent way, more importantly in literary writings.
The embedding process should comply with the grammar rules of the language, in order
to preserve the readability of the text. Preserving the style of the author is very important
in some domains such as literature writing or news channels [37]. Sensitive nature of
some documents such as legal documents, poetry, and quotes do not allow us to make
semantic transformations randomly because in these forms of text a simple
transformation sometimes destroys both the semantic connotation and the value of text.
3.4 Attacks
Cyber community is not much enthusiastic about text watermarking technologies. The
reason might be the un-disclosed watermarking methods and lack of robustness towards
attacks. It is possible for an attacker to perform partial attacks even if he/she is not able to
do it completely. So it is necessary to analyze each type of attack. Watermark attacks
include unauthorized insertion, unauthorized detection, and unauthorized deletion. These
unauthorized attacks, their volume, and nature are described as follows:
3.4.1 Types
Generally text encounters reproduction, synonym substitution, and reformatting,
paraphrasing, and syntactic transformations attacks. All these attacks can be placed in the
following categories: unauthorized insertion, unauthorized detection, unauthorized
deletion, re-ordering attack, and combination of the all.
i. Unauthorized Insertion
Under this form of attack, words and sentences are added to the text to make it look
different and sometimes to keep another message\watermark of any attacker. An attacker
sometimes inserts some text to the original text to add some additional information. This
8/2/2019 718S-3
6/16
24 Copyright Protection of Plain Text using Digital Watermarking
kind of attack happens when an attacker is interested to add some false information for
example in case of legal documents and cases. Such attacks can be avoided by
incorporating a certifying authority in the watermarking architectures which timestamps
the contents in the name of author with current date and time. Whenever, a dispute over
the copyright claim arises, this timestamp is used to identify the author who registered the
content first.
ii. Unauthorized Detection
In some applications, the ability to detect should be restricted. It is conceivable that the
ability of an adversary merely identifies whether or not a mark is present in a given Work
will threaten the security of a watermarking system [38].
iii. Unauthorized Deletion
Deletion attack means random deletion of words and sentences from the original text.
The attacker deletes some information to detract the reader and hide the identity of the
original author\owner of the text. Security against unauthorized deletion is required in all
watermarking applications. It is necessary to prevent an attacker from recovering the
original, but it is more important to prevent removal of watermark from the text. The
watermark should still survive if the attacker performs a number of alterations in text.
Watermark should be detectable by the extraction algorithm.
iv. Re-ordering
The attacker shuffles and reorders the words and sentences of the text to make it look
different and to destroy the watermark. In case of text, the attacker rephrases and replaces
certain words with their synonyms. The intention generally is to destroy the writing style,
connotation, and sometimes meaning of the text.
3.4.2 Volume
The volume of attack depends on the attackers intention. If the attacker is interested to
add or delete some information to and from the text, then volume of attack will be low.
8/2/2019 718S-3
7/16
25
However, if the attacker is interested in using some part of the text in his\her own text,
then volume of attack will be high.
3.4.3 Nature
Combined insertion, deletion, and re-ordering attack is termed as tampering attack.
Tampering can be made at any random location in the text document. Tampering can be
made in two ways: dispersed tampering, and localized tampering.
i. Localized
Localized tampering means, insertion or deletion; of words or sentences at a single
location in the text. This location can be in the beginning, at the end or anywhere in thetext, depending on the attackers intention of use.
ii. Dispersed
Dispersed insertion and deletion of sentences and words can be made at multiple
locations in the original text. The attackers trying to make text look different makes
dispersed tampering in the text. This kind of attack generally occurs in research
plagiarism and literary writings.
3.5 Literature Review
Text watermarking is the area of research that has emerged after the development of
Internet and communication technologies. The first reported effort to protect the copyright
of the text was made in 1994 by Brassil et al. [14] [15], when IEEE Journal on Selected
Areas in Communications issue was scheduled to be published, for Secure Electronic
Publishing Trial. There were over 1,200 registered users in first month, and each copy of
each paper has been registered and watermarked with the recipient [39] and it is currently
a very active research area with a number of researchers working on text watermarking for
the English language as well as Persian, Turkish, Korean, Urdu, and Arabic.
8/2/2019 718S-3
8/16
26 Copyright Protection of Plain Text using Digital Watermarking
The previous work on digital text watermarking can be classified in the following
categories; an image based approach, a syntactic approach, and a semantic approach .
Description of each category and the work done accordingly are ensuing:
3.5.1 Image-based Approach
In this approach towards digital text watermarking, text document image is used to
embed the watermark. Text is difficult to watermark because of its simplicity,
sensitiveness, and low capacity for watermark embedding. The initially attempts in text
watermarking tried to treat text as image. Watermark was embedded in the layout and
appearance of the text image.
Brassil, et al. proposed a few methods to watermark text document by using text image
[12-14]. The first method proposed by Brassil was the line-shift coding algorithm which
alters the document image by moving lines upward or downward (left or right) depending
on binary signal (watermark) to be inserted as shown in figure 3.1.
Figure 3.1 Line shift coding [16]
The detection algorithm is non-blind in which the original document should be
available. The second method was the word-shift coding algorithm which moves thewords within text horizontally thus expanding spaces to embed the watermark. The
algorithm can operate both in non-blind and blind modes. The third method is the feature
coding algorithm which slightly modifies features such as the pixel of characters, the
length of the end lines in characters to encode watermark bits in the text. All these
8/2/2019 718S-3
9/16
27
proposed techniques discourage un-authorized distribution by embedding each document
with a unique codeword. Among the three presented methods, line-shift coding is the
most robust solution under diverse attacks but this can also be easily defeated.
Maxemchuk, et al. [39][40][41] analyzed the performance of the above mentioned
methods. The correlation and centroid-based methods [42] are also suggested which
treats profiles as a discrete time signal and look for direction of shift and which uses
distances between the centroids of adjacent profile blocks for detecting the watermark
respectively. Low, et al. [42][43] further analyzed the efficiency of the methods.
Huang and Yan [43] proposed an algorithm based on an average inter-word distance in
each line. The distances are adjusted according to the sine wave of a specific phase and
frequency. The feature and the pixel level algorithms were also developed which mark
the documents by modifying the stroke features such as width or serif [45]. Algorithm
which utilizes gray scale image of text was also developed [46]. Another algorithm which
watermarks text document image using edge direction histogram was also proposed [47].
Young-Won Kim et al. proposed a text watermarking algorithm based on word
classification and inter-word space statistics [48]. In this approach, all the words in a text
document are classified depending on some text features and then adjacent wordscomprise a segment and that segment is classified depending on class labels of the words
within the segment. The information is encoded by modifying some statistics of inter-
word spaces of the segments belonging to the same class. Several advantages over the
conventional word-shift algorithms are discussed. Adnan M. Alattar et al. proposed an
algorithm [49] to watermark electronic text documents containing justified paragraphs
and irregular line spacing.
The algorithms which exploit the printed text document to identify the source printerwere also developed [50]. These methods use print quality defects as an intrinsic
signature of a printer shows the banding features of a text document. These features can
identify the specific make and model of the device which created it. Cox et al., [51]
described a number of applications of digital watermarking and their common properties
8/2/2019 718S-3
10/16
28 Copyright Protection of Plain Text using Digital Watermarking
like robustness tamper resistance, fidelity, computational cost, and false positive rate.
They observed that these properties vary greatly depending on the application. They also
described seven applications of watermarking: broadcast monitoring, owner
identification, proof of ownership, authentication, transactional watermarks, copy control,
and covert communication.
Yang and Kot [52] proposed a method for watermarking on text document images to
authenticate the owner or authorized user is proposed. The proposed method makes use
of the integrated inter character and word spaces for watermark embedding. An
overlapping component which is of size three is utilized, whereby the relationship of the
left and right spaces of the character is employed for the watermark embedding. The
integrity of the document can be ensured by comparing the hash value of the charactercomponents of the document before and after watermark embedding, which can be
applied to other line shifting and word-shifting methods as well.
Chao et al., [53] proposed a steganographic method to embed secret information into
text files. This is achieved by making slight modification to scattered inter-word spaces
of the formatted text using the popular typesetting tool TeX. Qadir and Ahmad [54]
suggested a novel idea based upon an intelligent encoding scheme in the world of text
watermarking which has no effect on the alteration of the syntax of the document as wellas the layout. Thus providing a layout/format independent technique in which
information within the text is manipulated to hide certain information.
Abdullah and Wahab [55] presented a text watermarking scheme targeting an object
based environment. The heart of the proposed solution describes the concept of
watermarking an object based text document where each and every text string is
entertained as a separate object having its own set of properties. Taking advantage of the
z-ordering of objects, watermark is applied with the z-axis letting zero fidelity
disturbances to the text.
Villan et al. [56] analyzed the theoretical practical aspects of text data hiding in printed
documents. Mikkilineni et al. [57] [58] [59] worked to enhance data hiding and
8/2/2019 718S-3
11/16
29
watermark embedding capacity of printed paper documents. Micic et al. [60] proposed
algorithm for authentication of text document using digital watermarking. Text document
images were compared to evaluate changes. Xingming Sun with his team proposed a
component based digital watermarking algorithm for Chinese texts [61]. Li and Dong
[62] proposed an algorithm for Chinese text watermarking based on Chinese characters
structure. Another text watermarking algorithm using eigen values is also proposed [63].
Culnane et al. [64] proposed a binary text watermarking algorithm using continuous line
embedding. Zhou et al. [65] presented a zero-watermarking algorithm for content
authentication of Chinese text documents.
3.5.2 Syntactic Approach
Text is made up of characters, words, and sentences. Sentences have different syntactic
structures. Applying syntactic transformations on text structure to embed watermark has
also been one of the approaches towards text watermarking in the past.
Mikhail. J. Atallah, et al. first proposed the natural language watermarking scheme
using the syntactic structure of text [17][18][66] where the syntactic tree is built and
transformations are applied to it to embed the watermark preserving all inherent
properties of the text. They developed techniques for embedding a robust watermark in
text by a number of information assurance and security techniques with the advanced and
resources of natural language processing.
For watermark embedding, they used the manipulations of TMR (Text meaning
representation), such as grafting, pruning, and substitution. These methods are resistant
towards many attacks but change the text to a large extent. Hence cannot be applied to
the text of sensitive nature like poetry, legal documents, transcripts, and contracts. TheNatural Language Processing (NLP) techniques are used to analyze the syntactic and the
semantic structure of text while performing any transformations to embed the watermark
bits.
8/2/2019 718S-3
12/16
30 Copyright Protection of Plain Text using Digital Watermarking
Figure 3.2 Syntactic sentence level watermarking [68]
Kankanhalli and Hau [67] proposed a method to watermark electronic text documents in
using the ASCII characters and punctuation in text.
Hassan et al. proposed the natural language watermarking algorithm by performing the
morpho-syntactic alterations to the text [68]. The text is first transformed into a syntactic
tree diagram where the hierarchies and the functional dependencies are made explicit and
watermark is embedded. The watermarking process is shown in figure 3.2. The author
stated that agglutinative languages like Turkish are easier to watermark than English
language. The watermarking solutions for agglomerative languages like Turkish, Korean,
Arabic, and Urdu are efficient since these languages provide space for watermark
embedding. However, the syntactic solutions for English language are not much
adequate.
Hassan et al. also proposed 21 syntactic tools for text watermarking [69] and Mi-Young
Kim [70] recently proposed an algorithm for text watermarking using syntactic analysis
of plain text. Kim [71] also proposed a natural language watermarking algorithm for
Korean language using adverbial displacement. Helge Hoehn proposed a natural
language watermarking algorithm which uses rather the semantic and syntactical
transformations of the original text contents rather than modifying the text [72].
8/2/2019 718S-3
13/16
31
Murphy and Vogel [73] present three natural language marking algorithms using
shallow parsing techniques, lexical substitutions, and swapping. They also analyzed the
significance of automated and reversible syntactic transformations to hide data in plain
text [74].
3.5.3 Semantic Approach
The semantic watermarking schemes focus on using the semantic structure of text to
embed the watermark. Text contents, verbs, nouns, words and their spellings, acronyms,
sentence structure, grammar rules, etc. have been exploited to insert watermark in the text
but none of these proved to be resilient and degrade the quality of the text to a large
extent.
Atallah et al. were the first to propose the semantic watermarking schemes in 2000
[17][18][75][76]. Later, the synonym substitution method was proposed in which
watermark is embedded by replacing certain words with their synonyms [19]. Xingming,
et al. proposed noun-verb based technique for text watermarking [77] which exploits
nouns and verbs in a sentence parsed with a grammar parser using semantic networks.
Mercan et al. proposed a sentence based text watermarking algorithm [78] which relies
on multiple features of each sentences and exploits the notion of orthogonality between
features. Later Mercan, et al. proposed an algorithm of the text watermarking by using
idiosyncrasies to embed the watermark [20]. The algorithms make clever use of typing
errors, acronyms, and abbreviations that are common in cursory text like emails, blogs,
chat, SMS etc.
Algorithms were developed to watermark the text using the linguistic semantic
phenomena of presuppositions [21][22] by observing the discourse meanings andrepresentations. Presupposition is the implicit information considered as well known.
Presuppositions are identified and then transformations like passivization, topicalization,
extraposition, and preposing are applied to embed watermark in the text.
8/2/2019 718S-3
14/16
32 Copyright Protection of Plain Text using Digital Watermarking
The text pruning and the grafting algorithms were also developed in the past. The
algorithm based on text meaning representation (TMR) strings has recently been
proposed [79]. Shirali-Shahreza et al. [80] proposed a new method for secret exchange of
information through SMS by using abbreviation text steganography with the use of the
invented language of SMS-texting. They also proposed a method for steganography in
English texts. In this method the US and UK spellings of words substituted in order to
hide data in an English text. In English some words have different spelling in US and UK
[81]. Later, Rafat [82] proposed an enhanced method for SMS steganography using SMS-
texting language, by removing the static nature of word-abbreviation list and introducing
computationally light weighted XoR encryption.
Das [83] proposed an enhanced buyer-seller watermarking protocol based on publickey encryption standard, which is secure and flexible.
Jonathan et al [84] and Robert [85] provided study and surveys of digital
watermarking techniques for text, image, and video documents. Zhang et al. [86]
explored the application of text watermarking in digital reading, in which holders are
compensate for any copyright violation.
3.6
Drawbacks
Text watermarking algorithms using binary text images are not robust against re-typing
and text reproduction attacks. With increasing and efficient use of OCR (Optical
Character Recognition) now a days, these methods are totally a failure. The use of OCR
can destroy the changes made by shifting words upward and downward, to the document
margins, to the fonts, serif, and features of the text. Also, watermarking can easily
destroyed by a simple copy paste to notepad attack.
Text watermarking by using syntactic structure combined with natural language
processing algorithms, is an efficient approach towards text watermarking but research
progress in NLP is very slow. Syntactic sentence paraphrasing can result in unnaturalness
of the sentence. Syntactic techniques also require good performance syntactic analyzers.
The transformation applied using NLP algorithms are most of the time non-reversible.
8/2/2019 718S-3
15/16
33
Semantic text watermarking techniques significantly improve the information hiding
capacity of English text by modifying the granularity of meaning of individual
term/sentence but semantic text watermarking schemes are very conceptual and
impractical. The synonym based techniques are not resilient to the random synonym
substitution attacks. There may be the cases where wrong words get selected for synonym
substitution. Moreover, synonym based methods require a large synonymy dictionary and
a huge collocation database.
Sensitive nature of some text like legal documents, poetry, and quotation do not allow
us to make random semantic transformations. The reason behind is the necessity to
preserve the semantic connotation as well the value of text, while performing any
transformation.
In addition, text watermarking based on semantics, is language dependent where
language is not something static. With the passage of time, language varies and hence the
security and copyright solution provided by digital watermarking based on semantic will
have limited strength. The semantic techniques for digital watermarking use natural
language processing algorithms to analyze text meaning and to perform transformation.
NLP is an immature area of research; hence, text watermarking using semantics does not
provide a practical and complete text watermarking solution.
3.7 Summary
In this chapter, text document watermarking is described exclusively in detail. The
applications areas, the challenges, and possible attacks are also described. It is observed
that text watermarking methods for English language text proposed so far; lack
robustness, integrity, accuracy, and generality. Also, the amount of work done on text
watermarking is very limited and specific.
Text watermarking algorithms using binary text image are not robust against text
reproduction attacks and have limited applicability. Similarly, text watermarking using
text syntactic and semantic structure is not robust against random tampering attacks.
8/2/2019 718S-3
16/16
34 Copyright Protection of Plain Text using Digital Watermarking
These algorithms are application area and/or language specific with limited applicability
and usability. The previous techniques are computationally expensive and non robust.
Text being an important medium of information exchange requires complete protection.
Text encountering massive insertion, deletion, and reordering attacks need to be protected
from copyright violators. Therefore, efficient and practical text watermarking algorithms
are required.