20
Mohammed Aabed Sameh Awaideh Abdul-Rahman Elshafei

Mohammed Aabed Sameh Awaideh Abdul-Rahman Elshafei

  • View
    226

  • Download
    6

Embed Size (px)

Citation preview

Mohammed AabedSameh Awaideh

Abdul-Rahman Elshafei

Arabic Diacritics Arabic Diacritics حركاتحركات Based Based Steganography Steganography Steganography is the ability of hiding information in redundant bits of any unremarkable cover media.

This presentation will discuss new Arabic text steganography schemes.

Difficulties of Text Difficulties of Text SteganographySteganographyIn steganography, the cover media used to hide the message can be text, image, video or audio files.

Using text media for this purpose is considered the hardest !

Text data does not have much needless information within the essential data.

Fig. 1: Data Hiding in Binary Text Documents

Previous TechniquesPrevious Techniques

Many techniques have been proposed for text steganography that are mostly graphical in nature:

1.Line shifting:•Text is divided into lines.

•Implementing 1 is by shifting the line a small fraction that can’t be detected by the bare eye.

•Implementing a 0 means keeping the line as is.

2.Word shifting:

•Same as previous but text is divided into words.

3. Word horizontal shifting:• Same as previous approach

but words are shifted left and right to indicate bits.

Original Text: We are embedding a ‘b’ using horizontal word shifting.

Modified Text: We are embedding a ‘b’ using horizontal word shifting.

Hiding ‘b’ = 0x62 = 0 1 1 0 0 0 1 0b

Original Text: We are embedding a ‘b’ using horizontal word shifting.

Modified Text: We are embedding a ‘b’ using horizontal word shifting.

Hiding ‘b’ = 0x62 = 0 1 1 0 0 0 1 0b

4. White space manipulation:• White spaces at the end of the

line are not apparent.

5. HTML formatting:• HTML syntax is case

insensitive. This can be used to hide information.

Previous TechniquesPrevious Techniques

• Other variations for the previous techniques are proposed.

• Pointed letters shifting.

• Kashida insertion.

• Some approaches consider the syntactic structure of the language used.

• ‘Run’ can be used instead of ‘sprint’ to mean something.

• In summary the previous techniques tackle one of two areas:

• Limitations of human sight.

• Specific language grammar.

Arabic Based SteganographyArabic Based Steganography

Arabic language is the largest living member of the Semitic language family in terms of speakers. (270 million speakers).

It contains 28 alphabet characters; 15 of which have points.

GةIغK KغIةGالل الل

Characters with no points

Characters with one

point

Characters with two points

Characters with three

points

أ ح د ر س ص ط ع ك ل م هـ و

ب ج خ ذ ز ض ظ غ ف ن

ت ق ي ث ش

Fig. 2: Arabic Alphabet

Previous ApproachesPrevious ApproachesVertical displacement of the points in the Arabic alphabet to hide information.

Using letter points and extensions to hide data.

Fig. 3: Using vertical displacement to hide data(M. Hassan Shirali-Shahreza, Mohammad Shirali-Shahreza)

Fig. 4: Using extensions to hide data(A. Gutub)

Diacritics (Harakat – Diacritics (Harakat – حركاتحركات))

Arabic language uses eight symbols as diacritical marks.

It is used to alter the pronunciation of a phoneme or to distinguish between words of similar spelling.

The use of diacritics in the text is optional in written Standard Arabic.

Fig. 5: Arabic Diacritics

Statistics for DiacriticsStatistics for Diacritics

First we needed to find the average occurrences of diacritics in a fully diacritized Arabic document.

Then we needed to compare these occurrences to find the best embedding technique available.

Both ambiguity and capacity are important factors to consider.

Fig. 6: Sample for diactrized Arabic text

nنp IزnيدI ب IةG عIنp ي عpب Gا شI Iن pنG جIعpفIر{ قIالI حIد}ث Iا مGحIم}دG ب Iن حIد}ثGو Iب Iا أ Iن وpسIطI قIالI خIطIب

I pنn عIامnر{ عIنp أ n ب pم Iي ل Gس pنIر{ عp خGمIي Gه{ }هn صIل}ى الل سGولG الل Iر IامIق IالIقIف Gهp }هG عIن ضnيI الل Iر{ رp Iك ب

pر{ Iك Gو ب بI Iى أ Iك و}لn وIب

I pاأل IامIا عIذIي هnامIقIم Iم{ ل IسIو nهp Iي عIل pمI IةI فIل pعIافnي وp قIالI ال

I pمGعIافIاةI أ }هI ال Gوا الل ل Iر{ سp Iك Gو ب بI فIقIالI أ

pوI Iةn أ pعIافnي IفpضIلI مnنp ال Iقnينn أ pي IعpدI ال IحIد� قIط� ب GؤpتI أ ي

nة{ ن Iجp nر� وIهGمIا فnي ال pب }هG مIعI ال nن nالص�دpقn فIإ Gمp ب pك Iي pمGعIافIاةn عIل ال IالIو nار{ pفGجGورn وIهGمIا فnي الن }هG مIعI ال nن IذnبI فIإ pك Gمp وIال }اك nي وIإ

Gوا Gون وا وIك GرI IدIاب IقIاطIعGوا وIالI ت IاغIضGوا وIالI ت Iب دGوا وIالI ت IاسIحI تIى IعIال }هG ت Gمp الل ك IرIم

I IمIا أ ¥ا ك nخpوIان إ اI Iن Gو عIامnر{ قIاالI حIد}ث Iب pنG مIهpدnي¦ وIأ حpمIنn ب pدG الر} Iا عIب Iن حIد}ث

nم}دIحGم Iنp nي اب Iعpن }هn ي pدn الل pنI مGحIم}د{ عIنp عIب nي اب Iعpن pر� ي هIي Gز pنIي� عnارIصp Iن pع{ األnاف Iر nنp pنn رnفIاعIةI ب pنn عIقnيل{ عIنp مGعIاذn ب ب

Iر{ الص�د�يقp Iك Iا ب بI مnعpتG أ Iس IالIع{ قnاف Iر nنp nيهn رnفIاعIةI ب ب

I أ Gه{ }هn صIل}ى الل سGولn الل Iر nرI pب IقGولG عIلIى مnن pهG ي }هG عIن ضnيI الل Iر

nهp Iي }هG عIل }هn صIل}ى الل سGولI الل Iر Gتpعnم Iس Iم{ ل IسIو nهp Iي عIل nه{ سGولI الل Iر IرI pر{ حnينI ذIك Iك Gو ب ب

I Iى أ Iك IقGولG فIب }مI ي ل IسIو Gتpعnم Iس IالIم} قG pهG ث يI عIن Gم} سGر� }مI ث ل IسIو nهp Iي }هG عIل صIل}ى الل

IقGولG فnي هIذIا }مI ي ل IسIو nهp Iي }هG عIل }هn صIل}ى الل سGولI الل Iر IينnقI pي IةI وIال pعIافnي pعIفpوI وIال }هI ال Gوا الل ل Iس nو}ل

I pاأل IامIع nظp pقIي الIى Gول pاألIو nة Iرnخ pي اآلnف

nنp IزnيدI ب IةG عIنp ي عpب Gا شI Iن pنG جIعpفIر{ قIالI حIد}ث Iا مGحIم}دG ب Iن حIد}ثGو Iب Iا أ Iن وpسIطI قIالI خIطIب

I pنn عIامnر{ عIنp أ n ب pم Iي ل Gس pنIر{ عp خGمIي Gه{ }هn صIل}ى الل سGولG الل Iر IامIق IالIقIف Gهp }هG عIن ضnيI الل Iر{ رp Iك ب

pر{ Iك Gو ب بI Iى أ Iك و}لn وIب

I pاأل IامIا عIذIي هnامIقIم Iم{ ل IسIو nهp Iي عIل pمI IةI فIل pعIافnي وp قIالI ال

I pمGعIافIاةI أ }هI ال Gوا الل ل Iر{ سp Iك Gو ب بI فIقIالI أ

pوI Iةn أ pعIافnي IفpضIلI مnنp ال Iقnينn أ pي IعpدI ال IحIد� قIط� ب GؤpتI أ ي

nة{ ن Iجp nر� وIهGمIا فnي ال pب }هG مIعI ال nن nالص�دpقn فIإ Gمp ب pك Iي pمGعIافIاةn عIل ال IالIو nار{ pفGجGورn وIهGمIا فnي الن }هG مIعI ال nن IذnبI فIإ pك Gمp وIال }اك nي وIإ

Gوا Gون وا وIك GرI IدIاب IقIاطIعGوا وIالI ت IاغIضGوا وIالI ت Iب دGوا وIالI ت IاسIحI تIعIالIى }هG ت Gمp الل ك IرIم

I IمIا أ ¥ا ك nخpوIان إ اI Iن Gو عIامnر{ قIاالI حIد}ث Iب pنG مIهpدnي¦ وIأ حpمIنn ب pدG الر} Iا عIب Iن حIد}ث

nم}دIحGم Iنp nي اب Iعpن }هn ي pدn الل pنI مGحIم}د{ عIنp عIب nي اب Iعpن pر� ي هIي Gز pنIي� عnارIصp Iن pع{ األnاف Iر nنp pنn رnفIاعIةI ب pنn عIقnيل{ عIنp مGعIاذn ب ب

Iر{ الص�د�يقp Iك Iا ب بI مnعpتG أ Iس IالIع{ قnاف Iر nنp nيهn رnفIاعIةI ب ب

I أ Gه{ }هn صIل}ى الل سGولn الل Iر nرI pب IقGولG عIلIى مnن pهG ي }هG عIن ضnيI الل Iر

nهp Iي }هG عIل }هn صIل}ى الل سGولI الل Iر Gتpعnم Iس Iم{ ل IسIو nهp Iي عIل nه{ سGولI الل Iر IرI pر{ حnينI ذIك Iك Gو ب ب

I Iى أ Iك IقGولG فIب }مI ي ل IسIو Gتpعnم Iس IالIم} قG pهG ث يI عIن Gم} سGر� }مI ث ل IسIو nهp Iي }هG عIل صIل}ى الل

IقGولG فnي هIذIا }مI ي ل IسIو nهp Iي }هG عIل }هn صIل}ى الل سGولI الل Iر IينnقI pي IةI وIال pعIافnي pعIفpوI وIال }هI ال Gوا الل ل Iس nو}ل

I pاأل IامIع nظp pقIي الIى Gول pاألIو nة Iرnخ pي اآلnف

Fig. 7: Statistics

Using Diacritics To Hide DataUsing Diacritics To Hide Data

Analysis indicates that in standard Arabic the frequency of one diacritic, namely Fatha, is almost equal to the occurrence of the other seven diacritics.

Assign a 1 to the diacritic Fatha and the remaining seven diacritics will represent a 0.

Use a cover media that is empty of diacritics.

Fig. 8: Diactrized and non-diactrized text

To encode a value of 1 the algorithm looks for the first location where a Fatha can be placed and inserts the diacritic Fatha in the text.

Location determination is based on the rules defined by the Standard Arabic language grammar and syntax.

Or we can compare it to a copy of the cover media that is already diactrized (faster, and less complex)

Syntactically CorrectSyntactically Correct

Implementation ExampleImplementation Example

Next, the algorithm looks for the next location where a Fatha can be placed if another 1 needs to be inserted and adds the Fatha.

Otherwise, to insert a bit value of 0 the algorithm locates the first next position where any of the other diacritics can be inserted and adds that diacritic.

This process is repeated for as long as there are bits remaining to be hidden.

Fig. 9: Encoding the sequence 10101110101110000 using diacritics

قـال الشيـخ اإلمـام الـحـافظ أبـو عبـد الـلـه محمد بن إسماعيـل بن إبراهـيـم بن الـمـغيرة الـبخاري رحـمـه الـلـه تعـالـى

آمـين

Fig. 10: Encoding the same sequence using Kashida

Reusing The Cover MediaReusing The Cover Media

The output file will have less diacritics than the original cover media (because of deletion).

This means that reusing the same document more than once will mean less capacity.

A research group at IBM has proposed techniques for restoration of Arabic diacritics based on maximum entropy.

Fig. 11: Error rate in % for n-gram diacritic restoration

ResultsResults

Compared to other techniques, capacity is the highest if a fully diactrized document is used as cover media.

Ambiguity is dependent on the reader’s familiarity with Arabic language.

Robustness is high since it can withstand:

• Printing• Retyping• Font changing• OCR

File Type File Size (Bytes)

Cover Size (Bytes)

Capacity (%)

.txt 10,356 318,632 3.250 %

.wav 43,468 1,334,865 3.256 %

.jpg 23,796 717,135 3.318 %

.cpp 10,356 318,216 3.254 %

Average 3.27 %

File Type File Size (Bytes)

Cover Size (Bytes)

Capacity (%)

.txt 4439 365181 1.215 %

.html 4439 378589 1.172 %

.cpp 10127 799577 1.266 %

.gif 188 15112 1.244 %

Average 1.22 %

Table 1: Diacritics Technique

Table 2: Kashida Technique

AnalysisAnalysis

Advantages

Approach is easily implemented using software.

It produces high capacity.

Can be modified for more ambiguity (Use one of the diacritics as dummy diacritic, or as a switching diacritic)

Fairly robust. Can withstand OCR, retyping, printing and font changing.

Disadvantages

Medium to low ambiguity.

Sending Arabic message with diacritics might raise suspicions nowadays.

Arabic font has different encodings on different machines, can be computer dependant.