34
String Analysis and its Application in Software Internationalization Lu Zhang, Peking University

String Analysis and its Application in Software Internationalizationlcs.ios.ac.cn/~zj/20091212/zhanglu_StringAnalysis.pdf ·  · 2009-12-15String Analysis and its Application in

Embed Size (px)

Citation preview

String Analysis and its

Application in Software

Internationalization

Lu Zhang, Peking University

String Analysis ——

Introduction

String Analysis is first developed by A. S.

Christensen et al. in 2003 to predict the

possible values of a string variable. Minamide

enhanced string analysis by adding FSTs in

2005.

Input: code, a string variable V whose value

is on demand

Output: a grammar whose language

approximates the set of possible values of V

String Analysis —— Overview

Code to SSA Form

SSA Form to Extended-CFG

Extended-CFG to CFG

From Code to SSA Form

CodeString x = "abc"

for (int i = 0; i < n; i++)

x = "0"+x+"1";

String s = x.replace("00","0");

System.out.print(s);

SSA Formx = "abc"

for (i = 0; i < n; i++)

x1 = "0".φ(x1, x)."1";

x2 = φ(x1, x));

s = x2.replace("00","0");

System.out.print(s);

SSA Form to Extended CFG

Rules: x=expression => x->expression

φ(x1, x2) => x1|x2

CFGX->abc

X1->0X1|0X11

X2->X1|X

S->str_replace(“00”, “0”, X2)

X1+X2 => x1x2

String operations => add the

invoking object as an argument

SSA Formx = "abc"

for (i = 0; i < n; i++)

x1 = "0".φ(x1, x)."1";

x2 = φ(x1, x));

s = x2.replace("00","0");

System.out.print(s);

Finite State Transducer

Finite State Transducer is like Finite

Automaton, but it not only accepts input

strings, but also output strings according to

the accepted string.

The following FST simulates the function

str_replace(“00”, “0”, x);

Extended-CFG to CFG

The output string set of a CFG through a FST is a CFG

The algorithm to calculate the output CFG is the similar

with calculating the intersection of a CFG and a DFA

Algorithm:

1. Convert CFG to PDA

2. PDA’ = PDA*(the corresponding FA of the FST)

3. Convert PDA’ to CFG’, when converting the transitions

in the PDA’ to the productions of CFG’, use the output

terminal in the FST instead of the input terminal

String Taint Analysis

Developed by Wassermann and Su in 2007.

Adding a tag to unsafe terminals and propagate

the tags among the CFG to predict whether a

string variable’s values are from unsafe sourcebasic idea:

for S->BC...

if (B has tag | C has tag|...){

add tag to S;

}

Through the process from extended-CFG to CFG using

FSTs, all newly added non- terminals according to an old

tagged non-terminal are tagged

Some Applications of String

Analysis

SQL Injection Detection

Cross-site Scripting Detection

Impact Analysis of Database Schema

Application of String Taint Analysis in

Software Internationalization

Introduction

Example

Approach

Experiments

Globalization Process

One-language

Version

Internationalized

Version

English

Property

German

Property

Chinese

Property

Developer

I18n

L10n

All language specific

code elements are

externalized to

property files

I18n Conducted for

• Old software projects

• New project with no global plan at first

• Using old components

I18n

Two Steps:

Internationalization(I18n)

Localization (L10n)

Example of I18n and L10n

Original Code Elements

Externalized Code Elements

Property files

Language Specific Code

Elements

• Constant Strings

• Date/Number Formats

• Currency/Measures

• Writing Direction

• Color/Culture related elements

• …

Constant Strings are of the largest number, and some of

them are very hard to be located.

Motivation of our work

There are a lot of constant strings

We should not translate all of them

It is sometimes hard to decide which string is

need-to-translate

Application/

Version

#LOC #Constant

Strings

#Need-to-Translate Strings (Not

externalized in the subsequent version)

Rtext0.8.6.9 (Core

Package)

17k 1252 408(121)

Risk1.0.7.5 19k 1510 509(55)

ArtOfIllusion1.1 71k 2889 1221(816)

Megamek0.29.72 110k 10464 1734(678)

Outline

Introduction

Example

Approach

Experiments

Example(1)Risk project: Risk.java and RiskGame.javapublic class Risk{

public void GameParser(String mem){

message=mem; (5)

StringTokenizer StringT = new StringTokenizer(message," "); (4)

String addr = StringT.nextToken(); (4-1)

...

if(addr.equals("CARD")){

if(StringT.hasMoreTokens()){

String name = StringT.nextToken(); (3)

String cardName;

. ..

if(name.equals("wildcard"))

cardName = name; (2)

gui.sendMessage("You got a new card:\""

+ cardName + "\"", false , false); (1)

} ...

}

}

Example(2)public void DoEndGo(String mem){

...

GameParser("CARD "+game.getDeservedCard()); (6)

...

}

}

}

public class RiskGame{

public String getDesrvedCard(){

Card c = cards.elementAt(r.nextInt(cards.size()));

if(c.getCountry() == null)

return "wildcard"; (7)

else

return c.getCountry.getName();

...

}

}

Outline

Introduction

Example

Approach

Experiments

Basic Idea

We assume that all need-to-translate strings are those

strings that are sent to the GUI

String Variables

/ExpressionsGUIConstant Strings

Challenges

String operations (concatenate, tokenize, substring, etc..)

String transmissions:

String Comparisons:

Trivial Strings: “123”, “ ”, “Risk”, …

Client GUI

networkServer

Client GUI

String1

String2Comparison

GUI

String1

String1:part1

String1:part2

GUIString1:part1

String1:part2

Approach

Collect output API methods

Locate initial output strings

Adapted String Taint Analysis

String Transmission Analysis

String Comparison Analysis

Filtering

Output API Methods

Output API Methods are methods that pass at least one of its parameters to the GUI

Example

java.awt.Graphics2D.drawString(java.lang.String, int, int) drawString 1 false 0

Initial Output Strings are the arguments sent to Output API Methods

g.drawString (weaponMessage, 30,20)

We locate the string using Eclipse API Search Engine

String Analysis

Determine the possible values of a string variable in the

code as CFGs and DFAs

return1 → wildcard

return2 → &FileInput

return3 → return1|return2

parseCard → CARD return3

message → parseCard|...

StringT → message

addr → nextToken(stringT, " ")

StringT1 → reduceToken(StringT, " ")

name → nextToken(stringT1, " ")

StringT2 → reduceToken(StringT1," ")

output → You got a new card: nameStart

String Taint Analysis

Determine whether a part of a string is from unsafe

source

return1 → wildcard

return2 → &FileInput

return3 → return1|return2

parseCard → CARD return3

message → parseCard|...

StringT → message

addr → nextToken(StringT, " ")

StringT1 → reduceToken(StringT, " ")

name → nextToken(StringT1, " ")

StringT2 → reduceToken(StringT1," ")

output → You got a new card: nameStart

Adapted String Taint Analysis

Propagate List of Originating Positions as the

tags of the non-terminals in the list

parseCard → CARD return3

Positions

Risk:8922

RiskGame:6767

extern

Positions

Risk:6767

externPositions

1-5:Risk:8922

String Transmission Analysis

Scenario

Socket

GUI

Control

Logs

Labels

Comparison

Packet1

Label: Info

“Client A kills you”

Other Fields

Packet2

Label: Command

“quit”

Other Fields

Server Side Client Side

String Comparison Analysis

Locating all string comparison operations String.equals(), String.startWith(), String.endWith(),

String.compareTo(), etc.

String taint analysis on both sides of the

operations

If one side contains a need-to-translate string,

mark the constant strings on the other side as

need-to-translateString1

String2Comparison

GUI

Filtering

Remove strings with no alphabets

Remove strings the same with the project

name

Outline

Introduction

Example

Approach

Experiments

Experimental subjects

RText : Simple Editor

Risk : Board Game

ArtOfIllusion : Graph Drawing Project

Megamek : Big Real Time Strategy Game

Application/Version Starting

Month

#Developers #LOC #Files #Constant Strings

RText 0.8.6.9 11/2003 16 17k 55 1252

Risk 1.0.7.5 05/2004 4 19k 38 1510

AOI 1.1 11/2000 2 71k 258 2889

Megamek 0.29.72 02/2002 33 110k 338 10464

Experimental Results Best Results

Turning on and off String Transmission Analysis

App Need-to-Trans (Not Externalized

in subsequent version)

Located FN FP

RText 408(121) 445 0 37

Risk 509(55) 498 18 7

AOI 1221(816) 1280 6 65

Megamek 1734(678) 1765 10 41

App Need-to-trans Located FN FP

Megamek 1734 1765 10 41

Megamek(NT) 1734 1188 585 39

Megamek(ALL) 1734 1777 10 53

Reduce FN

significantly

and reduce

some FP

Experimental Results cont.

Turning on and off

String comparison

analysis

App Located FN FP

RText 445 0 37

RText(NC) 445 0 37

Risk 498 18 7

Risk(NC) 474 42 7

AOI 1280 6 65

AOI(NC) 1280 6 65

Megamek 1765 10 41

Megamek(NC) 1730 36 32

App Located FN FP

RText 445 0 37

RText(NC) 581 0 173

Risk 498 18 7

Risk(NC) 532 18 41

AOI 1280 6 65

AOI(NC) 1487 6 272

Megamek 1765 10 41

Megamek(NC) 2080 10 356

Turning on and off

filterReduce some

FN, but very

important FN

Significantly

reduce FP

Bugs found

We found 17 not-externalized need-to-translate

strings in the latest version of Megamek and

reported them as report 2085049. The

developers confirmed and externalized them.

Thank you!