28
Internet‐scale Source Code Sharing and Analysis Framework Iman Keivanloo MOSART 2011 Montreal Software Analysis Research Talks

Sharing and Analysis Frameworkmosart.soccerlab.polymtl.ca/archives/2011-Oct-11...Internet‐scale Source Code Sharing and Analysis Framework Iman Keivanloo MOSART 2011 Montreal Software

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Sharing and Analysis Frameworkmosart.soccerlab.polymtl.ca/archives/2011-Oct-11...Internet‐scale Source Code Sharing and Analysis Framework Iman Keivanloo MOSART 2011 Montreal Software

Internet‐scale Source Code Sharing and Analysis Framework

Iman Keivanloo

MOSART 2011 Montreal Software Analysis Research Talks

Page 2: Sharing and Analysis Frameworkmosart.soccerlab.polymtl.ca/archives/2011-Oct-11...Internet‐scale Source Code Sharing and Analysis Framework Iman Keivanloo MOSART 2011 Montreal Software

Agenda

•Software Analysis Steps

•Motivation: Sharing and Integration

•LinkedData as an enabling factor

•SeCold research project

•Showcase: Copyright violation detection

2 MOSART 2011

Page 3: Sharing and Analysis Frameworkmosart.soccerlab.polymtl.ca/archives/2011-Oct-11...Internet‐scale Source Code Sharing and Analysis Framework Iman Keivanloo MOSART 2011 Montreal Software

Software Analysis Story (1)

3

Issue Tracker Source Code Mailing List Versioning Control …

Some output

Some analysis

MOSART 2011

Page 4: Sharing and Analysis Frameworkmosart.soccerlab.polymtl.ca/archives/2011-Oct-11...Internet‐scale Source Code Sharing and Analysis Framework Iman Keivanloo MOSART 2011 Montreal Software

Software Analysis Story (2)

4

Issue Tracker Source Code Mailing List Versioning Control …

Extraction Process

Raw Data

Structured Internal Data

Representation

Analysis Process

Structured Output

[Source code analysis: a roadmap, FOSE’07]

MOSART 2011

Page 5: Sharing and Analysis Frameworkmosart.soccerlab.polymtl.ca/archives/2011-Oct-11...Internet‐scale Source Code Sharing and Analysis Framework Iman Keivanloo MOSART 2011 Montreal Software

Sharing (Idea #1)

5

Issue Tracker Source Code Mailing List Versioning Control …

Extraction Process

Raw Data

Structured Internal Data

Representation

Analysis Process

Structured Output

[Source code analysis: a roadmap, FOSE’07] [Fostering synergies: how … ICSE-SUITE’10]

MOSART 2011

Page 6: Sharing and Analysis Frameworkmosart.soccerlab.polymtl.ca/archives/2011-Oct-11...Internet‐scale Source Code Sharing and Analysis Framework Iman Keivanloo MOSART 2011 Montreal Software

Sharing for whom?

6

The Challenge #1

MOSART 2011

Page 7: Sharing and Analysis Frameworkmosart.soccerlab.polymtl.ca/archives/2011-Oct-11...Internet‐scale Source Code Sharing and Analysis Framework Iman Keivanloo MOSART 2011 Montreal Software

Integration (Idea #2)

7

Internal Data

Analysis Process

Output

Issue Tracker Source Code Mailing List Versioning Control …

Internal Data

Analysis Process

Output

Internal Data

Analysis Process

Output

Internal Data

Analysis Process

Output

Alig

nm

en

t

Inte

r-da

tas

et A

na

lys

is

MOSART 2011

Page 8: Sharing and Analysis Frameworkmosart.soccerlab.polymtl.ca/archives/2011-Oct-11...Internet‐scale Source Code Sharing and Analysis Framework Iman Keivanloo MOSART 2011 Montreal Software

How to align?

8

The Challenge #2

MOSART 2011

Dataset A

Dataset B

Page 9: Sharing and Analysis Frameworkmosart.soccerlab.polymtl.ca/archives/2011-Oct-11...Internet‐scale Source Code Sharing and Analysis Framework Iman Keivanloo MOSART 2011 Montreal Software

History of Data Sharing (Options)

9

Page 10: Sharing and Analysis Frameworkmosart.soccerlab.polymtl.ca/archives/2011-Oct-11...Internet‐scale Source Code Sharing and Analysis Framework Iman Keivanloo MOSART 2011 Montreal Software

Linked Data (Chosen Option)

• By-product of Semantic Web

• A method to • interlink data

• publish data on the Web

• Human

• Machines/Applications

10 MOSART 2011

Page 11: Sharing and Analysis Frameworkmosart.soccerlab.polymtl.ca/archives/2011-Oct-11...Internet‐scale Source Code Sharing and Analysis Framework Iman Keivanloo MOSART 2011 Montreal Software

Linked Data is about being …

Online a URL for each fact!

Standard uses HTTP, XML, HTML and …

Open usable for both human and machines

NOT Static data and schema are editable

Graph-based graph of triples vs. XML (tree)

Integrating integrated/linked on the fly

11 MOSART 2011

Page 12: Sharing and Analysis Frameworkmosart.soccerlab.polymtl.ca/archives/2011-Oct-11...Internet‐scale Source Code Sharing and Analysis Framework Iman Keivanloo MOSART 2011 Montreal Software

LinkedData is happening

[Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/, as of Sept 2011]

12 MOSART 2011

Publication

Life Science

Government

Media

Circle size Triple count

Very large >1B

Large 1B-10M

Medium 10M-500k

Small 500k-10k

Very small <10k

Page 13: Sharing and Analysis Frameworkmosart.soccerlab.polymtl.ca/archives/2011-Oct-11...Internet‐scale Source Code Sharing and Analysis Framework Iman Keivanloo MOSART 2011 Montreal Software

Our Contribution

[Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/, as of Sept 2011]

13 MOSART 2011

Page 14: Sharing and Analysis Frameworkmosart.soccerlab.polymtl.ca/archives/2011-Oct-11...Internet‐scale Source Code Sharing and Analysis Framework Iman Keivanloo MOSART 2011 Montreal Software

SeCold Project

A Linked Data repository/framework for the Software Analysis Community

• Internet-scale Sharing

• On-the-fly integration (no sync.)

The first Linked Data Source Code repository

Each have an unique + online URL

Online: access/browse/query/download

14 MOSART 2011

[SeCold-2011]

Page 15: Sharing and Analysis Frameworkmosart.soccerlab.polymtl.ca/archives/2011-Oct-11...Internet‐scale Source Code Sharing and Analysis Framework Iman Keivanloo MOSART 2011 Montreal Software

SeCold Project (2)

A Linked Data repository/framework for the Software Analysis Community

Billions of facts:

~ 18,000 Java projects.

~ 1,500,000 unique Java classes

~ 300 Million LOC

~ 1.5 Billion facts (triple)

15 MOSART 2011

Page 16: Sharing and Analysis Frameworkmosart.soccerlab.polymtl.ca/archives/2011-Oct-11...Internet‐scale Source Code Sharing and Analysis Framework Iman Keivanloo MOSART 2011 Montreal Software

SeCold Project (3)

fine-grained integrated facts from

Source Code, Versioning,

and Bug Repositories

+ Your data You can upload, integrate, attach, extend, update …

16 MOSART 2011

Page 17: Sharing and Analysis Frameworkmosart.soccerlab.polymtl.ca/archives/2011-Oct-11...Internet‐scale Source Code Sharing and Analysis Framework Iman Keivanloo MOSART 2011 Montreal Software

Research Steps

1. Model/Vocabulary Set/Ontology

2. URL/ID generation

3. Fact extraction e.g., similar lines/files, related commits and bugs

17 MOSART 2011

Page 18: Sharing and Analysis Frameworkmosart.soccerlab.polymtl.ca/archives/2011-Oct-11...Internet‐scale Source Code Sharing and Analysis Framework Iman Keivanloo MOSART 2011 Montreal Software

Source code modeling (No proper model!)

18 MOSART 2011

Page 19: Sharing and Analysis Frameworkmosart.soccerlab.polymtl.ca/archives/2011-Oct-11...Internet‐scale Source Code Sharing and Analysis Framework Iman Keivanloo MOSART 2011 Montreal Software

Approaching…

1- Vocabulary Set (aka Schema, Data Model, Ontology)

Source Code Ecosystem Ontology Family

(SECON)

SOCON, VERON, METON, ISSUEON, LICENSON, CLON online at secold.org

19 MOSART 2011

Page 20: Sharing and Analysis Frameworkmosart.soccerlab.polymtl.ca/archives/2011-Oct-11...Internet‐scale Source Code Sharing and Analysis Framework Iman Keivanloo MOSART 2011 Montreal Software

Approaching…

2- URL/ID Generation Schema Goal: A URL for each piece of fact (e.g. var. def. stmt) Integration Challenge: Several ways to generate URLs (e.g. random )

New idea REPRODUCIBLE IDENTIFIERS

20 MOSART 2011

Page 21: Sharing and Analysis Frameworkmosart.soccerlab.polymtl.ca/archives/2011-Oct-11...Internet‐scale Source Code Sharing and Analysis Framework Iman Keivanloo MOSART 2011 Montreal Software

REPRODUCIBLity

REPRODUCIBLE IDENTIFIERS

On-the-fly data integration with no synch.

1. Context independency rule 2. Right granularity rule 3. Abstraction level dependency rule

Sample: http://aseg.cs.concordia.ca/secold/resource/variation/jvalog/Uci_58_922/http____code_google_com_p_jvalog__content_src_net

_asfun_jvalog_service_Jdoer_java

21 MOSART 2011

Page 22: Sharing and Analysis Frameworkmosart.soccerlab.polymtl.ca/archives/2011-Oct-11...Internet‐scale Source Code Sharing and Analysis Framework Iman Keivanloo MOSART 2011 Montreal Software

Step 1 and 2 Overview

22 MOSART 2011

Page 23: Sharing and Analysis Frameworkmosart.soccerlab.polymtl.ca/archives/2011-Oct-11...Internet‐scale Source Code Sharing and Analysis Framework Iman Keivanloo MOSART 2011 Montreal Software

Approaching…

3- Fact Extractor Modules

23

RDF

HTMLHTTP GET: urlACCEPT: text/html

HTTP GET: urlACCEPT: application/rdf+xml

Public General Purpose Independent Services

URL

Gen

erat

ion

API

Vers

ion

Cont

rol L

inke

dDat

a Pu

blis

her

Sour

ce C

ode

Link

edD

ata

Publ

ishe

r

URL

Gen

erat

ion

Sche

ma

Com

mon

Voc

abul

ary

Gui

delin

es/S

ampl

es/D

ocum

ents

SECOLD Publication Framework

Triple Store & Query Engine

Syntax&Presentation Layer

Web Crawler

Formatter Tokenizer AST Builder

Semantic Layer

Call GraphSimilar Code

FQN Extractor

Inheritance Tree

...

Virtual Triple Generator

SECO

LD

Web

Ser

ver

Dump (RDF/Ntriples)

MOSART 2011

Page 24: Sharing and Analysis Frameworkmosart.soccerlab.polymtl.ca/archives/2011-Oct-11...Internet‐scale Source Code Sharing and Analysis Framework Iman Keivanloo MOSART 2011 Montreal Software

Approaching…

3- Fact Extractor Modules

24 MOSART 2011

•Source Code IJaDataset (Source Code Dataset)

•Lexical/presentation: tokens, lines

•Syntax: AST nodes

•Semantic: Call graph links, FQN s

•etc: Code fingerprints (file/line) for clone type-1, 2, and 3 [SeClone]

•License information per file and per project

•IssueTracker: IssueZilla

•VersionControl: SVN

•Bug/Commit/SourceCode integration

Page 25: Sharing and Analysis Frameworkmosart.soccerlab.polymtl.ca/archives/2011-Oct-11...Internet‐scale Source Code Sharing and Analysis Framework Iman Keivanloo MOSART 2011 Montreal Software

Showcase (1) (Copyright violation detection)

25 MOSART 2011

Internal Data

Analysis Process

Output

Source Code of 18K projects

Internal Data

Analysis Process

Output

Ninka etc. [A sentence-matching …, ASE’10]

Se Clone [SeClone -ICPC’11& WCRE’11]

Line level fingerprints Clone (Type 1,2 and 3)

License per file

Upload

Page 26: Sharing and Analysis Frameworkmosart.soccerlab.polymtl.ca/archives/2011-Oct-11...Internet‐scale Source Code Sharing and Analysis Framework Iman Keivanloo MOSART 2011 Montreal Software

Showcase (2) (Copyright violation detection)

26 MOSART 2011

Output

Output

matching …, ASE’10]

… ICPC’11& WCRE’11]

Line level fingerprints Clone (Type 1,2 and 3)

License per file

Upload

Copyright violation detection: Could be done by running the following query

on SeCold

select ?fileA ?fileB where {

?fileA testxi ?fingerprint .

?fileB testxi ? fingerprint .

?fileA hasLicense ?la .

?fileB hasLicense ?lb .

Filter (?la != ?lb) }

Page 28: Sharing and Analysis Frameworkmosart.soccerlab.polymtl.ca/archives/2011-Oct-11...Internet‐scale Source Code Sharing and Analysis Framework Iman Keivanloo MOSART 2011 Montreal Software

Question?

http:// secold.org