Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Internet‐scale Source Code Sharing and Analysis Framework
Iman Keivanloo
MOSART 2011 Montreal Software Analysis Research Talks
Agenda
•Software Analysis Steps
•Motivation: Sharing and Integration
•LinkedData as an enabling factor
•SeCold research project
•Showcase: Copyright violation detection
2 MOSART 2011
Software Analysis Story (1)
3
Issue Tracker Source Code Mailing List Versioning Control …
Some output
Some analysis
MOSART 2011
Software Analysis Story (2)
4
Issue Tracker Source Code Mailing List Versioning Control …
Extraction Process
Raw Data
Structured Internal Data
Representation
Analysis Process
Structured Output
[Source code analysis: a roadmap, FOSE’07]
MOSART 2011
Sharing (Idea #1)
5
Issue Tracker Source Code Mailing List Versioning Control …
Extraction Process
Raw Data
Structured Internal Data
Representation
Analysis Process
Structured Output
[Source code analysis: a roadmap, FOSE’07] [Fostering synergies: how … ICSE-SUITE’10]
MOSART 2011
Integration (Idea #2)
7
Internal Data
Analysis Process
Output
Issue Tracker Source Code Mailing List Versioning Control …
Internal Data
Analysis Process
Output
Internal Data
Analysis Process
Output
Internal Data
Analysis Process
Output
Alig
nm
en
t
Inte
r-da
tas
et A
na
lys
is
MOSART 2011
How to align?
8
The Challenge #2
MOSART 2011
Dataset A
Dataset B
History of Data Sharing (Options)
9
Linked Data (Chosen Option)
• By-product of Semantic Web
• A method to • interlink data
• publish data on the Web
• Human
• Machines/Applications
10 MOSART 2011
Linked Data is about being …
Online a URL for each fact!
Standard uses HTTP, XML, HTML and …
Open usable for both human and machines
NOT Static data and schema are editable
Graph-based graph of triples vs. XML (tree)
Integrating integrated/linked on the fly
11 MOSART 2011
LinkedData is happening
[Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/, as of Sept 2011]
12 MOSART 2011
Publication
Life Science
Government
Media
Circle size Triple count
Very large >1B
Large 1B-10M
Medium 10M-500k
Small 500k-10k
Very small <10k
Our Contribution
[Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/, as of Sept 2011]
13 MOSART 2011
SeCold Project
A Linked Data repository/framework for the Software Analysis Community
• Internet-scale Sharing
• On-the-fly integration (no sync.)
The first Linked Data Source Code repository
Each have an unique + online URL
Online: access/browse/query/download
14 MOSART 2011
[SeCold-2011]
SeCold Project (2)
A Linked Data repository/framework for the Software Analysis Community
Billions of facts:
~ 18,000 Java projects.
~ 1,500,000 unique Java classes
~ 300 Million LOC
~ 1.5 Billion facts (triple)
15 MOSART 2011
SeCold Project (3)
fine-grained integrated facts from
Source Code, Versioning,
and Bug Repositories
+ Your data You can upload, integrate, attach, extend, update …
16 MOSART 2011
Research Steps
1. Model/Vocabulary Set/Ontology
2. URL/ID generation
3. Fact extraction e.g., similar lines/files, related commits and bugs
17 MOSART 2011
Approaching…
1- Vocabulary Set (aka Schema, Data Model, Ontology)
Source Code Ecosystem Ontology Family
(SECON)
SOCON, VERON, METON, ISSUEON, LICENSON, CLON online at secold.org
19 MOSART 2011
Approaching…
2- URL/ID Generation Schema Goal: A URL for each piece of fact (e.g. var. def. stmt) Integration Challenge: Several ways to generate URLs (e.g. random )
New idea REPRODUCIBLE IDENTIFIERS
20 MOSART 2011
REPRODUCIBLity
REPRODUCIBLE IDENTIFIERS
On-the-fly data integration with no synch.
1. Context independency rule 2. Right granularity rule 3. Abstraction level dependency rule
Sample: http://aseg.cs.concordia.ca/secold/resource/variation/jvalog/Uci_58_922/http____code_google_com_p_jvalog__content_src_net
_asfun_jvalog_service_Jdoer_java
21 MOSART 2011
Approaching…
3- Fact Extractor Modules
23
RDF
HTMLHTTP GET: urlACCEPT: text/html
HTTP GET: urlACCEPT: application/rdf+xml
Public General Purpose Independent Services
URL
Gen
erat
ion
API
Vers
ion
Cont
rol L
inke
dDat
a Pu
blis
her
Sour
ce C
ode
Link
edD
ata
Publ
ishe
r
URL
Gen
erat
ion
Sche
ma
Com
mon
Voc
abul
ary
Gui
delin
es/S
ampl
es/D
ocum
ents
SECOLD Publication Framework
Triple Store & Query Engine
Syntax&Presentation Layer
Web Crawler
Formatter Tokenizer AST Builder
Semantic Layer
Call GraphSimilar Code
FQN Extractor
Inheritance Tree
...
Virtual Triple Generator
SECO
LD
Web
Ser
ver
Dump (RDF/Ntriples)
MOSART 2011
Approaching…
3- Fact Extractor Modules
24 MOSART 2011
•Source Code IJaDataset (Source Code Dataset)
•Lexical/presentation: tokens, lines
•Syntax: AST nodes
•Semantic: Call graph links, FQN s
•etc: Code fingerprints (file/line) for clone type-1, 2, and 3 [SeClone]
•License information per file and per project
•IssueTracker: IssueZilla
•VersionControl: SVN
•Bug/Commit/SourceCode integration
Showcase (1) (Copyright violation detection)
25 MOSART 2011
Internal Data
Analysis Process
Output
Source Code of 18K projects
Internal Data
Analysis Process
Output
Ninka etc. [A sentence-matching …, ASE’10]
Se Clone [SeClone -ICPC’11& WCRE’11]
Line level fingerprints Clone (Type 1,2 and 3)
License per file
Upload
Showcase (2) (Copyright violation detection)
26 MOSART 2011
Output
Output
matching …, ASE’10]
… ICPC’11& WCRE’11]
Line level fingerprints Clone (Type 1,2 and 3)
License per file
Upload
Copyright violation detection: Could be done by running the following query
on SeCold
select ?fileA ?fileB where {
?fileA testxi ?fingerprint .
?fileB testxi ? fingerprint .
?fileA hasLicense ?la .
?fileB hasLicense ?lb .
Filter (?la != ?lb) }
Sneak peak (Browsing)
http://secold.org
http://aseg.cs.concordia.ca/secold/resource/lin
e/ecyberpunk/Uci_40_995/http____sourceforg
e_net_cvs__group_id__62743_ecyberpunk_sr
c_org_ecyberpunk_platform_PlatformFrame_j
ava/232
27 ICSE SUITE'11- SeCold.org