Digital Forensicsrocha/teaching/2014s2/... · A. Rocha, 2014 – Digital Forensics (MO447/MC919) 29...

Preview:

Citation preview

Reasoning for Complex Data (RECOD) Lab. Institute of Computing,

University of Campinas (Unicamp)

Av. Albert Einstein, 1251 - Cidade Universitária CEP 13083-970 • Campinas/SP - Brasil

Digital Forensics MO447 / MC919

* Pintura de Rajib Roy, Case Investigation - 2012

Prof. Dr. Anderson Rocha !

Microsoft Research Faculty Fellow Affiliate Member, Brazilian Academy of Sciences

Reasoning for Complex Data (Recod) Lab. !anderson.rocha@ic.unicamp.br http://www.ic.unicamp.br/~rocha

File Carving & Smart File Carving

A. Rocha, 2014 – Digital Forensics (MO447/MC919) 3

Based on “The evolution of file carving – the benefits and problems of forensics recovery.” Anandabrata Pal and Nasir Memon. IEEE Signal Processing Magazine, 26(2):59–71, March 2009.

Organization

A. Rocha, 2014 – Digital Forensics (MO447/MC919) 5

Organization

‣ Intro and Terminology

‣ Traditional File Recovery

‣ File Carving

‣ Smart Carving

‣ Conclusions

‣ References

Introduction and Terminology

A. Rocha, 2014 – Digital Forensics (MO447/MC919) 7

What is File Carving?

‣ Denotes the extraction and recovery of files based on their structure

A. Rocha, 2014 – Digital Forensics (MO447/MC919) 8

Why File Carving?

Massive amount of data subject to

‣ File system corruption

‣ Device formatting

‣ Unknown proprietary formats

‣ Files removed or deleted (un- or intentionally)

A. Rocha, 2014 – Digital Forensics (MO447/MC919) 9

Storage (1)

‣ Hard disks and SSD's are divided in Clusters

‣ Clusters are formed by sectors and is atomic in the data storage world

‣ Clusters vary from 512 bytes to 32K bytes

A. Rocha, 2014 – Digital Forensics (MO447/MC919) 10

Storage (2)

‣ The file systems

• manage the files

• alocate blocks (Clusters)

‣ The traditional allocation may or not not be sequential

• Non-sequential allocation => fragmentation

A. Rocha, 2014 – Digital Forensics (MO447/MC919) 11

Example: Storing a file

File

Data from a file under to vantage point of an application

↓A1 A2 A3 A4 A5 A6 A7 A8 A9

Data from a file in the disk, split into blocks

A. Rocha, 2014 – Digital Forensics (MO447/MC919) 12

Fragmentation

‣ The fragmentation level depends on:

• File system

• File size

• Cluster size

‣ Once again: non-sequential allocation => fragmentation

‣ Fragments may appear in any order

A. Rocha, 2014 – Digital Forensics (MO447/MC919) 13

In this example, each cell represents a block. Here we have three files, each one with three clusters. Clusters 1, 2 and 3 represent, respectively: the beginning, middle and end of file.

Fragmentation example

A1 A2 B1 B2 B3 C2 C1 C3 A3

A. Rocha, 2014 – Digital Forensics (MO447/MC919) 14

‣ Taking A as an example:

• A1 and A2 are the base fragment

• A2 é fragmentation point

Terminology

A1 A2 B1 B2 B3 C2 C1 C3 A3

Traditional data recovery

A. Rocha, 2014 – Digital Forensics (MO447/MC919) 16

Traditional recovery

‣ Relies on the structures present in the the file system, for instance: file allocation tables

• File systems normally only mark an entry as removed

‣ It allows a fast recovery of files while they are present in the structure

‣ It avoids searches for unallocated areas of the disk

A. Rocha, 2014 – Digital Forensics (MO447/MC919)

Traditional recovery

17

* FAT 32 * 4GB limit !* NTFS came to overcome this problem * Uses B-Trees to store the information related to files (not the actual content of files)

Inserting a file - FAT32

A. Rocha, 2014 – Digital Forensics (MO447/MC919)

Traditional recovery

18

Deleting a file - FAT32

A. Rocha, 2014 – Digital Forensics (MO447/MC919)

Some discoveries

19

‣ Analyzing ~350 HDs (FAT, NTFS, UFS), it was found that

‣ Fragmentation is low but exists

‣ It is high for user files (MSOffice, e-mail, JPEG).

‣ JPEGs = 16%

‣ MS Word = 17%

‣ AVI = 22%

‣ MS-Outlook PSTs = 58%

A. Rocha, 2014 – Digital Forensics (MO447/MC919)

Some discoveries

20

‣ Amiga Smart File System moves an entire file upon each edit

‣ Unix File System (UFS) predicts possible extensions leaving some available clusters to a file

‣ XFS and ZFS use late writing until a flush from the OS is sent

A. Rocha, 2014 – Digital Forensics (MO447/MC919)

Some discoveries

21

‣ SSDs tend to increase fragmentation regardless of the file system used due to wear-leveling techniques

‣ If the controller is compromised, only file carving approaches could be used and not traditional techniques of recovery

‣ In some cases, the file system itself can force a fragmentation (UFS does it to large files or when a file has an odd number of clusters).

File Carving

A. Rocha, 2014 – Digital Forensics (MO447/MC919) 23

File Carving: General Rules

‣ Does not rely directly on the information present file system structures

‣ Normally identify common files by means of hashes (MD5) and keywords

File Carving based on Structure

A. Rocha, 2014 – Digital Forensics (MO447/MC919) 25

File Carving: Recovery based on Structure

‣ It searches files based on “magic numbers” (ie. sequence of bytes in known positions)

• Header and footer (e.g., jpegs), or

• Header and file size (e.g., bmps)

‣ More advanced techniques also use the file content

‣ A file is formed by all clusters between a header and a footer

A. Rocha, 2014 – Digital Forensics (MO447/MC919) 26

File Carving: Recovery based on Structure

A1 A2 B1 B2 B3 C2 C1 C3 A3

File A: A1+A2+B1+B2+B3+C2+C1+C3+A3

A1 A2 B1 B2 B3 C2 C1 C3 A3

File B: B1+B2+B3

A1 A2 B1 B2 B3 C2 C1 C3 A3

File C: C1+C3

File Carving based on Graph Theory

A. Rocha, 2014 – Digital Forensics (MO447/MC919) 28

File Carving based on Graph Theory

‣ Approaches the recovery problem by means of structure of files

‣ The blocks represent the vertices

‣ Edges represent the similarity between blocks (weight)

‣ How to define the similarity?

A. Rocha, 2014 – Digital Forensics (MO447/MC919) 29

Graphs: Hamiltonian Path

‣ Technique presented by Shanmugasundaram et al.

‣ Computes the permutation in a set of n blocks belonging to a file A which represents the original structure in A

• The weights between blocks represent the probability of them being adjacent

• The correct permutation is likely the one that maximizes the sum of weights

‣ The set of all weights creates an adjacency matrix of a complete graph of n vertices.

• The correct sequence is an Hamiltonian path in the graph.

A. Rocha, 2014 – Digital Forensics (MO447/MC919) 30

Graphs: Hamiltonian Path

‣ The question is: how to determine the weight between two blocks (clusters)?

• Prediction by parcial matching (PPM) for texts (Kulesh et al.)

• Border comparison for images (Pal et al.)

A. Rocha, 2014 – Digital Forensics (MO447/MC919)

Prediction by parcial matching (PPM)

31

A. Rocha, 2014 – Digital Forensics (MO447/MC919)

Border of blocks for images

32

A. Rocha, 2014 – Digital Forensics (MO447/MC919)

Graphs: Hamiltonian Path

33

‣ Problems?

‣ It does not consider that, in real systems, many files can be fragmented at the same time

‣ Statistics of multiple files could be helpful

A. Rocha, 2014 – Digital Forensics (MO447/MC919)

Graphs: Hamiltonian Path

34

A. Rocha, 2014 – Digital Forensics (MO447/MC919) 35

Graph: k-Vertex Disjoint Path

‣ Refinement of the Hamiltonian Path method by Pal et al.

• In real cases, many files are fragmented simultaneously

• This technique uses the statistics of such files

‣ Each vertex represents a block

A. Rocha, 2014 – Digital Forensics (MO447/MC919) 36

Graph: k-Vertex Disjoint Path

‣ We start with k files identified by their headers

• There exists only k disjoint paths, as (usually) each block belongs to a unique file

‣ It is an NP-hard problem

‣ Many algorithms were proposed for this case but the ones called UP – unique path are the highlight

A. Rocha, 2014 – Digital Forensics (MO447/MC919) 37

Unique Path (UP) Algorithms

‣ Realistic: each cluster usually belongs to a unique file

‣ The problem: errors propagate in cascade

• An incorrect cluster leads to the wrong reconstruction of two files

A. Rocha, 2014 – Digital Forensics (MO447/MC919) 38

File Carving: PUP

PUP: Parallel Unique Path

1. Starts with a set S with k Headers (s1,s

2,...,s

k), related to k files.

2. Finds the set T with k clusters, where ti is the best correspondence to s

i. It selects the t

i

with highest correspondence among all.

i. Adds ti to the path of the ith file

ii. Replaces the current cluster in S to the ith element (si = t

i)

iii. Finds a new set T of the best correspondences

iv. Selects the element with the best correspondence

v. Repeats (i) until all files are complete

A. Rocha, 2014 – Digital Forensics (MO447/MC919) 39

Example: PUP

A. Rocha, 2014 – Digital Forensics (MO447/MC919) 40

File Carving: SPF

SPF: Shortest Path First

1. Shortest path first (SPF) is an algorithm that assumes that the best recoveries have the lowest average path costs.

2. This algorithm reconstructs each image one at a time.

3. However, after an image is reconstructed the clusters assigned to the image are not removed, only the average path cost is calculated

4. All the clusters in the reconstruction of the image are still available for the reconstruction of the remainder of the images

A. Rocha, 2014 – Digital Forensics (MO447/MC919) 41

File Carving: SPF

SPF: Shortest Path First

1. This process is repeated until all the image average path costs are calculated.

2. Then the image with the lowest path cost is assumed to be the best recovery and the clusters assigned to its reconstruction are removed.

3. Each of the remaining images that used the clusters removed have to redo their reassemblies with the remaining clusters and their new average path cost is calculated.

4. Once this process is completed for the remaining images, the one with the lowest average path cost is again removed, and this process continues until all images are recovered.

A. Rocha, 2014 – Digital Forensics (MO447/MC919) 42

File Carving: SPF

SPF: Shortest Path First

1. For each image to be reconstructed:

i. From the available set of clusters, reconstructs the path and calculate the average path

2. Finds, among all paths, the one with the lowest avg cost and reconstructs such image

3. Remove from the other paths the used blocks in step (2)

4. Repeats step (1) until all images are reconstructed

A. Rocha, 2014 – Digital Forensics (MO447/MC919) 43

File Carving: PUP vs. SPF

‣ Reconstruction of up to 88% of files against 83% of PUP

‣ Performance and scalability are lower than PUP

‣ The edge weights are pre-computed to facilitate the search but this step has complexity O(n2 log n)

‣ Modern disks contain millions of clusters and pre-computing such weights are UNFEASIBLE

BitFragment Gap Carving

A. Rocha, 2014 – Digital Forensics (MO447/MC919) 45

BitFragment Gap Carving

‣ Fast Object Validation for files with headers and footers

‣ Files must be decodable (JPEG, MPEG, ZIP, etc.)

‣ A validator will show if a sequence is valid or not for a specific file type

‣ For instance, PNG uses CRC (error correction codes) at the end of the files

‣ Plain texts and BMPs cannot be recovered this way

A. Rocha, 2014 – Digital Forensics (MO447/MC919) 46

BitFragment Gap Carving

‣ Bitfragment Gap Carving (BGC) recovers files by exhaustive search of the gap between two sequences validating everything in between

A. Rocha, 2014 – Digital Forensics (MO447/MC919) 47

BitFragment Gap Carving

‣ Consider bh as a header cluster, bf is the fragmentation point, bs the start of the cluster with the footer and bz the footer

‣ For each gap size g starting in 1, all combinations of bf and bs are designated in such a way there are exactly g clusters between them (s - f = g)

‣ Disadvantages:

‣ This technique does not scale for larger gaps

‣ It only works for files of two fragments

‣ It only works for files that can be validated

‣ Correct validation does not mean coherent/correct

A. Rocha, 2014 – Digital Forensics (MO447/MC919) 48

BitFragment Gap Carving

Smart File Carving

A. Rocha, 2014 – Digital Forensics (MO447/MC919) 50

Smart Carving

Proposed by Pal et al.

‣ Aims at solving scalability problems

‣ Takes into consideration the typical behavior of fragmentation in disks

‣ Steps:

• Pre-processing

• Collating/Comparison/classification

• Reassembly/Reconstruction

A. Rocha, 2014 – Digital Forensics (MO447/MC919)

Smart Carving

51

A. Rocha, 2014 – Digital Forensics (MO447/MC919) 52

Smart Carving: Pre-processing

‣ Applied to data that are compressed or encrypted

‣ Optionally, can remove the allocated clusters (via additional information from the table of allocation, for instance)

A. Rocha, 2014 – Digital Forensics (MO447/MC919) 53

Smart Carving: Collating

‣ Classifiers the clusters by file type

• Keywords (e.g., <HTML>, <IMG>)

• ASCII char frequency

• Entropy

• “File prints” (e.g., histogram of bytes in files)

A. Rocha, 2014 – Digital Forensics (MO447/MC919) 54

Smart Carving: “File print”

‣ McDaniel and Heydari proposed 3 algorithms:

• Frequency of byte distribution (BFD): average of histograms of many examples of each type of file and byte correlation

• Frequency of cross-correlation distribution (BDC): correlation among bytes

• Inclusion of header and footer

‣ Low accuracy: 30% (BFD), 45% (BFC) and 95% with headers and footers considered together

‣ Does not work for classifying blocks.

‣ Getting back to the planning!!!

A. Rocha, 2014 – Digital Forensics (MO447/MC919) 55

Smart Carving: “File print”

Proposed by Wang and Stalfo

‣ Uses a set of BFD models and standard deviations

‣ Higher accuracy: between 75% and 100%

‣ Accuracy decreases with the number of bytes

A. Rocha, 2014 – Digital Forensics (MO447/MC919) 56

Smart Carving: “File print”

Karresand et al. proposed the Oscar method

‣ Uses a centroid model based on the average and std of each byte

• 97% accuracy

‣ Improved when used a measure to analyze byte orderning using absolute difference between adjacent bytes

• 99% accuracy for JPEG files

A. Rocha, 2014 – Digital Forensics (MO447/MC919) 57

Smart Carving: Reconstruction

‣ Aims at finding the fragmentation point of a file

‣ Some previous studies have shown that files normally fragment in less than 3 fragments

‣ The reconstruction consists of finding the base fragment and finding its last cluster

A. Rocha, 2014 – Digital Forensics (MO447/MC919) 58

Smart Carving: SHT-PUP

‣ Modification of PUP by Pal et al.

‣ SHT: Sequential Hypotheses Testing

‣ Each file has a specific hypothesis

‣ Clusters are combined until a hypothesis is confirmed of refuted

‣ Only implemented for JPEG files

A. Rocha, 2014 – Digital Forensics (MO447/MC919) 59

1. Start with a set S with k Headers (s1,s

2,...,s

k), wrt to k files.

2. Finds the set T with k clusters, where ti is the best match to s

i. Selects t

i with the

highest match among all.

i. Adds ti to the path of the ith file

ii. Replaces the current cluster in S to the ith element (si = t

i)

iii. Analyzes, sequentially, the immediate cluster after ti until detecting a frag.

point tf or until the file is complete. Here is the hypothesis testing.

iv. Replaces the current cluster in S with tf (s

i = t

i)

v. Finds the new set T of best matches

vi. Selects the element ti with best match among all

vii. Repeats step (i) until all files are complete

A. Rocha, 2014 – Digital Forensics (MO447/MC919) 60

Example: SHT-PUPA1 B1 C1

↓ ↓ ↓

A2 B2 C3

(a)

A1 B1 C1

↓ ↓ ↓

A2 B2 C3

B3

(b)

A1 B1 C1

↓ ↓ ↓A2 B2 C3

B3

(c)

Video Demo

A. Rocha, 2014 – Digital Forensics (MO447/MC919)

Video Demo

A. Rocha, 2014 – Digital Forensics (MO447/MC919)

Conclusions

A. Rocha, 2014 – Digital Forensics (MO447/MC919) 66

Conclusions

We have shown the benefits and problems that exist with current techniques for recovering files

There is a lot of research yet to be done in this area for data recovery.

While Pal et. al’s techniques are useful for recovering text and images, new weighting techniques need to be created for video, audio, executable and other file formats, thus allowing the recovery to extend to those formats

References

A. Rocha, 2014 – Digital Forensics (MO447/MC919) 68

References1. Anandabrata Pal, Husrev T. Sencar, and Nasir Memon. Detecting file fragmentation point using sequential hypothesis testing. Digital

Investigation (DIIN), 5(1):S2–S13, September 2008.

2. Anandabrata Pal and Nasir Memon. The evolution of file carving – the benefits and problems of forensics recovery. IEEE Signal Processing Magazine, 26(2):59–71, March 2009.

3. Pal A, Memon N. Automated reassembly of file fragmented images using greedy algorithms. IEEE Transactions on Image processing February 2006:385–93.

4. K. Shanmugasundaram and N. Memon, “Automatic reassembly of document fragments via data compression,” presented at the 2nd Digital Forensics Research Workshop, Syracuse, NY, July 2002

5. A. Pal, K. Shanmugasundaram, and N. Memon, “Reassembling image fragments,” in Proc. ICASSP, Hong Kong, Apr. 2003, vol. 4, pp. IV–732-5.

6. A. Pal and N. Memon, “Automated reassembly of file fragmented images using greedy algorithms,” IEEE Trans. Image Processing, vol. 15, no. 2, pp.385 – 393, Feb. 2006.

7. A. Pal, T. Sencar, and N. Memon, “Detecting file fragmentation point using sequential hypothesis testing,” Digit. Investig., to be published.

8. M. McDaniel and M. Heydari, “Content based file type detection algorithms,” in Proc. 36th Annu. Hawaii Int. Conf. System Sciences (HICSS’03)—Track 9, IEEE Computer Society, Washington, D.C., 2003, p. 332.1

9. K. Wang, S. Stolfo, “Anomalous payload-based network intrusion detection,” in Recent Advances in Intrusion Detection, ( Lecture Notes in Computer Science), vol. 3224. New York: Springer-Verlag, 2004, pp. 203 –222.

10. M. Karresand and N. Shahmehri, “Oscar file type identification of binary data in disk clusters and RAM pages,” in Proc . IFIP Security and Privacy in Dynamic Environments, vol. 201, 2006, pp. 413 – 424.

11. M. Karresand and N. Shahmehri, “File type identification of data fragments by their binary structure,” in Proc. IEEE Information Assurance Workshop, June 2006, pp. 140 –147.

!NOTE: the papers of A. Pal et al. can be obtained in http://digital-assembly.com/technology/

Obrigado!Thank you!

Recommended