April 29 th , 2013 Prof. John Kubiatowicz cs194-24

CS194-24Advanced Operating Systems

Structures and Implementation Lecture 23

Application-Specific File SystemsDeep Archival Storage

Security and Protection

April 29th, 2013Prof. John Kubiatowicz

http://inst.eecs.berkeley.edu/~cs194-24

Lec 23.24/29/13 Kubiatowicz CS194-24 ©UCB Fall 2013

Goals for Today

• Application-specific File Systems– Dynamo, Haystack

• Deep Archival Storage– OceanStore

• Security and Protection

Interactive is important!Ask Questions!

Note: Some slides and/or pictures in the following areadapted from Bovet, “Understanding the Linux Kernel”, 3rd edition, 2005


Recall: VFS Common File Model

• Four primary object types for VFS:– superblock object: represents a specific mounted filesystem– inode object: represents a specific file– dentry object: represents a directory entry – file object: represents open file associated with process

• There is no specific directory object (VFS treats directories as files)

• May need to fit the model by faking it– Example: make it look like directories are files– Example: make it look like have inodes, superblocks, etc.


Recall: Data-based Caching (Data “De-Duplication”)

• Use a sliding-window hash function to break files into chunks– Rabin Fingerprint: randomized function of data window

» Pick sensitivity: e.g. 48 bytes at a time, lower 13 bits = 0 2-13 probability of happening, expected chunk size 8192

» Need minimum and maximum chunk sizes– Now – if data stays same, chunk stays the same

• Blocks named by cryptographic hashes such as SHA-256


Recall: Peer-to-Peer: Fully equivalent components

• Peer-to-Peer has many interacting components– View system as a set of equivalent nodes

» “All nodes are created equal”– Any structure on system must be self-organizing

» Not based on physical characteristics, location, or ownership


Recall: Lookup with Leaf Set (Chord)

0…

10…

110…

111…

Lookup ID

Source

Response

• Assign IDs to nodes– Map hash values to

node with closest ID• Leaf set is

successors and predecessors– All that’s needed for

correctness• Routing table

matches successively longer prefixes– Allows efficient

lookups• Data Replication:

– On leaf set


Advantages/Disadvantages of Consistent Hashing

• Advantages:– Automatically adapts data partitioning as node membership

changes– Node given random key value automatically “knows” how to

participate in routing and data management– Random key assignment gives approximation to load balance

• Disadvantages– Uneven distribution of key storage natural consequence of

random node names Leads to uneven query load– Key management can be expensive when nodes transiently fail

» Assuming that we immediately respond to node failure, must transfer state to new node set

» Then when node returns, must transfer state back» Can be a significant cost if transient failure common

• Disadvantages of “Scalable” routing algorithms– More than one hop to find data O(log N) or worse– Number of hops unpredictable and almost always > 1

» Node failure, randomness, etc


Dynamo Assumptions• Query Model – Simple interface exposed to application

level– Get(), Put()– No Delete()– No transactions, no complex queries

• Atomicity, Consistency, Isolation, Durability– Operations either succeed or fail, no middle ground– System will be eventually consistent, no sacrifice of

availability to assure consistency– Conflicts can occur while updates propagate through system– System can still function while entire sections of network are

down• Efficiency – Measure system by the 99.9th percentile

– Important with millions of users, 0.1% can be in the 10,000s• Non Hostile Environment

– No need to authenticate query, no malicious queries– Behind web services, not in front of them


Service Level Agreements (SLA)

• Application can deliver its functionality in a bounded time: – Every dependency in the

platform needs to deliver its functionality with even tighter bounds.

• Example: service guaranteeing that it will provide a response within 300ms for 99.9% of its requests for a peak client load of 500 requests per second

• Contrast to services which focus on mean response time Service-oriented

architecture of Amazon’s platform


Replication• Each data item is replicated

at N hosts• “preference list”: The list of

nodes responsible for storing a particular key– Successive nodes not guaranteed

to be on different physical nodes– Thus preference list includes physically distinct nodes

• Sloppy Quorum– R (or W) is the minimum number of nodes that must

participate in a successful read (or write) operation.– Setting R + W > N yields a quorum-like system.– Latency of a get (or put) is dictated by the slowest of

the R (or W) replicas. For this reason, R and W are usually configured to be less than N, to provide better latency.

• Replicas synchronized via anti-entropy protocol– Use of Merkle tree for each unique range– Nodes exchange root of trees for shared key range


Administrivia

• Get moving on Lab 4– Will require you to read a bunch of code to

digest the VFS layer– Design due this Thursday!

» So that Palmer can have design reviews on Friday

» Focus on behavioral aspects• Mounting, File operations, Etc

• Don’t forget final Lecture during RRR– Monday 5/6– Send me final topics


Data Versioning

• A put() call may return to its caller before the update has been applied at all the replicas

• A get() call may return many versions of the same object.

• Challenge: an object having distinct version sub-histories, which the system will need to reconcile in the future.

• Solution: uses vector clocks in order to capture causality between different versions of the same object– A vector clock is a list of (node, counter) pairs– Every version of every object is associated with

one vector clock– If the counters on the first object’s clock are

less-than-or-equal to all of the nodes in the second clock, then the first is an ancestor of the second and can be forgotten.


Vector clock example


Conflicts (multiversion data)• Client must resolve conflicts

– Only resolve conflicts on reads – Different resolution options:

» Use vector clocks to decide based on history» Use timestamps to pick latest version

– Examples given in paper:» For shopping cart, simply merge different versions» For customer’s session information, use latest version

– Stale versions returned on reads are updated (“read repair”)• Vary N, R, W to match requirements of applications

– High performance reads: R=1, W=N– Fast writes with possible inconsistency: W=1– Common configuration: N=3, R=2, W=2

• When do branches occur?– Branches uncommon: 0.06% of requests saw > 1 version over

24 hours– Divergence occurs because of high write rate (more

coordinators), not necessarily because of failure


Haystack File System• Does it ever make sense to adapt a file system to a

particular usage pattern?– Perhaps

• Good example: Facebook’s “Haystack” filesystem– Specific application (Photo Sharing)

» Large files!, Many files!» 260 Billion images, 20 PetaBytes (1015 bytes!)» One billion new photos a week (60 TeraBytes)

– Presence of Content Delivery Network (CDN)

» Distributed caching and distribution network

» Facebook web servers return special URLs that encode requests to CDN

» Pay for service by bandwidth– Specific usage patterns:

» New photos accessed a lot (caching well)

» Old photos accessed little, but likely to be requested at any time NEEDLES

Number of photosrequested in day


Old Solution: NFS• Issues with this design?• Long Tail Caching does not

work for most photos– Every access to back end storage

must be fast without benefit ofcaching!

• Linear Directory scheme worksbadly for many photos/directory– Many disk operations to find

even a single photo– Directory’s block map too big to cache in memory– “Fixed” by reducing directory size, however still not

great• Meta-Data (FFS) requires ≥ 3 disk accesses per lookup

– Caching all iNodes in memory might help, but iNodes are big

• Fundamentally, Photo Storage different from other storage:– Normal file systems fine for developers, databases, etc


New Solution: Haystack• Finding a needle

(old photo) in Haystack• Differentiate between old

and new photos– How? By looking at “Writeable”

vs “Read-only” volumes– New Photos go to Writeable

volumes• Directory: Help locate photos

– Name (URL) of photo has embedded volume and photo ID

• Let CDN or Haystack CacheServe new photos– rather than forwarding them to

Writeable volumes• Haystack Store: Multiple “Physical Volumes”

– Physical volume is large file (100 GB) which stores millions of photos

– Data Accessed by Volume ID with offset into file– Since Physical Volumes are large files, use XFS which is

optimized for large files


Haystack Details

• Each physical volume is stored as single file in XFS– Superblock: General information about the volume– Each photo (a “needle”) stored by appending to file

• Needles stored sequentially in file– Naming: [Volume ID, Key, Alternate Key, Cookie]– Cookie: random value to avoid guessing attacks– Key: Unique 64-bit photo ID– Alternate Key: four different sizes, ‘n’, ‘a’, ‘s’, ‘t’

• Deleted Needle Simply marked as “deleted”– Overwritten Needle – new version appended at end


Haystack Details (Con’t)• Replication for reliability

and performance:– Multiple physical volumes

combined into logical volume» Factor of 3

– Four different sizes » Thumbnails, Small, Medium, Large

• Lookup– User requests Webpage– Webserver returns URL of form:

» http://<CDN>/<Cache>/<Machine id>/<Logical volume,photo>

» Possibly reference cache only if old image– CDN will strip off CDN reference if missing, forward to

cache– Cache will strip off cache reference and forward to Store

• In-memory index on Store for each volume map:[Key, Alternate Key] Offset


What about Protection?• Start by asking some high-level questions…

– What do we expect of our systems?» Won’t leak our information» Won’t lose our information» Will always work when we need them» Won’t launch attacks against other people

– How can we prevent systems from misbehaving?» Never connect them to the network?» Always authenticate users?» Never use them?

• Protection: use of one or more mechanisms for controlling the access of programs, processes, or users to resources– Page Table Mechanism– File Access Mechanism– On-disk encryption

• Can use lots of Protection but still have an insecure system!– Bugs, back doors, viruses, poorly defined policy, inside

man– Denial of service, …


Protection vs Security• Security is a very complex topic: see, i.e. CS161

– Security is about Policy, i.e. what human-centered properties do we want from our system

» Usually with reference to an attack model– Security is achieved through a series of

Mechanisms, i.e. individual elements of the system combined together to achieve a security policy

• Security: use of protection mechanisms to prevent misuse of resources– Misuse defined with respect to policy

» E.g.: prevent exposure of certain sensitive information

» E.g.: prevent unauthorized modification/deletion of data

– Requires consideration of the external environment within which the system operates

» Most well-constructed system cannot protect information if user accidentally reveals password


Preventing Misuse• Types of Misuse:

– Accidental:» If I delete shell, can’t log in to fix it!» Could make it more difficult by asking: “do you

really want to delete the shell?”– Intentional:

» Some high school brat who can’t get a date, so instead he transfers $3 billion from B to A.

» Doesn’t help to ask if they want to do it (of course!)

• Three Pieces to Security– Authentication: who the user actually is– Authorization: who is allowed to do what– Enforcement: make sure people do only what

they are supposed to do• Loopholes in any carefully constructed system:

– Log in as superuser and you’ve circumvented authentication

– Log in as self and can do anything with your resources; for instance: run program that erases all of your files

– Can you trust software to correctly enforce Authentication and Authorization?????


Authentication: Identifying Users• How to identify users to the system?

– Passwords» Shared secret between two parties» Since only user knows password, someone types

correct password must be user typing it» Very common technique

– Smart Cards» Electronics embedded in card capable of

providing long passwords or satisfying challenge response queries

» May have display to allow reading of password» Or can be plugged in directly; several

credit cards now in this category– Biometrics

» Use of one or more intrinsic physical or behavioral traits to identify someone

» Examples: fingerprint reader, palm reader, retinal scan

» Becoming quite a bit more common• What else?

– Consider the “Swarm” and “Un-pad” views


Timing Attacks: Tenex Password Checking

• Tenex – early 70’s, BBN– Most popular system at universities before

UNIX– Thought to be very secure, gave “red team” all

the source code and documentation (want code to be publicly available, as in UNIX)

– In 48 hours, they figured out how to get every password in the system

• Here’s the code for the password check:for (i = 0; i < 8; i++) if (userPasswd[i] != realPasswd[i]) go to error

• How many combinations of passwords?– 2568?– Wrong!


Defeating Password Checking• Tenex used VM, and it interacts badly with the above code

– Key idea: force page faults at inopportune times to break passwords quickly

• Arrange 1st char in string to be last char in pg, rest on next pg– Then arrange for pg with 1st char to be in memory, and rest

to be on disk (e.g., ref lots of other pgs, then ref 1st page) a|aaaaaa |

page in memory| page on disk • Time password check to determine if first character is correct!

– If fast, 1st char is wrong– If slow, 1st char is right, pg fault, one of the others wrong– So try all first characters, until one is slow– Repeat with first two characters in memory, rest on disk

• Only 256 * 8 attempts to crack passwords– Fix is easy, don’t stop until you look at all the characters


• How do we decide who is authorizedto do actions in the system?

• Access Control Matrix: containsall permissions in the system– Resources across top

» Files, Devices, etc…– Domains in columns

» A domain might be a user or a group of permissions

» E.g. above: User D3 can read F2 or execute F3– In practice, table would be huge and sparse!• Two approaches to implementation

– Access Control Lists: store permissions with each object

» Still might be lots of users! » UNIX limits each file to: r,w,x for owner, group,

world» More recent systems allow definition of groups of

users and permissions for each group– Capability List: each process tracks objects has

permission to touch» Popular in the past, idea out of favor today» Consider page table: Each process has list of pages

it has access to, not each page has list of processes …

Recall: Authorization: Who Can Do What?


Authorization Continued• Principle of least privilege: programs, users,

and systems should get only enough privileges to perform their tasks– Very hard to do in practice

» How do you figure out what the minimum set of privileges is needed to run your programs?

– People often run at higher privilege then necessary

» Such as the “administrator” privilege under windows

• One solution: Signed Software– Only use software from sources that you trust,

thereby dealing with the problem by means of authentication

– Fine for big, established firms such as Microsoft, since they can make their signing keys well known and people trust them

» Actually, not always fine: recently, one of Microsoft’s signing keys was compromised, leading to malicious software that looked valid

– What about new startups?» Who “validates” them?» How easy is it to fool them?


Mandatory Access Control (MAC)• Mandatory Access Control (MAC)

– “A Type of Access control by which the operating system constraints the ability of a subject or initiator to access or generally perform some sort of operation on an object or target.”

From Wikipedia– Subject: a process or thread– Object: files, directories, TCP/UDP ports, etc– Security policy is centrally controlled by a security

policy administrator: users not allowed to operate outside the policy

– Examples: SELinux, HiStar, etc.• Contrast: Discretionary Access Control (DAC)

– Access restricted based on the identity of subjects and/or groups to which they blong

– Controls are discretionary – a subject with a certain access permission is capable of passing that permission on to any other subject

– Standard UNIX model


Data Centric Access Control (DCAC?)• Problem with many current models:

– If you break into OS data is compromised– In reality, it is the data that matters – hardware is

somewhat irrelevant (and ubiquitous)• Data-Centric Access Control (DCAC)

– I just made this term up, but you get the idea– Protect data at all costs, assume that software might

be compromised– Requires encryption and sandboxing techniques– If hardware (or virtual machine) has the right

cryptographic keys, then data is released• All of the previous authorization and enforcement

mechanisms reduce to key distribution and protection– Never let decrypted data or keys outside sandbox– Examples: Use of TPM, virtual machine mechanisms


Enforcement• Enforcer checks passwords, ACLs, etc

– Makes sure the only authorized actions take place– Bugs in enforcerthings for malicious users to

exploit• Normally, in UNIX, superuser can do anything

– Because of coarse-grained access control, lots of stuff has to run as superuser in order to work

– If there is a bug in any one of these programs, you lose!

• Paradox– Bullet-proof enforcer

» Only known way is to make enforcer as small as possible

» Easier to make correct, but simple-minded protection model

– Fancy protection» Tries to adhere to principle of least privilege» Really hard to get right

• Same argument for Java or C++: What do you make private vs public?– Hard to make sure that code is usable but only

necessary modules are public– Pick something in middle? Get bugs and weak

protection!


Summary• Peer-to-Peer:

– Use of 100s or 1000s of nodes to keep higher performance or greater availability

– May need to relax consistency for better performance

• Application-Specific File Systems (e.g. Haystack):– Optimize system for particular usage pattern

• Security: use of protection mechanisms to prevent misuse of resources– Represents Human-Centered Policy as opposed to

mechanism• Three Pieces to Security

– Authentication: who the user actually is– Authorization: who is allowed to do what– Enforcement: make sure people do only what they

are supposed to do• Principle of least privilege: programs, users, and

systems should get only enough privileges to perform their tasks

Documents

April 29 th , 2013 Prof. John Kubiatowicz cs194-24