Hadoop Requirements v16

7/31/2019 Hadoop Requirements v16

1/27

Hadoop File System as part of a CMS Storage Element

1.Introduction

In the last year, several new storage technologies have matured to the point where they have becomeviable candidates for use at CMS Tier2 sites. One particular technology is HDFS, the distributedfilesystem used by the Hadoop data processing system.

2.The HDFS SEHDFS is a file system; as such, it must be complemented by other components in order to build a gridSE. We consider a minimal set of additional components to be:

a) FUSE / FUSE-DFS: FUSE, a standard linux kernel module, allows filesystems to be written inuserspace. The FUSE-DFS library makes HDFS into a FUSE filesystem. This allows aPOSIX-like interface to HDFS, necessary for user applications.

b) Globus GridFTP: Provides a WAN transfer protocol. Using Globus GridFTP with HDFS

requires a plugin developed by Nebraska.c) BeStMan: Provides an SRMv2 interface. UCSD and Nebraska have implemented plugins for

BeStMan to allow smarter selection of GridFTP servers.There are optional components that may be layered on top of HDFS, including XrootD, Apache HTTP,and FDT.

3.RequirementsThis document aims to show that the combination of HDFS FUSE, Globus GridFTP , BeSTMan, andsome plugins developed by CMS Tier-2 teams at Nebraska, Caltech, and UCSD, meet the SErequirements set forth by USCMS Tier 2 management. The requirements are typeset in italics, and theresponses are given below them in normal typesetting. Throughout this document, we use Hadoop

and HDFS interchangeably; generally, Hadoop refers to the entire data processing system. In thiscontext, we take it to only refer to the filesystem components.

4.Management of the SE

Requirement 1.A SE technology must have a credible support model that meets the reliability,availability, and security expectations consistent with the area in the CMS computing infrastructure inwhich the SE will be deployed.

Support for this SE solution is provided by a combination of OSG, LBNL, Globus, the Apache

Software Foundation (ASF), DISUN, and possibly the US CMS Tier-2 program as follows.

BeSTMan is supported by LBNL, GridFTP by Globus. Both are part of the OSG portfolio of storagesolutions. HDFS is supported by ASF, as elaborated below, and FUSE is part of the standard Linuxdistribution. This leaves us with two types of plugins that are required to integrate all the pieces into asystem, as well as the packaging support. The first is a plugin to BeSTMan to pick a GridFTP serverfrom a list of available GridFTP servers; this list is reloaded every 30 seconds from disk and servers arerandomly selected (otherwise, the default policy is a simple round-robin and it requires an SRM restartto alter the GridFTP server list). The second is a GridFTP plugin to interface to HDFS. We propose that


2/27

both of these types of plugins continue to be supported by their developers from DISUN and US CMS,until OSG has had an opportunity to gain sufficient experience to adopt the source, and own them.Similarly, packaging support in form of RPMs is to be provided initially by DISUN/Caltech, and laterby US CMS/Caltech, until OSG has had an opportunity to adopt it as part of a larger migration towardsproviding native packaging in form of RPMs. OSG ownership of these software artifacts has beenagreed upon in principle, but not yet formally. OSG support for this solution will start with Year 4 of

OSG (October 1st, 2009), and will initially be restricted to:

a) Pick a set of RPMs twice a year, verify that this set is completely consistent, providing a wellintegrated system. We refer to this as the golden set.

b) Document installation instructions for this golden set.c) Do a simple validation test on supported platforms (with the validation preferably automatic).d) Performance test the golden set, and document that test. This performance will be in-depth.e) Provide operations support for two golden sets at a time. This means that there is a staff person

on OSG who is responsible for tracking support requests, answering simple questions, andfinding solutions to difficult questions via the community support group organized in [email protected] listserv.

f) OSG will provide updates to a golden set only for important bug and security fixes; thesecritical patches will go through validation test, but not performance tests.

Support will be official for RHEL4 and RHEL5-derivants on both 32-bit and 64-bit platforms. Thecore HDFS software (the namenode and datanode) is usable on any platform providing the Java 1.6JDK. Currently, Caltech and Nebraska both run datanodes on Solaris. The main limitation for FUSEclients is support for the FUSE kernel module; this is supported on any Linux 2.6 series kernel, andFUSE was merged into the kernel itself in 2.6.14.

Upgrades in HDFS are covered in this wiki document:http://wiki.apache.org/hadoop/Hadoop_Upgrade . On our supported version, upgrades on the samemajor version will only require a yum update. For major version upgrades, the procedure is:

a) Shutdown the cluster.b) Upgrade via yum or RPMs.c) Start the namenode manually with the -upgrade flag.d) Start the cluster. The cluster will stay in read-only mode.e) Once the clusters health has been verified, issue the hadoop dfsadmin finalizeUpgrade

command. After the command has been issued, no rollback may be performed.The wiki explains additional recommended safety precautions.

Hadoop is an open-source project hosted by the Apache Software Foundation. ASF hosts multiplemailing lists for both users and developers, which is actively watched by both the Hadoop developersand the larger Hadoop community. Members of this community include Yahoo employees who useHadoop in their workplace, as well as employees of Cloudera, which provides commercial packagingand support for Hadoop. As it is a top-level Apache project, it has contributors from at least threecompanies and strong project management. Yahoo has stated that it has invested millions of dollarsinto the project and intends to continue doing so (http://developer.yahoo.net/blogs/hadoop).

Hadoop depends heavily upon its JIRA instance, http://issues.apache.org/jira/browse/HADOOP , forbug reporting and tracking. As it is an open-source project, theres no guarantee of response time toissues. However, we have found high-priority issues (those that may lead to data loss) get solvedquickly because bugs affecting a T2 site have a high probability of affecting Yahoos production
mailto:[email protected]:[email protected]://wiki.apache.org/hadoop/Hadoop_Upgradehttp://developer.yahoo.net/blogs/hadoophttp://issues.apache.org/jira/browse/HADOOPmailto:[email protected]:[email protected]://wiki.apache.org/hadoop/Hadoop_Upgradehttp://developer.yahoo.net/blogs/hadoophttp://issues.apache.org/jira/browse/HADOOP


3/27

infrastructure.

Requirement 2. The SE technology must demonstrate the ability to interface with the global datatransfer system PhEDEx and the transfer technologies of SRM tools and FTS as well as demonstratethe ability to interface to the CMSSW application locally through ROOT.

Caltech is using Hadoop exclusively for OSG RSV tests, PhEDEx load tests, user stageouts via SRM,ProdAgent stageout via SRM and POSIX-like access via FUSE. User analysis jobs submitted toCaltech with CRAB have been running against data stored in Hadoop for several months.

UNL has been using HDFS at large scale since approximately late February 2009. It is used for:

PhEDEx data transfers for all links (transfers are done with SRM and FTS).

OSG RSV tests (used to meet the WLCG MoU availability requirements)

Monte Carlo simulation and merging.

User analysis via CMSSW (both grid-based submissions using CRAB and local interactiveanalysis).

UCSD has been using HDFS for serving /store/user.

Caltech, UNL, and UCSD have been leading the effort to demonstrate the scalability of the BeStManSRM server. This has recently resulted in a srmLs processing rate of 200Hz in a single server; this is 4times larger than the rate required by CMS for FY2010, and is in fact the most scalable SRM solutiondeployed in global CMS.

Requirement 3. There must be sufficient documentation of the SE so that it can be installed andoperated by a site with minimal support from the original developers (i.e. nothing more than "besteffort"). This documentation should be posted on the OSG Web site, and any specific issues ininterfacing the external product to CMS product should be highlighted.

Installation, operation, and troubleshooting directions can be found athttp://twiki.grid.iu.edu/bin/view/Storage/Hadoop . This has already been discussed in Requirement 1,and we elaborate further here. It should be plausible for any site admin to install HDFS and thecorresponding grid components without support from the original developers. This provides a coherentinstall experience; all components, including BeStMan and GridFTP, are available as RPMs. Adminsexperienced with the RedHat tool yum will find that the SE is installable via a simple yum installhadoop. The DISUN/Caltech packaging also provides useful logging defaults that enable one toeasily centrally log errors happening in HDFS; this greatly aids admins in troubleshooting.

With the OSG 1.2 release, there are no specific issues pertaining to using HDFS with CMS.

Experience shows that operational overhead at Caltech has been equivalent to approximately 1FTE,and that includes R&D activities for packaging and testing. Going forward, we believe the overheadwill decrease, as the R&D portions will be greatly reduced. Nebraska, which has reached a stable statewith Hadoop for several months, shows that operational overhead is less than 1 FTE. UCSD, which ispresently supporting HDFS for /store/user (113TB) and dCache for all else (273TB) reports the sameexperience. The main reason why this solution is experienced as less costly to operate is because it hasmany fewer moving parts. This much reduced complexity results in significantly lower operationaloverhead.
http://twiki.grid.iu.edu/bin/view/Storage/Hadoophttp://twiki.grid.iu.edu/bin/view/Storage/Hadoop


4/27

Requirement 4. There must be a documented procedure for how problems are reported to thedevelopers of those products, and how these problems are subsequently fixed.

Starting with Year 4 of OSG, all problems are reported via OSG. OSG then uses the supportmechanisms as discussed in Requirement 1.

In particular, Hadoop has an online ticket system called JIRA. JIRA is heavily used by the developersto received and track bugs and features requests for Hadoop and Hadoop-related projects. JIRA is openfor viewing by everyone, and requires a simple account registration for posting comments and newtickets. All commits in the project must be traceable via JIRA and go through the quality controlprocess (which includes code review by a different developer and passing automated tests).

The JIRA system can be found at http://issues.apache.org/jira/browse/HADOOP . The Hadoopcommunity has also written a guide to filing bug reports,http://wiki.apache.org/hadoop/HowToContribute.

Requirement 5. Source code required to interface the external product to CMS products must bemade available so that site operators can understand what they are operating. If at all possible, sourcecode for the external product itself should also be available.

All software components are open source. In particular, Hadoop source code can be downloaded fromhttp://www.apache.org/dyn/closer.cgi/hadoop/core/. Patches for specific problems in a release can bedownloaded from JIRA (see above). Any patches currently applied to the Caltech Hadoop distributionhave been submitted to JIRA, and we have tried to make sure that they get committed in a timelymanner. This helps us minimize the costs of maintaining our own RPM installs.

Yahoo has publicly committed to releasing its stable patch set to the world beginning with 0.20.0.They are committed to keeping all patches used publicly available in the JIRA; the only addedknowledge is which patches are stable enough to be applied to current releases. When we upgrade tothis version, we will be able to tap into this significant resource; Yahoo has a committed QA team and atest cluster an order of magnitude larger than our production cluster.

The BeStMan source code is open source for academic users; the OSG is working on clearing up thelicensing of this software and making the code more freely available. Each of the used plug-ins (forBeStMan and GridFTP) are available in the Nebraska SVN repository which allows anonymous access.

5.Reliability of the SE

Requirement 6. The SE must have well-defined and reliable behavior for recovery from the failureof any hardware components. This behavior should be tested and documented.

We broadly classify the HDFS grid SE into three parts: metadata components, data components, andgrid components. Below, we document the risks a failure in each poses and the suggested recoverymechanisms.

Metadata components: The two major metadata components for HDFS and are the namenodeand the secondary namenode (backup) servers. The namenode is the single point of failurefor normal user operation. Because of this, there are several built-in protections each site is
http://issues.apache.org/jira/browse/HADOOPhttp://wiki.apache.org/hadoop/HowToContributehttp://www.apache.org/dyn/closer.cgi/hadoop/core/http://issues.apache.org/jira/browse/HADOOPhttp://wiki.apache.org/hadoop/HowToContributehttp://www.apache.org/dyn/closer.cgi/hadoop/core/


5/27

recommended to take:o Write out multiple copies of the journal and file system image. HDFS heavily relies on

the metadata journal as a log of all the operations that alter the namespace. HDFSconfig files allow for writing the journal onto multiple partitions. It is recommended towrite the journal on two separate physical disks in the namenode and suggested a thirdcopy be written to a NFS server.

o

The secondary namenode / backup server allows the site admin to create checkpointfiles at regular intervals (default is every hour or whenever the journal reaches 64MB insize). It is strongly recommended to run the secondary namenode on a separatephysical host. The last two checkpoints are automatically kept by the secondarynamenode.

o The checkpoints should be archived, preferably off-site.

Future versions of HDFS (0.21.x) plan on having an offline checkpoint verification tool andstream journal information to the backup node in real-time, as opposed to an hourly checkpoint.In the case of namenode failure, the following remedies are suggested:

1. Restart the namenode from the file system image and journal found on the namenodesdisk. The only resulting file loss will be the files being written at the time of namenode

failure. This action will work as long as the image and journal are not corrupted.2. Copy the checkpoint file from the secondary namenode to the namenode. Any filescreated between the time the checkpoint was made and the namenode failure will belost. This action will work as long as the checkpoint has not been corrupted.

3. Use an archived checkpoint. Any files created between the checkpoint creation and thenamenode failure will be lost. This action will work as long as one good checkpointexists.

Note that all the namenode information is kept on two files written directly to the Linuxfilesystem.If none of these actions work, the entire file system will be lost; this is why we place suchimportance on backup creation. In addition to the normal preventative measures, the following

can be done:1. Have hot-spare hardware available. HDFS is offline when the namenode is offline. Ifthe failure does not have an immediate, obvious cause, we recommend using the standbyhardware instead of prolonging the downtime by troubleshooting the issue.

2. Create a high-availability (HA) setup. It is possible, using DRBD and Heartbeat tocompletely automate recovery from a namenode failure using a primary-secondarysetup. This would allow services to automatically be restored on the order of a minutewith the loss of only the files which were currently being created at the time of thefailure. As most of the CMS tools automatically retry writing into the SE, this meansthat the actual file loss would be minimal. These high-availability setups have beenused by external companies, but not yet done by a grid site; the extra setup complexity is

not yet perceived as worth the time. Data components: Datanode loss is an expected occurrence in HDFS and there are multiple

layers of protection against this.1. The first layer of protection is block replication. Hadoop has a robust block replication

feature that ensures that duplicate file blocks are placed on separate nodes and evenseparate racks in the cluster. This helps ensure that complete copies of every file areavailable in the case that a single node becomes unavailable, and even if an entire rackbecomes unavailable.

2. Hadoop periodically requests an entire block report from every data node. This protects


6/27

against synchronization bugs where the namenodes view of the datanodes contents isdifferent from reality3. It is often possible to have a bad hard drive causing corruption issues, but nothave that hard drive fail. Each datanode schedules each block to be checksummed onceevery 2 weeks. If the block fails the checksum, it will be deleted and the namenodenotified allowing automatic healing from hard drive errors.

Grid components: Grid components require the least amount of failure planning because theyare stateless for the HDFS SE. Multiple instances of the grid components (gridftp, bestmanSRM) can be installed and used in a failover fashion using Linux LVS or round-robin DNS.

We are aware of one significant failure mode. BeStMan SRM is known to lock up under very heavyload (over 1000 concurrent SRM requests, which is at least 10 fold whats been observed inproduction), and requires a restart when this happens. We believe this to be a problem with the Globusjava container. BeStMan is scheduled to transition to a different container technology in Fall 2009 inBeStMan 2. OSG is committed to follow this issue, and validate the new version when it is released.However, this issue manifests itself only under extreme loads at the existing deployments, and is thuspresently not an operational problem.

In addition, there is a minor issue with the way HDFS handles deployments with server nodes thatserve space of largely varying sizes. HDFS need to be routinely re-balanced, which is typically donevia crontab in extreme cases (e.g. Nebraska), and manually once every two weeks or so fordeployments where disk space differs by no more than a factor 2-4 or so between nodes (e.g. UCSD).There is no significant operational impact from routinely running the balancer beyond network traffic;HDFS allows the site to throttle the per-node rate going to the balancer. This imbalance is causedbecause the selection of servers for writing files is mostly random, while the distribution of free spaceis not necessarily random.

Requirement 7. The SE must have a well-defined and reliable method of replicating files to protectagainst the loss of any individual hardware system. Alternately there should be a documented hardwarerequirement of using only systems that protect against data loss at the hardware level.

Hadoop uses block decomposition to store its data, breaking each file up into blocks of 64MB apiece(this block size is a per-file setting; the default is 64MB, but most sites have increased this to 128MBfor new files). Each file has a replication factor, and the HDFS namenode attempts to keep the correctnumber of replicas of each of the blocks in the file. The replication policy is:

a) First replica goes to the closest datanode the local node is the highest priority, followed by adatanode on the same rack, followed by any data node.

b) Second replica goes to a random datanode on a different rackc) Third replica (if requested) goes to a random datanode on the second rack.

If two replicas are requested, each will end up on two separate racks; if three replicas are requested,they will end up on two separate racks.

Replication level is set by the client at the time it creates the block. The replication level may beincreased or decreased by the admin at any time per-file, and can be done recursively. The namenodeattempts to satisfy the clients request; as long as the number of successfully created replicas is betweenthe namenodes configured minimum and maximum, the HDFS considers the write a success. Becausethe client requests a replication level for each write, one cannot set a default replication for a directorytree.


7/27

For example, at Caltech, a cron script automatically sets the replication level on known directories inorder to ensure clients request the desired replication level. They currently use:

hadoop fs -setrep -R 3 /store/userhadoop fs -setrep -R 3 /store/unmerged

hadoop fs -setrep -R 1 /store/data/CRUZET09hadoop fs -setrep -R 1 /store/data/CRAFT09It is not possible to pin files to specific datanodes, or to set replication based on the datanodes wherethe files will be located. Hadoop treats all datanodes as equally unreliable.

The namenode keeps track of all block locations in the system, and will automatically delete replicas orcreate new copies when needed. Replicas may be deleted either when the associated file is removed orthe block has too more replicas than desired. Datanodes send in a heartbeat signal once every 3seconds, allowing the server to keep up-to-date with the system status.

For example, when a datanode fails, the server allows it to miss up to 10 minutes of heartbeats (settingconfigurable). Once it is declared dead, the namenode starts to make inter-node transfer requests tobring the blocks that were on the datanode back up to the desired level. This often is quite quick; allthe desired replicas can be done in around an hour per TB (the majority of the transfers go faster thanthat; a good portion of the hour is transfer tails). When decommissioning is started, the system doesnot prefer to copy data off the decommissioned node. Assuming the number of replicas is greater thanone, the load of replication is distributed randomly throughout the cluster. The percentage of transfersthe decommissioned node will get is (# of TB of files on datanode)/(# of TB in entire system). Becauseolder nodes typically have a smaller disk size, it will comparatively get less load.

If a dead node reappears (host rebooted/fixed, disk is physically moved to a different node), the blocksit previously hosted will now be overreplicated. The namenode will then reduce the number of replicasin the system, starting with replicas located on nodes with the least amount of free space (as apercentage of total space on the node). If the blocks belonged to files which have been deleted, thenode will be instructed to delete them in the response to its block report.

The number of under-replicated blocks can be seen by viewing the system report using fsck or bylooking at the namenodes Ganglia statistics.

The success of Hadoop's automatic block replication was seen when Caltech suffered a simultaneousfailure of 3 large (6TB) datanodes on the evening of Sunday, Jul. 12:

Within about an hour of installing a faulty Nagios probe, 3 of the 2U datanodes had crashed,all within minutes of each other. Each of the 2U datanodes hosts just over 6TB of raid-5 data.Nagios started sending alarms indicating that we had lost our datanodes. Wehad seen kernel panics caused by this nagios probe before, so we had no trouble locating thecause of the problem. The immediate corrective action was to disable this Nagios probe. Thiswas done immediately to avoid further loss of datanodes.

Ganglia showed Hadoop reporting ~80k underreplicated blocks, and Hadoop started replicatingthem after the datanodes failed to check in after ~10 minutes. The network activity on thecluster jumped from ~200MB/s to 2GB/s. Since we run Rocks, 2 of the machines went into areinstall immediately after dumping the kernel core file. Within an hour they were back up and


8/27

running. Hadoop did not start automatically, however, due to a Rocks misconfiguration. Istarted Hadoop manually on these two nodes, which caused our underreplicated block count todrop from ~20k to ~4k. Within ~90 minutes the number of underreplicated blocks was back tozero.

At this point we still had one datanode that had not recovered. I checked the system console

and discovered that the system disk had also died. Since it was late on a Sunday evening, andwe run 2X replication in Hadoop, I decided we could leave the datanode offline and wait untilthe following morning to replace the disk.

After the replication was done, I ran hadoop fsck / to check the health of HDFS. To mysurprise, hadoop was reporting 40 missing blocks and 33 corrupted files. This seemed strangebecause all of our files (aside from LoadTest data) is replicated 2x, so theloss of a single datanode should not have caused the loss of any files (aside from LoadTestdata). After parsing the output of hadoop fsck /, I found that we had been accidentally settingthe replication on /store/relval to 1, instead of leaving it at the default of 2. This was fixed.

The next morning we replaced the system disk in the dead datanode and waited for it to reinstall(which took a few hours longer than it ought have). Almost immediately after starting Hadoopon this last datanode, hadoop fsck reported the filesystem was clean again. By 2:15 thefollowing afternoon everything had returned to normal and hadoop was healthy again.

Requirement 8. The SE must have a well-defined and reliable procedure for decommissioninghardware which is being removed from the cluster; this procedure should ensure that no files are lostwhen the decommissioned hardware is removed. This procedure should be tested and documented.

The process of decommissioning hardware is documented in the Hadoop twiki under the Operationsguide. The process goes approximately like this:

1. Edit the hosts exclude file to exclude the to-be-decommissioned host from the cluster.2. Issue the refreshNodes command in the Hadoop CLI to get the namenode to re-read the file.

The node should show up as Decommissioning in the web interface at this point.3. Watch the web interface or the report command in the Hadoop CLI and wait until the node is

listed as dead.

This process is not only straightforward, but a very routine process at each site. Decommissioning isdone whenever a node needs to be taken offline for any upgrade lasting more than 10 minutes atNebraska.

Requirement 9. The SE must have well-defined and reliable procedure for site operators toregularly check the integrity of all files in the SE. This should include basic file existence tests as wellas the comparison against a registered checksum to avoid data corruption. The impact of thisoperation (e.g. load on system) should be documented.

Hadoops command line utility allows site admins to regularly check the file integrity of the system. Itcan be viewed using hadoop fsck /. At the end of the output, it will either say the file system isHEALTHY or CORRUPTED. If it is corrupted, it provides the outputs necessary to repair orremove broken files.

HDFS registers a checksum at the block level within the blocks metadata. HDFS automatically


9/27

schedules background checksum verifications (default is to have every block scanned once every 2weeks) and automatically invalidates any block with the incorrect checksum. The checksumminginterval can be adjusted downward at the cost of increased background activity on the cluster. We donot currently have statistics on the rate of failures avoided by checksumming.

Whenever a file is read by a client (even partially checksums are kept for every 4KB), the client

receives both data and checksum and computes the validity of the data on the client side. Similarly,when block is transferred (for example, through rebalancing), the checksum is computed by thereceiving node and compared to the senders data.

Note about catastrophic loss:We have emphasized that with 2 replicas, file loss is very rare due to:

Failures that occur rapidly (>2 hours between failure) cause little to no loss because the re-replication in the file system is extremely fast; one guidance is to expect 1TB per hour to be re-replicated.

Multi-disk failures happening within an hour usually are due to some common piece ofequipment (such as the rack switch or PDU). Rack-awareness prevents an entire rack

disappearing from causing file loss.However, what happens if we make theassumption that all safeguards are bypassed and 2 disks arelost? This is not without precedent; at Caltech, a misconfiguration told Hadoop 2 nodes on the samerack were on different racks. This bypassed the normal protections from rack awareness. The racksPDU failed and two disks failed to come back up. Caltech lost 54 blocks of file.Using the binomial distribution, the expected number of blocks lost is:

(# of blocks lost) = (# of blocks) * P(single block loss)The binomial distribution is appropriate because the loss of one block does not affect the probability ofanother block loss. The standard deviation is approximately the square root of the number of blockslost. The probability of a single block loss is

P(single block loss) = P(block on node 1) * P(block on node 2)

The probability a block is on a given node is approximately:P(block on node 1) = (replication level)*(size of node)/(size of HDFS)assuming that the cluster is well-balanced and blocks are randomly distributed. Both assumptionsappear to be safe in currently-deployed clusters.

Plugging in Caltechs numbers (1,540,263 blocks; 342.64 TB in the system, each lost disk was 1TB),the expected number of lost blocks was 52.4 with a standard deviation of 7.2. This is strikingly close tothe actual loss, 54 blocks.

If only complete files were written (i.e., no block decomposition), then the expected loss would be

(# of files lost) = (# of files) * P(single block loss)

So, assuming 128MB and experimental files of around 1GB, the number of files lost would be 10xlower. In the end, CMS site would lose 10x more files using HDFS. We believe this is an acceptablerisk, especially as the recovery procedure for 5 files versus 50 files is similar. In the case ofsimultaneous triple-disk-failure on triple-replicated files, the expected loss would be less than 1 file forCaltechs HDFS instance.

Requirement 10. The SE must have well-defined interfaces to monitoring systems such as Nagios sothat site operators can be notified if there are any hardware or software failures.


10/27

HDFS integrates with Ganglia; provided that the site admin points HDFS to the right Ganglia endpoint,many relevant statistics for the namenode and datanodes appear in the Ganglia gmetad webpages.Many monitoring and notification applications can set up alerts based on this.

Caltech has also contributed several HDFS-Nagios plugins to the public that monitor various aspects of

the health of the system directly. They have released a TCL-based desktop application, gridftpspywhich monitors the health and activity of the Globus gridftp servers. Some of these are based on theJMX (Java Management eXtensions) interface into HDFS. JMX can integrate with a wider range ofmonitoring system. There is also an external project providing Cacti templates for monitoring HDFS.The Nagios and gridftpspy components are packaged in the Caltech yum repository, but not officiallyintegrated; we foresee labeling them experimental for the OSG-supported first release.

Finally, Caltech has developed the Hadoop Chronicle, a nightly email that sends administrators thebasic Hadoop usage statistics. This has an appropriate level of details to inform site executives aboutHadoops usage. The Hadoop Chronicle is now part of the OSG Storage Operations toolkit. This iscurrently in use at Caltech and in testing at Nebraska.

Note about admin intervention:

The previous two requirements start to cover the topic of what HDFS activities do site admins engagein? and at what interval. We have the following feedback from Nebraska and Caltech site admins,respectively:

Nebraska:o Daily tasks: Check Hadoop Chronicle, look at RSV monitoring

o Once a week: Clean up dead hardware, restart dead components. The component which

crashes most often is BeStMan at about once every 2 weeks.o Once every 2 months: Some sort of data recovery or in-depth maintenance. Examples

include debugging an underreplicated block or recovering a corrupted file.

Caltech (note: Caltech runs an experimental kernel, which may explain the reason theres morekernel-related maintenance than at Nebraska):

o Continuously: wait for Nagios alerts

o Hourly tasks: Check namenode web pages and gridftp logs via gridftpspy (admittedly a

bit excessive)o Daily tasks: Read Hadoop Chronicle, browse PhEDEx rate/error pages

o Weekly tasks: reboot nodes due to kernel panic, adjust gridftp server list (BeStMan

plugin currently notused), track down lost blocks (for datasets replicated once),maintain ROCKS configuration.

o Once a month: Reboot namenode with new kernel, reinstall data nodes with bugfix

update.

6.Performance of the SE

All aspects of performance must be documented.

Requirement 11. The SE must be capable of delivering at least 1 MB/s/batch slot for CMSapplications such as CMSSW. If at all possible, this should be tested in a cluster on the scale of acurrent US CMS Tier-2 system.


11/27

To test this requirement, Caltech ran a test using dd to read from HDFS through the fuse mount on eachof the 89 worker nodes on the Tier2 cluster. dd was used to maximize the throughput from the storagesystem. We acknowledge that the IO characteristics from dd are not identical to that of CMSSWapplications, which tend to read smaller chunks of data in random patterns. Each worker node ran 8 ddprocesses in parallel, one per core. Each dd process/batch slot on a single worker node read a different

2.6GB file from HDFS 10 times in sequence. The same 8 files were read from each of the 89 worker

nodes. At the end of each file read, dd reported the rate at which the file was read. A total of 18.1TBwas read during this test. The final dd was finished approximately 4.25 hours after the test was started.

The average read rate reported by dd was 2.3MB/s 1.5MB/s. The fastest read was 22.8MB/s and theslowest was 330KB/s.

The rate per file delivered from HDFS was 18.1TB/4.25hours = 1238MB/s, or approximately 155MB/s


12/27

(1Gbps) per HDFS file. The test was run as more of a test to see how the system behaves for the 'hotfile' problem. As such, this test shows that HDFS can deliver even 'hot files' to the batch slots at therequired rates.

It should be noted that this test was run while the cluster was also 100% full with CMS production andCMS analysis jobs, most of which were also reading and writing to Hadoop at the same time. The

background HDFS traffic from this CMS activity was not included in these results.

UCSD ran a separate test with a standard CMSSW application consuming physics data. The sameapplication has been used for computing challenging or scalability exercise. The application is very I/Ointensive. Here we mainly focused on the application reading the data that is located locally in thehadoop.

During the tests, there are 15 datanodes holds the data files with 1GB in size. The block size in UCSD'sHadoop is 128 MB. The replication of the data files is set to 2. For each file, there are 16 blocks welldistributed across all the datanodes. The application was configured to run against 1 file or 10 files perjob slot. The number of jobs running simultaneously ranged from 20 to 200. The maximal number ofjobs running simultaneously is 250, which is roughly a quarter of available job slots at UCSD at thattime. The rest of slots were running production or user analysis jobs. So the test was running under avery typical Tier-2 condition. The test application itself didn't significantly changes the overallcondition of the cluster.

The ratio of average job slots running the tests to the number of Hadoop datanodes ranged from 10-20.Eventually this ratio will be 8 if all the WNs are configured as Hadoop datanodes, and each WN runs 8slots. This will increase the I/O capability per job slot for 50-100% from the results we measured in thetest.

The average processing time per job is 200 and 4000 second for the application processing 1 and 10GB of data respectively. The average I/O in reading the data are shown in the following: average I/Ofor application consuming 1GB (left) and 10 GB (right). The test shows the 1MB/s per slot requirementis at the low end of the rate that is actually delivered by the HDFS. The average is ~2-3 MB/s per job.


13/27

Requirement 12. The SE must be capable of writing files from the wide area network at aperformance of at least 125MB/s while simultaneously writing data from the local farm at an averagerate of 20MB/s.

Below is a graph for the Nebraska worker node cluster

During this time, HDFS was servicing user requests at a rate of about 2500/sec (as determined bysyslog monitoring using the HadoopViz application). Each user request is a minimum of 32KB, so thisis at least 80MB of internal traffic. At the same time, we were writing in excess of 100MB/s as

measured by PhEDEx

Below is an example of HDFS serving data to a CRAB-based analysis launched by an external user. Atthe time (December 2008), the read-ahead was set to 10MB. This provided an impressive amount ofnetwork bandwidth (about 8GB/s) to the local farm, but is not an every day occurrence. The currentlyrecommended read-ahead size is 32KB.


14/27

Requirement 13. The SE must be capable of serving as an SRM endpoint that can send and receivefiles across the WAN to/from other CMS sites. The SRM must meet all WLCG and/or CMSrequirements for such endpoints. File transfer rates within the PhEDEx system should reach at least125MB/s between the two endpoints for both inbound and outbound transfers.

During Aug. 20-24, Caltech and Nebraska ran inter-site load tests using PhEDEx to exercise thegridftp-hdfs servers.
https://twiki.cern.ch/twiki/bin/view/CMS/PhEDExhttps://twiki.cern.ch/twiki/bin/view/CMS/PhEDEx


15/27

During this time period, PhEDEx recorded a 48-hour average of 171MB/s coming into the CaltechHadoop SE, with files primarily originating from UNL. Peak rates of up to 300MB/s were observed.There was a temporary drop to zero at ~23:00 Aug. 24 due to an expired CERN CRL.

During this same time period, Caltech was exporting files at an average rate of 140MB/s, with filesprimarily destined for UNL. For several hours during this time period the transfer rates exceeded200MB/s.

It must be noted that the PhEDEx import/export load tests were not run in isolation. While thesePhEDEx load tests were running Caltech was downloading multi-TB datasets from FNAL, CNAF, andother sites with an average rate of 115MB/s and peaks reaching almost 200MB/s.


16/27

UCSD has additionally been working on an in-depth study of the scalability of BeStMan, especially atdifferent levels of concurrency. The graph below shows how the effective processing rate has scaledwith the increasing number of concurrent clients.

The operation used was srmLs without full details; this causes a stat operation on the file system, butreduces the amount of XML generated by the BeStMan server. This demonstrates processing rates wellabove the levels currently needed for USCMS. It is sufficient for high-rate transfers of gigabyte-sizedfiles and uncontrolled chaotic analysis.


17/27

7.Site-specific RequirementsNote: We believe the requirements set out here cover a subset of the functionality required at a CMST2 site. We believe that the better test has been putting the storage elements into production at severalsites the combination of all activities and chaotic loads appears to be better than artificial tests. Anadditional test that we recommend below is replacing the skims (which are bandwidth-heavy and IOPS-light, unlike most T2 activities) with a few analysis jobs (which are bandwidth-medium and IOPS-

heavy).

Note: We've done the best we could without owning more storage (by the end of 2010, each site willprobably double in size). We believe we have demonstrated that the potential bottlenecks (thenamenode) scale out for what we'll need in the next three years. As long as the ratio of cores to usableterabytes stays on the order of 1 to 1 and not 1 to 10, we believe IOPS will scale as demonstrated. Webelieve the fact that Yahoo has demonstrated multi-petabyte clusters shows the number of raw terabyteswill scale.

Note: We believe that the architectures deployed at the current T2 sites (UCSD, Caltech, and Nebraska)can be repeated at others in particular, any site that does not rely entirely on a small number of RAID

arrays. It is applicable for sites having issues with reliability or site admin availability.

Requirement 14. A candidate SE should be subject to all of the regular, low-stress tests that areperformed by CMS. These include appropriate SAM tests, job-robot submissions, and PhEDEx loadtests. The SE should pass these tests 80% of the time over a period of two weeks. (This is also the levelneeded to maintain commissioned status.)

The below chart shows the status of the site commissioning tests from CMS, which is a combination ofall the regular low-stress tests performed.

Additionally, Caltech's use of a Hadoop SE has maintained a 100% Commissioned site status for thetwo weeks prior to Aug. 17:

http://lhcweb.pic.es/cms/SiteReadinessReports/SiteReadinessReport_20090817.html#T2_US_Caltech .

Requirement 15. The new storage element should be filled to 90% with CMS data. These datasetsshould be chosen such that they are currently "popular" in CMS and will thus attract a significantnumber of user jobs. Failures of jobs due to failure to open the file or deliver the data products from
https://twiki.cern.ch/twiki/bin/view/CMS/PhEDExhttp://lhcweb.pic.es/cms/SiteReadinessReports/SiteReadinessReport_20090817.html#T2_US_Caltechhttps://twiki.cern.ch/twiki/bin/view/CMS/PhEDExhttp://lhcweb.pic.es/cms/SiteReadinessReports/SiteReadinessReport_20090817.html#T2_US_Caltech


18/27

the storage systems (as opposed to user error, CE issues, etc.) should be at the level of less than 1 in10^5 level.

A suggested test would be a simple "bomb" of scripts that repeatedly opens random files andreads a few bytes from them with a high parallelism; for the 10^5 test, it's not necessary to do itthrough CMSSW or CRAB. An example would be to have 200 worker nodes open 500 randomfiles each and read a few bytes from the middle of the file.

This was performed using the se_punch.py tool found in Nebraskas se_testkit. There were no fileaccess failures. This script implemented the suggested test all worker nodes in the Nebraska clustersimultaneously started opening random files and reading a few bytes from the middle of each.

Nebraska is now working on a script utilizing PyROOT (which is distributed with CMSSW) that opensall files on the SE with ROOT. This not only verifies files can be opened, but demonstrates a minimallevel of validity of the contents of the file. Opening with ROOT should fully protect against truncation(as the metadata required to open the file is written at the end of the file) and whole-file corruption. Itdoes not detect corruptions in the middle of the file, but built-in HDFS protections should detect these.

Nebraska ran with HDFS over 90% full during May 2009 and encountered no significant problemsother than writes failing when all space was exhausted. Caltech also experienced some corruptedblocks when HDFS was filled to 96.8% and certain datanodes reached 100% capacity. Somecombination of failed writes, rebalancing, and failing disks resulted in two corrupted blocks and twocorrupted files. These files had to be invalidated and retransferred to the site. This is the only time thatCaltech has lost data in HDFS since putting it into production 6 months ago. There are a fewrecommendations to help avoid this situation in the future:

1) Run the balancer often enough to prevent any datanode from reaching 100%2) Don't allow HDFS to fill up enough that an individual datanode partition reaches 100%3) If using multiple data partitions on a single datanode, make them of equal size, or merge them

into a single raid device so that hadoop sees only a single partition.

Future versions of Hadoop (0.20) have a more robust API to help manage datanode partitions that havebeen completely filled to 100%.

Requirement 16. In addition, there should be a stress test of the SE using these same files. Over thecourse of two weeks, priority should be given to skimming applications that will stress the IO system.

Specific CMS skim workflows were run at Nebraska on June 6. However, the results of these were notinteresting as the workflows only lasted 8 hours (no significant failures occurred).

However, the stress of the skim tests is far less than the stress of user jobs (especially PAT-basedanalysis) due to the number of active branches in ROOT; see CMS Internal Note 2009-18. Many activebranches in ROOT result in a large number of small reads; a CMS job on an idle system will readtypically no more than 32KB per read and achieve 1MB/s. Hence, 1000 jobs will achieve 30,000 IOPSif they are not bound by the underlying disk system. Because the HDFS installs have relatively highbandwidth due to the large number of data nodes, but the same number of hard drives as other systems,bandwidth is usually not a concern while I/O operations per second (IOPS) is. See the below graphsdemonstrating a large number of IOPS; even at the max request rate, the corresponding bandwidthrequired is only 5Gbps. For the hard drives deployed at the time the graph was generated, this
https://twiki.cern.ch/twiki/bin/view/CMS/CRABhttps://twiki.cern.ch/twiki/bin/view/CMS/CRABhttps://twiki.cern.ch/twiki/bin/view/CMS/CRAB


19/27

represented about 60 IOPS per hard drive, which matched independent benchmarks of the hard drives.The bandwidth usage of 5Gbps represents only a fraction of the bandwidth available to HDFS.

Because HDFS approaches the underlying hardware limits of the system during production, weconsider typical user jobs are the best stressor of the system. Such stress tests occur in large batcheson a weekly basis at both Nebraska and Caltech. During the tests in this requirement and others,

Nebraska and Caltechs systems were in full production for CMS simulation, analysis, and WANtransfer and often the batch slots were 100% utilized. By default, data went to HDFS and only a fewdatasets were kept on dCache. UCSDs system was smaller and shared the CMS activities with adCache instance.


20/27

Requirement 17. As part of the stress tests, the site should intentionally cause failures in variousparts of the storage system, to demonstrate the recovery mechanisms.

As noted in Requirement 16, a HDFS instance in large-scale production is sufficient for demonstrating

stress. During production at Nebraska and at Caltech, we have observed failures of the followingcomponents:

Namenode: When a namenode dies, the only currently used recovery mechanism is to replacethe server (or fix the existing server) and copy a checkpoint file into the appropriate directory.A high-availability setup have not yet been investigated by our production sites, mostly due tothe perceived complexity for little perceived benefit (namenode failure is rare). This has beendemonstrated in production at Nebraska and Caltech. When the namenode fails, writes will notcontinue and reads will fail if the client had not yet cached the block locations for open files.

Datanode: Datanode failures are designed to be an everyday occurrence, and they have indeedoccurred at both Nebraska and Caltech. The largest operational impact is the amount of trafficgenerated by the system while it is re-replicating blocks to new hosts.

Globus GridFTP servers: Each transfer is spawned as separate process on the host by xinetd.This results in the server being extremely reliable in the face of failures or bugs in the GridFTPserver. When the GridFTP host dies, others may be used by SRM. Nebraska and UCSD haveimplemented schemes where the SRM server stops sending new transfers to the GridFTP server.Caltech has also implemented a Gridftp appliance integrated with the Rocks clustermanagement software that can be used to install and configure a new gridftp server in 10minutes.

SRM server: When the SRM server fails, all SRM based transfers will fail until it has beenrestarted manually (the service health is monitored via RSV). This happens infrequently


21/27

enough in production that no automated system has been implemented, although LVS-basedfailover and load-balancing is plausible because BeStMan is stateless. Caltech hasimplemented a Bestman appliance integrated with the Rocks cluster management software thatcan be used to install and configure a new Bestman server in 10 minutes.

8.Security Concerns

HDFS

HDFS has unix-like user/group authorization, but no strict authentication. HDFS should only beexposed to a secure internal network which only non-malicious users are able to access. For users withunrestricted access to the local cluster, it is not difficult at all to bypass authentication. There is noencryption or strong authentication between the client and server, meaning that one must haveboth trusted server and client. This is the primary reason why HDFS mustbe segregated onto aninternal network.It is possible to reasonably lock-down access by:

1. Preventing unknown or untrusted machines from accessing the internal network. Thisrequirement can be removed by turning on SSL sockets in lieu of regular sockets for inter-

process communication. We have not pursued this method due to the perceived performancepenalty.

a. By untrusted machines, we include allowing end-users laptops or desktops to accessHDFS. Such access could be allowed via Xrootd redirectors (for ROOT-based analysis)or exporting the file system via HTTPS (allowing whole-file download).

2. Prevent non-fuse users from accessing HDFS ports on the known machines on the network.This will mean only the HDFS FUSE process will be able to access the datanodes andnamenode; this allows the Linux filesystem interface to sanitize requests and prevents usersfrom TCP-level access to HDFS.

Its important to point out that in (2), we are relying on the security of the clients on the network. If ahost is compromised at the root-level, the attacker can perform any arbitrary action with sufficient

effort. During the various tests outlined above, the sites security was based on either the internal NAT(Caltech and Nebraska) or firewalls eliminating access to the outside world (UCSD).Security concerns are actively being worked on by Yahoo. The progress can be followed on this masterJIRA issue:

https://issues.apache.org/jira/browse/HADOOP-4487 In release 0.21.0, access tokens issued by the namenode prevents clients from accessing arbitrary dataon the datanode (currently, one only needs to know the block ID to access it). Also in 0.21.0, thetransition to the Java Authentication and Authorization Service has begun; this will provide the buildingblocks for Kerberos-based access (Yahoos eventual end goal). Judging by current progress,transitioning to Kerberos-based components could happen during 2010.If a vulnerability is discovered, we would release updated RPMs within one workweek (sooner if the

packaging is handled by the VDT). This probably will not be necessary as the security model isalready very permissive. Security vulnerabilities are one of the few reasons we will update the goldenset of RPMs.

Note: Example damage a rogue batch job could do

To demonstrate the security model, we give a few examples of what a rogue job could do:

Excessive memory usage by the rogue job could starve the datanode process and cause it tocrash. Most sites limit the amount of memory allowed for individual batch jobs, so this is not abig concern.
https://issues.apache.org/jira/browse/HADOOP-4487https://issues.apache.org/jira/browse/HADOOP-4487


22/27

If the rogue job has write access to the datanode partition, then it could fill up the partition withgarbage which would prevent the datanode from writing any further blocks. This will not causethe datanode to fail, but will cause a loss of usable space in the SE.

o Most sites use Unix file system permissions to prevent this.

A malicious batch job with telnet access to the Hadoop datanode could request any block ofdata if it knows the block ID. This is fixed in the HDFS 0.21.0 branch (to be released

approximately in November). A malicious batch job with telnet access to the Hadoop namenode could perform arbitrary file

system commands. This could result in a lot of damage to the storage system, and why werecommend client-side firewalls.

o This is a known weakness in the current security model and is being addressed in

current Hadoop development.

Grid Components (GridFTP and BeStMan)

Globus GridFTP and BeStMan both use standard GSI security with VOMS extensions; we assume thisis familiar to both CMS and FNAL. Because both components are well-known, we do not examine

their security models here.If a vulnerability is discovered in any of these components, we would release a RPM update once ourupstream source (the VDT) has this update. The target response time would be one workweek whilepackaging is done at Caltech, and in lockstep with the VDT update when that team does packaging.

9.Risk AnalysisIn this section, we analyze different risks that are posed to the different pieces of the HDFS-based SE.We attempt to present the most pressing risks in the proposed solution (both technical andorganizational), and point out any mitigating factors.

HDFS

HDFS is both the core component and a component external to grid computing. Hence, its risk must beexamined most closely.

1. Health of Hadoop project: HDFS is completely dependent on the existence and continuedmaintenance of the Hadoop project. Continued development and growth of this project iscritical. Hadoop is a top-level project of the Apache Software Foundation; in order to achievethis status, the following requirements were necessary:

a. Legali. All code ASL'ed (Apache Software License, a highly permissive open-source

license).

ii. The code base must contain only ASL or ASL-compatible dependencies.iii. License grant complete.iv. Contributor License Agreement on file.v.Check of project name for trademark issues.

This legal legwork protects us from code licensing issues and various other legal issues.b. Meritocracy / Community

i. Demonstrate an active and diverse development communityii. The project is not highly dependent on any single contributor (there are at

least 3 legally independent committers and there is no single company or entity


23/27

that is vital to the success of the project)iii. The above implies that new committers are admitted according to ASF

practicesiv. ASF style voting has been adopted and is standard practicev.Demonstrate ability to tolerate and resolve conflict within the community.vi. Release plans are developed and executed in public by the community.

vii. ASF Board for a Top Level Project, has voted for final acceptance.The ASF has shown that these community guidelines and requirements are hallmarks ofa good open source project.

The fact that HDFS is an ASF project and nota Yahoo corporation project means that it is nottied to the health of Yahoo. The current HDFS lead is employed by Facebook not Yahoo. Atthis point in the projects life, about 40% of the patches come from non-Yahoo employees.Relevant to the recent changes to Microsoft as the companys search engine provider, Yahoo hasmade public statements that:

Hadoop is used for almost every piece of the Yahoo infrastructure, including: spamfighting, ads, news, and analytics.

Hadoop is critical to Yahoo as a company, and is not a subproject of the search engine.

It is possible that money previously invested into the search engine technology will nowbe invested into Hadoop.Cloudera has received about $16 million in start-up capital and employs several key developers,including Doug Cutting, the original author of the system. Hadoop maintains a listing of websites and companies utilizing its technology,http://wiki.apache.org/hadoop/PoweredBy .

Condor currently funds a developer working on Hadoop, and is investigating the use of HDFSas a core component.

While we believe these reasons mitigate the risk of HDFS development becoming stagnate, webelieve this is the top long-term risk associated with the project.

2. Hadoop support / resolution of bugs: There is no direct monetary support for large-scaleHDFS development, nor is the success of HDFS dependent upon WLCG usage. We have nopaid support for HDFS (although it can be purchased). This is mitigated by:

a. Paid support is available: We have good contacts with the Cloudera technical staff, andwould be able to purchase development support as needed. Several project committersare on Cloudera staff.

b. Critical bugs affect large corporations: Any bug we are exposed to affects Yahoo andFacebook, whose businesses depend on HDFS. Hence, any data loss bug we discoverwill be of immediate interest to their development teams. When Nebraska started withHDFS, we had issues with blocks truncated by ext3 file system recovery. This triggereda long investigation by a member of the Yahoo HDFS team, resulting in many patchesfor 0.19.0. Since that version, we have not seen the truncation issue again.

c. Acceptance of patches: Nebraska has contributed on the order of 5 patches to HDFS,and has not had issues with getting patches accepted by the upstream project. The majorissue has been passing the acceptance criteria each patch must meet coding guidelines,pass code review from a different coder, and come with a unit test (or an explanation ofwhy a new unit test is not needed).

i. We have opened 30 issues. 10 of these issues have been fixed. 4 have beenclosed as duplicate. 4 have been closed as invalid. 12 remain open; 6 of thesehave a patch available, but have not been committed. Of the remaining openissues, only 1 is applied to our local distribution (the same patch is also applied
http://wiki.apache.org/hadoop/PoweredByhttp://wiki.apache.org/hadoop/PoweredByhttp://wiki.apache.org/hadoop/PoweredBy


24/27

to the Cloudera distribution).d. Large number of unittests: HDFS core has good unit test coverage (Clover coverage

of 76% http://hudson.zones.apache.org/hudson/job/Hadoop-Hdfs-trunk/clover/). Allnontrivial commits require a unit test to be committed along with it. Because of theinitial difficulties in getting completely safe sync/append functionality, a large set ofnew unit tests was developed for 0.21.0 based on a fault-injection framework. The fault

injection framework provides developers with the ability to better demonstrate not onlycorrect behaviors, but correct behaviors under a variety of fault conditions.

The unit tests are run nightly using Apache Hudson(http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/ ) and take several hours.

Each point helps to mitigate the issue, but does not completely remove the issue. In the extremecase, we are prepared to run on locally-developed patches that are not accepted by the upstreamproject. This would hurt our efforts to keep support costs under control, so we would avoid thissituation.

3. Hadoop feature set: We believe HDFS currently has all the features necessary for adoption.We do not believe that any new features are required in the core. However, it should be pointedout the system is not fully POSIX compliant. Specifically, the following is missing:

a. File update support: Once a file is closed, it cannot be altered. In HDFS 0.21.0,append support will be enabled. We do not believe this will ever be necessary forUSCMS.

b. Multiple write streams / random writes: Only a single stream of data can write to anopen file and doing a seek() during write is not supported. This means that one may notwrite a TFile directly to HDFS using ROOT; first, the file must be written to disk, thencopied to HDFS. We developed in-memory stream reordering in GridFTP in order toavoid this limitation. If USCMS decides to write files directly to the SE and not uselocal scratch, HDFS will not be immediately supported. We believe this to be a lowrisk.

c. Flush and append: A file is not guaranteed to be fully visible until it has been closed.Until it is closed, it is not defined how much data a reader will see if they attempt to readthe file. Flush and append support will be available for HDFS 0.21.0, which willprovide guaranteed semantics about when data will be available to readers. We do notbelieve this will be an issue for CMS.

4.

FUSE/FUSE-DFS

FUSE-DFS, as a contributed project in Hadoop, shares many risks with HDFS. There are a fewconcerns we believe relevant enough to merit their own category.

1. FUSE support: FUSE has no commercial company providing support. However, is a part ofthe mainline Linux kernel, over 5 years old, and has had a stable interface for quite awhile. Wehave never seen any issue from FUSE itself. We believe the FUSE kernel module is a riskbecause OSG has less experience packaging kernel modules, and kernel modifications oftenresult in support issues. This is mitigated by the fact that OSG-supported Xrootd requires theFUSE kernel (meaning that HDFS isnt unique in this situation) and that the ATrpms repositoryprovides a FUSE kernel module and tools to build the RPM module for non-standard kernels.UCSD and Caltech both build their own kernel modules; Nebraska uses the ATrpms ones.

2. FUSE-DFS support: FUSE-DFS is the name of the userland library that implements the FUSE
http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/


25/27

filesystem. This was originally implemented by Facebook and is in the HDFS SVN repositoryas a contributed module. It does not have the same level of support as HDFS Core because it iscontributed; it also does not have as many companies using it in production. Through theprocess of adopting HDFS, we have discovered critical bugs, submitted, and had acceptedseveral FUSE-DFS related patches. We have even recently found memory leak bugs in thelibhdfs C wrappers (this had not been previously discovered because it only is noticeable when

things are in continuous production). We believe this component is the short-term highest-risk software component of the entire solution. The mitigating factors for FUSE-DFS are:a. Small, stable codebase: FUSE-DFS is basically a small layer of glue between FUSE

and the libhdfs library (a core HDFS component, used by Yahoo). The entire code baseis around 2,000 lines, about 4% of the total HDFS size. During our usage of HDFS,neither the libhdfs nor the FUSE API has changed. This limits the number ofundiscovered bugs and rate of bug introduction. We believe the majority of the possibleissues are fixed.

b. Production experience: We have been running FUSE for more than 8 months, and feellike we have a good understanding of possible production issues. As of the latestrelease, the largest outstanding issue is the fact that FUSE must be remounted wheneverusers are added or removed from groups (user-to-group mappings are currently cachedindefinitely). This is well understood and possible to work-around. This bug may bemitigated in future versions of HDFS, as it will be necessary to the future Kerberos-based authorization / authentication.

c. Extensive debugging experience available: The last FUSE-DFS memory leak bugtackled required in-depth debugging at Nebraska. We believe we have the experienceand tools necessary to handle any future bugs. We intend to make sure that any locallydeveloped patches are upstreamed to the HDFS project.

There have been other FUSE binding attempts, but this is the only one that has been supportedor developed by a major company (Facebook) and committed as a part of the HDFS projects.The other attempts appear to have never been completed or kept up-to-date with HDFS.

BeStManBeStMan is an already supported component of the OSG. We have identified the associated risks:

1. BeStMan runs out of funding: As BeStMan is quickly becoming an essential OSG package,we believe that it will always meet the needs of USLHC, even if it is not funded at LBNL.

2. BeStMan currently uses Globus 3 container: The Globus 3 web services container was neverin large-scale use, and currently suffers from debilitating bugs and unmaintained architecture.The BeStMan team is currently using most of their effort in replacing this with an industry-standard Tomcat webapp container. This should be delivered fall-winter 2009. We believe thiswill remove many bugs and improve the overall source code. This would make it possible forexternal parties to submit improvements.

Globus GridFTP

Globus GridFTP is an already-supported component of the OSG. We have identified the associatedrisks:

1. Globus GridFTP runs out of funding: Globus GridFTP is an essential component to the OSG.If it runs out of funding, we will use whatever future solution the OSG adopts.

2. Globus GridFTP model possibly not satisfactory: The Globus GridFTP model is based onprocesses being launched by xinetd. Because each transfer is a separate process, issuesaffecting one transfer are very separate from other transfers. However, this makes it extremely


26/27

hard to enforce limits on the number of active transfers per node. This can lead to eitherinstability issues (by having no limit) or odd errors (globus-url-copy does not gracefully reportwhen xinetd refuses to start new servers). We would like to investigate multi-threaded daemon-mode Globus GridFTP, but have not identified effort yet. Current T2 sites mitigate this bymostly controlling the number of concurrent transfers (except CRAB stageouts) and providingsufficient hardware to accommodate for an influx of transfers.

Component Plug-ins

Both BeStMan and GridFTP require plug-ins in order to achieve the desired level of functionality inthis SE. We have identified the associated risks:

1. Future changes in versions of underlying components: We may have to update plugin code ifthe related component changes its interface. For example, BeStMan2 may require a new Javainterface to implement GridFTP selector plugins. Even if the API remains the same, itspossible for the underlying assumptions to change i.e., if GridFTP plug-in needed to becomethread-safe.

2. Original authors leaves USCMS: If the original author leaves USCMS, then much knowledgewould be lost, even if the effort is replaced. This is why focus is being put into clean

packaging, documentation, and ownership by an organization (OSG) as opposed to just oneperson. The BeStMan component is relatively simple and straightforward, mitigating thisconcern. The GridFTP component is not due to the complexity of the Globus DSI interface (byfar, the most complex interface in the SEs). This is high-performance C code and difficult tochange. If the original author left and the Globus DSI module changed significantly, USCMSwould need to invest about 1 man-month of effort to perform the upgrade. This is mitigated bythe fact that the current system does not have any necessary GridFTP feature upgrades USCMS can run on the same plugin for a significant amount of time.

Packaging

We have worked hard to provide packaging for the entire solution. The current packaging does offer afew pitfalls:

1. Original author leaves USCMS: The setup at Caltech is based on mock, the standardFedora/Redhat build tool. The VDT cannot currently does not have the processes in place topackage RPMs effectively, but this is a planned development for Year 4. Until the packagingduties can be transferred from Caltech to VDT (perhaps late Year 4), we will be dependent onthe setup there. We are attempting to get it better documented in order to mitigate risk.

2. Patches fail to get upstreamed: It is crucial to send patches upstream and maintain theminimum number of changes from the base install. We must remain diligent in making sure tocommit upstream fixes for any bugs.

3. Rate of change: Even with only bug fix updates for golden releases, the rate of updates is

always worrying. Most of the updates recently have been related to packaging issues,especially for platforms not present at any production T2 cluster. We hope that the added OSGeffort in Year 4 will enable us to drastically reduce the rate of change.

4. Update mechanisms for ROCKS clusters: Currently, doing a yum install is the correct wayto install the latest version of the software. However, when a administrator adds the RPMs to aROCKS roll, they get locked into that specific version and must manually take action toupgrade the RPMs. This means there will always be significant resistance to changing versions.This makes decreasing the rate of updates even more important.


27/27

Experts and Funding

Much of this work was done using several CMS experts. We outline two risks:a) Loss of experts: As mentioned above, we take a significant hit if our experts leave the

organization. We are focusing heavily on documentation, packaging, and finishing offdevelopment (in fact, preparation for this review has prompted us to clear several long-standingissues). This will allow us to do the first golden set, but also increase the length of time

HDFS can be maintained between experts.a. A significant amount of CMS T2 funding comes from the DISUN project, which ends in

Spring 2010. DISUN personnel contribute to the HDFS effort. This is a going concernto the HDFS effort and CMS T2 program as a whole.

b) Loss of OSG: Much of the risk and effort is being shouldered with the OSG to leverage theirpackaging expertise. Having HDFS in the OSG taps into an additional pool of human resourcesoutside the experts in USCMS. However, the current funding for the OSG runs out in 2 years(and is reduced in 1 year). If the OSG funding is lost, then we will have to again rely internallyon USCMS personnel, similar to FY2009.

The catastrophe scenario for HDFS adoption is both funding loss in the OSG and loss of the experts. Inthis case, the survival plan would be:

Identify funding for new experts (from experience, it takes about 6 months to train a new expertonce they are in place). This can be taken from the pool of HDFS sysadmins; as HDFS gainswider use, the pool of potential experts is broadened.

No new golden set until a packaging, testing, and integration program can be re-established.If this becomes a chronic problem, a hard focus would be made on to switching entirely toClouderas distribution in order to offload the Q/A testing of major changes to an externalorganization.

No new USCMS-specific features. We believe that HDFS has all the necessary major featuresfor CMS adaptation, but we do find small useful ones (an example would be the development ofGanglia 3.1 compatibility). Without a local expert, developing these for CMS would not bepossible. Without a local expert, any running with patches not accepted by the upstream project

becomes increasingly dangerous.

Documents

Hadoop Requirements v16