26
GAIA Tech1 Data Repositories Meeting GAIA Tech1 Data Repositories Meeting Ingrid Bàrcena, HPC and Storage services manager Ricard de la Vega, Portals and Repositories manager GAIA Tech1 meeting Madrid May 24 2011

GAIA Tech1 Data Repositories Meeting

Embed Size (px)

Citation preview

Page 1: GAIA Tech1 Data Repositories Meeting

GAIA Tech1 Data Repositories MeetingGAIA Tech1 Data Repositories Meeting

Ingrid Bàrcena, HPC and Storage services manager

Ricard de la Vega, Portals and Repositories manager

GAIA Tech1 meeting

Madrid May 24 2011

Page 2: GAIA Tech1 Data Repositories Meeting

OutlineOutline

1. ¿What is CESCA?

2. CESCA services

� HPC ans Storage

� Network

� University e-Administration

� Portals and Repositories

3. Digital Repositories

� Overview

� Two examples: DSpace and web archiving

� Long term preservation

4. CESCA and GAIA

� What is done

� What could be done

Page 3: GAIA Tech1 Data Repositories Meeting

Centre de Centre de SupercomputaciSupercomputacióó de Catalunyade Catalunya

� Patrons:

• Generalitat de Catalunya

• Fundació Catalana per a la Recerca i la Innovació

• Universitat de Barcelona

• Universitat Autònomade Barcelona

• Universitat Politècnicade Catalunya

• Universitat Pompeu Fabra

• Universitat de Girona

• Universitat Rovira i Virgili

• Universitat de Lleida

• Universitat Obertade Catalunya

• Universitat Ramon Llull

• Consell Superiord’Investigacions Científiques

� Public Consortium created in 1991

� ICTS since 2000

Page 4: GAIA Tech1 Data Repositories Meeting

OurOur ServicesServices

Page 5: GAIA Tech1 Data Repositories Meeting

HPCHPC and and StorageStorage

19,48 Tflop/s Peack performance

50 research projects ( 203 users)

Main areas:

• Materials Science (31%)

• Life Science (32%)

• Environmental Science (28%)

• Astronomy and Astrophysics (5%)

+ 3.5 HC used during 2010

+ 50 scientific applications available

Disk Library

NetApp FAS3170

150 TB

21 TB FC drives

126 TB SATA drives

6 Pharma Labs10 Academic research groups

HPC Service Storage Service

Tape Library

ADIC i2000

156 TB

6 LTO-4 drives

300 slots

NetBackup 6.5

2 Software Packages

Drug Design Service

Page 6: GAIA Tech1 Data Repositories Meeting

Network servicesNetwork services

+80 connected institutions

2 core nodes at 10 Gbps

Flexible bandwidth

Services: IPv6, multimedia, Remot

Access Service,Voice over IP,

Eduroam, Security...

21 institutions in Catalonia

40 countries

24 ISP and operators

Services: Multicast, IPv6, NTP Server, F root server (A and J,

.com and .net coming soon)...

Page 7: GAIA Tech1 Data Repositories Meeting

University eUniversity e--Administration ProjectsAdministration Projects

e-Register

• URV: production 02-01-11

• UdL: production 03-14-11

• Sadiel: 32.692 €

e-Vote

• Bid price : 405.000 €

• Awarded (03-18-10): Scytl, 345.000 €

• Production: 02-01-11

SCD (e-Identitat i e-Signatura)

• Available:

EC-UR i EC-URV

ER-CESCA, -URV, -UPC

-UdL, -UPF

• In development: ER-UdG,

ER-UB, ER-UAB i ER-UVic

GPI

Improvements (02-03-11)

• Inteum Sentinel i Technology Publisher

• Office 2007; separació MVs per

universitat; enviament correus

Licence renewal. UB i UPC

Investment: 1.046,97 €

e-Archive

• Transfer agreement: 12-7-10

• Inst. ATLAS: 17.800 €

• Integr. Doc. Mgt:Award: IECI 51.920 € (02-12-11)

• Production: 06-01-11

Cluster: 15 BL460c G6 (2 x Intel Xeon E5530 QC); 480 GB; 4,3 TB;

XenServer Citrix; 2 load balancer F5 BIG-IP 1600; 110.487 €

Capa de dades

Balancejadors F5 BIG-IP

Page 8: GAIA Tech1 Data Repositories Meeting

31-03-11

Portals and RepositoriesPortals and Repositories

Since 2001

18 universities

10,577 doctoral thesis

www.tdx.cat

Since 2005

22 institutions

24,564 research

papers, eprints…

www.recercat.cat

Since 2006

328 journals

129,235 articles

www.raco.cat

Since 2009

10 universities

1,814 learning objects

www.mdx.cat

Since 2006

39,587 websites crawled

118,039 versions crawled

249M files in 7.5 TB

www.padicat.cat

Since 2010

22 institutions

24,564 research

papers, eprints…

www.recercat.cat

Since 2006

Turnkey development

Evolutionary maintenance

http://recyt.fecyt.es

Pilot 2009-10

420 websites crawled

790 versions crawled

http://recyt.fecyt.es

(restricted IP address)

Page 9: GAIA Tech1 Data Repositories Meeting

OutlineOutline

1. ¿What is CESCA?

2. CESCA services

� HPC ans Storage

� Network

� University e-Administration

� Portals and Repositories

3. Digital Repositories

� Overview

� Two examples: DSpace and web archiving

� Long term preservation

4. CESCA and GAIA

� What is done

� What could be done

Page 10: GAIA Tech1 Data Repositories Meeting

Digital RepositoriesDigital Repositories

� A repository capture, store, index, preserve and distribute digital content.

� Data + Metadata• Dublin Core (DC)

• Mets, Mods, marc21…

• VO?

• Astronomical?

� Main issues• Access (search / browse)

• Preservation

• Interoperability

– Open Archive Initiative for metadada harvest (OAI-PMH)

(based on Dublin Core metadata)

Page 11: GAIA Tech1 Data Repositories Meeting

Repositories taxonomyRepositories taxonomy

Towards a European e-Infrastructure for e-Science Digital Repositories. 7th e-Concentration Meeting, Brussels, 12-14th October, 2009

Page 12: GAIA Tech1 Data Repositories Meeting

Repositories HardwareRepositories Hardware

� High availability

� Load balancing

� Easy scalability

� 24x7 monitoringBalancers

Services

Data

Storage Area Network

Disc Tape

Page 13: GAIA Tech1 Data Repositories Meeting

Repositories SoftwareRepositories Software

� For general purpose

• DSpace, EPrints, Fedora, Islandora…• Implemented in

� For journal management

• Open Journal Systems (OJS)• Implemented in

� For web archives preservation

• Heritrix, NutchWAX, WERA, Wayback, Webcurator…• Implemented in

Page 14: GAIA Tech1 Data Repositories Meeting

ExampleExample onon general general purposepurpose repositoryrepository ((DSpaceDSpace))

� For digital objects, like PDF, images, videos, data…

� Index metadata and PDF for searching

Page 15: GAIA Tech1 Data Repositories Meeting

ExampleExample onon webweb archive (PADICAT)archive (PADICAT)

� PADICAT consists of collecting, processing and providing

permanent access to the entire cultural, scientific and general output of Catalonia in digital format. It is the

Catalan web sites archive.

PANDORA UK ARCHIVE IA VEFSAFN BNF Kulturarw3 Netarchive Scope Australia UK World Islandia France Sueden Denmark

Begin 1996 2004 1996 2004 2002 1996 2005

Open access � � � �since 2009 � � �

Search by URL � � � � � � �

S. by keyword � � � � � � �

Directori � � � � � � �

N. websites 26.630 8.308 - - - - > 1,1 milions N. crawls 60.276 32.618 150 billion - - - 4,5 bilions

Space 4,63 TB 7,59 TB - - 180 TB - 155TB

Data 16-12-2010 12-01-2011 13-12-2011 13-01-2011 13-01-2011 26-11-2010 08-2010

- Open Access

- Search by URL and keyword

- Catalogue and thematic directory

www.padicat.cat

Since 2006- 39,587 websites crawled

- 118,039 versions crawled

- 249M files in 7.5 TB

Page 16: GAIA Tech1 Data Repositories Meeting

Web archive software architectureWeb archive software architecture

INDEX FOR KEYWORD SEARCHING

INDEX FOR URL SEARCHINGARXIUS

ARC

HADOOP +

NUTCHWAX

ARCINDEXER

HERITRIX

WAYBACK

WERA

CATALOG DATABASE

(Crawl Metadata)

WEB CURATOR TOOL

1. Harvest

2. Index and search

3. Catalogue and browse

Page 17: GAIA Tech1 Data Repositories Meeting

PADICATPADICAT’’ss indexesindexes

� Until now (< 100.000 website version crawled)

• For search by URL (like Internet Archive)

– Index with ArcIndexer (~100 GB) + visualize with Wayback √

• For search by keyword

– Index with Hadoop+NutchWAX + visualize with WERA √

� Now (120.000 website version crawls)

• Performance problems for keyword indexing

• Two solutions under evaluation:

– Index with a new version of NutchWAX + visualize with TNH (the new

hotness, from IA)

– Index with JB (James Brown, from IA) + visualize with TNH

Page 18: GAIA Tech1 Data Repositories Meeting

Long term preservationLong term preservation

� The e-infrastructure must ensure the long term data

access, without failure.

� To succeed, it must be taken into account:

• Replication (more than one copy)

• Media refresh

• Format migration

• Data integrity (checksums)

• Contingency and recovery plan

• Preservation plan

• ...

Page 19: GAIA Tech1 Data Repositories Meeting

An example of long term preservationAn example of long term preservation

The “preservation history” of TDX (doctoral theses)…

� 2001 – 80 GB, 8.000 access hits

• SW: ETDdb (+ MySQL, Glimpse…) from Virginia Tech

• HW: HP V2500 with 16 processors, 4 GB memory, 227 GB disk

• HW: StorageTek TimberWolf 9740 with 2,7 TB of 9840 tapes

Born in a supercomputer!

Page 20: GAIA Tech1 Data Repositories Meeting

An example of long term preservationAn example of long term preservation

The “preservation history” of TDX (doctoral theses)…

� Hardware migrations

• 2003 (cpu + disk)

– HP rp5430 with 2 processors, 704 GB memory

– HP EVA V.2 with 2,8 TB disk

• 2006 (cpu + tape)

– High availability HP cluster with 32 Proliant DL360 nodes

– Adic Scalar i2000 (from 9840 tapes to LTO3 tapes)

• 2009 (disk)

– NetApp FAS3170 with 60 TB disk

� Software migrations

• 2010 – DSpace (+ PostgreSQL, Java, solr, …) from MIT & HP labs

Page 21: GAIA Tech1 Data Repositories Meeting

An example of long term preservationAn example of long term preservation

The “preservation history” of TDX (doctoral theses)…

� Replication

• On disk - Online version (1)

• One backup on the tape library (2)

• Other backup on a fireproof cabinet (3)

• Other backup on a 50 Km remote Centre (4)

• A dark copy on the MetaArchive Cooperative

– Private LOCKSS (Lots of Copies Keep Stuff Safe) Network

– 10 more copies around the world (14)

� Data Integrity

• Checksums on DSpace (online version)

• Checksums on LOCKSS (dark copies)

Page 22: GAIA Tech1 Data Repositories Meeting

An example of long term preservationAn example of long term preservation

The “preservation history” of TDX (doctoral theses)…

� 2011 – 300 GB, + of 3,5 million access hits

• SW: DSpace (+ PostgreSQL, Java, solr, …) from MIT & HP labs

• HW: High availability HP cluster with 32 Prolian DL360 nodes

• HW: NetApp FAS3170 with 60 TB disk

• HW: Adic Scalar i2000

• SW: LOCKSS (+ Conspectus...)

• HW: HP DL380 (LOCKSS cache)

� xxxx – …

www.tdx.cat

Page 23: GAIA Tech1 Data Repositories Meeting

OutlineOutline

1. ¿What is CESCA?

2. CESCA services

� HPC ans Storage

� Network

� University e-Administration

� Portals and Repositories

3. Digital Repositories

� Overview

� Two examples: DSpace and web archiving

� Long term preservation

4. CESCA and GAIA

� What is done

� What could be done

Page 24: GAIA Tech1 Data Repositories Meeting

GAIA at CESCA: GAIA at CESCA: whatwhat isis donedone

2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011

Data

Processing

IDT/IDU

Storage

DatabaseGDASS/COG

Backup

Page 25: GAIA Tech1 Data Repositories Meeting

Data processing

Database

GAiAGAiA andand CESCA: CESCA: whatwhat couldcould be donebe done

Preservation:

Dark copy, …

Data Repository

Large data

transfer

Powerful

Searches and

interoperability

Storage and Backup

Page 26: GAIA Tech1 Data Repositories Meeting

¡¡Thank you!Thank you!

¿Questions?

[email protected]

[email protected]