Upload
vuonghanh
View
213
Download
0
Embed Size (px)
Citation preview
Introduction to NCI - Part 2
National Computational Infrastructure
Download training materials here:http://nci.org.au/services-support/training/
Filesystems II VDI Data Collections
Massdata
The Mass Data Store was migrated to a new SGI HierarchicalStorage Management System in January 2012.
MDSS is used for long term storage of large datasets.
Every project has a directory on the MDSS.
All members of the project group have read and write accessto the top project directory.
If you have numerous small files to archive - bundle into atarfile FIRST.
mdss dmls -l gives information what is online (on diskcache) and what is on tape.
4 / 39
Filesystems II VDI Data Collections
Using the MDSS
The mdss command can be used to “get” and “put” databetween the interactive nodes of Raijin and the MDSS, aswell as to list files and directories on the MDSS.
netcp and netmv can be used from within batch jobs to
Generate a batch script for copying/moving files to the MDSS
Submit the generated batch script to the special copyq whichruns copy/move job on an interactive node.
netcp and netmv can also be used interactively to save youwork creating tarfiles and generating mdss commands.
-t create a tarfile to transfer-z/-Z gzip/compress the file to be transferred
Caution!
Always use -l other=mdss when using mdss commands in copyq.This is so that jobs only run when the the mdss system is available.
5 / 39
Filesystems II VDI Data Collections
Exercise 4: Using the MDSS
To see these commands in action do
cd /short/$PROJECT/$USERmdss get Data/data.tarls -ltar xvf data.tarlsrm data.tarmdss mkdir $USERnetmv -t $USER.tar DATA $USERwatch qstat -u $USER... (wait until job finishes, use Ctrl+C to quit)...less DATA.o*mdss ls $USERmdss rm $USER/$USER.tar
6 / 39
Filesystems II VDI Data Collections
Using /jobfs
Only available through queueing system:
Request like -ljobfs=1GBAccess via $PBS JOBFS environment variable
All files are deleted at end of job. Copy what you need to/short or other global filesystem in job script.
Cannot use mdss or netcp commands for files on /jobfs.
7 / 39
Filesystems II VDI Data Collections
Exercise 5: Managing Files between /short, /jobfs and MDSS
Submit a batch job with a /jobfs request, where the job:
Copies an input file from /short to /jobfs
Runs a code to use the input file and generate some output
Saves the output data back to the /short area
Uses the netcp command to archive the data to the MDSS
Read the runjobfs script then submit it to the queueing system,monitor the job with qstat, and examine the job output files:
cd /short/$PROJECT/$USER/INTRO_COURSEqsub runjobfswatch qstat -u $USER... (wait until job finishes, use Ctrl+C to quit)...cat runjobfs.e*cat runjobfs.o*
8 / 39
Filesystems II VDI Data Collections
Exercise 5: Managing Files between /short, /jobfs and MDSS(cont)
Check out the output file that this job created on /short and thecopy on the MDSS
cd /short/$PROJECT/$USERls -ltrless save_data.o*mdss ls $USERmdss rm -r $USER
9 / 39
Filesystems II VDI Data Collections
What is a virtual laboratory?
A Virtual Laboratory is an interactive environment for creating andconducting simulated experiments via a computer interface. Itprovides a range of domain-specific digitally enabled data,programs and tools.The National eResearch Collaboration Tools and Resources(NeCTAR) project provides infrastructure and project fundingenabling Virtual Laboratories and other eResearch tools.
11 / 39
Filesystems II VDI Data Collections
CSIRO - Virtual Geophysics Laboratory
Genomics Virtual Laboratory
University of Tasmania - Marine Virtual Laboratory
The All Sky Virtual Observatory
CWSLab Climate and Weather Science Laboratory
Humanities Networked Infrastructure (HuNI) unlocking and uniting Australia’scultural data
The Characterisation Virtual Laboratory: research environments for exploringinner space
Endocrine Genomics Virtual Laboratory (EndoVL)
Biodiversity and Climate Change Virtual Laboratory
Above and beyond speech, language and music: a Virtual Laboratory for humancommunication science
The Industrial Ecology Virtual Laboratory
12 / 39
Filesystems II VDI Data Collections
The Virtual Desktop Infrastructure (VDI)
The Virtual Desktop Infrastructure at NCI offers Australian researchers
access to spatial data analysis software
an extensive data library including climate, weather and satellite data
new research, analysis and visualisation tools
integration with NCI’s HPC infrastructure
a platform for sharing code and results across the community.
The VDI makes it easier for new scientists to get started at NCI, and to collaboratewith others.
13 / 39
Filesystems II VDI Data Collections
Why use the VDI?
NCI provides a Virtual Desktop Infrstructure (VDI) supporting virtuallaboratories including the CWSLab and the AGDC. It provides usefulsystems and tools for Earth systems related researchers:
The VDI allows users to interact with parts of the NCI system asthough they were on a local machine, with fast scratch space anddedicated CPUs
Interact with data - perform analyses and plot data in real timewithout relying on qsub -I jobs on Raijin
Familiar desktop environment for code development etc.
Not a supercomputer - think of it as your own desktop, butconnected to NCI systems.
Nodes are no more powerful than a standard PC, but they areconnected to /g/data and share standard Raijin modules forPython, R, Matlab, QGIS, etc.
14 / 39
Filesystems II VDI Data Collections
VDI Desktop Specs
Each node has 8 vCPUs, 32GB RAM and 148GB local space.
Looks like (and is) a normal Linux UI - CentOS6
Some software packages are available through menus, othersas command line modules
Session time limits apply, see help documentation for details
15 / 39
Filesystems II VDI Data Collections
How to access the VDI
Need an NCI account and access to relevant project(s) anddatasets
Note only a limited number of NCI projects can access theVDIs at this time
Install TurboVNC and Strudel software (already done for thiscourse)
Follow instructions here http://training.nci.org.au/
For assistance or to request additional data mounts orsoftware packages please contact [email protected]
Really, do work through the VDI training course or read the Googledocument, it has the answers to lots of questions!
16 / 39
Filesystems II VDI Data Collections
Exercise 6: Let’s get acquainted!
Start Strudel
select site “NCI Virtual Desktop”supply your usernameclick “login”.select “Don’t remember me” when in a lab. On your owncomputer, create a passphrase at this step instead.
VNC session starts in a new window
Start a terminal by navigating
Applications → System Tools → Terminal
The VDIs share some modules with Raijin, commonly used inHPC Earth Systems work.
In the terminal, type module avail
Is the software you’re likely to want listed there?
Python (and iPython notebooks), Matlab, R, GDAL, QGIS,NCO, CDO, Ferret, UV-CDAT...
17 / 39
Filesystems II VDI Data Collections
Caveats
/home IS NOT THE SAME/home on Raijin is different to /home on the VDI. Within theVDI, /home is shared, so you can log out and back in andyour data/code will still be there. However the content of yourRaijin /home will not be visible from the cloud and must becopied across as needed.The same goes for /local and other temporary space...Remember /home and /g/data are the only persistentspaces on the VDI, everything else is wiped on log-outQuotas apply to /home. Do not ignore any quota warningswhich appear when you connect to a desktop session. Takeaction immediately - after the grace period expires you willbe unable to start a new session.
19 / 39
Filesystems II VDI Data Collections
Caveats (cont)
Can access /g/data but to submit PBS jobs to the Raijin queue
requires an ssh qsub script. This means all dependencies MUST be
in standard modules (on both systems) and /g/data.
If developing code to run on Raijin, the code needs to be copiedback to Raijin’s /home and run from there, or run from a /g/dataspace via an ssh script. Note that not all libraries are the same sofurther development and testing is likely to be required.
Limited availability
Only 32 desktops are available on the current system. Multiple usersmay be allocated to share resources after this limit is reached. Thiswill not be apparent to the user but performance may be affected.Maximum session times apply, please completely log out whenevernot in ongoing use.
Note
If the “submit a debug report” dialog appears when using Strudel, clicking the“Submit” button does not send the report to NCI and so we never see it. Contact usif you are having problems.
20 / 39
Filesystems II VDI Data Collections
Exercise 7: Finding Data
Easiest through the command line
You can also use the file system explorer GUI, and we’reworking on web services like GeoNetwork and command linesearch tools)
Change to where the data lives:
cd /g/data2/rr5/satellite/obs/himawari8/FLDK/
Can find the data we want from here
cd 2015/09/23/0900
ls gives a list of all files for this time step.
21 / 39
Filesystems II VDI Data Collections
Exercise 8: Interacting with Data
Let’s plot some data!
We’ll look at Himawari8 satellite data.
First quickly with ncviewmodule load netcdfncview 20150923090000-P1S-ABOM OBS B09-PRJ GEOS141 2000-HIMAWARI8-AHI.nc
22 / 39
Filesystems II VDI Data Collections
Using Python in the VDI
You’re likely to want to do analysis in something like Python.
May also want the python NetCDF4 library, othernon-standard libraries, to update versions, etc.
The VDI has python module ’virtualenv’, which lets you defineisolated python environmentsDifferent environments can be defined for different projects oranalysis workflowsThis is particularly helpful when some libraries conflict withone another for a particular task, but not othersIf something breaks or goes wrong within an environment, youcan just delete it and start over
23 / 39
Filesystems II VDI Data Collections
Exercise 9: Interacting with Data - Python virtualenv
Create a ’virtualenv’
On the VDI, load the ’python’ module followed the virtualenvmodule.
$ module load python$ module load virtualenv
Make a directory for the virtualenv (give it any name you’dlike)
$ mkdir <directory>
Create the virtualenv inside the new directory. Note the nameyou enter here for <venv> will be the name that your terminaldisplays when you have activated this virtualenv.
$ cd <directory>$ virtualenv <venv>
24 / 39
Filesystems II VDI Data Collections
Exercise 9: Interacting with Data - Virtualenv (cont)
Activate/deactivate ’virtualenv’
To activate (to enter the virtualenv):
$ source <directory>/<venv>/bin/activate
To deactivate (leave the virtualenv, not delete it):
$ deactivate
25 / 39
Filesystems II VDI Data Collections
Exercise 9: Interacting with Data - Virtualenv (cont)
At some point, you will probably need to update a library or install onethat is not already included.
Updating a python library (e.g., NumPy) within the ’virtualenv’:
Using ’pip install’ along with ’–upgrade’ or alternatively,’–ignore-installed’
$ pip install numpy --upgrade
Installing a new python library (e.g., netCDF4) within the ’virtualenv’:
This package requires additional modules within the VDI tobuild.
$ module load netcdf/4.3.3.1$ module load hdf5/1.8.14$ module load szip
Use ’pip install’ along with defined paths to dependent libraries
$ HDF5_DIR=/apps/hdf5/1.8.14/$ NETCDF4_DIR=/apps/netcdf/4.3.3.1/$ pip install netCDF4
26 / 39
Filesystems II VDI Data Collections
Exercise 9: Interacting with Data - **Notes (cont)
If you are not using ’virtualenv’:Remember to include ’–user’ and ’–build’ to install locally.
$ HDF5_DIR=/apps/hdf5/1.8.14/$ NETCDF4_DIR=/apps/netcdf/4.3.3.1/$ pip install --user --build $TMPDIR/pip_build netcdf4
27 / 39
Filesystems II VDI Data Collections
Exercise 10: VDI - iPython Notebook
Need two more python libraries (inside your virtualenv):
Update ’ipython’
$ pip install --upgrade ipython
Install ’jupyter’ notebook
$ pip install jupyter
Now let’s look at a IPython Notebook example:
/home/900/kad900/NCI_Training
To start notebook:
$ jupyter notebook
28 / 39
Filesystems II VDI Data Collections
NCI data collections
Datasets which are of national significance, or are otherwise usefulreference data which should be securely stored and assigned a DOI,may be hosted at NCI via the RDSI project.
Data is transferred to NCI, and for RDSI projects is stored in/g/dataData must be curated - a Data Management Plan is required, madein conjunction with Jingbo Wang at NCI and Irina Bastrakova atGA
https://datamgt.nci.org.auHosted data collections appear in our collection level GeoNetwork (aweb based tool for searching data holdings at NCI)
http://geonetwork.nci.org.au/Some data collections also have their own geonetwork for data, e.g.http://geonetworkrs0.nci.org.au/
For data publishing and geonetwork requests, [email protected]
30 / 39
Filesystems II VDI Data Collections
Research data collections at NCI
Collection Name Research Data Approved (TB)Australian Data Archive (Social Sciences) 4TERN eMAST Data Assimilation 110Phenology Monitoring: Near Surface Remote Sensing 12Satellite Soil Moisture Products 5Global Navigation Satellite System (Geodesy) Data Archive 5Australian Natural Hazards Data Archive - Tropical Cyclone, Earthquakes, Tsunami 27Synthetic Aperture Radar Data 118Key Water Assets 44CSIRO Coastal Modelling Products 2High Altitude Ice Crystals - High Ice Water Content 23D Geological Models of Australia 3Australian Marine Video and Imagery Collection 7Digitised Australian Aerial Survey Photography Collection 74Models of Land and Water Dynamics from Space Data Collection 22National CT-Lab Tomographic Data Collection 205SkyMapper Southern Sky Survey 227Plant Phenomics Digital Data Repository 10Ocean Model for the Earth Simulator Re-analysis Datasets 27Year Of Tropical Convection Re-analysis Datasets 90CORDEX Australasia 57Australian Bathymetry and Elevation Reference Dataset 113Australian Geophysical Data Collection 175Severe Weather Case Studies 50Tropical Cyclone Scenarios 250Atmospheric Forcing Products 5ACCESS-CM 0.25 degrees Simulations 30ACCESS Numerical Weather Prediction Models 3000
31 / 39
Filesystems II VDI Data Collections
Research data collections at NCI (cont.)
National Resource Management data (post-processed CMIP5) 4Remote and In-situ Observations Products for Earth System Modelling 366Ocean and Marine Modelling and Forecast Products 220Ocean Forecasting Australia Model 150Atmospheric Re-analysis Products 2ARC Centre of Excellence for Climate System Science Datasets Collection 166Seasonal Climate Prediction Data Collection 595Ecosystem Modelling and Scaling Infrastructure Facility (eMAST) Data 90Australian Earth Observation Data (Landsat) 1474Australian Moderate Resolution Satellite Products (NOAA/AVHRR, MODIS, VIIRS and AusCover) 428Bioplatforms Australia Melanoma Collection 278Coupled Model Intercomparison Project (CMIP5) 2322Community Atmosphere Biosphere Land Exchange (CABLE) Model Collection 9
32 / 39
Filesystems II VDI Data Collections
NCI data collections (cont)
NCI already hosts a number of data collections, many of which (thoughnot all!) are of interest to GA.
Earth system sciences, climate and weather model data assets andproducts
Earth and marine observations and products
Geosciences
Terrestrial ecosystems
Water management and hydrology
Astronomy, social sciences and biosciences
33 / 39
Filesystems II VDI Data Collections
NCI data collections (cont)
License (Thanks to Irina for providing the following information):
GA will be transitioning from Creative Commons Attribution 3.0 Australia(CC-BY 3.0) to Creative Commons Attribution 4.0 International (CC-BY4.0) from 31 March 2015. The CC-BY 4.0 licence is compatibleinternationally and is similar to other international copyright licences.
You can go and view the information on the new license and if you haveany queries please contact [email protected] or phoneElizabeth Fredericks on ext 9367 or Jeanette Holland ext 9731
34 / 39
Filesystems II VDI Data Collections
Backup and recovery
Backup strategy in forms are signed by the CI/data managers(A big thank-you to all RDS data managers in GA and IrinaBastrakova).
The backup will be done by the operational team at NCIbased on the frequency requirement.
By default, the latest backup copy will replace with the oldercopy. Multiple copies can be maintained as requested.However, it depends on the size of the storage.
NCI will inform the data managers about the backup status indue course.
35 / 39
Filesystems II VDI Data Collections
Exercise 11a: Data discovery
Visit our data collections geonetwork at geonetwork.nci.org.au
Select Advanced Search tab, click SearchShows 40+ data collections, including 4000+ records.Try searching for data that may be of interest to you, can youfind...?
MODIS satellite dataElevation or bathymetry dataHazards (eg earthquake) data
Enter a catalogue entry to see metadata
36 / 39
Filesystems II VDI Data Collections
Exercise 11b: Interrogating metadata
Select a GeoNetwork entry to see collection metadataEg Water observations from space (WOfS)Models of Land and Water Dynamics from Space Data Collection
Note fields include abstract, custodial information, keywords, access anduse constraints (licence info), geographic extent, access information, anda heap of other data - all data provided for the DMP and reflected in theGeonetwork is ISO19115 or ISO19139 compliant.
37 / 39
Filesystems II VDI Data Collections
Ways to access data
There are a number of ways data held at NCI may be accessed
Via a web service (if published)
Filesystem on Raijin
In a virtual desktop environment (e.g. AGDC, CWSLab)
The data listed in the geonetwork are all technically public, some may bepublished via a service like THREDDS or Geoserverhttp://dap.nci.org.au
If the data are not published online or you need to access the data onRaijin for computing purposes, you will need to request membership ofthe appropriate project
Check in the collection geonetwork for the dataset of interestUnder ”transfer options” there now appears a field like:Transfer optionsOnLine resource http://dap.nci.org.auOnLine resource The data is available at NCI raijin.nci.org.au:/g/data2/<proj>
where the 2nd link shows the project that needs to be joined in order toaccess the data (last 3 characters).If in doubt talk to the custodian or email [email protected].
Join data projects as needed using Mancini:https://my.nci.org.au/mancini/project
Add <project code >/join to go directly to the membership request page.38 / 39
Filesystems II VDI Data Collections
Exercise 12: Data collections
On Raijin, data collections (excluding long term storage data onMDSS), is found in /g/data
ssh raijin.nci.org.au -l abc123cd /g/data1/rr4cd /g/data1/rr9cd /g/data2/u39cd /g/data2/rs0ls
Explore data on file-system
Note
Not all datasets are globally readable, in general you will need to be inthe appropriate group to do this.Published data spaces should be well maintained, generally notappropriate areas to create private working directories (be careful withpermissions, too).
39 / 39