Upload
others
View
38
Download
1
Embed Size (px)
Citation preview
Google Genomics DocumentationRelease v1beta2
Cassie
March 16, 2015
Contents
1 Discover Public Data 31.1 Google Genomics Public Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Annotate Variants with Tute Genomics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 PGP data in Google Cloud Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Load Data into Google Genomics 52.1 Loading Genomic Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Troubleshooting Job failures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 Browse Genomic Data 9
4 Quality Control 11
5 Annotate Variants 135.1 Annotate Variants with BioConductor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135.2 Annotate Variants with Tute Genomics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135.3 Annotate Variants with Google Genomics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
6 Analyze Variants 15
7 Compute Principal Coordinate Analysis 17
8 Compute Identity By State 19
9 Build your own Google Genomics API Client 219.1 Important constants and links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219.2 Common API workflows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219.3 API authorization requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239.4 The java client . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239.5 The python client . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249.6 The R client . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269.7 Migrating from v1beta to v1beta2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
10 The mailing list 29
i
ii
Google Genomics Documentation, Release v1beta2
Here you will find task-oriented documentation. What do you want to do today?
Contents 1
Google Genomics Documentation, Release v1beta2
2 Contents
CHAPTER 1
Discover Public Data
1.1 Google Genomics Public Data
See https://cloud.google.com/genomics/public-data
1.2 Annotate Variants with Tute Genomics
Tute Genomics has made available to the community annotations for all hg19 SNPs as a BigQuery table.
See Tute’s documentation for more details about the annotation databases included and sample queries upon publicdata.
To make use of this upon your own data:
1. Load Data into Google Genomics
2. Use the BigQuery JOIN command to join the Tute table with your variants and materialize the result to a newtable.
TODO: actual example with bq tool
1.3 PGP data in Google Cloud Storage
Google is hosting a copy of the PGP Harvard data in Google Cloud Storage. All of the data is in this bucket:gs://pgp-harvard-data-public
If you wish to browse the data you will need to install gsutil.
Once installed, you can run the ls command on the pgp bucket:
$ gsutil ls gs://pgp-harvard-data-publicgs://pgp-harvard-data-public/cgi_disk_20130601_00C68/gs://pgp-harvard-data-public/hu011C57/gs://pgp-harvard-data-public/hu016B28/....lots more....
The sub folders are PGP IDs, so if we ls a specific one:
$ gsutil ls gs://pgp-harvard-data-public/hu011C57/gs://pgp-harvard-data-public/hu011C57/GS000018120-DID/
3
Google Genomics Documentation, Release v1beta2
And then keep diving down through the structure, you can end up here:
$ gsutil ls gs://pgp-harvard-data-public/hu011C57/GS000018120-DID/GS000015172-ASM/GS01669-DNA_B05/ASM/gs://pgp-harvard-data-public/hu011C57/GS000018120-DID/GS000015172-ASM/GS01669-DNA_B05/ASM/dbSNPAnnotated-GS000015172-ASM.tsv.bz2gs://pgp-harvard-data-public/hu011C57/GS000018120-DID/GS000015172-ASM/GS01669-DNA_B05/ASM/gene-GS000015172-ASM.tsv.bz2... and more ...
Your genome data is located at: gs://pgp-harvard-data-public/{YOUR_PGP_ID}
If you do not see the data you are looking for, you should contact PGP directly through your web profile.
4 Chapter 1. Discover Public Data
CHAPTER 2
Load Data into Google Genomics
2.1 Loading Genomic Variants
Contents
• Loading Genomic Variants– Prerequisites– Step 1: Upload variants to Google Cloud Storage
* Transfer the data.* Check the data.
– Step 2. Import variants to Google Genomics* Create a Google Genomics dataset to hold your data.* Import your VCFs from Google Cloud Storage to your Google Genomics Dataset.* Check the import job for completion.
– Step 3. Export variants to Google BigQuery* Create a BigQuery dataset in the web UI to hold the data.* Export variants to BigQuery.* Check the import job for completion.
2.1.1 Prerequisites
1. Sign up for Google Genomics by doing all the steps in Google Genomics: Try it now.
2. Sign up for Google Cloud Storage by doing all the steps in Google Cloud Storage: Try it now.
5
Google Genomics Documentation, Release v1beta2
2.1.2 Step 1: Upload variants to Google Cloud Storage
For the purposes of this example, let’s assume you have a local copy of the Illumina Platinum Genomes variants thatyou would like to load.
Note Google Genomics cannot load compressed VCFs so for now be sure to uncompress them prior to uploading themto cloud storage. We expect to support compressed VCFs soon.
Transfer the data.
To transfer a glob of files:
gsutil -m -o ’GSUtil:parallel_composite_upload_threshold=150M’ cp *.vcf \gs://YOUR_BUCKET/platinum-genomes/vcf/
Or to transfer a directory tree of files:
gsutil -m -o ’GSUtil:parallel_composite_upload_threshold=150M’ cp -R YOUR_DIRECTORY_OF_VCFS \gs://YOUR_BUCKET/platinum-genomes/
If any failures occur due to temporary network issues, re-run with the no-clobber flag to transmit just the missing files:
gsutil -m -o ’GSUtil:parallel_composite_upload_threshold=150M’ cp -n -R YOUR_DIRECTORY_OF_VCFS \gs://YOUR_BUCKET/platinum-genomes/
For more detail, see the gsutil cp command.
Check the data.
When you are done, the bucket will have contents similar to this but with your own bucket’s name:
$ gsutil ls gs://genomics-public-data/platinum-genomes/vcfgs://genomics-public-data/platinum-genomes/vcf/NA12877_S1.genome.vcfgs://genomics-public-data/platinum-genomes/vcf/NA12878_S1.genome.vcfgs://genomics-public-data/platinum-genomes/vcf/NA12879_S1.genome.vcfgs://genomics-public-data/platinum-genomes/vcf/NA12880_S1.genome.vcfgs://genomics-public-data/platinum-genomes/vcf/NA12881_S1.genome.vcfgs://genomics-public-data/platinum-genomes/vcf/NA12882_S1.genome.vcfgs://genomics-public-data/platinum-genomes/vcf/NA12883_S1.genome.vcfgs://genomics-public-data/platinum-genomes/vcf/NA12884_S1.genome.vcfgs://genomics-public-data/platinum-genomes/vcf/NA12885_S1.genome.vcfgs://genomics-public-data/platinum-genomes/vcf/NA12886_S1.genome.vcfgs://genomics-public-data/platinum-genomes/vcf/NA12887_S1.genome.vcfgs://genomics-public-data/platinum-genomes/vcf/NA12888_S1.genome.vcfgs://genomics-public-data/platinum-genomes/vcf/NA12889_S1.genome.vcfgs://genomics-public-data/platinum-genomes/vcf/NA12890_S1.genome.vcfgs://genomics-public-data/platinum-genomes/vcf/NA12891_S1.genome.vcfgs://genomics-public-data/platinum-genomes/vcf/NA12892_S1.genome.vcfgs://genomics-public-data/platinum-genomes/vcf/NA12893_S1.genome.vcf
For more detail, see the gsutil ls command.
6 Chapter 2. Load Data into Google Genomics
Google Genomics Documentation, Release v1beta2
2.1.3 Step 2. Import variants to Google Genomics
Create a Google Genomics dataset to hold your data.
• YOUR_DATASET_NAME: This can be any name you like such as “My Copy of Platinum Genomes”.
• YOUR_GOOGLE_CLOUD_PLATFORM_PROJECT_NUMBER: You can find your Google Cloud Platformproject number towards the top of the Google Developers Console page.
$ java -jar genomics-tools-client-java-v1beta2.jar createdataset --name YOUR_DATASET_NAME \--project_number YOUR_GOOGLE_CLOUD_PLATFORM_PROJECT_NUMBER{"id" : "THE_NEW_DATASET_ID","isPublic" : false,"name" : "YOUR_DATASET_NAME","projectNumber" : "YOUR_GOOGLE_CLOUD_PLATFORM_PROJECT_NUMBER"}
For more detail, see managing datasets.
Import your VCFs from Google Cloud Storage to your Google Genomics Dataset.
• THE_NEW_DATASET_ID: This was returned in the output of the prior command.
$ java -jar genomics-tools-client-java-v1beta2.jar importvariants \--variant_set_id THE_NEW_DATASET_ID \--vcf_file gs://YOUR_BUCKET/platinum-genomes/vcf/*.vcfImport job: {
"id" : "THE_NEW_IMPORT_JOB_ID","status" : "pending"
}
For more detail, see managing variants.
Check the import job for completion.
• THE_NEW_IMPORT_JOB_ID: This was returned in the output of the prior command.
$ java -jar genomics-tools-client-java-v1beta2.jar getjob --poll --job_id THE_NEW_IMPORT_JOB_IDWaiting for job: job_id...{
"status" : "success","importedIds" : ["call_set_id", "call_set_id"],"warnings" : []
}
2.1.4 Step 3. Export variants to Google BigQuery
Create a BigQuery dataset in the web UI to hold the data.
1. Open the BigQuery web UI.
2. Click the down arrow icon next to your project name in the navigation, then click Create new dataset.
3. Input a dataset ID.
2.1. Loading Genomic Variants 7
Google Genomics Documentation, Release v1beta2
Export variants to BigQuery.
• THE_NEW_DATASET_ID: This was returned in the output of the createdataset command.
• YOUR_BIGQUERY_DATASET: This is the dataset ID you created in the prior step.
• YOUR_BIGQUERY_TABLE: This can be any ID you like such as “platinum_genomes_variants”.
$ java -jar genomics-tools-client-java-v1beta2.jar exportvariants \--project_id YOUR_GOOGLE_CLOUD_PLATFORM_PROJECT_NUMBER \--variant_set_id THE_NEW_DATASET_ID \--bigquery_dataset YOUR_BIGQUERY_DATASET \--bigquery_table YOUR_BIGQUERY_TABLEExport job: {
"id" : "THE_NEW_EXPORT_JOB_ID","status" : "pending"
}
For more detail, see variant exports
Check the import job for completion.
• THE_NEW_EXPORT_JOB_ID: This was returned in the output of the prior command.
$ java -jar genomics-tools-client-java-v1beta2.jar getjob --poll --job_id THE_NEW_EXPORT_JOB_IDWaiting for job: job_id...{
"status" : "success","importedIds" : ["call_set_id", "call_set_id"],"warnings" : []
}
Now you are ready to start querying your variants!
2.2 Troubleshooting Job failures
If you were redirected to this page from a Job failure, that means your Job failed for an unknown reason.
Either the failure was transient (which occassionally happens) and the Job should be retried, or there is a bug in ourimplementation which is causing an unexpected exception.
Rest assured that we keep track of all failed Jobs, and will track down the bug if there is one. In a perfect world, youwould never need to see this page.
Because you are here though, please try the following:
• Re-launch your Job once more.
• If the Job fails a second time, please email [email protected] with both of your JobIDs.
Sorry for the failure - we’ll do better next time.
8 Chapter 2. Load Data into Google Genomics
CHAPTER 3
Browse Genomic Data
TODO: GABROWSE and IGV
To browse your own data . . .
9
Google Genomics Documentation, Release v1beta2
10 Chapter 3. Browse Genomic Data
CHAPTER 4
Quality Control
TODO: qc codelab
To run this on your own data . . .
11
Google Genomics Documentation, Release v1beta2
12 Chapter 4. Quality Control
CHAPTER 5
Annotate Variants
5.1 Annotate Variants with BioConductor
TODO: point to annotation vignette on BioConductor for an example on public data
to annotate your own variants . . .
5.2 Annotate Variants with Tute Genomics
Tute Genomics has made available to the community annotations for all hg19 SNPs as a BigQuery table.
See Tute’s documentation for more details about the annotation databases included and sample queries upon publicdata.
To make use of this upon your own data:
1. Load Data into Google Genomics
2. Use the BigQuery JOIN command to join the Tute table with your variants and materialize the result to a newtable.
TODO: actual example with bq tool
5.3 Annotate Variants with Google Genomics
TODO: command line for new AnnotateVariants Dataflow job
13
Google Genomics Documentation, Release v1beta2
14 Chapter 5. Annotate Variants
CHAPTER 6
Analyze Variants
TODO: All Modalities Codelab
To run this on your own data . . .
15
Google Genomics Documentation, Release v1beta2
16 Chapter 6. Analyze Variants
CHAPTER 7
Compute Principal Coordinate Analysis
TODO: spark and dataflow instructions
17
Google Genomics Documentation, Release v1beta2
18 Chapter 7. Compute Principal Coordinate Analysis
CHAPTER 8
Compute Identity By State
TODO: dataflow instructions
19
Google Genomics Documentation, Release v1beta2
20 Chapter 8. Compute Identity By State
CHAPTER 9
Build your own Google Genomics API Client
The tools for working with the Google Genomics API are all open source and available on GitHub.
This documentation covers how to get started with the available tools as well as how you might build your own codewhich uses the API.
All improvements to these docs are welcome! You can file an issue or submit a pull request.
9.1 Important constants and links
Google’s base API url is: https://www.googleapis.com/genomics/v1beta2
More information on the API can be found at: http://cloud.google.com/genomics and http://ga4gh.org
To test Google’s compliance with the GA4GH API, you can use the compliance tests: http://ga4gh.org/#/compliance
To get a list of public datasets that can be used with Google’s API calls, you can use the APIs explorer or GoogleGenomics Public Data.
9.2 Common API workflows
There are many genomics-related APIs documented at cloud.google.com/genomics/v1beta2/reference.
Of the available calls, there are some very common patterns that can be useful when developing your own code.
The following sections describe these workflows using plain URLs and simplified request bodies. Each step shouldmap 1-1 with all of the auto-generated client libraries.
9.2.1 Browsing read data
• GET /datasets
List all available datasets that a current user has access to. (Or all public datasets when not using OAuth) Chooseone datasetId from the result.
Note: Currently, this call only returns public datasets! It is not able to return any private datasets. For now, youmay need to ask a user for a datasetId or readGroupSetId directly
• POST /readgroupsets/search {datasetIds: [<datasetId>]}
Search for read group sets in a particular dataset. Choose one readGroupSetId from the result.
21
Google Genomics Documentation, Release v1beta2
Note: This is a good place to use a partial request to only ask for the id and name fields on a read group set.Then you can follow up with a GET /readgroupsets/<readGroupSetId> call to get the complete readgroup set data.
• GET /readgroupsets/<readGroupSetId>/coveragebuckets
Get coverage information for a particular readset. This will tell you where the read data is located, and whichreferenceNames should be used in the next step.
• POST /reads/search {readGroupSetIds: [<readGroupSetId>]}
Get reads for a particular read group set.
Note: The call also requires referenceName, start and end. The referenceName can be chosen from thecoverage buckets by the user, along with the start and end coordinates they wish to view. The API uses 0-basedcoordinates.
9.2.2 Map reducing over read data within a readset
• GET /readgroupsets/<readGroupSetId>/coveragebuckets
First get coverage information for the read group set you are working with.
Iterate over the coverageBuckets array. For each bucket, there is a field range.end. Using this field, andthe number of shards you wish to have, you can calculate sharding bounds.
Let’s say there are 23 references, and you want 115 shards. The easiest math would have us creating 5 shardsper reference, each with a start of i * range.end/5 and an end of min(range.end, start +range.end/5)
• POST /reads/search {readGroupSetId: x, referenceName: shard.refName,start: shard.start, end: shard.end}
Once you have your shard bounds, each shard will then do a reads search to get data. (Don’t forget to use a usea partial request)
9.2.3 Map reducing over variant data
• GET /variantsets/<datasetId>
First get a summary of the variants you are working with. This includes the references that have data, as well astheir upper bounds.
Iterate over the referenceBounds array. For each reference, there is a field upperBound. Using this field,and the number of shards you wish to have, you can calculate sharding bounds.
Let’s say there are 23 references, and you want 115 shards. The easiest math would have us creating 5shards per reference, each with a start of i * referenceBound.upperBound/5 and an end ofmin(referenceBound.upperBound, start + referenceBound.upperBound/5)
• POST /variants/search {variantSetIds: [x], referenceName: shard.refName,start: shard.start, end: shard.end}
Once you have your shard bounds, each shard will then do a variants search to get data. (Don’t forget to use ause a partial request)
If you only want to look at certain call sets, you can include the callSetIds: ["id1", "id2"] fieldon the search request. Only call information for those call sets will be returned. Variants without any of therequested call sets won’t be included at all.
22 Chapter 9. Build your own Google Genomics API Client
Google Genomics Documentation, Release v1beta2
9.3 API authorization requirements
Calls to the Google Genomics API can be made with OAuth or with an API key.
• To access private data or to make any write calls, an API request needs to be authenticated with OAuth.
• Read-only calls to public data only require an API key to identify the calling project. (OAuth will also work)
Some APIs are still in the testing phase. The following lays out where each API call stands and also indicates whethera call supports requests without OAuth.
9.3.1 Available APIs
API method OAuth requiredGet, List and Search methods (except on Jobs) FalseCreate, Delete, Patch and Update methods TrueImport and Export methods TrueAll Job methods True
9.3.2 APIs in testing
API method OAuth requiredgenomics.experimental.* True
9.4 The java client
The api-client-java project provides a command line interface for API queries in Java.
9.4.1 Command line options for api-client-java
To command line is now the best place for help. Executing without any parameters:
java -jar target/genomics-tools-client-java-v1beta2.jar
will print out all the available commands. To get help on a specific command, append the command followed byhelp. For example to get help on the searchreads command:
java -jar target/genomics-tools-client-java-v1beta2.jar searchreads help
All the request types map to Genomics API calls. You can read the API documentation for more information aboutthe various objects, and what each method is doing.
The custom command
If you wish to call an API method that doesn’t have a pre-defined request type, or if you wish to pass in additionalJSON fields that aren’t supported with the existing options, then you can issue a fully custom request with the followingparameters:
--custom_endpoint Required. The API endpoint to query. This is relative to the base URL andshouldn’t start with a / Example: readgroupsets/search.
9.3. API authorization requirements 23
Google Genomics Documentation, Release v1beta2
--custom_method The HTTP method to query with. Defaults to POST. Other valid examples areGET, PATCH, DELETE.
--custom_body If the API endpoint you are hitting requires a HTTP body, use this parameterto pass in a JSON object as a string. It should look something like {"key":"value"}
Putting these pieces together, if you wanted to do a readsets search with name filtering (which isn’t supported throughthe other options) you could do so with this query:
java -jar target/genomics-tools-client-java-v1beta2.jar custom --custom_endpoint "readgroupsets/search" --custom_body ’{"datasetIds": ["10473108253681171589"], "name": "NA1287"}’ --fields "readGroupSets(id,name)" --pretty_print
If instead you wanted to make a GET call, your custom request could look like this:
java -jar target/genomics-tools-client-java-v1beta2.jar custom --custom_endpoint "readgroupsets/CMvnhpKTFhD04eLE-q2yxnU" --custom_method "GET" --fields "id,name" --pretty_print
9.4.2 Clearing stored credentials
The first time the Java client makes an API request, it authenticates the caller with OAuth and stores the resultingcredentials for all future API calls.
If you wish to remove these stored credentials (to authenticate with a different client secrets file, or as a different user,etc), you will need to remove the storage directory with this command:
rm ~/.store/genomics_java_client/StoredCredential
The next request made to the Java client will then require a browser to open the OAuth pages.
The java client uses Google’s java client library to get data from the Google Genomics APIs. See the java docs formore details.
9.5 The python client
The api-client-python project provides a simple genome browser that pulls data from the Genomics API.
9.5.1 Setting up the python client on Windows
• In order to setup Python 2.7 for Windows, first download it from https://www.python.org/downloads/
• After installing Python, add to your PATH the location of the Python directory and the Scripts directory withinit.
For example, if Python is installed in C:\Python27, proceed by right-clicking on My Computer on the StartMenu and select “Properties”. Select “Advanced system settings” and then click on the “Environment Variables”button. In the window that comes up, append the following to the system variable PATH (if you chose a differentinstallation location, change this path accordingly):
;C:\Python27\;C:\Python27\Scripts\
• Get the api-client-python code onto your machine by cloning the repository:
git clone https://github.com/googlegenomics/api-client-python.git
24 Chapter 9. Build your own Google Genomics API Client
Google Genomics Documentation, Release v1beta2
Running the client with App Engine
Only follow the instructions in this section if you want to run the python client with App Engine.
• Download the “Google App Engine SDK for Python” for Windows fromhttps://developers.google.com/appengine/downloads and install it.
• From within the api-client-python directory that you clones, run the dev_appserver.py script. If weassume the installation directory for your app engine SDK was C:\Google\google_appengine, thenyou would run the following command:
python C:\Google\google_appengine\dev_appserver.py .
If you get an error like google.appengine.tools.devappserver2.wsgi_server.BindError:Unable to bind localhost:8000, try specifying a specific port with this command:
python C:\Google\google_appengine\dev_appserver.py --admin_port=12000 .
• To view your running server, open your browser to localhost:8080.
Running the client without App Engine
Only follow the instructions in this section if you do not want to use App Engine. See the section above for AppEngine instructions.
• First you will need to download Pip from https://raw.github.com/pypa/pip/master/contrib/get-pip.py
• To install Pip, open up a cmd.exe window by selecting Start->Run->cmd and type the following (replacedirectory_of_get-pip.py with the location of where get-pip.py resides):
cd directory_of_get-pip.pypython get-pip.py
• Afterwards in the same command window, type the following command to update your Python environmentwith the required modules:
pip install WebOb Paste webapp2 jinja2
• You should then be able to run the localserver with the following commands:
cd api-client-pythonpython localserver.py
Enabling the Google API provider
If you want to pull in data from ‘Google Genomics API‘_ you will need to set API_KEY in main.py to a validGoogle API key.
• First apply for access to the Genomics API by following the instructions athttps://developers.google.com/genomics/
• Then create a project in the Google Developers Console or select an existing one.
• On the APIs & auth tab, select APIs and turn the Genomics API to ON
• On the Credentials tab, click create new key under the Public API access section.
• Select Server key in the dialog that pops up, and then click Create. (You don’t need to enter anything in thetext box)
9.5. The python client 25
Google Genomics Documentation, Release v1beta2
• Copy the API key field value that now appears in the Public API access section into the top of the main.pyfile inside of your api-client-python directory. It should look something like this:
API_KEY = "abcdef12345abcdef"
Note: You can also reuse an existing API key if you have one. Just make sure the Genomics API is turned on.
• Run your server as before, and view your server at localhost:8080.
• Google should now show up as an option in the Readset choosing dialog.
9.5.2 GABrowse URL format
The genome browser code supports direct linking to specific backends, readsets, and genomic positions.
These parameters are set using the hash. The format is very simple with only 3 supported key value pairs separated by& and then =:
• backend
The backend to use for API calls. example: GOOGLE or NCBI
• readsetId
The ID of the readset that should be loaded. See Important constants and links for more information.
• location
The genomic position to display at. Takes the form of <chromosome>:<base pair position>. exam-ple: 14:25419886 This can also be an RS ID or a string that will be searched on snpedia.
As you navigate in the browser (either locally or at http://gabrowse.appspot.com), the hash will automatically populateto include these parameters. But you can also manually create a direct link without having to go through the UI.
Putting all the pieces together, here is what a valid url looks like:
http://gabrowse.appspot.com/#backend=GOOGLE&readsetId=CPHG3MzoCRDY5IrcqZq8hMIB&location=14:25419886
The python client does not currently use Google’s python client library. If you want to use the client library, themethod documentation for genomics can be very useful.
9.6 The R client
The api-client-r project provides an R package with methods to search for Reads and Variants stored in the GoogleGenomics API. Additionally it provides converters to BioConductor datatypes such as GAlignments, GRanges, andVRanges.
9.7 Migrating from v1beta to v1beta2
The v1beta2 version of the Google Genomics API is now available and all client code should migrate to it by the endof 2014.
If you are using the genomics-tools-client-java jar from the command line - upgrading is as easy as downloadinga new jar. (Or running git pull; mvn package from your git client)
For all other integrations: v1beta2 matches the GA4GH API v0.5.1, which means that there are quite a few methodand field renames to deal with. This page summarizes all the changes necessary to move to the latest API.
26 Chapter 9. Build your own Google Genomics API Client
Google Genomics Documentation, Release v1beta2
9.7.1 new version notes
General
• maxResults is now pageSize, and is an integer
Datasets and Jobs
• All usages of projectId should be replaced by projectNumber
• job.description is now job.detailedStatus
Variants
• The variant objects have not changed.
• The import and export methods have slightly different URLs. /variants/import is now/variantsets/<variantSetId>/importVariants and /variants/export is/variantsets/<variantSetId>/export. These affect the generated client libraries slightly.
Readsets/Readgroupsets
• readset has now been renamed to readgroupset. This is mostly a straightforward replacement ofthe term.
• readset.fileData[0].fileUri is now readgroupset.filename
• readset.fileData[0].refSequences is replaced by readgroupset.referenceSetId
• The rest of the readset.fileData field has been replaced by information within thereadgroupset.readgroups array.
Reads
• All read positions are now 0-based longs, just like the variant positions.
• originalBases is now alignedSequence
• alignedBases (originalBases with the cigar applied) has been removed
• baseQuality is now an int array called alignedQuality. You no longer need to subtract 33 or dealwith ASCII conversion.
• name is now fragmentName
• templateLength is now fragmentLength
• tags is now info
• position is now alignment.position.position. The alignment object now contains allalignment-related information - including the cigar, reference name, and whether the read is on the re-verse strand.
• The old cigar string is now the structured field alignment.cigar. To get an old-style cigar string,iterate over each element in the array, and concat the operationLength with a mapped version ofoperation. pseudocode:
cigar_enums = {ALIGNMENT_MATCH: "M", CLIP_HARD: "H", CLIP_SOFT: "S", DELETE: "D",INSERT: "I", PAD: "P", SEQUENCE_MATCH: "=", SEQUENCE_MISMATCH: "X", SKIP: "N"}
cigar_string = [c.operationLength + cigar_enums[c.operation] for c in read.alignment.cigar].join(’’)
• The old flags integer is now represented by many different first class fields. To reconstruct a flags value,you need code similar to this pseudocode:
9.7. Migrating from v1beta to v1beta2 27
Google Genomics Documentation, Release v1beta2
flags = 0flags += read.numberReads == 2 ? 1 : 0 #read_pairedflags += read.properPlacement ? 2 : 0 #read_proper_pairflags += read.alignment.position.position == null ? 4 : 0 #read_unmappedflags += read.nextMatePosition.position == null ? 8 : 0 #mate_unmappedflags += read.alignment.position.reverseStrand ? 16 : 0 #read_reverse_strandflags += read.nextMatePosition.reverseStrand ? 32 : 0 #mate_reverse_strandflags += read.readNumber == 0 ? 64 : 0 #first_in_pairflags += read.readNumber == 1 ? 128 : 0 #second_in_pairflags += read.secondaryAlignment ? 256 : 0 #secondary_alignmentflags += read.failedVendorQualityChecks ? 512 : 0 #failed_quality_checkflags += read.duplicateFragment ? 1024 : 0 #duplicate_readflags += read.supplementaryAlignment ? 2048 : 0 #supplementary_alignment
reads/search
• sequenceName is now referenceName
• sequenceStart is now start
• sequenceEnd is now end
• The response from reads/search now returns a field called alignments rather than reads
28 Chapter 9. Build your own Google Genomics API Client
CHAPTER 10
The mailing list
The Google Genomics Discuss mailing list is a good way to sync up with other people whouse genomics-tools including the core developers. You can subscribe by sending an email [email protected] or just post using the web forumpage.
All improvements to these docs are welcome! You can file an issue or submit a pull request.
29