4
WHITE PAPER PerkinElmer Signals™ Notebook ChemSearch Not Your Father’s Chemistry Cartridge Chemical substructure and similarity searching is the cornerstone of most cheminformatics systems and a critical feature in electronic lab notebooks (ELN). To casual users it can be mystifying that we have been able to train computers to understand chemical structure diagrams, and to perform complex matching algorithms to identify compounds of interest from millions of potential molecules or chemical reactions. It’s a technology not unlike the fingerprint match databases seen in crime movies. However, as the scale of cheminformatics systems has grown, power users are often frustrated with the performance of their chemical queries. They often ask: “How can Google search billions of documents and give me an answer with sub-second response times, but it takes seconds or even minutes to find compounds in my corporate database or ELN?” The detailed answer is technical, but at its core it has to do with knowing how to break the problem up into smaller pieces to work on those bits in parallel. Most corporate intranet systems built over past decades were based on relational database technologies such as Oracle ® or Microsoft ® SQL Server. These types of SQL databases put a premium on data consistency, so they store and process all their content in a single physical computer. The more data you add to the system the bigger the computer you have to buy to maintain acceptable query speeds. We all know that small computers have become very cheap (just look in your pocket). However, very big computers have become exponentially expensive. Google and other social media companies figured out early on that to be able to store and quickly search billions of documents, posts, or Tweets, they had to distribute data across hundreds or thousands of cheaper computers. This strategy had several advantages. The first is obvious. The more computers working on resolving a search, the quicker the user will get back the results. The second one is that the performance of the system can be dynamically maintained by adding more computers as the data grows or the number of queries increases. Finally, the overall cost of the system is reduced, especially for very large datasets, which would require a prohibitively expensive and sophisticated single computer system. These new and scalable solutions, powering the social media revolution, came to be known as NoSQL databases. “500 milliseconds to search 500 million compounds” David Gosalvez, Ph.D

PerkinElmer Signals™ Notebook ChemSearch€¦ · than native E-Notebook searching while supporting complex combinations of chemical, text, and property queries via a web based modern

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: PerkinElmer Signals™ Notebook ChemSearch€¦ · than native E-Notebook searching while supporting complex combinations of chemical, text, and property queries via a web based modern

WHITEPAPER

PerkinElmer Signals™ Notebook ChemSearchNot Your Father’s Chemistry Cartridge

Chemical substructure and similarity searching is the cornerstone of most cheminformatics systems and a critical feature in electronic lab notebooks (ELN). To casual users it can be mystifying that we have been able to train computers to understand chemical structure diagrams, and to perform complex matching algorithms to identify compounds of interest from millions of potential molecules or chemical reactions. It’s a technology not unlike the fingerprint match databases seen in crime movies. However, as the scale of cheminformatics systems has grown, power users are often frustrated with the performance of their chemical queries. They often ask: “How can Google search billions of documents and give me an answer with sub-second response times, but it takes seconds or even minutes to find compounds in my corporate database or ELN?”

The detailed answer is technical, but at its core it has to do with knowing how to break the problem up into smaller pieces to work on those bits in parallel. Most corporate intranet systems built over past decades were based on relational database technologies such as Oracle® or Microsoft® SQL Server. These types of SQL databases put a premium on data consistency, so they store and process all their content in a single physical computer. The more data you add to the system the bigger the computer you have to buy to maintain acceptable query speeds.

We all know that small computers have become very cheap (just look in your pocket). However, very big computers have become exponentially expensive.

Google and other social media companies figured out early on that to be able to store and quickly search billions of documents, posts, or Tweets, they had to distribute data across hundreds or thousands of cheaper computers. This strategy had several advantages. The first is obvious. The more computers working on resolving a search, the quicker the user will get back the results. The second one is that the performance of the system can be dynamically maintained by adding more computers as the data grows or the number of queries increases. Finally, the overall cost of the system is reduced, especially for very large datasets, which would require a prohibitively expensive and sophisticated single computer system. These new and scalable solutions, powering the social media revolution, came to be known as NoSQL databases.

“500 milliseconds to search 500 million compounds”

David Gosalvez, Ph.D

Page 2: PerkinElmer Signals™ Notebook ChemSearch€¦ · than native E-Notebook searching while supporting complex combinations of chemical, text, and property queries via a web based modern

2

While specialized chemical substructure search databases date back to the early 1970s, the proliferation of enterprise cheminformatics systems, such as electronic lab notebooks, did not happen until the late 1990’s when we figured out how to teach chemistry to the standard SQL databases systems demanded by corporate IT. Oracle, in particular, was the only database vendor that provided us with a technical approach to integrate chemical search algorithms into a standard database. The technology was referred to as an Oracle® Chemistry Cartridge. This allowed, for the first time, for both chemical and non-chemical enterprise data to be stored, processed, and backed up using a single, trusted, and non-proprietary database back-end. This explains why virtually all cheminformatics systems in production today are still using the Oracle® product as their storage and search engine.

Unfortunately, it also explains why those systems have not been able to keep up with end-user’s performance expectations. Mature ELNs in production for well over a decade at global pharmaceutical and chemical companies currently store tens of millions of chemical structures and reactions. The development of new parallel synthesis techniques has recently exploded the size of commercially available compound databases, from millions, to hundreds of millions of records. We are quickly approaching an

Figure 1. Early Systems: MACCS c.1979.

Figure 2. ChemBase c.1986.

unprecedented milestone – one billion accessible commercial reagents. Even with significant corporate investment on high-end single database systems, such as Oracle’s Exadata® platform, cartridge-based systems cannot deliver the performance and scale demanded by end-users who increasingly expect their corporate systems to behave more like their social media applications.

PerkinElmer has been the industry leader in adopting internet and social media technologies to build the next generation cloud based scientific computing platform. So, it is not surprising that we were the first to develop and submit patents for the integration of chemical search capabilities with infinitely scalable NoSQL systems. In particular, PerkinElmer’s ChemSearch NoSQL Chemistry Cartridge was recently integrated into the PerkinElmer Signals Notebook, a cloud-native ELN, where it offers exact, substructure, and similarity search capabilities with performance at scale.

The new search engine is based on the open source and battle-tested Elastic Search technology. This is the search engine used by most online shopping, airline, hotel reservation, and social media sites. We simply figured out how to teach it chemical searching. ChemSearch allows Signals Notebook to provide sub-second response time on databases with hundreds of millions of molecules and reactions. In the process of developing ChemSearch, we did not have to change the trusted chemical search algorithms originally developed by CambridgeSoft, Inc., now PerkinElmer. Chemical queries performed against the new engine are guaranteed to give the same results as our Oracle® Cartridge, only with orders of magnitude greater speed.

Elastic Search is not a database. It is an index (think Google). It consumes documents and other electronic content, parses it, and organizes it in a way that makes it easier to find. Just like the index in the back of a book, it’s a sort of electronic table that makes it fast and easy to find the original documents that are stored somewhere else. Chemical substructure and similarity searching are well suited for this type of indexing technology. The first step in chemical searching involves classifying all the molecules in the database by assigning them fingerprints (a set of characteristics that distinguish the molecule). The fingerprints are well suited to be used as the entries in the index. Elastic Search is incredibly efficient at filtering out index entries (screening molecules) based on the fingerprint. It is not intimidated by having to handle billions of molecules.

Indexing systems are also very efficient at assigning relevance to results. It is not enough to return results quickly. The results must be valid, and the most important ones should be returned first. The second step in chemical searching is to evaluate which of the screened molecules is really a valid substructure of the query molecule. This is a complex and computationally expensive process. It’s the secret sauce developed by CambridgeSoft and PerkinElmer over 30 years of heuristic experience. You can think of it as a pass/fail relevance test. Matching molecules are assigned a high relevance, failing ones are dropped from the list of results. The good news again, is that Elastic Search is particularly well suited to scoring the results via a relevance function. It efficiently

Page 3: PerkinElmer Signals™ Notebook ChemSearch€¦ · than native E-Notebook searching while supporting complex combinations of chemical, text, and property queries via a web based modern

3

Figure 3. Structure searching in Signals Notebook - apply filters for reaction properties and easily toggle between full, exact, similar or substructure searching.

parallelizes the scoring process by allowing hundreds or even thousands of small computers to each handle a fraction of the results to score. The net result is a system that is vastly more scalable and performant than older SQL Cartridges.

ChemSearch is not only used to search all chemical documents and reaction schemes stored in Signals Notebook. It also powers the new ChemACX® Explorer panel. With over 10 million commercially available, off-the-shelf reagents from nearly 1000

Figure 4. Signals Notebook ACX Explorer Panel.

vendors, PerkinElmer’s ChemACX® database has become the industry standard for procurement of fine and research chemicals. With the sub-second substructure query response times afforded by our new engine, the ACX Explorer panel implementation offers a powerful and innovative user experience. It is a sort of “chemical type-ahead”, whereas the user draws structures into the reaction scheme, the panel provides real-time feedback on available reagents, their vendors, chemical properties, and safety data.

Page 4: PerkinElmer Signals™ Notebook ChemSearch€¦ · than native E-Notebook searching while supporting complex combinations of chemical, text, and property queries via a web based modern

For a complete listing of our global offices, visit www.perkinelmer.com/ContactUs

Copyright ©2018, PerkinElmer, Inc. All rights reserved. PerkinElmer® is a registered trademark of PerkinElmer, Inc. All other trademarks are the property of their respective owners. 014409A_01 PKI

PerkinElmer, Inc. 940 Winter Street Waltham, MA 02451 USA P: (800) 762-4000 or (+1) 203-925-4602www.perkinelmer.com

ChemSearch is also used in the upcoming Signals™ Notebook archive feature. Customers migrating from CambridgeSoft E-Notebook will be able to query and view all their closed E-Notebook experiments via the Signals Notebook application. Performance is expected to be orders of magnitude better than native E-Notebook searching while supporting complex combinations of chemical, text, and property queries via a web based modern search interface that feels a bit like shopping on Amazon.

But ChemACX and the E-Notebook archive are not the only large scale chemical data sources of interest to the Signals Notebook users. We expect that over the coming weeks and months our ChemSearch chemical indexing engine will be used for integrating other pubic data sources, such as PubChem or ChEMBL, or internal proprietary systems such as in-house chemical inventory, sample management, and registration systems. The pattern of exposing external data sources into Signals Notebook via the advanced search screen or via custom helper panels a-la ChemACX will enable Signals Notebook to become a central portal from which to access scientifically relevant content.

About Signals Notebook - The Only ELN with ChemDraw Built In

Signals Notebook, the cloud electronic lab notebook from PerkinElmer, helps scientists capture, store, share, and search virtually any type of data. All through an interface that’s so intuitive you’ll be up and running in no time. Plus you will benefit from embedded ChemDraw® – quickly search for available compounds and add them to your reaction scheme with one click.

• Full Microsoft Office® and Microsoft Office® Online Integration

• Industry-renowned chemical drawing software, ChemDraw, embedded at no extra charge

• No hardware or software to install, download, or maintain

Request your 30-day free trial today. Please visit https://signalsnotebook.perkinelmer.cloud/trial/. Or for more information, visit https://bit.ly/2zRbu20