Upload
petermurrayrust
View
159
Download
4
Embed Size (px)
Citation preview
RSU: Richard Smith-UnnaPMR: Peter Murray-RustCL: CottageLabs
QueuesRepos
Scientificliterature
SciencePlugins
ScienceVolunteers
Collaboration with Open Access Button
quickscrapeCrawlFeed
Norma Index &Transform
XML
URL
DOI
Scientificliterature
Repositories DOC
CSV
sHTML
Plugins
Regex
SequencesSpecies
Bespoke
Scrapers
XPathPer-Journal
TaggersPer- Journal
MetadataChemistry
Phylogenetics Farming
AMI
BadHTML
OCR
Diagrams
Open NORMA-lized Scientific Literature + Facts
CANARY pipeline
CAT-alogue index
Starting points
• Search/Crawl/Feed-> PMCID,DOI,URL -> quickscrape -> CMDir(PDF,HTML,XML,images/,meta) -> Norma -> CMDir(sHTML|TXT|SVG) good
• PDF,XML,HTML -> Norma -> CMDir(PDF,rawHTML,TXT,images/,meta?) -> NormaOCR -> CMDir(sHTML,TXT,SVG) variable
Conversions
• Paper-> Scanned -> TIFF (avoid) • PDF,TIFF,PNG -> Tesseract-N -> HTML, SVG
fast, variable• PDF -> PDF2SVG-N -> sHTML, SVG, images/.
slow, accurate-ish• PDF -> PDF2TXT-N -> TXT fast, variable• PDF -> PDF2Image-N -> PNG fast, accurate
Raw HTMLNot wellformedBad charactersemantics
ScholarlyHTML
Well-formed XHTML
PNG
TaggedSections
CaptionedFigures
Tables
CaptionedTables
XMLHtmlTidyJsoupHtmlUnit
XSLT1/2
XSLT1/2
NORMA
Per-journalStylesheets
End points
• Norma -> CMDir(OpenSHTML-SVG) • Norma -> CMDir(sHTML. sections) -> AMI ->
all text + species, chemistry, sequences)• Norma -> CMDir(TXT (unsectioned)) ->
AMI -> bagOfWords, regex, • Norma -> CMDir(PNG) -> AMI -> phylo, bar/xy-
plots, • Norma -> CMDir(SVG) -> AMI -> phylo, bar/xy-
plots, chemistry
PDFNon-UnicodePixel glyphsNo wordsNo structures
ScholarlyHTML
SVG
High-levelgraphics
PDF2SVG
characters
SentencesParastables
PNG OCR
TaggedSections
SVGBuilder
CaptionedFigures
NORMA
XSLT1/2
NORMALIZE
NormaConvert PDF,XMLTo sHTMLTag sections
Normalized Scientific Literature
AMIIndexTransformExtractSearch
PDF2SVGXSL stylesheetsTaggers
normalizationParameters
“Permanent” Filestore
Temporary Filestore
Extracted factsindexes
PluginsRegex
quickscrape Norma Index &Transform
XML
URL
DOI
DOC
CSV
sHTML
Plugins
SequencesSpecies
Bespoke
Scrapers
XPath
Taggers
Per- Journal
Chemistry
Phylogenetics Farming
AMI
BadHTML
OCR
Diagrams
CAT-alogue index
getpapersquery
Titles+ links
DailyCrawl/feed
EuPMC
JToCs
catalogue
getpapers
query
DailyCrawl
EuPMC, arXivCORE , HAL,(UNIV repos)
ToCservices
PDF HTMLDOC ePUB TeX XML
PNGEPS CSV
XLSURLsDOIs
crawl
quickscrape
normaNormalizerSectionerSemanticTagger
Text
DataFigures
ami
UNIVRepos
search
LookupCONTENTMINING
Chem
Phylo
Trials
CrystalPlants
COMMUNITY
plugins
Visualizationand Analysis
PloSONE, BMC, peerJ… Nature, IEEE, Elsevier…
Publisher Sites
scrapersqueries
taggers
abstract
methods
references
CaptionedFigures
Fig. 1
HTML tables
30, 000 pages/day Semantic ScholarlyHTML
Facts