43
Case Study in Big Data : the SocioTechnical Issues of HathiTrust Digital Texts Women’s Ins*tute for Summer Enrichment Cornell University, Jun 16, 2014 Beth Plale Professor, School of Informa?cs and Compu?ng Director, Data To Insight Center Indiana University HATHI TRUST RESEARCH CENTER

Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts

Embed Size (px)

DESCRIPTION

Invited talk at TRUST Women’s Institute for Summer Enrichment (WISE), Cornell, NY Jun 16, 2014. Infrastructure support for text mining research of big data repository like HathiTrust raises challenges in access and security when the bulk of the repository is protected by copyright.

Citation preview

Page 1: Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts

Case  Study  in  Big  Data  :  the  Socio-­‐Technical  Issues  of  HathiTrust  Digital  Texts  

Women’s  Ins*tute  for  Summer  Enrichment  Cornell  University,  Jun  16,  2014  

 Beth  Plale  

Professor,  School  of  Informa?cs  and  Compu?ng  Director,  Data  To  Insight  Center    

Indiana  University  

HATHI TRUST RESEARCH CENTER!

Page 2: Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts

•  Who  are  the  Players?  HathiTrust,  Google,  Authors  Guild  

•  The  Object  of  AJen?on  :  11  M  books  from  university  libraries  

•  Rulings  around  copyright  •  HTRC,  or  why  I  care  •  Is  security  of  HTRC  Data  Capsule  good  enough?  

Page 3: Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts

The  Players  

Page 4: Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts

Books  Digi*za*on  Project  (2007)  

Libraries  of  U  Michigan,  U  California,  Virginia,  Wisconsin,  Indiana,  …  

digi*zed  books  

digi*zed  books  

digi*ze  

Page 5: Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts

digi*zed  books  

digi*zed  books  

Legal  ac*on  

Mar  2011:    New  York  federal  judge  rejected  a  $125  million  legal  se\lement  that  Google  had  worked  out  with  the  authors  and  publishers  over  the  copyright  issues  Nov  2013:  same  Judge  issued  ruling  saying  that  Google's  use  of  the  works  was  a  "fair  use"  under  copyright  law  

Google/Authors  Guild  

Page 6: Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts

•  June  2014:    2nd  Circuit  Court  of  Appeals  ruling  on  Authors  Guild  versus  HathiTrust  (Cornell,  U  Michigan,  U  California,  U  Wisconsin,  Indiana)  is  a  major  victory  for  fair  use  

digi*zed  books  

Legal  ac*on  

Page 7: Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts

Highlights  2014  ruling  

•  With  respect  to  the  full-­‐text  database,  the  court  found  that  although  a  copy  of  the  en*re  work  is  made,  the  purpose  of  a  full-­‐text  searchable  database  is  so  different  from  that  of  the  underlying  works  that  the  use  must  be  considered  transforma*ve.  In  fact,  the  court  wrote,  "the  crea*on  of  a  full-­‐text  searchable  database  is  a  quintessen*ally  transforma*ve  use".    

June  10,  2014  |  By  Parker  Higgins    Another  Fair  Use  Victory  for  Book  Scanning  in  HathiTrust    

Page 8: Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts

•  The  Authors  Guild  argued  that  HathiTrust's  use  of  an  iden*cal  server  and  two  tape  back-­‐ups  cons*tuted  "excessive"  copying.    

•  Thankfully,  the  court  rejected  that  premise,  acknowledging  that  when  it  comes  to  digital  technology,  an  approach  that  focuses  only  on  individual  copies  made  is  insufficient.  

June  10,  2014  |  By  Parker  Higgins    Another  Fair  Use  Victory  for  Book  Scanning  in  HathiTrust    

Highlights  2014  ruling  

Page 9: Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts

Does  Authors  Guild  Represent  All  Authors?    

•  The  Authors  Guild  members  are  overwhelmingly  trade-­‐book  authors;  the  books  scanned  by  the  Hathi  Trust  are  overwhelmingly  scholarly  books  wri\en  as  part  of  an  academic  tradi*on  that  takes  free  access  and  sharing  as  its  founda*on.    

•  The  Authors  Alliance  :  new  organiza*on  represen*ng  authors  who  are  primarily  concerned  with  being  read.  

Court  finds  full-­‐book  scanning  is  fair  use  Cory  Doctorow  at  3:00  pm  Sat,  Jun  14,  2014    

Page 10: Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts

Highlight  2014  Ruling    

•  Given  that  consistent  fair  use  record  for  book  digi*za*on,  today's  ruling  might  not  be  totally  surprising.  S*ll,  the  text  of  the  opinion  is  encouraging,  and  reflects  a  court  that  respects  the  Cons/tu/onal  purpose  of  copyright  as  a  tool  to  promote  the  progress  of  science  and  the  useful  arts—not  a  blunt  instrument  for  rightsholders  to  regulate  all  downstream  uses.  

June  10,  2014  |  By  Parker  Higgins    Another  Fair  Use  Victory  for  Book  Scanning  in  HathiTrust    

Page 11: Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts

•  Who  are  the  Players?  HathiTrust,  Google,  Authors  Guild  

•  The  Object  of  A\en*on  :  11  M  books  from  university  libraries  

•  Rulings  around  copyright  •  HTRC,  or  why  I  care  •  Is  security  of  HTRC  Data  Capsule  good  enough?  

Page 12: Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts

HTRC,  or  why  I  care:        HathiTrust  digital  library  is  “big  data”;    

and  Text  mining  is  the  new  library  catalog  

search  

Page 13: Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts

Similar  model,  different  ends  

$$  

HTRC  goes  beyond  “full  text  searchable  database”  

Scholarly  search  

Scholarly  mining  

Page 14: Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts

#HTRC    @HathiTrust  

HathiTrust  

•  HathiTrust  is  a  consor*um  of  academic  &  research  ins*tu*ons,  offering  a  collec*on  of  millions  of  *tles  digi*zed  from  libraries  around  the  world.  – Founding  members:  University  of  Michigan,  Indiana  University,  University  of  California,  and  University  of  Virginia  

http://www.hathitrust.org/htrc  

http://www.hathitrust.org  

à  Dis*nguished  from  

Page 15: Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts

#HTRC    @HathiTrust  

Page 16: Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts

#HTRC    @HathiTrust  

Content  of  HathiTrust  

•  Books  and  journals  – Plus  pilots  around  images,  audio,  born-­‐digital  

•  Digi*za*on  sources  – Google  (96.8%,  10,162,104)  –  Internet  Archive  (2.9%,  301,972)  – Local  (0.3%,  31,840)  

Page 17: Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts

#HTRC    @HathiTrust  

Content  Sources  

Page 18: Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts

#HTRC    @HathiTrust  

Content  distribu*on  

360,000  volumes  in  Spanish  

Page 19: Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts

#HTRC    @HathiTrust  

Mo?va?on  for  HTRC  

à  HathiTrust repository is massive scale -- latent goldmine for text based research à  Restricted nature of parts of HathiTrust content suggests need for new forms of access that preserves intimate nature of interaction with texts while at same time honoring restrictions on access à  Size and restrictions demand new paradigm: computation moves to the data (not vice versa)

Page 20: Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts

#HTRC    @HathiTrust  

HathiTrust  Research  Center  

•   The  HathiTrust  Research  Center  (HTRC)  was  established  in  2011  to  enable  computa*onal  research  across  a  comprehensive  body  of  published  works,  for  the  purposes  of  scholarship,  educa*on,  and  inven*on.    

•  HTRC  Execu*ve  Commi\ee  –  Beth  Plale,  co-­‐Director,  Professor  of  Informa*cs  and  Compu*ng,  Indiana  University  

–  J.  Stephen  Downie,  co-­‐Director,  Professor  of  Informa*on  Science,  University  of  Illinois  

–  Robert  McDonald,  Indiana  University  Libraries  –  Beth  Namachchivaya  Sandore,  University  of  Illinois  Library  –  John  Unsworth,  CIO,  Dean  of  Library,  Brandies  University  

   

Page 21: Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts

HTRC  system    

Complexity  hiding  interface  

The  complexity  

Tabular  info  

Sta*s*cal  plots  

Spa*al  plots  

Request  

Page 22: Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts

   

Complexity

 hiding  interface  

   

Page 23: Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts

Text  mining  at  scale:  quick  tutorial  on  topic  modeling  of  texts  

Page 24: Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts

#HTRC    @HathiTrust  

Topic  Modeling  

•  Can  answer  more  complex  or  nuanced  ques*ons  – What  are  the  primary  themes  of  an  author?  – What  are  the  primary  themes  of  a  research  domain?  

– When  did  a  new  topic  enter  a  research  domain?  •  Provides  more  data  than  word  counts  

– 100s  of  topics  can  be  extracted.      – Underlying  data  (topics,  volume,  and  page)  is  available  

Page 25: Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts

#HTRC    @HathiTrust  

Themes  for  Authors  Two  topics  with  iden*cal  centrali*es  (e.g.,  Dickens)  but  separate  themes  

More  strongly  focused  on  book  (illustra*ons,  volume,  literature)  

More  strongly  focused  on  author  himself    (le\ers,  household,  house)  

Page 26: Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts

Ted Underwood, Univ of Illinois

Page 27: Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts

Digging  into  philosophy  of  science  

Establish  points  of  contact  between  philosophy  and  

science:  where  philosophical  arguments  on  

anthropomorphism  appear  in  science  texts  

Colin  Allen,  IU  

Page 28: Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts

The  How  

•  1315  volumes  from  HTRC  selected  using  keyword  search  for  ‘darwin’,  ‘romanes’,  ‘anthropomorphism’,  and  ‘compara*ve  psychology’  

•  Set  contains  lots  of  uninteres*ng  books:    e.g.,  college  course  catalogs  

•  Apply  topic  modeling  on  86  volume  subset    •  Using  iPy  Notebook  

Page 29: Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts

..  Of  set  of  topics,  choose  ‘16’  as  best  

Page 30: Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts

Volumes  most  similar  to  topic  16  

Page 31: Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts
Page 32: Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts
Page 33: Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts

Copyright:  A  Reality    Full  text  download  is  limited  by  both  

size  and  by  copyright  

Page 34: Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts

HTRC  solu*on  to  fully-­‐flexible  text  mining  research  on  en*re  HT  digital  repository:          HTRC  Data  Capsule  

 Funded  by  Alfred  P.  Sloan  Founda*on;  in  collabora*on  with  Atul  Prakash,  University  of  Michigan    

Page 35: Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts

#HTRC    @HathiTrust  

Ques*ons  driving  HTRC  Data  Capsule  

•  Non-­‐consump*ve  use:  can  framework  provide  safe  handling  of  large  amounts  of  protected  data?    

•  Openness:  can  framework  support  user-­‐contributed  analysis  without  resor*ng  to  code  walkthroughs  prior  to  acceptance?    

•  Large-­‐scale  and  low  cost:  can  protec*ons  be  extended  to  u*liza*on  of  large-­‐scale  na*onal  (public)  computa*onal  resources?    

Page 36: Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts

#HTRC    @HathiTrust  

HTRC  Data  Capsules  

•  Trusts  text  mining  researcher  to  not  deliberately  leak  repository  data  

•  Prevents  malware  ac*ng  on  user’s  behalf  from  leaking  data.  

•  V1.0  limits  analysis  to  running    within  single  VM  

Page 37: Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts

VM  Image  Manager  

VM  Image  Store  

VM  Image  Builder  

VM  Manager  

VM  instance  

Secure  Capsule  cluster  

SSH   Research  results  

Researcher  

HTRC  Data  Capsule  Architectural  Components  

   

Registry    Services,  worksets  

 

 

Page 38: Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts

VM  Image  

Manager  

VM  Image  Store  

VM  Image  Builder  

VM  Manager  

VM  instance  

Upon  run,  Secure  Capsule:  

controls  I/O  behind  scenes  

SSH   Research  results  

Researcher  

HTRC  Data  Capsule  interac*on  

Researcher  requests    new  VM  of  type  X  

Researcher  install  tools  onto  VM  through  window  on  her  desktop.    

   

Registry    Services,  worksets  

 

 

Final  loca*on  of  results  is  registry  

1)  

2)  

Image  instance  is  created  

3)  

4)  

Page 39: Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts

setup  

Page 40: Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts
Page 41: Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts

41  

HTRC  secure  data  capsule:  view  from  researcher  desktop  

Page 42: Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts

Thanks  to  our  sponsors  

Page 43: Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts

HTRC  goes  beyond  “full  text  searchable  database”.    Security  has  to  be  top  concern.  

scholarly  research  

HTRC  goes  beyond  “full  text  searchable  database”