My Little Data in a Big Data World

Preview:

Citation preview

1

My Little Data

A Poster by Candida Haynes - PyData NYC 2015

Big Data World

in a

!!!!!!!You share on social media. You use email. Big data machines “think” they know you once they have analyzed some of your patterns. Now imagine “writing” a deliberate data story. !This poster describes an early process of !

1) retrieving personal data, !2) securing and exploring it, and !

3) identifying the tools that will allow individuals to start cultivating and/or preserving their data identities with code.

Description

3

Considerations at Each Step:

Does it expose personal data or thought processes to the web? To a platform that does not already have the data?

Friction for beginners – i.e. difficulty, information available online for someone with limited knowledge of jargon.

More on Friction

Most documentation

describes how to pull information from distributed

API's.

Required precautions might delay beginners

and ultimately lose them by seeming (or being) too far from their project

goals.

5

Problem-Solving Strategy: Local Security

!

!

Computed behind the firewall that was already on my system.

Encrypted hard drive, protected via password.

6

USB 4 gigabytes

Did not encrypt USB drive but will consider it in the next iteration to discover if / how that changes the process

Storage and Memory Management

7

Getting Your DataSome social media sites have a page where you can request a download file of your data. I chose to use Twitter and found my request link here: https://twitter.com/settings/account Timing: Sometimes the packaging and prep of your data download can take several minutes, several hours, or a few days. The email alert that the data was ready for download took less than two minutes to arrive this time.

8

Twitter Data CharacteristicsPersonal information that is / was already consumable by the public. !Delivered as ZIP file with Twitter's encryption while in transit. !Twitter archive file, which had a numerical name, included the a tweets.csv file, which matched the data type in Bokeh's example. !Each word in a tweet had its own column, which made counting easier. !

9

The data I downloaded appeared in a folder once I unzipped the file.

Grailbird.data.tweets_2008_12 =

[ {

"source" : "\u003Ca href=\"http:\/\/twitter.com\" rel=\"nofollow\"\u003ETwitter Web Client\u003C\/a\u003E",

"entities" : {

"user_mentions" : [ ],

"media" : [ ],

"hashtags" : [ ],

"urls" : [ ]

},

"geo" : { },

"id_str" : "1064099455",

"text" : "I need cookies.",

"id" : 1064099455,

"created_at" : "2008-12-18 00:00:00 +0000",

"user" : {

"name" : "Dida Lakes",

"screen_name" : "dihaynes",

"protected" : false,

"id_str" : "18206386",

"profile_image_url_https" : "https:\/\/pbs.twimg.com\/profile_images\/654911288917659648\/QKsP0wHR_normal.jpg",

"id" : 18206386,

This is a JSON file of my first tweet!

11

!Handled a .CSV file with > 7k tweets Adequately displayed data Sorting and counting tools did not require code !!!!

Problem: Dependencies Solution: Anaconda

Understanding Data: OpenOffice Calc as a Problem-Solving Tool

Not having the right software and configurations for the new software you are installing causes errors. Anaconda resolves a lot of those problems and is recommended in the Bokeh documentation

From @dihaynes on Twitter (sans cleansing) via Jason Davies.

You can explore more word cloud generators at http://worditout.com/ and http://tagcrowd.com.

Word Cloud: Visual Alternative to Calc

13

Data that matched my interest in employment data for future projects

Learning opportunities via dependencies that I could use to interact with tools that had previously posed too much friction

It had a presentation format that would allow for quick interactions

“Employment Sample” visualization had colors that resonated with me

Data Visualization Tool: Bokeh

Data visualization from Bokeh sample.

http://bokeh.pydata.org/en/latest/docs/gallery/unemployment.html

Visual Inspiration

15

What is the role of data science in society?

What would you add to the story of this project?

What are some moments when data science and storytelling are at odds? When are they not?

What questions do you still have about the data? About the process?

Questions for Discussion?

Recommended