93
Data Scientist Interview I applied online and the process took 4 weeks - interviewed at LinkedIn. Interview Details – There should be two rounds of phone interviews and one on-site interview. I applied for the position in LinkedIn career page. A recruiter contacted me the next day and scheduled my first phone screen. After the first phone screen, the recruiter contacted me regarding the next round of the phone screen. Interview Question – I passed the first phone screen (basic data mining questions, including the concepts of classification and clustering; and a simple dp question which is quite similar to "Climbing Stairs"), and failed the second one right after I came back from another state (basic nlp questions, like named entity extraction, and basic data mining questions, like SVM, naive bayes; and a sampling question which is quite similar to Reservoir sampling). View Answer Interview Details – ask questions about sql and data mining Interview Question – questions are quite standard. I applied online and the process took 3 weeks - interviewed at LinkedIn in July 2012. Interview Details – I was first contacted by a recruiter. Two phone interviews were arranged. The first interviewer asked some basic questions about my resume and then we went into the technical questions. Two questions were asked, both about searching in sorted arrays of numbers. The second interview was almost identical, except that the question was more about algorithm design, which required general problem solving skill. Interview Question – Questions are not difficult. It is important to review basic algorithm design and know how to talk through the interview, and know when to ask for help. Interview Details – Had two phone interviews with data science group. One was more design questions, machine learning background check etc. The second interview was strictly coding, algorithms and stuff. Interview Questions

Big Data Interview

Embed Size (px)

Citation preview

Page 1: Big Data Interview

Data Scientist Interview

I applied online and the process took 4 weeks - interviewed at LinkedIn.

Interview Details – There should be two rounds of phone interviews and one on-site interview. I applied for the position in LinkedIn career page. A recruiter contacted me the next day and scheduled my first phone screen. After the first phone screen, the recruiter contacted me regarding the next round of the phone screen.

Interview Question – I passed the first phone screen (basic data mining questions, including the concepts of classification and clustering; and a simple dp question which is quite similar to "Climbing Stairs"), and failed the second one right after I came back from another state (basic nlp questions, like named entity extraction, and basic data mining questions, like SVM, naive bayes; and a sampling question which is quite similar to Reservoir sampling).   View Answer

Interview Details – ask questions about sql and data mining

Interview Question – questions are quite standard.

I applied online and the process took 3 weeks - interviewed at LinkedIn in July 2012.

Interview Details – I was first contacted by a recruiter. Two phone interviews were arranged. The first interviewer asked some basic questions about my resume and then we went into the technical questions. Two questions were asked, both about searching in sorted arrays of numbers. The second interview was almost identical, except that the question was more about algorithm design, which required general problem solving skill.

Interview Question – Questions are not difficult. It is important to review basic algorithm design and know how to talk through the interview, and know when to ask for help.  

Interview Details – Had two phone interviews with data science group. One was more design questions, machine learning background check etc. The second interview was strictly coding, algorithms and stuff.

Interview Questions

Implement pow function.   Answer Question Segment a long string into a set of valid words using a dictionary.

Return false if the string cannot be segmented. What is the complexity of your solution?

Page 2: Big Data Interview

Segment a long string into a set of valid words using a dictionary. Return false if the string cannot be segmented. What is the complexity of your solution?"

Common Analytics Interview QuestionsPosted by: Sarita Digumarti on October 11, 2013 in Articles 1 Comment

You are excited. You have got that much awaited interview call for that dream analytics job. You are confident you will be perfect for the job. Now all that remains is convincing the interviewer. Don’t you wish you knew what kind of questions they are going to be ask?

As co founder and one of the chief trainers at Jigsaw Academy, an online analytics training institute, I regularly get calls from our students days before their scheduled interview asking me just this. I am going to share with you just what I share with them. Here you go. Below are a few of the more popular questions you could get asked and the corresponding answers in a nutshell. 

Question 1. Can you outline the various steps in an analytics project?

Broadly speaking these are the steps. Of course these may vary slightly depending on the type of problem, data, tools available etc.

1. Problem definition - The first step is to of course understand the business problem. What is the problem you are trying to solve – what is the business context? Very often however your client may also just give you a whole lot of data and ask you to do something with it. In such a case you would need to take a more exploratory look at the data. Nevertheless if the client has a specific problem that needs to be tackled, then then first step is to clearly define and understand the problem. You will then need to convert the business problem into an analytics problem. I other words you need to understand exactly what you are going to predict with the model you build. There is no point in building a fabulous model, only to realise later that what it is predicting is not exactly what the business needs.

2. Data Exploration - Once you have the problem defined, the next step is to explore the data and become more familiar with it. This is especially important when dealing with a completely new data set.

Page 3: Big Data Interview

3. Data Preparation – Now that you have a good understanding of the data, you will need to prepare it for modelling. You will identify and treat missing values, detect outliers, transform variables, create binary variables if required and so on. This stage is very influenced by the modelling technique you will use at the next stage.  For example, regression involves a fair amount of data preparation, but decision trees may need less prep whereas clustering requires a whole different kind of prep as compared to other techniques.

4. Modelling – Once the data is prepared, you can begin modelling. This is usually an iterative process where you run a model, evaluate the results, tweak your approach, run another model, evaluate the results, re-tweak and so on….. You go on doing this until you come up with a model you are satisfied with or what you feel is the best possible result with the given data.

5. Validation   – The final model (or maybe the best 2-3 models) should then be put through the validation process. In this process, you test the model using completely new data set i.e. data that was not used to build the model. This process ensures that your model is a good model in general and not just a very good model for the specific data earlier used (Technically, this is called avoiding over fitting)

6. Implementation and tracking – The final model is chosen after the validation. Then you start implementing the model and tracking the results. You need to track results to see the performance of the model over time. In general, the accuracy of a model goes down over time. How much time will really depend on the variables – how dynamic or static they are, and the general environment – how static or dynamic that is.

 

Question 2.   What do you do in data exploration?

Data exploration is done to become familiar with the data. This step is especially important when dealing with new data. There are a number of things you will want to do in this step –

a.        What is there in the data – look at the list of all the variables in the data set. Understand the meaning of each variable using the data dictionary. Go back to the business for more information in case of any confusion.

b.        How much data is there – look at the volume of the data (how many records), look at the time frame of the data (last 3 months, last 6 months etc.)

c.         Quality of the data – how much missing information, quality of data in each variable. Are all fields usable? If a field has data for only 10% of the observations, then maybe that field is not usable etc.

d.        You will also identify some important variables and may do a deeper investigation of these. Like looking at averages, min and max values, maybe 10th and 90th percentile as well…

e.        You may also identify fields that you need to transform in the data prep stage.

Page 4: Big Data Interview

 

Question 3: What do you do in data preparation?

In data preparation, you will prepare the data for the next stage i.e. the modelling stage. What you do here is influenced by the choice of technique you use in the next stage.

But some things are done in most cases – example identifying missing values and treating them, identifying outlier values (unusual values) and treating them, transforming variables, creating binary variables if required etc,

This is the stage where you will partition the data as well. i.e create training data (to do modelling) and validation (to do validation).

 

Question 4: How will you treat missing values?

The first step is to identify variables with missing values. Assess the extent of missing values. Is there a pattern in missing values? If yes, try and identify the pattern. It may lead to interesting insights.

If no pattern, then we can either ignore missing values (SAS will not use any observation with missing data) or impute the missing values.

Simple imputation – substitute with mean or median values

OR

Case wise imputation –for example, if we have missing values in the income field.

 

Question 5: How will you treat outlier values?

You can identify outliers using graphical analysis and univariate analysis. If there are only a few outliers, you can assess them individually. If there are many, you may want to substitute the outlier values with the 1stpercentile or the 99th percentile values.

If there is a lot of data, you may decide to ignore records with outliers.

Not all extreme values are outliers. Not all outliers are extreme values.

 

Question 6: How do you assess the results of a logistic regression analysis?

Page 5: Big Data Interview

You can use different methods to assess how good a logistic model is.

a. Concordance – This tells you about the ability of the model to discriminate between the event happening and not happening.

b. Lift – It helps you assess how much better the model is compared to random selection.

c. Classification matrix – helps you look at the false positives and true negatives.

Some other general questions you will most likely be asked:

What have you done to improve your data analytics knowledge in the past year? What are your career goals?

Why do you want a career in data analytics?

The answers to these questions will have to be unique to the person answering it. The key is to show confidence and give well thought out answers that demonstrate you are knowledgeable about the industry and have the conviction to work hard and excel as a data analyst.

The Top 5 Questions A Data Scientist Should Ask During a Job InterviewPosted on July 29, 2013 by Sean Murphy

The data science job market is hot and an incredible number of companies, large and small, are advertising a desperate need for talent.

Before jumping on the first 6-figure offer you get, it would be wise to ask the penetrating questions below to make sure that the seemingly golden opportunity in front of you isn’t actually pyrite.

1) Do they have data?

You might get a good laugh at this one and probably assume that this company interviewing you must have data as they are interviewing you (a data scientist). However, you know what they say about ass-u-ming, right?

Page 6: Big Data Interview

If the company tells you that the data is coming (similar to the “check is in the mail”), start asking a lot more questions. Ask if the needed data sharing agreements have been signed and even ask to see them. If not, ask what the backup plan is for if (or when) the data does not arrive. Trust me, it always takes longer than everyone thinks.

To be an entrepreneur means to be an optimist at some level because otherwise no one would do something with such a low probability of success. Thus, it is pretty easy for an entrepreneur to assume that getting data will not be that hard. It will only be after months of stalled negotiations and several failures that they will give up on getting the data or, in startup parlance, pivot. In the meantime, you best figure out some other ways of being useful and creating value for your new organization.

2) Who will you report to and what is her or his background?

So, really what you are asking is: does the person who will claim me as a minion actually have experience with data and do they understand the amount of time that wrangling data can take?

If you are reporting to an Management/Executive type, this question is all important and your very survival likely depends on your answer.

First, go read the Gervais Principle at ribbonfarm. From my experience, the ideas aren’t too far off of the mark.

Second, many data-related tasks are conceptually trivial. However, these tasks can take an amount of time seemingly inversely proportional to their simplicity. Or, even worse, something that is conceptually very simple may be mathematically or statistically very challenging or require many difficult and time-consuming steps. Something like count the number of tweets for or against a particular topic is trivial for people but less so for algorithms.

Further, as everyone knows, data wrangling on any project can consume 80% or more of the total project time and, unless that manager has worked with data, she or he may not understand this reality. The rule of thumb to never forget is that if someone does not understand something, that person will almost always under appreciate it. I swear there must be a class in American MBA programs that teaches if you don’t understand something it must be simple and only take five minutes.

If you are reporting to a CTO-type, the situation may seem better but it actually might be worse. Software engineering and development do not equal data science. Technical experience, most of the time, does not equal data experience. Having gone through a few semesters of calculus does not a statistics background make. Hopefully, I have made my point. There is a reason we call the fields software **engineering** (nice and predictable) and data **science** (conducting experiments to test hypotheses). However, many technically-oriented people may believe they know more than they actually do.

Short version for #2 is that time expectations are important to flesh out up front and are highly dependent on your boss’ background.

Page 7: Big Data Interview

Third, your communications strategy will change radically depending on your boss’ background. Do they want the sordid details of how you worked through the data or do they just want the bottom line impact?

3) How will my progress and/or performance be measured?

Knowing how to succeed in your new workplace is pretty important and the expectations surrounding data science are stratospheric at the moment. Keep your eyes peeled if there is a good quick win available for you to demonstrate your value (and this is a question that I would directly ask).

The giant red flag here is if you will be included in an “agile” software process with data-work shoehorned into short-term sprints along with the engineering or development team. Data Science is science and many tasks will often have you dealing with the dreaded unknown unknown. In other words, you are exploring terra incognita, a process that is unpredictable at best. Managing data scientists is very different than managing software engineers.

4) How many other data scientists/practitioners will you be working with and are in the company overall?

What you are trying to understand here is how data-driven (versus ego-driven) the company that you are thinking of joining is.

If the company has existed for more than a few years and has few data science or analyst types, it is probably ego driven. Put another way, decisions are made by the HiPPOs (the HIghest Paid Person’s Opinions). If your data analyses are going to be used for internal decision making, this possibly puts you, the new hire, directly against the HiPPOs. Guess who will win that fight?  If you are going into this position, make sure you will be arming the HiPPO with knowledge as opposed to fighting directly against other HiPPOs.

5) Has anyone ever run analyses on the company’s data?

This one is critical if you will be doing any type of retrospective analyses based on previously collected data. If you simply ask the company if they have ever looked at their data, the answer is often yes regardless of whether or not they have as most companies don’t want to admit that they haven’t. Instead, ask what types of analyses the company has done on its data, did the examination cover all of the companies data, and ask who (being careful to inquire about this person’s background and credentials) did the work.

The reason this line of questioning is so important is that the first time you plumb the depths of a company’s database, you are likely to dig up some skeletons. And by likely I really mean certainly. In fact, going through historically collected data is much like an archeological excavation. As you go further back into the database, you go through deeper layers of the history of the organization and will learn much. You might find out when they changed contractors or when they decided to stop collecting a particular field that you just happen to need. You might

Page 8: Big Data Interview

see when the servers went down for a day or when a particularly well hidden bug prevented database writes for a few weeks.  The important point here is that you might uncover issues that some people still present in the company would prefer not to be unearthed. My simple advice, tread lightly.

Nailing the Tech Interview

Jessica Kirkpatrick is the Director of Data Science at InstaEDU, and formerly a data scientist on the analytics team at Yammer (Microsoft). Before that she was an Astrophysicist at UC Berkeley and has also been an Insight mentor since the program's founding. Below is a guest post, originally appearing on the Women in Astronomy blog, where Jessica shares her tips on doing well in technical job interviews.

A year ago, I made the transition from astrophysicist to data scientist. One of the harder parts of making the transition was convincing a tech company (during the interview process) that I could do the job. Having now been on both sides of the interview table, I'd like to share some advice to those wishing to break into the tech/data science industry. While this advice is applicable to candidates in general, I'm going to be gearing it towards applicants coming from academia / PhD programs.

Most tech companies are interested in smart, talented people who can learn quickly and have good problem solving skills. We see academics as having these skills. Therefore, if you apply for internships or jobs at tech companies, you will most likely get a response from a recruiter. The problem is that once you get an interview, there are a lot of industry-specific skills that the company will try to assess, skills that you may or may not have already.

Below are some of the traits we look for when recruiting for the Yammer analytics/data team, descriptions of how we try to determine if a candidate has these traits, and what you should do to 'nail' this aspect of the interview.

1. Interest in the Position This sounds like a no-brainer, but you would be surprised at how many candidates haven't done proper research about the company or the position. It is especially important for people coming from academic backgrounds to demonstrate why they are interested in making this transition and why they are specifically interested in this opportunity.

When I ask a candidate "Why are you interested in joining my team?" I often get responses like "I really want to move to San Francisco" or "I'm sick of my research." Neither of these responses demonstrate specific interest in my team or my company.

How to Nail It: Do research about the position you are applying for. Understand what the role entails, the company's goals and priorities, and the product(s) that you will be working on. Have a convincing story for why you are making this career change or why

Page 9: Big Data Interview

you want to leave your current position. Show enthusiasm for the opportunity—every interviewer should think that their position is your number one choice and that you can't wait to join their team. More importantly, only apply for roles that you genuinely find interesting.

2. Excellent Problem Solving Skills One of the most challenging aspects of the analyst/data scientist role is taking a vague question posed by someone within the company, and figuring out how to best answer it using our data sets. Testing (and demonstrating) this skill in an interview very difficult.

At Yammer we try to test this skill by asking a combination of open-ended problems, brain teasers, and scenarios similar to those we deal with on a regular basis. For many of these questions there isn't a right or wrong answer, we are more interested in the way the candidate constrains the problem, articulates her thought process, and how efficiently she gets to a solution. For some data science positions you will be asked to do coding problems. Familiarize yourself with some of the standard coding algorithms and questions.

How to Nail It: These types of problems are asked by many tech companies and there are plenty of examples of them on the web. Practice constraining, coming up with a clear game plan, articulating that plan, and then following through in a methodical way. Many problems are hard to answer as posed and so trying simpler versions of the problem or looking at edge cases can give you insight into how to find patterns. Sometimes not all the relevant information is given by the interviewer, don't be afraid to ask clarifying questions or turn the process into a discussion. If the interviewer tries to give you hints or tips, take them. There is nothing more frustrating (as an interviewer) than trying to guide a candidate back on track and have her ignore your help (it also doesn't bode well for the interviewee's ability to work well with others).

3. Communication Skills As I said in a previous post, communication is key. We are looking for someone who can clearly articulate her thought process, and can be convincing that her approach is correct even when being challenged by the interviewer. A standard way we will test this is by posing an open-ended question and when the interviewee says a reasonable answer, we give a reason why that isn't right, then the she comes up with a different explanation, and we negate it again. We keep going to see how she deals with having to switch directions, and balances defending her answer and being flexible with taking the interviewer's suggestions.

How to Nail It: Practice articulating your approach and methods for the above 'technical' interview questions. Come up with a big-picture game plan for approaching the problem, be clear about that plan, have a methodical approach, and then execute it—all the while articulating your thought process as much as possible. If the interviewer tries to make you change directions, it's ok to defend your approach, but you don't want to be too rigid, they might be trying to help you not go down the wrong path. Try to make the interaction as pleasant and warm as possible. Avoid getting defensive, frustrated, or just giving up. It is

Page 10: Big Data Interview

a very hard balance, but practice (especially with another person who can give you feedback) makes perfect.

4. Culture Fit In tech companies you work collaboratively on projects on tight knit teams. We are going to be spending a lot of time with a candidate if we hire her; we want to enjoy that time together. Therefore we are also trying to assess if the interviewee would be a good coworker at an interpersonal level. Is she friendly? Does she work well on teams? Does she have the right balance of being opinionated but not domineering? Is she an interesting person? What are her passions and goals?

I can't tell you how many times I've asked a candidate "What do you like to do for fun?" and they answer: "I like to read programming books." Is that really what you like to do for fun? Or do you just think that is what you are supposed to say in a tech interview?

How to Nail It: Remember that your interviewer is a person too, and interact with them as a person. Try to show some of your personality, passion, sense of humor, and uniqueness in the interview. It's hard to be relaxed in these situations, but personality goes a long way.

5. Ask Good Questions At the end of the interview you will typically have a chance to ask questions. This is your time to take control of the process and turn the tables on the interviewer. Sometimes I learn the most about a candidate in how she uses this portion of the interview. A interviewee I am on the fence about can really tip the decision one way or another by asking intelligent, thought provoking, and engaging questions at the end of an interview (or boring, uninformed, or generic questions).

How to Nail It: Use this as an opportunity to communicate things you weren't able to show in other parts of the interview. Demonstrate that you have researched the company, that you understand their business goals and the way you could contribute. Ask thoughtful questions about the role, demonstrate that you want something that is challenging and discuss types of skills you want to learn or apply. Use this opportunity to show the interviewer what skills you can bring to the role. If applicable, try to relate what you are learning about the job/company to what you've done in the past. Prepare tons of questions, write them down ahead of time and bring them to the interview. You shouldn't run out of questions or have to repeat them over the course of the day.

The above is by no means an exhaustive list of everything a tech company is looking for, and of course different companies have different approaches. When I interviewed for my current job (I recently moved to the education start-up InstaEDU), most of the interview involved discussing my previous projects, the problems that the company was facing, and how I could provide value to them as a data scientist. It was a very different experience interviewing for my second job in the tech industry than for my first. However, I do hope that the above demystifies the tech interview process, gives you insight into how one company goes about hiring data people, and helps you understand what we are looking for on the other side of the table.

Page 11: Big Data Interview

Ads By Google

The Business Analyst job description may vary from one company to another.The job requirements of a person filling a business analyst position depend on the business nature of a given company.

Therefore, each Business Analyst job interview can be completely different.

Note: a Business Analyst is sometimes referred to as a System Analyst or Engineers Analysts or IT business analyst.

This article provides samples of job interview questions for a business analyst position.

In general, the business analyst job description is:

Effectively translate business needs to applications and operations. Challenge cross company units and provide the requirements for the R&D team.

Use cases, surveys/scenarios analysis, and workflow analysis, evaluating information, focal point for internal and external customers, define users’ needs and convert to business cases.

The skills requirements for a business analyst are:

Leadership Decision making

Conflict resolution

Presentation skills

Excellent verbal and written communication skills

Interpersonal communication skills

Analytical thinking and a negotiator

Sample Business Analyst Interview Questions1. Describe your responsibilities as a business analyst in your last job.2. What are the BI, Business Intelligence, reporting tools you use for a given project. Describe the

project, the BI tool/s and the report extracted.

3. How do you select which BI tool to implement? How do you decide on the report-frequency, update-frequency and user-needs based on the objectives you want to

4. achieve for that report? Refer to BI tools such as – Cognos Discoverer, Business Objects and Crystal reports etc.

5. Can you list the desired skills needed to perform effectively as a Business Analyst?

Page 12: Big Data Interview

6. What types of modeling requirements are used in the business applications of the analyst?

7. If two companies are merging, explain which tasks you would implement and how, to ensure a successful union?

8. Have you been responsible for assigning tasks to testers? How were you required to integrate the results found? How do you coordinate these responsibilities with the team and your management?

9. Can you explain the term “push back” in relation to business users? What this means to you?

10. When working on a project, at what point would the requirements of a Traceability Matrix be implemented and for what purpose?

11. For process testing, can you explain the role of the Business Analyst?

12. When working with specific document requirements, can you explain or define the steps to create Use Cases?

13. At what point of a project is the Use Case system complete? What are the next steps in the project phase?

Some other technical questions that may be asked during the job interview include:

Technical terms and Technical questions

Some technical terms that may be used by the interviewer to verify your knowledge and competencies would be:

UML modeling GAP analysis

SDLC methodologies

Traceability Matrix

RUP, Rational Unified Process, implementation.

UI Designs and UI Design Patterns

System Design Document (SDD)

Requirement Management Tool, Requirements Modeling

Use Case and Test Case

Risk Management

Business plan

Data mapping

Page 13: Big Data Interview

Black box testing and White box testing

Push Back from Business Users

Waterfall Method and Prototyping Model and their hybrid

Interface / Integration mapping

Functional requirements: FSD(Functional Spec document) or FRS/MRD

End user support and user acceptance testing (UAT)

Validation of the requirements

Determine – ROI, cash flow, break even and fixed/variable costs and sale price

Answer:This is your time to review your knowledge about the above business terms. Make sure you have good understanding of some of these buzz words of the industry of your interest.Read again the job description which was detailed in the job opening and if you find special requirements, prepare yourself to answer related questionsIn case you have expertise in several specific areas, find the right time during the interview to speak about these professional expertise that you gained in your career.

How to Face an Interview for Data AnalystFacing a job interview for a data analyst position, sometimes referred to as a statistician position, can be intimidating. Analysts often have to evaluate, sort and report on data that is incomplete or erroneous, so an interviewer will likely ask how you handle those assignments. Don't get rattled by tough questions. Stay positive and use personal examples from previous projects to support your skills and experience.

Data-Gathering Experience

Data analysts are often responsible for gathering and compiling data from various reputable sources before making evaluations, drawing conclusions and issuing reports. Expect the hiring manager to ask questions like, "How do you go about collecting information to support your analyses?" or, "What types of data have you researched and analyzed in the past?" The employer might need data analyses to create new advertising strategies, prepare short and long-term finance budgets, or determine which company products are most profitable. Answer data-collecting questions with specific examples of how you successfully used group samples, conducted market research, reviewed financial reports or analyzed surveys to make fair and consistent assessments.

Page 14: Big Data Interview

Validity of Data

Data isn't always accurate, complete, understandable, consistent, predictable or beneficial to meeting a company's goals, so expect interview questions about your methods for verifying and validating information. You might discuss ways you take averages, find medians, double-check questionable entries, find alternative research to support your findings or consult specialists. Most importantly, you want to show the interviewer that you are an effective problem-solver, troubleshooter and decision-maker so she has no reason to question your skills or capabilities.

Software

The hiring manager will likely ask about your computer skills and experience using analytic software. Data analysts process collected data and reach conclusions with the help of computer software, according to the U.S. Bureau of Labor Statistics. Discuss any experience you've had with statistical software, such as Stata, RStudio, PSPP or GMDH Shell. If most of your previous work has been with inter-office spreadsheets or Microsoft Excel files, assure the interviewer that you are proficient with those types of data files and would be willing to learn any new software programs necessary for the job.

Communication and Presentation Skills

Data analysts must communicate results, findings and future goals using visual aids, such as charts, grafts and infographics. The interviewer will likely ask, "What are your communication strengths?" or, "Explain how you organize and create presentations to report analytical findings?" Answer these questions with specific examples of presentations, reports and seminars you've created or hosted. The interviewer wants assurance that you have the people skills and interpersonal strengths to effectively relay your analyses and results.

References U.S. Bureau of Labor Statistics: Statictician

Resources Forbes: Analytics is Fast Becoming a Core Competency for Business Professionals Psychology Today: Secrets to a Successful Job Interview

About the Author

Kristine Tucker has been writing articles on finance, politics, humanities and interior design since 2001. Her articles have been featured in many online publications. Tucker's experience as an English teacher has given her the opportunity to read many wonderful masterpieces. She holds a degree in political science with a minor in international studies.

 

Page 15: Big Data Interview

← Upcoming Information Retrieval Conferences Dream. Fit. Passion. →

Retiring a Great Interview Problem

August 8th, 2011 · 105 Comments · General

Interviewing software engineers is hard. Jeff Atwood bemoans how difficult it is to find candidates who can write code. The tech press sporadically publishes “best” interview questions that make me cringe — though I love the IKEA question. Startups like Codility and Interview Street see this challenge as an opportunity, offering hiring managers the prospect of outsourcing their coding interviews. Meanwhile, Diego Basch and others are urging us to stop subjecting candidates to whiteboard coding exercises.

I don’t have a silver bullet to offer. I agree that IQ tests and gotcha questions are a terrible way to assess software engineering candidates. At best, they test only one desirable attribute; at worst, they are a crapshoot as to whether a candidate has seen a similar problem or stumbles into the key insight. Coding questions are a much better tool for assessing people whose day job will be coding, but conventional interviews — whether by phone or in person — are a suboptimal way to test coding strength. Also, it’s not clear whether a coding question should assess problem-solving, pure translation of a solution into working code, or both.

In the face of all of these challenges, I came up with an interview problem that has served me and others well for a few years at Endeca, Google, and LinkedIn. It is with a heavy heart that I retire it, for reasons I’ll discuss at the end of the post. But first let me describe the problem and explain why it has been so effective.

The Problem

I call it the “word break” problem and describe it as follows:

Given an input string and a dictionary of words,segment the input string into a space-separatedsequence of dictionary words if possible. Forexample, if the input string is "applepie" anddictionary contains a standard set of English words,

Page 16: Big Data Interview

then we would return the string "apple pie" as output.

Note that I’ve deliberately left some aspects of this problem vague or underspecified, giving the candidate an opportunity to flesh them out. Here are examples of questions a candidate might ask, and how I would answer them:

Q: What if the input string is already a word in the dictionary?A: A single word is a special case of a space-separated sequence of words.

Q: Should I only consider segmentations into two words?A: No, but start with that case if it's easier.

Q: What if the input string cannot be segmented into a sequence of words in the dictionary?A: Then return null or something equivalent.

Q: What about stemming, spelling correction, etc.?A: Just segment the exact input string into a sequence of exact words in the dictionary.

Q: What if there are multiple valid segmentations?A: Just return any valid segmentation if there is one.

Q: I'm thinking of implementing the dictionary as a trie, suffix tree, Fibonacci heap, ...A: You don't need to implement the dictionary. Just assume access to a reasonable implementation.

Q: What operations does the dictionary support?A: Exact string lookup. That's all you need.

Q: How big is the dictionary?A: Assume it's much bigger than the input string, but that it fits in memory.

Seeing how a candidate negotiates these details is instructive: it offers you a sense of the candidate’s communication skills and attention to detail, not to mention the candidate’s basic understanding of data structures and algorithms.

A FizzBuzz Solution

Enough with the problem specification and on to the solution. Some candidates start with the simplified version of the problem that only considers segmentations into two words. I consider this a FizzBuzz problem, and I expect any competent software engineer to produce the equivalent of the following in their programming language of choice. I’ll use Java in my example solutions.

String SegmentString(String input, Set<String> dict) { int len = input.length(); for (int i = 1; i < len; i++) { String prefix = input.substring(0, i);

Page 17: Big Data Interview

if (dict.contains(prefix)) { String suffix = input.substring(i, len); if (dict.contains(suffix)) { return prefix + " " + suffix; } } } return null;}

I have interviewed candidates who could not produce the above — including candidates who had passed a technical phone screen at Google. As Jeff Atwood says, FizzBuzz problems are a great way to keep interviewers from wasting their time interviewing programmers who can’t program.

A General Solution

Of course, the more interesting problem is the general case, where the input string may be segmented into any number of dictionary words. There are a number of ways to approach this problem, but the most straightforward is recursive backtracking. Here is a typical solution that builds on the previous one:

String SegmentString(String input, Set<String> dict) { if (dict.contains(input)) return input; int len = input.length(); for (int i = 1; i < len; i++) { String prefix = input.substring(0, i); if (dict.contains(prefix)) { String suffix = input.substring(i, len); String segSuffix = SegmentString(suffix, dict); if (segSuffix != null) { return prefix + " " + segSuffix; } } } return null;}

Many candidates for software engineering positions cannot come up with the above or an equivalent (e.g., a solution that uses an explicit stack) in half an hour. I’m sure that many of them are competent and productive. But I would not hire them to work on information retrieval or machine learning problems, especially at a company that delivers search functionality on a massive scale.

Analyzing the Running Time

But wait, there’s more! When a candidate does arrive at a solution like the above, I ask for an big O analysis of its worst-case running time as a function of n, the length of the input string. I’ve heard candidates respond with everything from O(n) to O(n!).

I typically offer the following hint:

Page 18: Big Data Interview

Consider a pathological dictionary containing the words"a", "aa", "aaa", ..., i.e., words composed solely ofthe letter 'a'. What happens when the input string is asequence of n-1 'a's followed by a 'b'?

Hopefully the candidate can figure out that the recursive backtracking solution will explore every possible segmentation of this input string, which reduces the analysis to determine the number of possible segmentations. I leave it as an exercise to the reader (with this hint) to determine that this number is O(2n).

An Efficient Solution

If a candidate gets this far, I ask if it is possible to do better than O(2n). Most candidates realize this is a loaded question, and strong ones recognize the opportunity to apply dynamic programming or memoization. Here is a solution using memoization:

Map<String, String> memoized;

String SegmentString(String input, Set<String> dict) { if (dict.contains(input)) return input; if (memoized.containsKey(input) { return memoized.get(input); } int len = input.length(); for (int i = 1; i < len; i++) { String prefix = input.substring(0, i); if (dict.contains(prefix)) { String suffix = input.substring(i, len); String segSuffix = SegmentString(suffix, dict); if (segSuffix != null) { memoized.put(input, prefix + " " + segSuffix); return prefix + " " + segSuffix;

}}memoized.put(input, null);return null;}Again the candidate should be able to perform the worst-case analysis. The key insight is that SegmentString is only called on suffixes of the original input string, and that there are only O(n) suffixes. I leave as an exercise to the reader to determine that the worst-case running time of the memoized solution above is O(n2), assuming that the substring operation only requires constant time (a discussion which itself makes for an interesting tangent).

Why I Love This Problem

There are lots of reasons I love this problem. I'll enumerate a few:

Page 19: Big Data Interview

It is a real problem that came up in the couse of developing production software. I developed Endeca's original implementation for rewriting search queries, and this problem came up in the context of spelling correction and thesaurus expansion.

It does not require any specialized knowledge -- just strings, sets, maps, recursion, and a simple application of dynamic programming / memoization. Basics that are covered in a first- or second-year undergraduate course in computer science.

The code is non-trivial but compact enough to use under the tight conditions of a 45-minute interview, whether in person or over the phone using a tool like Collabedit.

The problem is challenging, but it isn't a gotcha problem. Rather, it requires a methodical analysis of the problem and the application of basic computer science tools.

The candidate's performance on the problem isn't binary. The worst candidates don't even manage to implement the fizzbuzz solution in 45 minutes. The best implement a memoized solution in 10 minutes, allowing you to make the problem even more interesting, e.g., asking how they would handle a dictionary too large to fit in main memory. Most candidates perform somewhere in the middle.

Happy Retirement

Unfortunately, all good things come to an end. I recently discovered that a candidate posted this problem on Glassdoor. The solution posted there hardly goes into the level of detail I've provided in this post, but I decided that a problem this good deserved to retire in style.

It's hard to come up with good interview problems, and it's also hard to keep secrets. The secret may be to keep fewer secrets. An ideal interview question is one for which advance knowledge has limited value. I'm working with my colleagues on such an approach. Naturally, I'll share more if and when we deploy it.

In the mean time, I hope that everyone who experienced the word break problem appreciated it as a worthy test of their skills. No problem is perfect, nor can performance on a single interview question ever be a perfect predictor of how well a candidate will perform as an engineer. Still, this one was pretty good, and I know that a bunch of us will miss it.

105 responses so far ↓

1 rp1 // Aug 8, 2011 at 2:59 am

Why is this not just n lookups? Start with one character. If that fails, look up the first two, etc. When a lookup succeeds, insert a space and set the string to start at the next unexamined character.

Thus “aaaaa” becomes “a a a a a”. Works great with a trie.

2 facepalm // Aug 8, 2011 at 3:26 am

Page 20: Big Data Interview

rp1, you’d fail the interview

the author already explained why in the article.

3 binarysolo // Aug 8, 2011 at 3:40 am

rp1 – You need to consider cases where a word is composed of a valid word base with a suffix that is not a standalone word.

Say, valid word + an invalid word, such as: “shorter” –> = short + er = (valid) + (invalid).

Recursion just makes the most sense as the author wrote; reducing the problem by accounting for the string from the back. As he points out, bad case is when you have tons of viable character combinations that remain workable with prior chars in combinations, and one persistent incompatibility that forces the exploration of the entire space of solutions .

Er, I’m not articulating it well; just read his answer which explains it a lot better.

4 Grant Husbands // Aug 8, 2011 at 3:43 am

@rp1, you’re failing to think about backtracking; imagine the input is “aaaaab”.

5 Seems easy // Aug 8, 2011 at 3:44 am

rp1′s solution seems perfectly fine to me. The author did not explain why I couldn’t use it.

If you think differently explain why.

6 Rick Williams // Aug 8, 2011 at 3:46 am

You make “interview candidates” sign NDAs in order to interview. That’s amazing. Mind sharing the text of the NDA, I’d like to see what it covers. Thanks in advance.

7 binarysolo // Aug 8, 2011 at 3:46 am

Incidentally I’m kinda surprised this is considered a worthy-enough interview prob. I’m not a programmer by trade, though I enjoyed the CS106x+107 sequence at a li’l school near Daniel’s location and would think that a freshman with some amount of CS thinking/experience would trivially solve this.

8 zui // Aug 8, 2011 at 3:54 am

the main issue with such interviews is that 50% of the candidates have troubles answering those under stress no matter how simple.

Page 21: Big Data Interview

i know its an issue hard to solve.

interviewers don’t care about finding the proper candidate. they care that a candidate passed tests so can’t be blamed if the candidate does not perform well enough at work.

interviewers care of a level of assurance, if you prefer. hey it’s not as bad as it sounds, the candidate indeed is going to be likely to fit the position.

will he really fit, will he like it and is he actually good at solving *new* problems or just solving classic interview quizzes ? Who cares thats a risk the interviewer is willing to take.

A true interviewer, that is, someone who’s main care is that the candidate is going to be helping out the company is not going to give all these stupid questions the same way. A true interviewer, while he has a lot of questions and steps prepared will not serve them to the candidate. he will learn to know the candidate in a short amount of time and ask the right questions, ones which may not even have been prepared before.

a true interviewer sees the challenge in every candidate he interviews, and not just the opposite (where the candidate is challenged by the interviewer’s questions)

a true interviewer disregard the crap (hello best questions list) and focus on what matters.

finally, new talents in the it world are HARD to come by so there’s a lot of competition to recruit them. well, let me tell you this straight, there’s many talents which are just discarded by the review/interview process that are here for you to pick from.

thats how we find gems that everyone elses passed upon. then people wonder why my team always outperforms their.

9 tzs // Aug 8, 2011 at 4:32 am

Interesting problem, which I don’t think I’ve seen before. Here’s the O(n^2) solution I’d have come up with if confronted with this thing.

Build a directed graph consisting of vertices labeled 0, 1, 2, …, n, where n is the length of the string, where there is an edge from k to j if and only if there is a dictionary word of length j-k starting at position k in the string.

This can be done in O(n^2).

Solutions to the problem then correspond to paths through the graph from 0 to n. Use something like Dijkstra’s algorithm to find a minimal path in O(n^2), which corresponds to a solution that uses the smallest number of dictionary words to exactly cover the string.

10 Benjash // Aug 8, 2011 at 4:53 am

Page 22: Big Data Interview

Im not a fully blown coder by any means.

But, I was thinking a solution, would be to break the string into consonant clusters (syllables maybe?) and group them into weighted groups. Or some algorithm measuring lexical units within the string.

some database of word clusters like:

String = applepieisgreat

3wordcluster = “app”, “eat”, “ple”, “ond” etc …2wordcluster = “ap”, “le”, “pi”, “is” etc

Then build up the wordsapp + le / ap + ple / ple + pie / ple + pi / pie + is / etc

Then working back from the biggest clusters run look ups against the dictionary see if its a word.Then it would be a simple jigsaw puzzle of what words fit in the string.

11 Bob // Aug 8, 2011 at 4:54 am

@Seems easy

Let’s assume that this is the dictionary:

thistextisshortshorter

Now let’s apply rp1′s solution to the problem. It will walk along the string until it finds a match and apply that match to the output. It produces this:

this text is short er

“short” is obviously a match for the first 5 characters of “shorter”. Since the solution analyses those 5 characters first and finds a match, it happily accepts the word “short”. It now has the string “er” to analyse. “er” isn’t in the dictionary, so we’ll assume it just gets tagged on the end instead of discarded.

The correct behaviour here would be to start backtracking. If we have “er” as an unmatched string, it is possible that it connects with something we’ve matched previously. So, let’s try “ter”. Nope, no match. Now “rter”. Nope, no match. However,

Page 23: Big Data Interview

when we eventually try “shorter”, we’ve got a match. With backtracking, we’ve found a solution that works for everything in the string:

“this text is shorter”

Clearly this isn’t O(n) as we have to re-examine components of the string multiple times. In fact, rp1′s algorithm is O(n^2), not O(n) as he suggestst.

12 rcf // Aug 8, 2011 at 5:02 am

@ Seems easy

Suppose the words are a, aa, aaa, abSuppose the input is aaab.The proper segmentation is aa+ab, but if you do it rp1s way you’d try a+? and fail to find a segmentation. If you also started with the longer string aaa you’d still fail.In the worst case scenario there’s a tiny chance you’d get the right answer greedily

13 Tom White // Aug 8, 2011 at 5:17 am

Coder interviews are “rocket science”. How to detect the best programmers is a hard problem.

But, this coding test is for a programmer who is going to be hired to write a string-processing library (or similar). That might not be your goal.

There is also a pretty high degree of quirks in this problem. Most programmers come from a background of processing whole lines, whole sentences, whole words, etc. This seems to be asking the programmer to make one dictionary lookup *per letter*. Perhaps too much of a brain shift for a high-pressure interview situation.

Then, you think you speed it up by creating a O(n) cache to hold words you have seen when the O(log n) dictionary has all the words in memory (and the virtual memory pages that get loaded wouldn’t get paged out to disk during the one millisecond this runs). Your cache can only really help if it is the case that the program, the cache, and the input string are quite small and fit in the fastest level of processor cache.

This article and exercise confirms my experience that most interviewers are not good enough programmers themselves and unbiased enough to recognize the great programmers when they pass by. But, that’s OK, the HR department has usually already rejected them due to some “problem with their resume”.

You are correct that a coding problem is absolutely necessary. It needs to be maximally understandable so as to overcome the candidate’s anxiety. It needs to start easy and work up via interactive discussion to show the programmer’s actual level of competence. The

Page 24: Big Data Interview

automated websites for this look interesting but can only do a first level of checking for basic programming ability.

We all need to step outside ourselves and understand what makes someone a good programmer: smart, quick learner, and good to work with.

14 rcf // Aug 8, 2011 at 5:53 am

@Tom White

The cache has nothing to do with speeding up lookups or the size of the dictionary. If you have a string of length L than the standard backtracking solution will in the worst-case do 2^L dictionary lookups. By keeping track of which substrings are possible to segment we can reduce this to ~L lookups.

15 Ian Wright // Aug 8, 2011 at 5:57 am

As to someone who hasn’t coded since high school, I’m curious as to what legal actions you take against sites like Glassdoor that publish information bared in NDAs. I assume you would need to know who actually provided the information to the site to bring any sort of penlites against them. At the end of the day you’re just closing the stable door after the horse has already bolted.

16 rocketsurgeon // Aug 8, 2011 at 7:34 am

There’s a bug in the general solution, it doesn’t work as written. Maybe your interview questions should include a section on testing?

17 Retric // Aug 8, 2011 at 7:43 am

Reading this I can’t help but your falling into the java trap of trying to make a ridiculously generic solution which I would consider a danger sign but plenty of people love to see.

Anyway, assuming a real world input and real world dictionary you can try plenty of things that break down for a dictionary that includes four hundred A’s but are actually valid solutions. Also, if you want a fast real world solution then sticking with a pure lookup dictionary would slow things down. EX: Being able to toss 5 letters into a lookup table that says the longest and shortest words that start with those letters would save a lot of time. Basically, ‘xxyzy’ = null, null saves you from looking up x to the maximum word length in your dictionary. Secondly sanitizing inputs is going to be both safer and faster. Granted anything you are coding in 45 minutes is going to be a long way from production ready code.

Page 25: Big Data Interview

PS: Even with backtracking you can code an O(n) solution for vary large input sets. Just keep track of the K best output sequences that are aka N, N-1,N-2,…N-K. (K being the lengh of the longest word in your dictionary.)

18 jeremy // Aug 8, 2011 at 8:03 am

Many candidates for software engineering positions cannot come up with the above or an equivalent (e.g., a solution that uses an explicit stack) in half an hour. I’m sure that many of them are competent and productive. But I would not hire them to work on information retrieval or machine learning problems, especially at a company that delivers search functionality on a massive scale.

I have a comment about the relationship of this question to “massive scale” information retrieval and machine learning problems. Naturally, no one wants a O(2^n) solution. But even when the solution is O(n^2), if the size of your input n is 10 billion (aka massive scale, web scale, big data), even an elegant, memoized algorithm isn’t tractable anyway, correct?

So while I indeed like this question, because it has a “progressive reveal” in levels of thinking, does someone working on web scale (massive) data really ever need to implement the highest level of thinking? Or is it that you just want a programmer who is aware of the issue, even if she or he never has to use it?

19 The “word break” problem | en_GB@blog // Aug 8, 2011 at 8:09 am

[...] Retiring a Great Interview Problem (via Retiring a Great Interview Problem), this is the best I could get in less than 30 [...]

20 betacoder // Aug 8, 2011 at 8:15 am

Trivial but noted: The simplified solution also has a off-by-one error.

21 Daniel Tunkelang // Aug 8, 2011 at 8:21 am

To those who noted the off-by-one error: thanks, it’s fixed now.

22 Daniel Tunkelang // Aug 8, 2011 at 8:23 am

@binarysolo

I also used to think this problem was too easy. Experience proved otherwise. In fact, I found at Google that performance on this problem correlated strongly to overall performance in the interview process, albeit with a limited sample size.

23 rcf // Aug 8, 2011 at 8:25 am

Page 26: Big Data Interview

@Retric congratulations, you’ve reinvented dynamic programming with your O(n) solution That’d be a perfect solution.Also, if your map is stored in sorted order (like in C++) a binary search will find you the longest word that fits the remainder of the string. (To also efficiently find all of the words that fit will require something extra.)

24 Daniel Tunkelang // Aug 8, 2011 at 8:27 am

@Rick Williams

It’s pretty common for companies to make candidates sign NDAs. I’ve learned about confidential information from several companies I interviewed with, typically as part of the sell of exciting things I could work on.

But forget the legalities, which are unenforceable in practice. Instead think about how you’d like yourself and other engineers to be assessed. Part of the point of interview problems is to make hiring about more than a person’s resume. It’s not clear who gains when interview problems are retired because they’ve been disclosed. Still, I recognize practically that it is impossible to maintain secrets once enough people know them. Hence my comment at the end of the post.

25 jdavid.net // Aug 8, 2011 at 8:29 am

Questions like this are grand if you are looking for a ‘Search Engineer’ but if you are looking for a front end developer or someone that understands how to design a REST api, or something else usefull this is a bogus question.

All to often I have seen interviewers use a one question fits all approach to interviewing. I have been out of college for a decade and have significant work experience in producing web pages and frameworks that respond to a browser or app in a specific way.

Questions like these filter me out because I have not been doing college papers or playing around in a programming languages class, but rather have been building real tools.

Questions like how would you design your own web framework might be a better question. Why is jQuery popular how does it differ from other js frameworks.

Can you write a delegated click handler? How does it differ from an assigned click handler?

Please ask a question relative to an applicant’s role/ experience.

There are plenty of questions out there that are easy for college grads to answer, but hard to experienced programmers and vice verse.

26 The Word Break Problem « Plus 1 Lab // Aug 8, 2011 at 8:42 am

Page 27: Big Data Interview

[...] an interesting post on the Noisy Channel blog describing what author Daniel Tunkelang calls the word break problem.  I didn’t find this post interesting because the problem is a good interview question, I [...]

27 Daniel Tunkelang // Aug 8, 2011 at 9:00 am

@zui

No interview process is perfect. It’s tempting to only hire people you’ve worked with — indeed, I’d be inclined to so that as a founder. But that doesn’t scale, and it also biases the process towards people who are already well connected. Another approach is to focus almost entirely on the resume, but that has its own problems, from resume inflation to again favoring those who started with advantages.

I’m by no means perfect but I think I give candidates as fair a shot as I know how. I hired a candidate with no college degree and a high-school equivalency — she performed phenomenally as a software engineer and is now a managing director at Goldman. I try to put nervous candidates at ease. But there’s no getting around that an interview is an assessment process, and not everyone does well when they are being assessed.

I’m curious to hear the details of how you or your company interview candidates. Everyone benefits if we can make this process better.

28 Geek Reading August 8, 2011 | Regular Geek // Aug 8, 2011 at 9:02 am

[...] Retiring a Great Interview Problem [...]

29 Daniel Tunkelang // Aug 8, 2011 at 9:03 am

@Retric

Is it really so generic? I’ve simplified the problem a little, but it’s pretty close to a real problem I had to solve for implementing search features that would be deployed in a very broad range of domains, including for part-number search. I had the opportunity to see domain-specific heuristics break down.

30 Daniel Tunkelang // Aug 8, 2011 at 9:08 am

@jeremy

To clarify, the n here is the size of the input string, rather than the number of users or documents. But algorithms like these run in a high-traffic environment, with requests being processed concurrently. A feature whose cost blows up exponentially with a bad input can take a site down. Granted, there are other ways to guard against such failures (e.g., time-outs), it’s good to use algorithms that don’t have such blow up. And even

Page 28: Big Data Interview

better to understand the behavior of algorithms before rather than after those algorithms make it to production.

31 Grant Husbands // Aug 8, 2011 at 9:09 am

@rcf, @Retric: You don’t need to invent new structures for the dictionary; lots of variants of trie/NFA/DFA will do just fine. However, the question explicitly disallows changing the dictionary and its API. I’m sure Daniel is aware that O(N) solutions exist if the dictionary is improved, but interview questions are for exploring the abilities of a candidate rather than finding the best possible solution.

32 jeremy // Aug 8, 2011 at 9:33 am

Right, I understand that this example problem is a small input string. Perhaps what I was trying to ask is how indicative of a real world massive scale machine learning or information retrieval problem this example problem really is.

Agreed about being able to recognize that an algorithm might have quadratic or exponential blowup. But again, I’m asking about how realistically often one has to implement solutions that need dynamic programming when doing information retrieval machine learning on a web scale. I am assuming that you’re not just talking about parsing the input. I assume when you say massive scale information retrieval and machine learning, you’re working on algorithms to extract patterns from the data, to find relationships, to (set based) extraction of related entities. For example.

And in that case, the size of the input isn’t 100 or 1000. But millions. Billions. So again, how often is memoization necessary, in practice?

(And recognizing that something is going to be NP complete, or has a quadratic-time DP solution, or a quadratic time approximation, is different than being able to code that solution, during a half hour interview. So might it perhaps be better to test one’s ability to recognize say 7 or 8 different problems as to their potential complexity, rather than having a candidate write code for a single example?)

33 Daniel Tunkelang // Aug 8, 2011 at 9:50 am

@jeremy

Different problems test different skills. I’ve never used this question as the only determinant in assessing a candidate, but I’ve found that it provides more bits of information than most.

34 Retric // Aug 8, 2011 at 10:01 am

@Grant Husbands I was not suggesting you needed to recreate a dictionary. However, the ideal program for a human language dictionary where the longest word is 32ish digits vs.

Page 29: Big Data Interview

some input text is very different than something you would use if you had 200 different ~100,000 digit DNA sequences in your dictionary. Conceder with a long enough input string iterating over the full dictionary and creating a quick index could easily reduce runtimes from years to seconds. And with a short enough input string any optimizations are basically pointless. EX: ‘cat’.

35 Earl // Aug 8, 2011 at 10:01 am

Daniel,

I think you have a boundary problem in your example — the substring function in java takes (to my mind) weird arguments. You have to call substring until the endIndex / right argument is equal to the string length.eg:

String test = “0123456″;puts(“%s length %d”, test, test.length());

for (int i=1; i %s”, 1, i, test.substring(0, i));

produces

prefix [1, 1] -> 0prefix [1, 2] -> 01prefix [1, 3] -> 012prefix [1, 4] -> 0123prefix [1, 5] -> 01234prefix [1, 6] -> 012345prefix [1, 7] -> 0123456

36 Earl // Aug 8, 2011 at 10:05 am

Blech, html

http://codepad.org/Ald1l9vb

37 Retric // Aug 8, 2011 at 10:11 am

PS: By index I mean find size of the longest word, and or do other preprocessing.

38 Daniel Tunkelang // Aug 8, 2011 at 10:15 am

@Earl

Are you sure? Change the for loop to

Page 30: Big Data Interview

for (int i=1; i < = test.length(); i++)puts("prefix [%d, %d] -> %s”, 0, i, test.substring(0, i));}

produces

prefix [0, 1] -> 0prefix [0, 2] -> 01prefix [0, 3] -> 012prefix [0, 4] -> 0123prefix [0, 5] -> 01234prefix [0, 6] -> 012345prefix [0, 7] -> 0123456

39 Grant Husbands // Aug 8, 2011 at 10:19 am

@Retric: Your suggested 5-letter lookup was essentially a change to the dictionary API; it is to that that I was referring. For any lack of clarity on my part, I apologise. Anyway, there are plenty of ways of preprocessing the dictionary, and I mentioned common ones, but none of them fit the problem description, which explicitly disallows such preprocessing, making this debate irrelevant.

40 John H // Aug 8, 2011 at 10:33 am

Why are you retiring it? Because it’s out on the web?

I don’t think having knowledge of the interview question ahead of time necessarily precludes its usefulness. Being able to deliver the solution quickly and efficiently is still a valuable assessment of skill. Also, assuming the candidate has to answer 4-6 diverse problems over the course of the day, it’s still a pretty good screen to have them ‘reproduce memorized answers’ (assuming they knew them ahead of time). This is further mitigated by having a couple of problems you can switch between, now the candidate would need to have 25-50 questions memorized.

Honestly, having the solution to 50 interview questions to CS problems memorized is pretty good. On top of that, very few candidates do the research needed to find the problems ahead of time.

PS, Thanks for the outline, I’m going to use this question for my candidates in the future.

41 Charles Scalfani // Aug 8, 2011 at 2:40 pm

I have always hated interview questions, which is why I don’t use them. I’d rather solve a problem WITH the candidate or if I’m being interviewed then with the interviewer. I want real world problems that have NOT been pre-solved. I want to have a design

Page 31: Big Data Interview

discussion with them and talk through the design and implementation issues. Coding is trivial after that.

I would rather have the candidate bring in an example of code that is non-proprietary and something they’re particularly proud of. Then I review the code with them like I would if they worked for me.

I’d much rather have an interview as a pseudo-working session. Then I can see how it would be to actually work with that person. There’s no better way to see how someone thinks than making them work WITH you; not jump through your artificial hoops.

42 binarysolo // Aug 8, 2011 at 3:24 pm

@Daniel

Well, I still dunno if this is a great interview problem per se, but it sure is a great conversation starter given the very *accessible* nature that even us laymen who don’t have much programming can access the question and think of reasonable implementations.

I’m not familiar enough with the programming world, but I’d imagine that the value of clever thinking and efficient structural thinking >> some technical, nuanced aspect of some computer language which is what this problem fields out.

43 jeremy // Aug 8, 2011 at 5:16 pm

Different problems test different skills. I’ve never used this question as the only determinant in assessing a candidate, but I’ve found that it provides more bits of information than most.

Again, yes.. but you specifically said that this specific question didn’t just test a candidate’s ability to think generally, computer-scientifically. But a candidate’s ability to think specifically in terms of massive scale, machine learning and information retrieval.

And it’s that specific connection — between DP and massive scale in the context of ML and IR — that I’m struggling to understand, rather than the broader question of whether a candidate is a good computer scientist.

It’s just that.. oh, nevermind. I’ll take it offline.

44 Rick Williams // Aug 8, 2011 at 7:06 pm

Thank you for the response about NDAs. I agree completely that it is unprofessional to publish secret interview questions after an interview.

Page 32: Big Data Interview

But that’s different from the NDA issue. I’ve been on dozens of interviews and have never been asked to sign an NDA for interviewing, nor have I required one of anyone I am interviewing. NDAs are something clients sign before being informed of proprietary business trade secret information. This is done only when absolutely necessary since it is much better simply not to reveal trade secrets to outside parties in the first place. It’s quite bizarre to hear of them being required for interviews and I don’t really believe that this is a common practice. If it is common in some segment of the field, then it is an ill advised practice.

45 Golam Kawsar // Aug 8, 2011 at 7:35 pm

Great interview question, but in my experience as a programmer, I have not seen many programmers who can cook up dynamic programming solutions to problems like these, let alone during the stress of an interview. Even recognizing this as a dynamic programming question will be hard for many.

Thanks for such a nice post Daniel. Enjoyed reading it a lot!

46 Debnath // Aug 8, 2011 at 8:13 pm

The general solution will terminate if it finds the word in the dictionary, should it still not continue? I mean, there could be a word like “endgame”, for the lack of a better example (or backtracking), which might be a part of the dictionary, but are also individual words…I can understand they may not be popular in IR though since they are usually an abbreviation of the sub-words, and the sub words don’t really make sense independently, but given the problem definition…

47 A great interview Problem | Phanikumar // Aug 8, 2011 at 10:20 pm

[...] problem with the code and runtime of the algorithm mentioned in the post. http://thenoisychannel.com/2011/08/08/retiring-a-great-interview-problem/ This entry was posted in Code by phani. Bookmark the [...]

48 Daniel Tunkelang // Aug 8, 2011 at 11:37 pm

@John H

I hope this question serves you well. It’s time for me to move on from it, and I thought the best way to retire this problem was to do so in a way that others would learn from it.

49 Daniel Tunkelang // Aug 8, 2011 at 11:42 pm

@Charles Scalfani

We do have collaborative problem solving and product design discussions as part of the interview process. But I also want to see how a candidate writes code. Reviewing code

Page 33: Big Data Interview

they’ve written before is an option, but it’s tricky — especially for a candidate that has not written non-proprietary code in a long time.

50 Daniel Tunkelang // Aug 8, 2011 at 11:44 pm

@Debnath

The termination condition is part of the problem statement. The problem could be changed to require outputting all valid segmentations, but there could be an exponential number of them. We could also require the “best” segmentation, which is an interesting design question as to what constitutes “best”.

51 On “Retiring a Great Interview Problem” « Will.Whim // Aug 9, 2011 at 1:49 am

[...] Tunkaleng wrote an interesting blog post, “Retiring a Great Interview Problem” on an interview problem that he has, in the past, posed to interviewees, but which he has [...]

52 Will Fitzgerald // Aug 9, 2011 at 1:51 am

I wrote a response to this, “On ‘Retiring a Great Interview Problem’” at http://t.co/uXQf4oh.

53 Daniel Tunkelang // Aug 9, 2011 at 6:41 am

Will, I read your response. A short one here: assessment of candidates on a problem like this isn’t binary. Rather, the point is to get as holistic a picture as is possible in the time constraints of how well a candidate solves an algorithmic problem and implements it. It’s not a perfect tool, but no tool is. I’m always on the lookout for better ones — don’t forget that I’m retiring this problem.

And you allude to a candidate’s nervousness under interview conditions. Part of the interviewer’s job is to put the candidate at ease. That isn’t unique to questions that involve coding, and it doesn’t always work out. No interviewer or hiring process is perfect.

Finally, while I share your concerns about using whiteboard coding in interviews (see my link to Diego’s post), I disagree that it discriminates against older candidates. That very hypothesis strikes me as ageist, at least in the absence of supporting data.

54 Patrick Tilland // Aug 9, 2011 at 11:06 am

As rocketsurgeon mentioned, there is one bug and also one typo in the code. There is a parenthesis missing in this line:

if (dict.contains(input) return input;

Page 34: Big Data Interview

And the loop condition never reaches the next to last character in the input string:

for (int i = 1; i < len – 1; i++)

55 Daniel Tunkelang // Aug 9, 2011 at 11:14 am

@rocketsurgeon@Patrick Tilland

Bugs fixed. Thanks guys!

56 Sonic Charmer // Aug 10, 2011 at 3:48 am

I won’t write this out in Java/whatever syntax as I am too lazy, also not a programmer so not interesting in memorizing syntax of this or that language, but as stated I am lazy so I would want a ‘good’ segmentation, not just any. (I don’t want a lot of “a’s” if there’s an “aaaa” available). So I would check the longest-length word in the dictionary, say that length is k (of course cap this at input-string length and discard all dictionary words greater than this – I’ll assume the dictionary has easy capability both of this max-length-check and to ‘discard’/ignore all >k), then starting from i=1 search all (i,i+k-1) substrings of the input unless/till I find a match, if none reduce k (discard more dictionary!) & try again, if so parse out & into ‘match’, prefix and suffix (as applicable) and recurse onto both of the latter. Details too boring/obvious to spec out.

Personally I dislike cutesy ‘interview questions’ and am instinctively distrustful of the filter they implicitly apply to candidates. When I’m interviewing someone I do what is known as ‘talk to’ the person, one might even say I ‘have a conversation’ with them. That may/may not be better but my way at least I don’t think someone could slide through the interview filter by memorizing some stuff off websites.

57 Stavros Macrakis // Aug 10, 2011 at 7:59 am

Many skills go into a good software engineer, and different interviewers will probe different skills. This problem emphasizes algorithmic thinking and coding — and I ask a similar interview question myself. But algorithmic coding is becoming surprisingly uncommon in many environments because the non-trivial algorithms are wrapped in libraries. Of course, there are places where understanding all this is crucial — someone has to write those libraries, and some people really do have Big Problems that the libraries don’t address — but it doesn’t seem to be a central skill that all programmers need to master.

Is this good or bad? Consider A.N. Whitehead: “Civilization advances by extending the number of important operations which we can perform without thinking about them.”

58 Stavros Macrakis // Aug 10, 2011 at 8:13 am

Page 35: Big Data Interview

One thing I’ve learned about interview questions is that you have to lead up to them in steps if you want to determine where a candidate’s understanding trails off — which may be very soon.

For example, many candidates claim on their resumes that they know SQL. I used to ask such candidates how they would determine if person A was an ancestor of person B given a table of parent-child relations. This requires the (advanced) SQL feature of recursive queries (and I’d actually be happy if they could explain why it couldn’t be done in SQL, as it can’t in basic SQL). Now, I ask the question in stages:

* In SQL, how would you represent parent-child relations?* How would you find X’s parents?* How would you find X’s grandparents?* How would you find all of X’s ancestors?* What if you wanted to do all this high-volume genealogy Web site — would you change your table design or queries? or use some technology other than SQL?

I was shocked to discover that many candidates who listed SQL on their resumes couldn’t do *any* of this, and many required considerable coaching to do it. One candidate didn’t even remember that SQL queries start with SELECT — I would have forgiven this if he’d had conceptual understanding but had just forgotten the keyword, but he had zero conceptual understanding as well.

All this to say that you can’t really trust the self-reporting on a resume and you’ve got to probe to understand what the candidate actually knows.

59 Daniel Tunkelang // Aug 10, 2011 at 8:42 am

@Sonic Charmer

Your sketch is actually a reasonable start towards what I’d expect of a candidate. In fact, I think I could persuade you that your assumptions would have to be general enough to at least solve the interview problem as a special case where the min word length is one and the max is large enough to be the length of the input string. Please bear in mind that this “cutesy” problem is a simplification of one I had to solve to deliver software that has been deployed to hundreds of millions of people!

@Stavros

You’re right that not all software engineers needs to have strong command of algorithms. But I do require that strength of the folks I hire, given the problems that my team solves. Same applied at Google and Endeca. And yes, self-reporting on a resume is always subject to the maxim of “trust but verify”.

60 Word Breaks « Programming Praxis // Aug 12, 2011 at 2:03 am

Page 36: Big Data Interview

[...] Tunkelang posted this interview question to his blog: Given an input string and a dictionary of words, segment the input string into a space-separated [...]

61 Daniel Tunkelang // Aug 12, 2011 at 6:24 am

@Programming Praxis

A solution in Scheme. Nice!

62 Ben Mabey // Aug 14, 2011 at 4:16 pm

Thanks for sharing this problem! I did a Clojure and Ruby solution and discussed the differences in lazy (as in lazy lists) and non-lazy solutions: http://benmabey.com/2011/08/14/word-break-in-clojure-and-ruby-and-laziness-in-ruby.html

63 Daniel Tunkelang // Aug 14, 2011 at 8:32 pm

Ben, thank you! I’m honored to have inspired such an insightful and elegant post.

64 Conducting a Remote Technical Interview | Hiring Tech Talent // Aug 15, 2011 at 9:19 am

[...] hacked, as the wise candidate can research typical questions ahead of time. Daniel Tunkelang has a great post on this, where he found that one of his best questions was posted on [...]

65 Raymond Moore // Aug 17, 2011 at 3:21 pm

Excellent article \ topic. We are currently recruiting for a half dozen C++ \ Web UI positions and for many companies it is extremely difficult to determine whom is talking the talk and whom can write VERY clean code and think logically when faced with difficult programming requests. As an example a Sales Director will hand a candidate for a sales position a phone and say “make this call and pitch them” just so you can see what they have, this is partly what interviewing has become like in the IT world.

66 Daniel Tunkelang // Aug 17, 2011 at 6:07 pm

Raymond, thanks. The problem with all interviewing — and with interviewing software engineers in particular — is that it’s hard to extract a reliable signal under interview conditions.

Ultimately the best solution may be to change the interview conditions. But the approach has to be both effective and efficient. It’s a great research problem — and I’ll let readers here know what my colleagues and I come up with.

Page 37: Big Data Interview

Of course, I’d love to hear what others are doing.

67 State of Technology #21 « Dr Data's Blog // Aug 18, 2011 at 10:55 pm

[...] – How to retire a great Interview problem – “word break” problem described as [...]

68 Anatoly Karp // Aug 21, 2011 at 12:20 am

As a side note, there is a discussion of a slightly more general problem in Peter Norvig’s excellent chapter from O’Reilly book “Beautiful Data” – http://norvig.com/ngrams/ch14.pdf (see section “Word Segmentation”). His remark that the memoized solution can be thought of as the Viterbi algorithm from the HMM theory is nicely illuminating (and of course obvious upon a moment’s thought).

69 Daniel Tunkelang // Aug 21, 2011 at 9:31 am

Indeed. Peter emailed me his “naive” solution — unfortunately, I don’t think he’s on the market.

70 Attention CMU Students! // Sep 7, 2011 at 7:04 pm

[...] and Wednesday. And of course LinkedIn will be conducting on-campus interviews: those will take place all day on Thursday, September [...]

71 William // Nov 4, 2011 at 10:57 pm

Hi Daniel:how could you change your code so that it can find all valid segmentation for the whole string?E.X. string “aaa”, dict{‘a’, ‘aa’} to be segmentation{“a a a”, ‘a aa’, “aa a”}

72 Daniel Tunkelang // Nov 5, 2011 at 1:44 pm

William, interesting question. For starters, the number of valid segmentations may be exponential in the length — in fact, that will be the case for a string of n a’s if every sequence of a’s is a dictionary word. Could still use memoization / dynamic programming to avoid repeating work, but storing sets of sequences rather than a single one.

73 Roberto Lupi // Dec 17, 2011 at 2:41 pm

It can be done in O(n), n=length of the string, with some pretty relaxed assumption given the nature of the problem: we just need a preprocessing step on the dictionary.

The idea is to build a set of rolling hashing function using the Rabin-Karp algorithm, for each word length in the dictionary, and the corresponding hash value for each word.

Page 38: Big Data Interview

To segment the string, we loop over it once, updating the rolling hash values (for each length) and if we find a match in our set of hash values from the dictionary, we have a potential match. We still have to check the actual dictionary to confirm the match, avoiding false positives.

This design has the added advantage that the dictionary can be larger than what can fit into memory. We only need to store the hash values for each word in memory.

74 Daniel Tunkelang // Dec 17, 2011 at 3:01 pm

Roberto, one person I presented the problems to did suggested an approach along these lines: since dictionary membership is a regular language, just build a finite state machine. I was impressed by the ingenuity, but I then enforced the constraint that the dictionary only supported a constant-time membership test.

By the way, I’ve since moved on to less exciting coding problems that require less ingenuity and are more a test of basics (though not quite as elementary as fizzbuzz). I’m still surprised at how many candidates with strong resumes fail at these.

75 Roberto Lupi // Dec 18, 2011 at 12:27 pm

Maybe stronger candidates tend to overcomplicate problems, instead of solving them in the simplest way they search for a clever one and get lost.

“[T]he stupider one is, the closer one is to reality. The stupid one is, the clearer one. Stupidity is brief and artless, while intelligence wriggles and hides itself. Intelligence is a knave, but stupidity is honest and straight forward.” — Dostoevsky (The Brothers Karamazov)

76 Daniel Tunkelang // Dec 18, 2011 at 12:33 pm

Some strong candidates assume that an easy solution must be too naive and therefore wrong. For that reason, it’s important to set expectations at the beginning. If a problem is basic, I tell the candidate as much — which also helps avoid a candidate feeling insulted or worrying that the bar is too low. And for all problems, I urge candidates to come up with a working solution before optimizing it.

77 Alex // Jan 10, 2012 at 11:23 am

At first, it sounds depressing. I’m writing compilers and OSes, among other things, but don’t think I’d pass this interview.

However, the more I read the more I realize I’m a different kind of programmer than who is sought here. I don’t cite CLR by heart, solve real problems in real manner (i.e., including asking others, not to mention using books, Internet etc.) and generally not good

Page 39: Big Data Interview

at “coding”. I guess my skills aren’t very marketable, in this approach. Might be a better choice not to program for somebody else.

78 Daniel Tunkelang // Jan 10, 2012 at 9:12 pm

Alex, I’ve actually switched to using problems that are a bit less CLR-ish. I still think it’s reasonable to expect someone to be able to real problems like this one, though I realize it’s unnatural to solve problems under interview conditions.

An alternative approach would be to put less emphasis on the accuracy of the interviewing process and treat the first few months as a trial period. Unfortunately, that’s not the cultural norm, so instead we try to squeeze all the risk out of the hiring process.

Anyway, if you’re able to write a compiler or OS, you should have no problem finding work you’re great at and enjoy.

79 dbt // Jan 11, 2012 at 2:50 pm

Alex, I have heard that lament before. I work at a place that uses similarly algorithmic questions (and other questions too — algorithms aren’t everything, but they’re important) and I sometimes hear my coworkers lament that they couldn’t get hired with the standards we have today. Which is, of course, nonsense.

What’s important to realize about these questions is that in an hour, you have time to make an attempt at an answer, get feedback, and improve your solution. It is a conversation, and not just a blank whiteboard, an interviewer in the corner tapping a ruler on the table every time you make a mistake, and a disappointing early trip home.

80 Daniel Tunkelang // Jan 11, 2012 at 3:27 pm

Tapping a ruler on the table? More like rapping you on the knuckles! OK, the nuns didn’t really do that to us.

Seriously, dbt is right. Good interviews are a conversation. Otherwise, it would be better to make them non-interactive tests.

81 Don’t write on the whiteboard – The Princeton Entrepreneurship Club // Jan 28, 2012 at 2:33 pm

[...] me to write tests for my code, find corner cases. He then asked me 3 other problems. They were Dan Tunkelang type problems. He ran out of problems and there were 15 minutes left. “Normally there’s not enough [...]

82 Hiring: you are doing it wrong | @abahgat's blog // Feb 22, 2012 at 3:27 am

Page 40: Big Data Interview

[...] is a challenge. A lot has been written about the process itself and its quirks, ranging from programming puzzles to whiteboard interviews. However, there are still a few details that are often overlooked by [...]

83 Strata 2012: Big Data is Bigger than Ever // Mar 2, 2012 at 12:58 am

[...] minutes extended into three hours of conversation about everything from normalized KL divergence to interview problems — and segued into a reception with specialty big-data cocktails. By the time I got back to my [...]

84 man2code // Mar 18, 2012 at 1:18 am

if (segSuffix != null) {memoized.put(input, prefix + ” ” + segSuffix);return prefix + ” ” + segSuffix;}

this if should be added with else as mention below, as to show words fetched before non-dictionary word ::

if (segSuffix != null) {memoized.put(input, prefix + ” ” + segSuffix);return prefix + ” ” + segSuffix;}else{return prefix ;}

85 Jp Bida // Jun 28, 2012 at 11:21 am

Would using human cycles with captcha’s get you closer to O(n)? Or does big O analysis only apply when we actually understand the details of the algorithm?

86 netfootprint // Jul 1, 2012 at 1:41 am

I was thinking about the same problem. Quite surprised to see the same thing appear here and asked for interviews .shouldn’t “aaaaab” be“aaaaa” + b ? if a, aa, aaa, aaaa,aaaaa are in dict ?Step 1: check for “aaaaab”, not in dictionaryStep 2: check for “aaaaa”, found in dictionaryword segments are “aaaaa” + “b”I wonder why the need all combinations in a “real-world” problem!

87 Algorithms: What is the most efficient algorithm to separateconnectedwords? assuming all the constituent words are in the vocabulary (and also assume for simplicity that there aren't any spelling mistakes) - Quora // Jul 3, 2012 at 4:23 am

Page 41: Big Data Interview

[...] is a thorough discussion of this problem as an interview question by Daniel Tunkelang: [1][1] http://thenoisychannel.com/2011/…Comment Loading… • Post • 4:23am  Add [...]

88 Rob // Aug 27, 2012 at 8:30 pm

I love the problem…but I can’t help to notice that the interviewer who has used this question to filter many fine candidates over the years can’t even write out a working solution without bugs given unlimited time, years of experience asking others this question and do so in the context of writing an in-depth analysis teaching the unwashed masses about how great an interview problem this is. Can it be such a great question if you can’t even get it right?

The biggest problem with questions like this is that this type of programming is a highly perishable skill. I have played the guitar for 10 years but lately have spent more time singing acoustically – trying to fit my fingers to a Bach piece that used to come to me written on the back of my eyelids is now impossible.

Now if I were playing Bach every day it would be a different story. Same with this problem. You are only going to hire the guy who happened to solve several problems like this last week because they ran into some issue where it was relavent.

89 Daniel Tunkelang // Aug 27, 2012 at 8:41 pm

Criticism noted. But bear in mind that the question isn’t a binary filter. It’s a test of algorithmic thinking and even of working out the problem requirements.

I’m curious what you mean by “this type of programming” being highly perishable. If you mean solving a basic algorithmic problem that comes up in the course of real work, then I strongly object. As I said in the post, the problem isn’t from a textbook — it’s a simplified version of a real issue that has come up for me and others in the course of writing production software.

But I grant that people don’t always have opportunities to use dynamic programming and perhaps even recursion. With that in mind, I’ve switched to interview problems that are less reliant on these. But I still expect the people I hire to be able to apply these fundamental computer science techniques with confidence.

Of course there’s a risk with this and any interview problem of overfitting to someone’s recent experience. That’s why it’s good to use a diverse set of interview questions. If someone aces the interviews because she solved all of those problems last week, I’ll take my chances and hire her!

Finally, if you have suggestions about how to interview more effectively, I’m all ears.

90 Tapori // Sep 2, 2012 at 5:23 pm

Page 42: Big Data Interview

One problem with this is that good programmers tend to avoid recursion (no data to support it though).

So while you are expecting a 10 min recursive version, the good programmer is trying to bring out a non recursive version and might fail. (Of course, non recursive version for this can be done in an interview).

btw, this is a texbook exercise problem.

I believe Sedgewick’s book (or perhaps CLR) has it.

91 Daniel Tunkelang // Sep 2, 2012 at 5:30 pm

I concede that a lot of good programmers may instinctively avoid recursion, although in this case I’d say that makes them worse programmers. The whole point of learning a set of software engineering tools is to apply the right one to the right problem, and this problem is mostly naturally defined and solved recursively.

As for it being a textbook exercise, good to know. As I said in the original post, I first encountered this problem in the course of writing production software.

92 Tapori // Sep 2, 2012 at 7:21 pm

The problem (pun intended) is that the problem is still incompletely defined. For instance, we have no idea about the expected target hardware. Does it have limited memory? Limited stack space? Does the language we are to use support recursion? etc Ok, maybe the last one (or any of them) is not relevant these days, but you get my point.

Basically we have no idea what we need to try and optimize for. Granted, a good candidate might and probably should try and clarify that, but in the face of ambiguity, good programmers follow “(instinctive) good practices” which they gained through experience etc.

For instance, you will use quadratic memory and linear stack in the recursive version. A good programmer might instinctively try avoid the cost of the stack.

But, if the goal is to optimize the time to write the code, then a recursive version will be faster (and I suppose is an implicit goal in the interviews, but almost never the case in critical production code).

So calling them worse programmers for not using recursion is not right, IMO.

btw, you seem to have ignored the cost of looking up the memoized structure. Since you are looking up the string your recursive version is cubic, inspite of the substring and dictionary lookup being O(1). Of course, that can be avoided if we represent the string for lookup by the end point indexes (i,j), rather than the string itself.

Page 43: Big Data Interview

Sorry for the long post.

93 Daniel Tunkelang // Sep 2, 2012 at 8:15 pm

No need to apologize — I appreciate the discussion!

And several of the points you’ve raised have come up when I’ve used this problem in interviews. I’ve seen candidates implement a stack-based approach without recursion. I’ve also had discussions about nuances of scale and performance, including whether the limited memory requires the dictionary to be stored out of code (a great motivation for using a Bloom filter) and whether the cost of creating a read-only copy of a substring is constant (i.e., represented by the end-point indexes) or linear in the length of the substring. (because the substring is actually created as a string).

So you’re right that not all good programmers will jump to recursion — though I do think that is the simplest path for most. And in an interview I urge candidates to start with the simplest solution that works. That’s not only a good idea during an interview, but a good idea in practice, to avoid premature optimization.

Regardless of the choice of interview problem, the interviewer has to be competent and flexible. I’m sure a interviewer can butcher an interview with even the best problem. But some interview problems are better than others. And I strongly favor interview problems that are based on real problem, don’t require specialized knowledge, and provide candidates options to succeed without depending entirely on the candidate arriving at a single insight.

94 Tapori // Sep 2, 2012 at 10:16 pm

Completely agree with the comment about competent and flexible interviewers.

95 Job Interviews: What is your favourite interview questionn for a software engineer? - Quora // Sep 3, 2012 at 12:07 pm

[...] is your favourite interview questionn for a software engineer?This blog post has a nice question – http://thenoisychannel.com/2011/… . What are your favourite interviewing questions?   Add [...]

96 mm // Sep 22, 2012 at 7:41 pm

Due to this question I had interview with whitepages.com and this is why I couldn’t get the job… Great answer, hopefully I won’t see this question again in my interviews. LOL

97 Quora // Sep 29, 2012 at 10:37 am

Why do topmost tech companies give more priority to algorithms during the recruitment process?…

Page 44: Big Data Interview

Not all top tech companies. At LinkedIn, we put a heavy emphasis on the ability to think through the problems we work on. For example, if someone claims expertise in machine learning, we ask them to apply it to one of our recommendation problems. A…

98 techguy // Oct 16, 2012 at 11:51 pm

I have a question about the memoized solution. I understand the advantage of saving the results of a dead end computation, where the code reads:

memoized.put(input, null);

But I don’t understand the advantage to memoizing here:

memoized.put(input, prefix + ” ” + segSuffix);

Since segSuffix has already been found to be not null, that means we have reached the end of the input string somewhere deeper in the call stack, and are now just unwinding back to the top? Maybe I’ve missed something, it’s late at night for me, but I can’t see it any other way right now. Thanks.

99 Daniel Tunkelang // Oct 17, 2012 at 10:06 pm

I think you’re right — we don’t need to memoize the non-null values. I’ve amended the code accordingly.

100 CIKM 2012: Notes from a Conference in Paradise // Nov 12, 2012 at 7:00 am

[...] sessions of the conference. There was a talk on query segmentation, a topic responsible for my most popular blog post. Also a great talk on identifying good abandonment, a problem I’ve been interesting ever [...]

101 Thought this was cool: CIKM 2012: Notes from a Conference in Paradise « CWYAlpha // Nov 13, 2012 at 7:11 am

[...] sessions of the conference. There was a talk on query segmentation, a topic responsible for my most popular blog post. Also a great talk on identifying good abandonment, a problem I’ve been interesting ever since [...]

102 Dirk Gorissen // Jan 10, 2013 at 4:41 am

As an aside, as far as I can see, as given the given code will fail to give a complete solution in cases like this:

dict = ["the","big","cat"]string = “thebiggercat” or “thebigfoocat”

Page 45: Big Data Interview

Also, if you call it with the string “cats” it will return null instead of cat.

103 Daniel Tunkelang // Jan 10, 2013 at 7:25 am

Dirk, that is how the problem is set up.

From the post:

Q: What about stemming, spelling correction, etc.?

A: Just segment the exact input string into a sequence of exact words in the dictionary.

Of course you can generalize the problem to make it more interesting, especially if a candiate solves the original problem with time to spare.

104 Dirk Gorissen // Jan 10, 2013 at 7:36 am

Indeed, missed that, sorry. Thanks for a great post btw.

105 Daniel Tunkelang // Jan 10, 2013 at 7:38 am

My pleasure, glad you enjoyed it!

How to Prepare for an Interview as a Entry Level Data Analystby Rick Leander, Demand Media

Data analyst jobs vary greatly between industries, so preparation is key to landing the job. Find out about the company, its products and services, the industry, the analysis team, and the types of analyses used. In many cases, much of this information can be found on the internet, but it also helps to talk with people inside the company to gain a competitive advantage.

Step 1

Study the job posting. Go online and find the job listing on the company’s website or, if not online, pull the application from your files. Write down each requirement of the job, then list relevant experience or training that relates to the item. If the listing mentions survey analysis, describe a class project that involved survey preparation, and describe how the survey was administered and the statistical techniques used to analyze the results. Repeat this for each job requirement.

Step 2

Page 46: Big Data Interview

Know the company. Study the company’s website, paying close attention to the “About Us” pages. Find out about the company’s products or services, the largest customers, and the backgrounds of the management team. For government or research companies, study their research presentations. These pages will offer quite a bit of insight about their data analysis techniques and practices.

Related Reading: Accounting Entry Level Interview Questions

Step 3

Ask for an informational interview. When possible, ask to call and talk with the hiring manager or a lead analyst to find out more about the work being done. By understanding what the team does, you can align your training and experience to better match their needs. When topics arise that are unfamiliar, take time to research and fill in these deficiencies.

Step 4

Prepare to answer common interview questions. Almost every interview starts with a question like “tell me about yourself”, so be ready with a concise thirty to sixty second answer that includes a summary of your training, work experience, and one or two of your personal interests. Other standard questions will include why you want to work at this company, your strengths and weaknesses, and what you can offer to the company. Prepare short answers for each of these questions then practice answering them out loud.

Step 5

Be ready for the tough questions. Look through your school transcript and resume and look for weaknesses that the interviewer may probe, like dropped classes or low grades. Many employers run background and credit checks so be ready to address any financial or law enforcement issues. Explain the circumstances, then show how you learned and grew from these experiences.

Step 6

Find the interview location. Unless the company sits directly across the street, drive to the interview location ahead of time. Find the main entrance and visitor parking then take time to observe how the staff dresses for work. On the day of the interview, dress just a bit better than you observed. For example, if the dress is business casual, wear a sport coat and tie. Arrive for the interview early, take a minute or two to check your appearance, then go in with a positive attitude, knowing you are well prepared.

Page 47: Big Data Interview

References (3)

About the Author

Rick Leander lives in the Denver area and has written about software development since 1998. He is the author of “Building Application Servers” and is co-author of “Professional J2EE EAI." Leander is a professional software developer and has a Masters of Arts in computer information systems from Webster University.

66 job interview questions for data scientists1. What is the biggest data set that you processed, and how did you process it,

what were the results?2. Tell me two success stories about your analytic or computer science projects?

How was lift (or success) measured?

3. What is: lift, KPI, robustness, model fitting, design of experiments, 80/20 rule?

4. What is: collaborative filtering, n-grams, map reduce, cosine distance?

5. How to optimize a web crawler to run much faster, extract better information, and better summarize data to produce cleaner databases?

6. How would you come up with a solution to identify plagiarism?

7. How to detect individual paid accounts shared by multiple users?

8. Should click data be handled in real time? Why? In which contexts?

9. What is better: good data or good models? And how do you define "good"? Is there a universal good model? Are there any models that are definitely not so good?

10. What is probabilistic merging (AKA fuzzy merging)? Is it easier to handle with SQL or other languages? Which languages would you choose for semi-structured text data reconciliation? 

11. How do you handle missing data? What imputation techniques do you recommend?

12. What is your favorite programming language / vendor? why?

13. Tell me 3 things positive and 3 things negative about your favorite statistical software.

14. Compare SAS, R, Python, Perl

15. What is the curse of big data?

Page 48: Big Data Interview

16. Have you been involved in database design and data modeling?

17. Have you been involved in dashboard creation and metric selection? What do you think about Birt?

18. What features of Teradata do you like?

19. You are about to send one million email (marketing campaign). How do you optimze delivery? How do you optimize response? Can you optimize both separately? (answer: not really)

20. Toad or Brio or any other similar clients are quite inefficient to query Oracle databases. Why? How would you do to increase speed by a factor 10, and be able to handle far bigger outputs? 

21. How would you turn unstructured data into structured data? Is it really necessary? Is it OK to store data as flat text files rather than in an SQL-powered RDBMS?

22. What are hash table collisions? How is it avoided? How frequently does it happen?

23. How to make sure a mapreduce application has good load balance? What is load balance?

24. Examples where mapreduce does not work? Examples where it works very well? What are the security issues involved with the cloud? What do you think of EMC's solution offering an hybrid approach - both internal and external cloud - to mitigate the risks and offer other advantages (which ones)?

25. Is it better to have 100 small hash tables or one big hash table, in memory, in terms of access speed (assuming both fit within RAM)? What do you think about in-database analytics?

26. Why is naive Bayes so bad? How would you improve a spam detection algorithm that uses naive Bayes?

27. Have you been working with white lists? Positive rules? (In the context of fraud or spam detection)

28. What is star schema? Lookup tables? 

29. Can you perform logistic regression with Excel? (yes) How? (use linest on log-transformed data)? Would the result be good? (Excel has numerical issues, but it's very interactive)

30. Have you optimized code or algorithms for speed: in SQL, Perl, C++, Python etc. How, and by how much?

31. Is it better to spend 5 days developing a 90% accurate solution, or 10 days for 100% accuracy? Depends on the context?

Page 49: Big Data Interview

32. Define: quality assurance, six sigma, design of experiments. Give examples of good and bad designs of experiments.

33. What are the drawbacks of general linear model? Are you familiar with alternatives (Lasso, ridge regression, boosted trees)?

34. Do you think 50 small decision trees are better than a large one? Why?

35. Is actuarial science not a branch of statistics (survival analysis)? If not, how so?

36. Give examples of data that does not have a Gaussian distribution, nor log-normal. Give examples of data that has a very chaotic distribution?

37. Why is mean square error a bad measure of model performance? What would you suggest instead?

38. How can you prove that one improvement you've brought to an algorithm is really an improvement over not doing anything? Are you familiar with A/B testing?

39. What is sensitivity analysis? Is it better to have low sensitivity (that is, great robustness) and low predictive power, or the other way around? How to perform good cross-validation? What do you think about the idea of injecting noise in your data set to test the sensitivity of your models?

40. Compare logistic regression w. decision trees, neural networks. How have these technologies been vastly improved over the last 15 years?

41. Do you know / used data reduction techniques other than PCA? What do you think of step-wise regression? What kind of step-wise techniques are you familiar with? When is full data better than reduced data or sample?

42. How would you build non parametric confidence intervals, e.g. for scores? (see the AnalyticBridge theorem)

43. Are you familiar either with extreme value theory, monte carlo simulations or mathematical statistics (or anything else) to correctly estimate the chance of a very rare event?

44. What is root cause analysis? How to identify a cause vs. a correlation? Give examples.

45. How would you define and measure the predictive power of a metric?

46. How to detect the best rule set for a fraud detection scoring technology? How do you deal with rule redundancy, rule discovery, and the combinatorial nature of the problem (for finding optimum rule set - the one with best predictive power)? Can an approximate solution to the rule set problem be OK? How would you find an OK approximate solution? How would you decide it is good enough and stop looking for a better one?

47. How to create a keyword taxonomy?

Page 50: Big Data Interview

48. What is a Botnet? How can it be detected?

49. Any experience with using API's? Programming API's? Google or Amazon API's? AaaS (Analytics as a service)?

50. When is it better to write your own code than using a data science software package?

51. Which tools do you use for visualization? What do you think of Tableau? R? SAS? (for graphs). How to efficiently represent 5 dimension in a chart (or in a video)?

52. What is POC (proof of concept)?

53. What types of clients have you been working with: internal, external, sales / finance / marketing / IT people? Consulting experience? Dealing with vendors, including vendor selection and testing?

54. Are you familiar with software life cycle? With IT project life cycle - from gathering requests to maintenance? 

55. What is a cron job? 

56. Are you a lone coder? A production guy (developer)? Or a designer (architect)?

57. Is it better to have too many false positives, or too many false negatives?

58. Are you familiar with pricing optimization, price elasticity, inventory management, competitive intelligence? Give examples. 

59. How does Zillow's algorithm work? (to estimate the value of any home in US)

60. How to detect bogus reviews, or bogus Facebook accounts used for bad purposes?

61. How would you create a new anonymous digital currency?

62. Have you ever thought about creating a startup? Around which idea / concept?

63. Do you think that typed login / password will disappear? How could they be replaced?

64. Have you used time series models? Cross-correlations with time lags? Correlograms? Spectral analysis? Signal processing and filtering techniques? In which context?

65. Which data scientists do you admire most? which startups?

66. How did you become interested in data science?

67. What is an efficiency curve? What are its drawbacks, and how can they be overcome?

Page 51: Big Data Interview

68. What is a recommendation engine? How does it work?

69. What is an exact test? How and when can simulations help us when we do not use an exact test?

70. What do you think makes a good data scientist?

71. Do you think data science is an art or a science?

72. What is the computational complexity of a good, fast clustering algorithm? What is a good clustering algorithm? How do you determine the number of clusters? How would you perform clustering on one million unique keywords, assuming you have 10 million data points - each one consisting of two keywords, and a metric measuring how similar these two keywords are? How would you create this 10 million data points table in the first place?

73. Give a few examples of "best practices" in data science.

74. What could make a chart misleading, difficult to read or interpret? What features should a useful chart have?

75. Do you know a few "rules of thumb" used in statistical or computer science? Or in business analytics?

76. What are your top 5 predictions for the next 20 years?

77. How do you immediately know when statistics published in an article (e.g. newspaper) are either wrong or presented to support the author's point of view, rather than correct, comprehensive factual information on a specific subject? For instance, what do you think about the official monthly unemployment statistics regularly discussed in the press? What could make them more accurate?

78. Testing your analytic intuition: look at these three charts. Two of them exhibit patterns. Which ones? Do you know that these charts are called scatter-plots? Are there other ways to visually represent this type of data?

79. You design a robust non-parametric statistic (metric) to replace correlation or R square, that (1) is independent of sample size, (2) always between -1 and +1, and (3) based on rank statistics. How do you normalize for sample size? Write an algorithm that computes all permutations of n elements. How do you sample permutations (that is, generate tons of random permutations) when n is large, to estimate the asymptotic distribution for your newly created metric? You may use this asymptotic distribution for normalizing your metric. Do you think that an exact theoretical distribution might exist, and therefore, we should find it, and use it rather than wasting our time trying to estimate the asymptotic distribution using simulations? 

80. More difficult, technical question related to previous one. There is an obvious one-to-one correspondence between permutations of n elements and integers between 1 and n! Design an algorithm that encodes an integer less than n! as a

Page 52: Big Data Interview

permutation of n elements. What would be the reverse algorithm, used to decode a permutation and transform it back into a number? Hint: An intermediate step is to use the factorial number system representation of an integer. Feel free to check this reference online to answer the question. Even better, feel free to browse the web to find the full answer to the question (this will test the candidate's ability to quickly search online and find a solution to a problem without spending hours reinventing the wheel).  

81. How many "useful" votes will a Yelp review receive? My answer: Eliminate bogus accounts (read this article), or competitor reviews (how to detect them: use taxonomy to classify users, and location - two Italian restaurants in same Zip code could badmouth each other and write great comments for themselves). Detect fake likes: some companies (e.g. FanMeNow.com) will charge you to produce fake accounts and fake likes. Eliminate prolific users who like everything, those who hate everything. Have a blacklist of keywords to filter fake reviews. See if IP address or IP block of reviewer is in a blacklist such as "Stop Forum Spam". Create honeypot to catch fraudsters.  Also watch out for disgruntled employees badmouthing their former employer. Watch out for 2 or 3 similar comments posted the same day by 3 users regarding a company that receives very few reviews. Is it a brand new company? Add more weight to trusted users (create a category of trusted users).  Flag all reviews that are identical (or nearly identical) and come from same IP address or same user. Create a metric to measure distance between two pieces of text (reviews). Create a review or reviewer taxonomy. Use hidden decision trees to rate or score review and reviewers.

82. What did you do today? Or what did you do this week / last week?

83. What/when is the latest data mining book / article you read? What/when is the latest data mining conference / webinar / class / workshop / training you attended? What/when is the most recent programming skill that you acquired?

84. What are your favorite data science websites? Who do you admire most in the data science community, and why? Which company do you admire most?

85. What/when/where is the last data science blog post you wrote? 

86. In your opinion, what is data science? Machine learning? Data mining?

87. Who are the best people you recruited and where are they today?

88. Can you estimate and forecast sales for any book, based on Amazon public data? Hint: read this article.

89. What's wrong with this picture?

90. Should removing stop words be Step 1 rather than Step 3, in the search engine algorithm described here? Answer: Have you thought about the fact that mine and yours could also be stop words? So in a bad implementation, data mining

Page 53: Big Data Interview

would become data mine after stemming, then data. In practice, you remove stop words before stemming. So Step 3 should indeed become step 1. 

91. Experimental design and a bit of computer science with Lego's

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by vishali rajiv on November 18, 2013 at 11:35pm

@Vincent

can i get the possible answers for the above interview questions

Vishali

Comment by Vincent Granville on September 12, 2013 at 9:26am

I have added one new question - question #90.

Comment by Vincent Granville on May 5, 2013 at 7:44pm

Someone wrote: 

Looks like hiring managers expect data scientists to have expertise in machine learning, statistics, business intelligence, database design, data munging, data visualization and programming. Are not these requirements too excessive?

My answer:

I think being familiar with all these domains (add computer science, map reduce) is necessary, as well as expertise in some of these domains. Mastering two programming languages (Java, Python) is a must, as well as familiarity with R and SQL. Visualization is easy to acquire.

Knowing how to quickly and independently find, learn (or if necessary, invent) and assess usefulness of the techniques needed to handle the problems, is critical, and MORE important than "knowing" the techniques in the first place. A good amount of experience with some techniques is necessary. 

But you don't need to be an expert in everything. For instance, about 90% of what I

Page 54: Big Data Interview

learned in statistics courses, I've never had to use it to solve business problems. So why learn it in the first place? Also, machine learning (in my opinion) is a subset of statistics focusing on clustering, pattern recognition and association rules. 

The mistake that many hiring managers do is looking for someone who is an expert in everything.

Comment by Joe M on April 21, 2013 at 5:06pm

Are questions like this actually asked in high-level interviews?  All I ever got when I was starting out was "What was your most satisfying experience?" and "Other than for the money, why do you want this job?"

Comment by Ritendra on February 25, 2013 at 12:44pm

Vincent,

Que are good and many of them are based on practical exp too.

Thx for sharing the comments from Allen Engelhardt, it provides better context.

Answer to #76, I would luv to see #1 VA -> Visual Analytics being top of them.

#2 should be, large amount of the data replaced by video's.

Comment by Mars Ma on February 20, 2013 at 5:51am

@Vincent Granville, really useful questions, I like them, thanks a lot !~~

Comment by Amy on February 19, 2013 at 11:40am

@Craig: If the client returns result in your browser, you can handle only as much data as your browser can. In most cases, a 80,000 row table will crash your browser. Just access Oracle directly via Python or Perl, and you can handle (extract and save) gigabytes of data quite easily. And far, far faster.

Comment by Amy on February 19, 2013 at 11:37am

What makes you a data scientist? You just need to know how to gather and turn data into money - nothing more, nothing less. No degree needed, you can learn some techniques by

Page 55: Big Data Interview

reading material online, but much of what makes a successful data scientist (data/business craftsmanship)  is not found in any curriculum or published article.

Comment by craig chambers on February 19, 2013 at 10:22am

Can someone help me with #20 - "Toad or Brio or any other similar clients are quite inefficient to query Oracle databases. Why? How would you do to increase speed by a factor 10, and be able to handle far bigger outputs? "  I didn't know that SQL clients really affected actual query efficiency.  Thanks!

Comment by Vincent Granville on February 16, 2013 at 7:19am

Here's a potential answer to question #10 (probabilistic merging). Feel free to add your answers to any of these questions.

Answer to question #10:

Not sure if the problem of fuzzy merging can be addressed within the framework of traditional databases. Say you have a table A with 10,000 users (key is user ID), a table B with 50,000 users (key is user ID). You could created a user mapping table C with three fields:

1. userID (= key), 2. Alternate_UserID (this field would also be a user ID) and

3. Probability (probability that userID = Alternate_UserID).

This table would be populated after some machine learning algorithm had been applied to tables A and B to identify similar users and the probability they match. Make sure that you only include (in table C) records where probability is above (say) 0.25, otherwise you risk exploding your database.

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Vincent Granville on February 16, 2013 at 6:54am

Also, my 70 questions focus mostly on the tech aspects of being a data scientist. And these are high level questions, aimed at senior professionals (I think there is no such thing as a junior data scientist - they would be called data analyst, software engineer, statistician or computer scientist instead). I did not include questions about soft skills -

Page 56: Big Data Interview

that would be another set of 70 questions. 

I will add a new one: do you think data science is an art or a science? The answer, as always, is "both". Then you can dig deeper and ask whether you are more of an artist than a scientist. My answer would be: it's more craftsmanship than art, but in my case, being a designer/architect, it's a tiny bit closer to art than to science. Certainly a blend of both.

And when bringing the issue of art vs. Science, I would also add that I like to build solutions that are elegant in the way they contribute to ROI / lift, but not in the way they contribute to statistical theory and the beauty of science. I like a dirty, ugly, imperfect solution better than a "great model" if it is more scalable, simple, efficient, easy to implement and robust.

Comment by Richard Giambrone on February 15, 2013 at 12:51pm

Vincent,  I like these questions.  They are good questions to ask yourself even if you're not interviewing.  Understanding what you do is different from being able to explain what you do.

Rich Giambrone

Comment by Vincent Granville on February 15, 2013 at 10:22am

A data scientist is a bit of everything (statistician, software engineer, business analyst, computer scientist, six sigma, consultant, communicator), but most importantly she is a senior analytic practitioner

with a very good sense for business data and business optimization at large. knowledge of big data - both drawbacks and potential (and able to leverage its

potential)

who enjoys swimming in unstructured data, fuzzy non-SQL "joins"

who knows the limitation of old statistics (regression etc.) yet knows how to correctly do sampling, cross-validation, Monte Carlo simulations, design of experiments, assessing lift, identify good metrics

who knows the limitations of MapReduce, and how they can be overcome

who can design and develop robust, simple, efficient, reliable, scalable, useful predictive algorithms - whether or not based on statistical theory

A data scientist may not know much (but at least a little) about linear regression, statistical distributions, the complexity of the quicksort (sorting) algorithm or the limit

Page 57: Big Data Interview

theorems. Her knowledge of SQL can be a bit elementary, although she can run a big SQL query 10 times faster than business analysts who use tools such as Toad or Brio. Her strengths, skills and knowledge are briefly outlined above.  

Comment by Vincent Granville on February 15, 2013 at 8:24am

Interviewers would pick a small subset, there's not enough time in a one-day interview to ask all these questions. Also, several of these questions are about relevant projects (e.g. questions #1 and #2). Of course, these are not yes/no question, and one would expect to spend 10-15 minutes and go in some depths answering these questions. Not being able to answer one question in no big deal - this set has 70 questions and the interviewer can easily pick another one. Indeed, this is the purpose of my list.

Comment by Vincent Granville on February 14, 2013 at 6:52am

Here's a comment from one of our readers:

Some suggestions for structure you may want to apply to your own list:

* Tools (#13 14 18 etc)* Algorithms (#26 33 etc)* Statistics (#35 36 37 etc)* Techniques (#3 4 10 etc)* Data Structures (#21 22 25 etc)* Experience (#1 2 etc)* Business language (52 54)* Domain-specific (#5 6 7 8 10 19 20 21 24b 27 46 55 59 and probably others)* Plain weirdness (#5 48 59 61 63)

It is probably worth thinking about the areas that are important to you and manage a list based on those. I don't think Vincent expects us to just use the list except for inspiration.

My favourites from the list (for senior people) are #2 9 (my answer: "valuable actions are best") and 62. Which ones are your favourites?By Allan Engelhardt

Page 58: Big Data Interview

Interview Questions for Data ScientistsPosted: January 3, 2013 | Author: Hilary Mason | Filed under: blog | Tags: datascience, hiring, startups | 27 Comments »

Great data scientists come from such diverse backgrounds that it can be difficult to get a sense of whether someone is up to the job in just a short interview. In addition to the technical questions, I find it useful to have a few questions that draw out the more creative and less discrete elements of a candidate’s personality. Here are a few of my favorite questions.

1. What was the last thing that you made for fun?

This is my favorite question by far — I want to work with the kind of people who don’t turn their brains off when they go home. It’s also a great way to learn what gets people excited.

2. What’s your favorite algorithm? Can you explain it to me?

I don’t know any data scientists who haven’t fallen in love with an algorithm, and I want to see both that enthusiasm and that the candidate can explain it to a knowledgable audience.

Update: As Drew pointed out on Twitter, do be aware of hammer syndrome: when someone falls so in love with one algorithm that they try to apply it to everything, even when better choices are available.

3. Tell me about a data project you’ve done that was successful. How did you add unique value?

This is a chance for the candidate to walk us through a success and show off a bit. It’s also a great gateway into talking about their process and preferred tools and experience.

4. Tell me about something that failed. What would you change if you had to do it over again?

This is a tricky question, and sometimes it takes people a few tries to get to a complete answer. It’s worth asking, though, to see that people have the confidence to talk about something that went awry, and the wisdom to have recognized when something they did was not optimal.

5. You clearly know a bit about our data and our work. When you look around, what’s the first thing that comes to mind as “why haven’t you done X”?!

Technical competence is useless without the creativity to know where to focus it. I love when people come in with questions and ideas.

Page 59: Big Data Interview

6. What’s the best interview question anyone has ever asked you?

I’d like to wish for more wishes, please.

I’m always looking for new and interesting things to add to my list, and I’d love to hear your suggestions.

Algorithms Every Data Scientist Should Know: Reservoir Sampling

by Josh Wills (@josh_wills) April 23, 2013

2 comments

Data scientists, that peculiar mix of software engineer and statistician, are notoriously difficult to interview. One approach that I’ve used over the years is to pose a problem that requires some mixture of algorithm design and probability theory in order to come up with an answer. Here’s an example of this type of question that has been popular in Silicon Valley for a number of years: 

Say you have a stream of items of large and unknown length that we can only iterate over once. Create an algorithm that randomly chooses an item from this stream such that each item is equally likely to be selected.

The first thing to do when you find yourself confronted with such a question is to stay calm. The data scientist who is interviewing you isn’t trying to trick you by asking you to do something that is impossible. In fact, this data scientist is desperate to hire you. She is buried under a pile of analysis requests, her ETL pipeline is broken, and her machine learning model is failing to converge. Her only hope is to hire smart people such as yourself to come in and help. She wants you to succeed.

Page 60: Big Data Interview

Remember: Stay Calm.

The second thing to do is to think deeply about the question. Assume that you are talking to a good person who has read Daniel Tunkelang’s excellent advice about interviewing data scientists. This means that this interview question probably originated in a real problem that this data scientist has encountered in her work. Therefore, a simple answer like, “I would put all of the items in a list and then select one at random once the stream ended,” would be a bad thing for you to say, because it would mean that you didn’t think deeply about what would happen if there were more items in the stream than would fit in memory (or even on disk!) on a single computer.

The third thing to do is to create a simple example problem that allows you to work through what should happen for several concrete instances of the problem. The vast majority of humans do a much better job of solving problems when they work with concrete examples instead of abstractions, so making the problem concrete can go a long way toward helping you find a solution.

A Primer on Reservoir Sampling

For this problem, the simplest concrete example would be a stream that only contained a single item. In this case, our algorithm should return this single element with probability 1. Now let’s try a slightly harder problem, a stream with exactly two elements. We know that we have to hold on to the first element we see from this stream, because we don’t know if we’re in the case that the stream only has one element. When the second element comes along, we know that we want to return one of the two elements, each with probability 1/2. So let’s generate a random number R between 0 and 1, and return the first element if R is less than 0.5 and return the second element if R is greater than 0.5.

Now let’s try to generalize this approach to a stream with three elements. After we’ve seen the second element in the stream, we’re now holding on to either the first element or the second element, each with probability 1/2. When the third element arrives, what should we do? Well, if we know that there are only three elements in the stream, we need to return this third element with probability 1/3, which means that we’ll return the other element we’re holding with probability 1 – 1/3 = 2/3. That means that the probability of returning each element in the stream is as follows:

Page 61: Big Data Interview

1. First Element: (1/2) * (2/3) = 1/32. Second Element: (1/2) * (2/3) = 1/3

3. Third Element: 1/3

By considering the stream of three elements, we see how to generalize this algorithm to any N: at every step N, keep the next element in the stream with probability 1/N. This means that we have an (N-1)/N probability of keeping the element we are currently holding on to, which means that we keep it with probability (1/(N-1)) * (N-1)/N = 1/N.

This general technique is called reservoir sampling, and it is useful in a number of applications that require us to analyze very large data sets. You can find an excellent overview of a set of algorithms for performing reservoir   sampling in this blog post by Greg Grothaus. I’d like to focus on two of those algorithms in particular, and talk about how they are used in Cloudera ML, our open-source collection of data preparation and machine learning algorithms for Hadoop.

Applied Reservoir Sampling in Cloudera ML

The first of the algorithms Greg describes is a distributed reservoir sampling algorithm. You’ll note that for the algorithm we described above to work, all of the elements in the stream must be read sequentially. To create a distributed reservoir sample of size K, we use a MapReduce analogue of the ORDER BY RAND() trick/anti-pattern from SQL: for each element in the set, we generate a random number R between 0 and 1, and keep the K elements that have the largest values of R. This trick is especially useful when we want to create stratified samples from a large dataset. Each stratum is a specific combination of categorical variables that is important for an analysis, such as gender, age, or geographical location. If there is significant skew in our input data set, it’s possible that a naive random sampling of observations will underrepresent certain strata in the dataset. Cloudera ML has a sample command that can be used to create stratified samples for text files and Hive tables (via the HCatalog interface to the Hive Metastore) such that N records will be selected for every combination of the categorical variables that define the strata.

The second algorithm is even more interesting: a weighted distributed reservoir sample, where every item in the set has an associated weight, and we want to sample such that the probability that an item is selected is proportional to its weight. It wasn’t even clear whether or not this was even possible until Pavlos Efraimidis and Paul Spirakis figured out a way to do it and published it in the 2005 paper “Weighted Random Sampling with a Reservoir.” The solution is as simple as it is elegant, and it is based on the same idea as the distributed reservoir sampling algorithm described above. For each item in the stream, we compute a score as follows: first, generate a random number R between 0 and 1, and then take the nth root of R, where n is the weight of the current item. Return the K items with the highest score as the sample. Items with higher weights will tend to have scores that are closer to 1, and are thus more likely to be picked than items with smaller weights.

In Cloudera ML, we use the weighted reservoir sampling algorithm in order to cut down on the number of passes over the input data that the scalable k-means++ algorithm needs to perform.

Page 62: Big Data Interview

The ksketch command runs the k-means++ initialization procedure, performing a small number of iterations over the input data set to select points that form a representative sample (or sketch) of the overall data set. For each iteration, the probability that a given point should be added to the sketch is proportional to its distance from the closest point in the current sketch. By using the weighted reservoir sampling algorithm, we can select the points to add to the next sketch in a single pass over the input data, instead of one pass to compute the overall cost of the clustering and a second pass to select the points based on those cost calculations.

These Books Behind Me Don’t Just Make The Office Look Good

Interesting algorithms aren’t just for the engineers building distributed file systems and search engines, they can also come in handy when you’re working on large-scale data analysis and statistical modeling problems. I’ll try to write some additional posts on algorithms that are interesting as well as useful for data scientists to learn, but in the meantime, it never hurts to brush up on your Knuth.

How to hire data scientists and get hired as ones you might have heard before if you read McKinsey reports, the New York Times or just about any technology news site, data scientists are in high demand. Heck, the Harvard Business

Page 63: Big Data Interview

Review called it the sexiest job of the 21st century. But landing a gig as a data scientist isn’t easy — especially a top-notch gig at a major web or e-commerce company where merely talented people are a dime a dozen.

However, companies are starting to talk openly about what they look for in data scientists, including the skills someone should have and what they’ll need to know to survive an interview. I spent a day at the Predictive Analytics World conference on Monday and heard both Netflix and Orbitz give their two cents. That’s also the same day Hortonworks published a blog post about how to build a data science team.

Granted that “data scientist” is a nebulous term — perhaps as much so as “big data” — these tips (a mashup of all three sources) are still broadly applicable. If you want to make the leap from guy who knows data to data scientist, I suggest paying attention.

1. Know the core competencies.

For most of us, there’s readin, ‘ritin’ and ‘rithmetic. For data scientists, there’s SQL, statistics, predictive modeling and programming (probably Python). If you don’t have at least a grounding in these skills, you’re probably not getting through the door, in part because they form a common language that lets people from different backgrounds talk to each other.

Hortonworks’ Ofer Mendelevitch describes the ideal data scientist as occupying a place on the spectrum between a software engineer and a research scientist. In distinguishing a great engineer, mathematician or data analyst from a data scientist, programming skills are probably the biggest variable. That’s because being able to write code means you’ll have an easier time testing out your hypotheses and algorithms, hacking through certain problems and generally thinking in ways that actually relate to the products your employer is building.

Source: Hortonworks

Chris Pouliot, director of algorithms and analytics at Netflix, said even being able to “pseudo-code” might be good enough if someone is otherwise a strong candidate. You can pick up SQL or Python or whatever you need pretty quickly, he noted.

Or, hinted Orbitz VP of Advanced Analytics Sameer Chopra, you could just suck it up and learn Python now: “If you were to leave today and ask ‘What specific skills should I learn?’: Python.”

Page 64: Big Data Interview

2. Know a little more.

Of course, just meeting the minimum requirements never got anybody a job (well, almost nobody). What Pouliot is really looking for in a candidate are: an advanced degree in a quantitative field; hands-on experience hacking data (ideally using Hive, Pig, SQL or Python); good exploratory analysis skills; the ability to work with engineering teams; and the ability to generate and create algorithms and models rather than relying on out-of-the-box ones.

Chopra’s advice was to get up to speed on machine learning, especially if you want to work in Silicon Valley, where machine learning has exploded in popularity. He’s also a big fan of honing those hacking skills because data munging is such a valuable skill when you’re dealing with so many types of data that you need to process so they work together. If you can do quality analytics across myriad data sources, Chopra said, “you can write your own ticket in this day and age.”

Oh, and if you’re planning to work at a startup, he added, R is almost a must-know for anyone whose job will entail statistical analysis.

3. Embrace online learning.

If it all sounds a little daunting, don’t be too worried, Chopra advised. That’s because there are plenty of opportunities to learn these new skills online via both massive open online courses (he’s particularly keen on Udacity’s Computer Science 101 and Andrew Ng’s machine learning course on Coursera) and universities’ own online curricula. Chopra also suggested joining professional groups on LinkedIn, participating in Kaggle competitons and maybe even getting out of the house by going to meetups.

Whatever you’re curious about, though — text mining, natural language processing, deep learning — you can probably find someone willing to teach you for free or nearly free, and any additional skills will help set you apart from the crowd.

4. Learn to tell a story.

Last month at Structure: Data, DJ Patil told me that one of the biggest skill shortcomings in data science is the ability to tell a story with data beyond just pointing to the numbers. Chopra agreed, noting that today’s new visualization tools make it easier to display data in formats that non-scientists might be able to (or at least want to) consume. A corollary of storytelling is good, old-fashioned communication: All the charts in the world won’t make a difference if you can’t communicate to product managers or executives why your findings matter.

Pouliot is a little less sold on communication skills, though — at least sometimes. If you’re an engineer primarily talking to other engineers, he told the room, you probably can speak all the jargon you want. It’s only if someone has a business-facing role when communication really becomes important.

Page 65: Big Data Interview

5. Prepare to be tested (aka “Your pedigree means nothing”).

After you’ve learned all these skills, added them to your résumé and talked to a hiring manager about how good you are at them, it’s likely testing time. Prospective Netflix data scientists go through a battery of exercises, Pouliot says, including explaining projects they’ve worked on and questions to determine the depth of their knowledge. They’ll also be asked to devise a framework that solves a problem of the interviewer’s choice.

Chris Pouliot

One thing Pouliot warned about is an over-reliance on what’s on your résumé. Right off the bat, for example, he’ll test the heck out the skills or knowledge that someone claims to ensure they really know it.

Having a Stanford degree and work experience at Google don’t necessarily make someone a shoo-in, either. Pouliot acknowledged during a quick chat after his presentation that he’s been seduced by the perfect resume before — even going so far as to cut a few corners to get someone in for an interview — only to be disappointed in the end. Everyone has to pass the tests, he said, and some of the best applicants on paper crashed and burned very early in the process.

6. Exercise creativity.

It’s during the testing phase at places like Netflix that all those personal skills and experience can come into play. There’s often no right answer when it comes to answering the hypotheticals an interviewer like Pouliot might ask, and he gives bonus points for solutions he’s never seen before. “Creativity is one of the biggest things to look for when hiring data scientists,” he said. Later, he added, “Creativity is king, I think, for a great data scientist.”

Page 66: Big Data Interview

Bonus tips for anyone hiring and managing data scientists

Technically, Pouliot’s talk at Predictive Analytics World was about hiring data scientists, but much of the insights were probably more valuable to aspiring data scientists. Some of them, though, we’re definitely for management, possibly at the C-level. A few points to consider:

Netflix has a standalone data science team that works closely with other departments but ultimately answers to itself. This helps the data scientists collaborate with one another, gives them upward mobility (i.e., they might never become director of marketing, but they could become director of data science) and makes it easier to manage them because everyone speaks the same language so an employee knows his boss knows his stuff.

However, he noted, the alternative approach of embedding data scientists within other departments does bring its own benefits. That type of setup can result in a better alignment of research efforts and business needs, and it can help products get built faster because everyone is on the same page. Pouliot suggests one compromise might be to keep a centralized data science team but locate it physically near the other teams it will be interacting with most often, and other is just to ensure you have representatives from every stakeholder department present for meetings and problem-solving exercises.

Actually, if you just cannot hire data scientists with all the skills you want them to have, Mendelevitch from Hortonworks suggests a similar tactic. It can be difficult to teach applied math to software engineers and vice versa, so, he writes, “[S]imply build a Hadoop data science team that combines data engineers and applied scientists, working in tandem to build your data products. Back when I was at Yahoo!, that’s exactly the structure we had: applied scientists working together with data engineers to build large-scale computational advertising systems.”

If you want to retain your good data scientists once you’ve hired them — especially in Silicon Valley where they can walk out the door and get five offers — paying them the market rate is a good start. Additionally, Pouliot said, letting them work on challenging products will keep them happy. Micro-managing them will not.