3 Common Data Science Concepts Tested on All Interviews

There are usually several concepts interviewers are testing for on data science interviews but since they might only have time to ask 1-2 questions, they’ll try to pack the concepts into one question. So it’s important to know what these concepts are so you can look out for them in an interview.

So what are they really testing for? Really what an interviewer is looking for are interviewees with an in-depth understanding of metric design and implementation of a real-world scenarios that would be present in the data. The key phrase here is “real-world scenario”, which means that there are probably going to be multiple edge cases and scenarios you’ll need to think through to solve the problem. There are 3 common concepts that they test for that test your understanding of how to implement code that solves real-world scenarios.

Since they only have time to ask 1-2 questions in an interview before their time is up, you’ll often see all 3 concepts wrapped in one question. I see this question, or a version of this question, ( platform.stratascratch.com/coding-question?id=10300&python= ) on almost every interview I’ve been on or given. Follow along with me and see if you would be able to answer this question.

The 3 concepts you need to know are CASE statements, JOINs, and subqueries/CTEs. Let’s go through a real interview question that cover these 3 concepts and talk about them in-depth. The link to the question is here ((platform.stratascratch.com/coding-question?id=10300&python=) if you want to follow along.

Aggregates from CASE STATEMENTs

You’ll likely get some sort of categorization question where you need to categorize data based on values you see in the table. This is super common in practice and you’ll likely always be categorizing and cleaning up data. So a CASE statement is the simplest technique to test for.

Add the addition of aggregates like sum() and count() and they’ll be testing to see if you actually know what is being returned in a case when, not just the implementation of it. Based on the case statements, you can always add an aggregate functions like a count or a sum.

Here is an example of a CASE statement with a simple aggregation in the SELECT clause for the question.

You see in the CASe statement below, we’re categorizing users based on if they are paying customers or mot. We then apply a sum() as it’s a quick way to count the number of paying customers vs non-paying customers in one simple query. If we did not have the CASE statement, it would take us two queries to find both numbers.

SELECT date, sum(CASE

WHEN paying_customer = ‘yes’ THEN downloads

END) AS paying,

sum(CASE

WHEN paying_customer = ‘no’ THEN downloads

END) AS non_paying

FROM ms_user_dimension a

JOINs

The 2nd concept is JOINing tables. Can you join tables? This is the lowest bar you need to jump over to be an analyst, much less a data scientist. This bar is basically on the ground so you can really just step over it.

So on interviews — do they usually do a LEFT JOIN, CROSS JOIN, INNER JOIN? Most of your work will be using a LEFT JOIN so they’re testing you based on practicality. You’ll almost never use a cross join. You’ll use an inner join quite a bit but left join is slightly more complicated so they’ll use that just as an additional filter.

Self joins are common because it’s not always obvious you’d be using that. But they’re common in practice.

In the below example, we’re joining tables to the CASE statement. We’re joining two tables to our main table using a LEFT JOIN.

SELECT date, sum(CASE

WHEN paying_customer = ‘yes’ THEN downloads

END) AS paying,

sum(CASE

WHEN paying_customer = ‘no’ THEN downloads

END) AS non_paying

FROM ms_user_dimension a

LEFT JOIN ms_acc_dimension b ON a.acc_id = b.acc_id

LEFT JOIN ms_download_facts c ON a.user_id=c.user_id

GROUP BY date

ORDER BY date

Subquery/CTE

The last common concept is a subquery/CTE, basically some concept where you’re doing some work and then need to do more work on it. This is testing to see if you can break up your problem into logical steps. Some solutions take more than one step to solve so they’re testing to see if you can write code that follows a logical flow. Not necessarily complicated or complex, but multi-step and pragmatic. This is especially useful in practice because you’ll 100% be writing code that’s over hundreds of lines long and you need to be able to create solutions that follow a good flow.

In the below example, I’m taking the query we wrote above and putting it in a subquery so that we can query its data. This way we can apply an additional filter in the HAVING clause and keep the entire solution to one query.

SELECT date, non_paying,

paying

FROM

(SELECT date, sum(CASE

WHEN paying_customer = ‘yes’ THEN downloads

END) AS paying,

sum(CASE

WHEN paying_customer = ‘no’ THEN downloads

END) AS non_paying

FROM ms_user_dimension a

LEFT JOIN ms_acc_dimension b ON a.acc_id = b.acc_id

LEFT JOIN ms_download_facts c ON a.user_id=c.user_id

GROUP BY date

ORDER BY date) t

GROUP BY t.date,

t.paying,

t.non_paying

HAVING (non_paying – paying) >0

ORDER BY t.date ASC



Source by Rakib Raihan