6 mins read

3 Common Data Science Concepts Tested Across All Interviews

There are usually several concepts that interviewers are testing in data science interviews, but since they may only have time to ask 1 or 2 questions, they will try to bundle the concepts into a single question. Therefore, it is important to know what these concepts are in order to look for them in an interview.

So what are they really testing for? Really, what an interviewer is looking for is interviewees with a deep understanding of metric design and implementation of real-world scenarios that would be present in the data. The key phrase here is “real world scenario,” which means there will likely be multiple edge cases and scenarios that you’ll need to think about to solve the problem. There are 3 common concepts that they test to test your understanding of how to implement code that solves real-world scenarios.

Since they only have time to ask 1 or 2 questions in an interview before time runs out, you’ll often see all 3 concepts wrapped up in one question. I see this question, or a version of this question, ( platform.stratascratch.com/coding-question?id=10300&python= ) in almost every interview I’ve been in or been given. Follow me and see if you can answer this question.

The 3 concepts you need to know are CASE statements, JOIN statements, and subqueries/CTEs. Let’s break down a real interview question that covers these 3 concepts and talk about them in depth. The link to the question is here ((platform.stratascratch.com/coding-question?id=10300&python=) if you want to follow along.
Aggregates of CASE STATEMENTS
You probably have some kind of categorization problem where you need to categorize the data based on the earnings you see in the table. This is very common in practice and you will always be likely to categorize and clean data. So a CASE statement is the simplest technique to test.

Throw in the addition of aggregates like sum() and count() and you’re testing to see if you really know what gets returned in a case where it’s not just implemented. Depending on the case statements, you can always add aggregate functions like a count or a sum.

Here is an example of a CASE statement with a simple aggregation in the SELECT clause for the question.

You’ll see in the CASE statement below, we’re classifying users based on whether or not they’re paying customers. We then apply a sum() as it is a quick way to count the number of paying customers vs. non-paying customers in a simple query. If we didn’t have the CASE statement, it would take us two queries to find both numbers.

SELECT date, sum (CASE
WHEN customer_payment = ‘yes’ THEN downloads
END) AS paying,
sum(CASE
WHEN customer_payment = ‘no’ THEN downloads
END) LIKE non_paying
FROM ms_user_dimension TO

Unions
The second concept is to JOIN tables. Can you join tables? This is the lowest bar you need to jump to be an analyst, let alone a data scientist. This bar is basically on the ground, so you can really walk over it.

So, in interviews, do you usually do LEFT JOIN, CROSS JOIN, INNER JOIN? Most of your work will use a LEFT JOIN, so you’ll be judged on practicality. You will almost never use a cross join. You’ll use an inner join quite a bit, but the left join is a bit more complicated, so they’ll use it just as an additional filter.

Auto joins are common because it’s not always obvious that you would use them. But they are common in practice.

In the following example, we are joining tables to the CASE statement. We are joining two tables to our table using LEFT JOIN.

SELECT date, sum (CASE
WHEN customer_payment = ‘yes’ THEN downloads
END) AS paying,
sum(CASE
WHEN customer_payment = ‘no’ THEN downloads
END) LIKE non_paying
FROM ms_user_dimension TO
LEFT JOIN ms_acc_dimension b ON a.acc_id = b.acc_id
LEFT JOIN ms_download_facts c ON a.user_id=c.user_id
GROUP BY date
ORDER BY date

Subquery/CTE
The last common concept is a subquery/CTE, basically a concept that you’re working on and then need to work on more. This is a test to see if you can break your problem down into logical steps. Some solutions require more than one step to solve, so you are testing to see if you can write code that follows a logical flow. Not necessarily complicated or complex, but multi-step and pragmatic. This is especially useful in practice because you will be 100% writing code that is over hundreds of lines and you need to be able to create solutions that follow a good flow.

In the example below, I take the query we wrote earlier and put it in a subquery so we can query its data. This way we can apply an additional filter on the HAVING clause and keep the entire solution in a single query.

SELECT date, non_paying,
payment
FROM
(SELECT date, sum(CASE
WHEN customer_payment = ‘yes’ THEN downloads
END) AS paying,
sum(CASE
WHEN customer_payment = ‘no’ THEN downloads
END) LIKE non_paying
FROM ms_user_dimension TO
LEFT JOIN ms_acc_dimension b ON a.acc_id = b.acc_id
LEFT JOIN ms_download_facts c ON a.user_id=c.user_id
GROUP BY date
ORDER BY date) t
GROUP BY t.date,
t.paying,
t.non_paying
HAVING (not_paying – paying) >0
SORT BY t.date ASC

Leave a Reply

Your email address will not be published. Required fields are marked *