Data Engineering Interview Questions: SQL and Databases I
Starting a career in data engineering is both exciting and challenging. As data continues to be the core of decision-makers in organizations, the role of a data engineer becomes increasingly important. It’s not about moving data from point A to point B. It’s about writing efficient SQL queries, building scalable databases, and understanding the inner workings of database engines to optimize performance. In the competitive world of data engineering interviews, you need more than just a little knowledge. You must demonstrate proficiency in writing optimized SQL, designing data warehouses that can scale with growing data demands, and show an understanding of at least one database engine’s internals. This depth of knowledge not only sets you apart but also prepares you to face real-world challenges effectively.
As a data engineer, your ability to write efficient SQL and structure your databases in a scalable way is your edge. Understanding the internals of database engines gives you an advantage, allowing you to optimize and troubleshoot with confidence. Whether you’re preparing for your first interview or looking to refresh your knowledge, this guide help you with the concepts and tools needed to improve. It will not only cover the questions you might face but also provide the context and explanations to stretch your comprehension. So, let’s go ahead and give you the skills that make you a great candidate in the field of data engineering.
What is the difference between OLAP and OLTP databases?
- The primary distinction between the two is that one uses data to uncover insightful knowledge while the other is just operational.
- Traditional OLAP systems involve an understanding of data modelling and, in many situations, coordination across many business divisions. On the other hand, OLTP systems are business-specific, with any outage resulting in disrupted transactions and revenue and brand damage.
- OLTP systems optimize write operations and exhibit low data latency, ensuring efficient handling of real-time transactions. OLAP systems enhance query response times and improve performance for complex queries. These systems are optimized for read operations and exhibit high data latency.
What are the different kinds of JOIN clauses, you know? When do you use them?
Let’s say we have a table of Instagram friends, and the other table of LinkedIn connections.
- CROSS join: How many combinations of friends and connections do I have?
- UNION: How many friends do my Instagram friends have and how many connections do my LinkedIn connections have?
- LEFT join: How many friends and connections do my Instagram friends have? (Regardless of if they are on LinkedIn)
- RIGHT join: How many friends and connections do my LinkedIn connections have? (Regardless of if they are on Instagram)
- INNER join: How many friends and connections do my friends who are on both Instagram and LinkedIn have?
- FULL OUTER join: How many friends and connections do my Instagram friends or LinkedIn connections have?
How do you use the HAVING clause?
In SQL, the HAVING
clause is used to filter groups of data after they have been aggregated using the GROUP BY
clause. While the WHERE
clause filters individual rows before any grouping occurs, the HAVING
clause filters the grouped records based on aggregate conditions.
Basic Concepts:
- WHERE Clause: Filters individual rows before grouping. Cannot use aggregate functions.
- GROUP BY Clause: Groups rows with the same values in specified columns.
- HAVING Clause:
- Filters groups after aggregation. Can include aggregate functions. The
HAVING
clause is executed afterGROUP BY
and beforeORDER BY
. - Aggregate Functions in HAVING. You can use functions like
SUM()
,COUNT()
,AVG()
,MIN()
, andMAX()
in theHAVING
clause. - Multiple Conditions in HAVING. You can include multiple conditions using
AND
,OR
, etc.
Example:
Suppose you have a table named Sales
with the following columns:
SaleID
: Unique identifier for each sale.SalesPerson
: Name of the salesperson.Region
: Sales region.Amount
: Sales amount.
Data:
| SaleID | SalesPerson | Region | Amount |
|--------|-------------|---------|--------|
| 1 | Alice | East | 500 |
| 2 | Bob | West | 700 |
| 3 | Alice | East | 600 |
| 4 | Charlie | East | 800 |
| 5 | Bob | West | 400 |
| 6 | Alice | East | 300 |
| 7 | Charlie | East | 700 |
| 8 | Bob | West | 600 |
| 9 | Alice | East | 200 |
| 10 | Charlie | East | 500 |
Find all salespersons in the ‘East’ region whose total sales exceed $1,500.
SQL:
SELECT
SalesPerson,
SUM(Amount) AS TotalSales
FROM
Sales
WHERE
Region = 'East' -- Filters base data
GROUP BY
SalesPerson -- Groups data by SalesPerson
HAVING
SUM(Amount) > 1500 -- Filters aggregated data
ORDER BY
TotalSales DESC; -- Orders results
The meaning of the code:
WHERE Region = 'East'
: Filters the rows to include only sales from the 'East' region.GROUP BY SalesPerson
: Groups the sales data by each salesperson.SUM(Amount) AS TotalSales
: Calculates the total sales amount for each salesperson.HAVING SUM(Amount) > 1500
: Filters out groups where the total sales are $1,500 or less.ORDER BY TotalSales DESC
: Sorts the results in descending order of total sales.
Query Result:
| SalesPerson | TotalSales |
|-------------|------------|
| Charlie | 2000 |
| Alice | 1600 |
Sanity Check:
- Charlie: Total sales in ‘East’ = 800 + 700 + 500 = 2,000
- Alice: Total sales in ‘East’ = 500 + 600 + 300 + 200 = 1,600
More Difficult Example
Suppose you want to find salespersons who’s total sales are between $1,500 and $2,500 in the ‘East’ region.
SELECT
SalesPerson,
SUM(Amount) AS TotalSales
FROM
Sales
WHERE
Region = 'East'
GROUP BY
SalesPerson
HAVING
SUM(Amount) > 1500 AND SUM(Amount) < 2500;
Query Result:
| SalesPerson | TotalSales |
|-------------|------------|
| Charlie | 2000 |
| Alice | 1600 |
The last Example using COUNT()
Suppose you have a Customers
table with a Country
column, and you want to find countries with more than 5 customers.
SELECT
Country,
COUNT(CustomerID) AS NumberOfCustomers
FROM
Customers
GROUP BY
Country
HAVING
COUNT(CustomerID) > 5;
Explanation:
- Groups customers by country.
- Counts the number of customers in each country.
- Uses
HAVING
to filter out countries with 5 or fewer customers.
What views are, and why would you use one instead of a table?
Advantages
- Views provide an abstraction over tables. You can add/remove fields easily in a view without modifying your underlying schema
- Views can model complex joins easily.
- You can easily manage your GRANTS directly on views rather than the actual tables. This is especially helpful if you know a certain user may only access a view.
- Views can help you with backwards compatibility. You can change the underlying schema, but the views can hide those facts from a certain client.
Disadvantages
- You lose information about relations (primary keys, foreign keys)
- It’s not obvious whether you will be able to insert/update a view, because the view hides its underlying joins from you
What are primary and foreign keys?
- Primary keys are used to identify and index each row within a single table uniquely.
- Foreign keys are used to link rows in two different tables such that a row can only be added or updated in Table_A if the value in its foreign key column exists in the relevant column of Table_B
Summary
Data engineering interviews require a strong understanding of SQL, database scalability, and engine optimization. Key concepts include OLAP vs. OLTP systems, JOIN clauses for combining data, and the HAVING clause to filter aggregated data. Views are useful for abstracting tables and managing access but may limit relational information. Understanding primary and foreign keys is fundamental for linking tables and maintaining data integrity.
Conclusion
Mastering these SQL concepts — JOINS, HAVING, views, and keys — gives you an edge in data engineering, helping you not only pass interviews but also address real-world challenges effectively. Stay tuned for the next chapter, where we dive even deeper into advanced data engineering techniques.