3 Hard Data Science Python Interview Questions - Test your knowledge

Python knowledge is a must for data scientist. Test your Python knowledge and see if you are ready for a real interview with the questions below.

How would you write unit tests for a data processing function that uses pandas DataFrames? Which libraries would you use?

Explanation

Unit testing data processing functions is crucial for ensuring their correctness and reliability, especially in production pipelines. The process involves defining expected inputs and outputs for various scenarios.

Libraries:

pytest: The de-facto standard testing framework in Python. It has a simple syntax for writing tests, powerful fixture support, and excellent plugins.
pandas.testing: This module provides specialized functions for comparing pandas objects, which is much better than using == or .equals(). For example, pd.testing.assert_frame_equal() can check for equality while ignoring minor floating-point differences, and it provides very detailed error messages if the DataFrames don't match.

Steps to Write a Unit Test:

Structure: Create a separate test file (e.g., test_my_processing.py).
Define a Test Function: Create a function that starts with test_.
Arrange: Inside the function, create a sample input DataFrame that represents a specific case (e.g., normal data, data with NaNs, empty DataFrame).
Act: Call the data processing function you want to test with the sample input.
Assert: Create an "expected output" DataFrame. Use a function like pd.testing.assert_frame_equal() to compare the actual output from your function with the expected output.

Code Example

Python

# --- my_processing.py (the function to be tested) ---
import pandas as pd

def process_data(df: pd.DataFrame) -> pd.DataFrame:
    """Calculates a new column and drops rows with NaNs."""
    if df.empty:
        return df
    df_processed = df.copy()
    df_processed['c'] = df_processed['a'] + df_processed['b']
    df_processed = df_processed.dropna()
    return df_processed

# --- test_my_processing.py (the unit test file) ---
import pandas as pd
import numpy as np
# from my_processing import process_data # Uncomment when in separate files

def test_process_data_normal_case():
    # 1. Arrange
    input_df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
    expected_output_df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [5, 7, 9]})

    # 2. Act
    actual_output_df = process_data(input_df)

    # 3. Assert
    pd.testing.assert_frame_equal(actual_output_df, expected_output_df)

def test_process_data_with_nans():
    # 1. Arrange
    input_df = pd.DataFrame({'a': [1, 2, np.nan], 'b': [4, 5, 6]})
    expected_output_df = pd.DataFrame({'a': [1.0, 2.0], 'b': [4.0, 5.0], 'c': [5.0, 7.0]}, index=[0, 1])

    # 2. Act
    actual_output_df = process_data(input_df)

    # 3. Assert
    pd.testing.assert_frame_equal(actual_output_df, expected_output_df)

# To run tests from the command line: pytest

88. Explain what closures and lexical scoping are in Python. Provide a practical example of where this could be useful in a data analysis script.

Explanation

Lexical Scoping: This means that the scope of a variable is determined by where it is defined in the source code. An inner function can access variables from its outer (enclosing) function's scope, even after the outer function has finished executing.
Closure: A closure is a function object that "remembers" the values in the enclosing lexical scope, even when they are not present in memory. It's an inner function that has access to the variables of its outer function. In essence, a closure "closes over" the free variables from its environment.

Practical Use Case in Data Analysis:
Closures are excellent for creating custom, configurable functions. For example, you can create a "function factory" that generates specialized data transformation functions based on certain parameters. This is useful for creating reusable components in a data pipeline.

Code Example (Function Factory for Power Transformation)

Python

def make_power_transformer(exponent):
    """
    This is an outer function (a factory). It takes an exponent.
    """
    def power_transform(series):
        """
        This is the inner function. It "closes over" the `exponent`
        variable from its enclosing scope.
        """
        return series ** exponent

    # The factory returns the inner function
    return power_transform

# --- Using the factory to create specialized functions ---
square = make_power_transformer(2)
cube = make_power_transformer(3)

# Now, `square` and `cube` are functions that remember their respective exponents
data = pd.Series([1, 2, 3, 4])
print("--- Original Data ---")
print(data)

print("\n--- Squared Data ---")
print(square(data))

print("\n--- Cubed Data ---")
print(cube(data))

89. How does NumPy's memory layout for arrays (C-style vs. Fortran-style) affect performance, especially in matrix operations?

Explanation

The memory layout of a multi-dimensional array determines how its elements are stored in a contiguous, one-dimensional block of memory.

C-style (Row-major order): This is the default in NumPy. Moving along a row means accessing adjacent memory locations. The last index changes the fastest.
- Example (3,3) array: A[0,0], A[0,1], A[0,2], A[1,0], A[1,1]...
Fortran-style (Column-major order): Moving down a column means accessing adjacent memory locations. The first index changes the fastest.
- Example (3,3) array: A[0,0], A[1,0], A[2,0], A[0,1], A[1,1]...

Performance Impact:
Modern CPUs are much faster at processing data that is sequential in memory due to a mechanism called caching. When the CPU requests a piece of memory, it also pulls in a larger, nearby block into a very fast cache. If the next piece of data you need is already in the cache (a "cache hit"), the operation is extremely fast. If it's not (a "cache miss"), the CPU has to fetch it from the much slower main RAM.

Therefore, for optimal performance, your algorithm should access array elements in the same order they are stored in memory.

For a C-style array, iterating row-by-row is faster.
For a Fortran-style array, iterating column-by-column is faster.

This is especially significant in matrix operations like matrix multiplication, where the performance of the underlying BLAS/LAPACK libraries can be highly optimized for a specific memory layout.

Code Example

Python

import numpy as np import time # Create large arrays in both C and Fortran order size = 5000 c_array = np.zeros((size, size), order='C') f_array = np.zeros((size, size), order='F') # --- Summing along rows (favors C-style) --- start = time.time() np.sum(c_array, axis=1) end = time.time() print(f"Summing rows of C-style array: {end - start:.4f}s (FAST)") start = time.time() np.sum(f_array, axis=1) end = time.time() print(f"Summing rows of F-style array: {end - start:.4f}s (SLOWER)\n") # --- Summing along columns (favors Fortran-style) --- start = time.time() np.sum(c_array, axis=0) end = time.time() print(f"Summing columns of C-style array: {end - start:.4f}s (SLOWER)") start = time.time() np.sum(f_array, axis=0) end = time.time() print(f"Summing columns of F-style array: {end - start:.4f}s (FAST)")Our ebook with 100 questions and answers just like these is coming soon!

3 Hard Data Science Python Interview Questions - Test your knowledge

How would you write unit tests for a data processing function that uses pandas DataFrames? Which libraries would you use?

88. Explain what closures and lexical scoping are in Python. Provide a practical example of where this could be useful in a data analysis script.

89. How does NumPy's memory layout for arrays (C-style vs. Fortran-style) affect performance, especially in matrix operations?

0 comments

Leave a comment

Recommended products

Most Common Data Science Questions and Best Answers - eBook

Mock Data Science Interview

Expert CV Review