machine learning +
PySpark OneHot Encoding – Mastering OneHot Encoding in PySpark and Unleash the Power of Categorical Data in Machine Learning
PySpark Exercises – 101 PySpark Exercises for Data Analysis
101 PySpark exercises are designed to challenge your logical muscle and to help internalize data manipulation with python’s favorite package for data analysis. The questions are of 3 levels of difficulties with L1 being the easiest to L3 being the hardest.
You might also like to try out:
1. How to import PySpark and check the version?
Difficulty Level: L1
2. How to convert the index of a PySpark DataFrame into a column?
Difficulty Level: L1
Hint: The PySpark DataFrame doesn’t have an explicit concept of an index like Pandas DataFrame. However, if you have a DataFrame and you’d like to add a new column that is basically a row number.
Input:
python
# Assuming df is your DataFrame
df = spark.createDataFrame([
("Alice", 1),
("Bob", 2),
("Charlie", 3),
], ["Name", "Value"])
df.show()
+-------+-----+
| Name|Value|
+-------+-----+
| Alice| 1|
| Bob| 2|
|Charlie| 3|
+-------+-----+
Expected Output:
python
+-------+-----+-----+
| Name|Value|index|
+-------+-----+-----+
| Alice| 1| 0|
| Bob| 2| 1|
|Charlie| 3| 2|
+-------+-----+-----+
3. How to combine many lists to form a PySpark DataFrame?
Difficulty Level: L1
Create a PySpark DataFrame from list1 and list2
Hint: For Creating DataFrame from multiple lists, first create an RDD (Resilient Distributed Dataset) from those lists and then convert the RDD to a DataFrame.
Input:
python
# Define your lists
list1 = ["a", "b", "c", "d"]
list2 = [1, 2, 3, 4]
4. How to get the items of list A not present in list B?
Difficulty Level: L2
Get the items of list_A not present in list_B in PySpark, you can use the subtract operation on RDDs (Resilient Distributed Datasets).
Input:
python
list_A = [1, 2, 3, 4, 5]
list_B = [4, 5, 6, 7, 8]
Expected Output:
python
#> [1, 2, 3]
5. How to get the items not common to both list A and list B?
Difficulty Level: L2
Get all items of list_A and list_B not common to both.
Input:
python
list_A = [1, 2, 3, 4, 5]
list_B = [4, 5, 6, 7, 8]
6. How to get the minimum, 25th percentile, median, 75th, and max of a numeric column?
Difficulty Level: L2
Compute the minimum, 25th percentile, median, 75th, and maximum of column Age
input
python
# Create a sample DataFrame
data = [("A", 10), ("B", 20), ("C", 30), ("D", 40), ("E", 50), ("F", 15), ("G", 28), ("H", 54), ("I", 41), ("J", 86)]
df = spark.createDataFrame(data, ["Name", "Age"])
df.show()
python
+----+---+
|Name|Age|
+----+---+
| A| 10|
| B| 20|
| C| 30|
| D| 40|
| E| 50|
| F| 15|
| G| 28|
| H| 54|
| I| 41|
| J| 86|
+----+---+
7. How to get frequency counts of unique items of a column?
Difficulty Level: L1
Calculte the frequency counts of each unique value
Input
python
from pyspark.sql import Row
# Sample data
data = [
Row(name='John', job='Engineer'),
Row(name='John', job='Engineer'),
Row(name='Mary', job='Scientist'),
Row(name='Bob', job='Engineer'),
Row(name='Bob', job='Engineer'),
Row(name='Bob', job='Scientist'),
Row(name='Sam', job='Doctor'),
]
# create DataFrame
df = spark.createDataFrame(data)
# show DataFrame
df.show()
python
+----+---------+
|name| job|
+----+---------+
|John| Engineer|
|John| Engineer|
|Mary|Scientist|
| Bob| Engineer|
| Bob| Engineer|
| Bob|Scientist|
| Sam| Doctor|
+----+---------+
8. How to keep only top 2 most frequent values as it is and replace everything else as ‘Other’?
Difficulty Level: L3
Input
python
from pyspark.sql import Row
# Sample data
data = [
Row(name='John', job='Engineer'),
Row(name='John', job='Engineer'),
Row(name='Mary', job='Scientist'),
Row(name='Bob', job='Engineer'),
Row(name='Bob', job='Engineer'),
Row(name='Bob', job='Scientist'),
Row(name='Sam', job='Doctor'),
]
# create DataFrame
df = spark.createDataFrame(data)
# show DataFrame
df.show()
python
+----+---------+
|name| job|
+----+---------+
|John| Engineer|
|John| Engineer|
|Mary|Scientist|
| Bob| Engineer|
| Bob| Engineer|
| Bob|Scientist|
| Sam| Doctor|
+----+---------+
9. How to Drop rows with NA values specific to a particular column?
Difficulty Level: L1
input
python
# Assuming df is your DataFrame
df = spark.createDataFrame([
("A", 1, None),
("B", None, "123" ),
("B", 3, "456"),
("D", None, None),
], ["Name", "Value", "id"])
df.show()
python
+----+-----+----+
|Name|Value| id|
+----+-----+----+
| A| 1|null|
| B| null| 123|
| B| 3| 456|
| D| null|null|
+----+-----+----+
10. How to rename columns of a PySpark DataFrame using two lists – one containing the old column names and the other containing the new column names?
Difficulty Level: L2
Input
python
# suppose you have the following DataFrame
df = spark.createDataFrame([(1, 2, 3), (4, 5, 6)], ["col1", "col2", "col3"])
# old column names
old_names = ["col1", "col2", "col3"]
# new column names
new_names = ["new_col1", "new_col2", "new_col3"]
df.show()
python
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| 2| 3|
| 4| 5| 6|
+----+----+----+
11. How to bin a numeric list to 10 groups of equal size?
Difficulty Level: L2
Input
python
from pyspark.sql.functions import rand
from pyspark.ml.feature import Bucketizer
# Create a DataFrame with a single column "values" filled with random numbers
num_items = 100
df = spark.range(num_items).select(rand(seed=42).alias("values"))
df.show(5)
python
+-------------------+
| values|
+-------------------+
| 0.619189370225301|
| 0.5096018842446481|
| 0.8325259388871524|
|0.26322809041172357|
| 0.6702867696264135|
+-------------------+
only showing top 5 rows
12. How to create contigency table?
Difficulty Level: L1
Input
python
# Example DataFrame
data = [("A", "X"), ("A", "Y"), ("A", "X"), ("B", "Y"), ("B", "X"), ("C", "X"), ("C", "X"), ("C", "Y")]
df = spark.createDataFrame(data, ["category1", "category2"])
df.show()
python
+---------+---------+
|category1|category2|
+---------+---------+
| A| X|
| A| Y|
| A| X|
| B| Y|
| B| X|
| C| X|
| C| X|
| C| Y|
+---------+---------+
13. How to find the numbers that are multiples of 3 from a column?
Difficulty Level: L2
Input
python
from pyspark.sql.functions import rand
# Generate a DataFrame with a single column "id" with 10 rows
df = spark.range(10)
# Generate a random float between 0 and 1, scale and shift it to get a random integer between 1 and 10
df = df.withColumn("random", ((rand(seed=42) * 10) + 1).cast("int"))
# Show the DataFrame
df.show()
python
+---+------+
| id|random|
+---+------+
| 0| 7|
| 1| 6|
| 2| 9|
| 3| 7|
| 4| 3|
| 5| 8|
| 6| 9|
| 7| 8|
| 8| 3|
| 9| 8|
+---+------+
14. How to extract items at given positions from a column?
Difficulty Level: L2
Input
python
from pyspark.sql.functions import rand
# Generate a DataFrame with a single column "id" with 10 rows
df = spark.range(10)
# Generate a random float between 0 and 1, scale and shift it to get a random integer between 1 and 10
df = df.withColumn("random", ((rand(seed=42) * 10) + 1).cast("int"))
# Show the DataFrame
df.show()
pos = [0, 4, 8, 5]
python
+---+------+
| id|random|
+---+------+
| 0| 7|
| 1| 6|
| 2| 9|
| 3| 7|
| 4| 3|
| 5| 8|
| 6| 9|
| 7| 8|
| 8| 3|
| 9| 8|
+---+------+
15. How to stack two DataFrames vertically ?
Difficulty Level: L1
Input
python
# Create DataFrame for region A
df_A = spark.createDataFrame([("apple", 3, 5), ("banana", 1, 10), ("orange", 2, 8)], ["Name", "Col_1", "Col_2"])
df_A.show()
# Create DataFrame for region B
df_B = spark.createDataFrame([("apple", 3, 5), ("banana", 1, 15), ("grape", 4, 6)], ["Name", "Col_1", "Col_3"])
df_B.show()
python
+------+-----+-----+
| Name|Col_1|Col_2|
+------+-----+-----+
| apple| 3| 5|
|banana| 1| 10|
|orange| 2| 8|
+------+-----+-----+
+------+-----+-----+
| Name|Col_1|Col_3|
+------+-----+-----+
| apple| 3| 5|
|banana| 1| 15|
| grape| 4| 6|
+------+-----+-----+
16. How to compute the mean squared error on a truth and predicted columns?
Difficulty Level: L2
Input
python
# Assume you have a DataFrame df with two columns "actual" and "predicted"
# For the sake of example, we'll create a sample DataFrame
data = [(1, 1), (2, 4), (3, 9), (4, 16), (5, 25)]
df = spark.createDataFrame(data, ["actual", "predicted"])
df.show()
python
+------+---------+
|actual|predicted|
+------+---------+
| 1| 1|
| 2| 4|
| 3| 9|
| 4| 16|
| 5| 25|
+------+---------+
17. How to convert the first character of each element in a series to uppercase?
Difficulty Level: L1
python
# Suppose you have the following DataFrame
data = [("john",), ("alice",), ("bob",)]
df = spark.createDataFrame(data, ["name"])
df.show()
python
+-----+
| name|
+-----+
| john|
|alice|
| bob|
+-----+
18. How to compute summary statistics for all columns in a dataframe
Difficulty Level: L1
python
# For the sake of example, we'll create a sample DataFrame
data = [('James', 34, 55000),
('Michael', 30, 70000),
('Robert', 37, 60000),
('Maria', 29, 80000),
('Jen', 32, 65000)]
df = spark.createDataFrame(data, ["name", "age" , "salary"])
df.show()
python
+-------+---+------+
| name|age|salary|
+-------+---+------+
| James| 34| 55000|
|Michael| 30| 70000|
| Robert| 37| 60000|
| Maria| 29| 80000|
| Jen| 32| 65000|
+-------+---+------+
19. How to calculate the number of characters in each word in a column?
Difficulty Level: L1
python
# Suppose you have the following DataFrame
data = [("john",), ("alice",), ("bob",)]
df = spark.createDataFrame(data, ["name"])
df.show()
python
+-----+
| name|
+-----+
| john|
|alice|
| bob|
+-----+
20 How to compute difference of differences between consecutive numbers of a column?
Difficulty Level: L2
python
# For the sake of example, we'll create a sample DataFrame
data = [('James', 34, 55000),
('Michael', 30, 70000),
('Robert', 37, 60000),
('Maria', 29, 80000),
('Jen', 32, 65000)]
df = spark.createDataFrame(data, ["name", "age" , "salary"])
df.show()
python
+-------+---+------+
| name|age|salary|
+-------+---+------+
| James| 34| 55000|
|Michael| 30| 70000|
| Robert| 37| 60000|
| Maria| 29| 80000|
| Jen| 32| 65000|
+-------+---+------+
21. How to get the day of month, week number, day of year and day of week from a date strings?
Difficulty Level: L2
python
# example data
data = [("2023-05-18","01 Jan 2010",), ("2023-12-31", "01 Jan 2010",)]
df = spark.createDataFrame(data, ["date_str_1", "date_str_2"])
df.show()
python
+----------+-----------+
|date_str_1| date_str_2|
+----------+-----------+
|2023-05-18|01 Jan 2010|
|2023-12-31|01 Jan 2010|
+----------+-----------+
22. How to convert year-month string to dates corresponding to the 4th day of the month?
Difficulty Level: L2
python
# example dataframe
df = spark.createDataFrame([('Jan 2010',), ('Feb 2011',), ('Mar 2012',)], ['MonthYear'])
df.show()
python
+---------+
|MonthYear|
+---------+
| Jan 2010|
| Feb 2011|
| Mar 2012|
+---------+
23 How to filter words that contain atleast 2 vowels from a series?
Difficulty Level: L3
python
# example dataframe
df = spark.createDataFrame([('Apple',), ('Orange',), ('Plan',) , ('Python',) , ('Money',)], ['Word'])
df.show()
python
+------+
| Word|
+------+
| Apple|
|Orange|
| Plan|
|Python|
| Money|
+------+
24. How to filter valid emails from a list?
Difficulty Level: L3
python
# Create a list
data = ['buying books at amazom.com', 'rameses@egypt.com', 'matt@t.co', 'narendra@modi.com']
# Convert the list to DataFrame
df = spark.createDataFrame(data, "string")
df.show(truncate =False)
python
+--------------------------+
|value |
+--------------------------+
|buying books at amazom.com|
|rameses@egypt.com |
|matt@t.co |
|narendra@modi.com |
+--------------------------+
25. How to Pivot PySpark DataFrame?
Convert region categories to Columns and sum the revenue
Difficulty Level: L3
python
# Sample data
data = [
(2021, 1, "US", 5000),
(2021, 1, "EU", 4000),
(2021, 2, "US", 5500),
(2021, 2, "EU", 4500),
(2021, 3, "US", 6000),
(2021, 3, "EU", 5000),
(2021, 4, "US", 7000),
(2021, 4, "EU", 6000),
]
# Create DataFrame
columns = ["year", "quarter", "region", "revenue"]
df = spark.createDataFrame(data, columns)
df.show()
python
+----+-------+------+-------+
|year|quarter|region|revenue|
+----+-------+------+-------+
|2021| 1| US| 5000|
|2021| 1| EU| 4000|
|2021| 2| US| 5500|
|2021| 2| EU| 4500|
|2021| 3| US| 6000|
|2021| 3| EU| 5000|
|2021| 4| US| 7000|
|2021| 4| EU| 6000|
+----+-------+------+-------+
26. How to get the mean of a variable grouped by another variable?
Difficulty Level: L3
python
# Sample data
data = [("1001", "Laptop", 1000),
("1002", "Mouse", 50),
("1003", "Laptop", 1200),
("1004", "Mouse", 30),
("1005", "Smartphone", 700)]
# Create DataFrame
columns = ["OrderID", "Product", "Price"]
df = spark.createDataFrame(data, columns)
df.show()
python
+-------+----------+-----+
|OrderID| Product|Price|
+-------+----------+-----+
| 1001| Laptop| 1000|
| 1002| Mouse| 50|
| 1003| Laptop| 1200|
| 1004| Mouse| 30|
| 1005|Smartphone| 700|
+-------+----------+-----+
27. How to compute the euclidean distance between two columns?
Difficulty Level: L3
Compute the euclidean distance between series (points) p and q, without using a packaged formula.
python
# Define your series
data = [(1, 10), (2, 9), (3, 8), (4, 7), (5, 6), (6, 5), (7, 4), (8, 3), (9, 2), (10, 1)]
# Convert list to DataFrame
df = spark.createDataFrame(data, ["series1", "series2"])
df.show()
python
+-------+-------+
|series1|series2|
+-------+-------+
| 1| 10|
| 2| 9|
| 3| 8|
| 4| 7|
| 5| 6|
| 6| 5|
| 7| 4|
| 8| 3|
| 9| 2|
| 10| 1|
+-------+-------+
28. How to replace missing spaces in a string with the least frequent character?
Difficulty Level: L3
Replace the spaces in my_str with the least frequent characte
python
#Sample DataFrame
df = spark.createDataFrame([('dbc deb abed gade',),], ["string"])
df.show()
python
+-----------------+
| string|
+-----------------+
|dbc deb abed gade|
+-----------------+
Desired output
python
+-----------------+-----------------+
| string| modified_string|
+-----------------+-----------------+
|dbc deb abed gade|dbccdebcabedcgade|
+-----------------+-----------------+
29. How to create a TimeSeries starting ‘2000-01-01’ and 10 weekends (saturdays) after that having random numbers as values?
Difficulty Level: L3
Desired output
values can be random
python
+----------+--------------+
| date|random_numbers|
+----------+--------------+
|2000-01-01| 8|
|2000-01-08| 3|
|2000-01-15| 8|
|2000-01-22| 5|
|2000-01-29| 4|
|2000-02-05| 6|
|2000-02-12| 8|
|2000-02-19| 1|
|2000-02-26| 9|
|2000-03-04| 3|
+----------+--------------+
30. How to get the nrows, ncolumns, datatype of a dataframe?
Difficiulty Level: L1
Get the number of rows, columns, datatype and summary statistics of each column of the Churn_Modelling dataset. Also get the numpy array and list equivalent of the dataframe
python
url = "https://raw.githubusercontent.com/selva86/datasets/master/Churn_Modelling.csv"
spark.sparkContext.addFile(url)
df = spark.read.csv(SparkFiles.get("Churn_Modelling.csv"), header=True, inferSchema=True)
#df = spark.read.csv("C:/Users/RajeshVaddi/Documents/MLPlus/DataSets/Churn_Modelling.csv", header=True, inferSchema=True)
df.show(5, truncate=False)
python
+---------+----------+--------+-----------+---------+------+---+------+---------+-------------+---------+--------------+---------------+------+
|RowNumber|CustomerId|Surname |CreditScore|Geography|Gender|Age|Tenure|Balance |NumOfProducts|HasCrCard|IsActiveMember|EstimatedSalary|Exited|
+---------+----------+--------+-----------+---------+------+---+------+---------+-------------+---------+--------------+---------------+------+
|1 |15634602 |Hargrave|619 |France |Female|42 |2 |0.0 |1 |1 |1 |101348.88 |1 |
|2 |15647311 |Hill |608 |Spain |Female|41 |1 |83807.86 |1 |0 |1 |112542.58 |0 |
|3 |15619304 |Onio |502 |France |Female|42 |8 |159660.8 |3 |1 |0 |113931.57 |1 |
|4 |15701354 |Boni |699 |France |Female|39 |1 |0.0 |2 |0 |0 |93826.63 |0 |
|5 |15737888 |Mitchell|850 |Spain |Female|43 |2 |125510.82|1 |1 |1 |79084.1 |0 |
+---------+----------+--------+-----------+---------+------+---+------+---------+-------------+---------+--------------+---------------+------+
only showing top 5 rows
31. How to rename a specific columns in a dataframe?
Difficiulty Level: L2
Input
python
# Suppose you have the following DataFrame
df = spark.createDataFrame([('Alice', 1, 30),('Bob', 2, 35)], ["name", "age", "qty"])
df.show()
# Rename lists for specific columns
old_names = ["qty", "age"]
new_names = ["user_qty", "user_age"]
python
+-----+---+---+
| name|age|qty|
+-----+---+---+
|Alice| 1| 30|
| Bob| 2| 35|
+-----+---+---+
32. How to check if a dataframe has any missing values and count of missing values in each column?
Difficulty Level: L2
Input
python
# Assuming df is your DataFrame
df = spark.createDataFrame([
("A", 1, None),
("B", None, "123" ),
("B", 3, "456"),
("D", None, None),
], ["Name", "Value", "id"])
df.show()
python
+----+-----+----+
|Name|Value| id|
+----+-----+----+
| A| 1|null|
| B| null| 123|
| B| 3| 456|
| D| null|null|
+----+-----+----+
33 How to replace missing values of multiple numeric columns with the mean?
Difficulty Level: L2
Input
python
df = spark.createDataFrame([
("A", 1, None),
("B", None, 123 ),
("B", 3, 456),
("D", 6, None),
], ["Name", "var1", "var2"])
df.show()
python
+----+----+----+
|Name|var1|var2|
+----+----+----+
| A| 1|null|
| B|null| 123|
| B| 3| 456|
| D| 6|null|
+----+----+----+
34. How to change the order of columns of a dataframe?
Difficulty Level: L1
Input
python
# Sample data
data = [("John", "Doe", 30), ("Jane", "Doe", 25), ("Alice", "Smith", 22)]
# Create DataFrame from the data
df = spark.createDataFrame(data, ["First_Name", "Last_Name", "Age"])
# Show the DataFrame
df.show()
python
+----------+---------+---+
|First_Name|Last_Name|Age|
+----------+---------+---+
| John| Doe| 30|
| Jane| Doe| 25|
| Alice| Smith| 22|
+----------+---------+---+
35. How to format or suppress scientific notations in a PySpark DataFrame?
python
# Assuming you have a DataFrame df and the column you want to format is 'your_column'
df = spark.createDataFrame([(1, 0.000000123), (2, 0.000023456), (3, 0.000345678)], ["id", "your_column"])
df.show()
python
+---+-----------+
| id|your_column|
+---+-----------+
| 1| 1.23E-7|
| 2| 2.3456E-5|
| 3| 3.45678E-4|
+---+-----------+
36. How to format all the values in a dataframe as percentages?
Difficulty Level: L2
Input
python
# Sample data
data = [(0.1, .08), (0.2, .06), (0.33, .02)]
df = spark.createDataFrame(data, ["numbers_1", "numbers_2"])
df.show()
python
+---------+---------+
|numbers_1|numbers_2|
+---------+---------+
| 0.1| 0.08|
| 0.2| 0.06|
| 0.33| 0.02|
+---------+---------+
37. How to filter every nth row in a dataframe?
Difficulty Level: L2
Input
python
# Sample data
data = [("Alice", 1), ("Bob", 2), ("Charlie", 3), ("Dave", 4), ("Eve", 5),
("Frank", 6), ("Grace", 7), ("Hannah", 8), ("Igor", 9), ("Jack", 10)]
# Create DataFrame
df = spark.createDataFrame(data, ["Name", "Number"])
df.show()
python
+-------+------+
| Name|Number|
+-------+------+
| Alice| 1|
| Bob| 2|
|Charlie| 3|
| Dave| 4|
| Eve| 5|
| Frank| 6|
| Grace| 7|
| Hannah| 8|
| Igor| 9|
| Jack| 10|
+-------+------+
38 How to get the row number of the nth largest value in a column?
Difficulty Level: L2
Input
python
from pyspark.sql import Row
# Sample Data
data = [
Row(id=1, column1=5),
Row(id=2, column1=8),
Row(id=3, column1=12),
Row(id=4, column1=1),
Row(id=5, column1=15),
Row(id=6, column1=7),
]
df = spark.createDataFrame(data)
df.show()
python
+---+-------+
| id|column1|
+---+-------+
| 1| 5|
| 2| 8|
| 3| 12|
| 4| 1|
| 5| 15|
| 6| 7|
+---+-------+
39. How to get the last n rows of a dataframe with row sum > 100?
Difficulty Level: L2
Input
python
# Sample data
data = [(10, 25, 70),
(40, 5, 20),
(70, 80, 100),
(10, 2, 60),
(40, 50, 20)]
# Create DataFrame
df = spark.createDataFrame(data, ["col1", "col2", "col3"])
# Display original DataFrame
df.show()
python
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 10| 25| 70|
| 40| 5| 20|
| 70| 80| 100|
| 10| 2| 60|
| 40| 50| 20|
+----+----+----+
40. How to create a column containing the minimum by maximum of each row?
Difficulty Level: L2
Input
python
# Sample Data
data = [(1, 2, 3), (4, 5, 6), (7, 8, 9), (10, 11, 12)]
# Create DataFrame
df = spark.createDataFrame(data, ["col1", "col2", "col3"])
df.show()
python
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| 2| 3|
| 4| 5| 6|
| 7| 8| 9|
| 10| 11| 12|
+----+----+----+
41. How to create a column that contains the penultimate value in each row?
Difficulty Level: L2
Create a new column ‘penultimate’ which has the second largest value of each row of df
Input
python
data = [(10, 20, 30),
(40, 60, 50),
(80, 70, 90)]
df = spark.createDataFrame(data, ["Column1", "Column2", "Column3"])
df.show()
python
+-------+-------+-------+
|Column1|Column2|Column3|
+-------+-------+-------+
| 10| 20| 30|
| 40| 60| 50|
| 80| 70| 90|
+-------+-------+-------+
42. How to normalize all columns in a dataframe?
Difficulty Level: L2
Normalize all columns of df by subtracting the column mean and divide by standard deviation.
Range all columns of df such that the minimum value in each column is 0 and max is 1.
Input
python
# create a sample dataframe
data = [(1, 2, 3),
(2, 3, 4),
(3, 4, 5),
(4, 5, 6)]
df = spark.createDataFrame(data, ["Col1", "Col2", "Col3"])
df.show()
python
+----+----+----+
|Col1|Col2|Col3|
+----+----+----+
| 1| 2| 3|
| 2| 3| 4|
| 3| 4| 5|
| 4| 5| 6|
+----+----+----+
43. How to get the positions where values of two columns match?
Difficulty Level: L1
Input
python
# Create sample DataFrame
data = [("John", "John"), ("Lily", "Lucy"), ("Sam", "Sam"), ("Lucy", "Lily")]
df = spark.createDataFrame(data, ["Name1", "Name2"])
df.show()
python
+-----+-----+
|Name1|Name2|
+-----+-----+
| John| John|
| Lily| Lucy|
| Sam| Sam|
| Lucy| Lily|
+-----+-----+
44. How to create lags and leads of a column by group in a dataframe?
Difficulty Level: L2
Input
python
# Create a sample DataFrame
data = [("2023-01-01", "Store1", 100),
("2023-01-02", "Store1", 150),
("2023-01-03", "Store1", 200),
("2023-01-04", "Store1", 250),
("2023-01-05", "Store1", 300),
("2023-01-01", "Store2", 50),
("2023-01-02", "Store2", 60),
("2023-01-03", "Store2", 80),
("2023-01-04", "Store2", 90),
("2023-01-05", "Store2", 120)]
df = spark.createDataFrame(data, ["Date", "Store", "Sales"])
df.show()
python
+----------+------+-----+
| Date| Store|Sales|
+----------+------+-----+
|2023-01-01|Store1| 100|
|2023-01-02|Store1| 150|
|2023-01-03|Store1| 200|
|2023-01-04|Store1| 250|
|2023-01-05|Store1| 300|
|2023-01-01|Store2| 50|
|2023-01-02|Store2| 60|
|2023-01-03|Store2| 80|
|2023-01-04|Store2| 90|
|2023-01-05|Store2| 120|
+----------+------+-----+
45. How to get the frequency of unique values in the entire dataframe?
Difficulty Level: L3
Get the frequency of unique values in the entire dataframe df.
Input
python
# Create a numeric DataFrame
data = [(1, 2, 3),
(2, 3, 4),
(1, 2, 3),
(4, 5, 6),
(2, 3, 4)]
df = spark.createDataFrame(data, ["Column1", "Column2", "Column3"])
# Print DataFrame
df.show()
python
+-------+-------+-------+
|Column1|Column2|Column3|
+-------+-------+-------+
| 1| 2| 3|
| 2| 3| 4|
| 1| 2| 3|
| 4| 5| 6|
| 2| 3| 4|
+-------+-------+-------+
46. How to replace both the diagonals of dataframe with 0?
Difficulty Level: L3
Replace both values in both diagonals of df with 0.
Input
python
# Create a numeric DataFrame
data = [(1, 2, 3, 4),
(2, 3, 4, 5),
(1, 2, 3, 4),
(4, 5, 6, 7)]
df = spark.createDataFrame(data, ["col_1", "col_2", "col_3", "col_4"])
# Print DataFrame
df.show()
python
+-----+-----+-----+-----+
|col_1|col_2|col_3|col_4|
+-----+-----+-----+-----+
| 1| 2| 3| 4|
| 2| 3| 4| 5|
| 1| 2| 3| 4|
| 4| 5| 6| 7|
+-----+-----+-----+-----+
47. How to reverse the rows of a dataframe?
Difficulty Level: L2
Reverse all the rows of dataframe df.
Input
python
# Create a numeric DataFrame
data = [(1, 2, 3, 4),
(2, 3, 4, 5),
(3, 4, 5, 6),
(4, 5, 6, 7)]
df = spark.createDataFrame(data, ["col_1", "col_2", "col_3", "col_4"])
# Print DataFrame
df.show()
python
+-----+-----+-----+-----+
|col_1|col_2|col_3|col_4|
+-----+-----+-----+-----+
| 1| 2| 3| 4|
| 2| 3| 4| 5|
| 3| 4| 5| 6|
| 4| 5| 6| 7|
+-----+-----+-----+-----+
48. How to create one-hot encodings of a categorical variable (dummy variables)?
Difficulty Level: L2
Get one-hot encodings for column Categories in the dataframe df and append it as columns.
Input
python
data = [("A", 10),("A", 20),("B", 30),("B", 20),("B", 30),("C", 40),("C", 10),("D", 10)]
columns = ["Categories", "Value"]
df = spark.createDataFrame(data, columns)
df.show()
python
+----------+-----+
|Categories|Value|
+----------+-----+
| A| 10|
| A| 20|
| B| 30|
| B| 20|
| B| 30|
| C| 40|
| C| 10|
| D| 10|
+----------+-----+
49. How to Pivot the dataframe (converting rows into columns) ?
Difficulty Level: L2
convert region column categories to Column
Input
python
# Sample data
data = [
(2021, 1, "US", 5000),
(2021, 1, "EU", 4000),
(2021, 2, "US", 5500),
(2021, 2, "EU", 4500),
(2021, 3, "US", 6000),
(2021, 3, "EU", 5000),
(2021, 4, "US", 7000),
(2021, 4, "EU", 6000),
]
# Create DataFrame
columns = ["year", "quarter", "region", "revenue"]
df = spark.createDataFrame(data, columns)
50. How to UnPivot the dataframe (converting columns into rows) ?
Difficulty Level: L2
UnPivot EU, US columns and create region, revenue Columns
Input
python
# Sample data
data = [(2021, 2, 4500, 5500),
(2021, 1, 4000, 5000),
(2021, 3, 5000, 6000),
(2021, 4, 6000, 7000)]
# Create DataFrame
columns = ["year", "quarter", "EU", "US"]
df = spark.createDataFrame(data, columns)
df.show()
python
+----+-------+----+----+
|year|quarter| EU| US|
+----+-------+----+----+
|2021| 2|4500|5500|
|2021| 1|4000|5000|
|2021| 3|5000|6000|
|2021| 4|6000|7000|
+----+-------+----+----+
Expected Output
python
+----+-------+------+-------+
|year|quarter|region|revenue|
+----+-------+------+-------+
|2021| 2| EU| 4500|
|2021| 2| US| 5500|
|2021| 1| EU| 4000|
|2021| 1| US| 5000|
|2021| 3| EU| 5000|
|2021| 3| US| 6000|
|2021| 4| EU| 6000|
|2021| 4| US| 7000|
+----+-------+------+-------+
51. How to impute missing values with Zero?
Difficulty Level: L1
Input
python
# Suppose df is your DataFrame
df = spark.createDataFrame([(1, None), (None, 2), (3, 4), (5, None)], ["a", "b"])
df.show()
python
+----+----+
| a| b|
+----+----+
| 1|null|
|null| 2|
| 3| 4|
| 5|null|
+----+----+
52. How to identify continuous variables in a dataframe and create a list of those column names?
Difficulty Level: L3
Input
python
url = "https://raw.githubusercontent.com/selva86/datasets/master/Churn_Modelling_m.csv"
spark.sparkContext.addFile(url)
df = spark.read.csv(SparkFiles.get("Churn_Modelling_m.csv"), header=True, inferSchema=True)
#df = spark.read.csv("C:/Users/RajeshVaddi/Documents/MLPlus/DataSets/Churn_Modelling_m.csv", header=True, inferSchema=True)
df.show(2, truncate=False)
python
+---------+----------+--------+-----------+---------+------+---+------+--------+-------------+---------+--------------+---------------+------+
|RowNumber|CustomerId|Surname |CreditScore|Geography|Gender|Age|Tenure|Balance |NumOfProducts|HasCrCard|IsActiveMember|EstimatedSalary|Exited|
+---------+----------+--------+-----------+---------+------+---+------+--------+-------------+---------+--------------+---------------+------+
|1 |15634602 |Hargrave|619 |France |Female|42 |2 |0.0 |1 |1 |1 |101348.88 |1 |
|2 |15647311 |Hill |608 |Spain |Female|41 |1 |83807.86|1 |0 |1 |112542.58 |0 |
+---------+----------+--------+-----------+---------+------+---+------+--------+-------------+---------+--------------+---------------+------+
only showing top 2 rows
53. How to calculate Mode of a PySpark DataFrame column?
Difficulty Level: L1
Input
python
# Create a sample DataFrame
data = [(1, 2, 3), (2, 2, 3), (2, 2, 4), (1, 2, 3), (1, 1, 3)]
columns = ["col1", "col2", "col3"]
df = spark.createDataFrame(data, columns)
df.show()
python
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| 2| 3|
| 2| 2| 3|
| 2| 2| 4|
| 1| 2| 3|
| 1| 1| 3|
+----+----+----+
54. How to find installed location of Apache Spark and PySpark?
Difficulty Level: L1
55. How to convert a column to lower case using UDF?
Difficulty Level: L2
Input
python
# Create a DataFrame to test
data = [('John Doe', 'NEW YORK'),
('Jane Doe', 'LOS ANGELES'),
('Mike Johnson', 'CHICAGO'),
('Sara Smith', 'SAN FRANCISCO')]
df = spark.createDataFrame(data, ['Name', 'City'])
df.show()
python
+------------+-------------+
| Name| City|
+------------+-------------+
| John Doe| NEW YORK|
| Jane Doe| LOS ANGELES|
|Mike Johnson| CHICAGO|
| Sara Smith|SAN FRANCISCO|
+------------+-------------+
56. How to convert PySpark data frame to pandas dataframe?
Difficulty Level: L1
Input
python
# Create a DataFrame to test
data = [('John Doe', 'NEW YORK'),
('Jane Doe', 'LOS ANGELES'),
('Mike Johnson', 'CHICAGO'),
('Sara Smith', 'SAN FRANCISCO')]
pysparkDF = spark.createDataFrame(data, ['Name', 'City'])
pysparkDF.show()
python
+------------+-------------+
| Name| City|
+------------+-------------+
| John Doe| NEW YORK|
| Jane Doe| LOS ANGELES|
|Mike Johnson| CHICAGO|
| Sara Smith|SAN FRANCISCO|
+------------+-------------+
57. How to View PySpark Cluster Details?
Difficulty Level: L1
58. How to View PySpark Cluster Configuration Details?
Difficulty Level: L1
59. How to restrict the PySpark to use the number of cores in the system?
Difficulty Level: L1
60. How to cache PySpark DataFrame or objects and delete cache?
Difficulty Level: L2
In PySpark, caching or persisting data is done to speed up data retrieval during iterative and interactive computations.
61. How to Divide a PySpark DataFrame randomly in a given ratio (0.8, 0.2)?
Difficulty Level: L1
62. How to build logistic regression in PySpark?
Difficulty Level: L2
Input
python
# Create a sample dataframe
data = spark.createDataFrame([
(0, 1.0, -1.0),
(1, 2.0, 1.0),
(1, 3.0, -2.0),
(0, 4.0, 1.0),
(1, 5.0, -3.0),
(0, 6.0, 2.0),
(1, 7.0, -1.0),
(0, 8.0, 3.0),
(1, 9.0, -2.0),
(0, 10.0, 2.0),
(1, 11.0, -3.0),
(0, 12.0, 1.0),
(1, 13.0, -1.0),
(0, 14.0, 2.0),
(1, 15.0, -2.0),
(0, 16.0, 3.0),
(1, 17.0, -3.0),
(0, 18.0, 1.0),
(1, 19.0, -1.0),
(0, 20.0, 2.0)
], ["label", "feat1", "feat2"])
63. How to convert the categorical string data into numerical data or index?
Difficulty Level: L2
Input
python
# Create a sample DataFrame
data = [('cat',), ('dog',), ('mouse',), ('fish',), ('dog',), ('cat',), ('mouse',)]
df = spark.createDataFrame(data, ["animal"])
df.show()
python
+------+
|animal|
+------+
| cat|
| dog|
| mouse|
| fish|
| dog|
| cat|
| mouse|
+------+
64. How to calculate Correlation of two variables in a DataFrame?
Difficulty Level: L1
Input
python
# Create a sample dataframe
data = [Row(feature1=5, feature2=10, feature3=25),
Row(feature1=6, feature2=15, feature3=35),
Row(feature1=7, feature2=25, feature3=30),
Row(feature1=8, feature2=20, feature3=60),
Row(feature1=9, feature2=30, feature3=70)]
df = spark.createDataFrame(data)
df.show()
python
+--------+--------+--------+
|feature1|feature2|feature3|
+--------+--------+--------+
| 5| 10| 25|
| 6| 15| 35|
| 7| 25| 30|
| 8| 20| 60|
| 9| 30| 70|
+--------+--------+--------+
65. How to calculate Correlation Matrix?
Difficulty Level: L2
Input
python
# Create a sample dataframe
data = [Row(feature1=5, feature2=10, feature3=25),
Row(feature1=6, feature2=15, feature3=35),
Row(feature1=7, feature2=25, feature3=30),
Row(feature1=8, feature2=20, feature3=60),
Row(feature1=9, feature2=30, feature3=70)]
df = spark.createDataFrame(data)
df.show()
python
+--------+--------+--------+
|feature1|feature2|feature3|
+--------+--------+--------+
| 5| 10| 25|
| 6| 15| 35|
| 7| 25| 30|
| 8| 20| 60|
| 9| 30| 70|
+--------+--------+--------+
66. How to calculate VIF (Variance Inflation Factor ) for set of variables in a DataFrame?
Difficulty Level: L3
Input
python
# Create a sample dataframe
data = [Row(feature1=5, feature2=10, feature3=25),
Row(feature1=6, feature2=15, feature3=35),
Row(feature1=7, feature2=25, feature3=30),
Row(feature1=8, feature2=20, feature3=60),
Row(feature1=9, feature2=30, feature3=70)]
df = spark.createDataFrame(data)
df.show()
python
+--------+--------+--------+
|feature1|feature2|feature3|
+--------+--------+--------+
| 5| 10| 25|
| 6| 15| 35|
| 7| 25| 30|
| 8| 20| 60|
| 9| 30| 70|
+--------+--------+--------+
67. How to perform Chi-Square test?
Difficulty Level: L2
Input
python
# Create a sample dataframe
data = [(1, 0, 0, 1, 1),
(2, 0, 1, 0, 0),
(3, 1, 0, 0, 0),
(4, 0, 0, 1, 1),
(5, 0, 1, 1, 0)]
df = spark.createDataFrame(data, ["id", "feature1", "feature2", "feature3", "label"])
df.show()
python
+---+--------+--------+--------+-----+
| id|feature1|feature2|feature3|label|
+---+--------+--------+--------+-----+
| 1| 0| 0| 1| 1|
| 2| 0| 1| 0| 0|
| 3| 1| 0| 0| 0|
| 4| 0| 0| 1| 1|
| 5| 0| 1| 1| 0|
+---+--------+--------+--------+-----+
68. How to calculate the Standard Deviation?
Difficulty Level: L1
Input
python
# Sample data
data = [("James", "Sales", 3000),
("Michael", "Sales", 4600),
("Robert", "Sales", 4100),
("Maria", "Finance", 3000),
("James", "Sales", 3000),
("Scott", "Finance", 3300),
("Jen", "Finance", 3900),
("Jeff", "Marketing", 3000),
("Kumar", "Marketing", 2000),
("Saif", "Sales", 4100)]
# Create DataFrame
df = spark.createDataFrame(data, ["Employee", "Department", "Salary"])
df.show()
python
+--------+----------+------+
|Employee|Department|Salary|
+--------+----------+------+
| James| Sales| 3000|
| Michael| Sales| 4600|
| Robert| Sales| 4100|
| Maria| Finance| 3000|
| James| Sales| 3000|
| Scott| Finance| 3300|
| Jen| Finance| 3900|
| Jeff| Marketing| 3000|
| Kumar| Marketing| 2000|
| Saif| Sales| 4100|
+--------+----------+------+
69. How to calculate missing value percentage in each column?
Difficulty Level: L3
Input
python
# Create a sample dataframe
data = [("John", "Doe", None),
(None, "Smith", "New York"),
("Mike", "Smith", None),
("Anna", "Smith", "Boston"),
(None, None, None)]
df = spark.createDataFrame(data, ["FirstName", "LastName", "City"])
df.show()
python
+---------+--------+--------+
|FirstName|LastName| City|
+---------+--------+--------+
| John| Doe| null|
| null| Smith|New York|
| Mike| Smith| null|
| Anna| Smith| Boston|
| null| null| null|
+---------+--------+--------+
70. How to get the names of DataFrame objects that have been created in an environment?
Difficulty Level: L2
Free Course
Master Core Python — Your First Step into AI/ML
Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.
Start Free Course →Trusted by 50,000+ learners
Related Course
Master PySpark — Hands-On
Join 5,000+ students at edu.machinelearningplus.com
Explore Course
