Menu

DataFrames in Julia

Written by Kabir | 6 min read

DataFrame is a 2 dimensional mutable data structure, that is used for handling tabular data. Unlike Arrays and Matrices, a DataFrame can hold columns of different data types The DataFrames package in Julia provides the DataFrame object which is used to hold and manipulate tabular data in a flexible and convenient way. It is quite essential for master DataFrames in order to perform data analysis, building machine learning models and other scientific computing. In this tutorial, I explain how to work with DataFrames in Julia.

Content

  1. Install DataFrames package in Julia
  2. Create New DataFrame
  3. Import Data
  4. Data Exploration
  5. Indexing the DataFrame
  6. Summarizing the DataFrame
  7. Join DataFrames
  8. Export DataFrames

1. Install DataFrames package in Julia

You can install any package in Julia with Pkg.add() command. Let’s install DataFrames in Julia
julia
using Pkg
Pkg.add("DataFrames")

2. Create new dataframe

In Julia, You can create a DataFrame in multiple ways: (i) Using a single statement using DataFrame() (ii) Column by column (iii) Row by row

(i) Create new dataframe in a single statement

You can create one using the DataFrame() function by enclosing the column names and values inside it. To do that you need to first load the DataFrames package by writing using DataFrames before using the function.
julia
using DataFrames

df = DataFrame(A = 1:5, B = ["A", "B", "C", "D", "E"])

(ii) Create new dataframe column by column

Alternately, You can create an empty Julia DataFrame using DataFrame() function and then add columns one by one. Since DataFrames are mutable, you can modify them afterward as well.
julia

# Initialize Empty DataFrame
df = DataFrame()

# Add Columns
df.A = 1:5

df.B = ["A", "B", "C", "D", "E"]

df

(iii) Create new dataframe row by row

Create an empty Julia DataFrame by enclosing column names and datatype of column inside DataFrame() function. Now you can add rows one by one using push!() function. This is like row binding.
julia
# Initialize empty DataFrame with columns
df = DataFrame(A = Int[], B = String[])

# Add Rows
push!(df, (1, "A"))
push!(df, (2, "B"))
push!(df, (3, "C"))
push!(df, (4, "D"))
push!(df, (5, "E"))

df

3. Import Data

You have seen how to create DataFrames. Now, let’s see how to import the existing files inside Julia as a DataFrame. There are different ways to import a dataset file. Let’s go through 2 of the most popular one. (i) Using readtable() (ii) Using CSV package

3.1 Using readtable() function from DataFrames package

readtable() function is used to read data from a CSV-like file
julia
# Import data using readtable
using DataFrames 
df = readtable("Data/insurance.csv")
head(df)

3.2 Using CSV package and later on converting the file to DataFrame

Read the file with CSV.File() function. Now, covert it to a DataFrame using DataFrame function
julia
# Add "CSV" package
using Pkg
Pkg.add("CSV")
julia
# Read the file using CSV.File and convert it to DataFrame
using CSV
df = DataFrame(CSV.File("Data/insurance.csv"))
head(df)

4. Data Exploration

Now, once you know how to read and create DataFrames, how about exploring it a bit. Let’s see some useful examples.

4.1 Show all rows and columns of DataFrame

By default, Julia doesn’t print all the rows and columns of a DataFrame because of obvious reasons like space and storage issues. But if you want to see all the rows and columns, it’s possible using show() function with allrows & allcols arguments.
julia
# Show all rows of DataFrame
show(df, allrows=true)

# Show all columns of DataFrame
show(df, allcols=true)
I am not printing it here because of space constraints. But trust me, it works.

4.2 Show top/bottom rows of DataFrame

head & first functions are used to see top n rows of a DataFrame
julia
# Print first n rows of df
head(df, 10)

first(df, 10)
tail & last functions are used to see bottom n rows of a DataFrame
julia
# Print last n rows of df
tail(df, 10)

last(df, 10)

4.3 Size of DataFrame

The size function is used to see the size of a DataFrame. By size, I mean the number or rows and columns of the DataFrame.
julia
# Shape and size of the data
println(size(df))

# Summary function also serves the purpose
println(summary(df))

4.4 Get column names of DataFrame

names function is used to get the column names of a DataFrame
julia
# Shape and size of the data
names(df)

4.5 Description of DataFrame

describe functions is used to get the description of a DataFrame. It tells us about matrices like the variable type, mean, median, max, number of unique values, number of missing values of every column in the DataFrame.
julia
# Describe the data
describe(df)

5. Indexing and filtering the DataFrame (multiple methods)

Julia follows “1 based indexing” i.e the first element starts with 1. R programming is also a 1 based indexing language, while Python is a 0 based indexing language.

Selecting specific rows

Rows and columns in DataFrame can be indexed by enclosing the index number inside [] big brackets.
julia
# Select specific rows in the data
df[1:5,:]

Selecting specific rows and columns

When you want to index all the rows/columns, : colons are used. Likewise, I have used to index all columns in the above example. Julia follows the name based indexing as well i.e. specify the column name while indexing followed by : colon.
julia
# Select specific rows and columns by column names in the data
df[1:5,[:age,:sex]]
julia
# Select only one column, as a dataframe
df[1:5, [:age]]
You must have noticed, I have used another [] square brackets in the column indexed space/area. This is required if you want to get a DataFrame. But, if you wish to get a vector, you need to remove the [] square brackets. Let’s see it with an example
julia
# Select only one column, as a vector
df[1:5,:age]
#> Returns a vector
You can use select function as well to subset the data .
julia
# Select columns using select() function
first(select(df, :age),5)
julia
# Select all the column except "age" column
first(select(df, Not(:age)),5)
You can subset the data based on the values as well. You need to use a . operator before the less than/greater than/ equal to operators. Let’s see it with an example
julia
# Filter based on values. Select all the rows where age is greater than 20.
first(df[df.age .> 20,:],5)

6. Grouping and Summarizing Data

You can group and summarize the data using aggregate function. You need to specify the aggregating operation you want to perform. Let’s see it with examples. I am going to get the sum and product of all the values in the columns.
julia
# Subset the numeric columns
num_df = df[:,[:age,:bmi,:children,:expenses]]

# Aggregate the data to find sum of all the column values
println(aggregate(num_df, sum), "\n\n")

# Aggregate the data to find product of all the column values
println(aggregate(num_df, prod))
You can get the basic summary of any of the columns using describe function. I have explained it earlier as well.
julia
# summarize data
describe(df[!,[:age]])
! can also be used in place of : when you wish to index all the values.

7. Join Data

Using join function, you can join multiple DataFrames to create a single DataFrame based on a common column . You can specify the type of join you wish to perform. Be it inner, left, right, outer, self join.
julia
# Define 2 datasets
employee_df = DataFrame(emp_code = ["N_1", "N_2", "N_3", "N_4"], 
emp_name = ["Kabir", "Aryan", "Khushi", "Mehak"])

designation_df = DataFrame(emp_code = ["N_1", "N_2", "N_3"], 
emp_designation = ["Data Scientist", "Data Engineer", "VP"])

# Create a new DataFrame by joining the two DataFrames, primary key being "emp_code"
join(employee_df, designation_df, on = :emp_code)
By default, it’s an inner join. Let’s see how to perform a left join
julia
# left join
join(employee_df, designation_df, on = :emp_code, kind = :left)

8. Export DataFrames

writetable function is used to export the data.
julia
writetable("output.csv", df)

Conclusion

So, now you should have a fair idea of how to handle DataFrames in Julia. Next, I will see you with more Data Science oriented topics in Julia. Read more about Julia here
Free Course
Master Core Python — Your First Step into AI/ML

Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.

Start Free Course
Trusted by 50,000+ learners
Kabir
Written by
Related Course
Master Julia — Hands-On
Join 5,000+ students at edu.machinelearningplus.com
Explore Course
Get the full course,
completely free.
Join 57,000+ students learning Python, SQL & ML. One year of access, all resources included.
📚 10 Courses
🐍 Python & ML
🗄️ SQL
📦 Downloads
📅 1 Year Access
No thanks
🎓
Free AI/ML Starter Kit
Python · SQL · ML · 10 Courses · 57,000+ students
🎉   You're in! Check your inbox (or Promotions/Spam) for the access link.
⚡ Before you go

Python.
SQL. NumPy.
All free.

Get the exact 10-course programming foundation that Data Science professionals use.

🐍
Core Python — from first line to expert level
📈
NumPy & Pandas — the #1 libraries every DS job needs
🗃️
SQL Levels I–III — basics to Window Functions
📄
Real industry data — Jupyter notebooks included
R A M S K
57,000+ students
★★★★★ Rated 4.9/5
⚡ Before you go
Python. SQL.
All Free.
R A M S K
57,000+ students  ★★★★★ 4.9/5
Get Free Access Now
10 courses. Real projects. Zero cost. No credit card.
New learners enrolling right now
🔒 100% free ☕ No spam, ever ✓ Instant access
🚀
You're in!
Check your inbox for your access link.
(Check Promotions or Spam if you don't see it)
Or start your first course right now:
Start Free Course →
Scroll to Top
Scroll to Top
Course Preview

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science