top of page

Pandas Tutorial: DataFrames in Python

Updated: May 26, 2023

What are pandas DataFrames?


Before you start, let’s have a brief recap of what DataFrames are.

Those who are familiar with R know the data frame as a way to store data in rectangular grids that can easily be overviewed. Each row of these grids corresponds to measurements or values of an instance, while each column is a vector containing data for a specific variable. This means that a data frame’s rows do not need to contain, but can contain, the same type of values: they can be numeric, character, logical, etc.

Now, DataFrames in Python are very similar: they come with the pandas library, and they are defined as two-dimensional labeled data structures with columns of potentially different types.

In general, you could say that the pandas DataFrame consists of three main components: the data, the index, and the columns.


There are several ways to create a pandas DataFrame. In most cases, you’ll use the DataFrame constructor and provide the data, labels, and other information. You can pass the data as a two-dimensional list, tuple, or NumPy array. You can also pass it as a dictionary or pandas Series instance, or as one of several other data types not covered in this tutorial.

For this example, assume you’re using a dictionary to pass the data:

data = {
    'name': ['Xavier', 'Ann', 'Jana', 'Yi', 'Robin', 'Amal', 'Nori'],
    'city': ['Mexico City', 'Toronto', 'Prague', 'Shanghai',
             'Manchester', 'Cairo', 'Osaka'],
    'age': [41, 28, 33, 34, 38, 31, 37],
    'py-score': [88.0, 79.0, 81.0, 80.0, 68.0, 61.0, 84.0]
}

row_labels = [101, 102, 103, 104, 105, 106, 107]

data is a Python variable that refers to the dictionary that holds your candidate data. It also contains the labels of the columns:

  • 'name'

  • 'city'

  • 'age'

  • 'py-score'

Finally, row_labels refers to a list that contains the labels of the rows, which are numbers ranging from 101 to 107.

Now you’re ready to create a pandas DataFrame:

 df = pd.DataFrame(data=data, index=row_labels)
 df
 output
       name         city  age  py-score
101  Xavier  Mexico City   41      88.0
102     Ann      Toronto   28      79.0
103    Jana       Prague   33      81.0
104      Yi     Shanghai   34      80.0
105   Robin   Manchester   38      68.0
106    Amal        Cairo   31      61.0
107    Nori        Osaka   37      84.0

That’s it! df is a variable that holds the reference to your pandas DataFrame. This pandas DataFrame looks just like the candidate table above and has the following features:

  • Row labels from 101 to 107

  • Column labels such as 'name', 'city', 'age', and 'py-score'

  • Data such as candidate names, cities, ages, and Python test scores

This figure shows the labels and data from df:


The row labels are outlined in blue, whereas the column labels are outlined in red, and the data values are outlined in purple.

pandas DataFrames can sometimes be very large, making it impractical to look at all the rows at once. You can use .head() to show the first few items and .tail() to show the last few items:

df.head(n=2)
output
       name         city  age  py-score
101  Xavier  Mexico City   41      88.0
102     Ann      Toronto   28      79.0

df.tail(n=2)
output
     name   city  age  py-score
106  Amal  Cairo   31      61.0
107  Nori  Osaka   37      84.0

That’s how you can show just the beginning or end of a pandas DataFrame. The parameter n specifies the number of rows to show.

Note: It may be helpful to think of the pandas DataFrame as a dictionary of columns, or pandas Series, with many additional features.

You can access a column in a pandas DataFrame the same way you would get a value from a dictionary:


cities = df['city']
cities
output
101    Mexico City
102        Toronto
103         Prague
104       Shanghai
105     Manchester
106          Cairo
107          Osaka
Name: city, dtype: object

This is the most convenient way to get a column from a pandas DataFrame.

If the name of the column is a string that is a valid Python identifier, then you can use dot notation to access it. That is, you can access the column the same way you would get the attribute of a class instance:

df.city
output
101    Mexico City
102        Toronto
103         Prague
104       Shanghai
105     Manchester
106          Cairo
107          Osaka
Name: city, dtype: object

That’s how you get a particular column. You’ve extracted the column that corresponds with the label 'city', which contains the locations of all your job candidates.

It’s important to notice that you’ve extracted both the data and the corresponding row labels:


Each column of a pandas DataFrame is an instance of pandas.Series, a structure that holds one-dimensional data and their labels. You can get a single item of a Series object the same way you would with a dictionary, by using its label as a key:

cities[102]
Output
'Toronto'

In this case, 'Toronto' is the data value and 102 is the corresponding label. As you’ll see in a later section, there are other ways to get a particular item in a pandas DataFrame.

You can also access a whole row with the accessor .loc[]:

df.loc[103]
output
name          Jana
city        Prague
age             33
py-score        81
Name: 103, dtype: object

This time, you’ve extracted the row that corresponds to the label 103, which contains the data for the candidate named Jana. In addition to the data values from this row, you’ve extracted the labels of the corresponding columns:


The returned row is also an instance of pandas.Series.



Related Posts

See All

Mastering Lists in Python

Introduction Lists are one of the most fundamental and versatile data structures in Python. They allow you to store and manipulate collections of items, making them an essential tool for any Python pr

Python Anonymous Function

In Python, an anonymous function is a function that is defined without a name. It is also known as a lambda function because it is created using the `lambda` keyword. Anonymous functions are typically

Comments


bottom of page