What are pandas DataFrames?
Before you start, let’s have a brief recap of what DataFrames are.
Those who are familiar with R know the data frame as a way to store data in rectangular grids that can easily be overviewed. Each row of these grids corresponds to measurements or values of an instance, while each column is a vector containing data for a specific variable. This means that a data frame’s rows do not need to contain, but can contain, the same type of values: they can be numeric, character, logical, etc.
Now, DataFrames in Python are very similar: they come with the pandas library, and they are defined as two-dimensional labeled data structures with columns of potentially different types.
In general, you could say that the pandas DataFrame consists of three main components: the data, the index, and the columns.
There are several ways to create a pandas DataFrame. In most cases, you’ll use the DataFrame constructor and provide the data, labels, and other information. You can pass the data as a two-dimensional list, tuple, or NumPy array. You can also pass it as a dictionary or pandas Series instance, or as one of several other data types not covered in this tutorial.
For this example, assume you’re using a dictionary to pass the data:
data = {
'name': ['Xavier', 'Ann', 'Jana', 'Yi', 'Robin', 'Amal', 'Nori'],
'city': ['Mexico City', 'Toronto', 'Prague', 'Shanghai',
'Manchester', 'Cairo', 'Osaka'],
'age': [41, 28, 33, 34, 38, 31, 37],
'py-score': [88.0, 79.0, 81.0, 80.0, 68.0, 61.0, 84.0]
}
row_labels = [101, 102, 103, 104, 105, 106, 107]
data is a Python variable that refers to the dictionary that holds your candidate data. It also contains the labels of the columns:
'name'
'city'
'age'
'py-score'
Finally, row_labels refers to a list that contains the labels of the rows, which are numbers ranging from 101 to 107.
Now you’re ready to create a pandas DataFrame:
df = pd.DataFrame(data=data, index=row_labels)
df
output
name city age py-score
101 Xavier Mexico City 41 88.0
102 Ann Toronto 28 79.0
103 Jana Prague 33 81.0
104 Yi Shanghai 34 80.0
105 Robin Manchester 38 68.0
106 Amal Cairo 31 61.0
107 Nori Osaka 37 84.0
That’s it! df is a variable that holds the reference to your pandas DataFrame. This pandas DataFrame looks just like the candidate table above and has the following features:
Row labels from 101 to 107
Column labels such as 'name', 'city', 'age', and 'py-score'
Data such as candidate names, cities, ages, and Python test scores
This figure shows the labels and data from df:
The row labels are outlined in blue, whereas the column labels are outlined in red, and the data values are outlined in purple.
pandas DataFrames can sometimes be very large, making it impractical to look at all the rows at once. You can use .head() to show the first few items and .tail() to show the last few items:
df.head(n=2)
output
name city age py-score
101 Xavier Mexico City 41 88.0
102 Ann Toronto 28 79.0
df.tail(n=2)
output
name city age py-score
106 Amal Cairo 31 61.0
107 Nori Osaka 37 84.0
That’s how you can show just the beginning or end of a pandas DataFrame. The parameter n specifies the number of rows to show.
Note: It may be helpful to think of the pandas DataFrame as a dictionary of columns, or pandas Series, with many additional features.
You can access a column in a pandas DataFrame the same way you would get a value from a dictionary:
cities = df['city']
cities
output
101 Mexico City
102 Toronto
103 Prague
104 Shanghai
105 Manchester
106 Cairo
107 Osaka
Name: city, dtype: object
This is the most convenient way to get a column from a pandas DataFrame.
If the name of the column is a string that is a valid Python identifier, then you can use dot notation to access it. That is, you can access the column the same way you would get the attribute of a class instance:
df.city
output
101 Mexico City
102 Toronto
103 Prague
104 Shanghai
105 Manchester
106 Cairo
107 Osaka
Name: city, dtype: object
That’s how you get a particular column. You’ve extracted the column that corresponds with the label 'city', which contains the locations of all your job candidates.
It’s important to notice that you’ve extracted both the data and the corresponding row labels:
Each column of a pandas DataFrame is an instance of pandas.Series, a structure that holds one-dimensional data and their labels. You can get a single item of a Series object the same way you would with a dictionary, by using its label as a key:
cities[102]
Output
'Toronto'
In this case, 'Toronto' is the data value and 102 is the corresponding label. As you’ll see in a later section, there are other ways to get a particular item in a pandas DataFrame.
You can also access a whole row with the accessor .loc[]:
df.loc[103]
output
name Jana
city Prague
age 33
py-score 81
Name: 103, dtype: object
This time, you’ve extracted the row that corresponds to the label 103, which contains the data for the candidate named Jana. In addition to the data values from this row, you’ve extracted the labels of the corresponding columns:
The returned row is also an instance of pandas.Series.
Comments