Pandas is an open-source, BSD-licensed Python library which provides the Python programming language with high-performance, easy-to-use data structures and data analysis tools.
Python with Pandas is used in a wide variety of sectors, including banking, economics, mathematics, analytics, etc., including academic and commercial domains.
We will discover the different functionalities of Python Pandas in this tutorial and how to use them in reality.
Pandas read_csv()
Let’s say we have a CSV file “iris.csv” with the following content:
150,4,setosa,versicolor,virginica 5.1,3.5,1.4,0.2,0 4.9,3.0,1.4,0.2,0 4.7,3.2,1.3,0.2,0 4.6,3.1,1.5,0.2,0 5.0,3.6,1.4,0.2,0 5.4,3.9,1.7,0.4,0 4.6,3.4,1.4,0.3,0 5.0,3.4,1.5,0.2,0 4.4,2.9,1.4,0.2,0 4.9,3.1,1.5,0.1,0 5.4,3.7,1.5,0.2,0 4.8,3.4,1.6,0.2,0 4.8,3.0,1.4,0.1,0 4.3,3.0,1.1,0.1,0 5.8,4.0,1.2,0.2,0 5.7,4.4,1.5,0.4,0 5.4,3.9,1.3,0.4,0 5.1,3.5,1.4,0.3,0
I found it in the sklearn dataset folder. If you have sklearn, you probably have this file too.
Let’s see how to read it using Pandas’ read csv() feature in a DataFrame.
import pandas df = pandas.read_csv('iris.csv') print(df)
Specifying Delimiter with Pandas read_csv() function
A CSV file’s default delimiter is a comma. But we can still use some other separator. Let’s assume the file delimiter for our CSV is /. Then:
150/4/setosa/versicolor/virginica 5.1/3.5/1.4/0.2/0 4.9/3.0/1.4/0.2/0 4.7/3.2/1.3/0.2/0 4.6/3.1/1.5/0.2/0 5.0/3.6/1.4/0.2/0 5.4/3.9/1.7/0.4/0 4.6/3.4/1.4/0.3/0 5.0/3.4/1.5/0.2/0 4.4/2.9/1.4/0.2/0 4.9/3.1/1.5/0.1/0 5.4/3.7/1.5/0.2/0 4.8/3.4/1.6/0.2/0 4.8/3.0/1.4/0.1/0 4.3/3.0/1.1/0.1/0 5.8/4.0/1.2/0.2/0 5.7/4.4/1.5/0.4/0
we can specify the delimiter using sep argument:
import pandas df = pandas.read_csv('iris.csv',sep = "/") print(df)
Reading specific Columns from the CSV File
To read particular columns from a CSV file, we should define a usecols parameter.
This is very useful because there are several columns in the CSV format, but we are only interested in a couple of them.
import pandas df = pandas.read_csv('iris.csv',usecols = ['versicolor','virginica']) print(df)
Reading CSV File without Header
Getting a header row in the CSV format is not compulsory. If the CSV file doesn’t have a header row, we can still read it by passing the read csv() function to header=None.
import pandas df = pandas.read_csv('iris.csv',header=None) print(df)
The column headers get auto-assigned from 0 to N. We can pass these column values in the usecols
parameter to then read specific columns.
Skipping CSV Rows
To skip rows from the CSV register, we can use the Skiprows parameter. Let’s assume we want to miss rows 3 and 4 of our original CSV data.
import pandas df = pandas.read_csv('iris.csv', skiprows=[2, 3]) print(df)
Pandas concat
The concat() pandas method is used to concatenate objects such as DataFrames and Sequence for pandas.
To modify the conduct of the concatenation process, we can transfer several parameters.
The concat() method syntax is:
concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=None, copy=True)
The arguments are:
- Objects: a list of objects for pandas to concatenate.
- Join: an extra parameter to determine how the indexes on the other axis should be treated. ‘Inner’ and ‘outer’ are true values.
- Join axes: outdated in version 0.25.0.
- Ignore index: If valid, the indexes from the source objects will be ignored and the output will be given a series of indexes from 0,1,2..n.
- Keys: a list to apply an identifier to the index of the data. It is useful to mark the source objects in the output.
- Levels: a sequence to define the particular multi-index levels to be generated.
- Names: names for the ranks in the resulting index of the hierarchy.
- Verify integrity: Check if there are duplicates in the current concatenated axis. It’s a pricey process.
- Type: Sort the non-concatenation axis if the relation is ‘outer’ and is not already aligned. It was added to version 0.23.00
- Copy: If incorrect, do not excessively copy info.
Concat along rows
Let’s look at a simple example to concatenate two DataFrame objects.import pandas d1 = {"Name": ["Arka", "Jake"], "ID": [1, 2]} d2 = {"Name": "Roger", "ID": 3} df1 = pandas.DataFrame(d1, index={1, 2}) df2 = pandas.DataFrame(d2, index={3}) print('***\n', df1) print('***\n', df2) df3 = pandas.concat([df1, df2]) print('***\n', df3)
Output:
*** Name ID 1 Arka 1 2 Jake 2 *** Name ID 3 Roger 3 *** Name ID 1 Arka 1 2 Jake 2 3 Roger 3
Note that the concatenation, i.e. 0-axis, is done row-wise. The indexes from the DataFrame source items are also retained in the output.
Concat along column
d1 = {"Name": ["Arka", "Jake"], "ID": [1, 2]} d2 = {"Role": ["Admin", "Editor"]} df1 = pandas.DataFrame(d1, index={1, 2}) df2 = pandas.DataFrame(d2, index={1, 2}) df3 = pandas.concat([df1, df2], axis=1) print('********\n', df3)
Output:
******** Name ID Role 1 Arka 1 Admin 2 Jake 2 Editor
When the source objects contain multiple types of data from an entity, the concatenation along the column makes sense.
Assig Keys to Concat DF Index
d1 = {"Name": ["Arka", "Jake"], "ID": [1, 2]} d2 = {"Name": "Roger", "ID": 3} df1 = pandas.DataFrame(d1, index={1, 2}) df2 = pandas.DataFrame(d2, index={3}) df3 = pandas.concat([df1, df2], keys=["DF1", "DF2"]) print('********\n', df3)
Output:
******** Name ID DF1 1 Arka 1 2 Jake 2 DF2 3 Roger 3
Ignore Source DF Objects in Concat
d1 = {"Name": ["Arka", "Jake"], "ID": [1, 2]} d2 = {"Name": "Roger", "ID": 3} df1 = pandas.DataFrame(d1, index={10, 20}) df2 = pandas.DataFrame(d2, index={30}) df3 = pandas.concat([df1, df2], ignore_index=True) print('********\n', df3)
Output:
******** Name ID 0 Arka 1 1 Jake 2 2 Roger 3
This is useful because there is not any meaning of the indexes of the source objects. So we should neglect them and assign the DataFrame output to the default indexes.
Ending Note
If you liked reading this article and want to read more, continue to follow codegigs. Stay tuned for many such interesting articles in the coming few days!
Happy learning! 🙂