Python Pandas - reading files as dataframes

Pandas is an open-source, BSD-licensed Python library which provides the Python programming language with high-performance, easy-to-use data structures and data analysis tools.

Python with Pandas is used in a wide variety of sectors, including banking, economics, mathematics, analytics, etc., including academic and commercial domains.

We will discover the different functionalities of Python Pandas in this tutorial and how to use them in reality.

Pandas read_csv()

Let’s say we have a CSV file “iris.csv” with the following content:

150,4,setosa,versicolor,virginica
5.1,3.5,1.4,0.2,0
4.9,3.0,1.4,0.2,0
4.7,3.2,1.3,0.2,0
4.6,3.1,1.5,0.2,0
5.0,3.6,1.4,0.2,0
5.4,3.9,1.7,0.4,0
4.6,3.4,1.4,0.3,0
5.0,3.4,1.5,0.2,0
4.4,2.9,1.4,0.2,0
4.9,3.1,1.5,0.1,0
5.4,3.7,1.5,0.2,0
4.8,3.4,1.6,0.2,0
4.8,3.0,1.4,0.1,0
4.3,3.0,1.1,0.1,0
5.8,4.0,1.2,0.2,0
5.7,4.4,1.5,0.4,0
5.4,3.9,1.3,0.4,0
5.1,3.5,1.4,0.3,0

I found it in the sklearn dataset folder. If you have sklearn, you probably have this file too.

Let’s see how to read it using Pandas’ read csv() feature in a DataFrame.

import pandas
df = pandas.read_csv('iris.csv')
print(df)

Specifying Delimiter with Pandas read_csv() function

A CSV file’s default delimiter is a comma. But we can still use some other separator. Let’s assume the file delimiter for our CSV is /. Then:

150/4/setosa/versicolor/virginica
5.1/3.5/1.4/0.2/0
4.9/3.0/1.4/0.2/0
4.7/3.2/1.3/0.2/0
4.6/3.1/1.5/0.2/0
5.0/3.6/1.4/0.2/0
5.4/3.9/1.7/0.4/0
4.6/3.4/1.4/0.3/0
5.0/3.4/1.5/0.2/0
4.4/2.9/1.4/0.2/0
4.9/3.1/1.5/0.1/0
5.4/3.7/1.5/0.2/0
4.8/3.4/1.6/0.2/0
4.8/3.0/1.4/0.1/0
4.3/3.0/1.1/0.1/0
5.8/4.0/1.2/0.2/0
5.7/4.4/1.5/0.4/0

we can specify the delimiter using sep argument:

import pandas
df = pandas.read_csv('iris.csv',sep = &quot;/&quot;)
print(df)

Reading specific Columns from the CSV File

To read particular columns from a CSV file, we should define a usecols parameter.

This is very useful because there are several columns in the CSV format, but we are only interested in a couple of them.

import pandas
df = pandas.read_csv('iris.csv',usecols = ['versicolor','virginica'])
print(df)

Reading CSV File without Header

Getting a header row in the CSV format is not compulsory. If the CSV file doesn’t have a header row, we can still read it by passing the read csv() function to header=None.

import pandas
df = pandas.read_csv('iris.csv',header=None)
print(df)

The column headers get auto-assigned from 0 to N. We can pass these column values in the usecols parameter to then read specific columns.

Skipping CSV Rows

To skip rows from the CSV register, we can use the Skiprows parameter. Let’s assume we want to miss rows 3 and 4 of our original CSV data.

import pandas
df = pandas.read_csv('iris.csv', skiprows=[2, 3])
print(df)

Pandas concat

The concat() pandas method is used to concatenate objects such as DataFrames and Sequence for pandas.

To modify the conduct of the concatenation process, we can transfer several parameters.

The concat() method syntax is:

concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,
           keys=None, levels=None, names=None, verify_integrity=False,
           sort=None, copy=True)

The arguments are:

Objects: a list of objects for pandas to concatenate.
Join: an extra parameter to determine how the indexes on the other axis should be treated. ‘Inner’ and ‘outer’ are true values.
Join axes: outdated in version 0.25.0.
Ignore index: If valid, the indexes from the source objects will be ignored and the output will be given a series of indexes from 0,1,2..n.
Keys: a list to apply an identifier to the index of the data. It is useful to mark the source objects in the output.

Levels: a sequence to define the particular multi-index levels to be generated.
Names: names for the ranks in the resulting index of the hierarchy.
Verify integrity: Check if there are duplicates in the current concatenated axis. It’s a pricey process.

Type: Sort the non-concatenation axis if the relation is ‘outer’ and is not already aligned. It was added to version 0.23.00

Copy: If incorrect, do not excessively copy info.

Concat along rows

Let’s look at a simple example to concatenate two DataFrame objects.

import pandas
d1 = {&quot;Name&quot;: [&quot;Arka&quot;, &quot;Jake&quot;], &quot;ID&quot;: [1, 2]}
d2 = {&quot;Name&quot;: &quot;Roger&quot;, &quot;ID&quot;: 3}

df1 = pandas.DataFrame(d1, index={1, 2})
df2 = pandas.DataFrame(d2, index={3})

print('***\n', df1)
print('***\n', df2)

df3 = pandas.concat([df1, df2])

print('***\n', df3)


 Output:

***
      Name  ID
1    Arka   1
2    Jake   2
***
     Name  ID
3  Roger    3
***
      Name  ID
1    Arka   1
2    Jake   2
3    Roger  3


 Note that the concatenation, i.e. 0-axis, is done row-wise. The indexes from the DataFrame source items are also retained in the output.

Concat along column

d1 = {&quot;Name&quot;: [&quot;Arka&quot;, &quot;Jake&quot;], &quot;ID&quot;: [1, 2]}
d2 = {&quot;Role&quot;: [&quot;Admin&quot;, &quot;Editor&quot;]}

df1 = pandas.DataFrame(d1, index={1, 2})
df2 = pandas.DataFrame(d2, index={1, 2})

df3 = pandas.concat([df1, df2], axis=1)
print('********\n', df3)


 Output:

********
      Name  ID    Role
1     Arka  1   Admin
2     Jake  2  Editor


 When the source objects contain multiple types of data from an entity, the concatenation along the column makes sense.

Assig Keys to Concat DF Index

d1 = {&quot;Name&quot;: [&quot;Arka&quot;, &quot;Jake&quot;], &quot;ID&quot;: [1, 2]}
d2 = {&quot;Name&quot;: &quot;Roger&quot;, &quot;ID&quot;: 3}

df1 = pandas.DataFrame(d1, index={1, 2})
df2 = pandas.DataFrame(d2, index={3})

df3 = pandas.concat([df1, df2], keys=[&quot;DF1&quot;, &quot;DF2&quot;])
print('********\n', df3)


 Output:

********
        Name  ID
DF1 1    Arka 1
    2    Jake 2
DF2 3   Roger 3

Ignore Source DF Objects in Concat

d1 = {&quot;Name&quot;: [&quot;Arka&quot;, &quot;Jake&quot;], &quot;ID&quot;: [1, 2]}
d2 = {&quot;Name&quot;: &quot;Roger&quot;, &quot;ID&quot;: 3}

df1 = pandas.DataFrame(d1, index={10, 20})
df2 = pandas.DataFrame(d2, index={30})

df3 = pandas.concat([df1, df2], ignore_index=True)
print('********\n', df3)


 Output:

********
    Name  ID
0    Arka 1
1    Jake 2
2   Roger 3


 This is useful because there is not any meaning of the indexes of the source objects. 

 So we should neglect them and assign the DataFrame output to the default indexes.

Ending Note

If you liked reading this article and want to read more, continue to follow codegigs. Stay tuned for many such interesting articles in the coming few days!

Happy learning! 🙂