Analyse Your Data with Pandas

Here’s an introduction to Pandas, an open source software library that’s written in Python for data manipulation and analysis. Pandas facilitates the manipulation of numerical tables and the time series.

In recent times, it has been proven again and again that data has become an increasingly important resource. Now, with the Internet boom, large volumes of data are being generated every second. To stay ahead of the competition, companies need efficient ways of analysing this data, which can be represented as a matrix, using Python’s mathematical package, NumPy.

The problem with NumPy is that it doesn’t have sufficient data analysis tools built into it. This is where Pandas comes in. It is a data analysis package, which is built to integrate with NumPy arrays. Pandas has a lot of functionality, but we will cover only a small portion of it in this article.

Getting started
Installing Pandas is a one-step process if you use Pip. Run the following command to install Pandas.

sudo pip install pandas

If you face any difficulties, visit http://pandas.pydata.org/pandas-docs/stable/install.html. You can now try importing Pandas into your Python environment by issuing the following command:

import pandas

In this tutorial, we will be using data from Weather Underground. The dataset for this article can be downloaded from http://www.synesthesiam.com/assets/weather_year.csv and can be imported into Pandas using:

data = pandas.read_csv(weather_year.csv)

The read_csv function creates a dataframe. A dataframe is a tabular representation of the data read. You can get a summary of the dataset by printing the object. The output of the print is as follows:

data

<class pandas.core.frame.DataFrame>

Int64Index: 366 entries, 0 to 365

Data columns:

EDT 366 non-null values

Max TemperatureF 366 non-null values

Mean TemperatureF 366 non-null values

Min TemperatureF 366 non-null values

Max Dew PointF 366 non-null values

MeanDew PointF 366 non-null values

Min DewpointF 366 non-null values

Max Humidity 366 non-null values

Mean Humidity 366 non-null values

Min Humidity 366 non-null values

Max Sea Level PressureIn 366 non-null values

Mean Sea Level PressureIn 366 non-null values

Min Sea Level PressureIn 366 non-null values

Max VisibilityMiles 366 non-null values

Mean VisibilityMiles 366 non-null values

Min VisibilityMiles 366 non-null values

Max Wind SpeedMPH 366 non-null values

Mean Wind SpeedMPH 366 non-null values

Max Gust SpeedMPH 365 non-null values

PrecipitationIn 366 non-null values

CloudCover 366 non-null values

Events 162 non-null values

WindDirDegrees 366 non-null values

dtypes: float64(4), int64(16), object(3)

As you can see, there are 366 entries in the given dataframe. You can get the column names using data.columns.
The output of the command is given below:

data.columns

Index([EDT, Max TemperatureF, Mean TemperatureF, Min TemperatureF, Max Dew PointF, MeanDew PointF, Min DewpointF, Max Humidity,  Mean Humidity,  Min Humidity,  Max Sea Level PressureIn,  Mean Sea Level PressureIn,  Min Sea Level PressureIn,  Max VisibilityMiles,  Mean VisibilityMiles,  Min VisibilityMiles,  Max Wind SpeedMPH,  Mean Wind SpeedMPH,  Max Gust SpeedMPH, PrecipitationIn,  CloudCover,  Events,  WindDirDegrees], dtype=object)

To print a particular column of the dataframe, you can simply index it as data['EDT'] for a single column or data[['EDT','Max Humidity']] for multiple columns. The output for data['EDT'] is:

data[EDT]

0     2012-3-10

1     2012-3-11

2     2012-3-12

3     2012-3-13

4     2012-3-14

5     2012-3-15

6     2012-3-16

...

...

...

361     2013-3-6

362     2013-3-7

363     2013-3-8

364     2013-3-9

365    2013-3-10

Name: EDT, Length: 366

And the output for data[[EDT,Max Humidity]] is:

data[[EDT,Max Humidity]]

<class pandas.core.frame.DataFrame>

Int64Index: 366 entries, 0 to 365

Data columns:

EDT 366 non-null values

Max Humidity 366 non-null values

dtypes: int64(1), object(1)

Sometimes, it may be useful to only view a part of the data, just so that you can get a sense of what kind of data you are dealing with. Here you can use the head and tail functions to view the start and end of your dataframe:

data[Max Humidity].head()

74

78

90

93

93

Name: Max Humidity

Note: The head and tail functions take a parameter which sets the number of rows to be displayed. And can be used as data[Max Humidity].head(n), where ‘n’ is the number of rows. The default is 5.

Working with columns
Now that we have a basis on which to work with our dataframe, we can explore various useful functions provided by Pandas like std to compute the standard deviation, mean to compute the average value, sum to compute the sum of all elements in a column, etc. So if you want to compute the mean of the Max Humidity column, for instance, you can use the following commands:

data['Max Humidity'].mean()

90.027322404371589

data['Max Humidity'].sum()

32950

data['Max Humidity'].std()

9.10843757197798

Note: Most of the Pandas functions ignore NaNs, by default. These regularly occur in data and a convenient way of handling them must be established. This topic is covered more in detail later in this article.

The std and sum function can be used in a similar manner. Also, rather than running these functions on individual columns, you can run them on the entire dataframe, as follows:

data.mean()

Max TemperatureF 66.803279

Mean TemperatureF 55.683060

Min TemperatureF 44.101093

Max Dew PointF 49.549180

MeanDew PointF 44.057377

Min DewpointF 37.980874

Max Humidity 90.027322

Mean Humidity 67.860656

Min Humidity 45.193989

Max Sea Level PressureIn 30.108907

Mean Sea Level PressureIn 30.022705

Min Sea Level PressureIn 29.936831

Max VisibilityMiles 9.994536

Mean VisibilityMiles 8.732240

Min VisibilityMiles 5.797814

Max Wind SpeedMPH 16.418033

Mean Wind SpeedMPH 6.057377

Max Gust SpeedMPH 22.764384

CloudCover 2.885246

WindDirDegrees 189.704918

Using apply for bulk operations
As we have already seen, functions like mean, std and sum work on entire columns, but sometimes it may be useful to apply our own functions to entire columns of the dataframe. For this purpose, Pandas provides the apply function, which takes an anonymous function as a parameter and applies to every element in the column. In this example, let us try to get the square of every element in a column. We can do this with the following code:

data[Max Humidity].apply(lambda d: d**2)

0      5476

1      6084

2      8100

3      8649

4      8649

5      8100

...

...

...

361     8464

362     7225

363     7744

364     5625

365     2916

Name: Max Humidity, Length: 366

Note: In the Lambda function, the parameter d is implicitly passed to it by Pandas, and contains each element of the a column.

Now you may wonder why you can’t just do this with a loop. Well, the answer is that this operation was written in one single line, which saves code writing time and is much easier to read.

Dealing with NaN values
Pandas provides a function called isnull, which returns a ‘True’ or ‘False’ value depending on whether the value of an element in the column is NaN or None. These values are treated as missing values from the dataset, and so it is always convenient to deal with them separately. We can use the apply function to test every element in a column to see if any NaNs are present. You can use the following command:

e = data[Events].apply(lambda d: pandas.isnull(d))

e

0      True

1     False

2     False

3      True

4      True

5     False

...

361    False

362     True

363     True

364     True

365     True

Name:  Events, Length: 366

As you can see, a list of Booleans was returned, representing values that are NaN. Now there are two options of how to deal with the NaN values. First, you can choose to drop all rows with NaN values using the dropna function, in the following manner:

data.dropna(subset=[Events])

<class pandas.core.frame.DataFrame>

Int64Index: 162 entries, 1 to 361

Data columns:

EDT                                   162  non-null values

Max TemperatureF                  162  non-null values

Mean TemperatureF                 162  non-null values

Min TemperatureF                 162  non-null values

Max Dew PointF                    162  non-null values

MeanDew PointF                    162  non-null values

Min DewpointF                     162  non-null values

Max Humidity                      162  non-null values

 Mean Humidity                    162  non-null values

 Min Humidity                     162  non-null values

 Max Sea Level PressureIn        162  non-null values

 Mean Sea Level PressureIn   162  non-null values

 Min Sea Level PressureIn     162  non-null values

 Max VisibilityMiles              162  non-null values

 Mean VisibilityMiles             162  non-null values

 Min VisibilityMiles              162  non-null values

 Max Wind SpeedMPH           162  non-null values

 Mean Wind SpeedMPH         162  non-null values

 Max Gust SpeedMPH            162  non-null values

PrecipitationIn                   162  non-null values

 CloudCover                       162  non-null values

 Events                               162  non-null values

 WindDirDegrees                   162  non-null values

dtypes: float64(4), int64(16), object(3)

As you can see, there are only 162 rows, which don’t contain NaNs in the column Events. The other option you have is to replace the NaN values with something easier to deal with using the fillna function. You can do this in the following manner:

data[Events].fillna()

0

1                  Rain

2                  Rain

3

4

5     Rain-Thunderstorm

6

7      Fog-Thunderstorm

8                  Rain

362

363

364

365

Name:  Events, Length: 366

Accessing individual rows
So far we have discussed methods dealing with indexing entire columns, but what if you want to access a specific row in your dataframe? Well, Pandas provides a function called irow, which lets you get the value of a specific row. You can use it as follows:

data.irow(0)

EDT                           2012-3-10

Max TemperatureF                         56

Mean TemperatureF                        40

Min TemperatureF                         24

Max Dew PointF                           24

MeanDew PointF                           20

Min DewpointF                            16

Max Humidity                             74

 Mean Humidity                           50

 Min Humidity                            26

 Max Sea Level PressureIn             30.53

 Mean Sea Level PressureIn            30.45

 Min Sea Level PressureIn             30.34

 Max VisibilityMiles                     10

 Mean VisibilityMiles                    10

 Min VisibilityMiles                     10

 Max Wind SpeedMPH                       13

 Mean Wind SpeedMPH                       6

 Max Gust SpeedMPH                       17

PrecipitationIn                            0.00

 CloudCover                               0

 Events                                    NaN

 WindDirDegrees                         138

Name: 0

Note: Indices start from 0 for indexing the rows.

Filtering
Sometimes you may need to find rows of special interest to you. Let’s suppose we want to find out data points in our data frame, which have a mean temperature greater than 40 and less than 50.You can filter out values from your dataframe using the following syntax:

data[(data['Mean TemperatureF']>40) & (data['Mean TemperatureF']<50)]

<class 'pandas.core.frame.DataFrame'>

Int64Index: 51 entries, 1 to 364

Data columns:

EDT                                51  non-null values

Max TemperatureF                  51  non-null values

Mean TemperatureF                 51  non-null values

Min TemperatureF                  51  non-null values

Max Dew PointF                    51  non-null values

MeanDew PointF                    51  non-null values

Min DewpointF                     51  non-null values

Max Humidity                      51  non-null values

 Mean Humidity                    51  non-null values

 Min Humidity                     51  non-null values

 Max Sea Level PressureIn    51  non-null values

 Mean Sea Level PressureIn  51  non-null values

 Min Sea Level PressureIn     51  non-null values

 Max VisibilityMiles              51  non-null values

 Mean VisibilityMiles             51  non-null values

 Min VisibilityMiles              51  non-null values

 Max Wind SpeedMPH           51  non-null values

 Mean Wind SpeedMPH         51  non-null values

 Max Gust SpeedMPH            51  non-null values

PrecipitationIn                   51  non-null values

 CloudCover                       51  non-null values

 Events                               23  non-null values

 WindDirDegrees                   51  non-null values

dtypes: float64(4), int64(16), object(3)

Note: The output of the condition data[Mean TemperatureF]>40 and data[Mean TemperatureF]<50 return a NumPy array, and we must use the brackets to separate them before using the & operator, or else you will get an error message saying that the expression is ambiguous.

Now you can easily get meaningful data from your dataframe by simply filtering out the data that you aren’t interested in. This provides you with a very powerful technique that you can use in conjunction with higher Pandas functions to understand your data.

Getting data out
You can easily write data out by using the to_csv function to write your data out as a csv file.

data.to_csv(weather-mod.csv)

Want to make a separate tab? No problem.  

data.to_csv(data/weather-mod.tsv, sep=\t)

Note: Generally, the dataframe can be indexed by any Boolean NumPy array. In a sense, only values that are true will be retained. For example, if we use the variable e, (e = data[Events].apply(lambda d: pandas.isnull(d))) which contains the list of all rows that have NaN values for data[Events], as data[e], we will get a dataframe which has rows that only have NaN values for data[Events]

Educational Revolutioners

Tuesday, 20 January 2015

Analyse Your Data with Pandas

Analyse Your Data with Pandas

No comments:

Post a Comment

Blog Archive

About Me