Analyse Your Data with Pandas
Here’s an introduction to Pandas, an open source software library that’s written in Python for data manipulation and analysis. Pandas facilitates the manipulation of numerical tables and the time series.
In recent times, it has been proven again and again that data has become an increasingly important resource. Now, with the Internet boom, large volumes of data are being generated every second. To stay ahead of the competition, companies need efficient ways of analysing this data, which can be represented as a matrix, using Python’s mathematical package, NumPy.
The problem with NumPy is that it doesn’t have sufficient data analysis tools built into it. This is where Pandas comes in. It is a data analysis package, which is built to integrate with NumPy arrays. Pandas has a lot of functionality, but we will cover only a small portion of it in this article.
Getting started
Installing Pandas is a one-step process if you use Pip. Run the following command to install Pandas.
Installing Pandas is a one-step process if you use Pip. Run the following command to install Pandas.
sudo pip install pandas |
If you face any difficulties, visit http://pandas.pydata.org/pandas-docs/stable/install.html. You can now try importing Pandas into your Python environment by issuing the following command:
import pandas |
In this tutorial, we will be using data from Weather Underground. The dataset for this article can be downloaded from http://www.synesthesiam.com/assets/weather_year.csv and can be imported into Pandas using:
data = pandas.read_csv(weather_year.csv) |
The read_csv function creates a dataframe. A dataframe is a tabular representation of the data read. You can get a summary of the dataset by printing the object. The output of the print is as follows:
data<class pandas.core.frame.DataFrame>Int64Index: 366 entries, 0 to 365Data columns:EDT 366 non-null valuesMax TemperatureF 366 non-null valuesMean TemperatureF 366 non-null valuesMin TemperatureF 366 non-null valuesMax Dew PointF 366 non-null valuesMeanDew PointF 366 non-null valuesMin DewpointF 366 non-null valuesMax Humidity 366 non-null valuesMean Humidity 366 non-null valuesMin Humidity 366 non-null valuesMax Sea Level PressureIn 366 non-null valuesMean Sea Level PressureIn 366 non-null valuesMin Sea Level PressureIn 366 non-null valuesMax VisibilityMiles 366 non-null valuesMean VisibilityMiles 366 non-null valuesMin VisibilityMiles 366 non-null valuesMax Wind SpeedMPH 366 non-null valuesMean Wind SpeedMPH 366 non-null valuesMax Gust SpeedMPH 365 non-null valuesPrecipitationIn 366 non-null valuesCloudCover 366 non-null valuesEvents 162 non-null valuesWindDirDegrees 366 non-null valuesdtypes: float64(4), int64(16), object(3) |
As you can see, there are 366 entries in the given dataframe. You can get the column names using data.columns.
The output of the command is given below:
The output of the command is given below:
data.columns |
Index([EDT, Max TemperatureF, Mean TemperatureF, Min TemperatureF, Max Dew PointF, MeanDew PointF, Min DewpointF, Max Humidity, Mean Humidity, Min Humidity, Max Sea Level PressureIn, Mean Sea Level PressureIn, Min Sea Level PressureIn, Max VisibilityMiles, Mean VisibilityMiles, Min VisibilityMiles, Max Wind SpeedMPH, Mean Wind SpeedMPH, Max Gust SpeedMPH, PrecipitationIn, CloudCover, Events, WindDirDegrees], dtype=object) |
To print a particular column of the dataframe, you can simply index it as data['EDT'] for a single column or data[['EDT','Max Humidity']] for multiple columns. The output for data['EDT'] is:
data[EDT]0 2012-3-101 2012-3-112 2012-3-123 2012-3-134 2012-3-145 2012-3-156 2012-3-16.........361 2013-3-6362 2013-3-7363 2013-3-8364 2013-3-9365 2013-3-10Name: EDT, Length: 366 |
And the output for data[[EDT,Max Humidity]] is:
data[[EDT,Max Humidity]] |
<class pandas.core.frame.DataFrame>Int64Index: 366 entries, 0 to 365Data columns:EDT 366 non-null valuesMax Humidity 366 non-null valuesdtypes: int64(1), object(1) |
Sometimes, it may be useful to only view a part of the data, just so that you can get a sense of what kind of data you are dealing with. Here you can use the head and tail functions to view the start and end of your dataframe:
data[Max Humidity].head() |
0 741 782 903 934 93Name: Max Humidity |
Note: The head and tail functions take a parameter which sets the number of rows to be displayed. And can be used as data[Max Humidity].head(n), where ‘n’ is the number of rows. The default is 5.
Working with columns
Now that we have a basis on which to work with our dataframe, we can explore various useful functions provided by Pandas like std to compute the standard deviation, mean to compute the average value, sum to compute the sum of all elements in a column, etc. So if you want to compute the mean of the Max Humidity column, for instance, you can use the following commands:
Now that we have a basis on which to work with our dataframe, we can explore various useful functions provided by Pandas like std to compute the standard deviation, mean to compute the average value, sum to compute the sum of all elements in a column, etc. So if you want to compute the mean of the Max Humidity column, for instance, you can use the following commands:
data['Max Humidity'].mean()
90.027322404371589data['Max Humidity'].sum()32950data['Max Humidity'].std()9.10843757197798 |
Note: Most of the Pandas functions ignore NaNs, by default. These regularly occur in data and a convenient way of handling them must be established. This topic is covered more in detail later in this article.
The std and sum function can be used in a similar manner. Also, rather than running these functions on individual columns, you can run them on the entire dataframe, as follows:
data.mean() |
Max TemperatureF 66.803279Mean TemperatureF 55.683060Min TemperatureF 44.101093Max Dew PointF 49.549180MeanDew PointF 44.057377Min DewpointF 37.980874Max Humidity 90.027322Mean Humidity 67.860656Min Humidity 45.193989Max Sea Level PressureIn 30.108907Mean Sea Level PressureIn 30.022705Min Sea Level PressureIn 29.936831Max VisibilityMiles 9.994536Mean VisibilityMiles 8.732240Min VisibilityMiles 5.797814Max Wind SpeedMPH 16.418033Mean Wind SpeedMPH 6.057377Max Gust SpeedMPH 22.764384CloudCover 2.885246WindDirDegrees 189.704918 |
Using apply for bulk operations
As we have already seen, functions like mean, std and sum work on entire columns, but sometimes it may be useful to apply our own functions to entire columns of the dataframe. For this purpose, Pandas provides the apply function, which takes an anonymous function as a parameter and applies to every element in the column. In this example, let us try to get the square of every element in a column. We can do this with the following code:
As we have already seen, functions like mean, std and sum work on entire columns, but sometimes it may be useful to apply our own functions to entire columns of the dataframe. For this purpose, Pandas provides the apply function, which takes an anonymous function as a parameter and applies to every element in the column. In this example, let us try to get the square of every element in a column. We can do this with the following code:
data[Max Humidity].apply(lambda d: d**2)0 54761 60842 81003 86494 86495 8100.........361 8464362 7225363 7744364 5625365 2916Name: Max Humidity, Length: 366 |
Note: In the Lambda function, the parameter d is implicitly passed to it by Pandas, and contains each element of the a column.
Now you may wonder why you can’t just do this with a loop. Well, the answer is that this operation was written in one single line, which saves code writing time and is much easier to read.
Dealing with NaN values
Pandas provides a function called isnull, which returns a ‘True’ or ‘False’ value depending on whether the value of an element in the column is NaN or None. These values are treated as missing values from the dataset, and so it is always convenient to deal with them separately. We can use the apply function to test every element in a column to see if any NaNs are present. You can use the following command:
Pandas provides a function called isnull, which returns a ‘True’ or ‘False’ value depending on whether the value of an element in the column is NaN or None. These values are treated as missing values from the dataset, and so it is always convenient to deal with them separately. We can use the apply function to test every element in a column to see if any NaNs are present. You can use the following command:
e = data[Events].apply(lambda d: pandas.isnull(d))e0 True1 False2 False3 True4 True5 False...361 False362 True363 True364 True365 TrueName: Events, Length: 366 |
As you can see, a list of Booleans was returned, representing values that are NaN. Now there are two options of how to deal with the NaN values. First, you can choose to drop all rows with NaN values using the dropna function, in the following manner:
data.dropna(subset=[Events])<class pandas.core.frame.DataFrame>Int64Index: 162 entries, 1 to 361Data columns:EDT 162 non-null valuesMax TemperatureF 162 non-null valuesMean TemperatureF 162 non-null valuesMin TemperatureF 162 non-null valuesMax Dew PointF 162 non-null valuesMeanDew PointF 162 non-null valuesMin DewpointF 162 non-null valuesMax Humidity 162 non-null values Mean Humidity 162 non-null values Min Humidity 162 non-null values Max Sea Level PressureIn 162 non-null values Mean Sea Level PressureIn 162 non-null values Min Sea Level PressureIn 162 non-null values Max VisibilityMiles 162 non-null values Mean VisibilityMiles 162 non-null values Min VisibilityMiles 162 non-null values Max Wind SpeedMPH 162 non-null values Mean Wind SpeedMPH 162 non-null values Max Gust SpeedMPH 162 non-null valuesPrecipitationIn 162 non-null values CloudCover 162 non-null values Events 162 non-null values WindDirDegrees 162 non-null valuesdtypes: float64(4), int64(16), object(3) |
As you can see, there are only 162 rows, which don’t contain NaNs in the column Events. The other option you have is to replace the NaN values with something easier to deal with using the fillna function. You can do this in the following manner:
data[Events].fillna()01 Rain2 Rain345 Rain-Thunderstorm67 Fog-Thunderstorm8 Rain362363364365Name: Events, Length: 366 |
Accessing individual rows
So far we have discussed methods dealing with indexing entire columns, but what if you want to access a specific row in your dataframe? Well, Pandas provides a function called irow, which lets you get the value of a specific row. You can use it as follows:
So far we have discussed methods dealing with indexing entire columns, but what if you want to access a specific row in your dataframe? Well, Pandas provides a function called irow, which lets you get the value of a specific row. You can use it as follows:
data.irow(0)EDT 2012-3-10Max TemperatureF 56Mean TemperatureF 40Min TemperatureF 24Max Dew PointF 24MeanDew PointF 20Min DewpointF 16Max Humidity 74 Mean Humidity 50 Min Humidity 26 Max Sea Level PressureIn 30.53 Mean Sea Level PressureIn 30.45 Min Sea Level PressureIn 30.34 Max VisibilityMiles 10 Mean VisibilityMiles 10 Min VisibilityMiles 10 Max Wind SpeedMPH 13 Mean Wind SpeedMPH 6 Max Gust SpeedMPH 17PrecipitationIn 0.00 CloudCover 0 Events NaN WindDirDegrees 138Name: 0 |
Note: Indices start from 0 for indexing the rows.
Filtering
Sometimes you may need to find rows of special interest to you. Lets suppose we want to find out data points in our data frame, which have a mean temperature greater than 40 and less than 50.You can filter out values from your dataframe using the following syntax:
Sometimes you may need to find rows of special interest to you. Lets suppose we want to find out data points in our data frame, which have a mean temperature greater than 40 and less than 50.You can filter out values from your dataframe using the following syntax:
data[(data['Mean TemperatureF']>40) & (data['Mean TemperatureF']<50)]<class 'pandas.core.frame.DataFrame'>Int64Index: 51 entries, 1 to 364Data columns:EDT 51 non-null valuesMax TemperatureF 51 non-null valuesMean TemperatureF 51 non-null valuesMin TemperatureF 51 non-null valuesMax Dew PointF 51 non-null valuesMeanDew PointF 51 non-null valuesMin DewpointF 51 non-null valuesMax Humidity 51 non-null values Mean Humidity 51 non-null values Min Humidity 51 non-null values Max Sea Level PressureIn 51 non-null values Mean Sea Level PressureIn 51 non-null values Min Sea Level PressureIn 51 non-null values Max VisibilityMiles 51 non-null values Mean VisibilityMiles 51 non-null values Min VisibilityMiles 51 non-null values Max Wind SpeedMPH 51 non-null values Mean Wind SpeedMPH 51 non-null values Max Gust SpeedMPH 51 non-null valuesPrecipitationIn 51 non-null values CloudCover 51 non-null values Events 23 non-null values WindDirDegrees 51 non-null valuesdtypes: float64(4), int64(16), object(3) |
Note: The output of the condition data[Mean TemperatureF]>40 and data[Mean TemperatureF]<50 return a NumPy array, and we must use the brackets to separate them before using the & operator, or else you will get an error message saying that the expression is ambiguous.
Now you can easily get meaningful data from your dataframe by simply filtering out the data that you aren’t interested in. This provides you with a very powerful technique that you can use in conjunction with higher Pandas functions to understand your data.
Getting data out
You can easily write data out by using the to_csv function to write your data out as a csv file.
You can easily write data out by using the to_csv function to write your data out as a csv file.
data.to_csv(weather-mod.csv)Want to make a separate tab? No problem. data.to_csv(data/weather-mod.tsv, sep=\t) |
Note: Generally, the dataframe can be indexed by any Boolean NumPy array. In a sense, only values that are true will be retained. For example, if we use the variable e, (e = data[Events].apply(lambda d: pandas.isnull(d))) which contains the list of all rows that have NaN values for data[Events], as data[e], we will get a dataframe which has rows that only have NaN values for data[Events]

No comments:
Post a Comment