Through these examples, we will learn some of the most basic functionality that the pandas
library offers.
Let's start by importing all the libraries we will use:
import pandas as pd
import pysal as ps
import numpy as np
Let's get income per capita data for the continental US from 1929 up to 2009. PySAL has this on its examples folder we can easily access it. Then we create a DataFrame straight from a csv. Since the first column of our csv is the name, let's set that as the index.
data_path = ps.examples.get_path('usjoin.csv')
df = pd.read_csv(data_path, index_col=0)
A DataFrame object has a bunch of attributes, particularly interesting for some purposes among them are the index and the columns:
df.index
df.columns
Now let's pull out the data from 2009
y = df['2009']
As you can see, this is a pandas Series object:
type(y)
And you can perform simple operations like summing it:
y.sum()
We can also extract a row:
row = df.xs('Arizona')
And we can also get a full block/subset (e.g. states 10th to 15th and years 2001, 2003 and 2008) of the data frame, which is also a data frame itself:
block = df[['2001', '2003', '2008']][10:15]
See how we can get a correlation matrix:
block.corr()
If you are missing numpy array-like slicing of a data frame, check the ix
attribute:
ba = df.ix[3:5, 2:8]
ba
If you need a data frame as a numpy array, you can convert it fairly efficiently:
a = ba.as_matrix()
a
It is of course also possible to add and delete columns. Let's suppose we get and update from Oregon on the pc income for 2010:
df['2010'] = pd.Series({'Oregon': 42000})
df['nothing'] = 'nothing'
del df['nothing']
If you check the data frame now, you'll see there is a '2010' column with only one non missing value:
df
It is cool to see that pysal
operations that work on vectors also work on pandas
series. For instance, let's create some weights and calculate Moran's I for 2009:
w = ps.rook_from_shapefile(ps.examples.get_path('us48.shp'))
mi = ps.Moran(y, w)
print "Moran's I: %f\tp-value: %f"%(mi.I, mi.p_sim)
And finally, some of the goodies that you also get for free, like matplotlibt
integration. Let's see how the pc income has evolved over the years for California and Arizona:
evol = df.ix[['California', 'Arizona'], :]
del evol['STATE_FIPS']
evol = evol.T
evol.plot()
legend(loc='upper left')
show()
This is only a first introduction, there much more functionality on the library, particularly related to database operations (joins, merges, etc.), so don't stop here and go to the main website for more info!!!
In this section we will showcase the use of the panel
structure. Code provided by Dave.
Construct DataFrame
objects for population and rate:
np.random.seed(10)
pop = np.random.randint(0, 4000, (len(df.index), len(df.columns)))
pop = pd.DataFrame(pop, index=df.index, columns=df.columns)
One method of constructing a panel is by passing a dictionary of dataframes
panel = pd.Panel({'inc':df, 'pop':pop})
print panel
A panel is like a dictionary of dataframes:
population = panel['pop']
print population
Add another attribute to the panel:
rate = np.random.uniform(0, 1, (len(df.index), len(df.columns)))
rate = pd.DataFrame(rate, index=df.index, columns=df.columns)
panel['rate'] = rate
print panel
Grab a spatial subset of the panel:
alabama = panel.major_xs('Alabama')
print alabama
a_states = panel.ix[:,['Alabama','Arizona','Arkansas'], :]
print a_states
Grab a temporal subset of the panel:
y1994 = panel.minor_xs('1994')
print y1994