Intro to pandas

Basic structures

Through these examples, we will learn some of the most basic functionality that the pandas library offers.

Let's start by importing all the libraries we will use:

In [6]:
import pysal as ps
 
import pandas as pd
import pysal as ps
import numpy as np

Let's get income per capita data for the continental US from 1929 up to 2009. PySAL has this on its examples folder we can easily access it. Then we create a DataFrame straight from a csv. Since the first column of our csv is the name, let's set that as the index.

In [7]:
data_path = ps.examples.get_path('usjoin.csv')
 
data_path = ps.examples.get_path('usjoin.csv')
df = pd.read_csv(data_path, index_col=0)

A DataFrame object has a bunch of attributes, particularly interesting for some purposes among them are the index and the columns:

In [8]:
df.index
 
df.index
Out[8]:
Index([Alabama, Arizona, Arkansas, California, Colorado, Connecticut,
       Delaware, Florida, Georgia, Idaho, Illinois, Indiana, Iowa, Kansas,
       Kentucky, Louisiana, Maine, Maryland, Massachusetts, Michigan,
       Minnesota, Mississippi, Missouri, Montana, Nebraska, Nevada,
       New Hampshire, New Jersey, New Mexico, New York, North Carolina,
       North Dakota, Ohio, Oklahoma, Oregon, Pennsylvania, Rhode Island,
       South Carolina, South Dakota, Tennessee, Texas, Utah, Vermont,
       Virginia, Washington, West Virginia, Wisconsin, Wyoming], dtype=object)
In [9]:
df.columns
 
df.columns
Out[9]:
Index([STATE_FIPS, 1929, 1930, 1931, 1932, 1933, 1934, 1935, 1936, 1937,
       1938, 1939, 1940, 1941, 1942, 1943, 1944, 1945, 1946, 1947, 1948,
       1949, 1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959,
       1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970,
       1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981,
       1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992,
       1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003,
       2004, 2005, 2006, 2007, 2008, 2009], dtype=object)

Now let's pull out the data from 2009

In [10]:
y = df['2009']
 
y = df['2009']

As you can see, this is a pandas Series object:

In [11]:
type(y)
 
type(y)
Out[11]:
pandas.core.series.Series

And you can perform simple operations like summing it:

In [12]:
y.sum()
 
y.sum()
Out[12]:
1785841

We can also extract a row:

In [13]:
row = df.xs('Arizona')
 
row = df.xs('Arizona')

And we can also get a full block/subset (e.g. states 10th to 15th and years 2001, 2003 and 2008) of the data frame, which is also a data frame itself:

In [14]:
block = df[['2001', '2003', '2008']][10:15]
 
block = df[['2001', '2003', '2008']][10:15]

See how we can get a correlation matrix:

In [15]:
block.corr()
 
block.corr()
Out[15]:
      2001    2003    2008  
2001  1.0000  0.9972  0.9564
2003  0.9972  1.0000  0.9553
2008  0.9564  0.9553  1.0000

If you are missing numpy array-like slicing of a data frame, check the ix attribute:

In [16]:
ba = df.ix[3:5, 2:8]
 
ba = df.ix[3:5, 2:8]
ba
Out[16]:
            1930  1931  1932  1933  1934  1935
Name                                          
California  887   749   580   546   603   660 
Colorado    578   471   354   353   368   444 

If you need a data frame as a numpy array, you can convert it fairly efficiently:

In [17]:
a = ba.as_matrix()
 
a = ba.as_matrix()
a
Out[17]:
array([[887, 749, 580, 546, 603, 660],
       [578, 471, 354, 353, 368, 444]], dtype=int64)

It is of course also possible to add and delete columns. Let's suppose we get and update from Oregon on the pc income for 2010:

In [18]:
df['2010'] = pd.Series({'Oregon': 42000})
 
df['2010'] = pd.Series({'Oregon': 42000})
df['nothing'] = 'nothing'
del df['nothing']

If you check the data frame now, you'll see there is a '2010' column with only one non missing value:

In [19]:
df
 
df
Out[19]:
<class 'pandas.core.frame.DataFrame'>
Index: 48 entries, Alabama to Wyoming
Data columns:
STATE_FIPS    48  non-null values
1929          48  non-null values
1930          48  non-null values
1931          48  non-null values
1932          48  non-null values
1933          48  non-null values
1934          48  non-null values
1935          48  non-null values
1936          48  non-null values
1937          48  non-null values
1938          48  non-null values
1939          48  non-null values
1940          48  non-null values
1941          48  non-null values
1942          48  non-null values
1943          48  non-null values
1944          48  non-null values
1945          48  non-null values
1946          48  non-null values
1947          48  non-null values
1948          48  non-null values
1949          48  non-null values
1950          48  non-null values
1951          48  non-null values
1952          48  non-null values
1953          48  non-null values
1954          48  non-null values
1955          48  non-null values
1956          48  non-null values
1957          48  non-null values
1958          48  non-null values
1959          48  non-null values
1960          48  non-null values
1961          48  non-null values
1962          48  non-null values
1963          48  non-null values
1964          48  non-null values
1965          48  non-null values
1966          48  non-null values
1967          48  non-null values
1968          48  non-null values
1969          48  non-null values
1970          48  non-null values
1971          48  non-null values
1972          48  non-null values
1973          48  non-null values
1974          48  non-null values
1975          48  non-null values
1976          48  non-null values
1977          48  non-null values
1978          48  non-null values
1979          48  non-null values
1980          48  non-null values
1981          48  non-null values
1982          48  non-null values
1983          48  non-null values
1984          48  non-null values
1985          48  non-null values
1986          48  non-null values
1987          48  non-null values
1988          48  non-null values
1989          48  non-null values
1990          48  non-null values
1991          48  non-null values
1992          48  non-null values
1993          48  non-null values
1994          48  non-null values
1995          48  non-null values
1996          48  non-null values
1997          48  non-null values
1998          48  non-null values
1999          48  non-null values
2000          48  non-null values
2001          48  non-null values
2002          48  non-null values
2003          48  non-null values
2004          48  non-null values
2005          48  non-null values
2006          48  non-null values
2007          48  non-null values
2008          48  non-null values
2009          48  non-null values
2010          1  non-null values
dtypes: int64(82), float64(1)

It is cool to see that pysal operations that work on vectors also work on pandas series. For instance, let's create some weights and calculate Moran's I for 2009:

In [20]:
w = ps.rook_from_shapefile(ps.examples.get_path('us48.shp'))
 
w = ps.rook_from_shapefile(ps.examples.get_path('us48.shp'))
mi = ps.Moran(y, w)
print "Moran's I: %f\tp-value: %f"%(mi.I, mi.p_sim)
Moran's I: 0.040711	p-value: 0.252000

And finally, some of the goodies that you also get for free, like matplotlibt integration. Let's see how the pc income has evolved over the years for California and Arizona:

In [21]:
evol = df.ix[['California', 'Arizona'], :]
 
evol = df.ix[['California', 'Arizona'], :]
del evol['STATE_FIPS']
evol = evol.T
evol.plot()
legend(loc='upper left')
show()

This is only a first introduction, there much more functionality on the library, particularly related to database operations (joins, merges, etc.), so don't stop here and go to the main website for more info!!!

Panel structures

In this section we will showcase the use of the panel structure. Code provided by Dave.

Construct DataFrame objects for population and rate:

In [23]:
pop = pd.DataFrame(pop, index=df.index, columns=df.columns)
np.random.seed(10) 
pop = np.random.randint(0, 4000, (len(df.index), len(df.columns))) 
pop = pd.DataFrame(pop, index=df.index, columns=df.columns)

One method of constructing a panel is by passing a dictionary of dataframes

In [24]:
print panel
panel = pd.Panel({'inc':df, 'pop':pop})
print panel
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 48 (major) x 83 (minor)
Items: inc to pop
Major axis: Alabama to Wyoming
Minor axis: STATE_FIPS to 2010

A panel is like a dictionary of dataframes:

In [25]:
print population
population = panel['pop']
print population
<class 'pandas.core.frame.DataFrame'>
Index: 48 entries, Alabama to Wyoming
Data columns:
STATE_FIPS    48  non-null values
1929          48  non-null values
1930          48  non-null values
1931          48  non-null values
1932          48  non-null values
1933          48  non-null values
1934          48  non-null values
1935          48  non-null values
1936          48  non-null values
1937          48  non-null values
1938          48  non-null values
1939          48  non-null values
1940          48  non-null values
1941          48  non-null values
1942          48  non-null values
1943          48  non-null values
1944          48  non-null values
1945          48  non-null values
1946          48  non-null values
1947          48  non-null values
1948          48  non-null values
1949          48  non-null values
1950          48  non-null values
1951          48  non-null values
1952          48  non-null values
1953          48  non-null values
1954          48  non-null values
1955          48  non-null values
1956          48  non-null values
1957          48  non-null values
1958          48  non-null values
1959          48  non-null values
1960          48  non-null values
1961          48  non-null values
1962          48  non-null values
1963          48  non-null values
1964          48  non-null values
1965          48  non-null values
1966          48  non-null values
1967          48  non-null values
1968          48  non-null values
1969          48  non-null values
1970          48  non-null values
1971          48  non-null values
1972          48  non-null values
1973          48  non-null values
1974          48  non-null values
1975          48  non-null values
1976          48  non-null values
1977          48  non-null values
1978          48  non-null values
1979          48  non-null values
1980          48  non-null values
1981          48  non-null values
1982          48  non-null values
1983          48  non-null values
1984          48  non-null values
1985          48  non-null values
1986          48  non-null values
1987          48  non-null values
1988          48  non-null values
1989          48  non-null values
1990          48  non-null values
1991          48  non-null values
1992          48  non-null values
1993          48  non-null values
1994          48  non-null values
1995          48  non-null values
1996          48  non-null values
1997          48  non-null values
1998          48  non-null values
1999          48  non-null values
2000          48  non-null values
2001          48  non-null values
2002          48  non-null values
2003          48  non-null values
2004          48  non-null values
2005          48  non-null values
2006          48  non-null values
2007          48  non-null values
2008          48  non-null values
2009          48  non-null values
2010          48  non-null values
dtypes: int64(83)

Add another attribute to the panel:

In [27]:
panel['rate
rate = np.random.uniform(0, 1, (len(df.index), len(df.columns)))
rate = pd.DataFrame(rate, index=df.index, columns=df.columns)
 
panel['rate'] = rate
print panel
<class 'pandas.core.panel.Panel'>
Dimensions: 3 (items) x 48 (major) x 83 (minor)
Items: inc to rate
Major axis: Alabama to Wyoming
Minor axis: STATE_FIPS to 2010

Grab a spatial subset of the panel:

In [28]:
print a_states
alabama = panel.major_xs('Alabama')
print alabama
a_states = panel.ix[:,['Alabama','Arizona','Arkansas'], :]
print a_states
            inc        pop   rate    
STATE_FIPS  1.0000000  1289  0.662636
1929        323.00000  3197  0.987605
1930        267.00000  527   0.803344
1931        224.00000  1344  0.876402
1932        162.00000  3441  0.629589
1933        166.00000  3195  0.630957
1934        211.00000  1180  0.858671
1935        217.00000  2009  0.827018
1936        251.00000  733   0.199480
1937        267.00000  1520  0.657321
1938        244.00000  1949  0.182582
1939        252.00000  3953  0.243080
1940        281.00000  2042  0.971408
1941        375.00000  1032  0.728128
1942        518.00000  1097  0.093333
1943        658.00000  2304  0.005975
1944        738.00000  2538  0.621661
1945        784.00000  40    0.101825
1946        754.00000  2550  0.277261
1947        805.00000  3492  0.279899
1948        881.00000  2419  0.334058
1949        833.00000  1552  0.040447
1950        909.00000  356   0.077944
1951        1045.0000  239   0.599454
1952        1106.0000  2443  0.575130
1953        1161.0000  2102  0.164323
1954        1139.0000  3416  0.748306
1955        1273.0000  3435  0.841941
1956        1356.0000  3194  0.665922
1957        1421.0000  574   0.001476
1958        1468.0000  3617  0.611659
1959        1526.0000  1224  0.781603
1960        1558.0000  3940  0.013469
1961        1587.0000  974   0.497504
1962        1667.0000  1713  0.295069
1963        1758.0000  3763  0.123912
1964        1890.0000  3126  0.799341
1965        2030.0000  77    0.445302
1966        2169.0000  3525  0.107034
1967        2294.0000  13    0.431859
1968        2516.0000  2803  0.060275
1969        2748.0000  409   0.846316
1970        2979.0000  1406  0.163761
1971        3225.0000  653   0.503197
1972        3544.0000  3932  0.482173
1973        3960.0000  1366  0.744714
1974        4351.0000  3870  0.545451
1975        4765.0000  2718  0.864570
1976        5323.0000  89    0.843613
1977        5817.0000  2700  0.247699
1978        6500.0000  3393  0.859065
1979        7199.0000  2591  0.290108
1980        7892.0000  2873  0.636296
1981        8712.0000  1828  0.254420
1982        9185.0000  283   0.908914
1983        9783.0000  3602  0.875070
1984        1.080e+04  93    0.838669
1985        1.158e+04  77    0.440742
1986        1.220e+04  2454  0.096371
1987        1.291e+04  2967  0.509761
1988        1.384e+04  2654  0.746588
1989        1.490e+04  2808  0.553091
1990        1.583e+04  395   0.499230
1991        1.654e+04  1692  0.845371
1992        1.746e+04  3146  0.915802
1993        1.799e+04  1240  0.265301
1994        1.886e+04  3465  0.936786
1995        1.968e+04  2063  0.627056
1996        2.033e+04  3602  0.981020
1997        2.113e+04  3536  0.645274
1998        2.212e+04  630   0.909217
1999        2.299e+04  1095  0.720750
2000        2.347e+04  3416  0.532705
2001        2.447e+04  1419  0.545948
2002        2.516e+04  1681  0.404441
2003        2.606e+04  1454  0.271924
2004        2.766e+04  1159  0.035921
2005        2.910e+04  2763  0.549628
2006        3.063e+04  796   0.235501
2007        3.199e+04  2849  0.177639
2008        3.282e+04  1492  0.591674
2009        3.227e+04  2810  0.933103
2010        NaN        3936  0.573377
<class 'pandas.core.panel.Panel'>
Dimensions: 3 (items) x 3 (major) x 83 (minor)
Items: inc to rate
Major axis: Alabama to Arkansas
Minor axis: STATE_FIPS to 2010

Grab a temporal subset of the panel:

In [29]:
print y1994
 
y1994 = panel.minor_xs('1994')
print y1994
                inc        pop   rate   
Name                                    
Alabama         1.886e+04  3465  0.93679
Arizona         1.977e+04  2512  0.75694
Arkansas        1.775e+04  3843  0.32183
California      2.347e+04  209   0.47235
Colorado        2.350e+04  2590  0.84379
Connecticut     3.053e+04  1502  0.81217
Delaware        2.453e+04  3901  0.02673
Florida         2.234e+04  1895  0.62025
Georgia         2.117e+04  1473  0.63873
Idaho           1.885e+04  1648  0.98899
Illinois        2.444e+04  568   0.82968
Indiana         2.115e+04  2677  0.28288
Iowa            2.050e+04  2062  0.89557
Kansas          2.135e+04  2105  0.36717
Kentucky        1.851e+04  644   0.17393
Louisiana       1.878e+04  1321  0.16996
Maine           1.953e+04  294   0.30187
Maryland        2.605e+04  2389  0.35742
Massachusetts   2.684e+04  1517  0.53566
Michigan        2.286e+04  251   0.22109
Minnesota       2.347e+04  1183  0.22630
Mississippi     1.655e+04  269   0.09927
Missouri        2.127e+04  976   0.72481
Montana         1.813e+04  1333  0.16582
Nebraska        2.117e+04  1777  0.39216
Nevada          2.464e+04  1730  0.51121
New Hampshire   2.382e+04  508   0.79152
New Jersey      2.788e+04  268   0.89463
New Mexico      1.795e+04  307   0.06218
New York        2.636e+04  1131  0.53872
North Carolina  2.093e+04  2165  0.75427
North Dakota    1.903e+04  3978  0.02904
Ohio            2.206e+04  1703  0.79136
Oklahoma        1.873e+04  2295  0.96429
Oregon          2.142e+04  2486  0.35017
Pennsylvania    2.286e+04  1223  0.68071
Rhode Island    2.276e+04  2620  0.92750
South Carolina  1.869e+04  415   0.56871
South Dakota    1.961e+04  3670  0.99405
Tennessee       2.070e+04  957   0.15895
Texas           2.059e+04  3661  0.73914
Utah            1.791e+04  3994  0.49425
Vermont         2.055e+04  1091  0.60881
Virginia        2.371e+04  3927  0.55392
Washington      2.312e+04  3013  0.21798
West Virginia   1.741e+04  3385  0.01580
Wisconsin       2.170e+04  3775  0.30247
Wyoming         2.096e+04  853   0.17367
In [ ]: