Pandas is a software library written for the Python programming language for data manipulation and analysis. In many organizations, it is common to research, prototype, and test new ideas using
a more domain-specific computing language like MATLAB or R then later port those ideas to be part of a larger production system written in, say, Java, C#, or C++. What
people are increasingly finding is that Python is a suitable language not only for doing research and prototyping but also building the production systems, too.

It contains data structures and operations for manipulating numerical tables and time series. I notice pandas while I was researching on big data. It saved my hours in research so I thought of writing some blog post on pandas. It contain

  • Data structures
  • Date range generation Index objects (simple axis indexing and multi-level / hierarchical axis indexing)
  • Data Wrangling (Clean, Transform, Merge, Reshape)
  • Grouping (aggregating and transforming data sets)
  • Interacting with the data/files (tabular data and flat files (CSV, delimited, Excel))
  • Statistical functions (Rolling statistics/ moments)
  • Static and moving window linear and panel regression
  • Plotting and Visualization

 

Lets do coding, Firstly we import as follows

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

1. Date Generation

1.1 Creating a Series by passing a list of values

series  = pd.Series([1,2,4,np.nan,5,7])
print series

 

1 import pandas as pd
2 import numpy as np
3 import matplotlib.pyplot as plt
4
5 series = pd.Series([1,2,4,np.nan,5,7])
6 print series
7

output


image


1.2 Random sample values


The numpy.random module supplements the built-in Python random with functions for
efficiently generating whole arrays of sample values from many kinds of probability  distributions


1 samples = np.random.normal(size=(4, 4))
2 print samples

output


image


1.3 Creating a DataFrame by passing a numpy array, with a datetime index and labeled columns.


1 import pandas as pd
2 import numpy as np
3 import matplotlib.pyplot as plt
4
5 dates = pd.date_range('20150201',periods=5)
6
7 df = pd.DataFrame(np.random.randn(5,3),index=dates,columns=list(['stock A','stock B','stock C']))
8 print "Colombo Stock Exchange Growth - 2015 Feb"
9 print (45*"=")
10 print df

output


image


1.4 Statistic summary


We can view a quick statistic summary of your data by describe


print df.describe()


image


1.5 Sorting


Now we want to sort by the values in one or more columns. Therefore we have to pass one or more column names to the 'by' option:
eg: We can sort data by increment of 'stock A' as below


df.sort_index(by='stock A')


output


image


To sort by multiple columns, pass a list of names:
df.sort_index(by=['stock A','stock B'])


API
df.sort(columns=None, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last') [2]


1.6 Ranking
DataFrame can compute ranks over the rows or the columns
df.rank(axis=0,method='min')


image


API
df.rank(axis=0, numeric_only=None, method='average', na_option='keep', ascending=True, pct=False) [3]


[NOTE]Tie-breaking methods with rank



  • 'average' - Default:  assign the average rank to each entry in the equal group.
  • 'min' -  Use the minimum rank for the whole group.
  • 'max' -  Use the maximum rank for the whole group.
  • 'first' -  Assign ranks in the order the values appear in the data.
  • ‘dense’ - like ‘min’, but rank always increases by 1 between groups

1.7 Descriptive Statistics Methods
pandas objects are equipped with a set of common mathematical and statistical methods. Most of these fall into the category of reductions or summary statistics, methods
that extract a single value (like the sum or mean).


eg:


image


We need each day total increment of the stockA, stockB and stockC
print df.sum(axis=1)


image
We need highest stock increment day (date) per each stocks
print df.idxmax()


image


[Note]


Descriptive and summary statistics































































count


Number of non-NA values


describe


Compute set of summary statistics for Series or each DataFrame column


min, max


Compute minimum and maximum values


argmin, argmax


Compute index locations (integers) at which minimum or maximum value obtained, respectively


idxmin, idxmax


Compute index values at which minimum or maximum value obtained, respectively


quantile


Compute sample quantile ranging from 0 to 1


sum


Sum of values


mean


Mean of values


median


Arithmetic median (50% quantile) of values


mad


Mean absolute deviation from mean value


var


Sample variance of values


std


Sample standard deviation of values


skew


Sample skewness (3rd moment) of values


kurt


Sample kurtosis (4th moment) of values


cumsum


Cumulative sum of values


cummin, cummax


Cumulative minimum or maximum of values, respectively


cumprod


Cumulative product of values


diff


Compute 1st arithmetic difference (useful for time series)


pct_change


Compute percent changes


There so many features I will go through them in my next posts.

0

Add a comment

I am
I am
Archives
Total Pageviews
Total Pageviews
2 0 5 7 7 0 6
Categories
Categories
Loading