Pandas is a software library written for the Python programming language for data manipulation and analysis. In many organizations, it is common to research, prototype, and test new ideas using
a more domain-specific computing language like MATLAB or R then later port those ideas to be part of a larger production system written in, say, Java, C#, or C++. What
people are increasingly finding is that Python is a suitable language not only for doing research and prototyping but also building the production systems, too.
It contains data structures and operations for manipulating numerical tables and time series. I notice pandas while I was researching on big data. It saved my hours in research so I thought of writing some blog post on pandas. It contain
- Data structures
- Date range generation Index objects (simple axis indexing and multi-level / hierarchical axis indexing)
- Data Wrangling (Clean, Transform, Merge, Reshape)
- Grouping (aggregating and transforming data sets)
- Interacting with the data/files (tabular data and flat files (CSV, delimited, Excel))
- Statistical functions (Rolling statistics/ moments)
- Static and moving window linear and panel regression
- Plotting and Visualization
Lets do coding, Firstly we import as follows
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
1. Date Generation
1.1 Creating a Series by passing a list of values
series = pd.Series([1,2,4,np.nan,5,7])
print series
1 import pandas as pd
2 import numpy as np
3 import matplotlib.pyplot as plt
4
5 series = pd.Series([1,2,4,np.nan,5,7])
6 print series
7
output
1.2 Random sample values
The numpy.random module supplements the built-in Python random with functions for
efficiently generating whole arrays of sample values from many kinds of probability distributions
1 samples = np.random.normal(size=(4, 4))
2 print samples
output
1.3 Creating a DataFrame by passing a numpy array, with a datetime index and labeled columns.
1 import pandas as pd
2 import numpy as np
3 import matplotlib.pyplot as plt
4
5 dates = pd.date_range('20150201',periods=5)
6
7 df = pd.DataFrame(np.random.randn(5,3),index=dates,columns=list(['stock A','stock B','stock C']))
8 print "Colombo Stock Exchange Growth - 2015 Feb"
9 print (45*"=")
10 print df
output
1.4 Statistic summary
We can view a quick statistic summary of your data by describe
print df.describe()
1.5 Sorting
Now we want to sort by the values in one or more columns. Therefore we have to pass one or more column names to the 'by' option:
eg: We can sort data by increment of 'stock A' as below
df.sort_index(by='stock A')
output
To sort by multiple columns, pass a list of names:
df.sort_index(by=['stock A','stock B'])
API
df.sort(columns=None, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last') [2]
1.6 Ranking
DataFrame can compute ranks over the rows or the columns
df.rank(axis=0,method='min')
API
df.rank(axis=0, numeric_only=None, method='average', na_option='keep', ascending=True, pct=False) [3]
[NOTE]Tie-breaking methods with rank
- 'average' - Default: assign the average rank to each entry in the equal group.
- 'min' - Use the minimum rank for the whole group.
- 'max' - Use the maximum rank for the whole group.
- 'first' - Assign ranks in the order the values appear in the data.
- ‘dense’ - like ‘min’, but rank always increases by 1 between groups
1.7 Descriptive Statistics Methods
pandas objects are equipped with a set of common mathematical and statistical methods. Most of these fall into the category of reductions or summary statistics, methods
that extract a single value (like the sum or mean).
eg:
We need each day total increment of the stockA, stockB and stockC
print df.sum(axis=1)
We need highest stock increment day (date) per each stocks
print df.idxmax()
[Note]
Descriptive and summary statistics
count | Number of non-NA values | |
describe | Compute set of summary statistics for Series or each DataFrame column | |
min, max | Compute minimum and maximum values | |
argmin, argmax | Compute index locations (integers) at which minimum or maximum value obtained, respectively | |
idxmin, idxmax | Compute index values at which minimum or maximum value obtained, respectively | |
quantile | Compute sample quantile ranging from 0 to 1 | |
sum | Sum of values | |
mean | Mean of values | |
median | Arithmetic median (50% quantile) of values | |
mad | Mean absolute deviation from mean value | |
var | Sample variance of values | |
std | Sample standard deviation of values | |
skew | Sample skewness (3rd moment) of values | |
kurt | Sample kurtosis (4th moment) of values | |
cumsum | Cumulative sum of values | |
cummin, cummax | Cumulative minimum or maximum of values, respectively | |
cumprod | Cumulative product of values | |
diff | Compute 1st arithmetic difference (useful for time series) | |
pct_change | Compute percent changes |
There so many features I will go through them in my next posts.
Add a comment