Wednesday, March 7, 2012

Its about time Python get serious with Statistics! Introducing Scikits.statsmodels (Linux Ubuntu)), Part 1

You can of course perform powerful statistical routines using Python, Rpy module and R. But greater joy is now coming to pure Python users with the Scikits.statsmodels package.


  • Install the scikits.statsmodels module
  • Be sure that your versions of scipy and numpy are the latest or later than the following:
    Python >= 2.5
    NumPy >= 1.4.0 
    SciPy >= 0.7
    
    You can install the latest versions by typing sudo apt-get install python python-setuptools python-numpy python-scipy
  • Download the software scikits.statsmodels module
  • In a terminal type wget -ct 0 http://pypi.python.org/packages/source/s/scikits.statsmodels/scikits.statsmodels-0.3.1.tar.gz#md5=1f55b53d161544b95ca2709c9731c00c Untar it,then descend to the directory.
    tar -xvvf scikits.statsmodels-0.3.1.tar.gz
    sudo python setup.py install
    
  • Test the installed module
  • Lets try this from ipython(Again, do a sudo apt-get install ipython if it is not yet installed.
    import scikits.statsmodels.api as sm
    help(sm)
    
    The following will be printed:
    Help on module scikits.statsmodels.api in scikits.statsmodels:  NAME     scikits.statsmodels.api - Statistical models  FILE     /home/toto/testsoftware/scikits.statsmodels-0.3.1/scikits/statsmodels/api.py  DESCRIPTION      - standard `regression` models            - `GLS` (generalized least squares regression)       - `OLS` (ordinary least square regression)       - `WLS` (weighted least square regression)       - `GLASAR` (GLS with autoregressive errors model)            - `GLM` (generalized linear models)       - robust statistical models            - `RLM` (robust linear models using M estimators)       - `robust.norms` estimates       - `robust.scale` estimates (MAD, Huber's proposal 2).       - sandbox models       - `mixed` effects models       - `gam` (generalized additive model 
    What is available by importing sm? Type dir(sm) and the following functions or attibutes will be listed:
    In [17]: dir(sm)
    Out[17]: 
    ['GLM',
     'GLS',
     'GLSAR',
     'Logit',
     'MNLogit',
     'OLS',
     'Poisson',
     'Probit',
     'RLM',
     'WLS',
     '__builtins__',
     '__doc__',
     '__file__',
     '__name__',
     '__package__',
     'add_constant',
     'categorical',
     'datasets',
     'families',
     'iolib',
     'nonparametric',
     'regression',
     'robust',
     'test',
     'tools',
     'tsa',
     'version']
    
  • Additional resources
  • Home page of pystatsmodels: http://statsmodels.sourceforge.net/ Subscribe to the Google group: http://groups.google.com/group/pystatsmodels
  • The regression example from the homepage
  • It is time to show what statsmodels can do. Here is a regression example from the home page: http://statsmodels.sourceforge.net/
    import numpy as np
    import scikits.statsmodels.api as sm
    
    
    # get data
    nsample = 100
    x = np.linspace(0,10, 100)
    X = sm.add_constant(np.column_stack((x, x**2)))
    beta = np.array([1, 0.1, 10])
    y = np.dot(X, beta) + np.random.normal(size=nsample)
    
    
    # run the regression
    results = sm.OLS(y, X).fit()
    
    # look at the results
    print results.summary()
    
    Here is the results by running the above published example:
         Summary of Regression Results     
    =======================================
    | Dependent Variable:            ['y']|
    | Model:                           OLS|
    | Method:                Least Squares|
    | Date:               Wed, 07 Mar 2012|
    | Time:                       17:58:02|
    | # obs:                         100.0|
    | Df residuals:                   97.0|
    | Df model:                        2.0|
    ==============================================================================
    |                   coefficient     std. error    t-statistic          prob. |
    ------------------------------------------------------------------------------
    | x1                      1.060         0.1353         7.8344         0.0000 |
    | x2                    0.09398        0.01309         7.1802         0.0000 |
    | const                   9.848         0.2927        33.6490         0.0000 |
    ==============================================================================
    |                          Models stats                      Residual stats  |
    ------------------------------------------------------------------------------
    | R-squared:                     0.9729   Durbin-Watson:              1.803  |
    | Adjusted R-squared:            0.9724   Omnibus:                    3.108  |
    | F-statistic:                    1742.   Prob(Omnibus):             0.2114  |
    | Prob (F-statistic):         9.776e-77   JB:                         2.791  |
    | Log likelihood:                -139.9   Prob(JB):                  0.2477  |
    | AIC criterion:                  285.8   Skew:                     -0.4176  |
    | BIC criterion:                  293.6   Kurtosis:                   2.927  |
    ------------------------------------------------------------------------------
    
    
    So far so good. We hope that the statsmodels module be more robust, easier to use and inspiring enough for Python users users to contribute code!

No comments:

Post a Comment