Interpreting and Visualizing AutoCorrelation

By Jithin J and Karthik Ravindra, Byte Academy

Analyzing a Time Series Data needs special attention. Here, we would like to explore working with time series data and identify the effect of autocorrelation to come up with a more practical approach to work in Linear Regression Models. When using some data to try to estimate some value, say equity prices, Autocorrelation is a common feature. It is defined as the situation when the error terms of the linear regression model are correlated. So, if one error term is positive (or negative), and this fact causes the next error term to also be positive (or negative), we say that the model suers from autocorrelation. It is a very serious problem, as it violates the common assumption that the error term is stochastic and non-deterministic. Maintaining a stochastic error term is important to maintain the integrity of a linear regression otherwise it risks inducing bias in the model’s estimations.

Let’s take an example of some financial data during a stock market crash. The crash on day one increases the likelihood of observing a downward trend for the next few days, perhaps even weeks. If the model suers from autocorrelation and is used for extrapolation, the model will estimate a similar stock market crash in the future as well. Therefore, we must first be able to identify the presence of this trend.

To prepare this article, we decided to pick a financial data set. After some quick research, we decided to work on Shiller PE ratio and estimate the movement of S&P monthly closing price. The data was taken from:

Domain Knowledge

The Shiller P/E is a valuation measure usually applied to the US S&P 500 equity market. It is defined as price divided by the average of ten years of earnings (moving average), adjusted for inflation. As such, it is principally used to assess likely future returns from equities over timescales of 10 to 20 years, with higher than average values implying lower than average long-term annual average returns.

Web scraping

We start with extracting data scraping the Shiller P/E ratio and S&P closing prices from . If interested in web scraping, the Python code is here:

Once our data has been extracted we store in pandas DataFrames. We create a pandas data frame with index column as time series and S&P closing and Shiller Ratio as our column.

Once the data is stored, we need to clean and prepare it for analysis.

Data Preparation and Data Cleaning using Pandas library: Creating a Time Series



So we have Shiller ratio data and S&P closing price in two different data frames, now let’s perform a lookup function to get the Shiller PE ratio for each month into the closing price data frame.


We have 1769 entries and 4 columns SandPDate and shDate are date columns we could easily drop one of them and we need to check for null values.


sh_Ratio has 120 null values, we could drop these values from our dataset safely as this accounts to less than 6% of total row items


Now we create a time series for which the S&P Date column needs to format correctly so that we are able to assign the correct data type for each column.


Now our Dataframe is in a time series format and ready for further analysis.

Stay tuned for the next post in this series, in which we will discuss Time Series Analysis.

Any trading symbols displayed are for illustrative purposes only and are not intended to portray recommendations. Originally posted on the IBKR Quant Blog here.

Thanks for the comment
No Comments
Please rate*

Other Suggested Reads

  • Data Analysis: Smart Phones & Other Trends In Will Creation

    Writing a last will and testament is not usually an activity associated with millennials.  However, young people are thinking differently about protecting their families, and, in turn are "disrupti...
  • The Beautiful Binomial Logistic Regression

    The Logistic Regression is an important classification model to understand in all its complexity. There are a few reasons to consider it: It is faster to train than some other classification algo...
  • The Worst Kind of Data: Missing Data

    Most publicly available datasets or datasets at the workplace are complete. However, from time to time we encounter datasets where some or many entries are missing. The problem of missing data exists ...