In today’s post, I’ll explore how you can use Azure Machine Learning to perform regression analysis on a dataset.
Regression analysis is a data mining process that identifies a model that depicts the correlation between an outcome and one or more predictors. To explore this idea, the first thing we need is data!
One compelling source of data is the NYC Open Data portal. The portal features datasets containing a wide range of topics and categories related to New York City. Information about schools, restaurants ratings, pot holes and many more topics, is available via the portal.
While browsing the data I came across the 311 Service Requests dataset. This dataset contains all the 311 service requests from 2010 to present. If you are not familiar with 311, think of it as a customer service line for all the non-emergency requests to New York City’s government. So I decided to upload the dataset to Azure ML Studio and use the visualization tools to find what is the most frequent complaint of the vocal New Yorkers.
Note: To upload a dataset in a CSV format you will need to create a ML workspace via the Azure Portal, sign-in into ML Studio and create a new Experiment. More information here.
For more information about how to visualize your data here.
To my surprise, I learned that the most frequent complaints in 2013 were heating related.
I thought it’d be interesting to create a model to predict the number of heating complaints. I think you would agree that there must be a correlation between the temperature and the number of these complaints. So next, I needed to find the average temperature of when the complaint was created. And fortunately, there’s a dataset for that! NOAA National Climatic Data Center.
After creating a database on Azure, importing the data and a bit of T-SQL’s aerobics –i.e. joining, data cleaning etc. I ended up with a dataset containing two columns: the number of heating complaints and the average temperature of when they occurred. I added a Reader module (to read data from my Azure SQL table) to my experiment and removed the original dataset. The figure below shows the visualization of the output of the Reader module.
Let’s start with the simplest the regression: lineal regression. In short, a linear model (identified from the process of regression) assumes that there’s a proportional relationship between to predictor (AvgDailyTemp) and the outcome (HeatingComplaints). So let’s train a Linear Model using Azure ML and our dataset.
The modeling process in Azure ML, at a high level consists of finding the parameters of your model from your dataset (Train Model), confirm the results of your model against a dataset (Score Model) and then obtain key quality factors of the results or evaluate against another model (Evaluate Model).
To do this, you need to drop a Train Model, Score Model, Evaluate Model and Linear Regression modules into the experiment and connect them as depicted below.
Note: You can fine tune the linear regression process by changing the Solution Method (Ordinary Least Squares or Online Gradient Descent), whether to include an intercept and other parameters. For more information see here.
Next you need to configure the Train Model node to select the HeatComplaints column –this is how you tell Azure ML what should be the outcome of your model.
So let’s see how the model did. The following image shows Azure ML’s visualization of the predicted values next to the scatter plot of the actual values.
Possion Regression and Log Scale
From the scatter plot of the actual values, you can tell that the plot resembles an exponential function. However, having an exponential model means that we would loose the ease of regression of the linear model. Fortunately, we can have both. We do this by transforming a component of our model so that instead of the number of heating complains the model will generalize the logarithm of the number of heating complaints. Let’s plot that scenario.
As you can see from the plot, once we apply the transformation, the trend is linear. So we can either go back to our data and apply the log function to the HeatingComplaints column or we can use the Possion regression. The outcome of the Possion regression is the natural log of the outcome variable. Your experiment should look similar to the picture below.
Now if we plot the scored model vs the actual data the get the following.
The new graph seems to be a better reflection of the actual shape and scale of values.
In this blog post I showed you how to visualize data, create linear and Possion regressions using Azure ML.
Next, I’ll show you how can use decision forests to get a model that fits the data better.