Analytical Problem Solving- Types of Reasoning

To solve any problem we need to have some way of breaking down the problem. There are two main ways of reasoning to that effect-

  • Deductive Reasoning–  This is also called as “Top Down” approach or “Formal Logic” approach. The key here is to form hypotheses to explain a certain phenomenon and then go to reject or accept related hypotheses. The conclusions and recommendation coming out from this sort of reasoning are more certain and factual in nature.
    • For example, let’s say you are trying to explain why a certain car gives lower miles per gallon. Because you know the business and have more context on this problem, you can start with potential hypotheses-
      • Weight of the car is high
      • Car has higher number or cylinders
      • Car has higher horse power
      •  and so on…

You will check each of the above hypotheses and reach to a definite conclusion.

  • Inductive Reasoning– On the other hand, this is a “Bottom Up” approach or “Informal Logic” approach. This sort of reasoning is more exploratory in nature. The end goal is to form some hypotheses to give possible reasons to explain certain phenomenon.
    • For example, let’s say you are trying to explain why sales of an eCommerce company has gone down in a particular quarter. You may begin by an exploratory analysis where you can begin with potential driver factors such as-
      • Marketing spend of the company
      • Pricing
      • Competitive landscape
      •  Macro economic factors

You will do data analysis to correlate each of the above factors to the sales and find potential reasons or build potential hypotheses to be tested further.

Cheers!

 

Lasso, Ridge and Elastic Net Regularization

Regularization techniques in Generalized Linear Models (GLM) are used during a modeling process for many reasons. A regularization technique helps in the following main ways-

  1. Doesn’t assume any particular distribution of the dependent variable ( DV). The DV can follow any distribution such as normal, binomial, possison etc. Hence the name Generalized Linear Models (GLMs)
  2. Address Variance-Bias Tradeoffs. Generally will lower the variance from the model
  3. More robust to handle multicollinearity
  4. Better sparse data (observations < features) handling
  5. Natural feature selection
  6. More accurate prediction on new data as it minimizes overfitting on the train data
  7. Easier interpretation of the output

And so on…

Overfitting

CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=263773

What is a regularization technique you may ask? A regularization technique is in simple terms a penalty mechanism which applies shrinkage (driving them closer to zero) of coefficient to build a more robust and parsimonious model. Although there are many ways to regularize a model, few of the common ones are-

  1. L1 Regularization aka Lasso Regularization– This add regularization terms in the model which are function of absolute value of the coefficients of parameters. The coefficient of the paratmeters can be driven to zero as well during the regularization process. Hence this technique can be used for feature selection and generating more parsimonious model
  2. L2 Regularization aka Ridge Regularization – This add regularization terms in the model which are function of square of coefficients of parameters. Coefficient of parameters can approach to zero but never become zero.
  3. Combination of the above two such as Elastic Nets– This add regularization terms in the model which are combination of both L1 and L2 regularization.

For more on the regularization techniques you can visit this paper.

Scikit help on Lasso Regression

Here is a working example code on the Boston Housing data. Please note, generally before doing regularized GLM regression it is advised to scale variables. However, in the below example we are working with the variables on the original scale to demonstrate each algorithms working.

lasso1lasso2lasso3lasso4lasso5lasso6lasso7lasso8lasso9lasso10lasso11lasso12lasso13

Cheers!

Model Persistence Using Python Pickle

After you have built a machine learning model which is doing a great job in prediction, you don’t have to retrain your model again and again for future usage. Instead, you can use Python pickle serialization for reusing this model in future and transferring it into a production environment where non modelers can also use this model to make predictions.

 

512px-Pickle

By Renee Comet (photographer) [Public domain], via Wikimedia Commons

First let’s look at how Wikipedia defines a pickle

Pickling is the process of preserving or expanding the lifespan of food by either anaerobic fermentation in brine or immersion in vinegar. The resulting food is called a pickle.

Python pickling is the same process without brine or vinegar, whereas you will pickle your model for longer usage without the need for you to recook your models. In a “Pickling” process a Python object is converted into a byte stream. On the other hand, in an “Unpickling” process a byte stream is converted back into Python object.

I strongly recommend that you read Python Official Documentation on this topic before moving forward.

Now let’s see this live in action. We will first look at a simple example and then look at a model example.

Example 1- In this we will pickle and un-pickle a simple Python list

pickle1

Example 2- In this we will pickle and un-pickle a Decision Tree classifier and use it later for making predictions on a new data

pickle2pickle3

For more details, do check out this excellent presentation.

Cheers!

 

Recurrent Neural Network (RNN) in Python

Recurrent Neural Network (RNN) are a special type of feed-forward network used for sequential data analysis where inputs are not independent and are not of fixed length as is assumed in some of the other neural networks such as MLP.  Rather in this case, inputs are dependent on each other along the time dimension. In other words, what happens in time ‘t’ may depend on what happened in time ‘t-1’, ‘t-2’ and so on.

These are also called ‘memory’ networks as previous inputs and states persist in the model for doing a more optimal sequential analysis. They can have both short term and long term time dependence. Due to their capabilities of handling sequential data very well, these networks are typically very suitable for speech recognition, sentiment analysis, forecasting, language translation and other such applications.

Let’s now spend sometime looking at how a RNN work-

Recurrent Neural Network (RNN)

Recurrent Neural Network (RNN)

As you may recall, in a typical feed-forward neural network input is fed at beginning and then hidden layers do the processing and finally output layer spits out the output. On the other hand, in a RNN generally speaking we will have different input, output and cost function for each time stamp. However, the same weight matrix is fed to all layers in the network.

One point to note is that RNNs are also trained using backward propagation of errors and gradient descent to minimize cost function. However, backward propagation in RNN happen over different time stamps and hence it’s called Backward Propagation Through Time (BPTT). In a typical RNN, we may have several time stamp layers which sometimes may range in hundreds or thousands and therein lies the problem of vanishing gradient or exploding gradient that these pure vanilla RNNs are particularly susceptible for.

There are various techniques such as gradient clipping and architecture such as Long Short Term Memory (LSTM) or Gated Recurrent Unit (GRU) which help in fixing the vanishing gradient and exploding gradient issues. We will delve deeper into how an LSTM work.

A LSTM network consist of hidden layers that have many LSTM blocks or units. In turn each LSTM unit will have the following components-

  • Memory Cell- The component that remembers the values over a period of time. This has an activation function
  • Input gate- Enables addition of info to the memory cell. Generally has as an tanh activation to squash the values between -1 and +1
  • Forget gate- Enables removing or retaining from the memory cell. This will generally have a sigmoid activation function and hence the output values will range between 0 and 1. If the gate is on, then all memories are retained. If the gate is turned-off, all values will be removed.
  • Output gate- Retrieve information from the memory cell passed through the tanh activation
Long Short Term Memory Cell or Block

Long Short Term Memory Cell or Block (Source- Wiki)

Let’s work through an example which we used in a previous article.

lstm1lstm2lstm3lstm4lstm5lstm6lstm7lstm8

lstm9

Here is an excellent article in case you want to explore more.

Cheers!

 

Ensemble Modeling using Python

Ensemble models are a great tool to fix the variance-bias trade-off which a typical machine learning model faces, i.e. when you try to lower bias, variance will go higher and vice-versa. This generally results in higher error rates.

Total Error in Model = Bias + Variance + Random Noise

Variance and Bias Trade-off

Variance and Bias Trade-off

Ensemble models typically combine several weak learners to build a stronger model, which will reduce variance and bias at the same time. Since ensemble models follow a community learning or divide and conquer approach, output from ensemble models will be wrong only when the majority of underlying learners are wrong.

One of the biggest flip side of ensemble models is that they may become “Black Box” and not very explainable as opposed a simple machine learning model. However, the gains in model performances generally outweigh any loss in transparency. That is the reason why you will see top performing models in many high ranking competitions will be generally an ensemble model.

Ensemble models can be broken down into the following three main categories-

  1. Bagging
  2. Boosting
  3. Stacking

Let’s look at each one of them-

Bagging-

  • One good example of such model is Random Forest
  • These types of ensemble models work on reducing the variance by removing instability in the underlying complex models
  • Each learner is asked to do the classification or regression independently and in parallel and then either a voting or averaging of the output of all the learners is done to create the final output
  • Since these ensemble models are predominantly focuses on reducing the variance, the underlying models are fairly complex ( such as Decision Tree or Neural Network) to begin with low bias
  • An underlying decision tree will have higher depth and many branches. In other words, the tree will be deep and dense and with lower bias

Boosting-

  • Some good examples of these types of models are Gradient Boosting Tree, Adaboost, XGboost among others.
  • These ensemble models work with weak learners and try to improve the bias and variance simultaneously by working sequentially.
  •  These are also called adaptive learners, as learning of one learner is dependent on how other learners are performing. For example, if a certain set of the data has higher mis-classification rate, this sample’s weight in the overall learning will be increased so that the other learners focus more on correctly classifying the tougher samples.
  • An underlying decision tree will be shallow and a weak learner with higher bias

There are various approaches for building a bagging model such as- pasting, bagging, random subspaces, random patches etc. You can find all details over here.

Stacking-

  • These meta learning models are what the name suggest. They are stacked models. Or in other words, a particular learner’s output will become an input to another model and so on.

Working examples-

First install xgboost via conda install-

Step1 – search packages  using “anaconda search -t conda xgboost”

Step 2- install a particular package such as ” conda install py-xgboost”

Coming soon….

Cheers!