Decision Tree Regression using Scikit

In this exercise we will build a Decision Tree Regression Model to find out key variables that impact credit card balances

This data is taken from “An Introduction to Statistical Learning with Applications in R” available at http://www-bcf.usc.edu/~gareth/ISL/index.html

  • Import basic packages and interactive shell to print many statements on one lineSlide1
  • Import data set and check ‘head’ and ‘info’ of the data. Categorical variables are represented as “Object” in the below table and Numerical variables as “Int” or “FloatSlide2
  • Remove unnecessary columns using ‘iloc’ and print random sample of data using ‘sample’ methodSlide3
  • Create new variable in the data frame. This is the dependent variableSlide4
  • Drop variables from dataframe using ‘.drop’ – ‘Cards’ and ‘Balance’ being droppedSlide5
  • Explore data bit more about the dataframe- describe, info, shape, insert blank lines in print statements, count missing values, column names etc.Slide6Slide7Slide8
  • Find out the mean and median values of the numerical variables by categorical variables by running a simple ‘for loop’Slide9Slide10
  • Do plotting using Seaborn (sns) package- Seaborn Pairplot. Possible values for the color palette are huge and can be selected from -Possible values are: Accent, Accent_r, Blues, Blues_r, BrBG, BrBG_r, BuGn, BuGn_r, BuPu, BuPu_r, CMRmap, CMRmap_r, Dark2, Dark2_r, GnBu, GnBu_r, Greens, Greens_r, Greys, Greys_r, OrRd, OrRd_r, Oranges, Oranges_r, PRGn, PRGn_r, Paired, Paired_r, Pastel1, Pastel1_r, Pastel2, Pastel2_r, PiYG, PiYG_r, PuBu, PuBuGn, PuBuGn_r, PuBu_r, PuOr, PuOr_r, PuRd, PuRd_r, Purples, Purples_r, RdBu, RdBu_r, RdGy, RdGy_r, RdPu, RdPu_r, RdYlBu, RdYlBu_r, RdYlGn, RdYlGn_r, Reds, Reds_r, Set1, Set1_r, Set2, Set2_r, Set3, Set3_r, Spectral, Spectral_r, Vega10, Vega10_r, Vega20, Vega20_r, Vega20b, Vega20b_r, Vega20c, Vega20c_r, Wistia, Wistia_r, YlGn, YlGnBu, YlGnBu_r, YlGn_r, YlOrBr, YlOrBr_r, YlOrRd, YlOrRd_r, afmhot, afmhot_r, autumn, autumn_r, binary, binary_r, bone, bone_r, brg, brg_r, bwr, bwr_r, cool, cool_r, coolwarm, coolwarm_r, copper, copper_r, cubehelix, cubehelix_r, flag, flag_r, gist_earth, gist_earth_r, gist_gray, gist_gray_r, gist_heat, gist_heat_r, gist_ncar, gist_ncar_r, gist_rainbow, gist_rainbow_r, gist_stern, gist_stern_r, gist_yarg, gist_yarg_r, gnuplot, gnuplot2, gnuplot2_r, gnuplot_r, gray, gray_r, hot, hot_r, hsv, hsv_r, icefire, icefire_r, inferno, inferno_r, jet, jet_r, magma, magma_r, mako, mako_r, nipy_spectral, nipy_spectral_r, ocean, ocean_r, pink, pink_r, plasma, plasma_r, prism, prism_r, rainbow, rainbow_r, rocket, rocket_r, seismic, seismic_r, spectral, spectral_r, spring, spring_r, summer, summer_r, tab10, tab10_r, tab20, tab20_r, tab20b, tab20b_r, tab20c, tab20c_r, terrain, terrain_r, viridis, viridis_r, vlag, vlag_r, winter, winter_rSlide11
  • Run Correlation heatmap and different color palettes CMAP. Possible values for the colormap cmap are-
    Accent, Accent_r, Blues, Blues_r, BrBG, BrBG_r, BuGn, BuGn_r, BuPu, BuPu_r, CMRmap, CMRmap_r, Dark2, Dark2_r, GnBu, GnBu_r, Greens, Greens_r, Greys, Greys_r, OrRd, OrRd_r, Oranges, Oranges_r, PRGn, PRGn_r, Paired, Paired_r, Pastel1, Pastel1_r, Pastel2, Pastel2_r, PiYG, PiYG_r, PuBu, PuBuGn, PuBuGn_r, PuBu_r, PuOr, PuOr_r, PuRd, PuRd_r, Purples, Purples_r, RdBu, RdBu_r, RdGy, RdGy_r, RdPu, RdPu_r, RdYlBu, RdYlBu_r, RdYlGn, RdYlGn_r, Reds, Reds_r, Set1, Set1_r, Set2, Set2_r, Set3, Set3_r, Spectral, Spectral_r, Vega10, Vega10_r, Vega20, Vega20_r, Vega20b, Vega20b_r, Vega20c, Vega20c_r, Wistia, Wistia_r, YlGn, YlGnBu, YlGnBu_r, YlGn_r, YlOrBr, YlOrBr_r, YlOrRd, YlOrRd_r, afmhot, afmhot_r, autumn, autumn_r, binary, binary_r, bone, bone_r, brg, brg_r, bwr, bwr_r, cool, cool_r, coolwarm, coolwarm_r, copper, copper_r, cubehelix, cubehelix_r, flag, flag_r, gist_earth, gist_earth_r, gist_gray, gist_gray_r, gist_heat, gist_heat_r, gist_ncar, gist_ncar_r, gist_rainbow, gist_rainbow_r, gist_stern, gist_stern_r, gist_yarg, gist_yarg_r, gnuplot, gnuplot2, gnuplot2_r, gnuplot_r, gray, gray_r, hot, hot_r, hsv, hsv_r, icefire, icefire_r, inferno, inferno_r, jet, jet_r, magma, magma_r, mako, mako_r, nipy_spectral, nipy_spectral_r, ocean, ocean_r, pink, pink_r, plasma, plasma_r, prism, prism_r, rainbow, rainbow_r, rocket, rocket_r, seismic, seismic_r, spectral, spectral_r, spring, spring_r, summer, summer_r, tab10, tab10_r, tab20, tab20_r, tab20b, tab20b_r, tab20c, tab20c_r, terrain, terrain_r, viridis, viridis_r, vlag, vlag_r, winter, winter_rSlide12Slide13
  • Do dummy coding using ‘for loop’Slide14
  • Create features and labels for decision tree regression using ‘.drop’Slide15
  • Import Decision Tree Regression object from sklearn and set the minimum leaf size to 30. Fit the tree on overall dataSlide16
  • Visualize the Tree using graphviz within the jupyter notebook and also import the decision tress as pdf using ‘.render’Slide17
  • Find out the predicted values using the treeSlide18

As you can see from the above decision tree, Limit, Income and Rating come out as the most important variables in predicting the “Balances/Card”.

The highest balance is for the customers who have credit limit more than $6,232 and income less than $ 69K. This makes sense as these people have higher lines available for them to buy items on and at the same time have lower income that prompts them to borrow more.

Thanks for reading! Please don’t forget to like and share with others!!

Support Vector Machine (SVM)

What is Support Vector Machine?

Support Vector Machine are supervised machine learning algorithms used mainly for classification and regression tasks. If a SVM is used for classification, it’s called Support Vector Classifier (SVC). Similarly, for regression it’s called Support Vector Regressor (SVR)

Where is SVM used?

SVM can be used wherever we use other machine learning techniques such as Logistic Regression, Decision Trees, Linear Regression, Naive Bayes Classifier etc. However, SVM may be particularly more suitable for following cases-

  • Sparse data
  • High Dimensional data
  • Text Classification
  • Data is nonlinear
  • Image classification
  • Data has complex patterns
  • Etc.

How does an SVM work?

A support vector machine, works to separate the pattern in the data by drawing a linear separable hyperplane in high dimensional space. For example in the 2D image below, we need to separate the green points from the red points. We can draw many hyperplanes such as H1, H2, H3 and H4. They all help in separating the points.

Slide1

However, since there are many possible hyperplanes as denoted in the image below, which hyperplane should be chosen? The answer is- the plane which maximizes the separation between the green and the red points. In this case, it happens to be H3.

What happens if the data is not linearly separable?

Kernel trick or Kernel function helps transform the original non-linearly separable data into a higher dimension space where it can be linearly transformed. See image below-

Slide2

What is the best hyperplane?

As we discussed earlier, the best hyperplane is the one that maximizes the distance (you can think about the width of the road) between the classes as shown below

Slide3

How to build this in Python?

Slide4Slide5Slide6Slide7Slide8Slide9Slide10Slide11Slide12Slide13

Here is an excellent link for the hyper-parameter optimisation-

https://medium.com/all-things-ai/in-depth-parameter-tuning-for-svc-758215394769

For more info on sklearn library, refer below links-

http://scikit-learn.org/stable/auto_examples/svm/plot_rbf_parameters.html

http://scikit-learn.org/stable/modules/svm.html

Thanks for reading!

 

How to Change Browser for Jupyter Notebook

Here are the step by step directions on how to open Jupyter Notebook in the browser of your preference.

Step 1- Go to Anaconda Navigator and start Jupyter Notebook

Slide1

Step 2- Go to Ananconda Prompt to grab URL

Step 3- Put the URL (will be unique for your application) in browser of your choice

Slide2

That’s it. You will have Jupyter notebook open in the browser of your choice.

For clearing up anaconda prompt screen, simply type – ‘cls’ on the command prompt

Thanks for reading!

Overview of Banking and Financial Services Industry

What is BFSI?

  • BFSI is an acronym for Banking, Financial Services and Insurance. This covers a whole gamut of activities and business models.
  • Wiki defines – “ BFSI comprises commercial banks, insurance companies, non-banking financial companies, cooperatives, pensions funds, mutual funds and other smaller financial entities. Banking may include core banking, retail, private, corporate, investment, cards and the like ”

Slide3

Slide4Slide6Slide7

Activity- Explore the below pages. List down different products and services that you see?

Thank you!