Ensemble methods in machine learning are algorithms that make use of more than one model to get improved predictions. This post will serve as an introduction to tree-based Ensemble methods. We will first go over how they utilize the delphi method to improve predictive power with Bootstrap Aggregation (Bagging for short). Then, move into boosting, a technique where algorithms use a combination of weak learners to boost performance. The following ensemble algorithms from scikit-learn will be covered:
This post will serve as a high-level overview of decision trees. It will cover how decision trees train with recursive binary splitting and feature selection with “information gain” and “Gini Index”. I will also be tuning hyperparameters and pruning a decision tree for optimization. The two decision tree algorithms covered in this post are CART (Classification and Regression Trees) and ID3 (Iterative Dichotomiser 3).
Decision trees are very popular for predictive modeling and perform both, classification and regression. Decision trees are highly interpretable and provide a foundation for more complex algorithms, e.g., random forest.
K-Nearest Neighbors (KNN) is a classification and regression algorithm which uses nearby points to generate predictions. It takes a point, finds the K-nearest points, and predicts a label for that point, K being user defined, e.g., 1,2,6. For classification, the algorithm uses the most frequent class of the neighbors. For regression, the algorithm uses the average of k data points to predict a continuous value. KNN is a supervised learning algorithm that is in a sense, lazy when it comes training (more on this later).
Last week I wrote an overview of Linear Regression and what’s happening under the hood of OLS regression from statsmodels. This post will serve as a high-level overview of Logistic Regression to perform classification tasks. Logistic Regression is a great first model to learn when introduced to classification.
Supervised Learning is a term referring to machine learning algorithms that have the ability to “learn” from a labeled (ground truth) dataset. The data needs to be labeled so the supervised learning algorithms can evaluate their performance. The performance is evaluated by comparing the predictions with the actual labels for the training…
Regression analysis is a statistical methodology that allows us to determine the strength and relationship of two variables. Regression is not limited to two variables, we could have 2 or more variables showing a relationship. The results from the regression help in predicting an unknown value depending on the relationship with the predicting variables. For example, someone’s height and weight usually have a relationship. Generally, taller people tend to weigh more. We could use regression analysis to help predict the weight of an individual, given their height.
Natural Language Processing (NLP) is the study of how computers interact (i.e. understand, interpret, manipulate) with humans through language, (e.g. speech, text). NLP got its start from the field of Linguistics, the study of language, primarily focusing on its semantics, phonetics, and grammar. Before machine learning began to show success with NLP tasks, it was mainly programming algorithms with rule-based methods from linguistics. The machine learning methods provided better accuracy, faster processing times, and dependability, resulting in the rule-based approaches taking the back seat.
I will be going through the NLP process from data preprocessing to model evaluation and selection…
Interested in deep learning and artificial intelligence? PyTorch is a Python-based computing library which uses the power of graphics processing units. It is preferred by many when it comes to deep learning research platforms. Here’s a little insight on PyTorch and some possible real world applications.
First, let me start by explaining how PyTorch will become useful to you. PyTorch has many different uses but is primarily used as a replacement for NumPy to use the power of GPUs, as well as a deep learning research platform providing flexibility and speed.
This post will serve as a step by step guide to build pipelines that streamline the machine learning workflow. I will be using the infamous Titanic dataset for this tutorial. The dataset was obtained from Kaggle. The goal being to predict whether a given person survived or not. I will be implementing various classification algorithms, as well as, grid searching and cross validation. This dataset holds records for each passenger consisting of 10 variables (see data dictionary below). …
Tutorial for Matching Sequences With the FuzzyWuzzy Library
This tutorial will go over how to match strings by their similarity. FuzzyWuzzy can save you ample amounts of time during the data science process by providing tools such as the Levenshtein distance calculation. Along with examples, I will also include some helpful tips to get the most out of FuzzyWuzzy.
String matching can be useful for a variety of situations, for example, joining two tables by an athlete’s name when it is spelled or punctuated differently in both tables. This is where FuzzyWuzzy comes in and saves the day! Instead of…
This post will serve as a tutorial for querying data with SQL (Structured Query Language). For the purposes of this tutorial, I will be using the SQLite3 library, which provides a relational database management system. For examples, I will be using the Chinook Database, a sample database that represents a digital media store, including tables for artists, albums, etc. For more details, take a look over the documentation here.
Today, SQL is a standard when it comes to manipulating and querying data. One of key benefits being that SQL allows users to quickly and efficiently input and retrieve information from…
Data Scientist with a passion for statistical analysis and machine learning