Data Science Vocabulary For Beginners

(A Continuously Updated Post)
Some of the definitions here will be for statistical concepts, python or SQL terms, as well as other terms that are relevant to Data Science. I will try to write these definitions as I understand them, but they will be heavily based on where I learn about them.



0-9

5 NUMBER SUMMARY - A way to quickly get a sense of the range, centrality, and spread of a data set. It is made from the quartiles of the data set (Q1, Q2, Q3) along with the minimum and maximum of the data set.



A

ALGORITHM - An algorithm is a set of instructions we give a computer so it can take values and manipulate them into a usable form.

ANALYTICS - The process of inspecting, cleaning, transforming, and modeling data in order to identify useful information, suggest conclusions, and support decision-making.

ARRAYS - A special type of list that allows us to store values in an organized manner.



B

BAYESIAN STATISTICS - From the 18th century mathematician and theologian Thomas Bayes. “Bayesian statistics is a mathematical procedure that applies probabilities to statistical problems. It provides people the tools to update their beliefs in the evidence of new data.”

BEAUTIFUL SOUP - A Python library that makes it easy for us to traverse an HTML page and pull out the parts we’re interested in.

BIG DATA - Data is considered “big” if it is too unwieldy for you to work with on one machine, that it requires distributed computing.

BINARY CLASSIFICATION - The act of deciding which of two classes a data sample belongs to.

BINOMIAL DISTRIBUTION - How likely are a certain number of “successes” to happen, given a probability of success and a number of trials. NumPy has a function for generating binomial distributions: np.random.binomial(), which we can use to determine the probability of different outcomes.

BOOLEAN - True or False. In programming, a binary variable, having two possible values called “true” and “false.”



C

CALCULUS - The science of measuring continuous change across time or space. Many things in data science change in a continuous way. Models themselves, for example, get more accurate over more iterations, and the gradient of their increasing accuracy can be modeled so that you may identify and pluck the most accurate version of the model from the field of every possible version.

CENTRAL LIMIT THEOREM - The central limit theorem states that as samples of larger size are collected from a population, the distribution of sample means approaches a normal distribution with the same mean as the population. No matter the distribution of the population (uniform, binomial, etc), the sampling distribution of the mean will approximate a normal distribution and its mean is the same as the population mean. The central limit theorem allows us to perform tests, make inferences, and solve problems using the normal distribution, even when the population is not normally distributed.

CLASSIFICATION - Used to predict a discrete label. The outputs fall under a finite set of possible outcomes. For example, “Predict an email for being SPAM or NOT.”

CLASSIFICATION THRESHOLD - The point at which we decide which class the sample belongs to. The default threshold for many algorithms is 0.5. If the predicted probability is greater than or equal to the threshold, then the sample is in the positive class. otherwise, the sample is in the negative class.

CONDITIONAL PROBABILITY - Conditional probabilities allow us to account for information we have about our system of interest. For example, we might expect the probability that it will rain tomorrow (in general) to be smaller than the probability it will rain tomorrow given that it is cloudy today. This latter probability is a conditional probability, since it accounts for relevant information that we possess.

CONFIDENCE INTERVAL - Confidence interval is the interval estimation of parameters that can be extracted via statistical inference. [point_estimation — cv*sd, point_estimation + cv*sd] Wherein, “cv” — defined as the critical value according to the sample distribution and “sd” — standard deviation of the given sample.

CONFIDENCE LEVEL - The confidence level defined in the hypothesis testing is said to be the probability of rejecting a null hypothesis provided it is a false one. The formula to calculate this is, P(Not Rejecting H0|H0 is True) = 1 — P(Rejecting H0|H0 is True). Where the default statistical power is said to be at 95 percent.

CROSS-VALIDATION - In machine learning, we run our modeling process on different subsets of the data to get multiple measures of model quality. This gives a more accurate measure of model quality, which is especially important if you are making a lot of modeling decisions. However, it can take longer to run.



D

DATAFRAME - A DataFrame is a table. It contains an array of individual entries, each of which has a certain value. Each entry corresponds to a row (or record) and a column.

DATA IMPUTATION - The substitution of estimated values for missing or inconsistent data items (fields). The substituted values are intended to create a data record that does not fail edits.

DATA LEAKAGE - Leakage happens when your training data contains information about the target, but similar data will not be available when the model is used for prediction. In other words, leakage causes a model to look accurate until you start making decisions with the model, and then the model becomes very inaccurate. There are two main types of leakage: target leakage and train-test contamination.

DATA VISUALIZATION - The art of communicating meaningful data visually. This can involve infographics, traditional plots, dashboards.

DATA WRANGLING - The process of gathering, selecting, cleaning, structuring, and enriching raw data into the desired format for better decision making in less time.

DECISION BOUNDARY - Used by Support Vector Machines to classify points of data. For a decision boundary using two features, the boundary is called a separating line. For three features, it is called a separating plane. For more than three features, the decision boundary is called a separating hyperplane.

DESCRIPTIVE ANALYTICS - Descriptive analytics help answer questions about what has happened based on historical data. Descriptive analytics techniques summarize large semantic models to describe outcomes to stakeholders.



E

EUCLIDEAN DISTANCE - The most commonly used formula to calculate the distance between points. To find the Euclidean Distance between two points, we first calculate the squared distance between each dimension. If we add up all of these squared distances and take the square root, we’ve calculated the Euclidean Distance.



F

FALSE NEGATIVE - See TYPE II ERROR

FALSE POSITIVE - See TYPE I ERROR

FUNCTIONS - In Python, a function is some code that can be reused. It performs a “function.” It can take and return parameters.



G

GRADIENT DESCENT - An iterative algorithm used to tune the parameters in regression models for minimum loss.

GRADIENT BOOSTING - A method that goes through cycles to iteratively add models into an ensemble. This method achieves state-of-the-art results on a variety of datasets.



H

HYPOTHESIS TESTING - Hypothesis testing can be defined as a method of statistical inference out of which you calculate the probability (p-value) of observing the statistics from the given data and conclude only if the null hypothesis is true. Now based on this you would have to decide whether or not you need to reject the null hypothesis by comparing the p-value and the significance level. The testing is majorly used for testing the existence of an effect.

HYPOTHESIS TEST P-VALUE - Statistical hypothesis tests return a p-value, which indicates the probability that the null hypothesis of a test is true. If the p-value is less than or equal to the significance level, then the null hypothesis is rejected in favor of the alternative hypothesis. And, if the p-value is greater than the significance level, then the null hypothesis is not rejected.



I

INNER MERGE - In SQL, it refers to merging together two tables where only the matching rows are merged.

INTERQUARTILE RANGE (IQR) - The difference between quartile 1 (Q1) and quartile 3 (Q3). “is a measure of statistical dispersion, being equal to the difference between 75th and 25th percentiles, or between upper and lower quartiles, IQR = Q3 − Q1. In other words, the IQR is the first quartile subtracted from the third quartile; these quartiles can be clearly seen on a box plot on the data.” Q1 separates the 1st 25% of the data, while Q3 separates the 1st 75% of the data. The main takeaway of the IQR is that it is a statistic, like the range, that helps describe the spread of the center of data. Unlike the range, the IQR is robust. A statistic is robust when outliers have little impact on it.



J

JACCARD INDEX - a measure of similarity between two sets.

JAVA - a programming language used for data analysis and machine learning.

JSON - a data interchange format used for transferring data between systems.

JUNCTION TREE - a graphical model used in probabilistic reasoning and decision making.

JUPYTER - an open-source web-based platform for interactive computing and data visualization.




K

K-NEAREST NEIGHBORS - (KNN) - A classification algorithm. Data points with similar attributes fall into similar categories. K represents the number of neighbors used to classify your sample. The algorithm looks at the K nearest data points to your sample and makes a classification based on those neighbors.




L

LEFT JOIN - In SQL, this is a way to merge two tables where all rows from the first (left) table are included, but only matching rows from the second (right) table are included.

len( ) - In Python, len() returns the number of items in the object being queried.

LINEAR REGRESSION - An algorithm used when we want to predict the values of a variable from its relationship with other variables. There are two different types of linear regression models, simple linear regression and multiple linear regression. Multiple Linear Regression uses two or more independent variables to predict the values of a dependent variable.

LIST COMPREHENSIONS - In Python, they are convenient ways to generate or extract information from lists.

LISTS - A Python data type that holds an ordered collection of values, which can be of any type. Lists are Python’s ordered mutable data type. Unlike tuples, lists can be modified in-place.

LOGISTIC REGRESSION - A supervised machine learning algorithm that uses regression to predict the continuous probability, ranging from 0 to 1, of a data sample belonging to a specific category or class. Based on that probability, the sample is classified as belonging to the more probable class, ultimately making Logistic Regression a classification algorithm.




M

MACHINE LEARNING - A process where a computer uses an algorithm to gain understanding about a set of data, then makes predictions based on its understanding. There are many types of machine learning techniques; most are classified as either supervised or unsupervised techniques. It is the science of getting computers to learn and act like humans do, and improve their learning over time in autonomous fashion, by feeding them data and information in the form of observations and real-world interactions.

MATPLOTLIB - A plotting library for Python. “Matplotlib is a Python 2D plotting library which produces publication-quality figures in a variety of hardcopy formats and interactive environments across platforms.” It allows people to create line charts, bar charts, pie charts, and more. It gives precise control over colors and labels so people can create the perfect chart to communicate findings.

MEAN - The mean is the sum of a list of values divided by the number of values in that list.

MEDIAN - In a set of values listed in order, the median is whatever value is in the middle.

MIN-MAX NORMALIZATION - One of the most common ways to normalize data. For every feature, the minimum value of that feature gets transformed to a 0 and the maximum to a 1. Every other value of that feature gets transformed to a value between 0 and 1.




N

NEURAL NETWORK - An artificial neural network is an interconnected group of nodes, akin to the vast network of neurons in a brain.

NORMAL DISTRIBUTION - The most common distribution in statistics is known as the normal distribution, which is a symmetric, unimodal distribution. Normal Distributions are defined by their mean and standard deviation. The mean sets the “middle” of the distribution, and the standard deviation sets the “width” of the distribution. A larger standard deviation leads to a wider distribution. A smaller standard deviation leads to a skinnier distribution. We can generate our own normally distributed datasets using NumPy. In order to create these datasets, we need to use a random number generator. The NumPy library has several functions for generating random numbers, including one specifically built to generate a set of numbers that fit a normal distribution: np.random.normal().

NORMALIZATION - The goal of normalization is to make every data point have the same scale, so each feature is equally important.

NULL HYPOTHESIS - A null hypothesis is a statement that the observed difference is the result of chance. In other words, there is no significant difference.




O

OUTER JOIN - In SQL, this is used to join two tables together where all rows from both tables are included, even if they don’t match. Any missing values are filled with None or nan (not a number).

OVERFITTING - This occurs when you rely too heavily on your training data. You assume that data in the real world will always behave exactly like your training data. In K-Nearest Neighbors, overfitting happens when you don’t consider enough neighbors.




P

P-VALUE - Tells us whether or not we can reject a null hypothesis. Generally, if we receive a p-value of less than 0.05, we can reject the null hypothesis and state that there is a significant difference.

PANDAS - A Python library for manipulating data and data frames. “A fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.” Pandas can do a lot of the things SQL can do, but it’s also backed by the power of Python, so we can easily transition from analyzing our data with Pandas to visualizing it using other Python tools. Pandas is the primary tool data scientists use for exploring and manipulating data. Most people abbreviate pandas in their code as pd.

POLYNOMIAL KERNEL - This is used by a Support Vector Machine when the data points do not line up in a linear way. It creates a three-dimensional decision boundary.

PIPELINES - A simple way to keep your data preprocessing and modeling code organized. Specifically, a pipeline bundles preprocessing and modeling steps so you can use the whole bundle as if it were a single step. The benefits include cleaner code and fewer bugs. Pipelines are easier to productionize and give us more options for model validation.

print( ) - In python, this is a function to display the output of a program.

PROBABILITY THEORY - Probability theory is the mathematical framework that allows us to analyze chance events in a logically sound manner.

PYTHON - A general-purpose programming language. It can do almost all of what other languages can do with comparable or faster speed. It is often chosen by Data Analysts and Data Scientists for prototyping, visualization, and execution of data analysis on data sets. Many highly trafficked websites, such as YouTube, are created using Python.




Q

QUARTILES - A common way to split the data into four groups of equal size. Quartile 2 (Q2) is the median of the data set. Q1 is the median of the first-half of the data set, and Q3 is the median of the second-half of the data set. Quartiles are so commonly used that they (Q1, Q2, Q3) along with the min and max of the data set are called the 5 Number Summary.




R

RADIAL BIAS FUNCTION KERNEL - (rbf) - The most commonly used kernel in SVM. it is the default for scikit-learn’s SVC. The rbf kernel transforms the data into infinite dimensions.

REGRESSION - Used to predict outputs that are continuous. the outputs are quantities that can be flexibly determined based on the inputs of the model rather than being confined to a set of possible labels. For example, “Predict the height of a potted plant from the amount of rain.”

RELATIONAL DATABASE - A type of database that uses a structure that allows us to identify and access data in relation to another piece of data. Often, data in a relational database is organized in tables.

RELATIONAL DATABASE MANAGEMENT SYSTEM (RDBMS) - A program that allows you to create, update, and administer a relational database. Most relational database management systems use the SQL language to access the database.

range() - In Python, the range() function returns a list of integers, the sequence of which is defined by the arguments passed to it.




S

SCHEMA - Related to databases, a schema shows the columns and data types of a table.

SCIKIT-LEARN - A Python library that helps build, train, and evaluate Machine Learning Models.

SEABORN - A Python data visualization library that provides simple code to create elegant visualizations for statistical exploration and insight. When compared to Matplotlib, Seaborn provides a more visually appealing plotting style and concise syntax. It natively understands Pandas DataFrames, making it easier to plot data directly from .csv files. Seaborn can easily summarize Pandas DataFrames with many rows of data into aggregated charts.

SIGMOID FUNCTION - A mathmatical function having a common characteristic “S”-shaped curve or sigmoid curve.

SPREAD - When analyzing data, it is the difference or distance between the first and last record.

SQL - Structured Query Language, or SQL, is another programming language that is used to perform tasks, such as updating or retrieving data for a database.

STANDARD DEVIATION - The standard deviation of a set of values helps us understand how spread out those values are. This statistic is more useful than the variance because it’s expressed in the same units as the values themselves. Mathematically, the standard deviation is the square root of the variance of a set. It’s often represented by the Greek symbol sigma, σ. You can usually expect around 68% of your data to fall within one standard deviation of the mean, 95% of your data within two standard deviations, and 99.7% of your data to fall within three standard deviations from the mean.

STATISTICAL INFERENCE - Facts and figures plus guessing. it attempts to make predictions about and find correlations between populations. It uses probability. so it’s never exact or definite; it is a different kind of mathematics than algebra and calculus. It tells you what is more or less likely within a realm of possibility.

SUPPORT VECTOR - The points in the training set closest to the decision boundary. The distance between a support vector and a decision boundary is called the margin. We want to make the margin as large as possible.

SUPPORT VECTOR CLASSIFIER - (SVC) - The name used by scikit-learn for a support vector machine object. See definition below for Support Vector Machines.

SUPPORT VECTOR MACHINES - (SVM) - A powerful supervised learning model used for classification. An SVM makes classifications by defining a decision boundary and then seeing what side of the boundary an unclassified point falls on.




T

TUPLES - A Python data type that holds an ordered collection of values, which can be of any type. Python tuples are “immutable,” meaning that they cannot be changed once created.

TYPE I ERRORS - Also known as False Positives, is the error of rejecting a null hypothesis when it is actually true. This can be viewed as a miss being registered as a hit. The acceptable rate of this type of error is called significance level and is usually set to be 0.05 (5%) or 0.01 (1%).

TYPE II ERRORS - also known as False Negatives, is the error of not rejecting a null hypothesis when the alternative hypothesis is the true. This can be viewed as a hit being registered as a miss. Depending on the purpose of testing, testers decide which type of error to be concerned with. Usually, Type I errors are more important than Type II.




U

UNDERFITTING - This occurs when your classifier doesn’t pay enough attention to the small quirks in the training data.

UNIVARIATE T-TEST - Compares a sample mean to a hypothetical population mean. It answers the question, “What is the probability that a sample came from a distribution with the desired mean?”




V

VARIABLES - In Python, Variables are assigned values using the = operator, which is not to be confused with the == sign used for testing equality. A variable can hold almost any type of value such as lists, dictionaries, functions.

VARIANCE - The variance of a set of values measures how spread out those values are. Mathematically, it is the average difference between individual values and the mean for the set of values.




W

WEB SCRAPING - the process of extracting data from websites.

WEIGHTED AVERAGE - a measure of central tendency calculated by taking the weighted sum of the data points.

WEIGHT MATRIX - a matrix used in machine learning to represent the strength of the connections between neurons in a neural network.

WORD CLOUD - a visual representation of the frequency of words in a text.





X

XML - a data interchange format used for transferring data between systems.




Y




Z

Z-SCORE - a measure of how many standard deviations a data point is from the mean.


Thank you for reading this article by Case Muller at Muller Industries. If you liked it, you can find more articles about data science, future technology and more at https://muller-industries.com.

Previous
Previous

What did they do?

Next
Next

Adding Value