Supervised learning uses labeled data, where each input has an associated output label. It aims to learn a mapping between inputs and outputs to make predictions on new data. Common tasks are classification (categorical labels) and regression (continuous values).
Unsupervised learning, by contrast, works with unlabeled data. The model identifies patterns, clusters, or associations within the data without predefined labels. Clustering and dimensionality reduction are common unsupervised tasks.
Â
Classification models categorize data into discrete classes or labels (e.g., spam or not spam). Common algorithms include decision trees, logistic regression, and support vector machines.
Regression models predict continuous values (e.g., predicting house prices). Linear regression and polynomial regression are examples. In general, classification answers “which category” and regression answers “how much.”
Â
Overfitting happens when a model learns the training data too closely, including noise, resulting in high accuracy on the training set but poor performance on new data. This is often due to an overly complex model.
Underfitting occurs when a model is too simple and cannot capture the underlying patterns in the data, leading to poor performance on both training and testing data. Balancing model complexity helps avoid both overfitting and underfitting.
Â
Cross-validation splits the dataset into multiple parts (folds). In each fold, the model trains on some folds and tests on the remaining fold. This helps assess model performance across different subsets of data, reducing overfitting and providing a more reliable estimate of model accuracy.
Â
Feature scaling ensures features contribute equally to a model’s output, especially important for algorithms sensitive to distance (e.g., KNN, SVM). Two common methods are:
Normalization: Scales features between 0 and 1, maintaining relative distances.
Standardization: Centers features around 0 with a standard deviation of 1, useful for normally distributed data.
Â
A confusion matrix is a table used to evaluate classification model performance, showing the counts of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). It provides a more detailed view than simple accuracy, helping calculate precision, recall, and F1-score.
Â
A decision tree splits data into branches based on feature values, creating a tree-like structure. Each internal node represents a decision rule based on a feature, and each leaf node represents a class or value prediction. Decision trees are intuitive and useful for both classification and regression tasks.
Â
K-means is an unsupervised algorithm that partitions data into k clusters. It randomly initializes cluster centers, assigns each data point to the nearest center, then re-calculates cluster centers iteratively until convergence. It’s commonly used for customer segmentation, image compression, and pattern recognition.
Â
A p-value measures the probability of observing results as extreme as the sample data, assuming the null hypothesis is true. A low p-value (commonly < 0.05) indicates strong evidence against the null hypothesis, suggesting a statistically significant result.
Â
Mean is the average value (sum of all values divided by the number of values).
Median is the middle value when data is sorted; it’s useful when data contains outliers.
Mode is the most frequently occurring value, helpful in categorical data analysis.
The bias-variance tradeoff balances a model’s error from two sources:
Bias: Error from overly simple models that fail to capture data patterns (underfitting).
Variance: Error from overly complex models sensitive to data fluctuations (overfitting).
An optimal model achieves low bias and low variance for accurate predictions.
Â
Bagging (Bootstrap Aggregating) trains multiple models on random data subsets and averages the predictions, reducing variance and increasing stability. Random Forest is a popular bagging technique.
Boosting sequentially trains models, where each model focuses on errors of the previous model, reducing bias. Examples include AdaBoost and Gradient Boosting.
Â
Principal Component Analysis (PCA) reduces high-dimensional data by transforming it into a smaller number of uncorrelated components, capturing maximum variance. It’s used for dimensionality reduction to simplify models, reduce overfitting, and improve visualization.
Â
Imbalanced datasets can be addressed by:
Resampling: Oversampling the minority class or undersampling the majority class.
Alternative metrics: Using metrics like F1-score, AUC-ROC that account for class imbalance.
Algorithm tuning: Using algorithms with class-weight adjustments or synthetic sampling (e.g., SMOTE).
Â
Precision: The proportion of true positives out of all predicted positives. High precision reduces false positives.
Recall: The proportion of true positives out of all actual positives. High recall reduces false negatives.
F1-score: The harmonic mean of precision and recall, balancing both metrics in a single measure.
Â
Multicollinearity occurs when independent variables are highly correlated, making it hard to distinguish their effects. It leads to unstable regression coefficients, making predictions less reliable.
Â
ReLU (Rectified Linear Unit): Outputs max(0, x), helping reduce computation time.
Sigmoid: Maps input to a range of 0 to 1, suitable for binary classification.
Tanh: Maps input to a range of -1 to 1, offering better gradient flow than sigmoid in some cases.
Â
A/B testing compares two versions of a variable to see which performs better. Statistical significance is determined by calculating p-values or confidence intervals, indicating if observed differences are likely due to chance.
Â
Regularization prevents overfitting by adding a penalty term to the loss function.
L1 Regularization (Lasso): Adds absolute weight values, leading to sparse models by zeroing some weights.
L2 Regularization (Ridge): Adds squared weight values, reducing but not eliminating weights, often providing stability.
Â
Cross-entropy is a loss function for classification, measuring the difference between actual and predicted probability distributions. It’s widely used in logistic regression and neural networks for binary or multi-class classification.
Â
The curse of dimensionality refers to various phenomena that arise when analyzing and organizing data in high-dimensional spaces As the number of dimensions increases the volume of the space increases making data points sparse This sparsity can lead to challenges in distance-based algorithms like KNN and increase the risk of overfitting since models may struggle to generalize from limited data in high-dimensional settings.
Â
Ensemble methods combine multiple models to improve overall performance compared to individual models They are useful because they can reduce variance through bagging reduce bias through boosting or improve predictions through stacking Common ensemble methods include Random Forest which uses bagging AdaBoost which uses boosting and Stacking where predictions from multiple models are combined to make final predictions.
Â
Feature engineering involves creating new input features from existing data to improve model performance This can include techniques like encoding categorical variables normalizing numerical values creating interaction terms or extracting date components Good feature engineering can significantly enhance a model’s ability to capture relevant patterns in data.
Â
A Receiver Operating Characteristic (ROC) curve is a graphical representation of a classifier’s performance across different thresholds It plots the True Positive Rate which is sensitivity against the False Positive Rate which is 1-specificity The area under the ROC curve (AUC) quantifies the overall performance An AUC of 0.5 suggests no discriminative power while an AUC of 1.0 indicates perfect discrimination.
Â
Gradient descent is an optimization algorithm used to minimize a loss function in machine learning It works by iteratively updating model parameters in the direction of the negative gradient of the loss function with respect to the parameters This process continues until convergence is reached aiming to find the minimum point where the loss is minimized.
Â
Hyperparameters are configuration settings used to control the training process of a model but are not learned during training Examples include learning rate number of trees in a forest or number of hidden layers in a neural network Hyperparameter tuning can be performed using techniques like grid search random search or Bayesian optimization to find the optimal combination that improves model performance.
Â
Time series analysis involves analyzing data points collected or recorded at specific time intervals to identify trends seasonal patterns and cycles Common techniques include ARIMA which stands for AutoRegressive Integrated Moving Average a statistical model for forecasting Seasonal Decomposition which breaks down a time series into trend seasonality and residuals and Exponential Smoothing a forecasting method that applies decreasing weights to past observations.
Â
A primary key uniquely identifies each record in a database table ensuring no two rows have the same key value It cannot contain NULL values A foreign key on the other hand is a field in one table that links to the primary key of another table establishing a relationship between the two tables This helps maintain data integrity and allows for relational database operations.
Â
Outliers are data points that differ significantly from other observations They can indicate variability in the measurement experimental errors or novel insights The treatment of outliers depends on the context They may be removed to improve model accuracy or investigated to gain valuable insights into unique phenomena.
Â
L1 Regularization also known as Lasso adds the absolute value of the coefficients to the loss function promoting sparsity in the model by potentially driving some coefficients to zero This can be useful for feature selection L2 Regularization also known as Ridge adds the squared value of the coefficients to the loss function which discourages large coefficients but does not eliminate them It generally leads to more stable models and helps mitigate multicollinearity.
Â