feature engineering
/ˈfiːtʃər ˌendʒɪˈnɪərɪŋ/Creating new input variables from raw data
“Feature engineering extracted meaningful signals from the timestamp data.”
Origin: From Latin `factura` (a making) + Old French `engin` (skill), from Latin `ingenium` (cleverness)
dimensionality reduction
/daɪˌmenʃəˈnælɪti rɪˈdʌkʃən/Reducing the number of variables while preserving information
“Dimensionality reduction made the dataset manageable for visualization.”
Origin: From Latin `dimensio` (a measuring), from `dimetiri` (to measure out)
cross-validation
/ˌkrɒs ˌvælɪˈdeɪʃən/Evaluating models by training on subsets and testing on the rest
“Cross-validation revealed the model's true generalization performance.”
Origin: From Latin `crux` (cross) + `validus` (strong, effective), from `valere` (to be strong)
The proportion of positive predictions that are correct
“High precision means few false positives in our spam detection.”
Origin: From Latin `praecisio` (a cutting off), from `praecidere` (to cut off), from `prae-` (before) + `caedere` (to cut)
The proportion of actual positives correctly identified
“High recall ensures we catch most fraudulent transactions.”
Origin: From Latin `re-` (again, back) + `calare` (to call, summon)
F1 score
/ˌef ˈwʌn ˌskɔːr/The harmonic mean of precision and recall
“The F1 score balances precision and recall into a single metric.”
Origin: Named `F1` as the first F-score or F-measure; `F` from F-measure, a weighted harmonic mean
ROC curve
/ˌɑːr oʊ ˈsiː ˌkɜːrv/A graph showing classifier performance at various thresholds
“The ROC curve demonstrated excellent discrimination between classes.”
Origin: Acronym for `Receiver Operating Characteristic`, from signal detection theory in 1940s
Area Under the Curve, measuring overall model performance
“An AUC of 0.95 indicates excellent predictive ability.”
Origin: Acronym from Latin `area` (open space) + Old English `under` + Latin `curvus` (bent)
confusion matrix
/kənˈfjuːʒən ˌmeɪtrɪks/A table showing prediction results versus actual values
“The confusion matrix revealed the model confused cats with dogs.”
Origin: From Latin `confusio` (mixing together) + `matrix` (womb, breeding female), from `mater` (mother)
bias-variance tradeoff
/ˌbaɪəs ˈveəriəns ˌtreɪdɒf/The balance between underfitting and overfitting
“Understanding the bias-variance tradeoff guides model complexity decisions.”
Origin: From French `biais` (slant) + Latin `variare` (to change) + Old English `tredan` (to tread) + `of` (away)
regularization
/ˌreɡjʊləraɪˈzeɪʃən/Techniques to prevent overfitting by penalizing complexity
“Regularization prevented the model from memorizing training data.”
Origin: From Latin `regula` (rule, straight piece of wood) + `-ization` suffix
normalization
/ˌnɔɹməɫɪˈzeɪʃən/Scaling data to a standard range
“Normalization ensured all features contributed equally to the model.”
Origin: From Latin `norma` (carpenter`s square, rule) + `-ization' suffix
imputation
/ˌɪmpjəˈteɪʃən/Filling in missing data values
“Mean imputation replaced missing values with column averages.”
Origin: From Latin `imputare` (to reckon, charge), from `in-` (in) + `putare` (to reckon, think)
outlier detection
/ˈaʊtlaɪər dɪˌtekʃən/Identifying data points that differ significantly from others
“Outlier detection flagged suspicious transactions for review.”
Origin: From `out` (Old English `ut`) + `lie` (Old English `licgan`) + Latin `detectio` (uncovering)
Grouping similar data points together
“Clustering revealed three distinct customer segments.”
Origin: From Old English `cluster` (bunch, group), related to `clot`
classification
/ˌkɫæsəfəˈkeɪʃən/Predicting which category a data point belongs to
“Classification determines whether an email is spam or legitimate.”
Origin: From Latin `classis` (class, division) + `facere` (to make)
Predicting a continuous numerical value
“Regression models forecast next quarter's sales figures.”
Origin: From Latin `regredi` (to go back), from `re-` (back) + `gradi` (to step, walk)
time series
/ˈtaɪm ˌsɪəriːz/Data points indexed in time order
“Time series analysis detected seasonal patterns in demand.”
Origin: From Old English `tima` (time) + Latin `series` (row, chain), from `serere` (to join)
anomaly detection
/əˈnɒməli dɪˌtekʃən/Identifying unusual patterns that don't conform to expected behavior
“Anomaly detection caught the security breach within minutes.”
Origin: From Greek `anomalia` (unevenness), from `an-` (not) + `homalos` (even)
Extract, Transform, Load - the data pipeline process
“The ETL pipeline processes millions of records nightly.”
Origin: Acronym from Latin `extractus` (drawn out) + `transformare` (change shape) + Old English `hladan` (to load)
A sequence of processes
“The data pipeline cleans and transforms the raw input.”
Origin: From `pipe` (Old English `pipe` from Latin `pipare` to chirp) + `line` (Latin `linea`)
visualization
/ˌvɪʒwəɫəˈzeɪʃən/The representation of an object, situation, or set of information as a chart or other image
“Data visualization helps in identifying trends.”
Origin: From Latin `visualis` (of sight), from `visus` (sight), from `videre` (to see)