Tricks in Data Mining Competitions 数据挖掘比赛技巧

Created At : 2020-01-11 20:11

Overview

1.Feature Engineering Basic 基础特征工程
2.Embedding
3.Training & Optimization Tricks 训练 & 优化技巧
5.编码和分析原则

Feature Engineering Basic 基础特征工程

Discretizer, Bin continuous data into intervals

1.Uniform – All bins in each feature have identical widths.
2.Quantitle – All bins in each feature have the same number of points.
3.Kmeans – Values in each bin have the same nearest center of a 1D k-means cluster.

Binazier, Encode categorical features

OneHotEncoder, OrdinalEncoder, HashMap, etc.

Scaler

Linear Scaler

Min-Max, Maxabs, Nomalize, Robust (QR，clip with quantile range)

Non-Linear Scaler

Gaussian-like Scaler.
Box-Cox/Yeo-Johnson.
RankGauss.
XX-like Scaler
Use icdf of target distribution to transform (cdfX)
Non-linear transforms such as Box-Cox may be userful to transform y. Diversity for ensemble, sometimes with better performance.
Yeo-Johnson transform.
Linear transform. Like standard normalized work better for X.

Embedding

Learn lookup table as category representation

Category Embedding/ Entity Embedding

Numerical Embedding

Bin numerical data into intervals.
as category embedding
add vicinal info

Key – Value Based

df[A].groupby(B).aggregate(func) Func can be anything, F(subgroupby(A)) -> x
Common aggregate func: mean, std, skew, kurt, entropy, min, max, median, frequency, size etc.
Residues and weights of Least-squares fitting works well in sequential value.
High order interaction, groupby().groupby()

Latent Representation

Use decomposition algorithm to extract latent features. Works better in neural network than GBDT. Just fit all handcrafted feature with different decomposition algorithm.

Topic Models (LDA, SVD, NMF etc).
Manifold Learning (T-SNE)

Symbolic Learning

Using symbolics ( + - * / cos sin tan ^ log min max neg…) as functions set to represent a relationship.

Total different interaction feature.
Automated feature learning.
Works in many natural scene, such as astronomical, geological, biological tasks.

Auto Encoders

Learn to construct clean samples from corrupted ones. (denoise)
Regularization of the latent space to match a prior. (Always multivariate normal distribution)

Graphs

FeatureA – FeatureB – aggregate. Everything can construct graph.
Maximize the likelihood of preserving newwork neighborhoods of nodes.
Unsupervised learning methods: Deepwalk, node2vec.
Semi or supervised learning: GCN.
Deepwalk, node2vec work in many tasks, like CTR, Fraud detection

End2End model

Online learning models, FTRL, mini-batch fm, etc.
DeepFM, DeepFFM,xDeepFM etc.
Attention based interaction model.

Target mean Encoding

Leave-one-out /K-Folds / history slide window schema to avoid overfitting
Post-prior schema

Model based target encoding

Base model to predict target, with k–folds, user prob as features.
Base model to predict residuals (target - probs)
Sometimes, user level-N stacking probs works better.

Training & Optimization Tricks 训练 & 优化技巧

Common Tricks

Learning rate schedule. SGD lr=0.1, BS=256, lr bs / 256= 0.1 256
Label smoothing.
No bias decay.
Bias initial with prior distribution.
Focal loss is a good metric for estimating models’capabilities.
Combine multi loss let the model converge rapidly towards better performance.
User standart scaler transform First.
Sometimes userful, but all depends on data.

NLP task traning Tricks

Diverse pre-trained embeddings
more than 90% of a model’s complexity resides the embedding layer.
Contextual level embedding always helps.
But now we have BERT.
Dynamic Padding boosts training speed and model performance.
Add test oov vocabs to embedding matrix.
OOV replaced with “something”
Adamw really works in many NLP tasks.
Spatial Droupout, shuflle noise in classification task.
Translation Aumentation. En-Fr, En-Fr-En.

CV task training Tricks

ResNet is always a good baseline model to do experiments.
Warm-up learning rate + cos learnging rate decay.
Manual Learning rate schedule.
Batch with different loss accumulate.
User more shape relevance augmentation.
CNN tend to learn texture information but not shape info.
Auto augmentation.

Pseudo Learning

Filtering hard examples in training dataset use OOF prediction.
Global variance info leakage.
Add the most confident test predictions to the training dataset.

Real Tricks

Test distribution leakage, data extend
Simple in Batch Traning Models.
User single model to label test dataset.
Retrain model with each batch add 10~30% test pseudo labeled data.

编码和分析原则

1.逻辑性 (代码架构清晰、自动化框架、特征工程MECE)
2.代码可复用 (不重复造轮子)
3.基于业务的分析 (bad case 分析、Kernels、Discussion、相似比赛)
4.反思总结

转载请注明来源 goldandrabbit.github.io