Tricks in Data Mining Competitions 数据挖掘比赛技巧

Overview

1.Feature Engineering Basic 基础特征工程
2.Embedding
3.Training & Optimization Tricks 训练 & 优化技巧
5.编码和分析原则

Feature Engineering Basic 基础特征工程

Discretizer, Bin continuous data into intervals

1.Uniform – All bins in each feature have identical widths.
2.Quantitle – All bins in each feature have the same number of points.
3.Kmeans – Values in each bin have the same nearest center of a 1D k-means cluster.

Binazier, Encode categorical features

OneHotEncoder, OrdinalEncoder, HashMap, etc.

Scaler

Linear Scaler

  • Min-Max, Maxabs, Nomalize, Robust (QR,clip with quantile range)

Non-Linear Scaler

  • Gaussian-like Scaler.
    Box-Cox/Yeo-Johnson.
    RankGauss.
  • XX-like Scaler
    Use icdf of target distribution to transform (cdfX)
  • Non-linear transforms such as Box-Cox may be userful to transform y. Diversity for ensemble, sometimes with better performance.
  • Yeo-Johnson transform.
  • Linear transform. Like standard normalized work better for X.

Embedding

Learn lookup table as category representation

Category Embedding/ Entity Embedding

Numerical Embedding

  • Bin numerical data into intervals.
  • as category embedding
  • add vicinal info

Key – Value Based

  • df[A].groupby(B).aggregate(func) Func can be anything, F(subgroupby(A)) -> x
  • Common aggregate func: mean, std, skew, kurt, entropy, min, max, median, frequency, size etc.
  • Residues and weights of Least-squares fitting works well in sequential value.
  • High order interaction, groupby().groupby()

Latent Representation

Use decomposition algorithm to extract latent features. Works better in neural network than GBDT. Just fit all handcrafted feature with different decomposition algorithm.

  • Topic Models (LDA, SVD, NMF etc).
  • Manifold Learning (T-SNE)

Symbolic Learning

Using symbolics ( + - * / cos sin tan ^ log min max neg…) as functions set to represent a relationship.

  • Total different interaction feature.
  • Automated feature learning.
  • Works in many natural scene, such as astronomical, geological, biological tasks.

Auto Encoders

  • Learn to construct clean samples from corrupted ones. (denoise)
  • Regularization of the latent space to match a prior. (Always multivariate normal distribution)

Graphs

  • FeatureA – FeatureB – aggregate. Everything can construct graph.
  • Maximize the likelihood of preserving newwork neighborhoods of nodes.
  • Unsupervised learning methods: Deepwalk, node2vec.
  • Semi or supervised learning: GCN.
  • Deepwalk, node2vec work in many tasks, like CTR, Fraud detection

End2End model

  • Online learning models, FTRL, mini-batch fm, etc.
  • DeepFM, DeepFFM,xDeepFM etc.
  • Attention based interaction model.

Target mean Encoding

  • Leave-one-out /K-Folds / history slide window schema to avoid overfitting
  • Post-prior schema

Model based target encoding

  • Base model to predict target, with k–folds, user prob as features.
  • Base model to predict residuals (target - probs)
  • Sometimes, user level-N stacking probs works better.

Training & Optimization Tricks 训练 & 优化技巧

Common Tricks

  • Learning rate schedule. SGD lr=0.1, BS=256, lr bs / 256= 0.1 256
  • Label smoothing.
  • No bias decay.
  • Bias initial with prior distribution.
  • Focal loss is a good metric for estimating models’capabilities.
  • Combine multi loss let the model converge rapidly towards better performance.
  • User standart scaler transform First.
  • Sometimes userful, but all depends on data.

NLP task traning Tricks

  • Diverse pre-trained embeddings
  • more than 90% of a model’s complexity resides the embedding layer.
  • Contextual level embedding always helps.
  • But now we have BERT.
  • Dynamic Padding boosts training speed and model performance.
  • Add test oov vocabs to embedding matrix.
  • OOV replaced with “something”
  • Adamw really works in many NLP tasks.
  • Spatial Droupout, shuflle noise in classification task.
  • Translation Aumentation. En-Fr, En-Fr-En.

CV task training Tricks

  • ResNet is always a good baseline model to do experiments.
  • Warm-up learning rate + cos learnging rate decay.
  • Manual Learning rate schedule.
  • Batch with different loss accumulate.
  • User more shape relevance augmentation.
  • CNN tend to learn texture information but not shape info.
  • Auto augmentation.

Pseudo Learning

  • Filtering hard examples in training dataset use OOF prediction.
  • Global variance info leakage.
  • Add the most confident test predictions to the training dataset.

Real Tricks

  • Test distribution leakage, data extend
  • Simple in Batch Traning Models.
  • User single model to label test dataset.
  • Retrain model with each batch add 10~30% test pseudo labeled data.

编码和分析原则

1.逻辑性 (代码架构清晰、自动化框架、特征工程MECE)
2.代码可复用 (不重复造轮子)
3.基于业务的分析 (bad case 分析、Kernels、Discussion、相似比赛)
4.反思总结


转载请注明来源, from goldandrabbit.github.io

💰

×

Help us with donation