Published at January 22, 2020 · 4 min read · by Flavio Clesio Silva de Souza
Facebook FastText – Automatic Hyperparameter optimization with Autotune
Disclaimer: some of the information in this blog post might be incorrect and, as FastText is very fast-paced to correct and adjust things, probably some parts of this post maybe can be out-of-date very soon too. If you have some corrections or feedback, feel free to comment.
Some of you must know, but FastText is a tool provided by Facebook Research used for Text Classification and training embeddings.
One of the most recent features provided by FastText is called Autotune that performs an automatic hyperparametrization during the Training phase.
It’s not new for anyone that one of the biggest limitations of FastText is not to have a proper module for hyperparametrization in the same way as Scikit-Learn. This feature seems to be a good start to overcoming this problem, but first, let’s understand what the Autotune is:
What is Autotune?
From the press release the description of Autotune is:
- […] This feature automatically determines the best hyperparameters for your data set in order to build an efficient text classifier […].
- […] FastText then uses the allotted time to search for the hyperparameters that give the best performance on the validation set. […].
- […] Our strategy to explore various hyperparameters is inspired by existing tools, such as Nevergrad, but tailored to fastText by leveraging the specific structure of models. Our autotune explores hyperparameters by sampling, initially in a large domain that shrinks around the best combinations found over time […]
For all parameters, Autotune has an updater (method updateArgGauss()) that considers a random number provided by a Gaussian distribution function (coeff) and set an update number between a single standard deviation (parameters startSigma and endSigma) and based on these values the coefficients have an update.
Each parameter has a specific range for the startSigma and endSigma that it’s fixed in the updateArgGauss method.
Updates for each coefficient can be linear (i.e. updateCoeff + val) or power (i.e. pow(2.0, coeff); updateCoeff * val) and depends on the first random Gaussian random number that is inside of standard deviation.
After each validation (that uses a different combination of parameters), one score (f1-score only) is stored, and the best one will be used to train the full model using the best combination of parameters. The arguments range is the following ones:
epoch: 1 to 100
learning rate: 0.01 to 5.00
dimensions: 1 to 1000
wordNgrams: 1 to 5
loss: Only softmax
bucket size: 10000 to 10000000
minn (min length of char ngram): 1 to 3
maxn (max length of char ngram): 1 to minn + 3
dsub (size of each sub-vector): 1 to 4
More clarification can be found in the issues in FastText project.
In terms of metrics for optimization, there’s only the f1score and labelf1score metrics to be optimized. It means that if other metrics as recall or precision cannot be separately optimized and there are no available metrics for ranking like nDCG (in cases that multilabel classification is being used for ranking as our case).
During the small PoC that we made with Autotune, we found the following advantages and disadvantages:
- In some domains where the FastText models are not so critical in terms of accuracy/recall/precision, the Timeboxing optimization can be very useful
- Extreme simplicity for implementation. It’s just to call more args in the train_supervised()
- Source code transparent where we can check some of the behaviors
- The search strategy is simple and has some boundaries that cut extreme training parameters (e.g. Learning Rate=10.0, Epoch=10000, WordNGrams=70, etc)
- FastText still doesn’t provide any log about the convergence. In that case, maybe a log for each model tested could be nice.
- Maybe the search strategy could be a bit clarified in terms of boundaries, parameter initialization and so on
- Boundaries parameters
endSigmafollow a Gaussian distribution and I think this maybe can be explained in docs
- Same for the hardcoded parameters that define the boundaries for each parameter. Something like Based on some empirical tests, we got these values. However,you can test a certain amount of combinations an open a PR if you find some good intervals.
- Autotune maybe can process in several combinations with not so good parameters before starting a good sequence of optimization (i.e. in a search space budget of 100 combinations, the first 70 can be not so useful). The main idea of Autotune is to be “automatic” but could be useful to have some option/configuration to a broader or optimized configuration.