Hyperparameter Optimisation at scale

ML Tools Oct 25, 2022

TLDR; Use Ray Tune or NNI, they provide SOTA algorithms out of box for efficient HPO.

Deep neural networks have many tunable hyperparameters which need to be selected to get maximum accuracy out of your datasets and models. It can be used to find the best neural architecture (NAS) by optimising layer choice, number of layers, and layer width, or finding the best learning algorithm by optimising learning rate, momentum, optimizer choice, data augmentation etc.

We will cover:

When to use HPO
How HPO works
State-of-art HPO algorithms
How to use - Ray Tune and NNI

When to use HPO

First, let’s cover where to not use HPO. If you have a dataset similar to pre-trained datasets and you are not hunting for last 3-4% accuracy gains, you should not use HPO. It will consume 10x more resources and time for such gains. In my experience, good authors or libraries provide good default values which are a result of both intensive HPO and domain expertise. Also, model architecture HPO should not be a primary way to optimize performance w.r.t accuracy for standard problems. There is a wider range of network architectures and sub-architectures available that trade accuracy for performance. For example, in vision classification, you can use architectures from VIT to EfficientNet to MobileNet and small, medium, large sub-architecture within. They provide 10-100x performance gain with a tradeoff of 3-10% accuracy. Optimisation techniques are the next best choice. However, HPO can also be used for the evaluation of all these choices.

But if you are an academic user, where a 1-2% gain is a SOTA decision maker or 3-4% improvement has a significant impact on business, then HPO is a really good choice. HPO provides very good results if the standard model and configurations are not yielding good results. It shines when you dwell in no-man’s land such as new algorithm development or custom model architecture design. Tuning large parameters manually is impossible with high complexity scenarios. I have used this to design custom model architecture that provides 10x performance while retaining accuracy for specific segmentation tasks.

How HPO works

HPO consist of two parts - search algorithm and scheduler. Search algorithm selects hyperparameter value for a trial from complete parameter space and scheduler decides the run duration or resource allocation to trial.

Two basic search algorithms are grid search and random search. Random search works better than grid search. It is also the best choice for embarrassingly parallel optimisation. One state-of-art search algorithm is Bayesian Optimisations. It utilizes surrogate bayesian model on search space to efficiently find optimal values. It can select value based on maximum uncertainty (exploration) or maximum gain (exploitation) from bayesian modelled search space and then improve search space modelling based on trial results. Two major drawbacks are that this algorithm in its naïve form is not parallel (next value selection is dependent on current value evaluation) and it only supports continuous values. However advanced implementations such as TPE and BOHB address these limitations.

Image from Random Search for Hyper-Parameter Optimization — *Image from* *Random Search for Hyper-Parameter Optimization*

Bayesian optimization in action using expected improvement as acquisition function. Notice the exploitation phase first, then an exploration one which leads to the discovery of the true global optimum. (Source)

A naive scheduler runs all trails for complete durations and evaluates them after. Successive halving or Asynchronous Successive halving (ASHA) keeps the best halves of the trials and discards half of the trials that are not good. It will continue until we have just one single configuration. It optimises resource allocation to good trials, resulting in faster training. Hyperband scheduler further improves upon this by starting new configuration trials in place of discarded trials. This increases the number of trials evaluated, resulting in better hyperparameters. ASHA or Hyperband requires that all configurations can be stopped early and validation score can be obtained.

State-of-art methods such as BOHB, combining bayesian optimisation and hyperband can lead to 5-20x speed up on HPO.

State-of-art HPO algorithms

HPO Algorithm	Notes
Random search	Simple approach that results in optimal parameters. Slow but supports embarrassing parallelism.
BOHB	Bayesian Optimisation + HyperBand. Resource-efficient and leads to faster convergence. My go-to choice.
Fabolas	Considers that HPO can be performed on a fraction of dataset. Good for large datasets. I typically control dataset fraction in BOHB to get this benefit.
Flaml Blendsearch & CFO	Cost Frugal Hyperparameter search. Good for NAS use cases.
PBT, PB2 & Fire PBT	Population Genetics/Bandits based training. Spawns a population of trials, evaluates, and selects new trials for the next stage of training based on genetics/bandit algos. Finds locally optimal hyperparameters for each stage of the trial. Better results than global parameters with 1-2% gain. Removes the requirement of learning rate or momentum decay/cycle. Sometimes gets stuck in the local optimum if the population size is less.

How to use - Ray Tune

Ray Tune is a really good tool for HPO. It is simple and powerful. As part of Ray ecosystem, it is scalable to multi-GPU and distributed environments. It provides all the above and additional SOTA algorithms. You can find more details here. Microsoft NNI is also a really good choice if you are using PyTorch ecosystem. It is even better for NAS use cases. But its support for non-PyTorch frameworks is limited.

HPO with Bayesian optimisation and HyperBand scheduler can be quickly implemented in Ray Tune via the following reference:

from ray import tune
from ray.tune.search.hyperopt import HyperOptSearch
from ray.tune.search import ConcurrencyLimiter
from ray.tune.schedulers import AsuncHyperBandScheduler
from ray.air import session


# 1. Define an objective function.
def trainable(config):
   #import torch/keras -        import pytorch/tf/keras here if using, known issue with GPU trials

    for x in range(20):  # "Train" for 20 epoch.
        one_epoch_training(model, config["lr"], config["a"])
        accuracy = calc_accuracy(model)
        session.report({"accuracy": accuracy})  # Send the score to Tune.

# 2. Define a search space.
search_space = {
   "lr": tune.loguniform(1e-8, 1e-2, base=10),
   "a": tune.choice([1, 2, 3]),
}


# 3. Define Search Algo and Scheduler
search_algo = HyperOptSearch()
search_algo = ConcurrencyLimiter(search_algo, max_concurrent=4) # Limit concurrent trials since BO doesn't parallelize very well

scheduler = AsuncHyperBandScheduler(metric="accuracy", mode="max", grace_period=5)

# 4. Start a Tune run that maximizes accuracy.
tune_config = tune.TuneConfig(
    search_alg=search_algo,
    metric="accuracy", mode="max",
    num_samples= 20 # Number of trials
)

tuner = tune.Tuner(
    trainable,
    tune_config= tune_config,
    param_space=search_space,
    scheduler = scheduler
)

results = tuner.fit()
print(results.get_best_result(metric="score", mode="min").config)

This should get you started on journey of optimizing hyperparameters efficiently with state-of-art algorithms. Combining with techniques like µTransfer makes it even more promising. OpenAI fine-tuned a 40 million parameter proxy GPT3 model before transferring the optimal hyperparameters to the 6.7B parameter variant. With only a 7% extra training budget for hyperparameter search, it outperformed the 13B variant. To learn more about this, these are some good references:

Ray Tune: https://docs.ray.io/en/latest/tune/key-concepts.html
Good blogs to take reference: Blog 1 , Blog 2
FLAML: https://github.com/microsoft/FLAML
Fabolas: https://arxiv.org/abs/1605.07079
PBT: https://www.deepmind.com/blog/population-based-training-of-neural-networks and https://www.deepmind.com/publications/faster-improvement-rate-population-based-training
µTransfer: https://decentdescent.org/tp5.html

Recommended for you

ML Tools

My ML Stack

2 years ago • 1 min read

Transformer Inference Estimations: Arithmetic…

A Guide to Effective Initialization of Neural…

Building a Transformer LLM with Code: Evolution of…

Building a Transformer LLM with Code: Fundamental…

Hyperparameter Optimisation at scale

When to use HPO

How HPO works

State-of-art HPO algorithms

How to use - Ray Tune

Tags

Saurabh Yadav

Recommended for you

My ML Stack

Transformer Inference Estimations: Arithmetic… Intensity, Throughput and Cost Optimization

A Guide to Effective Initialization of Neural… Networks

Building a Transformer LLM with Code: Evolution of… Positional Encoding

Building a Transformer LLM with Code: Fundamental… Transformer & GPT

When to use HPO

How HPO works

State-of-art HPO algorithms

How to use - Ray Tune

Tags

Saurabh Yadav

Recommended for you

My ML Stack

Transformer Inference Estimations: Arithmetic…

A Guide to Effective Initialization of Neural…

Building a Transformer LLM with Code: Evolution of…

Building a Transformer LLM with Code: Fundamental…