In order to investigate why qNEHVI presented a higher HV improvement for qNEHVI, we then proceed to plot the optimization trajectory to observe solutions in objective space, as shown in Figure 4. We set the number of dimensions to 8. This is representative of a range of experimental parameters that materials scientists would consider practical. We first performed a single optimisation run of 100 iterations x 8 points per batch. The evaluated solutions are plotted onto the objective space and coloured by their respective iteration from dark to bright.
The general observations in Figure 4 a)-h) comparing qNEHVI to U-NSGA-III are consistent with results previously reported in Figure 3, specifically in terms of HV scores and convergence rate. In all sub figures, qNEHVI was able to propose solutions at the PF within the first 20 iterations, as shown by the darker colour of points along the red line (true PF). This suggests that it is very sample efficient. However, it was unable to fully exploit the region of objective space close to the PF, and solutions in later iterations are non-optimal.  In fact, in Figure 4 b) and d), ZDT1 and ZDT2 respectively, a large portion of solutions lie along the f1=x1=0 line. This is explained by the choice of reference point, which we explore in more detail in SI 1.
We hypothesize that qNEHVI is unable to identify multiple bi-objective points along the PF because the underlying GP surrogate model did not accurately model the PF for ZDT1-3. As for MW7, despite the algorithm being able to propose many solutions near the unconstrained PF, it failed to overcome the constraints, as seen by the failure to adjust to the new dotted red line. We observed that qNEHVI’s superior HV score (Figure 3) could be attributed to the stochastic nature of QMC sampling, which is used to provide a pool of candidates for the surrogate model and acquisition function to determine the next ‘best’ batch of points to evaluate. This hypothesis is supported by results reported in SI 2, where it can be observed that the GP model did not fully learn the objective function.
In contrast, U-NSGA-III, while requiring a significantly larger number of iterations to reach the PF, had a more consistent optimisation trajectory towards the PF, as seen by the gradual colour gradient in Figure 4 a), c), e), g). This suggests that there are less wasted evaluations for MOEAs, as the latter iterations are targeted towards the PF. However, despite having more solutions near the PF, the HV score is lower for U-NSGA-III than qNEHVI. This is a limitation of using HV as a performance metric:  it strictly rewards non-dominated solutions across the entire search space, i.e. a handful of solutions at the PF extrema are preferred, as shown previously in Figure 3 where U-NSGA-III showed poorer HV improvement compared to qNEHVI for ZDT1, ZDT3 and MW7.
Notably, we observe in Figure 4 e) and g) that the disconnected PFs for ZDT3 and MW7 can lead to entire regions of objective space being omitted. This is clearly seen in both sub-figures where solutions only have a single trajectory towards the nearest PF region. We previously made the statement, based on results reported in Figure 3 c) and d), for the same synthetic problems, that the disconnected spaces are strongly influenced by initilisation, where U-NSGA-III’s mechanism of tournament selection rewards immediate gain over coverage, i.e. exploitation over exploration. This is both a strength and weakness of U-NSGA-III in comparison to qNEHVI, where the stochastic QMC sampling enables greater exploration of the overall search space, but not the PF.