## Abstract

In this work, an unsupervised model selection procedure for identifying data-driven forecast models for urban drainage systems is proposed and evaluated. Specifically, we consider the case of predicting inflows to wastewater treatment plants for activating wet weather operation (aeration tank settling, ATS) using Box–Jenkins models. The model selection procedure considers different model structures and different objective functions. The hyperparameter search space is constrained based on the time of concentration in the catchment. Objective function criteria that minimize one-step-ahead as well as multi-step prediction errors are considered. Finally, we consider two criteria for unsupervised selection of the best-performing model. These measure the agreement of observed and predicted hydrographs (persistence index), as well as the binary exceedance of critical flow thresholds (critical success index (CSI)). Our work shows that forecast models can be developed in an unsupervised manner, and ATS activation is correctly forecasted in 60–90% of the events. The selected model structures reflect the physical behaviour of the catchment. Models should not be selected on operational criteria like the CSI due to a risk of overfitting. The degree to which rainfall input improves forecasts depends on the specific catchment, and the objective function criterion that should be used for coefficient estimation depends on the application context.

## HIGHLIGHTS

Unsupervised model selection procedure for influent forecasting with ARIMA-type models.

Development of hyperparameter selection criteria, i.e., physical/operational decision criterion.

Comparisons of objective function criteria, i.e., single/multi-step forecasts.

Evaluation of the impact of rainfall input on flow forecast quality.

### Graphical Abstract

## INTRODUCTION

Urban drainage systems (UDS) are increasingly challenged as a result of on-going urbanization (European Commission 2016) and changing rainfall patterns (Arnbjerg-Nielsen 2012). Addressing these challenges with new infrastructural solutions (increased conveyance and storage) alone implies considerable costs and non-sustainable environmental impacts (Casal-Campos *et al.* 2015; Brudler *et al.* 2016). Green infrastructure, ‘Water Sensitive Cities’ and ‘Sponge Cities’ are, therefore, advocated, where water management is interlinked with many other services that a city provides to its citizens (Lund *et al.* 2019a; Wong *et al.* 2020). This development in urban water management occurs in parallel with the digitalization agenda, where the increasing availability of cheap sensor data enables new opportunities to exploit multifunctional aspects of urban water infrastructure through smart operation in real time (Kerkez *et al.* 2016; Sarni *et al.* 2019).

The potential of real-time operation of integrated urban wastewater infrastructures (Lund *et al.* 2019a; Wong *et al.* 2020) has been demonstrated in various contexts (García *et al.* 2015; Shishegar *et al.* 2018). These include using urban surfaces as a multifunction retention space for rainwater during heavy rainfall (Lund *et al.* 2019a), dynamically optimizing water distribution within the sewer system to minimize emissions to the environment (Fradet *et al.* 2011; Langeveld *et al.* 2013; Löwe *et al.* 2016), reducing the energy demand from sewer operations (Kroll *et al.* 2018) while maximizing the use of electricity from renewable sources (Stentoft *et al.* 2020), and increasing the treatment capacity of wastewater treatment plants (WWTPs) during and after wet weather events (e.g. aeration tank settling, ATS; Sharma *et al.* 2013). All of these applications either require or can be improved by the availability of a forecast of expected sewer flows, typically over forecast horizons ranging from 30 min up to 24 h.

Forecasting flows in UDS corresponds to translating observations and forecasts of rainfall into expected flows at specific locations, typically where the actuators used by real-time controls are placed. As described in García *et al.* (2015), UDS models can be subdivided into simulation- and control-oriented models. The first category includes detailed hydraulic models such as MIKE URBAN, SWMM, or Infoworks (so-called white-box models). Given their level of detail, these models are frequently considered too tedious and too slow to be applied for real-time operations, where decisions need to be taken within seconds to minutes. Control-oriented models include simplified conceptual hydrological models and data-driven models. Simplified conceptual models ensure important reductions in simulation times while still incorporating an understanding of the physical system into the model equations. These models have been applied both in a deterministic form (Vanrolleghem *et al.* 2005; Wolfs *et al.* 2013; Thrysøe *et al.* 2019) and in combination with data assimilation algorithms that enable the exploitation of flow sensor information to improve forecasts (Breinholt *et al.* 2011; Bach *et al.* 2016; Löwe *et al.* 2016; Lund *et al.* 2019b).

Data-driven models are a class of models without physical interpretation but much reduced computational requirements. These models are particularly useful to generate accurate forecasts in locations where sensor data are available. Their speed makes them an attractive choice in automated model selection procedures, where many models need to be tested. In urban hydrology, Box–Jenkins time-series models were used in a number of studies to forecast sewer flows and inflows to WWTPs (Tan *et al.* 1991; Pleau *et al.* 2005; Boyd *et al.* 2019). Zhang *et al.* (2018) used long short-term memory neural networks to predict inflow to WWTPs. Finally, Maleki *et al.* (2018) compared Box–Jenkins models and neural networks for predictions of WWTP inflow characteristics and found comparable performance between these model types. Data-driven models have also been used in catchment hydrology. For example, Phan & Nguyen (2020) used a combination of Box–Jenkins models and neural networks to forecast river water levels, and Valipour *et al.* (2013) applied Box–Jenkins models and neural networks to forecast reservoir inflows.

The aim of this work is to develop a procedure for the automated (or unsupervised) identification of data-driven models for forecasting flows in the UDS and specifically inflow to WWTPs. The purpose of the deveoped models is to control the activation of wet weather operation (ATS, see the section ‘Data and catchments’) at WWTPs. We choose to implement the model development procedure using Box–Jenkins models because they are widely used and readily available in the mathematical modelling software. The developed principles, however, can be transferable to other modelling approaches. Compared to existing procedures for the data-driven model development, we provide a number of major improvements:

- 1.
A new approach to restrict the search space for rainfall input based on physical system understanding. Indeed, most of the existing studies in hydrology (Tan

*et al.*1991; Pleau*et al.*2005; Boyd*et al.*2019) select the inputs and the structure of the forecast model manually based on expert knowledge. For Box–Jenkins models, an inspection of autocorrelation functions would typically be performed (Box*et al.*2008). In an unsupervised setting, this approach needs to be replaced by systematic tests of different model structures. Box–Jenkins models traditionally offer three hyperparameters that can be used to systematically traverse the modelling space (Box*et al.*2008; Hyndman & Khandakar 2008). Thus, when considering rainfall as an input to the model, the search space increases substantially and models may be difficult to identify. Therefore, we develop a new approach to restrict the search space for rainfall input based on physical system understanding. - 2.
A set of performance criteria to perfrom unsupervised model evaluation based on the intended model application. Criteria such as AIC that balance improved model fit on the training dataset against an increase in the number of model parameters are frequently used for Box–Jenkins models (Hyndman & Khandakar 2008) and have also been used in hydrology (Phan & Nguyen 2020). However, such criteria are not necessarily optimal for applications involving multi-step forecasting (Löwe

*et al.*2014). In general, different applications are dominated by different parts of flow hydrographs. For example, control schemes that minimize flooding require accurate forecasts of peak flows, while schemes for operating treatment plant inflow require accurate forecasts of water volume over some hours. Therefore, forecast models for different applications need to be selected based on different criteria. In this study, we focus on two criteria that, respectively, aim at reproducing the observed flow hydrograph and at maximizing the operational performance in the specific application, i.e., the frequency of how often ATS is activated correctly. - 3.
A comparison of different model error measures used to evaluate model performance. Given a model structure, the model coefficients (e.g., the coefficients in a time-series model, or the weights in a neural network) of the model need to be tuned such that the flow forecast is optimized for the specific case. Data-driven models come with standardized tuning criteria, such as maximum likelihood estimation, or the optimization of mean squared error or entropy (Tan

*et al.*1991; Boyd*et al.*2019; Phan & Nguyen 2020). Similar to the previous point, the choice of tuning criterion, however, depends on the specific application. In particular, when developing a forecast model for flows in the UDS, the widely applied tuning procedures based on one-step-ahead predictions may not be robust, because sensor issues, pumping operation, and semi-random behaviour in water consumption cause noise with varying frequencies (Löwe*et al.*2014). This work compares the effect on model calibration when using traditional error measures based on single-step forecasts or average forecast error over multiple lead times is minimized. - 4.
A model evaluation is based on performance indicators that are relevant for the final users of the model. Specifically, we evaluate the performance of our forecast models in terms of frequency of activation of wet weather operation (ATS) at a WWTP in Copenhagen, Denmark. The operation can, therefore, better relate to the forecast performance compared to when using traditional error measures based on residuals, such as SSE or RMSE, that were applied in similar studies (Boyd

*et al.*2019).

## MATERIALS AND METHODS

In this section, we first provide an overview of the considered catchments, the decision-making problem, and the considered datasets. Subsequently, we provide details on the automated model selection procedure. In the end, a brief summary of the technical implementation is provided.

### Data and catchments

#### Catchments and forecasting problems

In this work, models were fitted and evaluated on data from two locations in the Damhusåen catchment in Copenhagen, Denmark (Figure 1): at an upstream location (Dæmningen) and at a downstream location at the WWTP inlet (Damhusåen). Dæmningen is used as a wet weather control for Damhusåen WWTP, as it gives a warning of about 30 min to the plant operator (Breinholt & Sharma 2010). This transport time is confirmed by cross-correlation (see Figure S1 in Supplementary Material). The hydrographs at the two locations vary strongly. At the upstream location, the hydrographs are relatively smooth, driven mainly by gravity flow in pipe systems in suburban areas. At the downstream location, the hydrographs are more erratic due to inflows from dense city areas with fast response time and locally controlled ‘on-off’ pumps.

Historical rainfall observations were available in the form of C-band radar rainfall observations. The radar is located in Stevns, approximately 50 km south-west from Copenhagen. Data were gathered from the Danish Meteorological Institute (DMI). Historical forecast data were not available. Models were, therefore, fitted with ‘perfect rainfall forecasts’ (ex-post hindcasting) as well as entirely without rainfall input. A realistic forecasting performance can be expected to lie between these extremes.

The aim of the forecast models is to support the WWTP in initiating a process that can allow the WWTP to handle larger inflows than in normal operation, thus minimizing the risk of bypass and combined sewer overflow (CSO). This process is called ATS because the sludge is allowed to settle in the aeration tanks (Sharma *et al.* 2013). ATS increases the inflow capacity of WWTPs but decreases their treatment efficiency. This creates a decision-making problem where failure to activate ATS leads to unnecessary risk of CSO (because the inflow to the WWTP remains limited), while unnecessary activation of ATS leads to unwanted increased pollutant concentrations in the WWTP outflow (due to the reduced treatment efficiency). In practice, operators apply thresholds on the rate of inflow to the WWTP to decide whether ATS should be activated. The process requires between 30 and 45 min to activate and the threshold exceedance, therefore, needs to be forecasted with sufficient lead time.

#### Data

Data used in this study consist of both flow measurements and radar rainfall data that range from August 16th, 2017 to the end of December 31st, 2018. Flow measurements had a 2-min resolution, while rainfall data had a 10-min resolution. To synchronize these two temporal resolutions, the flow measurements were averaged to 10 min resolution.

Models were trained and evaluated only on rain events that are relevant for UDS operation. To identify these events, a rain intensity threshold of 0.01 mm/min was used to identify the start and stop times of specific rain events, i.e., when the intensity surpasses and falls below the threshold. This threshold was selected by assessing whether a reasonable selection of rain events was obtained in a visual inspection of the time series. Similar thresholds have been recommended in Löwe *et al.* (2016) and DWA (2006).

Rainfall events closer than 12 h to each other were merged to allow for the emptying of basins in the UDS and to return to dry-weather operation before the start of the next event. This event separation period was recommended by the operators of the system and validated by manually inspecting the flow time series. Figure 1 provides an example of a flow time series for a 5-day period.

Anomalies such as flatlines in inflow measurements and missing values were removed from the datasets. Around 3 and 9% of inflow and rainfall data, respectively, consisted of missing values. After pretreatment, all data were normalized by subtracting the lowest value from all data points in the time series and dividing them by the difference between the maximum and minimum values.

For an unbiased validation of the model fit, the datasets were split up into a training set and a validation set. The training set ranged from August 16th, 2017 to December 31st, 2017, while the validation set was the whole year of 2018. The training set contained 61 rain events covering both summer and winter periods. A training period of 4.5 months was considered a reasonable trade-off between the reducing computational expense and ensuring that the relevant hydrological variations are present in the training data. Subsequently, the model was validated on 105 events from an entire year to verify model performance across seasons. Training data of similar or shorter size has been used previously in previous work (Tan *et al.* 1991; Löwe *et al.* 2014).

### Framework for unsupervised development of WWTP inflow forecast models

#### Hyperparameter search space in Box and Jenkins time-series models

*et al.*2008; Madsen 2008). In their general form, these are illustrated in Equation (1), where is the flow value to be forecasted, assuming that observations are available until time step

*t*. is the flow observation for time step

*t*, is the one-step-ahead forecast error of the model observed at time step

*t*, and is the rainfall observation at time step

*t*. The equation has four terms, the mean , an autoregressive (AR) term with parameter , a moving average term (MA) with parameter , and an exogenous input term (

*X*) with parameter . The parameters (sometimes referred to as coefficients) , , , and are dimensionless and need to be estimated. The model orders

*p*,

*q*, and

*r*can be seen as hyperparameters that define the length of the polynomials. An additional model parameter

*d*describes the order of differencing (Madsen 2008) of the flow time series, which is a process frequently employed to remove non-stationarity from the series.

*p*,

*q*, and

*r*, and the models can, therefore, in principle contain an infinite number of parameters. In practice, most processes can, however, be described with

*p*and

*q*below 10. The rainfall inputs that should be included in the model will depend on the time delay between rainfall and runoff in the specific catchment, as well as on the shape of the typical rainfall–runoff hydrograph. Larger time delays imply that for small time lags

*k*are not relevant to include in the model. In addition, flow from catchments that cover a larger area will be sensitive to rainfall from both the more distant and the near past, while the reaction from smaller catchments can be characterized by a narrow rainfall interval. We exploited these effects by introducing hyperparameters that, depending on the characteristics of a catchment, define the minimal value that should be considered for the time lag

*k*, and the number of rainfall input terms that should be included in the model. This approach limits the number of coefficients that need to be estimated for the rainfall inputs to a window of size that covers the time period that is relevant for the catchment and enables the automated search for optimal model structures based on these hyperparameters with a limited range of values. Equation (2) summarizes the structure of the resulting model.

#### Criteria for estimating model parameters (coefficients)

*t*in a time series of length

*n*.

*SC*presented in Löwe

_{t}*et al.*(2014) was used, which takes a weighted average over several forecast horizons (

*h*= 1, 2, … ,

*k*) (Equation (4)). The maximum forecast horizon was set to

*k*= 10. This criterion encourages forecast models that reflect the physical system behaviour by penalizing deviations from the hydrograph on longer forecast horizons.

The implementation of the criterion in Equation (4) requires that forecasts covering multiple forecast horizons are generated for each time step during model training. This approach cannot be implemented efficiently using standard time-series modelling packages in R. We, therefore, implemented an approach based on matrix multiplication that exploits that ARIMA models in multi-step predictions re-use information from previous forecast horizons in an iterative manner. This approach is documented in the Supplementary Material (S3, Figure S3).

The optimization method used to minimize the objective functions was the dynamically dimensioned search algorithm (DDS) presented in Tolson & Shoemaker (2017). The DDS search is a heuristic algorithm, specifically designed to solve computationally expensive optimization problems with many parameters. The search starts as a global search where all model parameters are randomly varied but then narrows down to a local search, eventually varying only one parameter. All parameters were constrained to the interval −5 and 5, and starting parameters were selected randomly from a Gaussian distribution with a mean of 0 and a variance of 1. DDS requires the user to define the number of objective function evaluations, which was set to 2,500 for all models.

#### Criteria for selecting the best-performing model

*et al.*2013), which measures whether the model can reproduce the observed flow hydrograph by comparing the forecast error against a benchmark forecast defined as the last available flow observation:where represents a prediction of inflow

*y*at time step

*t*, in a time series of length

*n*, and

*h*represents the forecast horizon in time steps. A perfect model yields , while suggests that the forecast model cannot outperform the constant value benchmark.

The second criterion focuses on the application of the forecast model to the decision problem of activating wet weather operation of the WWTP (ATS). The threshold for ATS activation was defined together with the operators of the WWTP and set to at both locations. As unnecessary activation and failure to activate ATS in due time leads to unwanted pollution of the environment, evaluating the frequency of these undesirable situations enables the assessment of the model performance in an operational setting. This assessment was implemented as a contingency table with three states that were defined together with the operators (Courdent *et al.* 2017, 2018):

*Correct forecast (TP, true positive)*: If ATS activation was predicted within a 75 min interval (from 60 min before measured flow exceeded the threshold to 15 min after).*False alarm (FP, false positive)*: ATS event forecasted, but the measured flow did not exceed the threshold. This case leads to suboptimal WWTP performance with a worsening of its pollutant removal efficiency.*Missed (FN, false negative)*: if ATS activation was predicted more than 15 min after the measured flow exceeded the threshold or if ATS activation was not predicted at all. This case can negatively affect the WWTP performance by causing discharge of partially untreated wastewater (bypass), reduction of the efficiency of biological removal processes, loss of active biomass, and consequent long-term worsening of the WWTP operational capability.

The time windows used in the classification of the ATS events are schematized in Figure 2 and were defined in collaboration with the plant operator. Forecasting an ATS too early results in FP.

*et al.*2013) was used to measure how often ATS events were forecasted correctly and would ideally take a value of 1:

Both model selection criteria (Equations (5) and (6)) can be evaluated on different forecast horizons. Selecting models on longer forecast horizons is likely to encourage more physically realistic behaviour of the forecasts because high forecast skill cannot be obtained without generating multi-step flow predictions that follow the general shape of the observed hydrograph (Löwe *et al.* 2014). On the other hand, this approach might compromise forecast performance on shorter horizons. To identify which trade-off is reasonable for our problem, we computed PI and CSI for lead times of 30, 60, and 90 min as well as the average over these horizons (AVG-PI). The PI criterion generally increases with a larger forecasting horizon, and selecting a model based on the average PI across different forecasting horizons is done to help lower the risk of non-identifiability. Subsequently, we analysed the performance of models selected based on different lead times.

### Technical implementation of model selection

ARIMA and ARIMAX-type models were identified by performing a meta-model search (meta-optimization) on a pre-defined hyperparameter search space. The search was composed of three main steps:

Selecting various sets of hyperparameters (

*p*,*d*,*q*, ) from a pre-defined hyperparameter search space, generating various ‘hyper-models’.Estimating coefficients of the hypermodels by minimizing an objective function (SSE or SC).

Evaluating the calibrated models on a validation dataset based on two evaluation criteria (PI and CSI).

The hyperparameter search space was defined as , , , , and . The process was carried out for models with (ARIMAX) and without (ARIMA) rainfall input, as well as for different objective function criteria (single/multi-step forecasts) to allow for a systematic comparison of model types. The models were implemented in R and trained in a high-performance cluster environment where up to 120 hyperparameter combinations could be evaluated in parallel. Calibration for each model generally took under 10 min. In total, 2,592 models were calibrated for each catchment, resulting in a total computation time of approximately 100 CPU h.

## RESULTS AND DISCUSSION

In this section, the effect of different model configurations on forecast performance is evaluated. The first subsection provides a visual overview of the effect of different procedures for selecting and training forecast models. The subsequent sections then go in-depth with each of the analysed configurations. Sections S4–S6 in the Supplementary Material provide tables with a detailed overview of model structures and performance when applying different criteria for the estimation and selection of forecast models.

### Physical forecast behaviour for different models

Figure 3 illustrates multi-step forecasts for a single rain event in the two considered catchments. Panel (a) compares the best-performing models with and without external rainfall input. Panel (b) compares the best-performing models that were trained using single and multi-step criteria (SSE and SC), and panel (c) compares models that were selected considering the persistence index and critical success index (PI and CSI), respectively (PI was applied as a selection criterion in panels (a) and (b)). All models shown in Figure 3 were selected on a 90-min forecast horizon.

Figure 3(a) shows that models without rainfall input can behave similar to models with rainfall input, but cannot very well forecast the increasing limb of the hydrograph. This can be clearly seen from the hydrographs from Damhusåen (Figure 3(a)). Where precipitation is increasing (for example, around 00:00–06:00), the ARIMAX models can quite successfully forecast the increasing runoff, while the ARIMA models remain more stable. From Figure 3(b), it can be seen that models trained using a multi-step criterion (Equation (4)) seem to better reproduce the physical behaviour of the system on longer forecast horizons. As before, this is more obvious for the catchment of Damhusåen. From Figure 3(c), it is clear that if models are selected based on the CSI score, there is a high risk of overfitting and forecasts not reflecting the physical behaviour of the system. This is clearly illustrated as models selected on the CSI are just sharp peaks while selecting on PI results in a model behaviour that much more closely resembles the physical behaviour and shape of the hydrograph.

Structural differences between models selected on different criteria were also identified. In Sections S5 and S6 in the Supplementary Material, this is shown by selecting the 10 best-performing models based on both error metrics and various forecast horizons. When selected on PI, the best-performing models have a higher number of MA parameters (*q*) compared to the number of AR parameters (*p*) which is usually quite low. Additionally, most of the models selected on PI are differentiated, although some 60-min forecast models for Damhusåen are undifferentiated. When selected on the CSI, the only clear structure that can be seen is that most of the models are not differentiated, resulting in non-stationary behaviour.

### Value of rainfall input for forecasting

Figure 4 shows the average PI and CSI skill scores of the top-5 performing models without (ARIMA) and with (ARIMAX) rainfall input in the Dæmningen and Damhusåen catchments. The figure illustrates that in all cases, models using rainfall input are superior when compared to the ones that use no rainfall input. The difference is more significant in Damhusåen compared to Dæmningen, as the Damhusåen catchment exhibits more pronounced peaks and a generally less smooth behaviour of the flow series, caused partly by fast rainfall–runoff from dense city areas (cf. Figures 1 and 3).

Considering the model structures with the highest forecast performance (Supplementary Material, Sections S5 and S6), we see that the selected hyperparameters ( and ) reflect the behaviour of the catchment. The pronounced peaks in Damhusåen are mostly described by short lag times ( in the order of 5) and narrow window sizes ( in the order of 2–4), while slightly longer lag times ( in the order of 5–15) and wider window sizes for the rainfall input ( in the order of 2–10) are favoured in the Dæmningen catchment.

### Effects of single- vs. multi-step parameter calibration

Figure 5 compares models fitted on the single-step objective function criterion to models fitted on the multi-step objective function criterion . Considering selecting on PI skill scores in panel (a), the traditional single-step objective function criterion generally results in higher score values for shorter forecasting horizons (30 min) while for longer forecasting horizons, a multi-step objective function criterion is preferred. In addition, PI values in the Damhusåen catchment are generally very low for short forecast horizons, while they are more balanced in the Dæmningen catchment. This implies that the model selection procedure favours multi-step models in the Damhusåen catchment where 7 of the 10 best-performing models were trained using the criterion, while the opposite result is obtained in the Dæmningen catchment (Supplementary Material, Section S4 and Figure S4).

From Figure 5(b), where models are selected on CSI (in contrast to AVG-PI in Figure 4), it can be observed that the CSI is lower for models fitted with the multi-step objective function criterion (SC) compared to models fitted with the single-step objective function criterion (SSE). This is explained by the fact that the multi-step objective function criterion prevents overfitting and punishes models that do not follow the shape of the hydrograph (Figure 3(b)). The single-step models are more prone to overfit, which is clear from the reduced PI values for longer lead times.

### Efficient model selection criteria

Figure 6 shows how models selected on PI estimated for different forecast horizons (30, 60, and 90 min, AVG-PI) perform on other forecast horizons. Each model is selected on a specific forecasting horizon and shown as a certain point shape, and a line connects its performance for different forecasting horizons.

We have not considered models selected based on the CSI criterion, as the previous sections clearly illustrated that this criterion favours overfitted models that perform strongly for this specific criterion and a specific forecast horizon. Performance strongly degrades when considering other forecast horizons or evaluation criteria. In addition, no clear tendencies in the selected model hyperparameters can be identified if the model selection is performed based on CSI (Supplementary Material, Section S4). This underlines that the criterion leads to models that do not reflect the physical system behaviour.

We observe a general trend of the PI skill score to increase with the forecast horizon. The reason for this behaviour is that the error of the benchmark forecast (last known observation) increases for increasing forecast horizons. As a result, models selected on PI60 or PI90 may be overfitted and perform poorly on shorter forecast horizons (Supplementary Material, Sections S2 and S3). Selecting models based on their average PI for different forecast horizons (AVG-PI) reduces the overfit and leads to more robust models with robust CSI values.

## DISCUSSION

### Recommendations for the automated identification of flow forecast models

Our results suggest that it is possible to identify flow forecast models in an automated manner. Although models were trained on rain events that differed in size, no significant deviation in forecasting performance was observed (Supplementary Material, Section 1 and Figure S1). In terms of the criterion for parameter estimation, Figure 5 suggests that the most suitable approach may depend on the specific application. Generally, the multi-step estimation criterion (SC) favours models that better reflect the physical behaviour of the system, but the CSI of the forecast models will, in some cases, be inferior to those estimated using the traditional single-step procedure (SSE). The SSE criterion allows for greater flexibility of the predicted hydrographs. This may or may not be desirable, as the forecasts achieve higher skill in an operational sense. However, the forecasted hydrograph over multiple time steps will not necessarily reflect the shape of the observed hydrograph.

The selection of the best-performing model should, in an unsupervised procedure, use a criterion that is based on measuring the deviation from the hydrograph (such as PI) rather than an operational decision criterion such as CSI, as the latter carries a high risk of overfitting. The PI criterion generally increases as a function of the forecast horizon, as the performance of the constant value benchmark decreases. This implies a risk of non-identifiability if the forecast model is selected based on PI scores obtained for large forecast horizons. We were able to mitigate this problem by selecting models based on their average performance for different forecast horizons.

The hyperparameters of the selected models exhibited clear trends. For instance, selected models were mostly differentiated (*d* = 1) and had low AR term (*p*) and high MA term (*q*). Additionally, rainfall input improved forecasts and was mostly centred around specific lag values, that, however, differed between the catchments. Hence, recalibration of the models can likely be done more efficiently by considering a smaller search space.

### Limitations

This work focused on investigating what factors need to be considered when selecting Box–Jenkins-type models in an unsupervised manner. The work focused on flow measurements of two locations in Copenhagen, but the presented approach will likely be applicable to other catchments with similar physical characteristics.

Due to various changing factors such as changing temperature and groundwater levels, hydrological models occasionally need to be recalibrated (Troutman *et al.* 2017). Future work could investigate how often a recalibration should be performed and how much data is needed.

There are known data uncertainties, for example in the rainfall data, that will always be present in a real-world context. Radar rainfall input used in this work does not necessarily correspond to the true rainfall that is observed at the surface and information on the estimate of the uncertainty is not provided to the model. Such uncertainties imply that models that react less strongly to rainfall input will be selected and increase the uncertainty of flow forecasts. If desired, uncertainty bands for the forecasts can be generated based on the variance of model residuals (Madsen 2008). Previous work demonstrated that the uncertainty of rainfall forecasts can reduce the efficiency of control schemes for the sewer system by 20–30% (Löwe *et al.* 2016).

### Outlook

We applied a simple grid search to systematically evaluate the hyperparameter search space at the cost of computational expense. Hyperparameter optimization routines would enable a reduction of the computation time required for training the models. Heuristic search algorithms such as DDS (Tolson & Shoemaker 2017) or DREAM (Vrugt 2016) could be employed. Additionally, other criteria for model selection such as hydrological signatures (Shafii & Tolson 2015) could be promising. We would, however, expect that the consideration of such criteria is subject to similar challenges as identified in our work, i.e., selecting models based on very specific hydrograph features leads to a high risk of overfitting. Finally, as rain input does not need to be limited to a temporal resolution of the flow, rainfall-based daily intensities could be investigated. This might help the model capturing seasonal variations in flows.

We employed Box–Jenkins models for forecasting, but our procedures for defining rainfall input variables, tuning the model, and evaluating model performance are generic and can be transferred directly to other data-driven forecasting approaches such as artificial neural networks (Zhang *et al.* 2018; Crotti *et al.* 2020).

## CONCLUSIONS

We developed a method for unsupervised (automated) identification of data-driven models that forecast inflow to wastewater treatment facilities. Based on the results, we conclude that:

- 1.
We can obtain reliable forecasts of WWTP inflow with models that were identified in an unsupervised manner.

- 2.
To facilitate unsupervised model identification, the search space for precipitation variables that should be included in the forecast model can be constrained using meta-variables for the number of input variables and the time lag that should be considered.

- 3.
There is a high risk of overfitting if best-performing models are selected based on operational criteria (e.g. the frequency of flow threshold exceedance). Model selection criteria should instead ensure that the model generates physically meaningful hydrographs.

- 4.
Different approaches for tuning model coefficients need to be considered in initial stages of an unsupervised model selection procedure. A multi-step objective function leads to a more physically realistic behaviour of the forecasts but can lead to reduced forecast performance in terms of operational criteria. Thus, different objective functions may be optimal in different applications.

- 5.
Precipitation as an external regressor improves the performance of the models, but the improvement can be very small if the time of concentration of a catchment is greater than the considered forecast horizon. Precipitation input may, thus, not be necessary to generate short-term forecasts (<2 h) for larger urban catchments.

- 6.
Unsupervised model selection involves substantial computational efforts. However, once a suitable model was identified, recalibration to new observations can be performed quickly, because the hyperparameters selected in the unsupervised procedure exhibit clear trends that are linked to catchment characteristics.

## ACKNOWLEDGEMENTS

We thank Carsten Thirsing and BIOFOS for the provision of flow data and the DMI for the provision of radar rainfall data. Both datasets were made available as part of the Water Smart Cities project, which was supported by Innovation Fund Denmark (grant no. 5157-00009B).

## DATA AVAILABILITY STATEMENT

Data cannot be made publicly available; readers should contact the corresponding author for details.