The ROMS ocean model on AWS

Riha, S.

January 11, 2017

1 Introduction

NOTE (2020/03/22):

This article is outdated with respect to the currently available IaaS services offered by AWS.
The interpretation of Figure 2 may be incorrect or misleading. Here, we attribute the poor scaling to networking, but it may be present in shared memory computations as well, given the very small tile size resulting from the domain partition of the benchmark3 application. The experiment should be repeated with a larger domain, and the scaling of shared memory computations should be compared to distributed memory computations.

The objective of this brief note is to provide a rough estimate of the costs involved running a numerical ocean model on the Elastic Compute (EC2) infrastructure of Amazon Web Services (AWS) infrastructure. We use the Regional Ocean Modeling System (ROMS, Shchepetkin and McWilliams, 2005) on an EC2 CfnCluster and measure the execution time of pre-conﬁgured ROMS benchmark problems, yielding an approximate relationship between computational and monetary cost. Based on a recently published study in a peer-reviewed journal, we attempt to estimate the order of magnitude of the total computational and monetary cost involved in a typical numerical modeling project.

2 Method

2.1 Benchmark tests

ROMS provides three benchmark tests, consisting of an idealized model of the Southern Ocean with three grid sizes:

benchmark1: 512 x 64 x 30 grid points
benchmark2: 1024 x 128 x 30 grid points
benchmark3: 2048 x 256 x 30 grid points

All experiments are integrated for 200 time steps. No grid data is read or written from/to the hard disk.

2.2 Hardware

Computations are performed on C4 instances of Amazon Web Services, which are described as featuring the highest performing processors and lowest price/compute performance in AWS’ product range (Amazon Web Services, 2016c). These instances feature custom Intel Xeon E5-2666 v3 (Haswell) processors. For C4 instances, AWS deﬁnes a ”vCPU” as a hyperthread of the Intel Xeon processor (Amazon Web Services, 2016c). Note that a stock Xeon E5-2666 v3 has 10 cores and 2 threads per core (Intel, 2016), but that AWS uses a ”custom” version. The two largest C4 instance types are 4.4xlarge and c4.8xlarge, which differ in the number of vCPUs and the amount of memory:

Instance type	vCPUs	Memory
c4.4xlarge	16	30 GiB
c4.8xlarge	36	60 GB

2.3 Software

For hardware and software provisioning, we use the CfnCluster tool provided by AWS (Amazon Web Services, 2016a). The tool provides a range of virtual-machine images, which are pre-conﬁgured for cluster computing on AWS’ hardware. Pre-installed packages include MPI libraries and schedulers. For the experiments described below, it was necessary to manually install the NetCDF Fortran libraries as a dependency for ROMS (although no NetCDF I/O is performed in the experiments). All results shown were produced with ROMS compiled with Open MPI using the GNU Fortran compiler (GFortran).

Operating system:	CentOS Linux release 7.2.1511
Linux kernel:	v3.10.0 x86_64
Fortran compiler:	gcc-gfortran 4.8.5
MPI library:	openmpi 1.10.0

3 Results

3.1 Benchmark tests

Fig. 1 shows execution time for benchmark2 for various tiling conﬁgurations on a single c4.4xlarge instance (16 vCPUs, 30 GiB memory). The cause for the increase in execution time from the 1-core experiment to the 2-core experiments is unclear to us. For the remaining experiments, the data shows the expected scaling. Note that multithreading is achieved by using the MPI library, but that computations are performed on a single node, such that no inﬂuence of networking bandwidth or latency is expected.

Time (in seconds) spent per process for the BENCHMARK2 test, as function of the number of processes. — Figure 1: Figure 1: Time (in seconds) spent per process for the *benchmark2* test, as function of the number of processes. The inset shows the tiling. Axes are logarithmic with base 2, dashed lines show slopes of ideal scaling functions with varying offset. Computations are performed on a C4.4XLARGE instance, which has 16 vCPUs per (virtual) node.

Fig. 2 shows results from benchmark3, which has an execution time of roughly 230 s on 32-cores of a c4_8xlarge instance using the GFortran compiler. The number of time steps integrated in benchmark3 is 200, i.e. one time step is processed every 1.15 s. Note that we use only 32 vCPUs of the c4.8xlarge to avoid potential problems with some Linux operating systems which have a vCPU limit of 32 (Amazon Web Services, 2016b). Fig. 2 shows the overhead induced by distributing the computation amongst various nodes. The networking conﬁguration used in the experiment was the default conﬁguration provided by CfnCluster. In particular, we did not verify whether the modiﬁcations described by Howard (2015) (see also IBM Corporation 2010) have been applied before the experiment was conducted. We hypothesize that any improvements gained by optimizing the infrastructure beyond the conﬁguration provided by CfnCluster will likely not decrease execution time by an order of magnitude. We assume that if this was the case, such an optimization would be adopted by CfnCluster, or at least be described in their documentation.

Time (in seconds) spent per process benchmark3, as function of the number of processes. — Figure 2: Figure 2: Time (in seconds) spent per process *benchmark3*, as function of the number of processes. Axes are logarithmic with base 2. Computations are performed on c4.8xlarge instances of AWS, which have 36 vCPUs per node.

3.2 Computational and monetary costs of a realistic study

The preliminary test above allow a rough estimate of the computational and monetary costs involved in a realistic study. By monetary cost we mean the ﬁnancial cost of renting AWS’ hardware to conduct a study of a given computational cost. As an example we consider Kumar et al. (2015), who validate the mid-shelf and surfzone circulation generated by a numerical model against observations. They use a coupled ROMS-SWAN model and compare it to observations from the 2006 Huntington Beach (San Pedro Bay, California, U.S.A.) experiment. We assume that such studies, which investigate the transport processes directly adjacent to the coast, are particularly important for commercial applications. Whether or not it is actually technically feasible to conduct such a highly complex state-of-the-art simulation on AWS’ cloud infrastructure, is not shown here. Instead, the objective here is to gain a (very rough) order-of-magnitude estimate of the involved cost, assuming technical feasibility.

Components of the numerical experiment

The numerical study of Kumar et al. (2015) uses

Wind forcing,
Wave forcing,
Tide forcing, and
Buoyancy forcing.

All of these are coupled by the open-source Coupled Ocean-Atmosphere-Wave-Sediment (COAWST) Transport model ( Warner et al. 2008; Woods Hole Coastal and Marine Science Center 2016). To get our rough estimate for the total numerical cost, we make a couple of very crude approximations. First, we focus only on the ocean component of the model. Furthermore, within the ocean component, we only consider the numerical costs of the spin-up phase. Kumar et al. (2015) state that the model is spun up for 15 years with climatological surface forcing, and we assume that this is the most costly part of their study. Note that in operational forecasting, no spin-up phase may be necessary, hence the results obtained here may not be representative for such studies. Kumar et al. (2015) use the following nested grids:

Region	Area $(km^2)$	Resolution $\Delta x (m)$	Number of grid points
U.S. West Coast and eastern Paciﬁc (L0)	4000×3000	5000	800×600×40
Southern California Bight (L1)	800×700	1000	800×700×40
Interior bight region (L2)	500×300	250	2000×1200×40
San Pedro Bay (L3)	80×70	75	1067×934×32
Huntington Beach, Newport Beach (shelf break to inner shelf and surfzone, L4)	15×30	50	214×428×20
Huntington Beach (L5)	6×6	10	600×600×20

Let's assume here that L0 to L3 need to be spun-up for 15 years. Unfortunately, Kumar et al. (2015) do not report their time-step for the simulations on these grids. Hence, we have to make some further assumptions.

Computational cost of the spin-up phase

Assuming an ocean depth of H=2000 m, the barotropic time step $\Delta t_{bt}$ is constrained by the CFL condition to roughly $\mathcal{O}$(10 s) for L0 and L1, and $\mathcal{O}$(1 s) for L2 and L3, according to the simplified formula (which does not hold strictly in this practical application)

$$\Delta t_{bt}\leq\Delta x/c \qquad c=\sqrt{gH},$$

where $\Delta t_{bt}$ is the barotropic time step, $\Delta x$ is the lateral grid spacing), and g is standard gravity (Tab. 1).

Table 1: Barotropic ($\Delta t_{bt}$) and baroclinic ($\Delta t$) time steps for various grid resolutions ($\Delta x$) assuming validity of Eq. (1), a depth of H=2000 m, and ($\Delta_t/\Delta t_{bt}$)=15. N is the number of timesteps required for a 15 y model simulation (divided by $10^6$).
$\Delta x (m)$	$\Delta t_{bt} (s)$	$\Delta t (s)$	$ N(10^6) $
5000	36	535	0.9
1000	7	107	4.4
250	1.8	27	18
75	0.5	8	59

Assuming that the baroclinic time step $\Delta t$ is a factor of 10-20 longer than the barotropic time step, this yields $\Delta t$=$\mathcal{O}$(100s) (L0, L1) and $\Delta t$=$\mathcal{O}$(10s) (L2, L3), respectively. For a 15 yr simulation, N=$\mathcal{O}(10^6)$ (N=$\mathcal{O}(10^7)$) baroclinic time steps are required for L0, L1 (L2, L3). Clearly, these estimates are very rough. For example, to account for the shallower depth of the L2 and L3 grids, one may assume a depth decrease by a factor of 10, and accordingly increase $\Delta t_{bt}$ by a factor of 3 for L2 and L3. So far, no results are available measuring execution time for grids of similar size as L0-L3. The closest match is benchmark3, with a grid size of 2048×256×30 data points, which is roughly 80% (15%) of the number of spatial data points in L0 (L2). Note that it is always possible to assume identical physical dimensions of the grid spacing (including the temporal lattice) between benchmark3 and L0-L3, since the benchmarking results are independent on physical dimensions, but depend only on the amount of data points. To estimate the associated monetary cost, we consider the results from the 16×2 tiling conﬁguration. Since AWS charges fees based on units of time, this is the most cost eﬀective option, under the condition that there is no additional value gained by obtaining the results quickly. In other words, if one can aﬀord to wait for the modeling results, one can save money by eliminating all networking overhead. Note, however, that this is unlikely to be case in e.g. operational modeling studies, where it is desirable to obtain forecasts as quickly as possible. Our results showed that in benchmark3, the processing time for one time step is about 1.15 s seconds using 32 cores of a single c4_8xlarge instance. It follows that a 15 yr spin up time for benchmark3 (assumed to consist of $10^6-10^7$ baroclinic time steps of lengths displayed in Tab. 1) takes a minimum of $\mathcal{O}(10)d$ to complete.

Monetary cost

The instance type c4_8xlarge is currently (2017/01/12) priced at $1.591 ($2.085) per Hour for on-demand use in US East (Asia Paciﬁc, Sydney) region, resulting in a total of roughly $\mathcal{O}$($500) for a 10 day lease. Given that high-resolution, state-of-the-art regional and coastal numerical studies such as Kumar et al. (2015) are conducted on multiple diﬀerent large grids, some of which require an extensive spin-up phase, we estimate that the total monetary cost of cloud-computing infrastructure will typically be on the order of $\mathcal{O}$($10,000) to $\mathcal{O}$($100,000). Note that this estimate refers to the ﬁnal experiment(s) whose results are published. Importantly, it does not include the testing phase, which may well increase the total cost of the study by a factor of 10 or more.

4 Discussion

We found the cost of leasing infrastructure from AWS surprisingly high, although this statement by itself is mostly meaningless if there is no comparison to on-premise infrastructure. Especially in meteorology and oceanography, the computing infrastructure is mostly funded by governmental agencies, and it is much more diﬃcult to obtain a value/cost relationship. I welcome feedback on this note, best via email.

References

Amazon Web Services, 2016a: CfnCluster
Amazon Web Services, 2016b: Compute optimized instances
Amazon Web Services, 2016c: EC2 Instance Types
Howard, A., 2015: Running MPI applications in Amazon EC2
IBM Corporation, 2010: TCP_MEM change for TCP connection errors running MPI jobs - IBM Cluster 1350
Intel, 2016: Intel Xeon Processor E5 2660 v3 25M Cache 2.60-GHz
Kumar, N., Feddersen, F., Uchiyama, Y., McWilliams, J., O’Reilly, W., 2015: Midshelf to Surfzone Coupled ROMS--SWAN Model Data Comparison of Waves, Currents, and Temperature: Diagnosis of Subtidal Forcings and Response
Shchepetkin, A.F., McWilliams, J.C., 2005: The regional oceanic modeling system (ROMS): a split-explicit, free-surface, topography-following-coordinate oceanic model
Warner, J.C., Perlin, N., Skyllingstad, E.D., 2008: Using the Model Coupling Toolkit to couple earth system models
Woods Hole Coastal and Marine Science Center, 2016: COAWST: A Coupled-Ocean-Atmosphere-Wave- Sediment Transport Modeling System