The ROMS ocean model on AWS
Riha, S.1 Introduction
NOTE (2020/03/22):
- This article is outdated with respect to the currently available IaaS services offered by AWS.
-
The interpretation of Figure 2 may be incorrect or misleading. Here, we attribute the poor scaling to networking, but it may be present in shared memory computations as well, given the very small tile size resulting from the domain partition of the
benchmark3
application. The experiment should be repeated with a larger domain, and the scaling of shared memory computations should be compared to distributed memory computations.
The objective of this brief note is to provide a rough estimate of the costs involved running a numerical ocean model on the Elastic Compute (EC2) infrastructure of Amazon Web Services (AWS) infrastructure. We use the Regional Ocean Modeling System (ROMS, Shchepetkin and McWilliams, 2005) on an EC2 CfnCluster and measure the execution time of pre-configured ROMS benchmark problems, yielding an approximate relationship between computational and monetary cost. Based on a recently published study in a peer-reviewed journal, we attempt to estimate the order of magnitude of the total computational and monetary cost involved in a typical numerical modeling project.
2 Method
2.1 Benchmark tests
ROMS provides three benchmark tests, consisting of an idealized model of the Southern Ocean with three grid sizes:
-
benchmark1
: 512 x 64 x 30 grid points -
benchmark2
: 1024 x 128 x 30 grid points -
benchmark3
: 2048 x 256 x 30 grid points
All experiments are integrated for 200 time steps. No grid data is read or written from/to the hard disk.
2.2 Hardware
Computations are performed on C4 instances of Amazon Web Services, which are described as featuring the highest performing processors and lowest price/compute performance in AWS’ product range (Amazon Web Services, 2016c). These instances feature custom Intel Xeon E5-2666 v3 (Haswell) processors. For C4 instances, AWS defines a ”vCPU” as a hyperthread of the Intel Xeon processor (Amazon Web Services, 2016c). Note that a stock Xeon E5-2666 v3 has 10 cores and 2 threads per core (Intel, 2016), but that AWS uses a ”custom” version. The two largest C4 instance types are 4.4xlarge and c4.8xlarge, which differ in the number of vCPUs and the amount of memory:
Instance type | vCPUs | Memory |
---|---|---|
c4.4xlarge | 16 | 30 GiB |
c4.8xlarge | 36 | 60 GB |
2.3 Software
For hardware and software provisioning, we use the CfnCluster tool provided by AWS (Amazon Web Services, 2016a). The tool provides a range of virtual-machine images, which are pre-configured for cluster computing on AWS’ hardware. Pre-installed packages include MPI libraries and schedulers. For the experiments described below, it was necessary to manually install the NetCDF Fortran libraries as a dependency for ROMS (although no NetCDF I/O is performed in the experiments). All results shown were produced with ROMS compiled with Open MPI using the GNU Fortran compiler (GFortran).
Operating system: | CentOS Linux release 7.2.1511 |
---|---|
Linux kernel: | v3.10.0 x86_64 |
Fortran compiler: | gcc-gfortran 4.8.5 |
MPI library: | openmpi 1.10.0 |
3 Results
3.1 Benchmark tests
Fig. 1 shows execution time for
benchmark2
for various tiling configurations on a single c4.4xlarge instance (16 vCPUs, 30 GiB memory). The cause for the increase in
execution time from the 1-core experiment to the 2-core experiments is unclear to us. For the remaining experiments, the
data shows the expected scaling. Note that multithreading is achieved by using the MPI library, but that computations are
performed on a single node, such that no influence of networking bandwidth or latency is expected.
Fig. 2 shows results from
benchmark3
, which has an execution time of roughly 230 s on 32-cores of a c4_8xlarge instance using the GFortran compiler. The number
of time steps integrated in
benchmark3
is 200, i.e. one time step is processed every 1.15 s. Note that we use only 32 vCPUs of the c4.8xlarge to avoid potential
problems with some Linux operating systems which have a vCPU limit of 32
(Amazon Web Services, 2016b). Fig. 2 shows the
overhead induced by distributing the computation amongst various nodes. The networking configuration used in the experiment
was the default configuration provided by CfnCluster. In particular, we did not verify whether the modifications described
by
Howard (2015) (see also
IBM Corporation 2010) have been applied before the experiment
was conducted. We hypothesize that any improvements gained by optimizing the infrastructure beyond the configuration provided
by CfnCluster will likely not decrease execution time by an order of magnitude. We assume that if this was the case, such
an optimization would be adopted by CfnCluster, or at least be described in their documentation.
3.2 Computational and monetary costs of a realistic study
The preliminary test above allow a rough estimate of the computational and monetary costs involved in a realistic study. By monetary cost we mean the financial cost of renting AWS’ hardware to conduct a study of a given computational cost. As an example we consider Kumar et al. (2015), who validate the mid-shelf and surfzone circulation generated by a numerical model against observations. They use a coupled ROMS-SWAN model and compare it to observations from the 2006 Huntington Beach (San Pedro Bay, California, U.S.A.) experiment. We assume that such studies, which investigate the transport processes directly adjacent to the coast, are particularly important for commercial applications. Whether or not it is actually technically feasible to conduct such a highly complex state-of-the-art simulation on AWS’ cloud infrastructure, is not shown here. Instead, the objective here is to gain a (very rough) order-of-magnitude estimate of the involved cost, assuming technical feasibility.
Components of the numerical experiment
The numerical study of Kumar et al. (2015) uses
- Wind forcing,
- Wave forcing,
- Tide forcing, and
- Buoyancy forcing.
All of these are coupled by the open-source Coupled Ocean-Atmosphere-Wave-Sediment (COAWST) Transport model ( Warner et al. 2008; Woods Hole Coastal and Marine Science Center 2016). To get our rough estimate for the total numerical cost, we make a couple of very crude approximations. First, we focus only on the ocean component of the model. Furthermore, within the ocean component, we only consider the numerical costs of the spin-up phase. Kumar et al. (2015) state that the model is spun up for 15 years with climatological surface forcing, and we assume that this is the most costly part of their study. Note that in operational forecasting, no spin-up phase may be necessary, hence the results obtained here may not be representative for such studies. Kumar et al. (2015) use the following nested grids:
Region | Area \((km^2)\) | Resolution \(\Delta x (m)\) | Number of grid points |
---|---|---|---|
U.S. West Coast and eastern Pacific (L0) | 4000×3000 | 5000 | 800×600×40 |
Southern California Bight (L1) | 800×700 | 1000 | 800×700×40 |
Interior bight region (L2) | 500×300 | 250 | 2000×1200×40 |
San Pedro Bay (L3) | 80×70 | 75 | 1067×934×32 |
Huntington Beach, Newport Beach (shelf break to inner shelf and surfzone, L4) | 15×30 | 50 | 214×428×20 |
Huntington Beach (L5) | 6×6 | 10 | 600×600×20 |
Let's assume here that L0 to L3 need to be spun-up for 15 years. Unfortunately, Kumar et al. (2015) do not report their time-step for the simulations on these grids. Hence, we have to make some further assumptions.
Computational cost of the spin-up phase
Assuming an ocean depth of H=2000 m, the barotropic time step \(\Delta t_{bt}\) is constrained by the CFL condition to roughly \(\mathcal{O}\)(10 s) for L0 and L1, and \(\mathcal{O}\)(1 s) for L2 and L3, according to the simplified formula (which does not hold strictly in this practical application)
where \(\Delta t_{bt}\) is the barotropic time step, \(\Delta x\) is the lateral grid spacing), and g is standard gravity (Tab. 1).
\(\Delta x (m)\) | \(\Delta t_{bt} (s)\) | \(\Delta t (s)\) | \( N(10^6) \) |
---|---|---|---|
5000 | 36 | 535 | 0.9 |
1000 | 7 | 107 | 4.4 |
250 | 1.8 | 27 | 18 |
75 | 0.5 | 8 | 59 |
Assuming that the baroclinic time step \(\Delta t\) is a factor of 10-20 longer than the barotropic time step, this yields
\(\Delta t\)=\(\mathcal{O}\)(100s) (L0, L1) and \(\Delta t\)=\(\mathcal{O}\)(10s) (L2, L3), respectively. For a 15 yr simulation,
N=\(\mathcal{O}(10^6)\) (N=\(\mathcal{O}(10^7)\)) baroclinic time steps are required for L0, L1 (L2, L3). Clearly, these estimates are very rough. For example, to account for the shallower depth of the L2 and L3 grids,
one may assume a depth decrease by a factor of 10, and accordingly increase \(\Delta t_{bt}\) by a factor of 3 for L2 and
L3. So far, no results are available measuring execution time for grids of similar size as L0-L3. The closest match is
benchmark3
, with a grid size of 2048×256×30 data points, which is roughly 80% (15%) of the number of spatial data points in L0 (L2).
Note that it is always possible to assume identical physical dimensions of the grid spacing (including the temporal lattice)
between
benchmark3
and L0-L3, since the benchmarking results are independent on physical dimensions, but depend only on the amount of data
points. To estimate the associated monetary cost, we consider the results from the 16×2 tiling configuration. Since AWS charges
fees based on units of time, this is the most cost effective option, under the condition that there is no additional value
gained by obtaining the results quickly. In other words, if one can afford to wait for the modeling results, one can save
money by eliminating all networking overhead. Note, however, that this is unlikely to be case in e.g. operational modeling
studies, where it is desirable to obtain forecasts as quickly as possible. Our results showed that in
benchmark3
, the processing time for one time step is about 1.15 s seconds using 32 cores of a single c4_8xlarge instance. It follows
that a 15 yr spin up time for
benchmark3
(assumed to consist of \(10^6-10^7\) baroclinic time steps of lengths displayed in Tab. 1) takes a minimum of \(\mathcal{O}(10)d\) to complete.
Monetary cost
The instance type c4_8xlarge is currently (2017/01/12) priced at $1.591 ($2.085) per Hour for on-demand use in US East (Asia Pacific, Sydney) region, resulting in a total of roughly \(\mathcal{O}\)($500) for a 10 day lease. Given that high-resolution, state-of-the-art regional and coastal numerical studies such as Kumar et al. (2015) are conducted on multiple different large grids, some of which require an extensive spin-up phase, we estimate that the total monetary cost of cloud-computing infrastructure will typically be on the order of \(\mathcal{O}\)($10,000) to \(\mathcal{O}\)($100,000). Note that this estimate refers to the final experiment(s) whose results are published. Importantly, it does not include the testing phase, which may well increase the total cost of the study by a factor of 10 or more.
4 Discussion
We found the cost of leasing infrastructure from AWS surprisingly high, although this statement by itself is mostly meaningless if there is no comparison to on-premise infrastructure. Especially in meteorology and oceanography, the computing infrastructure is mostly funded by governmental agencies, and it is much more difficult to obtain a value/cost relationship. I welcome feedback on this note, best via email.
References
- Amazon Web Services, 2016a: CfnCluster
- Amazon Web Services, 2016b: Compute optimized instances
- Amazon Web Services, 2016c: EC2 Instance Types
- Howard, A., 2015: Running MPI applications in Amazon EC2
- IBM Corporation, 2010: TCP_MEM change for TCP connection errors running MPI jobs - IBM Cluster 1350
- Intel, 2016: Intel Xeon Processor E5 2660 v3 25M Cache 2.60-GHz
- Kumar, N., Feddersen, F., Uchiyama, Y., McWilliams, J., O’Reilly, W., 2015: Midshelf to Surfzone Coupled ROMS--SWAN Model Data Comparison of Waves, Currents, and Temperature: Diagnosis of Subtidal Forcings and Response
- Shchepetkin, A.F., McWilliams, J.C., 2005: The regional oceanic modeling system (ROMS): a split-explicit, free-surface, topography-following-coordinate oceanic model
- Warner, J.C., Perlin, N., Skyllingstad, E.D., 2008: Using the Model Coupling Toolkit to couple earth system models
- Woods Hole Coastal and Marine Science Center, 2016: COAWST: A Coupled-Ocean-Atmosphere-Wave- Sediment Transport Modeling System