Software

I strongly advocate for the use of open source software. Throughout my PhD, I have interacted with several industrial partners who have relied on proprietary software, while also expecting thorough and comparative results. Despite these challenges, I have consistently promoted the use of open source software and am proud of my contribution towards improving computing software, no matter how small it may be. All of my repositories can be found on my github.

Currently, I manage the software stack of our Lab at Georgia Tech, which includes CI, Julia registry, custom registry, and others. I am also the main developer for several open source software projects that focus on wave-equation based inversion, data interpolation, and machine learning for the scientific community. Our team aims to create portable and extensible software, and you can find all of our projects at SlimGroup.

Below are some of my most relevant contributions to open source software:

Main contributor

Devito

Devito is a Finite-difference/stencil DSL and just-in-time compiler that facilitates high-performance stencil computation. Devito’s interface and high-level API are based on SymPy for symbolic manipulation. The compiler generates high-performance C-code and supports standard openMP, MPI, GPU offloading via OpenACC or OpenMP, and MPI+GPU.

Devito provides a functional language to implement sophisticated operators that can be made up of multiple stencil computations, boundary conditions, sparse operations (e.g., interpolation), and much more. A typical use case is explicit finite difference methods for approximating partial differential equations. For example, a 2D diffusion operator may be implemented with Devito as follows

from devito import *
grid = Grid(shape=(10, 10))
f = TimeFunction(name='f', grid=grid, space_order=2)
eqn = Eq(f.dt, 0.5 * f.laplace)
op = Operator(Eq(f.forward, solve(eqn, f.forward)))

An Operator generates low-level code from an ordered collection of Eq (the example above being for a single equation). The code for the above operator can be looked at

print(op)

which prints the generated code that will be compiled at runtime (in this case C openmp code from the simple devito CPU configuration)

#define _POSIX_C_SOURCE 200809L
#define START_TIMER(S) struct timeval start_ ## S , end_ ## S ; gettimeofday(&start_ ## S , NULL);
#define STOP_TIMER(S,T) gettimeofday(&end_ ## S, NULL); T->S += (double)(end_ ## S .tv_sec-start_ ## S.tv_sec)+(double)(end_ ## S .tv_usec-start_ ## S .tv_usec)/1000000;

#include "stdlib.h"
#include "math.h"
#include "sys/time.h"
#include "omp.h"

struct dataobj
{
  void *restrict data;
  unsigned long * size;
  unsigned long * npsize;
  unsigned long * dsize;
  int * hsize;
  int * hofs;
  int * oofs;
  void * dmap;
} ;

struct profiler
{
  double section0;
} ;


int Kernel(struct dataobj *restrict f_vec, const float dt, const float h_x, const float h_y, const int time_M, const int time_m, const int x_M, const int x_m, const int y_M, const int y_m, const int nthreads, struct profiler * timers)
{
  float (*restrict f)[f_vec->size[1]][f_vec->size[2]] __attribute__ ((aligned (64))) = (float (*)[f_vec->size[1]][f_vec->size[2]]) f_vec->data;

  float r0 = 1.0F/(h_x*h_x);
  float r1 = 1.0F/(h_y*h_y);
  float r2 = 1.0F/dt;

  for (int time = time_m, t0 = (time)%(2), t1 = (time + 1)%(2); time <= time_M; time += 1, t0 = (time)%(2), t1 = (time + 1)%(2))
  {
    /* Begin section0 */
    START_TIMER(section0)
    #pragma omp parallel num_threads(nthreads)
    {
      #pragma omp for schedule(dynamic,1)
      for (int x = x_m; x <= x_M; x += 1)
      {
        #pragma omp simd aligned(f:16)
        for (int y = y_m; y <= y_M; y += 1)
        {
          f[t1][x + 2][y + 2] = dt*(-r0*f[t0][x + 2][y + 2] - r1*f[t0][x + 2][y + 2] + r2*f[t0][x + 2][y + 2] + 5.0e-1F*(r0*f[t0][x + 1][y + 2] + r0*f[t0][x + 3][y + 2] + r1*f[t0][x + 2][y + 1] + r1*f[t0][x + 2][y + 3]));
        }
      }
    }
    STOP_TIMER(section0,timers)
    /* End section0 */
  }

  return 0;
}

This code may then be compiled and executed

op(t=timesteps, dt=dt)

There is virtually no limit to the complexity of an Operator – the Devito compiler will automatically analyze the input, detect and apply optimizations (including single- and multi-node parallelism), and eventually generate code with suitable loops and expressions.

Key features include:

  • A functional language to express finite difference operators.
  • Straightforward mechanisms to adjust the discretization.
  • Constructs to express sparse operators (e.g., interpolation), classic linear operators (e.g., convolutions), and tensor contractions.
  • Seamless support for boundary conditions and adjoint operators.
  • A flexible API to define custom stencils, sub-domains, sub-sampling, and staggered grids.
  • Generation of highly optimized parallel code (SIMD vectorization, CPU and GPU parallelism via OpenMP and OpenACC, multi-node parallelism via MPI, blocking, aggressive symbolic transformations for FLOP reduction, etc.).
  • Distributed NumPy arrays over multi-node (MPI) domain decompositions.
  • Inspection and customization of the generated code.
  • Autotuning framework to ease performance tuning.
  • Smooth integration with popular Python packages such as NumPy, SymPy, Dask, and SciPy, as well as machine learning frameworks such as TensorFlow and PyTorch.

JUDI

JUDI, and its derived packages, is a julia framework for PDE constraints optimization that provides a high-level linear algebra user interface while providing state of the art computational performance through the use of Devito.

As an illustrative example we show below how JUDI can be used to solve a linearized least-square inversion for the acoustic wave-equation. You can find more information in the JUDI Documentation

# Set up matrix-free linear operators
opt = Options(optimal_checkpointing = true)    # set to false to disable optimal checkpointing
F = judiModeling(model0, q.geometry, dD.geometry; options=opt)
J = judiJacobian(F, q)

# Right-hand preconditioners (model topmute)
Mr = judiTopmute(model0.n, 52, 10)  # mute up to grid point 52, with 10 point taper

# Left-hand preconditioners
Ml = judiDataMute(q.geometry, dD.geometry; t0=.120) # data topmute starting at 120ms (30 samples)

# Stochastic gradient
x = zeros(Float32, info.n)  # zero initial guess
batchsize = 10  # use subset of 10 shots per iteration
niter = 32
fval = zeros(Float32, niter)

for j=1:niter
    println("Iteration: ", j)

    # Select batch and set up left-hand preconditioner
    i = randperm(dD.nsrc)[1:batchsize]

    # Compute residual and gradient
    r = Ml[i]*J[i]*Mr*x - Ml[i]*dD[i]
    g = adjoint(Mr)*adjoint(J[i])*adjoint(Ml[i])*r

    # Step size and update variable
    fval[j] = .5f0*norm(r)^2
    t = norm(r)^2/norm(g)^2
    global x -= t*g
end

Form this base package, a few extensions have been developed to extend the capabilities and applications.

JUDI4Cloud

JUDI4Cloud is an extension of JUDI that runs the core computation, which is the task-parallel wave equation solves, on Azure using Azure Batch. This software interfaces with [AzureClusterlessHPC.jl] for the Julia interface with Azure Batch.

TimeProbeSeismic

This repository implements a high-performance, low-memory convolutional layer in both Julia (for Flux.jl) and Python (for PyTorch). The convolutional layer has virtually zero memory imprint for training using randomized linear algebra to compute an unbiased estimate of the gradient with respect to the weights. Additionally, a byte-only implementation of the ReLU layer leads to a memory reduction by a factor of X2 for full networks.

ImageGather

ImageGather is a simple surface offset gather software used for quality control on seismic inverted models. This software implements the double-migration method using JUDI and Devito for surface gathers and the standard extended Born modeling for subsurface offset gathers.

PhotoAcoustic.jl

Photocacoustic imaging and iversion sofware. This software is built on top of JUDI.jl and Devito to offer a high-level interface and state-of-the art computational performance. It also designed to be used in combination with machine learning framework with differentiation rules defined to use the photacoustic propagators as layers/summaries.

Others

XConv: This repository implements a high-performance, low-memory convolutional layer in both Julia (for Flux.jl) and Python (for PyTorch). The convolutional layer has virtually zero memory imprint for training using randomized linear algebra to compute an unbiased estimate of the gradient with respect to the weights. Additionally, a byte-only implementation of the ReLU layer leads to a memory reduction by a factor of X2 for full networks.

SlimOptim: This softwares is a translation in julia of three of our main optimization algorithms: SPG, PQN and Linearized Bregman. The implementation is not aimed to computationnaly optimal for small problems but to be used in the context of PDE constrained optimization where the cost of the PDE solves is the main computational cost. The SPG and PQN implementation are inspired from and adapted from minConf.

Contributor

InvertibleNetworks: Julia implementation of invertible networks. Invertible networks are fully reversible and do not need to store intermediate state variable for training making them virtually memory free.

SLIM softwares: I am the main mainainer and manager of our software stack (CI, versionning, julia regsitration, ….).

Minor contributions

Sympy: Minor pull request contributions over the years mostly related to bugs found through Devito.

  • Various minor bug fixes in several Julia packages