PyCSEP provides routines to evaluate both gridded and catalog-based earthquake forecasts. This page explains how to use the forecast evaluation routines and also how to build “mock” forecast and catalog classes to accommodate different custom forecasts and catalogs.

Gridded-forecast evaluations

Grid-based earthquake forecasts assume earthquakes occur in discrete space-time-magnitude bins and their rate-of- occurrence can be defined using a single number in each magnitude bin. Each space-time-magnitude bin is assumed to be an independent Poisson random variable. Therefore, we use likelihood-based evaluation metrics to compare these forecasts against observations.

PyCSEP provides two groups of evaluation metrics for grid-based earthquake forecasts. The first are known as consistency tests and they verify whether a forecast in consistent with an observation. The second are comparative tests that can be used to compare the performance of two (or more) competing forecasts. PyCSEP implements the following evaluation routines for grid-based forecasts. These functions are intended to work with GriddedForecasts and CSEPCatalogs`. Visit the catalogs reference and the forecasts reference to learn more about to import your forecasts and catalogs into PyCSEP.


Grid-based forecast evaluations act directly on the forecasts and catalogs as they are supplied to the function. Any filtering of catalogs and/or scaling of forecasts must be done before calling the function. This must be done before evaluating the forecast and should be done consistently between all forecasts that are being compared.

See the example for gridded forecast evaluation for an end-to-end walkthrough on how to evaluate a gridded earthquake forecast.

Consistency tests

number_test(gridded_forecast, observed_catalog)

Computes "N-Test" on a gridded forecast.

magnitude_test(gridded_forecast, ...[, ...])

Performs the Magnitude Test on a Gridded Forecast using an observed catalog.

spatial_test(gridded_forecast, observed_catalog)

Performs the Spatial Test on the Forecast using the Observed Catalogs.

likelihood_test(gridded_forecast, ...[, ...])

Performs the likelihood test on Gridded Forecast using an Observed Catalog.

conditional_likelihood_test(...[, ...])

Performs the conditional likelihood test on Gridded Forecast using an Observed Catalog.

Comparative tests

paired_t_test(forecast, benchmark_forecast, ...)

Computes the t-test for gridded earthquake forecasts.

w_test(gridded_forecast1, gridded_forecast2, ...)

Calculate the Single Sample Wilcoxon signed-rank test between two gridded forecasts.

Publication references

  1. Number test (Schorlemmer et al., 2007; Zechar et al., 2010)

  2. Magnitude test (Zechar et al., 2010)

  3. Spatial test (Zechar et al., 2010)

  4. Likelihood test (Schorlemmer et al., 2007; Zechar et al., 2010)

  5. Conditional likelihood test (Werner et al., 2011)

  6. Paired t test (Rhoades et al., 2011)

  7. Wilcoxon signed-rank test (Rhoades et al., 2011)

Catalog-based forecast evaluations

Catalog-based forecasts are issued as a family of stochastic event sets (synthetic earthquake catalogs) and can express the full uncertainty of the forecasting model. Additionally, these forecasts retain the inter-event dependencies that are lost when using discrete space-time-magnitude grids. This problem can impact the evaluation performance of time-dependent forecasts like the epidemic type aftershock sequence model (ETAS).

In order to support generative or simulator-based models, we define a suite of consistency tests that compare forecasted distributions against observations without the use of a parametric likelihood function. These evaluations take advantage of the fact that the forecast and the observations are both earthquake catalogs. Therefore, we can compute identical statistics from these catalogs and compare them against one another.

We provide four statistics that probe fundamental aspects of the earthquake forecasts. Please see Savran et al., 2020 for a complete description of the individual tests. For the implementation details please follow the links below and see the example for catalog-based forecast evaluation for an end-to-end walk through.

Consistency tests

number_test(forecast, observed_catalog[, ...])

Performs the number test on a catalog-based forecast.

spatial_test(forecast, observed_catalog[, ...])

Performs spatial test for catalog-based forecasts.

magnitude_test(forecast, observed_catalog[, ...])

Performs magnitude test for catalog-based forecasts

pseudolikelihood_test(forecast, observed_catalog)

Performs the spatial pseudolikelihood test for catalog forecasts.

calibration_test(evaluation_results[, delta_1])

Perform the calibration test by computing a Kilmogorov-Smirnov test of the observed quantiles against a uniform distribution.

Publication reference

  1. Number test (Savran et al., 2020)

  2. Spatial test (Savran et al., 2020)

  3. Magnitude test (Savran et al., 2020)

  4. Pseudolikelihood test (Savran et al., 2020)

  5. Calibration test (Savran et al., 2020)

Preparing evaluation catalog

The evaluations in PyCSEP do not implicitly filter the observed catalogs or modify the forecast data when called. For most cases, the observation catalog should be filtered according to:

  1. Magnitude range of the forecast

  2. Spatial region of the forecast

  3. Start and end-time of the forecast

Once the observed catalog is filtered so it is consistent in space, time, and magnitude as the forecast, it can be used to evaluate a forecast. A single evaluation catalog can be used to evaluate multiple forecasts so long as they all cover the same space, time, and magnitude region.

Building mock classes

Python is a duck-typed language which means that it doesn’t care what the object type is only that it has the methods or functions that are expected when that object is used. This can come in handy if you want to use the evaluation methods, but do not have a forecast that completely fits with the forecast classes (or catalog classes) provided by PyCSEP.


Something about great power and great responsibility… For the most reliable results, write a loader function that can ingest your forecast into the model provided by PyCSEP. Mock-classes can work, but should only be used in certain circumstances. In particular, they are very useful for writing software tests or to prototype features that can be added into the package.

This section will walk you through how to compare two forecasts using the paired_t_test with mock forecast and catalog classes. This sounds much more complex than it really is, and it gives you the flexibility to use your own formats and interact with the tools provided by PyCSEP.


The simulation-based Poisson tests (magnitude_test, likelihood_test, conditional_likelihood_test, and spatial_test) are optimized to work with forecasts that contain equal-sized spatial bins. If your forecast uses variable sized spatial bins you will get incorrect results. If you are working with forecasts that have variable spatial bins, create an issue on GitHub because we’d like to implement this feature into the toolkit and we’d love your help.

If we look at the paired_t_test we see that it has the following code

def paired_t_test(gridded_forecast1, gridded_forecast2, observed_catalog, alpha=0.05, scale=False):
    """ Computes the t-test for gridded earthquake forecasts.

        gridded_forecast_1 (csep.core.forecasts.GriddedForecast): nd-array storing gridded rates, axis=-1 should be the magnitude column
        gridded_forecast_2 (csep.core.forecasts.GriddedForecast): nd-array storing gridded rates, axis=-1 should be the magnitude column
        observed_catalog (csep.core.catalogs.AbstractBaseCatalog): number of observed earthquakes, should be whole number and >= zero.
        alpha (float): tolerance level for the type-i error rate of the statistical test
        scale (bool): if true, scale forecasted rates down to a single day

        evaluation_result: csep.core.evaluations.EvaluationResult

    # needs some pre-processing to put the forecasts in the context that is required for the t-test. this is different
    # for cumulative forecasts (eg, multiple time-horizons) and static file-based forecasts.
    target_event_rate_forecast1, n_fore1 = gridded_forecast1.target_event_rates(observed_catalog, scale=scale)
    target_event_rate_forecast2, n_fore2 = gridded_forecast2.target_event_rates(observed_catalog, scale=scale)

    # call the primative version operating on ndarray
    out = _t_test_ndarray(target_event_rate_forecast1, target_event_rate_forecast2, observed_catalog.event_count, n_fore1, n_fore2,

    # prepare evaluation result object
    result = EvaluationResult()
    result.name = 'Paired T-Test'
    result.test_distribution = (out['ig_lower'], out['ig_upper'])
    result.observed_statistic = out['information_gain']
    result.quantile = (out['t_statistic'], out['t_critical'])
    result.sim_name = (gridded_forecast1.name, gridded_forecast2.name)
    result.obs_name = observed_catalog.name
    result.status = 'normal'
    result.min_mw = numpy.min(gridded_forecast1.magnitudes)

Notice that the function expects two forecast objects and one catalog object. The paired_t_test function calls a method on the forecast objects named target_event_rates that returns a tuple (numpy.ndarray, float) consisting of the target event rates and the expected number of events from the forecast.


The target event rate is the expected rate for an observed event in the observed catalog assuming that the forecast is true. For a simple example, if we forecast a rate of 0.3 events per year in some bin of a forecast, each event that occurs within that bin has a target event rate of 0.3 events per year. The expected number of events in the forecast can be determined by summing over all bins in the gridded forecast.

We can also see that the paired_t_test function uses the gridded_forecast1.name and calls the numpy.min() on the gridded_forecast1.magnitudes. Using this information, we can create a mock-class that implements these methods that can be used by this function.


If you are creating mock-classes to use with evaluation functions, make sure that you visit the corresponding documentation and source-code to make sure that your methods return values that are expected by the function. In this case, it expects the tuple (target_event_rates, expected_forecast_count). This will not always be the case. If you need help, please create an issue on the GitHub page.

Here we show an implementation of a mock forecast class that can work with the paired_t_test function.

class MockForecast:

    def __init__(self, data=None, name='my-mock-forecast', magnitudes=(4.95)):

        # data is not necessary, but might be helpful for implementing target_event_rates(...)
        self.data = data
        self.name = name
        # this should be an array or list. it can be as simple as the default argument.
        self.magnitudes = magnitudes

    def target_event_rates(catalog, scale=None):
        """ Notice we added the dummy argument scale. This function stub should match what is called paired_t_test """

        # whatever custom logic you need to return these target event rates given your catalog can go here
        # of course, this should work with whatever catalog you decide to pass into this function

        # this returns the tuple that paired_t_test expects
        return (ndarray_of_target_event_rates, expected_number_of_events)

You’ll notice that paired_t_test expects a catalog class. Looking back at the function definition we can see that it needs observed_catalog.event_count and observed_catalog.name. Therefore the mock class for the catalog would look something like this

class MockCatalog:

    def __init__(self, event_count, data=None, name='my-mock-catalog'):

        # this is not necessary, but adding data might be helpful for implementing the
        # logic needed for the target_event_rates(...) function in the MockForecast class.
        self.data = data
        self.name = name
        self.event_count = event_count

Now using these two objects you can call the paired_t_test directly without having to modify any of the source code.

# create your forecasts
mock_forecast_1 = MockForecast(some_forecast_data1)
mock_forecast_2 = MockForecast(some_forecast_data2)

# lets assume that catalog_data is an array that contains the catalog data
catalog = MockCatalog(len(catalog_data))

# call the function using your classes
eval_result = paired_t_test(mock_forecast_1, mock_forecast_2, catalog)

The only requirement for this approach is that you implement the methods on the class that the calling function expects. You can add anything else that you need in order to make those functions work properly. This example is about as simple as it gets.


If you want to use mock-forecasts and mock-catalogs for other evaluations. You can just add the additional methods that are needed onto the mock classes you have already built.