Evaluations
PyCSEP provides routines to evaluate both gridded and catalogbased earthquake forecasts. This page explains how to use the forecast evaluation routines and also how to build “mock” forecast and catalog classes to accommodate different custom forecasts and catalogs.
Table of Contents
Griddedforecast evaluations
Gridbased earthquake forecasts assume earthquakes occur in discrete spacetimemagnitude bins and their rateof occurrence can be defined using a single number in each magnitude bin. Each spacetimemagnitude bin is assumed to be an independent Poisson random variable. Therefore, we use likelihoodbased evaluation metrics to compare these forecasts against observations.
PyCSEP provides two groups of evaluation metrics for gridbased earthquake forecasts. The first are known as
consistency tests and they verify whether a forecast in consistent with an observation. The second are comparative tests
that can be used to compare the performance of two (or more) competing forecasts.
PyCSEP implements the following evaluation routines for gridbased forecasts. These functions are intended to work with
GriddedForecasts
and CSEPCatalogs`
.
Visit the catalogs reference and the forecasts reference to learn
more about to import your forecasts and catalogs into PyCSEP.
Note
Gridbased forecast evaluations act directly on the forecasts and catalogs as they are supplied to the function. Any filtering of catalogs and/or scaling of forecasts must be done before calling the function. This must be done before evaluating the forecast and should be done consistently between all forecasts that are being compared.
See the example for gridded forecast evaluation for an endtoend walkthrough on how to evaluate a gridded earthquake forecast.
Consistency tests

Computes "NTest" on a gridded forecast. 

Performs the Magnitude Test on a Gridded Forecast using an observed catalog. 

Performs the Spatial Test on the Forecast using the Observed Catalogs. 

Performs the likelihood test on Gridded Forecast using an Observed Catalog. 

Performs the conditional likelihood test on Gridded Forecast using an Observed Catalog. 
Comparative tests

Computes the ttest for gridded earthquake forecasts. 

Calculate the Single Sample Wilcoxon signedrank test between two gridded forecasts. 
Publication references
Number test (Schorlemmer et al., 2007; Zechar et al., 2010)
Magnitude test (Zechar et al., 2010)
Spatial test (Zechar et al., 2010)
Likelihood test (Schorlemmer et al., 2007; Zechar et al., 2010)
Conditional likelihood test (Werner et al., 2011)
Paired t test (Rhoades et al., 2011)
Wilcoxon signedrank test (Rhoades et al., 2011)
Catalogbased forecast evaluations
Catalogbased forecasts are issued as a family of stochastic event sets (synthetic earthquake catalogs) and can express the full uncertainty of the forecasting model. Additionally, these forecasts retain the interevent dependencies that are lost when using discrete spacetimemagnitude grids. This problem can impact the evaluation performance of timedependent forecasts like the epidemic type aftershock sequence model (ETAS).
In order to support generative or simulatorbased models, we define a suite of consistency tests that compare forecasted distributions against observations without the use of a parametric likelihood function. These evaluations take advantage of the fact that the forecast and the observations are both earthquake catalogs. Therefore, we can compute identical statistics from these catalogs and compare them against one another.
We provide four statistics that probe fundamental aspects of the earthquake forecasts. Please see Savran et al., 2020 for a complete description of the individual tests. For the implementation details please follow the links below and see the example for catalogbased forecast evaluation for an endtoend walk through.
Consistency tests

Performs the number test on a catalogbased forecast. 

Performs spatial test for catalogbased forecasts. 

Performs magnitude test for catalogbased forecasts 

Performs the spatial pseudolikelihood test for catalog forecasts. 

Perform the calibration test by computing a KilmogorovSmirnov test of the observed quantiles against a uniform distribution. 
Publication reference
Number test (Savran et al., 2020)
Spatial test (Savran et al., 2020)
Magnitude test (Savran et al., 2020)
Pseudolikelihood test (Savran et al., 2020)
Calibration test (Savran et al., 2020)
Preparing evaluation catalog
The evaluations in PyCSEP do not implicitly filter the observed catalogs or modify the forecast data when called. For most cases, the observation catalog should be filtered according to:
Magnitude range of the forecast
Spatial region of the forecast
Start and endtime of the forecast
Once the observed catalog is filtered so it is consistent in space, time, and magnitude as the forecast, it can be used to evaluate a forecast. A single evaluation catalog can be used to evaluate multiple forecasts so long as they all cover the same space, time, and magnitude region.
Building mock classes
Python is a ducktyped language which means that it doesn’t care what the object type is only that it has the methods or functions that are expected when that object is used. This can come in handy if you want to use the evaluation methods, but do not have a forecast that completely fits with the forecast classes (or catalog classes) provided by PyCSEP.
Note
Something about great power and great responsibility… For the most reliable results, write a loader function that can ingest your forecast into the model provided by PyCSEP. Mockclasses can work, but should only be used in certain circumstances. In particular, they are very useful for writing software tests or to prototype features that can be added into the package.
This section will walk you through how to compare two forecasts using the paired_t_test
with mock forecast and catalog classes. This sounds much more complex than it really is, and it gives you the flexibility
to use your own formats and interact with the tools provided by PyCSEP.
Warning
The simulationbased Poisson tests (magnitude_test, likelihood_test, conditional_likelihood_test, and spatial_test) are optimized to work with forecasts that contain equalsized spatial bins. If your forecast uses variable sized spatial bins you will get incorrect results. If you are working with forecasts that have variable spatial bins, create an issue on GitHub because we’d like to implement this feature into the toolkit and we’d love your help.
If we look at the paired_t_test
we see that it has the following code
def paired_t_test(gridded_forecast1, gridded_forecast2, observed_catalog, alpha=0.05, scale=False):
""" Computes the ttest for gridded earthquake forecasts.
Args:
gridded_forecast_1 (csep.core.forecasts.GriddedForecast): ndarray storing gridded rates, axis=1 should be the magnitude column
gridded_forecast_2 (csep.core.forecasts.GriddedForecast): ndarray storing gridded rates, axis=1 should be the magnitude column
observed_catalog (csep.core.catalogs.AbstractBaseCatalog): number of observed earthquakes, should be whole number and >= zero.
alpha (float): tolerance level for the typei error rate of the statistical test
scale (bool): if true, scale forecasted rates down to a single day
Returns:
evaluation_result: csep.core.evaluations.EvaluationResult
"""
# needs some preprocessing to put the forecasts in the context that is required for the ttest. this is different
# for cumulative forecasts (eg, multiple timehorizons) and static filebased forecasts.
target_event_rate_forecast1, n_fore1 = gridded_forecast1.target_event_rates(observed_catalog, scale=scale)
target_event_rate_forecast2, n_fore2 = gridded_forecast2.target_event_rates(observed_catalog, scale=scale)
# call the primative version operating on ndarray
out = _t_test_ndarray(target_event_rate_forecast1, target_event_rate_forecast2, observed_catalog.event_count, n_fore1, n_fore2,
alpha=alpha)
# prepare evaluation result object
result = EvaluationResult()
result.name = 'Paired TTest'
result.test_distribution = (out['ig_lower'], out['ig_upper'])
result.observed_statistic = out['information_gain']
result.quantile = (out['t_statistic'], out['t_critical'])
result.sim_name = (gridded_forecast1.name, gridded_forecast2.name)
result.obs_name = observed_catalog.name
result.status = 'normal'
result.min_mw = numpy.min(gridded_forecast1.magnitudes)
Notice that the function expects two forecast objects and one catalog object. The paired_t_test
function calls a
method on the forecast objects named target_event_rates
that returns a tuple (numpy.ndarray
, float) consisting of the target event rates and the expected number of events
from the forecast.
Note
The target event rate is the expected rate for an observed event in the observed catalog assuming that the forecast is true. For a simple example, if we forecast a rate of 0.3 events per year in some bin of a forecast, each event that occurs within that bin has a target event rate of 0.3 events per year. The expected number of events in the forecast can be determined by summing over all bins in the gridded forecast.
We can also see that the paired_t_test
function uses the gridded_forecast1.name
and calls the numpy.min()
on the gridded_forecast1.magnitudes
. Using this information, we can create a mockclass that implements these methods
that can be used by this function.
Warning
If you are creating mockclasses to use with evaluation functions, make sure that you visit the corresponding documentation and sourcecode to make sure that your methods return values that are expected by the function. In this case, it expects the tuple (target_event_rates, expected_forecast_count). This will not always be the case. If you need help, please create an issue on the GitHub page.
Here we show an implementation of a mock forecast class that can work with the
paired_t_test
function.
class MockForecast:
def __init__(self, data=None, name='mymockforecast', magnitudes=(4.95)):
# data is not necessary, but might be helpful for implementing target_event_rates(...)
self.data = data
self.name = name
# this should be an array or list. it can be as simple as the default argument.
self.magnitudes = magnitudes
def target_event_rates(catalog, scale=None):
""" Notice we added the dummy argument scale. This function stub should match what is called paired_t_test """
# whatever custom logic you need to return these target event rates given your catalog can go here
# of course, this should work with whatever catalog you decide to pass into this function
# this returns the tuple that paired_t_test expects
return (ndarray_of_target_event_rates, expected_number_of_events)
You’ll notice that paired_t_test
expects a catalog class. Looking back
at the function definition we can see that it needs observed_catalog.event_count
and observed_catalog.name
. Therefore
the mock class for the catalog would look something like this
class MockCatalog:
def __init__(self, event_count, data=None, name='mymockcatalog'):
# this is not necessary, but adding data might be helpful for implementing the
# logic needed for the target_event_rates(...) function in the MockForecast class.
self.data = data
self.name = name
self.event_count = event_count
Now using these two objects you can call the paired_t_test
directly
without having to modify any of the source code.
# create your forecasts
mock_forecast_1 = MockForecast(some_forecast_data1)
mock_forecast_2 = MockForecast(some_forecast_data2)
# lets assume that catalog_data is an array that contains the catalog data
catalog = MockCatalog(len(catalog_data))
# call the function using your classes
eval_result = paired_t_test(mock_forecast_1, mock_forecast_2, catalog)
The only requirement for this approach is that you implement the methods on the class that the calling function expects. You can add anything else that you need in order to make those functions work properly. This example is about as simple as it gets.
Note
If you want to use mockforecasts and mockcatalogs for other evaluations. You can just add the additional methods that are needed onto the mock classes you have already built.