As part of the process of re-architecting Specify into a Java platform, we sought to improve Specify's handling of paleontological data. The information model for Specify 5.x for Paleo collections was incomplete. We received feedback from clients about additional requirements they had to accomodate their task workflows. An important database issue for Paleo users is the appropriate modeling and handling of stratigraphic information, in particular biostratigraphy, chronostratigraphy, and lithostratigraphy. Paleo collections database users conceptualize stratigraphy as another index upon which to query, classify and report on their collections holdings. As Paleo researchers know, this dimension is complex and dynamic. Stratigraphies, like taxonomic classifications, are comprised of tree-structured data, and like biological taxonomies, strat classifications vary through time with different authorities updating reference standards on regular or irregular schedules. Stratigraphic usage also varies through space, when different classifications are used by different countries.
The discussion below is roughly the intellectual path we took to derive a new data model for handling paleontological Localities and associated stratigraphic data. In our exploration, assisted by colleagues, we noted that paleo Localities are not the same kind of concept as 'Collecting Localities'; they are not just points or polygons with x,y coordinate descriptors describing a space where something was found or where something happened. More appropriately thought of as 'Paleo Contexts', they are comprised of spatial and temporal dimensions mixed together with an element of ordinal position and rank (the stratigraphies). The bottom line for the Specify 6 data model will be that we will treat Paleo Contexts and stratigraphic information as relations of Collection Objects, directly. In earlier models we considered these paleo attributes to be indirectly related attributes of Collection Objects linked through a Collecting Event or through a Collection Locality. That older approach would not allow us to unambiguously record the geological properties of individual Collection Objects, given our standard definitions of Collecting Events and Collection Localities.
The following discussion is a walk among the data modeling issues for paleontological 'locality' data, which led us at the end to a new arrangement of data table relationships. We begin by discussing the limitations of the Specify 5.x model.
In Specify 5.x, our Paleo Collection model consists of Collection Object, Collecting Event, (Collecting) Locality, Stratigraphy, and Geologic Time Period data objects (see Figure 1). Here is a summary of those data objects:
Data Object |
Description |
Relationship |
| Collection Object | Represents a collected specimen or core collected | Collection Object is related to Collecting Event as a Many:One |
| Collecting Event | Represents person or group performing a collecting 'action' such as performing a core sample or digging and retrieving a specimen | A Collecting Event refers to a Locality and a Stratigraphy |
| Locality | Represents a place, a description of the Locality and an optional latitude and longitude. | |
| Stratigraphy | Represents a Lithostratigraphy. Contains these fields: superGroup, lithoGroup, formation, member, bed | Refers to a Geologic Time Period |
| Geologic Time Period | Represents time (ChronoStratigraphy) and has these fields: rankId, name, fullName, standard, startPeriod, startUncertainty, endPeriod, endUncertainty |
But as discussed below, the current data model is inadequate. For example, it doesn't support sampling with cores; where a collector needs to capture multiple 'pieces' of time per Locality and Collecting Event. It also ties the LithoStratigraphy (Stratigraphy) to the ChronoStratigraphy (Geologic Time Period) see Figure 1.
Specify 5.x considers Locality to be defined as a modern day place on the earth where something biological was collected; it is usually characterized by latitude and longitude, or UTM coordinates. Specify 5 also accommodates depth and height as a third dimension, but for most purposes today for using the Locality of a specimen (querying on it, etc.) only x,y coordinates on the earth's surface are important. The 5.x Locality is just that and only that--the Locality where a collection was taken in a Collecting Event. It is not defined by any other broader or orthogonal concept. It has a very simple, practical function--to identify the place on earth where a collection object came from. Figure 1. illustrates current Specify 5.x relationships among Collection Object, Collecting Event, Locality, Stratigraphy and Geologic Time Period.

In words: one or more collection objects can be taken during the course of a Collecting Event, from a single Locality. In our database usage, if the Locality changes, by definition, a new collecting event is created. A given Collecting Event can have only one Locality, which also means that a Collection Object can have only one Locality in the database. This model has worked for several years for neontological collections. For disciplines like botany, which do not recognize the importance or even existence of the Collecting Event concept in their work, we simply minimize or remove Collecting Event from the User Interface, but the relationship is maintained in the database without the knowledge of the user.
Paleo however has additional 'Localization' needs.

We add the relationship between Locality and Paleo Context as M:1, because by definition a Locality can only have one Paleo Context (a core slice), and Paleo Context may have one or more Localities in the database, i.e. a collection may have multiple Localities from the Cretaceous.
But that does not work, because the 5.x model does not accomodate multiple Localities for a Collecting Event. And we need that if we are going to define Locality with Paleo Context attributes. Each slice of the core would have a different Paleo Context and a correspondingly different Locality, as Locality and Paleo Context are related M:1. This model does not work.
This problem can be solved if we change the Specify 5 model to make the relationship between Collecting Event and Locality to be Many:Many as in Figure 3.

Doing that would mean that Specify databases for all disciplines would have an additional join table between Collecting Event and Locality. The relationship between Collecting Event table and the Locality table is at the core of most database transactions within Specify. Most database queries use this relationship and changing it from M:1 to M:M, adds another 'join' between the tables, and another table as a 'join table' to make the M:M relationship work. Adding a join table will have significant negative performance consequences for all Specify users for most operations, but that's not the most serious problem. The biggest problem is that there is now no way with a set operation to determine which Collection Object came from which Paleo Context, for a particular Collecting Event. The M:1 between Collection Object and Collecting Event, means that there is only one Collecting Event record for all of the Collection Objects from that event. So there is no way to point individual Collection Objects to particular Locality records (and thus particular Paleo Context records). So Collection Objects from a particular slice of the core cannot be linked to a particular PaleoContext from that core. This solution solves one problem, but creates a performance issue and another modeling problem which is unworkable.
One way to fix the modeling problem, as shown in Figure 4, would be to make the relationship between Collection Object and collecting Event, Many:Many.

But then we need to add another join table between Collection Object and Collecting Event. Empirically, we documented that performance degrades more than 30% when we test this for a neontological (Entomology) database, over supporting the Collection Object:Collecting Event as a M:1. We would make all disciplines pay a steep performance price, to accomodate this model, for probably the most important relationship in the database, in terms of core functions.
Another way to solve this problem would be to make a new relationship (Figure 5) between Collection Object and Paleo Context, so that each Collection Object would have a Paleo Context explicitly assigned as a M:1 relationship. A collection object would have one Paleo Context, but a particular Paleo Context could describe multiple Collection Objects in the collection. This presents many challenges when thinking about the query execution paths and joins that would be needed to put a UI on that. Not to mention the performance issues, again.

All of the above modeling option considerations brought us back to the starting point--our base information model. We asked: Is Paleo Context actually related to Specify's concept of (Collecting) Locality? Is it a type of Locality? A refinement of it? An attribute of Locality?
For the following reasons, we decided that Paleo Context is not directly related to Locality.
Paleo Context has attributes that are unique to Paleo collections; it incorporates the time dimension as part of its definition of space. A Paleo Locality in the Specify 5.x sense is not simply the x,y coordinate on the surface of the earth where the specimen came from. It is defined by the geological context which may be time based, or biologically based, or geologically based, or some or all of those strata, or even others (Chemostrata, etc.) labeling that concept 'Paleo Context" helps to disambiguate it from "Locality", when thinking about Paleo Collecting Events and Localities and the geological dimension.
If we don't attach the Paleo Context to Locality in the Specify model, then where do we put it? As a table linked to Collecting Event or to Collection Object? (or possibly as attributes of Collecting Event or Collection Object?)
The case for linking Paleo Context to Collecting Event (Figure 6):

Pro: This fits the science conceptualization that a Collecting Event can generate multiple collection objects all from the same Paleo Context. And that a particular Paleo Context could have multiple Collecting Events (in the collection).
Con: This breaks the core sampling use case, where one has one Collecting Event producing many Collection Objects (slices of the core or fossils from different slices of the core), and each Collection Object has its own Paleo Context. Each Collecting Event will have multiple Paleo Contexts. But we cannot support that with this model, as Collecting Event:Paleo Context is M:1. If we left it like this, in practice, a new Collecting Event record would need to be created for each unique combination of Collection Object and Paleo Context, which redefines Collecting Event.
We could solve that problem by making Collecting Event:Paleo Context a M:M relationship, but then we would need to redefine Collecting Event, not as one event of extracting the entire core, but as the process of taking each slice. Collecting Event would become a join table for Paleo Context and Collection Object.
The case for linking Paleo Context to Collection Object (Figure 7):

Pro: Core sampling use case works. Each Collection Object from a core slice, or from a quarry dig, has its own Paleo Context, allowing for one Collecting Event and one Locality for the entire core or quarry dig. We went to Jones quarry, we found three strata, we collected two fossils from each of the strata, resulting in: six collection objects, one Locality, one collecting event, and three paleo contexts.
One could make the case that the Paleo Context data are simply attributes of Collection Objects and forget about the Paleo Context as a table. This would be a denormalized model as a Paleo Context could have multiple Collection Objects, and Paleo Context data would likely be unnecessarily duplicated. Also all neontological Specify collections, would have those fields in the Collection Object table, which will have some performance hit, but more importantly the fields are not useful for those collections. For those reasons, a separate table seems optimal.
Con: We can't think of any. Paleo context is conceptualized as a time (chronostrat), substrate (lithostrat) or a biological association (Biostrat) dimension, and not a "space" dimension. All types of stratigraphy are relative indexes of time or of association. ChronoStrat is obvious. LithoStrat is not physically where the stratum is from the surface, like depth and canopy height would be, those are spatial dimensions, but Litho is a 'space' defined by a geological order dimension. A stratum does occupy space at each site, but (e.g.) the descriptor 'Devonian shale' characterizes a time dimension and not a physical space dimension, although the stratum does have spatial attributes. In other words, "Devonian shale" is not a property value of a spatial dimension. Same with Biostratrigraphy, which deals with biological associations not Localities in a spatial sense, e.g. a biostrat labeled 'Trilobite level' is not describing a spatial dimension.
In conclusion, we decided that Paleo Context does not have a direct relationship with Locality in Specify's traditional definition of a (collecting) Locality as a place on earth where something was collected. Paleo Context and Locality are two very different things. If we try to keep them artificially directly linked, major data model problems emerge. By linking Paleo Context as a seperate table to Collection Object, every Paleo field collection use case we know of can be supported, and only Paleo database users pay the price for any performance hit, as there would be no costly additional M:M join tables among Collection Object, Collecting Event and Locality.
There is an interesting use case where the spatial description of collecting Locality becomes congruent with the Paleo Context. When researchers identify the original location of a paleo organism on the earth's surface based on the paleo period during which the organism lived, by modelling the movement of land masses, then the locality for a Paleo Collection Object, merges with its Paleo Context. Simulation data from those projections of the global position of localities in geological time are not currently stored in our schema, but they could be added if it becomes useful for Paleontological collections to cache modeled georeferenced localities in the future, of the past.
Figure 8 shows the Specify 6 data model with the new and rearranged data objects.

Figure 8. Specify 6 Proposed Paleo Context and Stratigraphy Data Model
We added a 'parent pointer' to Locality. This provides a way to sub-divide Localities with little overhead or impact to other disciplines. The introduction of a recursive parent/child pointer, enables Locality/sub-Locality relationships and allows needed flexibility for other field collection localization methods. For example, with sub-Localities we can accomodate precise locality descriptions from points along transects, or from field site grids and subgrid units.
Most non-Paleo collections take a very Collection Object-centric view of the data. This approach is limiting for Paleo collections, which would rather view the information from a Collecting Event or Stratigraphic perspective. Figure 9 shows a Collecting Event view of the new data model:
Figure 9. Prototype Data Form for Paleo Collecting Events