Description and discussion of ESSC data-tree structure

Purpose and Structure of ESSC data tree

To accommodate the wide range of scales and geographic coverages of the geographically referenced data sets required by various ESSC investigators, the ESSC database is structured as a tree, with lower levels being at correspondingly finer scales. Each level corresponds to geographic regions in a given, approximate, size range. Very roughly, these levels are

Global;
Continents, oceans, or sudivisions thereof with dimensions greater than about 5000 km;
Medium sized countries, regions of large countries, groups of small countries, seas, and other regions with dimensions on the order of 1000 to 5000 km;
States, provinces, small countries, or other regions with dimensions on the order of 200km to 1000km;
1 degree by 1 degree squares or other regions with dimensions on the order of 50km to 200km;
7.5 minute map qauds or other regions with dimensions on the order of 10km to 50km;
Regions with dimensions less than about 10km.

At each level, there are two types of subdirectories: subdirectories containing actual data at the scale appropriate to that level, and subdirectories containing information for regions at the next finer scale.

The subdirectory containing the actual data for each region is further subdivided by type of data, etc, and these subsubdirectores will in turn contain a separate directory for each separate data set. This permits having documentation files co-located with the data, accommodates some data formats which require multilple files for a single data set (e.g., Arc/Info and LAS), and permits storing frequently used data sets in more than one format.

Because the Unix pathnames to reach dataset directories are often long, especially at the deeper levels of the geographic tree, each dataset directory also contains a "dataset reference code" (DSRCODE), which is displayed immediately below the dataset name in the hypertext listing of the directory contents. This code can be used to simplify accessing the dataset directory and retrieving data from it.

To accommodate different names for overlapping regions (e.g., the Susquehanna River Basin includes parts of NY, PA, and MD), many directores contain pointers (logical links) to other directories on the same or next lower level. For example, the "susq" directory contains pointers to all 1-degree-square subdirectories of the three states listed above which cover all or part of the basin; and the directory for each of these states contains a pointer to the "susq" directory.

In addition to the actual data and documentation, the top-level directory contains several files (README, NEWS, and FORMATS) and directories (notes/ and projections/) containing explanatory and supplementary information about the database. Each subregion also contains a file defining its geographic extent to facilitate database searches.

Data Set Documentation

At present, most datasecs are accompanied only by a temporary dcoumentation file, named "doc", containing a brief description of the data set, including its source, resolution, and units of measurement. Eventually, we hope to provide full documentation (metadata) for each dataset, in conformance with the content standards for spatial metadata developed by a working group representing a number of U.S. Governement agencies.

Since the complete documentation for one dataset can run to many pages, but contains many sections which are identical for all datasets of a given type or for a given region, the database contains provision for splitting the metadata into the three categories described below.

Generic: Records or fields defined by the metadata standard which are the same for all datasets of a given type (e.g., all 30m DEMs). This information will be collected together in files in a directory at the top of the database tree.
Collective: Records or fields defined by the metatdata standard which are the same for all datasets in a group (e.g., all model-output files for a given experiment). This information will be collected into a cdoc/ subdirectory of whatever region contains all the datasets in the group.
Dataset specific: This information will be contained in a metadata file located in the subdirectory containing the dataset to which it applies.

The intention is that the metadata file for each dataset will contain pointers to the relevant fields and records in the collective and generic documentation files. The software for displaying dataset documentation would then use these pointers to assemble a complete domumentation file for display. This software has not yet been implemented.

Support for Database Searches

Users need to be able to search for datasets using a variety of criteria, singly of in combination, such as data type, geographic location, date, and data resolution. Ultimately, this could be facilitated by copying selected fields of the metatdata for each dataset into a consolidated catalog file, which could then be rapidly scanned for datasets matching user-specified search criteria. Implementation of such a facility, however, must await creation of the metadata files for all datasets.

Since two of the most important search criteria are geographic location and data type, the database has been designed to facilitate searches using these criteria. Each subdirectory contains a box file which gives, in degrees and decimal fractions, approximate values of the western- and eastern-most longitudes and the northern- and southern-most latitudes of any point in the subregion. Although individual datasets may not occupy the entire box, and the box enclosing an irregularly shaped region such as a state may contain extensive portions of adjacent regions, the box files do provide a rapid means for screening out most datasets which do not contain a specified point or set of points.

To facilitate searches for a given data type, an attempt has been made to use a standardized set of names for generic data types. Standardization of names for subtypes is more difficult, so that the user may still need to check the dataset documentation file(s) to determine whether a dataset returned by a database search does in fact contain the desired type of data.

Symbolic Links

Symbolic links provide a mechanism for referring to the same data using more than one pathname. They are used for several purposes:

When a directory or data set at one level spans the region covered by more than one directory at the next higher level, the actual data will be entered under only one directory, and pointed to by links in the other directories. For example, the Wilmington West (N39W075) 1-degree square covers parts of Delaware, Maryland, New Jersey, and Pennsylvania; the actual data are entered under only one state, with symbolic links pointing to them from the directories for the other three states.
When a project focuses on a region which does not coincide with any region covered by a "standard" directory (for example, the Susquehanna River Basin covers three states), a separate directory may be created for the project which will contain both actual data, for data sets which extend over the entire region, and symbolic links to the directories at the next lower level which cover any part of the region. These lower level directories may in turn contain pointers back to the directory for the project region.
It is generally desireable to use fairly short names for directories, and more fully descriptive names for data sets. Symbolic links may be used to provide alternate, more descriptive names for directories and shorter, abbreviated names for data sets. For example, the "official" name of 1-degree square regions in the northern half of the western hemisphere incorporates the lattitude and longitude of the southeast corner of the square; in general, symobolic links have been created to permit accessing the directory by the name of the corresponding 1:250000 map sheet.
Symbolic links are also used to implement the simplified unix pathnames for datasets using their dataset reference codes.

Last change: 1 Oct 1996, R. A. White / raw@essc.psu.edu