Purpose and Structure of ESSC data tree
To accommodate the wide range of scales and geographic coverages of the
geographically referenced data sets required by various ESSC
investigators, the ESSC database is structured as a tree, with lower
levels being at correspondingly finer scales. Each level corresponds to
geographic regions in a given, approximate, size range. Very roughly,
these levels are
- Global;
- Continents, oceans, or sudivisions thereof with dimensions
greater than about 5000 km;
- Medium sized countries, regions of large countries, groups of
small countries, seas, and other regions with dimensions on the
order of 1000 to 5000 km;
- States, provinces, small countries, or other regions with
dimensions on the order of 200km to 1000km;
- 1 degree by 1 degree squares or other regions with
dimensions on the order of 50km to 200km;
- 7.5 minute map qauds or other regions with dimensions on the
order of 10km to 50km;
- Regions with dimensions less than about 10km.
At each level, there are two types of subdirectories: subdirectories
containing actual data at the scale appropriate to that level, and
subdirectories containing information for regions at the next finer
scale.
The subdirectory containing the actual data for each region is further
subdivided by type of data, etc, and these subsubdirectores will in turn
contain a separate directory for each separate data set. This permits
having documentation files co-located with
the data, accommodates some data formats which require multilple files
for a single data set (e.g., Arc/Info and LAS), and permits storing
frequently used data sets in more than one format.
Because the Unix pathnames to reach dataset directories are often
long, especially at the deeper levels of the geographic tree, each
dataset directory also contains a "dataset reference code" (DSRCODE),
which is displayed immediately below the dataset name in the hypertext
listing of the directory contents. This code can be used to simplify
accessing the dataset directory and
retrieving data from it.
To accommodate different names for overlapping regions (e.g., the
Susquehanna River Basin includes parts of NY, PA, and MD), many
directores contain pointers (logical links) to
other directories on the same or next lower level. For example, the
"susq" directory contains pointers to all 1-degree-square subdirectories
of the three states listed above which cover all or part of the basin;
and the directory for each of these states contains a pointer to the
"susq" directory.
In addition to the actual data and documentation, the top-level
directory contains several files (README, NEWS, and FORMATS) and
directories (notes/ and projections/) containing explanatory and
supplementary information about the database. Each subregion also
contains a file defining its geographic extent
to facilitate database searches.
At present, most datasecs are accompanied only by a temporary
dcoumentation file, named "doc", containing a brief description
of the data set, including its source, resolution, and units of
measurement. Eventually, we hope to provide full documentation
(metadata) for each dataset, in conformance with the content standards
for spatial metadata developed by a working group representing a number
of U.S. Governement agencies.
Since the complete documentation for one dataset can run to many pages,
but contains many sections which are identical for all datasets of a
given type or for a given region, the database contains provision for
splitting the metadata into the three categories described below.
- Generic
- Records or fields defined by the metadata standard which are the
same for all datasets of a given type (e.g., all 30m DEMs).
This information will be collected together in files in a
directory at the top of the database tree.
- Collective
- Records or fields defined by the metatdata standard which are
the same for all datasets in a group (e.g., all model-output
files for a given experiment). This information will be
collected into a cdoc/ subdirectory of whatever region
contains all the datasets in the group.
- Dataset specific
- This information will be contained in a metadata file located
in the subdirectory containing the dataset to which it
applies.
The intention is that the metadata file for each dataset will contain
pointers to the relevant fields and records in the collective and
generic documentation files. The software for displaying dataset
documentation would then use these pointers to assemble a complete
domumentation file for display. This software has not yet been
implemented.
Users need to be able to search for datasets using a variety of
criteria, singly of in combination, such as data type, geographic
location, date, and data resolution. Ultimately, this could be
facilitated by copying selected fields of the metatdata
for each dataset into a consolidated catalog file, which could then
be rapidly scanned for datasets matching user-specified search criteria.
Implementation of such a facility, however, must await creation of the
metadata files for all datasets.
Since two of the most important search criteria are geographic location
and data type, the database has been designed to facilitate searches
using these criteria. Each subdirectory contains a box file
which gives, in degrees and decimal fractions, approximate values of the
western- and eastern-most longitudes and the northern- and southern-most
latitudes of any point in the subregion. Although individual datasets
may not occupy the entire box, and the box enclosing an irregularly
shaped region such as a state may contain extensive portions of adjacent
regions, the box files do provide a rapid means for screening out most
datasets which do not contain a specified point or set of points.
To facilitate searches for a given data type, an attempt has been made
to use a standardized set of names for generic data types.
Standardization of names for subtypes is more difficult, so that the
user may still need to check the dataset documentation file(s) to
determine whether a dataset returned by a database search does in fact
contain the desired type of data.
Symbolic links provide a mechanism for referring to the same data using
more than one pathname. They are used for several purposes:
- When a directory or data set at one level spans the region
covered by more than one directory at the next higher level,
the actual data will be entered under only one directory, and
pointed to by links in the other directories. For example, the
Wilmington West (N39W075) 1-degree square covers parts of
Delaware, Maryland, New Jersey, and Pennsylvania; the actual
data are entered under only one state, with symbolic links
pointing to them from the directories for the other three
states.
- When a project focuses on a region which does not coincide with
any region covered by a "standard" directory (for example, the
Susquehanna River Basin covers three states), a separate
directory may be created for the project which will contain
both actual data, for data sets which extend over the entire
region, and symbolic links to the directories at the next lower
level which cover any part of the region. These lower level
directories may in turn contain pointers back to the directory
for the project region.
- It is generally desireable to use fairly short names for
directories, and more fully descriptive names for data sets.
Symbolic links may be used to provide alternate, more
descriptive names for directories and shorter, abbreviated
names for data sets. For example, the "official" name of
1-degree square regions in the northern half of the western
hemisphere incorporates the lattitude and longitude of the
southeast corner of the square; in general, symobolic links
have been created to permit accessing the directory by the
name of the corresponding 1:250000 map sheet.
- Symbolic links are also used to implement the simplified unix
pathnames for datasets using their dataset
reference codes.
Last change: 1 Oct 1996,
R. A. White / raw@essc.psu.edu