spacer
blue spacer underwater snapshot

Information Management

Data Mangement Plan: download as pdf

Data management is a subset of information management but the distinction blurs. Data management cannot be separated from research themes, publications, personnel, sampling locations, controlled vocabularies and the system which integrates all these.

Information Management System

IM System Figure 1 MCR LTER IM System (right) shows examples of data types and sources (ovals), and databases (cylinders) at the LTER Network Office (LNO). Currently only the Metacat data catalog both feeds as well as receives data and metadata, represented by a two-way arrow. The plan is for the other LTER Network databases to be sources, not just sinks, represented by a one-way arrow. The hexagons at right represent the modular components of our web interface; the two-way arrow represents the public website queries to the MCR IMS for content.


IM System Figure 2 The information management process (left) as it relates to the design and maintenance of the MCR long-term experiment. Information defined in the initial design of an experiment such as its data structure and protocols, personnel and sites are collected early in the process but regularly updated as necessary. Incremental updates of additional data collection are fed to the database (or appended to flat files on the file system) and automatically become available when new versions of metadata are submitted to Metacat. Once verified, these updates appear immediately in the public web interface to public data packages. Persistent links to protocol documents stored in the file system are linked to the metadata so that updated versions become available automatically.


There is no one metric to measure the amount of data. The complexity of MCR data packages varies widely, each containing from one to eleven data tables and each table having three to over 150 columns. Each data table has from two to hundreds of thousands of rows of data, averaging 268 and median 45,229 rows. Each datum may be the result of lab analysis, an observation by a diver, or an automated sensor recording. Time series data are appended with updated revisions to existing data packages; data are not split into separate tables by year.

Temporal typeN
pkgs
Ongoing time series32
Completed time series8
Short term study17
Non-temporal2
Spatial extentN
pkgs
LTER site or subset41
Single point7
Laboratory3
Regional or global8
CategoryN
pkgs
Core and Signature30
Non-core29
Reference or Exogenous6
Thesis4

MCR LTER Information Management Plan

INTRODUCTION

The primary Information Management missions of the MCR LTER are to facilitate the site’s scientific work; ensure data integrity, security and longevity; maintain appropriate accessibility to data; and facilitate data synthesis. Information management at MCR is collaborative. The Principal Investigators, the Deputy Director, and the Information Manager work together to plan the activities and set priorities. During MCR II, improvements and new components in our Information Management System (IMS) will be driven by our missions and with compatibility in mind. This six-year renewal cycle of MCR coincides with a dynamic period of planning and development of cyberinfrastructure (CI) in the LTER network and thus provides an opportunity to develop our local system to take advantage of these advances. MCR II activities will include:

  • Participation in network efforts to optimize scientific data management
  • Collaboration with other research sites to leverage or adapt compatible systems
  • Support for trends toward increasingly interdisciplinary studies
  • Continued commitment to comprehensive metadata in the network standard (EML)
  • Modular development of the IMS to enable integration of future projects
  • Ensuring data integrity and consistency throughout the data lifecycle
  • Development of applications that handle data in the form of web services

STATUS AND DEVELOPMENT OF THE CURRENT SYSTEM

The MCR LTER collects diverse types of data from a variety of sources and our growing IMS reflects this heterogeneity. Highlights of the processes, tools and protocols for the MCR IMS include:

Data Catalog. Currently (as of July 4, 2013) 59 data packages containing 195 data tables (or other data entities) are available, both locally and through the LTER Network Metacat catalog. All of the Type I packages are also available through the LTER Network Data Portal (aka PASTA). MCR's local data catalog is based on the XML specification adopted by the network (Ecological Metadata Language, EML), and so MCR is well placed to take full advantage of applications based on EML. Our catalog is a hybrid system that is populated from the network Metacat and transformed using local XSLT templates, which ensures that local and network catalogs contain identical inventories.

YearN
pkgs
N
Type I
N
Type II
I & II
mix
N
with
data
N
tables
N packages
"Integration" (2009)
or
Workflow-Ready (2012)
Data years cataloged
(excl. pre 2005)
2009241951924155
2012443954311841* (2 required login)8
2013594512259195>45* (not counting Type II)9

MCR LTER uses the LTER standard specification, EML, to record metadata. The Information Manager is involved in all levels of data collection. Data processing is laboratory-based in the languages preferred by the investigators and their staff. For each data product, the Investigator(s) and Information Manager discuss potential output formats, and agree on a product. In most cases the long-term data product from each lab is provided to the Information Manager as ASCII text. The exceptions are the annual biological census data, which are uploaded directly to the database by the technicians using a web interface. Currently, metadata is entered and maintained directly in EML using an xml editor. However, we are moving toward generating eml dynamically from a metadata database. For this we ported the relational database model "GCE Metabase2" to our local system in 2011.

Metadata database. Manual entry of metadata into EML is practical for a small inventory of datasets. However, in addition to keeping pace with data submissions, the LTER network continually moves toward improved data discovery and synthesis needs, often accomplished by enriching the metadata. Such upgrades are better accomplished with metadata in a relational database than in static XML documents. In 2010 we surveyed EML management systems and in 2011 began adoption of components of the GCE LTER IM System, starting with Metabase. Currently, EML data package inventory management, status and tasks are tracked in our database.

Congruence. The way we measure completeness of the metadata has matured over the last decade. Using the 2004 version of Best Practices, in termonology now deprecated, datasets were categorized from "Discovery" to "Integration." Those measures only pertained to the metadata, not the congruence of data with metadata. Since 2011 at MCR we required all datasets to pass a congruence test to ensure the metadata adequately describes the data prior to submission to the data catalog. The Quality Engine component of PASTA has made that process far more efficient, saving time over the older methods of congruence testing. Metadata for all packages is regularly maintained using Best Practices (v.2 August 2011) and has been in EML 2.1.0 since 2010. All data packages include methods or protocols (often with a downloadable document in addition to embedded text).

Controlled Vocabularies. MCR has robust, structured and well-controlled vocabularies for taxonomy and sampling locations. We will benefit from additional LTER and scientific community vocabularies for other data components such as keywords, observations and measurements (units). Where appropriate, development of vocabularies at MCR will leverage work at the network level, specifically, the LTER Unit Registry and thesaurus for keywords. Where possible, we will relate our vocabularies to existing standardized vocabularies, e.g., Global Change Master Directory, Open Geospatial Consortium. In addition to selecting dataset keywords from the Metacat browse page, In 2011 we began using the LTER Controlled Vocabulary tools getHiveEML and TemaTres browser to add keywords when datasets are created or updated.

File system. The MCR file system is co-managed by the Marine Science Institute (MSI), and two other research groups: the Santa Barbara Coastal LTER (SBC). MCR LTER data and document storage has three tiers of access: public access of data packages, internally shared pre-release or controlled-access data packages, and areas accessible only by the data owner. This enables us to secure new data against loss early in the process before it is ready for review and internal sharing. The file system is secured by a backup system of on-site daily incremental, less frequent full backups, and tape backups off-site. In January 2012 our shared storage capacity was doubled to 12 Terabytes.

Web Site. Our web site is a hybrid of static informational pages and dynamic pages with content supplied by our relational database. In 2010 we migrated to an updated design, with separation of content and style and new templates implemented with W3C standards in valid XHTML and CSS. Time invested in the redeisgn of the website has been regained in ease of maintenence and extensibility. Web site components are stored in a version control system (Subversion, SVN). Our web site host is shared with partner projects at MSI. MCR uses the lternet.edu domain name maintained by the LTER network.

NEW MODULES

We plan to take advantage of services to be offered by the LTER Network Office (LNO) as these develop (Fig. 1). Because computing technology changes rapidly relative to the long-term mission of the LTER network, the modular design of the MCR IM system will enable components to be replaced without requiring redesign. This design mode merges current recommended practices for programming with the pragmatic style of information management necessary for an LTER site.

Software Architecture Standardization. We are progressing toward a goal to standardize our software architecture to a constrained group of languages and frameworks. In some cases this has meant porting IMS tools from other LTER sites using different architectures, such as when we ported the GCE Metabase from SQL-Server to Postgres. Currently we use a PostgreSQL relational database as the back-end serving web site and data access. Five LTER sites use Postgres. We connect to this database with PHP and Perl cgi. When the LNO becomes able to offer web service access to the network PersonnelDB and all-site Bibliographic DB, our site system will implement web service clients to synchronize that content with our local database.

Project Cross-Referencing. Implementation of the ProjectDB, a way of storing and displaying research project information developed by an IMC working group, began in 2013 at MCR. In this sense, a project can accommodate cross-referencing between MCR Working Groups, a LTER network research focus, local research groups or labs, or even a particularly rich or diverse collaborative dataset. Each project is characterized by its description, people, datasets, locations, publications, and reference materials. Since March 2010 we have been using a simpler database model for cross-referencing which will continue to serve us until ProjectDB is fully populated. This allows cross-referencing between people, publications, datasets, research activities, and network core research areas.

POLICIES AND PRACTICES

Data Contributions and Access. LTER network policy is implemented by the MCR IMS as follows:

  • Type 0 Near-real-time physical oceanographic and meteorological raw data, and derived near-real-time event detection data are made publicly available without delay as this is part of their value.
  • Type I Data that require some level of QA/QC or require some level of analysis or post-processing to generate desired data products. The majority of MCR-LTER Long-term Time Series Program data packages are Type I, with public release as soon as data are analyzed and verified.
  • Type II Sensitive data resources such those collected for use in graduate student theses and dissertations or collected in collaboration with non-MCR researchers. These data are handled on a case by case basis with special approval of the Principal Investigators and/or Executive Committee. These data will be made available to the public after an agreed to specified period of time.

Public access to data is tracked when downloaded. MCR investigators may view the record of public downloads on an internal web page. Data users accept the access policy prior to download.

Quality Assurance, Control, and Monitoring. For the majority of MCR's data components, quality assurance and control are done before submission to the IMS. Quality assurance measures built into data collection by design are encouraged. For the annual biological census data, the relational database itself constrains and validates incoming data, so that successful upload into the data model is part of the quality assurance process. Additionally, a random sample of each year’s data is queried and manually compared to original data sheets before the data are marked as verified. These procedures were agreed upon and designed by the Information Manager in partnership with the Investigators and their staff.

IM System Team. MCR information management is an integrated system with smooth transition from data collection to curation (Fig. 2). The Information Manager requests periodic conferences with each Investigator individually to review their metadata and data contributions. Areas where science may benefit from the IM system resources are identified so that the process is seen as a two-way flow. MCR site technicians participate significantly in data documentation, re-formatting, and quality control and contribute their detailed knowledge to the design of data models and protocols. The Deputy Director provides the Information Manager with site, publication and personnel information. The MCR and SBC Information Managers actively collaborate, which has provided continuity and leveraged resources since 2004.

Training. Undergraduate and recent graduate assistants have been a welcome addition to the team. Lab and field assistants who have shown aptitude in data handling have been trained in metadata collection, entry, and congruence diagnosis. Data handling protocols have become more thoroughly documented as each new student describes details omitted at a higher level. As well as increasing our capacity to move datasets through the system now, when they move on they will take with them skills they can contribute to their own research groups in future. Assistants have been funded with LTER supplements.

 

 
white spacer white spacer
  NSF LTER logo This material is based upon work supported by the National Science Foundation through the Moorea Coral Reef Long-Term Ecological Research program under Cooperative Agreement #OCE-0417412 and #OCE-1026851. Any opinions, findings, conclusions, or recommendations expressed in the material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. Moore Foundation logo  
Valid XHTML 1.0 Transitional Valid CSS!