Information Management

Data Management Plan: download as pdf

Data management is a subset of information management but the distinction blurs. Data management cannot be separated from research themes, publications, personnel, sampling locations, controlled vocabularies and the system which integrates all these.

From SCUBA to PASTA

IM System Figure 2 The information management process (left) as it relates to the design and maintenance of the MCR long-term experiment. As part of preparation for the field season in Moorea, the science coordinator ensures all experiments have the latest taxonomic lists, gps coordinates of all experiments, and data collection protocols. Information defined in the initial design of an experiment such as its data structure and protocols, personnel and sites are collected early in the process but regularly updated as necessary. Links to protocol documents stored in the file system are linked to the metadata so that updated versions become available automatically. Incremental updates of additional data collection are fed to the database (or appended to flat files on the file system) and become available when new revisions of metadata are submitted to the data repository. The LTER Network specifies data be archived in an appropriate repository. For nearly all data, we use the repository maintained by the Environmental Data Initiative, using the underlying PASTA software. The metadata catalog DataONE receives these data packages from the LTER Member Node. Other repositories containing data from Moorea, such as BCO-DMO and the KNB Metacat used by NCEAS are also DataONE member nodes, allowing "one stop shopping" for data searches.

 

There is no one metric to measure the amount of data. The complexity of MCR data packages varies widely, each containing from one to eleven data tables and each table having three to over 150 columns. Each data table has from two to hundreds of thousands of rows of data, averaging 268 and median 45,229 rows. Each datum may be the result of lab analysis, an observation by a diver, or an automated sensor recording. Time series data are appended with updated revisions to existing data packages; data are not split into separate tables by year.

MISSION

The primary Information Management missions of the MCR LTER are to facilitate the site’s scientific work; ensure data integrity, security and longevity; maintain appropriate accessibility to data; and facilitate data synthesis. Information management at MCR is collaborative. The Principal Investigators, the Deputy Director, and the Information Manager work together to plan the activities and set priorities. During MCR III, improvements and new components in our Information Management System (IMS) will be driven by our missions and with compatibility in mind.

COMPONENTS

The MCR LTER collects diverse types of data from a variety of sources and our growing IMS reflects this heterogeneity. Highlights of the processes, tools and protocols for the MCR IMS include:

Data Catalog. As of May 2019 there are 75 public data packages published in the LTER Network Data Portal (aka PASTA) containing 238 data tables (or other data entities). Within our local MCR LTER data catalog, data packages can be browsed by topic, investigator, LTER Core Research Area, or listed as an inventory. Metadata is encoded in the XML specification adopted by the network (Ecological Metadata Language, EML)

MCR LTER uses the LTER standard specification, EML, to record metadata. The Information Manager is involved in all levels of data collection. Data processing is laboratory-based in the languages preferred by the investigators and their staff. For each data product, the Investigator(s) and Information Manager discuss potential output formats, and agree on a product. In most cases the long-term data product from each lab is provided to the Information Manager as ASCII text. The exceptions are the annual biological census data, which are uploaded directly to the database by the technicians using a web interface. Currently, metadata is entered and maintained directly in EML using an xml editor. However, we are moving toward generating eml dynamically from a metadata database. For this we ported the relational database model "GCE Metabase2" to our local system in 2011.

Metadata database. Manual entry of metadata into EML is practical for a small inventory of datasets. However, in addition to keeping pace with data submissions, the LTER network continually moves toward improved data discovery and synthesis needs, often accomplished by enriching the metadata. Such upgrades are better accomplished with metadata in a relational database than in static XML documents. In 2010 we surveyed EML management systems and in 2011 began adoption of components of the GCE LTER IM System, starting with Metabase. Currently, EML data package inventory management, status and tasks are tracked in our database. In 2018 MCR and SBC began a collaboration with the new BLE LTER site information manager to polish the Metabase system in preparation for wider release.

Congruence. The way we measure completeness of the metadata has matured over the last decade. Using the old 2004 version of Best Practices, in terminology now deprecated, datasets were categorized from "Discovery" to "Integration." Those measures only pertained to the metadata, not the congruence of data with metadata. Since 2011 at MCR we required all datasets to pass a congruence test to ensure the metadata adequately describes the data prior to submission to the data catalog. The Quality Engine component of PASTA has made that process far more efficient, saving time over the older methods of congruence testing. Metadata for all packages is regularly maintained using Best Practices (version 3, 2017) and has been in EML 2.1.0 since 2010, EML 2.1.1 since 2014. All data packages include methods or protocols (often with a downloadable document in addition to embedded text).

Controlled Vocabularies. MCR has robust, structured and well-controlled vocabularies for taxonomy and sampling locations. We benefit from additional LTER and scientific community vocabularies for other data components such as keywords, observations and measurements (units). Where possible, we relate our vocabularies to existing standardized vocabularies, e.g.,Global Change Master Directory, Open Geospatial Consortium, the former NBII and of course the LTER Controlled Vocabulary. The soon to be released EML 2.2 offers semantic annotation. This will allow more specific search and definition terms than keywords.

File system. The MCR file system is co-managed by the Marine Science Institute (MSI), and two other research groups: the Santa Barbara Coastal LTER (SBC). MCR LTER data and document storage has three tiers of access: public access of data packages, internally shared pre-release or controlled-access data packages, and areas accessible only by the data owner. This enables us to secure new data against loss early in the process before it is ready for review and internal sharing. The file system is secured by a backup system of on-site daily incremental, less frequent full backups, and tape backups off-site. In January 2012 our shared storage capacity was doubled to 12 Terabytes.

Web Site. Our web site is a hybrid of static informational pages and dynamic pages with content supplied by our relational database. In 2010 we migrated to an updated design, with separation of content and style and new templates implemented with W3C standards in valid XHTML and CSS. Time invested in the redesign of the website has been regained in ease of maintenance and extensibility. Web site components are stored in a version control system (Subversion, SVN). Our web site host is shared with partner projects at MSI. MCR uses the lternet.edu domain name maintained by the LTER network.

In 2016 we migrated the public-facing informational side of our website to a content management system (Drupal) to allow less technical personnel to update content. The database-driven specialized pages such as the data catalog and metadata display are retained from our previous system.

POLICIES AND PRACTICES

Data Contributions and Access. LTER network policy is implemented by the MCR IMS as follows:

  • Type I The majority of MCR-LTER Long-term Time Series Program data packages are Type I, with public release as soon as data are verified. Data that require follow-up during QA/QC or require some level of analysis or post-processing to generate desired data products are released once we have confidence the data are correct.
  • Type II Sensitive data resources such as those collected for use in graduate student theses and dissertations or collected in collaboration with non-MCR researchers. These data are handled on a case by case basis with special approval of the Principal Investigators and/or Executive Committee. These data will be made available to the public after an agreed to specified period of time.

Quality Assurance, Control, and Monitoring. For the some of data components, quality assurance and control are done before submission to the IMS. Quality assurance measures built into data collection by design are encouraged. For the annual biological census data, the relational database itself constrains and validates incoming data, so that successful upload into the data model is part of the quality assurance process. Additionally, a random sample of each year’s data is queried and manually compared to original data sheets before the data are marked as verified. These procedures were agreed upon and designed by the Information Manager in partnership with the Investigators and their staff.

IM System Team. MCR information management is an integrated system from data collection to curation. The Information Manager requests periodic conferences with each Investigator individually to review their metadata and data contributions. Areas where science may benefit from the IM system resources are identified so that the process is seen as a two-way flow. MCR site technicians participate significantly in data documentation, re-formatting, and quality control and contribute their detailed knowledge to the design of data models and protocols. The Deputy Director provides the Information Manager with site, publication and personnel information. The MCR and SBC Information Managers actively collaborate, which has provided continuity and leveraged resources since 2004.

Training. Undergraduate and recent graduate assistants have been a welcome addition to the team. Lab and field assistants who have shown aptitude in data handling have been trained in metadata collection, entry, and congruence diagnosis. Data handling protocols have become more thoroughly documented as each new student describes details omitted at a higher level. As well as increasing our capacity to move datasets through the system now, when they move on they will take with them skills they can contribute to their own research groups in future. Assistants have been funded in the past with LTER supplements.