presenter notes Understanding preservation metadata is crucial for several reasons. Firstly, it ensures long-term access to digital materials, safeguarding against format obsolescence and technological changes. Additionally, preservation metadata documents the provenance and authenticity of digital materials, supporting their trustworthiness and reliability. Effective management of digital collections relies on robust preservation metadata, facilitating systematic organization, monitoring, and decision-making processes. Moreover, standardized preservation metadata formats promote interoperability and exchange of digital materials across diverse systems and institutions. Lastly, preservation metadata aids in legal and ethical compliance, helping institutions adhere to copyright laws, privacy regulations, and access restrictions. In summary, preservation metadata plays a fundamental role in the preservation, management, and accessibility of digital materials.
presenter notes One definition I have often heard about what metadata is is that metadata is “data about data”. This is quite possibly the least helpful definition I have ever heard. A somewhat better definition comes from Jeffrey Pomerantz, information science educator, who has taught courses both at Simmons College and UNC Chapel Hill, and has also published a number of MOOCs, many which address many of the topics/standards we will be covering today. In 2015, wrote a book titled Metadata. Here, he describes metadata as “a means by which the complexity of an object is represented in a simpler form”.
presenter notes You can consider hand-written or typed cards in card catalogs as a pre-computer type of metadata. Image credit: https://www.themarginalian.org/2013/10/01/card-catalog-chronicle/
presenter notes These categories are derived from the Library of Congress’ Digital Preservation Metadata Standards document: https://www.loc.gov/standards/premis/FE_Dappert_Enders_MetadataStds_isqv22no2.pdf
presenter notes A metadata standard is a set of guidelines, rules, or best practices for describing data that has been established by a recognized organization or community which helps ensure consistency and interoperability between different data systems and applications.
presenter notes A metadata schema is a specific implementation of a metadata standard and provides a more detailed set of guidelines for describing a specific type of resource or domain, such as images, audio files, video files, or other data about specific formats. A metadata schema may include additional elements or refinements to the standard, and may also provide guidelines for encoding or storing the metadata.
presenter notes Here are two examples of metadata standards: MARC (Machine-Readable Cataloging) and Dublin Core. Each standard is used to describe a broad set of things. Dublin Core, for example, is best suited for digital resources; MARC cataloging is more suited towards bibliographic items. These standards have enabled us to structure data about broad categories of things. However, most of these standards, because of their broad-ness, can be quite limited, especially when we are describing something very specific.
presenter notes Dublin Core is a metadata standard made specific for digital resources that live on the web/networked environments. Dublin core is named after the city (Dublin, Ohio) where this development took place. This standard was in response to the rapid uptick in internet usage by users and library systems in the mid-1990s.
presenter notes - Contributor - Coverage - Creator - Date - Description - Format - Identifier - Language - Publisher - Relation - Rights - Source - Subject - Title - Type
presenter notes For those very specific things, we have developed metadata schemas. For example, Dublin Core, which is used to describe, broadly, digital resources, has been used as the basis for three other schemas: EAD, specific to finding aids, VRACore, specific to visual materials and images, and PBCore, specific to audio/visual materials. If you look into how these schemas are structured, you will see a lot of similarities to Dublin Core, as well as certain additions that veer away from Dublin Core.
presenter notes A metadata element is a discrete piece of information that describes a digital or physical object. It provides descriptive, administrative, technical, or structural information about the object to facilitate its management, discovery, access, and preservation. Metadata elements are typically organized into a standardized schema or framework to ensure consistency and interoperability across different systems, domains, and communities.
presenter notes Machine-Readable Cataloging (MARC) is a standard for the representation and communication of bibliographic and related information in machine-readable form. MARC was developed by the Library of Congress in the 1960s, and was used by libraries to store and share catalog records with each other. Each MARC record contains bibliographic data in a structured format that computers can easily process, allowing for efficient cataloging, searching, and sharing of resources across library systems.
presenter notes MODS (Metadata Object Description Schema) is a metadata standard developed by the Library of Congress for describing digital resources. It provides a flexible and extensible framework for encoding bibliographic and descriptive metadata about various types of digital objects, including electronic texts, images, audiovisual materials, and more. MODS is XML-based and designed to be interoperable with other metadata standards and systems. The relationship between MODS and the MARC is that MODS is considered a derivative of MARC. MARC is a widely used metadata format for bibliographic records, originally developed for library cataloging purposes. However, MARC is highly complex and not well-suited for describing digital resources or accommodating the needs of modern digital libraries and repositories.
presenter notes https://www.loc.gov/standards/mods/userguide/generalapp.html This screen capture shows the “MODS Elements and Attributes” Guidelines page, where you can see a list of what they refer to as “top-level elements”: things like titleInfo, language, note, location, name, physicalDescription, subject, etc.
presenter notes Most of the metadata schemas that you will come across will be written in or at least be compatible with Extensible Markup Language, or XML. This reflects the fact that we live in a networked world, which requires that information can be transferable across systems. Metadata schemas were not always written with networks in mind. For example, the first finding aids written describing information contained in archival repositories were often typed (using a typewriter) onto paper and later input into electronic word processing documents, and stored locally. Nowadays, finding aids are more likely to be input into a descriptive system such as ASpace, which will transform whatever information the archivist inputs into Electronic Archival Description (EAD) format, a metadata schema used for finding aids, that is written and expressed in XML. By writing finding aid data using the EAD schema, we can post finding aids online, relay information about archival holdings to other platforms, and also represent repository information in a structured, hierarchical way. Most metadata schemas are written in XML format, primarily because XML is platform- and language independent. XML was developed in the 1990s as a way to exchange data over the internet, and has become a sort of universal data exchange language.
presenter notes Over the next few slides we will be looking at an example of a MODS metadata file snippet, written in XML (see MODS schema), in order to give you a sense for how to read XML, and how it is structured. Starting from the top, we will always first declare what metadata schema and version we are using throughout the entire document. Here, we are saying, this file uses MODS by using the <mods> element on the first line. Within the <mods> element, we use attributes, which are basically pieces of data that qualify an element. So here, we are using the xmlns attribute of mods to point to the specific standard we are using on the LOC website. We further qualify the mods element using the version attribute, to say we are using version 3.0. In the context of XML, an "element" is a fundamental building block of an XML document. It is used to represent data structure and content.
presenter notes In the example on the slide, we are seeing the start of information to do with origin – i.e., publishing details like the city the item was created, the press that published it, etc. Once we done describing these origin-related aspects of the file, we use the </originInfo> closing tag.
presenter notes Attributes can be added to an element or sub-element to provide additional qualifying information for that element. Attributes are usually structured by the attribute name, followed by an equals sign, and then the value in opening and closing quotes. For example, within the <placeTerm> element, we have two attributes listed: type=”code” and authority=”marccountry”. Here, we are saying that we are providing some information about a place using the <placeTerm> element, and this particular information about a place will be encoded using the MARC Code List for Countries. This list is a controlled list of three-digit alphacharacter codes that represent both countries and states. In this case, “nyu” represents “New York State”. So here, we are using metadata to not just record information about something, but also referring to an existing encoding schema to qualify the thing we are recording information about.
presenter notes The Metadata Encoding and Transmission Standard (METS) is a metadata standard used for describing the structure and content of digital objects. It was developed by the Library of Congress in collaboration with other institutions, and was first released in 2001. METS is always written in XML.
presenter notes A METS file is typically composed of seven main sections: <metsHdr>, <dmdSec>, <amdSec>, <fileSec>, <structMap>, <structLink>, <behaviorSec>. “Sec” means “section”, and “Hdr” means “header”, or the beginning of the document.
presenter notes All METS files start off with a top-most <mets:mets> tag, sometimes referred to as the “root”. Notice that it starts with “<mets”, followed by a “:” and then “mets” again. What is going on here? First mets: This is the namespace prefix. In XML, a namespace is a collection of names, identified by a URI reference, used to avoid conflicts between elements that have the same name but are used in different contexts. The prefix mets: indicates that the elements (and attributes) are defined within the METS schema, which is associated with a specific URI (usually something like http://www.loc.gov/METS/). Before you can use a prefix like mets:, it must be declared in the document, typically in the root element, using the xmlns:mets attribute. Second mets: This is the local name of the element within the METS namespace.
presenter notes A namespace is a collection of names which are used in XML documents as element and attribute names. Namespaces are a way of ensuring that the names of elements and attributes used in an XML file are unique and do not conflict, and can be disambiguated from any other names in the same file. METS files can use XML vocabularies from multiple sources.
presenter notes In this XML example, we start off with the root element <mets:mets>. This can be qualified with attributes; specifically “xmlns” which stands for “XML Namespace”, which clarifies various other namespaces that may be used throughout the METS file. For example, we have the URL to the Library of Congress’ page for the METS standard, followed by the URL to W3’s xlink standard.
presenter notes Next, we have the <mets:metsHdr> element containing administrative metadata about the METS file itself: who created the file, what created it, when it was created, etc. It usually positioned at the very top of the document. Unlike other elements, the METS Header never repeats itself within the same file.
presenter notes In this example, we are saying that this document was created by two individuals, with two different roles. We also are using the <mets:agent> sub-element attribute “CREATEDATE” to indicate each “agent’s” role (i.e. archivist), followed by the <mets:name> element.
presenter notes The <dmdSec> Descriptive Section contains descriptive metadata for the resource being described by the METS record. METS is agnostic to which metadata schema is used and allows for multiple <dmdSec> Data is either wrapped or linked (examples in next 2 slides)
presenter notes This example contains a bibliographic description for the book, Alice in Wonderland, which in this pretend system uses the Dublin Core metadata schema. Because Dublin Core uses a different namespace from METS, we have to first declare this namespace, and “wrap” the bibliographic metadata within it. To do this, we use the <mdWrap> tag nested within <dmdSec>: “mdWrap” stands for for “metadata wrapper”. Within <mdWrap>, we declare the descriptive data schema we are using–Dublin Core–using <mdwrap> attributes (MDTYPE=”DC”, LABEL=”Dublin Core Metadata”.) Nester within <mdWrap> is the <xmlData> tag, which contains the actual bibliographic metadata like title, creator, date, publisher and format type. Notice how each bibliographic tag starts with dc:. This is how we say, “This tag uses the Dublin Core namespace, tthat is different from the METS default namespace used throughout this file”.
presenter notes Here we have a second example of bibliographic data contains in a METS file, but instead of including the actual bibliographic metadata within the file, like creator and title, we link out to a bibliographic record, in a different database. To do this, we use the <mdRef> tag, which tells us that we are referencing an EAD-encoded finding aid. Either way (wrapping, or referencing) works here: it really depends on your local system setup and standards.
presenter notes The Administrative Section <amdSec> contains information pertaining to the files that make up the digital objects described by the METS file. It has 4 subsections: <techMD> Technical Metadata <digiprovMD> Digital Provenance Metadata <sourceMD> Source Metadata <rightsMD> Rights Metadata We are going to look at two specific sub-sections of <mets:amdSec>: Technical Metadata, and Digital Provenance Metadata. We are going to especially focus on Digital Provenance Metadata, especially in regards to PREMIS.
presenter notes Here we start with <amdSec>, followed immediately by the <techMD> tag. <techMD> contains technical metadata, pertaining to the technical characteristics of the digital object or objects described by the METS file. In this example, let’s pretend we are dealing with a book (Alice in Wonderland) that was scanned in by a photographer technician at NYU. The technical metadata uses a different metadata schema/namespace, the National Information Standards Organization or NISO Data Dictionary for Technical Metadata for Digital Still Images. We are declaring this namespace within the <mdWrap> tag (i.e. MDTYPE=”NISOIMG” and LABEL=”NISO Img. Data”) in the same way we did when linking out to the EAD finding aid earlier within the descriptive metadata section. What this means is you can use <mdWrap> throughout the METS file: it is not specific to any particular section. Notice that each technical metadata detail is preceded by “<niso:” which disambiguates them from the default METS namespace used throughout the rest of the file. https://www.niso.org/publications/ansiniso-z3987-2006-r2017-data-dictionary-technical-metadata-digital-still-images
presenter notes The <mets:fileSec> section lists all files containing content which comprise the electronic versions of the digital object. This is where METS starts to become interesting (and maybe even a little bit fun)
presenter notes <span style="color:#2200CC"> _[https://babel.hathitrust.org/cgi/pt?id=uc1.c030214385&seq=1](https://babel.hathitrust.org/cgi/pt?id=uc1.c030214385&seq=1) Take for example, a digitized book, like Millions of Cats, which you can browse on the HathiTrust website. https://babel.hathitrust.org/cgi/pt?id=uc1.c030214385&seq=1 In some systems, this is considered a single digital resource (i.e. a book, a video, a sound recording) However, a single digital resource can comprise multiple related files. For example, Millions of Cats is composed of 42 page scans, one for each side of each page, along with the cover, inner cover, back inner cover, and back cover. Along with full-resolution scans of each page, there might also be derivative files, such as thumbnail previews. We can use the METS file, specifically the File Section element <fileSec> to express this relationship between a digital resource, and its derivative components.
presenter notes In the example, we start with the <mets:fileSec> tag. Beneath this, is the sub-element <mets:fileGrp> or “file group”, which we have given the nickname “page_images”. Beneath <mets:fileGrp>, we have listed two files, one for Page 1, and another for Page 2. Each file is contained within the <mets:file> tag. Within each <mets:file> tag, we have applied the ID attribute (which we will use later on). So the first file has an ID of page1. We then declare the type of file (JPEG), and using the <FLocat> tag (stands for “file location”) where we can say where the image file is located and its filename. We repeat the <mets:file> tag for every file in this group.
presenter notes This is where things get really interesting in METS! The Structure Map <mets:structMap> section refers back to the files listed within the File Section <mets:fileSec>, and outlines how these files are structured in a digital library. This information is used by things like the digital library front-end, so that the end-user is presented with the files in a way that they can logically browse. Though not always the case, it often is set up to mimic the experience of paging through a physical book (i.e. front to back).
presenter notes In this example, we are continuing our example of a book that is composed of pages. Here, we use a series of nested <div> tags (“div” stands for division). The first <div> tag declares that we have a book, so we use the ID attribute to give it the nickname “book”. Nested beneath the book are <div> Beneath the page1 <div>, we use the <fptr> or “file pointer” tag to reference the ID of the file listed in the <fileSec>. This repeats for the next page, Page 2 (nickname “page2” in the <fileSec>) In addition, each page has a thumbnail counterpart, which is specified using <smLink>. This is followed by an <smlink> or structural link tag. Here, we are saying that for page 1, we have a related thumbnail image that shared the same part of the structural map hierarchy with the high-resolution image. We then point to the location of the thumbnail using a URL. Overall, the <smLink> element provides a way to associate metadata with specific parts of a digital object, making it easier to manage and organize metadata related to the object. This can be particularly useful in complex digital collections with many components and related records.
presenter notes In this example, the <behavior> element is used to create a behavior called "playAudio". The behavior is defined using two sub-elements: <interfaceDef>, which defines the user interface for the behavior, and <mechanism>, which defines the actual code that implements the behavior. The code itself basically calls up an audio player that can be played or paused.
presenter notes PREMIS has 5 main elements: Objects, Environments, Agents, Events, and Rights.
presenter notes Here, we are describing an object’s characteristics in terms of a unique identifier, type (file), an MD5 checksum, and file size (2.5 gigabytes). The <premis:environments> element contains two <premis:environment> sub-elements, one for software type and one for note. The software environment is identified as Archivematica. You could also use this section to declare what hardware you used to run Archivematica, such as a computer model or type. This information can be useful for understanding the technical context in which the digital object was created, accessed, or modified.
presenter notes In this example, the metadata describes a digital object. The <rights> element contains a <rightsStatement> sub-element that describes the intellectual property rights associated with the digital object. The rights statement is identified by a URI, which in this case is a Creative Commons Attribution-NonCommercial 4.0 International license. The basis for the rights is identified as a license, and the specific rights granted are listed as Attribution-NonCommercial 4.0 International. The rights statement also includes start and end dates for the license.
presenter notes In this example, we have used the <event> element to describe a distinct event that has happened to a digital object. In this case, a person–Mary Kidd–performed a file format identification step using the fido tool (https://openpreservation.org/tools/fido/). Here, we have given the event a unique ID, and know when it happened.
presenter notes This is an example of PREMIS metadata contained or wrapped within a METS metadata file. The PREMIS metadata is contained specifically within the <mets:digiprovMD> element, which stands for “Digital Provenance Metadata”. This is followed by a <mets:mdWrap> element that defines a new namespace using the attribute MDTYPE (metadata type) = “PREMIS:EVENT”. So here, we are not only declaring that we are using the PREMIS namespace, we are also specifically using its event element. After declaring the <mets:xmlData> element, our PREMIS chunk begins. First, we use the xmlns (XML namespace) attribute of <premis:event> to point to the PREMIS standard, using a URL. We then assign a UUID to the event. Next, we declare the event type (“format identification”), declare when the event happened (using a timestamp), and then say what program we used to do the format identification (in this case, we used Siegfried, which is a popular format identification tool).