Week 1

Introduction to Digital Archives

Today’s Agenda

  • Settle in/reminders/announcements (15 minutes)
  • Introductions (10 minutes)
  • Syllabus review (20 minutes)
  • Lecture: Introduction to Digital Archives (60 minutes)
  • Break (10 minutes)
  • Start weekly activity (35 minutes)

Announcements

Feel free to email me with any announcements you would like me to boost (upcoming conferences, webinars, trainings, or other events/topics of interest). Alternatively you can post them in the Discussions section of Brightspace.

Ground Rules

This class is intended to be a welcoming and productive space. All questions, including repeat questions, or questions with obvious answers, are welcome and encouraged. Repetition = learning.

About Me

My name is Mary Kidd (she/her). You can call me Mary in class, over email, or anywhere else.

I work at Yale University in the Library Information Technology (LIT) Department. My role: Technical Lead for Archival Systems.

My email: mary.kidd@nyu.edu

Introduce yourself

  • Your name
  • What program are you in and how far along are you?
  • What do you hope to learn in this course?

Syllabus Review

https://digital-archives.github.io/HISTGA1011/

Introduction to Digital Archives

Question (with no wrong answers)

What does the term "digital archives" bring to mind for you?

Where have you encountered a digital archive?

~Digital archives/archiving can have many meanings~

Definition

Digital Archiving

Digital archiving can refer to the specific archival processing practice of accessioning, appraising, arranging, and describing born-digital archival materials.

The people who do this work are sometimes referred to as digital archivists (but sometimes they are also called just "archivists" or something else entirely).

Definition

Digital Archive

A digital archive or digital repository can refer to an archive consisting of some or all born-digital or digital surrogates (digital representations of "born-physical" materials). Digital archives are often presented on online platforms called digital libraries or digital collections.

Definition

Digital Preservation

Digital preservation, sometimes also called digital curation, refers to the activities involved in selecting, establishing, maintaining, and making accessible the contents of a digital repository or archive.

The common thread through all of these terms and definitions is digital.

“As archivists, if we are going to be able to take care of digital collections into the future, we must understand that the basic building blocks of… digital files are… bits and bytes . To know files, we must know how they are constructed… And from this knowledge, we will be better equipped to design preservation strategies for our digital collections.”

Bertram Lyons, The Digital Archives Handbook (2019)

Definition

Data Object

A data object is anything that encodes and/or decodes information encoded in binary format.

Examples: a computer file, a software application, a hard disk, a floppy disk, a thumb drive, a flip phone

Data Object: Relationships & Layers

A Data Object requires → Data Objects
↓ requires
Specialized knowledge (documentation, training, users)
↓ maintained by
Accessibility, preservation, stewardship

If these relationships are not maintained, the result is obsolescence, inaccessibility, data loss, meaning loss.

Data Object Example

Data Objects: DOC file stored on 3.5 floppy disk → Floppy disk drive and USB cable → Requires Microsoft Word
↓ Requires
Specialized Knowledge: General knowledge of word processing software

Digital Information Encoding

Question

Can you think of an example from your life or work where you used or encountered a standardized sequence of numbers to represent something?

Examples: A US ZIP code represents geographic areas using 5 numbers; A barcode represents the SKU of a product

Graphical User Interface (what we see on screen)
Programming language
Assembly
Machine Language
Binary
Electricity/circuitry

Definition

Binary

Binary is a counting system that uses two binary digits (1 and 0) also known as "bits" and place values to represent values.

Most computers use binary to encode information.

Question

Why do you suppose computers best use a binary/base-2 system for encoding, storing and reading information?

NMOS Hybrid Integrated Circuit

Caption: NMOS Hybrid Integrated Circuit.

Answer: Computers use electricity, and electrical signals naturally fall into two states: ON (electricity is flowing) or OFF (electricity is not flowing). This makes a binary/base-2 system perfect, because it can easily represent these two states.

Base-10

0 1 2 3 4 5 6 7 8 9

The number system we know is known as "Base-10" because it uses 10 digits (0-9) and place values to represent numeric values.

12

Digit 1 2
Place Index 1 0
Weight 10¹ 10⁰

Each digit has a place index starting from 0 and a weight weight calculated by raising 10 to the power of the place index. To calculate the value, for each digit multiply the weight x digit, then add all sums together.

So:
2 × 10⁰ + 1 × 10¹2 + 10 = 12

Digit 6 4 7 8 3 4 1
Weight 10⁶ 10⁵ 10⁴ 10³ 10² 10¹ 10⁰

6 → millions → 6 × 10⁶ = 6,000,000
4 → hundred-thousands → 4 × 10⁵ = 400,000
7 → ten-thousands → 7 × 10⁴ = 70,000
8 → thousands → 8 × 10³ = 8,000
3 → hundreds → 3 × 10² = 300
4 → tens → 4 × 10¹ = 40
1 → ones → 1 × 10⁰ = 1

Bit 0 1
Place Index 1 0
Weight 2⁰

Like base-10, binary uses place index, read from right to left, starting from 0.

Each bit is multiplied by its weight (2 to the power of the bit's place index).

Definition

Byte (1/3)

A byte is a discrete-length group of bits.

Example: 0000 0111

This byte has a discrete length of 8 bits.

Definition

Byte (2/3)

Byte length determines total number of values a byte can represent.

A 1-bit byte can hold up to two values (1 and 0). A 2-bit byte can hold up to 4 (00, 11, 01, 10) values. An 8-bit byte can represent up to 256 values.

Definition

Byte (3/3)

To determine the maximum values a byte can represent, you raise the 2 possible bit values (1 or 0) to the power of the byte length (8), notated as 2^8.

2 * 2 * 2 * 2 * 2 * 2 * 2 * 2 
= 
256 possible values

Question

How many possible values are there in a 16-bit system?

To determine this:

  • Take the number of possible bit values: 2
  • Take the length of byte: 16
  • Raise 2 to the power of the byte length (2^16)

Answer: 65,536 byte values

A 16-bit system can be calculated by raising the number of possible values (2) to the power of the length of the byte (16), or "two to the power of 16" (2^16). That is:

2 * 2 * 2 * 2 *  
2 * 2 * 2 * 2 *  
2 * 2 * 2 * 2 *  
2 * 2 * 2 * 2
=
65,536

Console screen capture of an 8-bit Nintendo Entertainment System (NES) gaming system from the late 1980s.

Side-by-side animated GIF comparison of an 8-bit system, versus a 16-bit system. The 16-bit system shows more color, texture, and detail.

Pikachu Digital image (pikachu.jpg)
1 red pixel from Pikachu's wand
255 [red], 0 [green], 0 [blue] Pixel decimal value (3 color intensities indicated by a number between 0-255)
FF [red], 00 [green], 00 [blue] Hexadecimal value (binary value shorthand)
11111111 [red]
00000000 [green]
00000000 [blue]
Binary value
Word OK
Word OK
ASCII Characters O K

Definition

The American Standard Code for Information Interchange (ASCII)

The American Standard Code for Information Interchange (ASCII) is a character encoding standard for electronic communication. It encodes 128 specified characters into seven-bit integers.

Image charting ASCII symbols and their binary and decimal equivilant

Word OK
ASCII Characters O K
Decimals 79 75
Word OK
ASCII Characters O K
Decimals 79 75
Hexadecimals 4F 4B
Word OK
ASCII Characters O K
Decimals 79 75
Hexadecimals 4F 4B
Byte (Binary) 01001111 01001011
Word OK
Characters O K
Decimals 79 75
Hexadecimals 4F 4B
Byte (Binary) 01001111 01001011
Hardware (Voltage High/Low) □ ■ □ □ ■ ■ ■ ■ □ ■ □ □ ■ □ ■ ■

Binary -> Decimal

Binary value Decimal value Binary value Decimal value
0000 0000 0 0000 0110 6
0000 0001 1 0000 0111 7
0000 0010 2 0000 1000 8
0000 0011 3 0000 1001 9
0000 0100 4
0000 0101 5
Bit 0 0 0 0 0 1 1 1

This byte represents the decimal number 7.

Let's step through how we get from 0000 0111 to 7.

Bit 0 0 0 0 0 1 1 1

First question to ask: How many ones (1s) are there?

Bit 0 0 0 0 0 1 1 1

Answer: There are three 1s.

Bit 0 0 0 0 0 1 1 1
Place 7 6 5 4 3 2 1 0

Second question to ask: For each 1 we've found, what are their place values?

Bit 0 0 0 0 0 1 1 1
Place 7 6 5 4 3 2 1 0

Answer: Their place values are 0, 1 and 2.

Bit 0 0 0 0 0 1 1 1
Place 7 6 5 4 3 2 1 0
Weight 2^7 2^6 2^5 2^4 2^3 2^2 2^1 2^0

Third question to ask: For each 1 we've found, what is their weight? (Weight = 2, raised to the power of the place value)

Bit 0 0 0 0 0 1 1 1
Place 7 6 5 4 3 2 1 0
Weight 2^7 2^6 2^5 2^4 2^3 2^2 2^1 2^0
Weight (calculated) NA NA NA NA NA 4 2 1

Answer: Each 1's weight is 4, 2 and 1.

Bit 0 0 0 0 0 1 1 1
Place 7 6 5 4 3 2 1 0
Weight 2^7 2^6 2^5 2^4 2^3 2^2 2^1 2^0
Weight (calculated) NA NA NA NA NA 4 2 1

What is the sum of all weight values added together?
4 + 2 + 1 = 7

Byte value 00000111 = Decimal number 7

Break

Animated GIF of a teapot and a steaming cup of tea.

Weekly Activity

Data Object

Start: https://digital-archives.github.io/HISTGA1011/activities/data_object.html

Animated GIF of a sun setting over still water.

Final questions or reflections?

mary.kidd@nyu.edu

presenter notes This semester’s syllabus is hosted on Github. Github is an online platform that is used to store and version information. It is also a platform used widely in the digital archives and preservation fields. We will cover what Github is, more, later on in the semester, and see some "real life" examples of digital archiving and preservation repositories. But for now, you will be using it primarily to access the class syllabus, as well assignments and other documents we will be using for in-class activities. Syllabus link: https://github.com/kiddmary/HIST-GA-1011

presenter notes I want to step you through basic concepts to do with what digital information is, and in particular, how it is encoded.

presenter notes Lyons, Bertram. "Digital Preservation." In The Digital Archives Handbook: A Guide to Creation, Management, and Preservation, edited by Aaron D. Purcell, 3-18. Rowman & Littlefield Publishers, 2019. Accessed September 11, 2023. http://ebookcentral.proquest.com/lib/nyulibrary-ebook/detail.action?docID=5646172, 3.

presenter notes Let's unpack this definition by thinking a bit about Data Objects we encounter through our life and work. We will return to defining Bitstreams later on.

Data Objects encapsulate various forms of digital content, such as documents, media, or software. All Data Objects, whether it's a single file, or an entire application, will require specialized software, hardware, emulation, specialized knowledge, or one or all of these things, to faithfully render and understand, ensuring their long-term accessibility and preservation.

presenter notes

presenter notes In this next section, we will talk about how binary digits are used by data objects to encode digital information, and that information is in turn used by computers to render information. Before we do so, it's good to take a pause and think broadly about how we generally use numbers to represent things in the world. My favorite example of this is a zip code. In the United States at least, a zip code is composed of five numbers that enables the postal service to quickly identify where a piece of mail is bound or returning. Take a second here to think through why a zip code might be more efficient than a non-numeric system to represent a place in the world: that is, writing out the specific location where a piece of mail is heading. A good example might be disambiguation between street addresses. Let's say I'm sending a letter to 11 91st Street. Though I haven't counted, there are likely many, many 11 91st Streets throughout the United States. Now, you can further clarify where the mail is going by writing out the city and state, which we commonly do when we prepare mail to be sent. However, this still might not be enough information to clarify where this mail is bound. For example, in New York City, where we live, there are several 11 91st Streets, depending on the borough: there's one in Queens and another in Brooklyn and an 11 East 91st Street in Manhattan. And, maybe this is a very New York-y thing, but I've received letters to me where the city and state is listed as New York, New York rather than Brooklyn: technically, both are true, since Brooklyn is a part of New York City. This is where zip codes come in handy. They're not random groups of five letters; instead, they are structured in a way that indicates with increasing granularity where something is going. The first digit represents a certain group of U.S. states, the second and third digits together represent a region in that group (or perhaps a large city), and the fourth and fifth digits represent a group of delivery addresses within that region.

presenter notes Over these next slides, I will be talking about the binary or base-2 counting system. Why should we care about this? This is because binary is as close as you can get to the underlying physicality of any computer. In our day-to-day we are actually several layers removed from what goes on underneath our computers: we are likely only really interfacing with a GUI (graphical user interface, so the buttons, windows, words, etc. that are displayed on your screen), or programming something using a specific language. But all of this is abstracted up from what are essentially just billions of transistors and logic gates.

presenter notes Binary is an encoding scheme that, instead of using the decimal digits (0-9) we are used to using to represent information, uses binary digits (1 and 0), known more commonly as "bits". So, a 1 is a bit, and a 0 is a bit, and that's all there is in a binary system. 1 or 0. Since there are only two possible values used, binary is considered what's known as a base-2 system. Along with bits, binary also uses place values to represent information. Place values are a term we were all probably introduced to in elementary or middle school. So, let's switch gears and look at the encoding scheme we are most used to: The base-10 decimal digit system.

presenter notes https://commons.wikimedia.org/wiki/File:HP_1813-0091_top_case_removed.jpg#Summary

presenter notes If you did not know this already, the numbers that you and I are most familiar with are written in a "base-10" decimal system. The 10 in base-10 refers to the fact that it uses 10 decimal values (0-9) to represent numeric values.

presenter notes Therefore, when we write out a number like twelve (12), we don't have a specific decimal number that represents 12 (otherwise it would be called a base-11 system). Instead, we combine a 1 and a 2 together to form a 12. The 2 is in the "ones" place, which we know to be the right-most decimal, and the 1 is in the "tenths" place. By combining decimals and using place values, we can represent any number.

presenter notes As in our "12" example, for 6,478,341, each digit’s **place** has a **weight**, a power of 10, that we subconsciously add together. We may even insert a nice comma in there to separate chunks of 3 places to make large numbers like this easier to read.

presenter notes A byte is a discrete-length grouping of bits. In the slide, we have an example of a byte whose length is 8 bits. You can think of a byte as a container that holds a certain amount of bits. Computers are built to handle specific byte lengths. Some handle 8-bit bytes, others 16, or 32.

presenter notes An 8-bit byte system means each byte contains 8 bits. Each bit represents 1 of 2 values: a 1 or 0. To calculate how many different combinations of 8 1s and 0s, we raise the number 2 (standing for 2 possible values) to the power of 8 (8 total bits). From this, we get 256 possible values.

presenter notes Comparing an 8-bit Nintendo Entertainment System to a 16-bit one side-by-side. There are more colors, shades, textures, and tones in the right-hand screen. The more values you can encode, the more colors and other visual details you can represent on-screen.

presenter notes So how do we get from bits to Mario - or in my example in the slide, an image of Pikachu? The constituent parts of an image are known as pixels, which are tiny squares of one particular color. The color of a single pixel can be encoded in what is known as the Red, Green and Blue color model, aka RGB. The RGB color model creates colors by combining various levels of the colors red, green, and blue. Let’s pretend that the particular system we are using to render Pikachu is an 8-bit system, which means that each of the red, green and blue values can be represented by a combination of up to eight 1s and 0s, which corresponds to the intensity or amount added for each color to create the color we see on the screen. We can express these 8-bit bitstreams by a pixel decimal number ranging from 0 to 255. Each of these three values from 0 to 255 can be translated further into what are known as hexadecimal values. Hexadecimal values come in two alphanumeric character pairs, each which represent 4 bits. Since we are using an 8-bit system, each of the red, green and blue values corresponds to a 2-character hex value. Hex values can then be broken down into bits. In this case, F stands for 1111, so two Fs equals 11111111.

presenter notes Let’s shift from the raw binary representation to something more familiar—an actual word. In this case, let's use the word "OK" as an example. When you see the word "OK" on a computer screen, you’re looking at an abstraction built on several layers of encoded data. The process that brings that simple word to your screen involves multiple transformations, from human-readable characters to machine-interpretable code. In the table on the slide, the left-hand column names each of these layers, while the right-hand column shows how the computer encodes and interprets the information. We are going to "drill down" through these layers, one-by-one.

presenter notes The first layer is what you see—the letters "O" and "K." Notice how I call these, in the chart "ASCII" (pronounced ask-key).

presenter notes Image source: https://upload.wikimedia.org/wikipedia/commons/1/1b/ASCII-Table-wide.svg

presenter notes Each letter is assigned a decimal number through a computer’s internal dictionary, also known as the ASCII table. The letter "O" corresponds to the decimal number 79, and "K" corresponds to 75.

presenter notes Then, these decimal values are often converted into a hexadecimal system for efficiency, where "O" becomes 4F and "K" becomes 4B. You can think of hexadecimals, referred sometimes in short as "hex", as a kind of shorthand for bytes.

presenter notes These values are converted into their binary representations: 01001111 for "O" and 01001011 for "K." At its core, computers understand and process everything in bits and bytes. In this case, each character in "OK" is made up of 8 bits, with a specific combination of 1s and 0s. These bits are then stored physically in hardware.

presenter notes If we could microscopically zoom into the physical storage—like a hard drive or memory chip—we would see that these bits are stored using electrical signals or magnetic charges. Think of each 1 and 0 as a tiny "on" or "off" switch, or a north/south magnetic direction. For example, a 1 might be represented by a magnetic field pointing in one direction, while a 0 is stored as the magnetic field pointing in the opposite direction. On a hard drive or chip, this encoding process happens for every single bit, ensuring that what you see on the screen is faithfully represented by physical signals underneath. So, whether you're reading a word, watching a video, or listening to music, it's all fundamentally encoded in binary and stored physically as on/off signals or magnetic impressions. This entire process—from the word "OK" you see on the screen down to the magnetic signals on a storage device—is how modern computing translates information into a format both humans and machines can understand.

presenter notes Here is a sample list of binary values, corresponding to decimal values, in an 8-bit system. In the right-most column, we have 10 decimals, 0 through 9, and their corresponding binary values. In an 8-bit system, the complete list would show 256 possible values. You may have noticed that, there seems to be a pattern in the placement of 1s and 0s for each decimal going up in succession. Bytes are not arbitrarily assigned to decimals: there is a mathematical system, corresponding to chains of logic gates that are the physical manifestation of math (adding, subtracting, etc.) behind that make it so, if you take a binary value, you can reverse-engineer it to determine, in a few steps, the decimal value it represents.

presenter notes Each bit has its own place or position, which is mapped out on the slide. In an 8-bit system, we have 8 possible place values, starting from place 0, up to place 7. Places are read from right to left.

presenter notes What do we mean by weight? A good example comes from the base-10 decimal system we are most familiar with.

presenter notes - The 1 in Place 0 carries a weight of 2^0 or 1. We multiply by 1 to get a Value of 1 - The 1 in Place 1 carries a weight of 2^1 or 2. We multiply by 1 to get a Value of 2. - The 1 in Place 2 carries a weight of 2^2 or 4. We multiply 4 by 1 to get a Value of 4. - Add together all values: 4 + 2 + 1 = 7