presenter notes So why learn to scripting in the first place? There are a number of reasons, some more specific to just work, in general, and how to make it easier, and some more specific to digital preservation environments. Some of the reasons I have listed here are Automate yours and other’s work: Reduce repetitive clicking, mousing, typing Reduce error Maintain file integrity and standards Promote scale up without increasing workload Allow systems to “talk” to each other
presenter notes How do we access the command line? This is usually accomplished by accessing what is known as “the terminal” or just “terminal”. Terminal is a computer program that provides a user interface for interacting with the operating system through typed commands. Most operating systems, like Windows or Mac machines, come with some sort of terminal program. On a Mac, it is called “Terminal” (with a capital T) and can be accessed by clicking on the Finder, opening the Utilities folder, and opening the icon labeled Terminal. In Windows, the terminal program is called Command Prompt. To get to it, click on the Windows button on your keyboard, and either search for the Run utility and type in “cmd” (nickname for Command Prompt), or just search for the Command Prompt application and double-click to open. It’s as simple as opening up any program.
presenter notes A really good introduction to automation is through what is known as the Command Line Interface or CLI (I prefer and often refer to it as just “the command line”). The command line is a pretty powerful tool, and is used heavily by professionals in the digital archiving and preservation field. If you have a computer, you likely have a way to access the command line now. All Mac and Windows machines come with a terminal program. Meaning, there isn’t anything that you have to install to use the command line.
presenter notes This, I think, is one of the most important reasons why the command line is great: it tends to not share the same fate as software applications, that often become obsolete over time. Though this is anecdotal, I have been using the command line since I was a kid (I started using the computers in the late 1980s) and the command line has basically looked and operated in the same way since. Another reason why the command line is great is because it can do a lot, and is not specific to one type of kind of thing to do. Its non-specificity makes it universal. It is also easy to learn, in part because it has been around for so long, and has generally behaved the same way over time, therefore it is well documented. Basically, once you get the syntax down, and see a few examples, it’s fairly easy to turn around and just start using it.
presenter notes A shell is a specific command-line interpreter that takes input from the user, interprets it, and then executes the appropriate commands. There are many different types of shells such as PowerShell (Windows) and bash (Mac). My Mac, by default, uses Z Shell or zsh, another type of shell. So what makes shells different from one another?
presenter notes Teletype Model 33, an electromechanical teleprinter from the 1960s with actual cylinders as its keys Photo credit: https://retrocomputing.stackexchange.com/questions/2697/could-you-see-what-you-are-typing-in-a-teletype Like most things in computing, the term “terminal” has its roots in the physical world. In the early days of computing, large mainframe computers were used to process and store data for businesses and organizations. A mainframe is a type of large, powerful computer designed for processing and storing large amounts of data. A user could access the mainframe by using a device known as a teletype. A teletype, also known as a teleprinter, teletypewriter, or just “tty” for short, was a device that could both send and receive messages to and from a mainframe. The user would type commands or queries using the keys which would be relayed to the mainframe, which in turn would return information that would be printed on a piece of paper. Some models could also be used to create punched tape for data storage (either from typed input or from data received from a remote source) and to read back such tape for local printing or transmission.
presenter notes Photo credit: From https://en.wikipedia.org/wiki/IBM_3270#/media/File:Informatics_General_programmer_at_terminal.jpg Eventually, TTYs were replaced with computers with electric computer screens. Commands would still be entered in by a human using a keyboard. The computer’s answers would “print” to a screen rather than to paper in the form of text or graphics. These sorts of computers were referred to as “terminals”. Most terminals on modern-day computers have a similar look and feel to that of the screen you see the computer programmer using on the slide from 1983. Very text-heavy, light typeface on stark background, not much else going on. This sometimes can be intimidating: we are so used to what are known as graphical user interfaces, or GUIs, with graphics, colors, and shapes; in comparison, the terminal looks very stark.
presenter notes Prompt: The space following the prompt (in this case, an angle bracket, or >) is where you input commands. Your location in the file directory structure (Documents and Settings folder within the C: drive) CLI Anatomy Command line prompt: A text-based symbol that appears in a command line interface, indicating that the interface is ready to accept commands from the user. The prompt typically appears on a new line and is followed by a cursor, indicating the position at which the user can begin typing commands. The exact appearance of the prompt varies depending on the operating system and the specific command line interface being used. For example, on Windows operating system, the default prompt in Command Prompt is a greater than symbol (>). On macOS and Linux systems, the default prompt in Terminal is typically the user's username followed by the name of the current directory and a dollar sign ($). The prompt character is usually preceded by some information about where the user is, in terms of which part of the file directory commands will be executed. In this example, by default, when we launch the Command Prompt in a Windows machine, the user by default starts off in the Documents and Settings folder, in the Drive called “C”. This can look a little bit different in a Mac.
presenter notes Prompt: Where you type in commands. Here, the prompt is the % character Username (marykidd) at (@) computer name (Marys-MacBook-Air) Symbol that indicates which shell you are running (definition in next slide) Here, % = zsh (stands for “Z Shell”) On my Mac, my command line prompt character is a percent sign (%), preceded by a tilde (~). A tilde, in the context of a Mac CLI, refers to my home directory. So, it’s the directory that contains my Downloads, Desktop, Documents, Pictures and other folders. Prior to this is a little bit of information about my computer. “marykidd” is my username. If you use a computer shared by multiple users, it’s a good thing to check that this shows the correct username. This is followed by an @ sign and then “Marys-MacBook-Air”, which is the name of my computer. You can read the prompt in a similar way you would read an email address. For example, my NYU email address is mary.kidd@nyu.edu. So I’m the user Mary in the NYU domain. Similarly, on my local computer, I am the user marykidd on the Marys-MacBook-Air domain. This is by default where my Mac starts when I open up the terminal. However, let’s say you have a folder of scripts that you want to work out of all the time, you can change the default directory where you start. Let’s go back to the % sign here. On a Mac, the particular symbol used is actually indicate of the shell my CLI is using. In this case, my command line, by default, uses the “zsh” shell.
presenter notes In the screenshot is my Mac terminal, where I've typed in the zsh command "ls" which stands for “list contents of the directory I am in”. Once you type in a command like "ls", or any command, you execute it by pressing the [return] or [enter] key. Once I do that, the terminal will list all the folders within the directory I am in, my home directory: Applications, Downloads, Movies, Desktop, etc.
presenter notes This screenshot features another shell, specific to PCs in the 1990s, called COMMAND.COM.COM. This shell was around when I was a kid, so I weirdly feel at home seeing this screen! I'm assuming though if this is your first time seeing terminal screens, this may look a bit stark. It's okay to think that (I felt the same way when I first encountered it, too). However, this behaves in pretty much the same way my modern Mac's terminal behaves, with slight differences. Here, I typed in the “dir” command, which is short for "list directory contents". Similar to the "ls" command, entering this in and pressing [return] or [enter] produces a list of folders and files in the current directory. Learning how one shell takes commands versus another is like learning a different dialect of a language. That said, commands that appear similar from one shell to another may actually behave differently. For example, notice how the ls command just shows a list of folder names, whereas the dir command shows both names, dates, a file count, a folder count, etc. We could modify the ls command with additional information to output similar information. What this indicates is that different shells not only have different commands, but different command behaviors and results.
presenter notes What this mini activity shows you to do is switch between different shells on a Mac or Windows operating system using the "echo" command (yes, different shells/operating systems can have the same-named commands!) The "echo" command can be read similar to “print this [thing]”. Echo is a command used to print text to the command prompt so that it can be seen by the user. The $0/%COMSPEC% are the names of two variables. You can think of a variable as a little named drawer in a shelf that stores a little bit of information. So here, we are saying, "Please print the information stored within the drawer called $0".
presenter notes So far, we have used the command line to run fairly simple commands using commands native to the shells we are using. However, you should know that the command line can be used to run command line tools or programs. A command line tool or program is a type of computer program designed to be executed through the command line. It is similar to how when you purchase a new computer, it comes with your usual basic suite of things you can do like use the Finder, open and use a calculator or whatever else. However, you can download and install other different programs, like Garage Band or Photoshop or whatever else. Similarly, the command line has its own sorts of distinct tools that are used for different purposes.
Text
Text
presenter notes So how do you install programs in the command line? This is somewhat similar to how you would download and install a program through your Windows GUI: you go somewhere, download a file, unzip it, run the installer, check off some options, and now the program is ready to use on your computer. The easiest way to install a program into your CLI is by using what is known as a package manager. A package manager is a software tool that is used to automate the process of installing, updating, configuring, and removing software packages on a computer system. It provides a simple and efficient way to manage software dependencies and ensure that all required libraries and components are installed and working correctly. Different operating systems, and even different programs, have different package managers. For example, the default package manager for the Mac operating system CLI is known as homebrew. For Windows, it’s chocolatey. There are many other options, but these are some of the most common. Installing a program using a package manager makes installation a cinch, and is quicker than going to a website, clicking around for the right version, etc. All you need to do is type in a simple command: the name of the package manager, the word “install”, and then the name of the program you want to install. The package manager will take care of all the rest.
presenter notes This is the basic syntax of how to run a command in the rsync program. Rsync commands are all, more or less, structured like this. An important thing to know is that the rsync commands - and really, any command – are always written in a specific order, with specific words/terms. This sequence or order is sometimes referred to as “syntax”.
presenter notes https://twobitpreservation.com/blog/2020/6/11/why-use-the-command-line-for-digital-archiving-and-preservation rsync is one of many command line tools used across the digital preservation field. The rsync program is a powerful tool that is used in a variety of digital preservation contexts. One case study where rsync came into play comes from one of the week’s assigned readings, authored by Nicole Martin, who is the Associate Director of Archives and Digital Systems at Human Rights Watch, as well as an Adjunct Professor here at NYU and former Associate Director of the Digital Preservation and Handling Complex Media courses at New York University’s Moving Image Archiving and Preservation (MIAP) graduate program.
presenter notes Let’s look at a real rsync command used by the Johns Hopkins University Archives and Manuscripts. They use the rsync command in their electronic records accessioning workflow. Let’s step through this briefly by looking at their Github repo: https://github.com/jhu-archives-and-manuscripts/electronic-records
presenter notes What is pseudocode? Pseudocode is a way of writing out the steps or logic of a computer program in plain, informal language that is not tied to any specific programming language syntax. It is a form of "fake" or "pretend" code that is used to plan out the structure and flow of a program before actually writing the code in a specific programming language, or can be used to interpret some code that has already been written. There is no right way to write out pseudo code, since it’s not going to be executed in any way. It’s just a way to describe what is going on in plain speak By laying out pseudo code, you can draft or sketch what you ultimately want to write or what you understand is going on. Take a minute or two to look at the chunk of code on screen, and write out what you think it is doing. You may need to look up a couple of the command’s modifiers. I’ve included the basic rsync syntax at the top, as a reminder of how, generally, an rsync command is structured.
presenter notes “Use rsync program to move files from the current directory containing a TAR file to the SAM. When you transfer over the files, please preserve the synced file’s metadata with the source file’s metadata. Please also show me additional information about the transfer, such as file size and transfer speed. Lastly, please show me a progress bar so I know how far along the transfer you are.”
Syntax Highlighting
Explanation Table
presenter notes Notice I've included ???. I don't want you to guess this time. Instead, I'd like for you to open up the <a href="https://www.ffmpeg.org/" target="_blank">ffmpeg</a>, click on the "Documentation" section, click the "Command Line Tools Documentation > ffmpeg section (so you ultimately end up at https://ffmpeg.org/ffmpeg.html) and search for the meaning of the -i option.
Syntax Highlighting
Explanation Table
presenter notes https://saaers.wordpress.com/2018/07/31/small-scale-scripts-for-large-scale-analysis-python-at-the-alexander-turnbull-library/ The National Library of New Zealand case study also mentions, generally speaking, what Python can be used for in a digital preservation or archives environment. Transfer Generating a list of files on original storage media Transferring files off the original digital media to our storage servers Appraisal Identifying duplicate files across different locations Adding file extensions so material opens in the correct software Flattening complex folder structures to support easy assessment Technical Analysis Sorting files into groups based on file extension to isolate unknown files Extracting file signature information from unknown files …and much more!
presenter notes Python is a high-level, interpreted programming language that was first released in 1991 by Guido van Rossum. It is designed to be easy to read and write, with a simple syntax and minimalistic approach to coding, making it a popular language for beginners and experts alike. Python has a large standard library and a vast collection of third-party libraries and modules, making it suitable for a wide range of applications, including web development, scientific computing, data analysis, artificial intelligence, machine learning, and more. Python's popularity has grown rapidly over the years, and it is now one of the most widely used programming languages in the world, with an active and vibrant community of developers and users. It is open-source, free to use, and available for various platforms, including Windows, macOS, and Linux.
presenter notes Most programming languages, once installed on your workstation, come with a standard “library” of pre-built functions. For example, when you download and install Python, it comes with a standard library called “os”. The os module is a standard library module in Python that provides a way for Python programs to interact with the operating system on which they are running. The os module provides functions for performing common tasks such as navigating directories, reading and writing files, renaming files, listing folders and files: the usual things that you are used to when interacting with your file system. You can kind of think of them as recipe books for different cuisines.
presenter notes https://saaers.wordpress.com/2018/07/31/small-scale-scripts-for-large-scale-analysis-python-at-the-alexander-turnbull-library/
presenter notes On slide is a screencap of a Python library called ArchivesSnake, which is something I am currently using to pull data out of NYPL’s ArchivesSpace database using Python commands and ArchiveSpace’s Application Programming Interface or API. ArchivesSnake provides a set of tools for working with archival collections and data. It was developed by the Rockefeller Archive Center and is designed to simplify the process of working with large and complex archival collections, such as those containing manuscripts, photographs, and other historical documents. This particular library does not come with Python. It has to be downloaded separately by the user. Python has a package manager known as pip3 that can make downloading these sorts of libraries easy. Similar to what we just saw with homebrew and chocolatey, if I wanted to install ArchivesSnake, I would use the pip3 package manager and type into my command line, pip3 install archivessnake, and voila, ArchivesSnake would be installed onto my computer in a matter of minutes.
Wrap in a container
presenter notes https://nypl.github.io/digpres/posts/data-analysis-tools This code example comes from NYPL’s Digital Preservation blog, which shows some examples of using pandas, which is an open-source Python library used for data manipulation and visualization. Again, here we see that the code starts by importing pandas by saying “import pandas” and then giving it a nickname “pd”. Next, we declare a variable, “df” (which stands for dataframe, but just know that you can name variables whatever you want, this is just a common variable name referring to some data). The variable’s value is that of a comma-separated value or CSV spreadsheet. So using pandas, you can actually store the contents of a spreadsheet into a variable, and then do things with that variable. Here, we have a spreadsheet created using siegfried, which is a tool used by digital archivists to identify file formats. In the first example, we take the variable df, containing our spreadsheet, pointing to the column “filesize”, containing a numeric value per file listed, and uses the sum method, which stands for “summary”, to summarize the total filesize across all files in the spreadsheet. This is similar to, if you’ve used Excel or Google Sheets, creating a grand total across an entire column’s worth of values. In fact, you can think of pandas as just another kind of spreadsheet analysis tool, but instead of clicking on buttons to understand a spreadsheet’s contents, you are using text. In the second example, it will group the rows in the "df" DataFrame by their file format, calculate the average file size for each group, sort the results in descending order based on the average file size, select the top 10 rows from the sorted data, and finally, print out those 10 rows to the console. Lastly, we have some code that is selecting a subset of rows from the "df" DataFrame where the "modified" column is less than the year 1990, grouping the selected rows by their file format, counting the number of rows in each group, sort the results in descending order based on the group size, and finally, printing out 5 rows to the console. What does the output look like when you press enter and run these scripts? I will show you in just a sec…
presenter notes _[https://towardsdatascience.com/visualizations-with-matplotlib-part-1-c9651008b6b8](https://towardsdatascience.com/visualizations-with-matplotlib-part-1-c9651008b6b8)
presenter notes https://www.crummy.com/software/BeautifulSoup/