The Interconnectedness of All Things: Understanding Digital Collections Through File Similarity

St John Karp

doi:10.1515/pdtc-2024-0042

Article Open Access

The Interconnectedness of All Things: Understanding Digital Collections Through File Similarity

St John Karp

Published/Copyright: November 22, 2024

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information

From the journal Preservation, Digital Technology & Culture Volume 53 Issue 4

Abstract

Archives that house digital collections often struggle with rapidly evolving workflows and the intrinsic difficulties in managing disordered records. Both physical and digital records may have complex relationships with other records such as drafts of the same document or one document that is included in another, but digital records offer the possibility that a computer may analyze the collection and automatically discover such relationships. An analytical tool for digital collections would employ a model that can represent the network of relationships between files instead of the hierarchical model used in traditional archival arrangement and description. A proof-of-concept of such a tool, employing techniques such as fuzzy and perceptual hashes, demonstrates the viability of this approach and suggests avenues for future research and development.

Keywords: digital archives; archival arrangement and description; hierarchy theory; linked open data; Semantic Web

1 Introduction

“What we are concerned with here is the fundamental interconnectedness of all things.”

Douglas Adams

Dirk Gently’s Holistic Detective Agency

In 2022 an archivist was processing an artist’s collection which included a large number of compact discs onto which the artist had saved their work. One of the archivist’s tasks was to comb through the contents of each disc in order to understand them, determine to what extent they were duplicated on other discs, and finally to arrange and describe them for use by researchers. A common tool in this process is file hashing, which uses a cryptographic algorithm to produce a unique checksum for each file. Using checksums an archivist can rapidly determine whether multiple copies of the same file are present even when those files have different names and locations on the file system. The tools to enable the archivist to compute checksums across different discs, however, were not in the archivist’s toolkit, with the result that a large degree of manual labor was still required.

Digital archival practice is such a new and rapidly changing field that the tools and workflows are still being developed. The first version of BitCurator, a toolkit for archivists working with digital collections, was only released in 2013 (BitCurator Consortium 2024). The FAIR Data Principles for the sharing and reuse of data resources were proposed in 2016 (Wilkinson et al. 2016), while in 2020 the CARE Principles built on FAIR to create an ethical framework for managing indigenous data (Carroll et al. 2020). Increasingly digital collections are coming to be regarded as data sets from which computers can derive insights at scale using techniques such as text mining, machine learning, and image/audio analysis (Padilla et al. 2023). The result has been the rise of new projects such as the International GLAM Labs Community which encourages user-centered and participatory experimentation with digital collections and data (Mahey et al. 2019, 31) and the European Commission’s “data spaces” which facilitate the pooling and sharing of data sources from the European Union (Page and Cecconi 2023).

The rapid development of digital archival theory and data use have largely outpaced the development of tools for the common archivist. The result is that digital archivists working in the trenches still have to perform a significant amount of manual labor even when the computer could help, as in the example of the archivist working with the artist’s collection. The practice of computing and comparing checksums to identify duplicates, for example, is common, and yet the practice of imaging multiple physical media, running checksums on their contents, and comparing checksums across different images is not common. The technologies are all there but the glue between them is not.

Although most people are familiar with the idea that Google, for example, can perform reverse image searches (i.e. search for images based on what they look like), such search mechanisms have yet to be applied widely in the context of archival processing. But if a computer could identify not just images that look alike but also similar audio, video, and textual documents then it could uncover connections that exist between files on one or more file systems. This level of analysis goes beyond simply identifying duplicates. It achieves a level of sophistication that more accurately reflects the way record creators manage their files. It can identify files that are included in one another, such as images embedded in a Word document, and it can identify files that are drafts or edited versions of the same work, such as cropped versions of the same image or drafts of the same document.

This paper will lay out the conceptual background of such a tool, describe its potential applications to the archival profession, and conclude with the findings of a study to build a proof-of-concept implementation.

2 Literature Review

2.1 Digital Collections

“In the beginning there was poor information.”

Stanisław Lem

A Perfect Vacuum

In an ideal world the flow of records from the departments of an institution to its archives is a formally defined process. The ideal is, however, “as rare as a feathered alligator” (Millar 2017, 124) and the practicalities are more challenging. The archives are sometimes overlooked and treated as an afterthought, with the result that the accumulation of records is haphazard. The increasing prevalence of born-digital records over the last 30 years has complicated the landscape by requiring new workflows to collect, preserve, and provide access to digital assets (Davis 2008). Despite the problem being some decades old now, archivists are still grappling with changing technologies and workflows (Rimkus et al. 2020). The result is that “with the breakdown of corporate/organisational controls on the information flows and information management patterns of individual staff members, the making and keeping of government and corporate information assets is increasingly coming to resemble the anarchic, heterogeneous, and idiosyncratic recordkeeping behaviours of private individuals” (Cunningham 2011, n.p.).

At the same time as digital archivists are struggling with the mundane and quotidian difficulties of their work, development continues apace in theory and data use in the modern world. Archivists work with the understanding that they are not just arranging and describing digital assets for use on local systems but are in fact integrating their work into a wider and networked digital landscape. This lays out a set of principles for knowledge and memory workers who produce digital collections in light of the fact that digital collections are increasingly being used as data in themselves (Padilla et al. 2023). These principles guide workers in making collections available as data for use and reuse by computational systems while at the same time being guided by ethical frameworks such as the CARE Principles. Digital archivists must work with the awareness that the collections they arrange and describe may be incorporated into larger data sets and see new kinds of usage that are not part of traditional archival workflows.

One project working with collections-as-data is GLAM Labs, which encourages user experimentation with digital collections through the organization of independent “labs.” These labs may consist of researchers, artists, communities, and businesses to foster engagement with the institution’s collections and create opportunities for future collaborations and partnerships (Mahey et al. 2019, 89–95). The extent to which such experimentation and reuse is possible, however, is predicated on how the collection has been processed and made available (Mahey et al. 2019, 100). A similar project is the European Commission’s “data spaces” which “bring together relevant data infrastructures and governance frameworks in order to facilitate data pooling and sharing” (“Commission Staff Working Document on Common European Data Spaces” 2022). The goal is not simply the creation of an organization, a technology platform, the data themselves, or the services enabled by the sharing of these data sets but rather the collective workings of all those elements (Page and Cecconi 2023).

Collections-as-data projects have advanced their own workflows for curating collections that can be reused and shared in digital environments (Alkemade et al. 2023; Candela et al. 2023). These workflows involve documentation, arrangement, the creation of machine-readable metadata, and investing in collaborative hosting platforms. While all of these are desirable, the expectation of creating better and more granular metadata only weighs more heavily on the archivist’s already limited resources.

Digital archivists must, therefore, avail themselves of all the tools at their command in order to deal with the collections they are entrusted to preserve and make accessible for use by both human researchers and machines. A minimal level of processing ensures that collections remain isolated and difficult to use, but the more digital archivists can do to make collections accessible and reusable the more those collections can be integrated into the larger data landscape.

The International Council on Archives (ICA) published 12 principles for making, keeping, and using electronic records (“Principles and Functional Requirements for Records in Electronic Office Environments – Module 1: Overview and Statement of Principles” 2008, 9–11), of which the following five are of particular interest:

Systems for capturing and managing business information have to rely on standardised metadata as an active, dynamic and integral part of the recordkeeping process.
Systems have to ensure interoperability across platforms and domains and over time.
Systems should rely as far as possible on open standards and technological neutrality.
Systems should have the capacity for bulk import and export using open formats.
As much metadata as possible should be system generated.

These principles provide a framework for the development of software for managing electronic records. Cunningham (2011) adds a key elaboration to the sixth principle: “the mapping of relationships between related records and groups of records.” By adding this elaboration, not present in the ICA’s original, Cunningham (2011) raises a tantalizing possibility, the idea that software can help determine the relationship between records automatically. Such software can only be possible, however, on two conditions. The first is that there must be an appropriate model of information that can adequately represent the way records interrelate. The second is that there must be methods of identifying the degree of similarity between files such that edited versions of the same file and files that are included in other files can be detected automatically. If these two conditions can be met and the functionality brought together, then archivists will have a valuable tool to help them understand messy or complex digital collections and enable them to process collections with a higher degree of granularity.

2.2 The Shape of Information

“Night which Pagan Theology could make the daughter of Chaos, affords no advantage to the description of order.”

Thomas Browne

The Garden of Cyrus

The ICA’s International Standard Archival Description (ISAD(G)) states that “The purpose of archival description is to identify and explain the context and content of archival material in order to promote its accessibility. This is achieved by creating accurate and appropriate representations and by organizing them in accordance with predetermined models” (“ISAD(G): General International Standard Archival Description” 2000, 7). The second sentence balances two different concepts: creating accurate and appropriate representations and organizing them in accordance with a model. The choice of model will determine how accurate and appropriate the archival description’s representations will be. Any model will place natural limitations on what can be described within the bounds of that model and an inadequate model will result in inadequate representations.

Traditional archival description employs a hierarchical model that starts with a single fonds, which is defined as “the whole body of documents, regardless of form or medium, created or accumulated by a particular individual, family, corporate body or other agency” (Millar 2017, 50). The fonds is then subdivided into series, files, and items (“ISAD(G): General International Standard Archival Description” 2000, 36) (Figure 1). The ICA does not delve into why hierarchies should be the most useful way of describing archives though it does make passing references to the fact that corporate bodies are structured in administrative hierarchies (“ISAD(G): General International Standard Archival Description” 2000, 20). Historically this has been true; in the 1930s the US National Archives settled on a hierarchical arrangement of record groups and file series in order to reflect the structure of American bureaucracy (Wiedeman 2019). Hierarchical description thus reflects the context in which records are often created and used. This context makes it easier for both archivists and users to prove a record’s authenticity and understand its content (Vafaie et al. 2021).

Figure 1:

The ISAD(G) hierarchical model of archival arrangement showing the fonds, series, file, and item levels (“ISAD(G): General International Standard Archival Description” 2000, 36).

Hierarchy theory holds that hierarchies are an efficient way for complex systems to evolve (Wu 2013). Examples from the natural world include biological and ecological systems, both of which have hierarchical structures such as the evolutionary tree. A non-hierarchical complex system is simply less likely to evolve because it would be less efficient than a hierarchical system. If the world truly has a hierarchical structure then it should not be surprising that hierarchies are a natural fit for the human brain. Hierarchies are the environment in which we have evolved and which our brains have therefore adapted to navigate.

Yet the hierarchical structure of archival description is not without criticism. It has long been problematic that the concept of the fonds is nebulous and is further complicated by changes of jurisdiction in the agency creating records (Duchein 1983). Other problems that complicate the fonds include collections that are incomplete, artificial collections, unknown creators, and multiple provenance (Millar 2017, 55–59). Postcolonial scholarship has pointed out that the hierarchy privileges one way of knowing above others such as tribal narratives (Christen and Anderson 2019). The CARE Principles describe such privilege as a “globalization of Western ideas, values, and lifestyles” that has resulted in “epistemicide, the suppression and co-optation of Indigenous knowledges and data systems” (Carroll et al. 2020). The imposition of a rigid hierarchy, then, may help make information accessible but it also runs the risk of obscuring information that falls outside of the hierarchical model, potentially in highly damaging ways.

Hierarchies are not the only system of organizing information. Information technology designer and theorist Ted Nelson promotes a model of information he refers to as the “perplex”: “a tangle of items and relations; of facts, partial facts, beliefs, statements and views which can contradict each other in many different ways” (Nelson 1997, 8). According to Nelson information is fundamentally disordered and it is a mistake to oversimplify it arbitrarily: “Hierarchy is a very bad idea that comes from a naive view of ideas. Because some things are hierarchical, there are traditions of making and insisting on hierarchies at all times … Most of the things that people describe and model with hierarchies and categories are overlapping and cross-connected, and the hierarchical and categorical descriptions usually miss this. Everything is much more likely to be interconnected, overlapping, cross-connected, intertwined and intermingled (I like to say ‘intertwingled’)” (Nelson 1997, 10) (Figure 2).

Figure 2:

Theodor H. Nelson (1974, DM 45) visualizes the intertwingularity of documents.

Nelson applied his model of information to digital publishing. He predicted the invention of hypertext systems and in 1966 started Project Xanadu to create a digital document platform that would implement his model of complexity in which documents can be linked, “transcluded” (included in one another), versioned, copyrighted, and navigated with the use of computers in a global network (Barnet 2015). In 1989, Tim Berners-Lee, aware of Nelson’s hypertext model, started developing what would soon become the World Wide Web (Berners-Lee and Fischetti 2000, 5, 23). To Nelson’s chagrin what Berners-Lee produced was not the complex document platform Nelson had proposed in the 1960s but a deeply broken and simplistic system (Nelson 2008, 165–66). Berners-Lee’s subsequent proposal for a Semantic Web based on the Resource Description Framework (RDF) (Berners-Lee and Fischetti 2000, 177–98) goes some way towards linking documents with semantic relationships as opposed to the links of the Web we have now which do not carry any meaning other than the location of another document.

The trend towards an RDF-based linked-data environment is reflected in the Library of Congress’s BIBFRAME project, its successor to the MARC specification for bibliographic metadata, and other organizations which have made their metadata available in RDF such as Europeana, the Bibliothèque nationale de France, and the Rijksmuseum. More recently the ICA has followed the trend towards RDF with Records in Contexts, a conceptual model (RiC-CM) and ontology (RiC-O) for describing archives using RDF: “While XML supports a specific form of graphs, the hierarchy (or ‘tree’), graph technologies enable unbounded representation of networks of interconnected data objects as well as real world objects (represented by data)” (“Records in Contexts: Conceptual Model” 2023). This move from hierarchical representations to interconnected graphs (Figure 3) is straight out of the Ted Nelson playbook and RiC-CM would therefore seem to employ a very Nelson-like model of electronic records.

Figure 3:

The representation of data in a hierarchy versus triples in a graph structure (“Records in Contexts: Conceptual Model” 2023, 6).

The proposed benefits of modeling semantic relationships extend beyond internal data representations. Because different projects can use common ontologies and because, as with Berners-Lee’s Semantic Web, everything can be given a unique URI, different data sets can link to each other. These semantically linked data sets, known as Linked Open Data, “contribute to a global data integration, connecting data from diverse domains, such as people, companies, books, scientific publications, films, music, reviews, television and radio programs, medicine, statistics, online communities, and scientific data” (Sikos 2015, n.p.). Archival and library metadata are traditionally siloed, especially in the case of legacy finding aids formatted as PDF or word processor documents. Creating linked data carries with it the promise that archives and libraries can integrate with data sets on the Semantic Web such as WikiData and the Virtual International Authority File (VIAF) to enable discovery, navigation, and reuse (Figure 4).

The implementation of a graph model over a hierarchical tree carries with it a cost. Records in a hierarchy have only one kind of relationship, parent/child. RiC-O, however, models over four hundred different kinds of relationship between objects. The labor required to input all the additional data would be prohibitive for many archives, especially given the fact that archives are already processing minimally and have extensive backlogs. Whether or not an implementation of RiC-O is possible will depend on whether we have the tools to create RiC-O data without unduly burdening the archivist.

Figure 4:

An example RiC-O graph from the Greater Manchester Asbestos Victims Support Group oral history project (International Council on Archives. Expert Group on Archival Description 2024).

2.3 Fuzzy and Perceptual Hashing

“I always find one thing very like another in this world.”

Agatha Christie

“The Bloodstained Pavement”

Hashing is a cryptographic operation that uses algorithms such as MD5 or SHA256 to create checksums or hashes based on a digital file’s contents. These checksums are used in digital archives to guarantee authenticity, verify fixity, and detect duplicate files (Lyons 2019). Files with identical contents, regardless of the files’ name or location, will always have the same checksum. Calculating the checksums of files in a digital collection is a valuable method for the archivist to determine whether or not any material is duplicated and can safely be deaccessioned (Emerling 2019). Digital storage is costly and saving space in each collection can help minimize the cost and environmental impact of long-term digital repositories (Pendergrass et al. 2019).

Other types of cryptographic functions can also be run on the contents of files. A technique known as fuzzy hashing can produce a special kind of hash that, when compared to that of another file, can reveal the degree of similarity between the two files (Kornblum 2006). This similarity rating can help the archivist identify files that share a significant proportion of their contents. These files may be edited versions of the same document or one file may “transclude” the contents of the other file. This technique can be used to inform the archivist about the contents of a digital collection and help them identify the relationships between digital assets.

While fuzzy hashes work on the binary contents of a file, perceptual hashes operate on the visual similarity of two images (Tikhonov 2019). A perceptual hash would allow the archivist to identify one image that is a scaled, cropped, or otherwise edited version of another image. Even if no edits have been made, the binary contents of two images may be different; if the owner captured an image as an uncompressed PNG, for example, and then saved it as a compressed JPG, the binary contents of the files will be different because they have been encoded using different formats. To a human being, however, the images will look identical. Perceptual hashing allows the computer to detect automatically that the two images look the same. The same principles have been applied in digital archives to match audio content that sounds similar (Six, Federica Bressan, and Renders 2023) and video content that looks similar (Mühling et al. 2019).

2.4 Applications

The use of fuzzy hashing has been raised in archival literature. Thomas (2011) mentions fuzzy hashing as a potential tool for archival arrangement in the digital collections at the Bodleian Library. Indeed the BitCurator Environment includes both ssdeep and sdhash (BitCurator Consortium n.d.), command-line tools that use fuzzy hashing algorithms to determine the degree of similarity between files. Bjelland, Franke, and Årnes (2014) discuss an experiment based on sdhash to visualize the similarity between files in complex data sets including matching emails from the same conversation and, significantly, creating cluster visualizations of similar PDF documents. Little has been done, however, to bring together different hashing algorithms into a single tool that can dive deep into a digital collection and present the archivist with a report to assist in archival arrangement and description.

Such a tool, built according to the software guidelines from “Principles and Functional Requirements for Records in Electronic Office Environments – Module 1: Overview and Statement of Principles” (2008, 9–11), could be invaluable to an archivist attempting to gain insight into a disordered digital collection. When the archivist receives the storage media, the next step in their workflow is to create read-only file system images of those media in a format such as ISO or EWF (Expert Witness Disk Image Format). These images can then be mounted as regular file systems and explored by the archivist. Due to the fact that the archives’ resources are often stretched and it can be laborious to comb through different media, many archives are obliged to choose the level at which they will process collections (Waugh, Roke, and Farr 2016). Factors that go into that decision may include the collection’s value and restrictions but another factor will surely also be the effort required to process it. If a software tool can reduce the amount of effort required of the archivist then it may allow archives to process more collections in greater detail.

A software tool of this kind would be able to take any number of file system images and explore them automatically. It would construct a list of all the files, perhaps ignoring those in system directories and concentrating on the user’s directories, and it would run fuzzy and perceptual hashes on those files as appropriate. It may even be able to pull out embedded files, such as documents attached to emails or images embedded in documents, and run hashes on those too. It would then be able to generate a report for the archivist describing which files are similar and in what ways. This information would provide the archivist with some insight into the contents of the file systems and give them starting points for exploring and analyzing the file systems by hand. The software would not make any changes to the files or direct the archivist on what actions to take but it would empower the archivist with a deeper understanding of the file systems that would allow them to navigate and describe the collection more effectively. Armed with this information the archivist might also make processing decisions such as deaccessioning, for example in the case where one video file is found to be a lower-resolution version of another. The archivist may choose to keep both videos but storage is costly and they may choose only to keep the higher resolution video.

Other applications outside of archival processing also suggest themselves. Mühling et al. (2019) have used perceptual hashes as an information retrieval technique to find videos in historical collections of television recordings. The Internet Archive has similar collections of television recordings among a wealth of other resources. These recordings are a tangle of interrelated works waiting to be connected. A recording of a single television broadcast, for example, may contain a feature film but will also contain dozens of advertisements. Each one is a work that may appear in other television broadcast recordings but without the metadata to document those connections the relationships between files will remain invisible. Further applications in other domains might include helping to manage digital assets for video content creators and documenting the relationships between assets in time-based media conservation. Many more examples like this are surely possible, however, the focus of this paper will remain on the use of software for automatic discovery and navigation during the processing of digital archival collections.

3 Eltrovo

Between January and April 2024 this study set about creating a proof-of-concept tool that would implement and demonstrate some of the features described above. The full suite of features was never intended to be implemented during this time but rather a subset of features in order to test some of the concepts discussed, identify problems, and suggest future directions. The tool has been codenamed “Eltrovo,” which is the Esperanto word for “finding out,” and the code has been published online under a GNU General Public License (Karp 2024).

3.1 Implementation

Eltrovo has been written in C# using the.NET framework, both of which were chosen because they are open-source and portable across platforms. Eltrovo can thus be compiled for Windows, macOS, and Linux. It is important that all platforms should be targeted and supported equally in order meet the software guidelines laid out in “Principles and Functional Requirements for Records in Electronic Office Environments – Module 1: Overview and Statement of Principles” (2008, 9–11) and support the work of archivists without dictating what kind of systems they are required to use.

The proof-of-concept for Eltrovo aimed to implement the following features:

finding files by recursing through a chosen target directory,
identifying similar files using a fuzzy hashing algorithm,
identifying visually similar images using a perceptual hashing algorithm,
modeling archival data using RiC-CM,
saving the results using RiC-O,
allowing the archivist to configure the parameters of the search with a rudimentary user interface, and
visualizing the results in an accessible and easily navigable way.

Eltrovo successfully implemented and demonstrated the viability of all of the goals listed with the exception of the visualization. The proof-of-concept version of the software has a basic user interface that allows the archivist to select a target directory. Eltrovo then recurses through that directory to calculate and compare fuzzy hashes of all files and perceptual hashes of images. Each pair of files is given a similarity rating between 0 and 100. In the case of the fuzzy hash used, ssdeep (Kornblum 2006), it was found that in the trial data set the optimum threshold for separating the significant matches from the false positives was approximately 50. In the case of the perceptual hash used, pHash (Krawetz 2011), the optimum threshold was approximately 70. These thresholds, however, may need to be adjustable to allow the archivist to calibrate Eltrovo for different data sets.

When Eltrovo starts it creates an empty RDF graph and populates it with standard nodes from RiC-O. These include Record, which represents the abstract concept of a file, Instantiation, which represents an actual file on the file system, and Extent, which is being used in this case for storing the file’s hashes. The reason a record is separate from an instantiation is because the file system may contain two copies of the same file. In this case the files will have their own instantiations but those instantiations will share a record.

For each file in the file system Eltrovo creates a record and an instantiation node and links the two. It also creates one or more extent nodes which are linked to the record node. If two files have the same fuzzy hash it means they are identical, and any identical files have their record nodes merged into one. If two files are found to be sufficiently similar, their record nodes are linked using hasGeneticLinkToRecordResource. Other more specific relationships are possible in RiC-O such as hasDraft but Eltrovo does not yet have the sophistication to determine whether or not two files are drafts of the same document. Such a relationship may be determined through analysis of the files’ contents or by looking for clues in the files’ names (e.g. “DocumentDraft1.docx,” “DocumentDraft2.docx,” etc.)

The resulting graph (Figure 5), now populated with a model of the file system, is exported as an XML file containing RDF data. RDF does not have many options that were found to be adequate for visualizing the resulting data, but Graphviz (2024) is popular for these kinds of visualizations and it was a simple matter of invoking command-line tools to convert the RDF to Graphviz’s DOT format and then generating an image or PDF of the graph. Such command-line invocations can of course be made automatically from within the Eltrovo code so that the archivist does not have to switch from the UI to the terminal.

Figure 5:

Graph of Eltrovo’s configuration of RiC-O nodes for an image file with an identified link to another record.

The records handled by archivists may be sensitive due to factors such as the privacy of the creators, restrictions on the handling of medical information, copyright protections, and cultural respect. It is therefore vital that Eltrovo and any other such tool make no connections to remote machines or send data beyond the computer on which it is running. The data analyzed by Eltrovo and the results of its analysis remain entirely local and under the control of the archivist (Figure 6).

Figure 6:

Eltrovo’s architecture and workflow in the proof-of-concept stage. The similarity algorithms used here are ssdeep (a fuzzy hash) and pHash (a perceptual hash).

3.2 Findings

The trial data set used in this study consisted of an author’s archive containing 229 files (1 GB total) which included different versions of literary works and other documents (EPUB, DOC, DOCX, PDF, ODT, MD, and TXT files) as well as images of cover artwork, some of which were included in the documents (BMP, GIF, JPG, PNG, and TIFF files). The computer used has a quad-core Intel Celeron N4120 processor (2.6 GHz) and 8 GB of RAM. Its operating system is Slackware Linux 15.0.

Eltrovo was successfully able to match scaled and cropped versions of the same image, drafts of the same document, and even images embedded in documents. This last finding was a surprise because this version of Eltrovo is not equipped with functionality for pulling out assets embedded in other files, such as an image in a Word document. The modern DOCX format is simply a ZIP archive containing XML and other assets such as embedded images, so ssdeep must have been able to match the binary contents of the image with the binary contents of the ZIP file. This will not act as a complete replacement for the feature of extracting embedded assets for separate analysis but it was a pleasant bonus.

It was found that with these resources a single-threaded process performing both the fuzzy hash and the perceptual image hash took an average of 320 s to complete the task. Real data sets may be significantly larger and therefore may require more time to process. It is also the case that future versions of Eltrovo may implement more hashing algorithms, such as audio and video perceptual hashes, which will also be expensive to compute.

While some computational cost must be incurred, it is considered vital to respect the time and resources of the archivist. Any software must run as efficiently as possible in order to make the best use of the archives’ resources, which may well be limited. The computer used to run Eltrovo is by no means a high-end machine but it may also be the case that archives do not have the choice of their computer equipment or may not prioritize purchasing the fastest computers. Cryptographic operations are by nature computationally intensive and it may sound impractical if Eltrovo were to take days or even weeks to fully analyze the contents of a digital collection. However it is also important to remember that the execution time does not come out of the archivist’s day. Even a low-end computer can be set up and left to analyze a collection while the archivist carries on with more important work. The archivist’s time is only required once Eltrovo has completed its calculations, at which point the archivist can begin to study the results.

Exactly how the archivist studies those results was Eltrovo’s primary sticking point. It was found that even running the fuzzy hash against a small data set produced a graph so large that it was impossible for a human to navigate. Running both hashes on the 229 files of the trial data set resulted in 1,168 RDF triples, a graph so large that the conversion to a PDF could not be completed. This was an important finding because how the problem is addressed will determine the ultimate usability of a tool like Eltrovo.

3.3 Future Directions

The struggle with an unwieldy amount of graph data is not solely Eltrovo’s problem. Borgman (2015, n.p.), in describing the problems Ted Nelson encountered when attempting to realize Project Xanadu, wrote, “the apparatus necessary to represent relationships between documents can be very large … The metadata required to manage, to find, and to follow relationships amongst documents is often much more voluminous than the documents themselves.”

Future experimentation must investigate alternative methods of visualizing and interacting with the data. One possibility is not attempting to visualize the entire data set at once. It does after all, contain a lot of RiC-O nodes that are not of immediate interest to the archivist. They are an important part of modeling the contents of the file system, but the archivist is interested only in specific parts of that information. Instead the graph should be queried and the key findings isolated and output on their own. This would exclude not only the RiC-O nodes but also any nodes that represent files without connections to other files.

An alternative method of visualizing the output may be to generate a textual report that arranges the data in an accessible format such as a hierarchy. While this might at first sound like a step backwards, the hierarchy is not being used to model the data internally. The internal representation will still be an RDF “perplex,” but RDF is not intended for direct use by human beings. To our brains hierarchies are still one of the most useful ways of understanding complex data. A textual report or a simple UI might allow the archivist to drill down into groups of files that share some similarity with each other, thus grouping drafts of the same document, for example, even though they may be located in different parts of the file system.

Assessing the data quality of the graph produced by Eltrovo will be important in order to fine-tune the matching thresholds. If the thresholds are too low then the software will produce too many erroneous matches, but if the thresholds are too high it will ignore valid matches. A future study may be necessary to determine the optimum thresholds, but then it may also be the case that the optimum thresholds are different depending on the nature of the data set. An easier option may be to allow the archivist to adjust the thresholds themselves if they find they are not getting the right results.

The wealth of as-yet unimplemented features will of course need to be addressed. These include selecting multiple directories to be scanned; traversing file system images such as EWF and ISO files; running audio and video perceptual hashes; and extracting embedded files such as email attachments or images inside Word documents. Eltrovo should also run as a multi-threaded application so that operations that can be performed in parallel are run at the same time, thus making the best use of the computational resources available. The comparison of all the hashes, for example, cannot be started until all the hashes have been calculated, but the calculation of those hashes could easily be done in parallel with the same number of processes as the CPU has cores.

Some features further down the line also suggest themselves. If the archivist could not only navigate the graph but also annotate it then the archivist would be able to correct and add to Eltrovo’s findings. The archivist could add elements of archival description to the graph and then export the data for use with other software. Eltrovo should therefore be able to convert its own internal model into other archival metadata standards such as Encoded Archival Description (EAD). Eltrovo might even be able to output whole finding aids on the basis of the descriptive elements written by the archivist.

There has been much hype and apprehension over the use of artificial intelligence (AI). This is no less true in the realm of libraries and archives where memory workers have expressed concern that AI might subject patrons to the tyranny of algorithmic bias (Cordell 2020, 12–16). The value of human labor and understanding should in no way be underestimated. The true value of computer labor is in acting as an “intelligence amplifier” that can supplement our own intelligence in the tasks that humans cannot perform alone, just as a crane or an automobile supplements our physical abilities (Lem 2013, 90–94). AI models have shown promise in such problems as predicting protein folding structures, a process too time-consuming and with too many variables for humans to perform unassisted (Al-Janabi 2022). AI should, therefore, inform the archivist to make better decisions: the decisions remain the archivist’s to make, not the computer’s. An AI component of Eltrovo might allow the computer to find similar files more effectively than hashing algorithms, but this possibility has yet to be determined through experimental results.

4 Conclusions

The resources of archives are already stretched in keeping up with new acquisitions and managing extensive backlogs with the result that many archives are forced to process collections minimally. Meanwhile the expectations for quality and granularity of metadata are rising in order to allow collections to be treated as data in the broader context of the Semantic Web and shared data projects. If the challenges of processing digital collections could be reduced, a larger number of those collections might be processed with a level of detail that enables the discovery and reuse of those collections in the broader data landscape. The current standard of archival description obliges the archivist to impose a hierarchical structure on records even though this process can be time-consuming and can be an imperfect model for the nature of the records. Hierarchies are nevertheless one of the best structures for modeling complex systems in a simple and accessible way. Computers, however, are able to manage non-hierarchical structures better than humans. With emerging standards such as Records in Contexts and with hashing techniques such as fuzzy and perceptual hashes, it may be possible for computers to determine automatically the relationships between files in a digital collection and thus enable the archivist to understand the nature of the records with less manual effort on their part. Eltrovo has demonstrated the viability of such an approach and it is hoped that continued development on Eltrovo or the development of a tool like it will one day provide archivists with a new technique for understanding digital collections.

Corresponding author: St John Karp, Horological Society of New York, 20 West 44th Street, Suite 501, 10036, New York, USA, E-mail: contact@stjohnkarp.net

Acknowledgments

The author would like to thank Amanda Garfunkel, Dianne Dietrich, Chiyong (Tali) Han, and Vicky Rampin, whose early feedback helped shape Eltrovo’s goals, and in particular Dr. Anthony Cocciolo for supervising this project.

References

Al-Janabi, A. 2022. “Has DeepMind’s AlphaFold Solved the Protein Folding Problem?” Biotechniques 72 (3): 73–6. https://doi.org/10.2144/btn-2022-0007.Search in Google Scholar

Alkemade, H., S. Claeyssens, G. Colavizza, N. Freire, J. Lehmann, C. Neudecker, G. Osti, and D. van 2023. “Datasheets for Digital Cultural Heritage Datasets.” Journal of Open Humanities Data 9 (17): 1–11, https://doi.org/10.5334/johd.124.Search in Google Scholar

Barnet, B. 2015. “The Importance of Ted’s Vision.” In Intertwingled: The Work and Influence of Ted Nelson, edited by D. R. Dechow, and D. C. Struppa, 59–66. New York: Springer Open.10.1007/978-3-319-16925-5_9Search in Google Scholar

Berners-Lee, T., and M. Fischetti. 2000. Weaving the Web: The Original Design and Ultimate Destiny of the World Wide Web by Its Inventor. New York: HarperBusiness.Search in Google Scholar

BitCurator Consortium. 2024. “Release History. Bitcurator-Distro.” https://github.com/BitCurator/bitcurator-distro/wiki/Release-History#release-history (accessed June 24, 2024).Search in Google Scholar

BitCurator Consortium. n.d. “Tools. BitCurator Docs.” https://bitcurator.github.io/documentation/Tools/ (accessed April 20, 2024).Search in Google Scholar

Bjelland, P. C., K. Franke, and A. Årnes. 2014. “Practical Use of Approximate Hash Based Matching in Digital Investigations.” Digital Investigation 11 (Supplement 1): S18–26. https://doi.org/10.1016/j.diin.2014.03.003.Search in Google Scholar

Borgman, C. L. 2015. “Data, Metadata, and Ted.” In Intertwingled: The Work and Influence of Ted Nelson, edited by D. R. Dechow, and D. C. Struppa, 67–74. New York: Springer Open.10.1007/978-3-319-16925-5_10Search in Google Scholar

Candela, G., N. Gabriëls, S. Chambers, M. Dobreva, S. Ames, M. Ferriter, N. Fitzgerald, et al. 2023. “A Checklist to Publish Collections as Data in GLAM Institutions.” Global Knowledge, Memory and Communication. https://doi.org/10.1108/GKMC-06-2023-0195.Search in Google Scholar

Carroll, S., I. Garba, O. Figueroa-Rodríguez, J. Holbrook, R. Lovett, S. Materechera, and M. Parsons, et al.. 2020. “The CARE Principles for Indigenous Data Governance.” Data Science Journal 19 (43): 1–12, https://doi.org/10.5334/dsj-2020-043.Search in Google Scholar

Christen, K., and J. Anderson. 2019. “Toward Slow Archives.” Archival Science 19: 87–116. https://doi.org/10.1007/s10502-019-09307-x.Search in Google Scholar

“Commission Staff Working Document on Common European Data Spaces.” 2022. SWD(2022) 45. European Commission.Search in Google Scholar

Cordell, R. 2020. “Machine Learning + Libraries: A Report on the State of the Field.” Library of Congress. LC Labs. https://labs.loc.gov/static/labs/work/reports/Cordell-LOC-ML-report.pdf (accessed November 4, 2024).Search in Google Scholar

Cunningham, A. 2011. “Ghosts in the Machine: Towards a Principles-Based Approach to Making and Keeping Personal Digital Records.” In I, Digital: Personal Digital Collections in the Digital Era, edited by C. A. Lee, 78–89. Chicago: Society of American Archivists.Search in Google Scholar

Davis, S. E. 2008. “Electronic Records Planning in ‘Collecting’ Repositories.” The American Archivist 71 (1): 167–89. https://doi.org/10.17723/aarc.71.1.024q2020828t7332.Search in Google Scholar

Duchein, M. 1983. “Theoretical Principles and Practical Problems of Respect Des Fonds in Archival Science.” Archivaria 16: 64–82.Search in Google Scholar

Emerling, D. 2019. “Congressional Collections.” In The Digital Archives Handbook: A Guide to Creation, Management, and Preservation, edited by A. D. Purcell, 195–214. Lanham: Rowman & Littlefield.Search in Google Scholar

Graphviz. 2024. “Graphviz.” https://graphviz.org/ (accessed November 4, 2024).Search in Google Scholar

International Council on Archives. Expert Group on Archival Description. 2024. “ICA Records in Contexts-Ontology (ICA RiC-O).” May 13. https://github.com/ICA-EGAD/RiC-O/tree/v1.0.1/ontology/current-version (accessed November 4, 2024).Search in Google Scholar

ISAD(G): General International Standard Archival Description. 2000. 2nd ed. International Council on Archives. Committee on Descriptive Standards.Search in Google Scholar

Karp, St John. 2024. “Eltrovo: Discovery, Navigation, and Description for Digital Archives.” GitHub, April 7. https://stjo.hn/eltrovo-src (accessed November 4, 2024).Search in Google Scholar

Kornblum, J. 2006. “Identifying Almost Identical Files Using Context Triggered Piecewise Hashing.” Digital Investigation 3 (Supplement): 91–7. https://doi.org/10.1016/j.diin.2006.06.015.Search in Google Scholar

Krawetz, N. 2011. “Looks Like It.” The Hacker Factor Blog. https://www.hackerfactor.com/blog/index.php?/archives/432-Looks-Like-It.html.Search in Google Scholar

Lem, Stanisław. 2013. Summa Technologiae. Translated by Joanna Zylinska. Minneapolis: University of Minnesota Press.Search in Google Scholar

Lyons, B. 2019. “Digital Preservation.” In The Digital Archives Handbook: A Guide to Creation, Management, and Preservation, edited by A. D. Purcell, 45–71. Lanham: Rowman & Littlefield.Search in Google Scholar

Mahey, M., A. Al-Abdulla, S. Ames, P. Bray, G. Candela, S. Chambers, and C. Derven, et al.. 2019. Open a GLAM Lab. Doha: International GLAM Labs Community.Search in Google Scholar

Millar, Laura A. 2017. Archives: Principles and Practices, 2nd ed. London: Facet Publishing.10.29085/9781783302086Search in Google Scholar

Mühling, M., M. Meister, N. Korfhage, J. Wehling, A. Hörth, R. Ewerth, and B. Freisleben. 2019. “Content-Based Video Retrieval in Historical Collections of the German Broadcasting Archive.” International Journal on Digital Libraries 20 (2): 167–83. https://doi.org/10.1007/s00799-018-0236-z.Search in Google Scholar

Nelson, T. 2008. Geeks Bearing Gifts: How the Computer World Got This Way, 1.0 ed. Sausalito: Mindful Press.Search in Google Scholar

Nelson, T. H. 1974. Dream Machines. Chicago: Hugo’s Book Service.Search in Google Scholar

Nelson, T. Holm. 1997. The Future of Information: Ideas, Connections, and the Gods of Electronic Literature. Tokyo: ASCII Corporation.Search in Google Scholar

Padilla, T., H. S. Kettler, S. Varner, and Y. Shorish. 2023. “Vancouver Statement on Collections as Data.” Zenodo. https://doi.org/10.5281/zenodo.8341519.Search in Google Scholar

Page, M., and G. Cecconi. 2023. European Data Spaces and the Role of Data.europa.eu. Luxembourg: data.europa.eu; Publications Office of the European Union.Search in Google Scholar

Pendergrass, K. L., W. Sampson, T. Walsh, and L. Alagna. 2019. “Toward Environmentally Sustainable Digital Preservation.” American Archivist 82 (1): 165–206. https://doi.org/10.17723/0360-9081-82.1.165.Search in Google Scholar

“Principles and Functional Requirements for Records in Electronic Office Environments – Module 1: Overview and Statement of Principles.” 2008. International Council on Archives; Australasian Digital Records Initiative.Search in Google Scholar

“Records in Contexts: Conceptual Model.” 2023. 1.0 ed. International Council on Archives. Expert Group on Archival Description.Search in Google Scholar

Rimkus, K. R., B. Anderson, K. E. Germeck, C. C. Nielsen, C. J. Prom, and T. Popp. 2020. “Preservation and Access for Born-Digital Electronic Records: The Case for an Institutional Digital Content Format Registry.” The American Archivist 83 (2): 397–428. https://doi.org/10.17723/0360-9081-83.2.397.Search in Google Scholar

Sikos, L. F. 2015. Mastering Structured Data on the Semantic Web: From HTML5 Microdata to Linked Open Data. New York: Apress.10.1007/978-1-4842-1049-9Search in Google Scholar

Six, J., F. Bressan, and K. Renders. 2023. “Duplicate Detection for for Digital Audio Archive Management: Two Case Studies.” In Advances in Speech and Music Technology: Computational Aspects and Applications, edited by A. Biswas, E. Wennekes, A. Wieczorkowska, and R. H. Laskar, 311–29. Cham: Springer.10.1007/978-3-031-18444-4_16Search in Google Scholar

Thomas, S. 2011. “Curating the I, Digital: Experiences at the Bodleian Library.” In I, Digital: Personal Digital Collections in the Digital Era, edited by C. A. Lee, 280–305. Chicago: Society of American Archivists.Search in Google Scholar

Tikhonov, A. 2019. “Preservation of Digital Images: Question of Fixity.” Heritage 2 (2): 1160–5. https://doi.org/10.3390/heritage2020075.Search in Google Scholar

Vafaie, M., O. Bruns, D. Dessí, H. Sack, and N. Pilz. 2021. “Modelling Archival Hierarchies in Practice: Key Aspects and Lessons Learned.” In Proceedings of the 6th International Workshop on Computational History (HistoInformatics 2021), 2981. Aachen: CEUR Workshop Proceedings.Search in Google Scholar

Waugh, D., E. R. Roke, and E. Farr. 2016. “Flexible Processing and Diverse Collections: A Tiered Approach to Delivering Born Digital Archives.” Archives and Records 37 (1): 3–19. https://doi.org/10.1080/23257962.2016.1139493.Search in Google Scholar

Wiedeman, G. 2019. “The Historical Hazards of Finding Aids.” The American Archivist 82 (2): 381–420. https://doi.org/10.17723/aarc-82-02-20.Search in Google Scholar

Wilkinson, Mark D., M. Dumontier, I. J. Aalbersberg, G. Appleton, M. Axton, A. Baak, and N. Blomberg, et al.. 2016. “The FAIR Guiding Principles for Scientific Data Management and Stewardship.” Scientific Data 3: 1–9, https://doi.org/10.1038/sdata.2016.18.Search in Google Scholar

Wu, J. 2013. “Hierarchy Theory: An Overview.” In Linking Ecology and Ethics for a Changing World: Values, Philosophy, and Action, edited by R. Rozzi, S. T. A. Pickett, C. Palmer, J. J. Armesto, and J. Baird Callicott, 281–301. New York: Springer.10.1007/978-94-007-7470-4_24Search in Google Scholar

Received: 2024-06-22

Accepted: 2024-09-11

Published Online: 2024-11-22

Published in Print: 2024-12-17

This work is licensed under the Creative Commons Attribution 4.0 International License.

Articles in the same Issue

https://doi.org/10.1515/pdtc-2024-0042

Keywords for this article

digital archives; archival arrangement and description; hierarchy theory; linked open data; Semantic Web

Creative Commons

BY 4.0