Minnesota  State Archives

Center for Archival Resources On Legislatures (CAROL)

Foundations: Methods of Content Acquisition

Introduction

The content you are trying to preserve or provide access to may be your own or produced by someone else. If you are not the content creator, how are you acquiring the content; directly or indirectly? Partnerships and collaborations often result in methods of direct acquisition; other times methods of indirect acquisition must be used.

Collaboration is a valuable activity for all parties. Use common goals to develop partnerships. Developing relationships with record creators early on assists with passing along information about using appropriate formats and standards, as well as expectations for long-term preservation. This makes long-term preservation and access easier down the road.

Using indirect methods of acquisition, you do not need to spend time building relationships with content providers. This may take less time, but you may also get less than ideal formats, the records may not be complete, and they will probably lack contextual information that may be useful.

 

Direct

The most straight-forward means of obtaining records is usually directly from the records creator or record steward. This may be through an established, on-going relationship or through one-time or limited contact. Often the content and frequency of transfers is defined through records retention schedules.

File transfer may take place via shared network servers or cloud storage, SFTP (Secure File Transfer Protocol) download, through Application Programming Interfaces (APIs), by exchange of external hard drives or other removable media, or through other agreed upon means. Depending on the circumstances, it may be useful to use a file packaging tool, such as the BagIt specification created by the University of California and the Library of Congress, to bundle records, transmit metadata, and provide checksums or other integrity-checking information.

One of the main benefits of direct acquisition is that documentation about the records, technical information about the files, and other metadata is often included in the transfer. This information greatly aids in the management, use, and preservation of the records.

 

Indirect

If there are too many barriers preventing the development of a partnership it may be necessary to use indirect methods of acquisition. Understand that if indirect methods are used, there is a greater chance of collecting incomplete or inaccurate records.

Web Harvesting: Web harvesting is the use of web crawlers to collect content from selected websites. The NDIIPP project explored web archiving with both the Internet Archive and the Web Archiving Service. Information on web archiving in general and the case studies can be found here.

Screen Scraping: Screen scraping is a technique for automating the collection of information made available via a website. Information displayed in a web browser is presented in a format called HTML that consists of text in sections defined by tags resembling <div> or <h1>.  Because there is some predictability and generally a pattern to the formatting of web content it is possible to write programs that can use the tag boundaries to extract meaningful content from a web page.

Screen scrapers are written by analyzing the structure of a websites HTML and then extracting the relevant information from an intuited pattern. While this method can be very powerful, changes in page structure due to new data or a redesign make continuous updates necessary and, as a result, it tends to be an expensive and time consuming technique to be relied upon only when intentionally machine readable information isn't available.

For more information on how to make data more accessible/useful with indirect methods of data collections, review the following reports:

 

Resource Center Navigation

Please use your back button to return to the last page.

Links to the main sections of CAROL are provided below.

Home - Foundations - Access - Preservation - Authentication

 

February 21, 2012; links verified March 29, 2013.