welcome to theresourcedepot



This document provides an introductory overview of theresourcedepot, a transactional archiving solution hosted in the cloud.

Transactional Archive

Transactional Archiving consists of selectively capturing and storing transactions that take place between a web client (browser) and a web server.

Most existing web archives recurrently send out bots to crawl the content of web servers. This results in observations of a server's content at the time of crawling. Since the crawling frequency is generally not aligned with the change rate of a server's resources, this approach is typically not able to capture all versions of a server's resource. The resulting archive may provide an acceptable overview of a server's evolution over time, but it will not provide an accurate representation of the server's entire history.

A Transactional Archive, however, captures every version of a resource as it is being requested by a browser. The resulting archive is effectively representative of a server's entire history, although versions of resources that are never requested by a browser will also never be archived.

Existing web archives, like the Internet Archive, may benefit from Transactional Archives by importing the detailed archived content in bulk.
 

theresourcedepot

theresourcedepot is a Transactional Archive solution, that archives content from web servers in the cloud. Currently, only Apache servers are supported. In order to activate transactional archiving in theresourcedepot, a filter, called mod_ta, is attached to an Apache Web Server. More info on Apache filters and mod_ta can be found here. This Apache Web Server is from now on referred to as a Content Server.

The mod_ta filter submits anonymized request/response pairs served by the Content Server to theresourcedepot. There, the submitted content is deduplicated and archived in an area dedicated to the Content Server. theresourcedepot’s mod_ta filter works seamlessly with existing Apache installations and is easy to install and configure. Only content that is served to clients through the Content Server's Apache Web Server and that results from a client's HTTP GET request will be archived.

Once mod_ta is installed, every time the Content Server responds to a HTTP GET request from a client, the filter also pushes that response (along with the request headers) to theresourcedepot. There, uniqueness of a response's content is determined by computing a checksum and comparing it with earlier checksums for the same resource. If the content has been previously archived, only the request and response headers are stored. theresourcedepot only stores requests and responses for the HTTP GET method. Resources that are transmitted to the client over HTTPS are not considered by the filter, are not pushed into theresourcedepot, and hence are not archived.

theresourcedepot incorporates the elegant functionality of the Memento protocol to easily discover and retrieve archived content. The Content Server can make TimeGates that theresourcedepot makes available for its resources discoverable via HTTP Link Headers and robots.txt files, so that Memento clients can easily access the archived versions.

theresourcedepot archives content from various Content Servers. Each Content Server is given a unique ID and is archived in a separate database. To ensure that submitted content originates from a legitimate Content Server, theresourcedepot registers its IP address, and only accepts content from registered IP addresses.

The installation instructions for mod_ta are given below. Explanation of the discovery services with examples is also provided. The hands on example of mod_ta.c installation can be found here: for fedora core linux and ubuntu linux

mod_ta Installation Instructions

Prerequisites:
  • Apache Web Server version 2.2 or higher.
    • mod_ta requires the Apache Web Server to be the primary web server that is serving the content to be archived.
  • Apache Extension Tool (apxs) installed.
    • Apache website, has detailed instructions on how to install both the web server and the extension tool.
    • For Debian based Linux distributions, the extension tool can be installed with the command, apt-get install apache2-threaded-dev or apache2-prefork-dev. For Fedora/Red Hat systems, yum install httpd-devel will install apxs.

Installing mod_ta Apache Extension:

To install the mod_ta filter that will allow activation of transactional archiving, run the following command:
sudo /usr/sbin/apxs -c -i –a mod_ta.c

This will install the mod_ta filter in your Apache Web Server. The installation is successful if the file mod_ta.so is found in the apache modules directory.

Configure Apache:

Add the following lines to the apache configuration file:
                                                    
<IfModule ta_module>
EnableArchiving On
ArchiveHost www.theresourcedepot.org
ArchivePort 8080
ArchivePath /<contentserver_id>/put/
ArchiveTimeGate http://theresourcedepot.org/<contentserver_id>/timegate/
EnableIP On
Excluded /search /test
</IfModule>
And configure as follows:
  • EnableArchiving On/Off: toggles transactional archiving on or off.
  • ArchiveHost: the host name of the transactional archive. The value must be www.theresourcedepot.org.
  • ArchivePath: the path to the Content Server's submission interface in the transactional archive. The value must be http://theresourcedepot.org/<contentserver_id>/put/, where <contentserver_id> stands for the unique ID that is assigned to the content server. Hence, a real value looks like http://theresourcedepot.org/020014/put/.
  • ArchivePort: the port number that the mod_ta filter should use to connect with the transactional archive. The port used by www.theresourcedepot.org is 80.
  • ArchiveTimeGate: The baseURL of Memento TimeGates at www.theresourcedepot.org. The value must be http://theresourcedepot.org/<contentserver_id>/timegate/, where <contentserver_id> stands for the unique ID that is assigned to the content server. Hence, a real value looks like http://theresourcedepot.org/020014/timegate/.
  • EnableIP On/Off: enables or disables recording of client’s request IP address.
  • Excluded: list of directories excluded from archiving. Optional parameter. All content of the listed directories, including their child directories will be excluded.
Restart the apache web server.

Access Interfaces

To let Memento clients, access archived content in theresourcedepot, the following services are provided. Please refer to the Memento protocol for additional information.

TimeGate:

To access the TimeGate at theresourcedepot for the resource with URI <original_url> that is served by the Content Server:
curl -D headers.txt -H Accept-Datetime:'Wed, 29 Sep 2008 12:00:04 GMT’ \ 
        
http://theresourcedepot.org/<contentserver_id>/timegate/<original_url>


Memento:

To retrieve a Memento (archived version of a resource) from theresourcedepot for the resource with URI <original_url> that is served by the Content Server:
curl -D headers.txt \

http://theresourcedepot.org/<contentserver_id>/memento/20110311000508/<original_url>
In this URI, 20110311000508 represents the archival date/time of the resource with the URI <original_url> expressed in the form YYYYMMDDHHMMSS.

TimeMap:

To retrieve a link-value formatted TimeMap from theresourcedepot for the resource with URI <original_url> that is served by the Content Server:
curl \

http://theresourcedepot.org/<contentserver_id>/timemap/link/<original_url>

Appendix

Pushing content into theresourcedepot:

Every time the server responds to a HTTP GET request for the resource with URI <original_url>, the mod_ta filter uses the following URL pattern to submit the request/response pair into theresourcedepot:
http://theresourcedepot.org/<contentserver_id>/put/<original_url>

and uses the HTTP PUT method to submit the following data related to the client's request:

<http_request_headers>
<http_response_headers>
<body>