The Electronic Filing Cabinet is a repository of media objects, built on World Wide Web and other Internet protocols. Metadata support index and search functions to provide flexible, content-based retrieval. Users may categorize objects explicitly by organizing them into a folder hierarchy. Data import and export is supported through several protocols. The EFC resembles a virtual operating system on the Web.
Today's servers and browsers approximate an ideal infrastructure: independent of both client and server operating systems, and of web server and client browser; component-based, allowing for module substitution; using software which is inexpensive in our academic setting.
The EFC makes only functional demands on supporting components, without dependency on specific web server, browser, or operating system. The prototype system uses the NCSA httpd Web server running under Solaris. Today's server-side implementation is in Perl.
Functionality
The EFC is a web-based repository of data, generally MIME instances. Scanned images and mail messages are currently implemented, with arbitrary URLs and faxed documents to be added. Metadata accompanies each data object. Attributes are specific to datatype. For example, TITLE is an attribute of image documents, SENDER an attribute of mail messages, and DATE applies to either. Searching on metadata allows retrieval without prior time-consuming categorization. Categorization of EFC objects into folders is optional. Folders are for classification only, and do not indicate ownership.
Authorization is by user and password, but independent of the underlying operating systems. Permissions on EFC objects specify who can access or change the object. Users are organized into groups to ease permissions maintenance. This is obviously based on the UNIX model. A significant departure is that there is no notion of current group for a user: whether or not a user can access or change an object is a function of all groups to which the user belongs.
Import into the EFC is supported over various protocols. NFS is used presently to import scanned image documents and OCR text. Mail provides the transport mechanism to import mail messages. Users have inboxes, allowing them to screen incoming objects and to apply permissions and metadata. Export of objects is through email notification of the URL through which the object is normally available.
The user interface is web-based. It allows searching on the metadata of the entire data store, viewing and editing metadata, and viewing objects themselves. Users can browse among their folder links, create new folders, and rearrange folder contents. Only the owner and system administrator can edit object permissions. Security parameters for users and groups are edited on a separate administration page, which only the system administrator can access.
Anticipated uses include searchable storage of written and electronic correspondence, on-line art image and journal article repositories, and fax distribution.
Design
System design is toward flexibility to add new datatypes and import protocols; port to other server operating systems and web servers; and replace components independently (search engine or user interface).
Objects are identified uniquely based on timestamp and datatype. Storage is relatively "flat", with additional structure contemplated for efficiency. Storage design is dependent on underlying operating system only in requiring the now-ubiquitous hierarchical directory/file model.
Metadata, folder contents, and permissions are stored as SGML. Metadata elements are based on a modified and expanded Dublin Core. There is a single SGML Document Type Definition (DTD) per document type. The search engine (Harvest) indexes objects through their metadata, guided by that object's datatype's DTD. Image documents have folder-like structure, each page analogous to a folder item.
Data and code are based separately and coupled flexibly. This supports the current HTTP/HTML implementation, and is realized by symbolic links from the EFC directory to code and other subdirectories for the current code version. Benefits are several: multiple EFCs reside on one server; multiple code versions reside on one server; upgrading an EFC's code version does not affect other EFCs; browser bookmarks do not reference the code version, so won't break on a version upgrade.
User interface and datastore access routines are separated and coupled flexibly. Current front-end routines serve HTTP/HTML implementation, and are written in Perl. They are dependent on the current back-end routines. Back-end routines are not dependent on front-end routines, so could be reused in support of an alternate front-end. Authorization is done in the back-end making that code reusable. This would be advantageous in deploying a Java user interface while retaining HTTP data connectivity. Our move to Java will probably supersede this, implementing Java on server also. So the benefit of front-end / back-end separation is likely more for modularity than for reuse.
A flexible template mechanism produces text-based system objects: HTTP responses, user interface HTML, and metadata SGML. Each template is in the protocol/language of the produced object, and includes variables for value-substitution on template completion. A default routine supports common processing. Custom routines support repeating items.
Import protocols are treated generally. Currently supported protocols are NFS and mail. Like-named routines place inbound objects into the user inbox, with partial metadata. (Import currently occurs as a background activity, independent of web browsing. This will change somewhat as browser-based upload methods are incorportated into EFC.) Similarly, datatypes supported by the EFC include recognizers supporting user interface and conversion routines.
User interface and data storage design generalizes several stored resolutions into a single datatype. So there aren't hard-coded image sizes.
Stand-in images prevent image conversion from delaying object creation. Actual images replace these as they become available. (Our current platform requires several minutes for each TIFF-to-GIF conversion, done as a background process. This will soon be reduced to several seconds by platform upgrade.)
Our current HTTP/HTML implementation partitions user authentication into a separate executable “wrapper”, providing an authentication session which spans individual HTTP sessions. Application data carried as HTTP pathinfo is transparent to the wrapper.
Additionally, the HTTP/HTML user interface is produced by front-end routines separate from those which result in metadata and data updates. These latter CGI routines do not themselves produce a user interface, but instead redirect the browser back to the user interface CGI.
