Article contents [±]
- Analog and digital representations
- Resources: Representations
- Layers of representation
- Resources: Layers of representation
- Curatorial requirements for data representations
- Bit preservation and information preservation
- Resources: Bit preservation and information preservation
- Format Information
- Resources: Format information and identification
- Word-Processor Formats
- Resources: Word-Processor Formats
- Resources: HTML
- SGML, XML, and their Applications
- Resources: SGML, XML, and their Applications
- Other Formats
- Resources: Other document formats
- Resources: Raster Graphics
- Resources: Vector Graphics
- Resources: Databases
- Resources: Numbers
- General article comments
By data representation is meant, in general, any convention for the arrangement of things in the physical world in such a way as to enable information to be encoded and later decoded by suitable automatic systems.
We specify conventions because information can be conveyed by other means as well. A dog may know by sniffing the air who has passed a given way in the preceding hour or so, but does not in doing so rely on any agreed conventions for information transfer. This and similar exchanges of information do not involve data representation in the sense we mean it here.
We specify automatic systems to distinguish data representation from the more general topic of the representation or encoding of information, which includes conventional writing systems, paper drawings, and other representations of information used by human beings.
We do not specify what physical objects are to be arranged, or how, or what kind of information they are to be used to encode, because data representation might in theory involve any kind of physical object and any kind of information. In practice both the physical objects involved and the conventions for their arrangement have varied a good deal over the short history of automatic information processing by means of machines. Holes punched in stiff cards, magnetic charges on a thin coating applied to plastic tape or flat metal disks, holes in paper tape, variations in the optical properties of the surface of a thin plastic disk, dials controlling electrical circuits, the positions and lengths of cables, and the positions of spools and sticks in a Tinkertoy construction have all been used successfully to represent data.
A particular convention for data representation is often referred to as a data format.
An understanding of principles and issues of data representation is essential for data curators because only curators who understand how information is represented by the digital objects in their care can take effective steps to ensure that the information represented is not lost. The long-term sustainability of digital objects is materially affected by the methods of data representation relied on by those objects; tradeoffs between different courses of curatorial action can be correctly assessed only with an understanding of how the information to be preserved is represented by physical objects.
One way to represent information is to create and sustain a direct analogy between salient properties of the system being modeled (the information) and the physical representation of the information. When the physical representation is astutely chosen, operations on the physical representation can correspond to operations on the objects being represented and the representation can be used for (for example) calculation.
Real numbers, for example, can be represented in an intuitive way using lengths of string or wood. Numbers with similar values will have physically similar representations, and the addition of a set of numbers can be represented by placing the representations of the numbers end to end; the sum is represented by a string or piece of wood whose length just equals the distance from one end of the sequence of addends to the other. The addition of numbers using a slide rule is based on precisely such a representation. (A more common use of a slide rule, of course, is to multiply numbers; in this case, each number is represented by a length of rule proportional to the logarithm of the number, and multiplication is represented as the addition, using the convention just described, of the logs of the multiplicands.)
Because it is based on an analogy of properties between the representation and the represented, this form of information representation is called analog. A fundamental property of an analog representation of information is that representations with similar physical properties represent similar things: infinitesimally small differences in the state of a representation correspond to infinitesimally small differences in the information represented, and there are uncountably many different meaningful states. It is a consequence of this property that small errors in the representation will result in small (though often tolerable) errors in the results of an calculation.
Among the best-known uses of analog representations for complex information are the tide-predicting machines developed in the 19th century and in use in some locations until the 1970s, which predicted tides using ingenious systems of cables, wheels, and pulleys.
A different family of representations uses a purely arbitrary or symbolic relation between an object and its representation; the physical representation serves to record a symbolic expression or notation of some sort, and the expression has an arbitrary relation, defined by convention, to the information it represents. The arbitrariness of the relation will be familiar to some readers from many other discussions of signs and signifiers.
The data representations used in modern computer systems all fall into this family. Their fundamental property is they represent information indirectly: physical phenomena are used to represent sequences of binary digits (zero or one), and sequences of binary digits are then interpreted as integers, real numbers, characters, or other “primitive” data types. From the use of binary digits as a fundamental building block (and more generally from the similarity of these representations to the use of fingers as symbolic units in counting), these representations are termed digital.
The fundamental property of digital representations is that they are based on the use of a finite number of discrete symbols to represent information. Because finite systems can represent only a finite number of symbols, in any such system there is only a finite number of possible meaningful states; this is a fundamental difference between digital and analog representations of information. (In practice, the number of states distinguishable in a digital system is large enough that it often simplifies reasoning to pretend that it is infinite.) In digital systems, the physical similarity of two representations of information is no guide to the similarity of the information they represent. (For example, the bit sequences 0000 0000 and 1000 0000 differ only in the value of a single bit, but if they are taken as unsigned integers, they denote 0 and 128; many numbers much closer to zero than 128 have representations very different from either.) Small errors occurring in the physical representation of information (e.g. the accidental flipping of a single bit) can and often do lead to wildly erratic results.
The early years of electronic computing machinery were marked by competition between digital and analog (or more frequently digital/analog hybrid) computers, but eventually digital devices swept the analog and hybrid devices from the marketplace so thoroughly that early descriptions of electronic binary digital stored-program computing machines now seem quaintly dated, and the representation of pre-existing materials in machine-processable form is referred to as digitization, as if no other form of machine processing were conceivable. Those early descriptions now serve as reminders that not all computational devices need be digital, or binary, or electronic. A key element in the commercial victory of digital devices was the development of methods for simulating the behavior of analog devices using digital representations. So even in environments which are strictly speaking digital it is sometimes useful to distinguish methods of representation which are more purely digital from those which have, or seek to have, analog properties. In the context of contemporary digital machines, therefore, the term analog may be applied to representations of a thing which model selected physical properties of that thing as closely as possible, typically using (digital representations of) real numbers; in contrast a digital representation represents the thing in symbolic form, typically using symbols from a (relatively) small number of discrete, enumerable atomic symbols.
A text, for example, can be represented by a scanned image of a page on which the text has been written, in which the data format records information about the hue and brightness of the light reflected from different points on the paper; this is an analog representation of the text (or more precisely, of the page), in the sense just described. The text (and the page) can also, however, be represented by a sequence of characters, each character represented internally by a distinct pattern of bits, with no attempt to record the physical appearance of the paper or the writing on it, only to record the identities of the symbols used to encode the text. In the sense just given, this is a digital representation of the text.
As illustrated by this example, analog representations often mimic physical attributes without any distinction between those which carry meaning and those which do not, while digital representations require an understanding of the properties of the thing being represented. For this reason, analog representations are sometimes associated with the act of perception and digital representations with the act of cognition (e.g. by Devlin 1991). Typically, digital representations provide better access to information of interest than do analog representations. Full-text search of digital representations is a straightforward operation, while full-text search of images representing pages of text is possibly only via a detour through a textual representation (often created automatically by optical character recognition, with the high error rates entailed by that operation). Because purely digital representations can omit extraneous information, they tend to be more compact than analog representations. An image of a page typically requires many times the storage space needed for a character-based transcription of the page. Conversely, because analog representations do not discriminate between relevant and extraneous information, they will typically convey information omitted from a purely digital representation of the same thing. Sometimes this extra information will prove useful or important.
In existing computer systems there is typically a long chain of relations connecting the physical phenomena by which data are represented with the data being represented. Each link in the chain connects two layers of representation: each layer organizes information available at the next lower level into structures at a higher (or at least different) layer of abstraction, and in this way provides information used in turn by the next higher level in the representation. For example, the representation of an email message may involve the following layers:
- Physical layer: holes in cards or tape, magnetic charges, color changes on optical disks or scan codes, tones on a telephone connection, or similar phenomena are interpreted as representing sequences of bits.
- Bit layer: those sequences of bits may be interpreted as representations of other different sequences of bits (for example five bits may be written to the physical medium to represent four bits of data, in such a way as to guarantee a minimum and maximum amount of space between magnetic flux events in the media).
- Byte / octet layer: the sequences of bits read from the storage device are grouped into octets: units of eight bits often referred to as bytes. (Historically different machines had bytes of different sizes, but it has been some decades since any prominent system had bytes of other than eight bits.)
- Character layer: an octet sequence may be interpreted as a sequence of characters. For conventional email, each octet will be interpreted as one character, as defined by the appropriate character-set standard.
- Application-specific data structure layer: the email reader will read the character stream and distinguish the mail header from the message body, and may distinguish multiple alternative representations of the message and attachments within the message body. Within the mail header, mail software will distinguish important fields like date, sender, and addressee.
- Presentation layer: the email reader will display the message on the user's screen.
The human reader of the mail will of course read the screen and (in the normal case) discern letters, words, and sentences, as well as (perhaps) images.
This hierarchy of layers of abstraction is characteristic of many information technologies, not just data representations. It has parallels to the structuralist idealization of natural language as organized into phonological, morphological, lexical, syntactic, semantic, and pragmatic layers. In the case of natural language, different layers sometimes interact in ways that conflict with the hierarchical model. Artificial systems of data representation, in contrast, may follow more frequently than natural languages the ideal of a strict hierarchy of layers in which no layer depends on or interacts with layers other than the immediately adjacent ones. In the design of technologies, such layering helps limit the complexity of the system and reduces the likelihood of error in its construction. From the metaphor of several pieces of software, each layered on top of the next, systems constructed in this way are often referred to as a software (or technology) stack. Software that supports network connectivity is often referred to as “the network stack”; the technologies available for working with XML documents are sometimes referred to as “the XML stack”; and so on.
There is no single hierarchy of data representation layers that applies to all data representations; other software running on the same machine may have a chain of representations and layers rather different from the chain described above for email messages. In particular, different applications will almost always have different application-specific data structures. Moreover, proprietary applications frequently use binary data formats which have no distinguishable character-level representation.
Despite the manifold opportunities for variation, however, some properties are shared by many data representations in wide use today:
- Unlike proprietary applications, non-proprietary applications often define character-level representations as a way of allowing interoperability between different implementations.
- Historically, machines from different manufacturers often used different character encodings. Since the development of the so-called Universal Character Set (UCS) of ISO 10646 and Unicode in the 1990s, however, hardware manufacturers, software developers, and the writers of non-proprietary specifications have been slowly converging on the use of the UCS as a standard character representation level. Examples include the use of the UCS as the fundamental character representation in the Java programming language, in HTML beginning with HTML 4.0, and in XML. In consequence, character-set variation is likely to pose practical problems primarily in the case of formats or data material from the 1990s or earlier.
- Most applications on most systems share the lowest levels of the representation and diverge only in their treatment of the octet sequence.
- Many applications designed for use on a single computer system also assume that the operating system within which they are used provides a file system (which can be described abstractly as a mapping from file names to octet sequences). This is not a universal property, however: not all computing devices provide file systems, and (in the interests of speed and/or reliability) many database management systems bypass the file system to deal directly with the interface to the hard disk or other storage media. Network protocols, in contrast, typically avoid assuming the existence of a file system and rely instead on concepts of messages, data transmissions, or (especially in the context of the World Wide Web) resources.
In practice, the issues of concern for data curation are almost all at or above the octet level. Partly this is because lower levels are normally highly reliable (and thus seldom need attention), partly because intervention at lower levels requires specialized engineering knowledge and equipment, and partly because application formats are designed to rely only on the octet level, precisely in order to make them independent of the precise implementation of the lower levels. (But see the discussion of bit preservation below.)
For data curation purposes there are two fundamental requirements: all other requirements derive from these (or are not requirements but negotiable desiderata):
- Permanence: the data representation must last a long time without corruption, degradation, decay, or loss.
- Usability: it must be possible to use the information being preserved. Using the information provides a check that the information has thus far been successfully preserved without the loss of some crucial bits, and in any case if the data have become unusable there may be little point in spending further resources to preserve them longer.
From the essential requirements others follow; all of these are desiderata only.
- Any data representation relied on for long-term preservation of information must have clear, well written, published documentation. If the format is not documented, the likelihood that the information it represents can be preserved without loss across media conversions is small; the likelihood that it can be preserved without loss across format conversions is nil. One of the most effective methods available for confirming that digital objects have been successfully preserved so far is to provide effective intellectual access to the material; active users of the material provide a far better monitor of data quality than any automated system could ever do. But if the format of the data is not documented, it is much harder, if not always impossible, to provide effective intellectual access to the material.
- The specification documents for preservation formats should be controlled by public bodies, preferably consensus-based organizations in the international standardization system or by relevant industry consortia. Proprietary formats are subject to change and abandonment by their owners in ways that make them a poor bet for long-term access to information.
- Other things being equal, a data representation that is widely supported has a better chance of long-term utility than one with a much smaller user community. Larger numbers of users mean it's easier to share costs of maintenance and development across a larger pool of resources, understanding and documentation of the format are likely to be more widespread, and there are better prospects for commercial support for the format. There are limits to this principle, however: a suitable format used by a small specialized community will often be preferable to a format used by a much larger community that does not provide a suitable representation of the information. (The mostly widely supported representations of human-readable documents, for example, are those of word-processor software. But many scholars using computers for the analysis of language and literature prefer other formats for the data they work on, because word-processor formats are not oriented to linguistic and literary concerns. It would not be a good idea to translate data from a well designed XML format into a proprietary word-processor format on the grounds that the word-processor format is more widely used.)
Practical work on data curation can usefully divided into two classes: efforts focused on the preservation of information at the bit or octet level (bit preservation) and efforts focused on higher levels. Efforts at both levels are essential to the successful preservation of digital materials; which area more urgently requires the attention and resources of data curators is an area of active controversy.
Briefly, bit preservation is the act of ensuring that devices in the future will be able to reproduce the sequence of bits, or octets, currently used to represent the information to be conserved. Bit preservation protects against bit rot and media failure, but not against other threats to digital preservation and access.
Information preservation is the act of ensuring that the information represented in a resource is preserved, possibly by translating it from an obsolescent format into a more current format. Note that format conversion protects against file-format obsolescence, but not against other possible threats to digital preservation.
Preservation of bits is a necessary part of digital preservation: since the bit sequence is the foundation for all the higher levels in the representation of the information, if the bit sequence is lost, the information will be lost as well. But bit preservation is not sufficient: a future user interested in a WordStar 1.3 document (for example) will be able to make use of the document effectively only if software capable of reading the WordStar 1.3 file format is available. Since WordStar was a very popular program for its day, such software may very possibly be available in practice. For the formats of less popular software, however, the situation looks less promising.
Since the set of formats potentially faced by a digital curator is unbounded, it is not feasible to a guide to all relevant formats. This section mentions a few of the most common and most important formats and points to other resources for further information.
For conventional prose documents, two classes of format can be distinguished: formats originally designed for a particular application (often for a single piece of software) and formats originally designed for the application-independent representation of documents. Families of formats defined in a common metalanguage (e.g. SGML and XML) are a third area of interest.
Most widespread are word-processor and other office-document formats. When this material was compiled, two of these formats were more or less reliably documented in international standards, namely the Open Document Format, and the Office Open XML File Format. For other word-processor formats, there is rarely any technical documentation. It is often possible for technical people of sufficient skill and patience to reverse engineer a format, if well understood sample documents in the format are available for examination. In such efforts, partial success is often attainable; perfect success is a theoretical possibility.
A second widely used application format is defined for the display of documents on the World Wide Web. In addition to the resources listed below, the W3C has published a number of ancillary documents related to HTML; see the W3C Technical Reports page.
Less widely used than word-processor formats or HTML, but perhaps more popular among digitization projects concerned with data longevity and reuse, are document formats designed for the software-independent representation of documents. These are of particular interest and importance for data curators because they seek, by design, to make the documents represented independent of any single piece of software. They thus avoid the single most common cause of format obsolescence, which is discontinuation of the software supporting the format. The desire for software independence also forces the designers of such formats to document the meaning of the format somewhat more carefully and completely than is usual among creators of software-specific formats.
While in theory there is no end to the methods that might be used to define document formats in a software-independent way, in practice almost all recent efforts in this direction have used SGML or XML.
There are a very large number of XML-based vocabularies; the single most useful source of information about them, and more generally about XML and related technologies, is The Cover Pages, compiled by Robin Cover and currently hosted by the Organization for the Advancement of Structured Information Standards, OASIS.
TeX is a batch document-formatting program written by the computer scientist Donald Knuth; its capabilities for formatting mathematical expressions are particularly well thought of. Since the formatting commands intrinsic to TeX operate at an extremely low level, it is customary to use TeX by defining higher level commands called macros. Over the years, a number of TeX macro sets have been written.
By far the most commonly used set of macros for TeX is LaTeX, originally written by the computer scientist Leslie Lamport.
TeX and LaTeX are in wide use for the creation of technical and scientific documents, particularly among academics. Unfortunately, the data format is defined exclusively in terms of the operational semantics provided by the executable TeX program; while it is possible in principle to define a declarative semantics for most of LaTeX, in practice many LaTeX authors extend the system with macros of their own. For preservation purposes, therefore, TeX and LaTeX documents rely on the continued existence of software to process them. Fortunately, the source code for TeX and LaTeX is publicly available and written with a great deal of care to be device- and system-independent.
PostScript is a programming language devised by Adobe Systems Incorporated; PDF (Portable Document Format) is a document format devised by the same organization, which uses a subset of PostScript and provides rules for embedding fonts in a document and for bundling all the pieces of a document together.
While originally a proprietary format, PDF has more recently been standardized and Adobe has issued a public license allowing the use of its patented technology in the creation of PDF software that supports the ISO standard definition of PDF.
Image formats fall into two classes: raster graphics, which represent images as an array of picture elements (pixels) coded for color, and vector graphics, which represent images as sets of geometric shapes (lines, rectangles, circles, ellipses, curves of various degrees of complexity, and text).
Vector graphics are more often used for the creation of new graphics than for the digitization of pre-existing non-digital graphic material, so for curatorial purposes the reader is more likely to encounter raster graphics than vector graphics. Vector graphics have a number of properties that make them attractive for the creation of new images, however (they are often more compact, and they do not degrade when the user zooms in on details), and the reader may wish to use vector graphics when creating new materials.
Several formats can contain graphic elements in either raster or vector format.
It is central to the conception of database management systems that the internal data representation of material in the database should not be visible to users of the database, except through a defined application-program interface (API) such as SQL. Discussions of the formats used internally are thus of no particular use to users of database management systems. They are in any case not standardized; competing systems strive to find data representations that allow faster indexing or retrieval and/or more compact storage, and in the case of commercial products the details of the representation are likely to be a closely guarded trade secret.
In order to allow mass imports or exports of data, however, database management systems typically provide one or more dump formats which can be read and written by the system. These are again apt to be implementation-specific, though comma-separated-value (CSV) formats are common. There is no standard definition of the CSV format, however, and implementations vary a good deal in punctuation rules and character sets. Occasional attempts have been made to write out a coherent specification for the CSV format, but these appear not to have any influence on the majority of implementors. The problems inherent in such variation led database vendors to adopt XML for inter-database exchange very early in the life of the XML specification.
Historically, the representation of numbers in electronic form has been a fundamental design question for computer systems, with analog and digital representations competing with each other for adoption. In modern digital systems, four main families of representations can be distinguished:
- Integers are typically represented in a fixed-width field of bits, either as unsigned base-2 numbers (so the possible values representable in a field of n bits range from 0 to 2n) or as signed numbers. Different methods of representing negative numbers are possible; virtually all current systems use the so-called “twos-complement” representation (which will not be explained here).
- Since binary numbers have rounding properties that differ from those of decimal numbers, they can cause problems for financial applications (which conventionally assume and require rounding behaviors suitable for decimal numbers). For this reason, systems intended for commercial use (most notably mainframe computers manufactured by IBM) often use binary-coded decimal representations of numbers. In this system, groups of four bits are used to represent the decimal digits, and a number is represented as a sequence of such decimal digits. Fractional numbers are handled by conventions at the programming-language level or higher which supply an implicit decimal point at a fixed location. The number of bits used to represent a number may vary. Computer hardware other than IBM mainframes seldom has hardware support for binary-coded decimal arithmetic, but software systems designed to support computation with large numbers often use binary-coded decimal representations.
- Real numbers pose a particularly thorny problem for digital systems, since one of the fundamental properties of the real number continuum (the fact that given any two real numbers we can identify a third midway between them) is very difficult to model with a digital system. For most purposes, real numbers are represented in modern computer systems using floating-point binary numbers, which use a fixed-width bit field to represent numbers with a range of values and arithmetic precisions. Over the years, the width of the bit field commonly used for floating-point numbers has grown from 32 bits to 64 and 128 bits. The standard representation of floating-point binary is defined by IEEE 754, which is supported by virtually all current hardware; other floating-point binary formats survive in some specialized markets. A 2010 revision of IEEE 754 specifies not only a floating-point binary but also a floating-point decimal format.
Those involved with data curation will probably seldom have need for detailed technical understanding of the formats historically or now used to represent numbers in computing. (Exceptions may arise when dealing with material which uses non-standard number formats for any reason.) But it may be worth while to scan the descriptions of number representations in Wikipedia, if only to dispel the notion that computer representations of numbers are somehow natural and thus simpler and less problematic than computer representations of other datatypes. The treatment in Wikipedia is reasonably sound, though by nature it will strike some readers as a bit dry.