Menasseh Ben Israel: local WebDOC project, University of Amsterdam
paper presented by J.J.M. de Haas, November 18, 1997

Introduction
The Bibliotheca Rosenthaliana, Department of Judaica and Hebraica, Amsterdam University Library harbours many unique research collections. One of the most prominent consists of the editions printed in the seventeenth century by Menasseh Ben Israel, who established the first Hebrew printing office in Amsterdam in 1626. In total he printed some 70 odd books, of which the greater part is in the collection of the Bibliotheca Rosenthaliana. Being almost complete, this collection is of great importance to scholars all over the world. Apart from the printed books, the Bibliotheca Rosenthaliana holds 6 autograph letters by Menasseh Ben Israel , which are rare as hardly any letters have been preserved. There is a copper engraving of his portrait by Salmon Italia. There is also an etching by Rembrandt alleged to be representing him but scholars dispute about this until this day.

The WebDOC project aimed at the following:

  • investigating in what way these 17th century editions can best be digitalised without damaging the books
  • making the whole of the collection available in electronic form (presentation of digital collection)
  • investigating what OCR method (if any) can best be used for making the Hebrew characters optical recognisable (conversion from image to searchable and editable text).
We will discuss briefly what were the problems we met during the project, and what choices we made in order to get to our goal.

Digitalisation without damaging the books
Most books of the collection would be damaged if pushed against the glass plate of a flatbed scanner, so they had to be scanned top side up. There were three ways in which the books could be handeled for digitalisation: photographing by digital camera, or by video camera and then converting the video stills to a digital format, or by an 'ordinary' analogous camera and then converting the microfilm to a digital format.

Using a digital camera proved to be too expensive because every shot had to be focussed manually, which would take far too long for the entire collection (23.862 pages).
The use of a video camera took even more time: a page had to be divided in at least two, but preferably four parts in order to get a legible page.
Using an ordinary analogous camera with a subsequent run in a SunRise microfilm scanner proved to be the best: in the first place, film is known to have a lifespan of at least a hundred years. Secondly, film obtains the highest possible contrast, which is very important for the next step: digitalisation. In the third place, the right settings for digitalisation by the microfilm scanner could be done at the beginning of the film and remained valid for the whole of it, so it took very little time to digitalise the films in this way: filming and running the resulting analogous images through the microfilm scanner took less time than photographing the collection by digital camera, and the images were clearer because the basis of digitalisation, a film of very high contrast, proved to be more suitable than the original pages.

Images: compression and formats on the web
We wanted the images of the pages on the Internet in such a way that endusers would be able to read them by just using Netscape 2.11 and higher (Internet Explorer 2.0 did not support frames at the time). As JPEG (joint photographics experts group) and GIF (graphics interchange format) are frequently and succesfully used on the web, we did some tests with these formats.
Scanning a book opening (one image of two pages) produced a bitmap image of 2,8 MB (256 colours, 2000 x 1400 pixels). This was converted by Paintshop Pro 4.10 to a JPEG image of 994 KB using Huffman standard compression. A conversion of the same bitmap image to GIF with 256 colours remained 2,8 MB in size.
In order to get an image of less megabytes we diminished the number of colours from 256 to 2. The JPEG image remained 994 KB in size, but the GIF was reduced to 110 KB .
Images of 994 KB take a lot of time to load. Even images of 110 KB are akward to skip through, so we decided to reduce the images from 2000 to 1000 pixels. Most pages will still be legible and two colours will be sufficient. This resulted in a b/w GIF image of 1000 x 700 pixels with a size of only 38 KB, a very acceptable size. For comparison, the corresponding JPEG image in 256 colours, 1000 x 700 pixels measured 310 KB.
During the project all kinds of technical developments occurred. At the end of 1996 there was a rather stunning new compression technique: Lightning Strike 2.6. We tried it out on the original image (bitmap 2,8 MB and JPEG 994 KB). The resulting cod image in 256 colours was 295 KB. Presentation of this image can remain rather small; it is possible to zoom in by using the right mouse button.
Our conclusion was that for a clear b/w representation of a page the small GIF image (38 KB, 1000 x 700 pixels) was sufficient with the advantage that no special viewers had to be installed. If the page needs to be represented 'life-like', just as it looks in reality, colours matter and the cod format is an interesting solution: after compression no information is lost, the image has 256 colours, 2000 x 1400 pixels. Disadvantage: the cod viewer has to be installed.

Images: compression and format on cd-rom
Scholars in the field of Hebrew studies need a clear view of the texts. Here we have far less limiting restrictions regarding size because the images on cd-rom will be loaded locally, independent of (expensive) Internet connections. It is also possible to put a special viewer on the cd-rom. Without size limitation the images can have another format than GIF or JPEG, so a higher quality can be obtained. The format for the images on cd-rom is Tiff (Tagged Image File Format )group IV, 300 dpi (dots per inch). The compression used by Tiff group IV is an international standard. It has been developed by post-, telephone and telegraph companies worldwide to send faxes efficiently all over the world.
In digitalising resolution is a very important factor. This resolution is decisive for the visibility of details. It is also of great influence to the size of the image: an image with a resolution of 400 x 400 dpi is four times as large as an image with a resolution of 200 x 200 dpi. Considering the original size of the printed pages, 300 dpi was a right choice. This resolution is sufficient enough for discerning any details. Generally speaking, a higher resolution might some times be advised, but as we scanned from microfilms a very high resolution was already obtained. Furthermore, all OCR systems are advising 300 dpi. A lower resolution would generate a poor result. A higher resolution would only be better in the case of extreme small print (e.g. telephone directories). The size of the printed editions involved justified the choice of 300 dpi. The size of the TIFF pages, with an average of 100-150 KB, may be rather large for the Internet, but this size is no problem for the cd-rom.
During the project, a (freeware) TIFF viewer became available as a Netscape plug-in 'Watermark WebSeries Viewer: http://www.filenet.com/prods/watermark/webdn.htm. This viewer is suitable for viewing the TIFF pages on the cd-rom.
Later versions of Microsoft's Internet Explorer (from version 3.0 onwards) can handle frames and are also equipped with a TIFF viewer, so endusers using later versions of MS Internet Explorer can read TIFF pages quite transparantly. That is why we supplanted in some cases (those where the characters were rather too small to read) the GIF images by TIFF images. For endusers using Netscape browsers up to 3.0, we linked to the site where the free TIFF plug-in can be downloaded.

Presentation of the digital collection
The electronic version of the Menasseh Ben Israel collection consists of more than 11.000 digital images, both in TIFF (cd-rom) as well as GIF (Internet). This cannot be presented as a linear sequence, there should be an unambiguous structure which distinguishes the books from each other and which makes it possible for the enduser to navigate easily through the collection and the individual editions.
General information and an entry to the Menasseh Ben Israel collection is the homepage. Here to the left are links to the books and letters by Menasseh; biography and publications about Menasseh, and the text of the final report on the project.
To get a brief description of all the books in the collection, press the mouse button on 'books'. Every brief description has a link to the full description with annotations and to the pages of the book itself.
We used three frames:

  • The main frame represents the full bibliographic description with annotations and sometimes a small image of the title page.
  • The frame to the left represents a vertical column with the numbers of the book openings, which are used to turn the pages.
  • The frame at the top contains the short title of the book, with to the left a button for returning in one click to the brief descriptions of the complete collection, and to the right a button to return in one click to the full bibliographic description of the book.

OCR
Basically speaking, there are two kinds of OCR software. On the one hand, there is software using font recognition, i.e. the computer program has been trained to recognize a specific font (e.g. times new roman, or Web Hebrew AD) and it automatically converts this font into a digital character set (ASCII: American Standard Code for Information Interchange). On the other hand, there is software using pattern recognition, i.e. no font has been pre-programmed, the specific character set has to be put into the memory of the computer before recognition can ensue. Font recognition programs can also be trained some new characters, but there is a limit, whereas pattern recognition programs have far more possibilities in this respect.
We experimented with both kinds of OCR software. Font recognition software was represented by Omnipage Pro 7.0, pattern recognition by proLector 1.20 D.
The Hebrew font we use is Web Hebrew AD which we downloaded from http://www1.snunit.k12.il/heb_pc.html. Characteristic of this font is the representation of its Hebrew characters by ASCII codes 224 - 250. The other codes generate fonts in latin. By using ASCII codes, a website can contain Dutch, English and Hebrew texts at the same time.
ASCII codes can be put in by using the numerical part of the keyboard to the right. Be sure the Num-Lock key is on, press ALT (and keep it pressed), type 0 followed by the numerical ASCII code. Most of these ASCII codes can also be put in by using letters with diacritics that correspond to the numerical ASCII codes.
Searching Hebrew texts by using a U.S. international standard keyboard can also be done by typing in the corresponding ASCII codes.
The characters used in modern printing have nowadays been produced by computers. Consequently, every -a- is identical to the next, they are cloned as it were. OCR using font recognition, such as Omnipage Pro 7.0, is very suitable for this kind of printing. But also font recognition programs have a (modest) possibility to 'learn' new characters up to a limited number. We tested this part of Omnipage in a modern Hebrew text which produced no difficulties.
Seventeenth century texts, however, consist of characters that have been hand-made and they are far from identical to each other. An -a- may have dozens of different shapes. That is the reason why for our project OCR using pattern recognition proved to be the only kind of OCR software with a more or less satisfying result. It also accounts for the fact that it really does not matter whether the text at hand is in Latin or Hebrew or whatever. The pattern recognition program can be taught to recognize any letter and link it to the corresponding ASCII codes. As an example we refer to: http://www.uba.uva.nl/nl/collecties/rosenthaliana/menasseh/ocrapport/hebzpunt.html.

General remarks about our experience with proLector and 17th century Hebrew:

  • Vocalized Hebrew (characters with dots) can be trained without the dots: it makes no difference when searching the text.
  • If some letters are not clearly separated from each other, they can be trained for the occasion as one pattern. This will only work up to four characters.
  • Occurrence of varieties within a font (italics, bold, larger capitals etc.) is no problem because this also is a matter of training.

Main problem:
Characters taking more than one line (e.g. large ornamental letters at the beginning of a chapter) are cut into pieces when next to smaller blocks of texts and will not be recognized. Blocks of varying texts have to be removed or put into separate zones in order to be recognized.

The url of a full description of the project (42 pages in Dutch) is: http://www.uba.uva.nl/nl/collecties/rosenthaliana/menasseh/eindrapport/index.html



© Pica, 1997