Uploading existing transcriptions or OCR with Page Images

It is possible to import existing transcripts in the zip file upload.

First, create a folder with image files in it. Then make sure that each image file has a .txt or .xml file containing the transcript of that page, following the same name conventions as the image, like in this example:

envelope.jpg

envelope.txt

page_001.jpg

page_001.txt

page_002.jpg

page_003.txt

postmark.JPG

postmark.txt

Not all image files need corresponding text files, but the filenames do need to be identical (except for the extension) when there are text files with transcripts.

Create a metadata.yml file if you wish, and place it in the same folder.

Then zip up the folder (along with other folders, if you want), and upload it to the Start a Project screen. Make sure to check the "import text" box.

The folders should be converted into a FromThePage work with the contents of the text or xml files set as the raw OCR text. You'll probably want to convert the work into a manuscript transcription work from an OCR correction project (using the checkbox on the collection settings page) so that the nomenclature is changed appropriately.