• Skip to main content
  • Skip to secondary menu
  • Skip to primary sidebar

FromThePage Blog

Crowdsourcing, transcription and indexing for libraries and archives

  • Home
  • Interviews
  • crowdsourcing
  • how-to
  • Back to FromThePage
  • Collections

Advanced Mark-up

FromThePage supports mark-up of text for common features found in historic documents.  This guide is a comprehensive listing of supported mark-up and its uses.

Anyone who has started transcribing a document runs into questions of encoding within the first few pages.  This phrase is underlined, that word isn't clearly legible, the rest of this sentence crawls its way up the edge of the paper when the writer ran out of paper -- how should these features be typed?  Is it even important to indicate them?

Our goal is to let users type naturally, balancing the flexibility to encode difficult features in the text with the speed of typing what they see with minimal mark-up.  The Wiki-text conventions used in FromThePage were originally inspired by the Mediawiki software used by Wikipedia, and have since been informed by Markdown.  Please also see the page on the Advanced Mark-up Editor for configuring an easier interface for mark-up.

Displaying and Exporting Mark-up

The reason we mark up text is so that it can be used properly.  Text should be easy to read on-screen, but it should also be easy to re-use by editors working on scholarly editions or librarians integrating it into their systems.  This means that the same mark-up will be displayed differently in these environments:

  • Single-page display:  When users read a single page on FromThePage, viewing a page image next to the transcript, we try to preserve any line-breaks in the text, so that the transcript lines up clearly with the manuscript image.
  • Multi-page display: Users reading the text of a work several pages at a time see a compacted version of the text with only thumbnail images for page facsimiles.  This display is also used for viewing all the pages that mention a subject or search results.  In this case, we attempt to wrap words, but preserve paragraph breaks.
  • TEI-XML export: TEI projects using FromThePage generally take documents exported in TEI as a starting point for further mark-up and processing.  For format, we attempt to preserve all mark-up created by users and translate it to the corresponding TEI elements to produce P5-compliant XML documents.
  • Verbatim plaintext export: Users can export verbatim plaintext from FromThePage for use in further editing projects.  In these cases, most mark-up is stripped in an attempt to present the basic transcript as typed, with the expectation that the export will be imported into off-line editing software.
  • Plaintext export for search or analysis: Both library systems supporting full-text search features and analytical tools operating on "bag of words" datasets need inputs that are stripped of markup.  However, their functionalities also benefit from further reductions like removing newlines and re-joining words hyphenated across line breaks, so these export formats support this additional stripping.

Mark-up Guidelines

Line and Paragraph Breaks

A single newline represents a new line within a paragraph of text.  This line will be wrapped for readers.

A single hyphen at the end of a line represents a word-break within the text.  Readers will see the word joined together, as if it were not broken in the original.  To indicate a literal hyphen at the end of a line, add a space after the hyphen.

Double newline represents a break between paragraphs or sections of texts.  The lines will not be wrapped when displayed, and will be represented as paragraphs if exported.

 

Source Multi-page Display Export
first line
second line
first line second line
first line<lb/>second line
word contin-
ues on next line
word continues on next line TEI-xml:

word contine<lb/>ues on next line
Searchable/Analytic Plaintext
word continues on next line
Headers

Surrounding a phrase with a balanced number equals signs marks it as a section header.  Increasing numbers of equals signs indicate more substantial section breaks:

===Chapter 1===
==Section A==
==Section B==

===Chapter 2===
==Section C==

HTML/XML

Experienced users often add HTML tags to their texts, and FromThePage presents them as marked when displaying to users.

  • <i> The italic tag in HTML is rendered as such in display modes, stripped from plaintext exports, and translated into <hi rend="italic"> in TEI exports.
  • <u> The underline HTML tag is rendered as an underline in display modes, stripped from plaintext exports, and translated into <hi rend="underline"> in TEI exports.
  • <s> and <strike> tags are rendered as strike-throughs when displaying, stripped from plaintext exports, and translated into <del> in TEI exports.
  • <sup> is rendered as a superscript in display, stripped from plaintext, and translated into <add> in TEI exports.
  • <sensitive> is a FromThePage-specific XML tag that will display the text to anyone with the permission to transcribe a text, but hide it from web crawlers and users who are not logged in.

 

Tabular

Accounting ledgers and other tables can be encoded using Markdown table syntax.  See this article on table encoding for examples and more information.

LaTeX

FromThePage supports LaTeX for encoding scientific and mathematical formulae.  Enclosing any valid LaTeX expression with curly braces and "tex" tags as follows:
{{tex: expression goes here :tex}}
The LaTeX expression will be rendered as an inline image in the display.  For more information and examples see this article on encoding mathematical and scientific formula with LaTex.

Primary Sidebar

What’s Trending on The FromThePage Blog

  • Guide to Digitizing Your Archives
  • How to Handle Racial or Ethnic Slurs &…
  • An Interview with Jodi Hoover of Digital Maryland
  • Privacy And Copyright Considerations Using GPT Models
  • Classifying the Mistakes We Make When We Transcribe
  • Project Profile: University of Virginia School of…

Recent Client Interviews

An Interview with Candice Cloud of Stephen F. Austin State University

An Interview with Shanna Raines of the Greenville County Library System

An Interview with Jodi Hoover of Digital Maryland

An Interview with Michael Lapides of the New Bedford Whaling Museum

An Interview with NC State University Libraries

Read More

ai artificial intelligence crowdsourcing features fromthepage projects handwriting history iiif indexing Indianapolis Indianapolis Children's Museum interview Jennifer Noffze machine learning metadata ocr paleography podcast racism Ryan White spreadsheet transcription transcription transcription software

Copyright © 2025 · Magazine Pro on Genesis Framework · WordPress · Log in