Conversations at the Washington Library Podcast: Transcribing From The Page with Sara and Ben Brumfield

On April 29, 2021, Ben and Sara Brumfield sat down with Jim Ambuske, Ph.D., who leads the Center for Digital History at the Washington Library, to record an episode of Conversations at the Washington Library. In this podcast, Jim discusses how the Washington Library in Mount Vernon used FromThePage to create work-from-home transcription solutions for their team members. At the time of this recording, project collaborators had made almost 9,000 page edits and contributed over 400 research notes.

Jim Ambuske: Hey, everyone. I am Jim Ambuske, and this is Conversations at the Washington Library. When the COVID pandemic struck last spring, thousands of cultural heritage sites, including the Washington Library in Mount Vernon, had to find ways to help team members do work from home. That wasn't always easy, especially as so much of our normal work requires a physical presence. One of our solutions at the library was to use this time to transcribe the voluminous correspondence of Harrison Dodge, Mount Vernon's Superintendent in the late 19th century. To do that, we turned to a digital platform called From the Page. From the Page is a crowdsourcing transcription tool that allows users to transcribe historical documents from the comfort of their own homes.

Jim Ambuske: Since last March, for example, our Dodge Project collaborators have made nearly 9,000 page edits and contributed over 400 research notes, so on today's episode, you'll meet Sara and Ben Brumfield, the creators of From the Page. Inspired by their involvement in Wikipedia's early days and hoping to find ways to transcribe treasured family heirlooms, the Brumfield's set out to create a way for people, including those of you listening right now to go out and really transcribe the past. Check out our show notes or go to FromthePage.com to find out how you can join our crowdsource transcription project, but before you do, let's transcribe From the Page with Sara and Ben Brumfield.

Jim Ambuske: I guess we should mention at the top of the podcast here, Ben and Sara, that I don't think I've ever done this. In the interest of full disclosure, Mount Vernon does have a subscription relationship with From the Page, but our relationship actually goes back quite a few years now to when I was at UVA. Why don't we start from a kind of a 30,000 foot level because we're here to talk about From the Page, which is a research tool that you've designed to help crowdsource transcription of historic manuscripts? That gives away a little bit, but not the whole thing. Why don't we talk more about what it is and what it's designed to do?

Sara Brumfield: From the Page is our collaborative transcription platform. People get to work together to transcribe generally handwritten, occasionally type written, historical documents. When we say collaborate, that can mean a lot of different things to different people. We have research groups that work together to transcribe and work on documents and documentary editions, we have classrooms and teachers that work on using it for pedagogy and teaching. During COVID times, especially a lot of staff who work on transcribing materials. It gives them something that they can do from home when they're not allowed to go into the office, but by far, our most popular use is the public. Working with institutions to transcribe their historic documents.

Ben Brumfield: Fundamentally, it is, it's very simple. We don't use machine learning. We present image of the page to a user and a place for them to type what's on the page. Those two things together make magic, but the magic is done by humans.

Sara Brumfield: One of the things that's kind of different about how our platform works is that we're very inspired by Wikipedia and they have this ethos of collaborative knowledge creation that not one person by themselves is creating knowledge, but that one person creates a version of knowledge and then you kind of iterate and improve on that. Built into our platform is this idea that you can transcribe a page, but then someone can come behind you and read it and review it and make updates to it. Maybe they can reword better than you can, maybe they know the technical terminology. There's a lot of reasons why four, five, 10 people working on the same page is actually produces a better transcription than that individual.

Ben Brumfield: It's fun to watch them as they work discuss different terms that they see. They'll encounter very strange looking things. Then someone will go off and do research and say, "Oh, okay. Well, someone who rode in on The Owl, it turns out that The Owl was a overnight sleeper train from San Francisco to Los Angeles, so that word really is owl, but it's a reference to a train."

Sara Brumfield: Yeah. When we were developing From the Page, we were very inspired by Wikipedia's model of collaborative knowledge creation, and this idea that one person can come in and do a first revision of an article, in Wikipedia's case, or a transcription in our case, and then other people can come behind them and review it. Maybe they can read the handwriting better or maybe they understand the technical vocabulary that's on the page. Two, three, 10 people can review the same page. They have a place that you can interact in notes on that page, and we think that produces a better transcription than one just person working by themselves.

Ben Brumfield: Yeah, in isolation.

Jim Ambuske: All right. A minute ago, you mentioned magic, so maybe we should go into the origin story of From the Page. Why From the Page? When did you initially conceive of this tool and what were your aspirations for it?

Ben Brumfield: This began as a personal project, and the inspiration for it was a project my father did in 1991. When my great-great-grandmother died, she had kept a diary for the last 20 years of her lives, and those were distributed to all of her grandchildren, and they went off to the four winds. My father transcribed one of these in WordPerfect 51 and printed out a number of copies and passed them around to family members and neighbors in the small county in Virginia, Pittsylvania County.

Sara Brumfield: Probably a lot of your listeners are from Virginia.

Ben Brumfield: It became the this hit. Whenever we'd go back there, complete strangers would come up and say, "You did a wonderful thing by typing this up." You'd have 80-year-old women who would say that whenever they had a bad day, they would open up this diary and read entries of this other 80-year-old woman who would get up and feed the chickens and feed the hogs and milk the cow, and then do some quilting and make dinner, and all these amazing things. They would get inspiration from that in their days. We also had people coming up saying that they'd found mistakes on family gravestones because they could see in the primary sources when someone died. Then if the gravestone says they died three days later, well, we know that's not quite right. I wanted to do the same thing with a separate copy of the diary.

Sara Brumfield: A different year.

Ben Brumfield: Yeah, a different year. Right. And I started to try to do that by myself and I discovered that I didn't know the people who were involved, I didn't know anything about early 20th century tobacco agriculture. I was just in way over my head. And around that time, Sara and I had gotten involved in the very, very early days of Wikipedia backend, and the Wiki model of multiple people being able to collaborate on the same text was what I thought would be a good solution because you have the ability for one person who's a good typist to type what they can, and then someone else who perhaps is elderly, rural, doesn't have a good internet connection, come in and say, "Oh, well, that's stripping tobacco. Here's what's happening," so those kinds of things.

Sara Brumfield: We need to add here that we're both computer scientists by trade, have degrees computer science, so hammer, nail, right?

Ben Brumfield: Right. In turns out that building a software system to-

Sara Brumfield: To do this. Yes.

Ben Brumfield: A tool to let other people collaborate to do this was a lot easier than doing the historical research by myself, so what I really wanted was the ability for people to see the page as they were transcribing the text because editing systems already existed for collaboration, but once you severed the transcript from the digital image of the text, you have no way to verify what's going on. You don't have the problems; you don't have that...

Sara Brumfield: ...context.

Ben Brumfield: ... transparency, the context, any of that. That's why we wrote From the Page.

Jim Ambuske: It's fascinating, right, because it seems like at base level, it's a simple thing. That you would want to have that image directly next to the transcription field, so that you could do that kind of verification. I used to work at the Washington Paper Project back in my graduate school days, and if somebody had to check the copy text or something like that, they'd have to go to the files, which were in a vault... it's not a vault, but it's a section in Alderman Library where they'd have to go and they'd have to get the Xerox copy of that copy text, bring it back to the desk, and look down and up, look down and up, as they were checking what they had done, but they could not see it side by side. I imagine then it made it a whole lot easier to begin transcribing your... is it your great-great-grandmother?

Ben Brumfield: Yes, yes. It was. The collaboration worked in ways that I was not expecting. The people who had originally... who I knew well, who I originally volunteered to work on this with me, they didn't end up doing more than a few pages, but they passed the document on to other people. There was a distant relative who had retired early with some health problems and was stuck at home by herself. She would plow through these pages. At one point, and she was able to find other copies of the diaries, other years, and get them scanned, and send them to me to put online. She would scale back what she was doing. I noticed she went from 10 pages a day, to five pages a day, to two pages a day because she was afraid that she would run out of material to work on and was waiting for me to post more. We also saw other people who did searches for their own name.

Sara Brumfield: Right. There's a gentleman who lives in Virginia by the name of Nat Wooding, and he was doing a Google search for his name, and he ran across his name in a transcribed page of one of these diaries. It turns out that his uncle, great uncle, I think it was his great uncle was the diarist's mailman, so he shows up in the diaries every once in a while. I think her husband or son often helps him fix his car because I guess if you're a mailman-

Ben Brumfield: Right, it breaks down.

Sara Brumfield: That's very important to have a working car. He was a mechanically minded person. Nat found this, and he started working on the documents, and he became a very large contributor.

Ben Brumfield: Yes.

Sara Brumfield: In fact, when those diaries were done, he's like, "Well, what else do you got? I really like doing this type of work." That's what we've seen with a lot of volunteers is that they really enjoy the process of immersing themselves in these historic documents.

Ben Brumfield: He switched over to work on herpetology notebooks at the Museum of Vertebrate Zoology, and just plowed through those as well. Since then, he's become a super transcriber volunteer for the Library of Virginia, so he does a lot of the Virginia Supreme Court case transcriptions in the 19th century.

Sara Brumfield: Yeah. It's neat.

Ben Brumfield: It's neat to watch people develop over time and-

Sara Brumfield: Right. To see people move from project to project as one wraps up, and to realize there's really an ecosystem of people who enjoy doing this type of volunteer work.

Jim Ambuske: Well, as you said, you have a computer science background, and you were in the early days of Wikipedia. We talk a little bit more about that because I'm interested in that connection between Wikipedia and From the Page. Wikipedia is so ubiquitous these days that it's hard to conceive of a moment when it was very small. How did you actually become involved in that project, and then, what lessons were you taking away that you realized later that laid the groundwork for something like From the Page?

Sara Brumfield: I think an example of how early we were in Wikipedia's history is I created the first page for San Antonio, Texas. Austin had pages because there are lots of tech consult people in Austin, but San Antonio just a hundred miles down the road didn't.

Ben Brumfield: Yeah. I mean, it really was kind of a joke. Most articles were...

Sara Brumfield: ...a paragraph

Ben Brumfield: ...or less. I did a lot of edits on the article on tobacco. This was something that I thought was interesting, which is that it's really hard if you give people a blank slate and call them to do something, and say, "Please write this article for free for thought," but if you present them with something that is half done or that is wrong, well, then they'll jump in. I looked at Wikipedia and I thought, "This is a joke. This is ridiculous," and then I read an article on tobacco that describes it as a vine. I said, "Well, it's not a vine." Immediately I had to go in and fix that. This desire to fix things that are wrong really...t's this very strong, human things. Whether that means we're all kind of bloody minded nitpickers, or whether we're concerned with truth and the right way to do things, I'm not sure.

Sara Brumfield: There's also the conversations that happen around Wikipedia articles, so I know for a while Ben especially would just...he would watch the talk sections for ones he was very passionate about, so he could make sure nobody was doing anything wrong. Right?

Ben Brumfield: Yes, that's right. Make sure nobody messed things up.

Sara Brumfield: We took that ethos and brought it into From the Page, he page image and you have the transcription, but then underneath them you have a place for notes and comments. We wanted to place that within the context of the page because you want to be able just to look up and say, "Oh, right there is what they're talking about. What is that word again?"

Ben Brumfield: A lot of that inspiration came from Pepys Diary online, which was a retro blog of Samuel Pepys' diary that would just post an entry per day, and there'd be 30 or 40 comments by people doing all this additional research and some of it was kind of speculative gossip about his sex life and what he's going to do next. Other people would go find contemporary entries by other diarists and post those. That kind of collaboration was amazing. Trying to put all these things together and tie them to the page image that you had, transparency and provenance was really the goal.

Sara Brumfield: What sort of things are we seeing with that, because we do see that? We watch the comments come in for different project on From the Page. I think probably one of the most interesting ones, one of our very early projects was lighthouse keepers logs from the Yaquina Head Lighthouses off the coast of Oregon.

Ben Brumfield: Yeah, I remember that one.

Sara Brumfield: Okay. Yeah. We talked about it a lot in the early days because of the lighthouses. Everyone loves lighthouses. They have these keepers logs that they had actually digitized from the National Archives and put them online and had people work on transcribing them. There was this one entry.

Ben Brumfield: The entry was... all these entries are maybe two lines total, and this entry starts off with weather observations, and then says that the lifeboat from, name of the ship, came to shore at the lighthouse and the assistant lighthouse keeper carried the survivors to the doctor, while the head lighthouse keeper helped bury the captain's body. The person who transcribed this two line entry went off and found contemporary newspaper report and found that there was this coal ship 200 miles off the Oregon coast that had exploded, and the survivors were stuck on a lifeboat for two weeks offshore before they finally washed up. That's an amazing discovery that people would not have found otherwise if some interested, motivated volunteer hadn't transcribed that and said, "Wait, wait, what is that about? Let's go look that up and find out more about this lifeboat."

Jim Ambuske: Golly. That's nuts.

Sara Brumfield: Yeah. Something that pops up all the time, but you have to do a lot of work to happen to stumble across one. Right?

Ben Brumfield: Right. It's not always... the things that we find aren't always people going off and doing research. Sometimes it's just analysis, an emotional connection with the material. We saw in a comment yesterday. There's a project out of-

Sara Brumfield: University of the South, Sewanee.

Ben Brumfield: Sewanee, right, that's working with a set of... with a convict leasing program in the 1870s, which is this absolutely horrible program in which people who are convicted, primarily African American men, are leased to this terrible coal mine and-

Sara Brumfield: For hard labor. Right?

Ben Brumfield: ... for hard labor. Just someone yesterday noticed as they were transcribing, this person is 12 years old. There's a 12 year old boy who is sentenced to three years hard labor. His measurements, he's not even five feet tall and he is stuck working in this coal mine. Just those stories and that connection that people that are transcribing this material have with it, even if there's no additional research, that's really powerful.

Jim Ambuske: Oh, yeah. I mean, it's a horrific example, but it's also a terrific example of how you can create those connections with the past in really deep and meaningful ways, and take with it and run with it, and dig up all that kind of information that most people either may not have seen before or not had an opportunity to if this stuff had not been digitized really.

Ben Brumfield: Yes. Absolutely.

Jim Ambuske: It's funny you talk about comments because I watched those come in, in our project, and one of the things that we're working on is we've had some volunteers transcribing the letter books of a man named Harrison Dodge, who was the Superintendent of Mount Vernon, essentially the president or the groundskeeper, I guess. Groundskeeper is not the right word. Superintendent of Mount Vernon in the late 19th and early 20th century. We've got a couple of very passionate volunteers who really love Dodge, and it's fun to watch them talk to each other in the comments. They're asking themselves can you please verify this word or the work that Dodge is being asked to be done may indicate that there was some development on this part of the grounds and whatnot.

Jim Ambuske: It's been really fascinating to watch, and they're pointing out the various things like books that Dodge mentions because this is a period in which the Mount Vernon Ladies Association is very actively engaged in trying to recover "relics" that had been distributed amongst the Washington family members in the 19th and early 20th century. Trying to figure out where those pieces are.

Jim Ambuske: Then, going back, the example you talked about at Sewanee, it's also been very helpful in thinking about what happened to the enslaved people's quarters around the site in that period. The real connection between the architectural history of the landscape and how it was modified over time, which has been very useful in thinking about how this was a community of people, primarily enslaved people, and what their lives looked like from a functional, practical standpoint.

Ben Brumfield: Yeah. We've seen some projects that go deep on that kind of material. There's one working with the Cameron Family papers out of University of North Carolina, working with an early set of store ledgers. The storekeeper on the plan... this is a plantation store, kept two sets of books. One for transactions with free customers, and one for transactions with enslaved customers. You can correlate the same entries and the same days and ask questions like if an enslaved person is sent in to buy something on behalf of a free customer, are they also conducting business on their own at the same time. Just fascinating, fascinating stuff.

Jim Ambuske: Oh, that's really cool. I want to see that project.

Sara Brumfield: It's Anna Agbie-Davies who's a anthropologist and an archeologist at UNC.

Ben Brumfield: Yeah. It's also interesting because she really focused on the specific material things that people are buying.

Jim Ambuske: Oh, that's cool.

Sara Brumfield: Look, they bought leather and something else, and then three days later came back in and sold a pair of shoes to the store. All of a sudden, you can see industry and commerce going on.

Jim Ambuske: Oh, yeah. There's various ways you can push that even further. Right? You can begin to map out where they're coming from. You may be, if some of that location data's in there, of figuring out what the other plantations are coming from and look at that entire network then.

Sara Brumfield: Mm-hmm.

Ben Brumfield: Yeah. That's exciting stuff.

Jim Ambuske: What was the state of crowdsourcing in general when you first spun up From the Page? What did it look like? I was aware of a few projects back in that period, mainly the Jeremy Bentham project out of London, but that's the only major one I can think of around that time period. What did it look like at that point?

Ben Brumfield: There wasn't much. I mean, we got started in 2005, and announced the project in 2008. When we went public, there was a lot of interest, but not much institutional interest. There were very few people at libraries, archives and museums who wanted to open up to the public the ability to correct their material or to contribute, make those kinds of contributions. It really took a long time for the idea to be accepted.

Sara Brumfield: I think the Smithsonian, the Smithsonian Transcription Center was the first really big, really successful, very public facing transcription project.

Ben Brumfield: One of the problems was that I guess around 2006, the word crowdsourcing was coined, and it was coined in the context of systems like Mechanical Turk, very explicitly with the idea of outsourcing, which is you fire all your paid staff and you replace them with the crowd, which is not something that staff members at institutions want to hear, and it's also not something that would work because you can't run a successful crowdsourcing project by replacing staff members with volunteers. It's going to go nowhere. It's not really possible. That was an additional...

Sara Brumfield: ...barrier.

Ben Brumfield: ... barrier to adoption. It doesn't make existing projects cheaper; it makes some impossible projects possible.

Sara Brumfield: This was work that wasn't happening. Right?

Ben Brumfield: Yeah.

Sara Brumfield: I mean, you have projects that do transcription for documentary editions and things like that, like the Washington Papers you referenced earlier, but most of the stuff was never going to get transcribed because it wasn't important.

Ben Brumfield: Right. Most library special collections, most archivists, they're dealing with boxes of unprocessed material, and they're trying to go through with less process, more product. That's the goal right now, which means you're not going to stop and sit down and type ever single letter in every single folder in every single box when there's hundreds of boxes waiting to just have basic finding aids written.

Sara Brumfield: You can kind of hear us starting to transition the language we're using to talk about these projects, from research focused projects and individual projects, to institutional ones. That's the trend that we've seen over the years that we've been working is that individuals are willing to take this risk and do this work early on, and then institutions start embracing it.

Ben Brumfield: Right.

Jim Ambuske: When we first started chatting a few years ago when I was at UVA Law Library, and that was a special collection, so I'm very aware of cataloging backlogs and things like that. It may be disappointing to some folks out there, but most of what archivists do is just trying to keep up with the bloody pile they've got. I mean, it's just an unrelenting effort to catalog everything. I'm wondering then, as you started to see that institutional shift, was that one a more general acceptance of crowdsourcing in general, but was it also a function of the fact that, hey, from an institutional standpoint, if we make the investment of getting stuff digitized, then at least we can not necessarily expedite cataloging, but at least offer content to our patrons in ways that we just can't take on ourselves right now?

Ben Brumfield: That's a good question. I'm not sure it's a matter of offering more content but offering different kinds of content. A lot of that comes down to findability. With the advent of things like Google searches, you have people who are searching for text, and we have the ability to find material if it's been transcribed, but these institutions that have done all this work scanning from the late 80s on, those scanned documents were still just pixels, and they couldn't be found, they couldn't be searched. In a sense, there's a backlog of material that's already had the labor put into it to digitize it and scan it, and even put it online, that could be made more valuable this way and allow the institutions to almost do the scanning in parallel with a public project of doing the transcription to make findability easier.

Ben Brumfield: This gets back to cataloging. You think about the example of the person who does a vanity search and finds diarist's mailman. What catalog, what finding aid is going to list a document and list everyone who's mentioned within it down to the diarist's mailman.

Sara Brumfield: No one will do that. Right?

Ben Brumfield: It wouldn't even be very useful in a list format.

Jim Ambuske: Yeah. I know, that makes total sense because you only have so much time and you've just sort of got to get the basic, bare bones of what the collection is down on paper, and then move on.

Sara Brumfield: What we have seen is that as people start spinning up some of these transcription projects and start building a volunteer core and do this work, all of a sudden, there's this phrase that Sonya Coleman is at the Library of Virginia, has phrased feeding the beast. Right? You have these voracious volunteers who want to do this work, and all of a sudden, your digitization program becomes in support of your transcription program as opposed to the other way around. Kind of fascinating. Right? It's great. It's a wonderful problem to have, to have so many volunteers who want to work on your material that you're in a race to keep them happy.

Jim Ambuske: I love that phrase, feed the beast. Yeah, Sonya is great. I worked with her a little bit, and that's a very apt description. They've got it down there at the Library of Virginia, a very robust program going on right now, where they are just shoveling coal into that fire as fast as they can.

Ben Brumfield: That's right.

Sara Brumfield: Yep.

Ben Brumfield: They've been doing this for many, many years. We're only one of three transcription platforms that they use. I mean, they are really pushing out, they...

Sara Brumfield: Yes. They do a lot.

Jim Ambuske: Well, to say publicly thank you to all those volunteers because I do use LVA stuff for my research from time to time, and it is extremely helpful.

Sara Brumfield: Yep. Yep. Yeah, and you're not the only one. We see academic researchers looking at documents that have been transcribed by volunteers, but the other real big place that this is valuable is for family historian. We have a lot of projects that are not... the documents are not as intrinsically interesting as letters or diaries, but they're more World War I service cards where you've got a card for every single service member, in this case, let's see, the State of Alabama, now I think Indiana has...

Ben Brumfield: Indiana as well. Right?

Sara Brumfield: ...one like this. They collect five fields off of this card, and then they're able to re-index them in their digital asset management systems, so that you can find your family members from a particular county and see where they enlisted and where they served and what their vital statistics were. That's pretty powerful if you're a family historian.

Ben Brumfield: I know we've seen that with vital records from the Maryland State Archives is working on marriage certificates. We've seen it with Indianapolis Public schools where you have just a list of students names and who their parents are and where they came from when they enrolled and where they're going when they de-enrolled. That's a genealogical gold mine.

Jim Ambuske: Oh, yeah.

Sara Brumfield: Not very interesting to transcribe, but people do it because they're very motivated because of the genealogical value and they can see that value.

Jim Ambuske: Well, and we're talking about these kinds of records. Those are very structured records. They've got tables and fields and things like that. Makes you wonder then about From the Page's early days because you would have to build in that kind of functionality to facilitate that kind of work. When you rolled out From the Page at the beginning, what did it look like? Was it geared toward particular kinds of records, or did you have these things in mind already?

Ben Brumfield: It was geared towards early 20th century printed diaries. All of our-

Sara Brumfield: Not printed.

Ben Brumfield: Not printed diaries, but you go to a store and you'd buy a diary for 1916, and every page would have a date heading at the top of it, and maybe 20 lines to write on. In the early days, it really was that the affordances of the tool were shaped by that experience.

Sara Brumfield: Julia Brumfield's diaries, fundamentally.

Ben Brumfield: Right. We have had to do a lot of work to support things like correspondence because the way that you navigate 20, 365 page long documents is very different from the way that you'd navigate a thousand two page documents. Very different experience, and also, a lot of our assumptions about whether say a page is a meaningful unit for analysis. Really, we had to rethink a lot of those when we started working with different kinds of material.

Sara Brumfield: Then, things like the World War I service cards were actually a collaboration between From the Page and the Council of State Archivists. The Alabama Department of Archives and History came to us, and they're like, "Hey, the World War I centennial's coming up, we have all these cards. We'd really like to do a project where we transcribe them and index them, and your tool is the closest we've found, but it really doesn't quite do what we need it to do." We were able to work with them and they provided funding from a number of different state archives coming together, which I think is a really neat model for collaboration to move a digital humanities tool forward. We were able to do what we call Field Based Transcription. One document gives you one record with as many fields as you want to configure.

Ben Brumfield: It's the sort of thing that you'd want to pull out as a spreadsheet rather than as the kind of thing you'd print out and read and-

Sara Brumfield: Right, textual documents.

Ben Brumfield: Right.

Jim Ambuske: Yeah. Then, you take that and do all kinds of interesting data analysis with those records as well. Right?

Ben Brumfield: Yes.

Jim Ambuske: Yes. Absolutely.

Ben Brumfield: We've also done a lot more work to support non-English documents and more difficult encoding. Pretty early on, there were some collaborators at Fordham that came to us with a couple of projects that were working with old French legal texts, which are of a lot of interest to people who don't necessarily know how to read old French. Being able to pull this material into a system where you could transcribe it, but then also translate it, was important. We did some similar work with a project at Fordham working with some Aztec codices with a codex which is written entirely in Nawat, which is great, but needs to be translated for access to most people who don't speak Nawat.

Jim Ambuske: Are the people doing the translation then or you found a way to do a programmatic translation?

Ben Brumfield: No. No. We have humans doing translation. Right?

Jim Ambuske: Okay.

Ben Brumfield: Yeah, and this is really important. Computers should help us out, but they shouldn't replace us. Maybe I'm biased; I'm a human.

Jim Ambuske: Yeah. Well, just like your Spanish language exam in college. If you get caught using Google translate, you're in trouble.

Ben Brumfield: When I was in college, they didn't have Google translate. Wasn't an issue.

Jim Ambuske: It sounds like though that this has been a model for you where someone who approaches you with a problem that From the Page can't immediately solve, but then they're willing to work with you to make improvements to the tools, so that they can facilitate that work.

Sara Brumfield: Yes. This is sort of we fell into that model early because we were running From the Page but a lot of our... When we started building it, we were still working day jobs, and then we slowly transitioned over time to doing consulting work in the same kind of digital humanities' area. It made it easy for us to say, "Oh, yeah. You want to build something on our platform, yes, we can do that." Then, in the last year or so, we've gotten to the point where we're doing hardly any consulting work, and we're just focusing on From the Page, but those collaborations with other institutions to help fund new features is really awesome because we get to, we like that. Right? We have the cash that lets us do that instead of building someone else's software, but everything that we build get rolled into From the Page, and then it's both available on FromthePage.com, our software as a service, but everything we build is also available as an open source software tool.

Sara Brumfield: We do a lot of collaboration with University of Texas at Austin where in Austin we happen to know them personally, so they recently received an NEH grant to do a number of enhancements. We did a translation of our interface, so not the documents that we're working on, but the software into Spanish and Portuguese because they have a lot of colonial documents, and they do a lot of collaboration with folks in Mexico. They've got the Puebla story...

Ben Brumfield: Yeah. The archives of Puebla and Cholula are material that's half Nawat, half Spanish, and the interface needs to not be in English.

Sara Brumfield: That's been a great collaboration. It's also going to in this coming year give us some additional export formats, like PDFs or Word or we're still talking through what those are, but sustaining software is really hard. Software breaks, and if you don't have people who can fix it, it's challenging, so trying very hard to be a platform that you can contribute to helping keep going means that From the Page can keep going and doesn't degrade like many, many projects happen to.

Ben Brumfield: It also provides a lot of efficiency for our projects in the humanities that typically don't have big budgets. If you're the British Library who needs an Arabic transcription platform, well, you can build an Arabic transcription platform from scratch and that costs...

Sara Brumfield: Will take you a year.

Ben Brumfield: Yeah, it will take you a year or more and it will be somewhere in the six digits or in the case of the British Library, they were able to throw three digits at us, and we were able to figure out what was needed to make our existing platform support Arabic. Yeah.

Sara Brumfield: Two months later, they had something that worked, and that's pretty awesome.

Jim Ambuske: Everybody wins in the end, too. As you said, it gets integrated into From the Page, but then it's also open source.

Ben Brumfield: Yes.

Sara Brumfield: Mm-hmm. Yep.

Ben Brumfield: It's one of the wonderful things from our perspective of working in this world rather than working in industry where your customers might be competing with each other and if they fund one project, their contract may say that no one else gets to use that. Only they get to use that, but that's not what cultural heritage and libraries, archives, and museums are like. They tend to be really excited to be able to help other people out.

Jim Ambuske: Why don't we talk a little bit about how a user actually uses From the Page, as best we can because nobody can see us right now because we're not a video podcast and it does wonder what the point of a video podcast is if it's a podcast, but that's a topic for another discussion. How does someone interact with it? What do you do?

Sara Brumfield: If you're a transcriber, and you go to FromthePage.com, most of that page is aimed at people who might want to run projects, but at the very top, there's a place to sign in and you can sign up as a transcriber.

Ben Brumfield: It even says, "Sign up to transcribe."

Sara Brumfield: Sign up to transcribe. There's a link up there as well that says, "Find a project." When you click on the Find a Project page, all of the projects on From the Page that are public, that have enough un-transcribed material that they still need new volunteers, and that have put enough effort into describing, so we like to have a picture and a short description, and those three things together, we'll put them all on our Find a Project page. A volunteer can scroll through that looking for things they might be interested in. They can also use like a search bar, so if you're interested in colonial documents, you could search for colonial and what would pop up would be so the Harvard University Colonial North America Project, which has everything from the colonial era, all of their libraries, so their theology library, their science library, the Radcliffe or Schlesinger Library for their women's stuff. I think there's like 15 different libraries at Harvard that all have colonial era material and it's all part of this project.

Ben Brumfield: You'll also get the North Carolina State Archives-

Sara Brumfield: Yeah, one of the state archives.

Ben Brumfield: ... colonial court records.

Sara Brumfield: Then, if you click into one of those projects, there's a button that's "Start Transcribing" and it will take you to a document that hasn't been worked on yet, if there are any. On the first page of that, or if there's no documents that haven't been worked on recently, it will take you to the next page of the least recently worked on documents, so that we try to keep people from stepping on each other, but because we have this very collaborative model, it is possible. You get a warning if you try to transcribe a page somebody else is working on.

Jim Ambuske: Can you give us a sense of what steps the owner has to go through to get a project spun up? Then, maybe we can talk about what's the end game once they're satisfied that their documents have been transcribed adequately, what do they do with it then?

Ben Brumfield: We try to make the process of getting material onto the crowdsourcing platform and getting the crowd contributions, getting the transcripts back out as simple as possible, which means we have a lot of difficult on ramps. I'm not going to go through all of them, but most simply someone would upload a zip file full of folders of images, they could also upload a PDF.

Sara Brumfield: They use a standard called IIIF. You just copy and paste a URL in, there's lots of different ways. Right?

Ben Brumfield: Right. You pull that material into a new project that you would create, you'd configure the project a little bit with a title and a description and the great user friendly things that Sara mentioned. You'd also define transcription conventions. This is really important because different projects use different conventions. It's really important that transcriptions are consistent within a project and that they match those descriptions. Our then kind of fundamental of scholarly editing is define your rules, and then follow them. Those transcription conventions get displayed to users transcribing as they go. Then, once material is transcribed, they can export and download as zip files and PDFs... I'm sorry, not PDFs.

Sara Brumfield: Not PDFs, yeah.

Ben Brumfield: Zip files, HTML-

Sara Brumfield: TEI.

Ben Brumfield: All kinds of different options.

Jim Ambuske: Yeah. Do you ever hear of anybody going rogue in terms of their volunteers? They're just like, you know what, I'm going to transcribe according to my own conventions?

Ben Brumfield: We do actually, and that has happened on a project that I ran working with some material from the 1820s in which I had not updated my transcription conventions. They still used the old 1918 conventions I had come up with for the first project. I had a volunteer who started plowing through those. She was a descendant of the diarist in this case, and she at one point said, "You say to modernize capitalization, but I've done some research and that's not the right thing to do. Here's what I'm going to do." That was a case in which she was right. We revised the transcription conventions to reflect proper practice, and went back and did a little bit of revision, and-

Sara Brumfield: The system allows for both of things. You can revise your transcription conventions, and you can revise your transcriptions.

Ben Brumfield: More common you have volunteers who don't necessarily go rogue, but they encounter gaps in the instructions. They come and they say, "Oh, well, this bit of text is crossed out. What do I do about it? You're not giving me any guidance, so I'm going to try doing this. I'm going to try using HTML tags or I'll just put a little parenthetical note that says line 43 crossed out, or something."

Jim Ambuske: Yep, and I've had that as well where most of the stuff that we're working on is our letters, correspondence, but occasionally a table will pop up and I think when we initially started the project, somebody was like, "What do we do with these?" I'm like, "We're just going to put those aside for now and we're going to figure that out later because that's tabular data." I mean, you could do... yeah, you can do HTML and I think this gentleman knew how to do it. I was like, "You know how to do that, man, you go for it, but otherwise, just skip that part. We'll go from there."

Sara Brumfield: We also support Markdown for tables, but it's a beast. Right?

Jim Ambuske: Yeah.

Sara Brumfield: Doing it in any of these formats is not easy.

Ben Brumfield: Table encoding is hard.

Jim Ambuske: Yeah. Well, what's nice though is you've got a level right where everybody can do it, and then if people want to, they can take advantage of what you have to offer to do something really complicated if they want.

Ben Brumfield: Right. That's actually one of our goals, and this is... so there's a concept in web development called progressive enhancement. You go to a website, and you're on a text only browser, for example. You should be able to see something that makes sense to you. Maybe you don't get the pretty picture, but if you go to a website and you have a browser that has... supports pictures but doesn't support JavaScript. You should still be able to read the news article or something. That all of these additional features should be enhancements.

Ben Brumfield: We like to think that that might be possible with transcription and text encoding, that if you approach a project and all you're comfortable with is typing in plain text, so long as someone is willing to go behind you and add the additional markup that might be desired by the project, that's okay. You can make a really valuable contribution that way. If what you want to do is markup the entities that are mentioned within a project, all the names of people, then great. Do that, too, but we shouldn't require people to master XML and coding and all these other things, in order to contribute.

Jim Ambuske: Exactly. Well, as you both know, we are still dealing with the COVID pandemic, and thinking about what contributions people have made, what opportunities they've had to develop their own skill sets using From the Page, and thinking about institutions, like Mount Vernon, because one of the reasons we signed up was because we needed to create a digital project that would help support our staff and keep them engaged when we were still thinking that it would be a two week shutdown. What kind of use have you seen since COVID reared its ugly head last year?

Sara Brumfield: We ran these numbers just the other day. The number of visitors to our site year over year is-

Ben Brumfield: Tripled, I think.

Sara Brumfield: Yeah.

Ben Brumfield: Tripled or quadrupled. Yeah.

Sara Brumfield: We spend a lot of time working on performance problems, which you will never see other than when it doesn't work. Right?

Ben Brumfield: Right.

Sara Brumfield: Keeping the site up and running when you start growing that fast has its own set of challenges.

Ben Brumfield: Right. When you go from maybe 300 people working on the site per day, to 1,000 people working on the site per day, that requires some stretching. We also have seen a usage pattern that we never really saw before, which is institutions using the tool for remote work. If you are a university library and you've got students on a work-study scholarship whose job is to come into the library and take special material from special collections at the library, and scan them on scanners at the library, and suddenly the students are stuck at home and need to do remote work...

Sara Brumfield: ...in order to keep their scholarships.

Ben Brumfield: How is that going to happen if you have hourly employees that are working at state archives? How are they going to work from home when there's nothing that has ever been part of the workflow in their state systems? We've really had to push hard to enable a lot more private projects or restricted projects in which say only staff members or only students are able to contribute. We've also done a lot more with workflow management, trying to make sure that the students make one pass, and we need to be able to lock that down, so that staff members can take a pass to review what they've done. We've done a lot with time sheet recording. We're getting emails we never would have gotten before from people who are volunteers who need a record of their hours for their probation officers.

Sara Brumfield: Probation officers.

Ben Brumfield: ...for example. They've got court appointed community service that they can't do in person anymore. Enabling e-volunteering is really important.

Jim Ambuske: That's amazing. I mean, I had a little bit of a sense because our volunteers have to record hours, I think it... I'm not sure why exactly. I think it just gets factored into whatever reports that the higher ups produce for the board and whatnot. I'm not even sure if there's any kind of tax implication or whatnot, but it never occurred to me that there could be instances which people are mandated by law that they have to do something, either community service as you just pointed out, who then can use something like this to satisfy those requirements and not get penalized for it.

Ben Brumfield: Right. It's also neat though to see, on a little happier note, projects happening that I think have the potential to transform the end-person experience once the end-person experience is open again. There's a project that's just starting at the Boston Public Library, that is engaging their volunteer docents, their tour guides of the big historic building on Copley Square, who have all been stuck at home and not been able to do any...guide any tours, to transcribe the trustees minutes when that building was being planned in 1885 to around 1900. You've got these volunteers who are intimately familiar with the physical space, and they are looking at the architectural discussions and they're transcribing those. These are the people who understand that building and understand that really well. They're going to be going through as they transcribe and finding all these what ifs of things that were discussed, dropped all the discussions and back stories behind this space-

Sara Brumfield: Where material came from, how much should it cost.

Ben Brumfield: Right. All this stuff, and not only are they the perfect people to be transcribing this material, but when all this is over and we're back to normal and those volunteer docents start giving tours again, just imagine that wealth of knowledge that they will have acquired and what they'll bring to that tour experience.

Jim Ambuske: Oh, that's really fascinating. Yeah, they'll make a much richer tour experience for the people who come and see those sites. It's funny, too, because a lot of times individuals who are in those positions or similar positions, they don't necessarily have the time to do the kind of research that they probably would like to unless they're doing it on their own time, but then to have the opportunity to sit down and actually stare at these things on a consistent basis over a longer period of time is probably going to change the way that they do things I would imagine.

Ben Brumfield: It's also possible that it changes the conversation with the institution. If you have a group of people who are giving tours, and one of them goes off and does all this extra research, and there may not necessarily be a channel for feedback to pass that up or to pass that to the rest of the tour group, but if all the guides are doing this together, then there's this really focused conversation happening that is pretty exciting.

Jim Ambuske: Yeah. It gets back to that idea of the collaborative model that you've been talking about over the course of our chat. Where do you see From the Page going next? What's on the horizon? You've talked about a couple of things you're working on at the moment, but what's down the road and what do you hope to achieve in say, I don't know, the next five years?

Sara Brumfield: The next year is interesting because we have actually a ton of this collaborative feature work that we referred to earlier, so the Council of State Archivists is funding what we call ledger style transcription, so picture documents that look a lot like spreadsheets. We have not been able to support those because they're hard to transcribe, they're hard to help your transcribers keep track of where they are, they've got many rows that you have to set up, so it's challenging. We're working on building that right now.

Sara Brumfield: They've also got quality control work that we're going to be doing for that, and then to allude to something you mentioned earlier about a lot of special collections work just being trying to keep up with the describing the flood of stuff that you have or that's coming in. One of the features that Alabama in particular wanted was what we call metadata description, so instead of, or in addition to, transcribing a document, also describe the document. Give us a paragraph about what it is. That's going to be an interesting experiment to see how that works.

Ben Brumfield: Right. It's especially relevant for correspondence collections where if you've got someone who's transcribed the document, asking them who the sender, who's the recipient, what are the dates here, it's a very easy thing for them to answer because they've just been in that document. We think probably this year the most transformative thing will be the ledger style transcription because that is something that so many records of importance are in. We're working with say the voter registration rolls from the 1867 election in Alabama, which is the first free election that you have freed African American males voting. That's really, really powerful, and so many other records in government archives, but also in natural sciences, social sciences come in that tabular format and there's really not a good option out there for people to transcribe that.

Jim Ambuske: Yeah. That's like the holy grail. Ledgers are a pickle.

Ben Brumfield: Yes.

Sara Brumfield: Yes. They are hard. I think when we look five years, it gets harder to figure out. I have a lot of possibilities in my head, and a lot of it is kind of our first thing is to grow the platform and to make it even more stable and to make sure it can handle all of these different types of manuscripts, which is a lot of what will be happening in the next year, year and a half, but once you get beyond there, there's interesting questions about, okay, well what about audio. Do you want to do audio? Would you do it From the Page, would you do it in a different system? I have some ideas. We'll have to see where we are when we get there.

Sara Brumfield: Photograph identification and description is a different beast because you might have a set of a hundred photographs that you ask people to look at, and really what you want is for them to just slide through them on an iPad or something. When they see one that they recognize, then to go deep and give you a good description, but maybe that's only three out of a hundred. That's assuming they know the area and the people and whatever you're trying to identify. That's a different type of crowdsourcing that maybe we get to, I'd like to get to.

Ben Brumfield: There's also more prosaic things like watching users workflow and trying to figure out what is frustrating for them, what makes them productive, what makes the task easier? Looking at the people who sign up and maybe they never transcribe a page, and ask, "Well, why? What's going on there?" That's something that is going to be constant over the next five years is just trying to make sure that people are able to do the best work they can and that the tool helps them rather than getting in the way.

Jim Ambuske: Well, let's check back in in five years and see where you are.

Sara Brumfield: Okay, that sounds great. Sounds great.

Jim Ambuske: Well, Sara, Ben, thank you very much. It's funny, I think as I said to you in an email, we've known each other for a while, but we've never actually talked about any of this, so I was very delighted that you said you would come on because I was very curious. Now, I feel like I'm fully informed.

Sara Brumfield: Great.

Jim Ambuske: Well, take care. We'll see you in five years, and stay safe in Texas.

Ben Brumfield: It's been a pleasure. Thank you so much for having us on.

Sara Brumfield: Thank you, Jim.

Jim Ambuske: Thanks for listening to Conversations, a production of the Center for Digital History at the Washington Library. This episode was hosted and produced by me, Jim Ambuske, with editorial assistance from Jeanette Patrick and support from Mount Vernon's media and communication's department. Are music is Witches Brew by CK Martin. Be sure to rate and subscribe to Conversations wherever you get your favorite podcasts. To find out more, please check us out at GeorgeWashingtonPodcast.com. Thanks, and we'll see you next time.

Listen to this podcast episode here, and find all episodes on GeorgeWashingtonPodcast.com.