La méthodologie

Suzette

Digitizing Suzette

Development of strategic approaches to encoding the teacher’s edition of Suzette have evolved significantly since 2019. We balanced experimental approaches with project goals:

Capture the differences between text that would have been used by teachers in their classrooms and that which the students would have read.
Document and describe the structure of the printed text (e.g., form words, the locations of illustrations on pages)
Connect lexical references within the text, placing glossed definitions in line with the referring terms
Identify the places in each chapter where the Classification is invoked
Distinguish and tag persons and places (both real/historical and fictional)
Anticipate more complex conceptual moments in the text that could be drawn upon for advanced analysis

Process

Work was undertaken in phases, with multiple members of the team contributing suggestions about approaches and making discoveries through close collaborative reading that informed further pathways. This phased process proved to be also a great pedagogical experience for the students involved, and several have become something of ‘subject experts’ regarding education and French society.

Image to Machine-Readable Text:

The first copy of the Suzette text was a PDF document of a microfilm scan available through the Bibliotheque Nationale de France. Early attempts at hand transcription proved to be slow; we were introduced to the Transkribus text recognition engine, which helped us to speed up the transcription proofing process (although the source PDF file was difficult for the available models to ‘read’). Eventually we got access to a physical copy of the teacher’s edition, which made that process even more reliable (although checking of 19th century printed text was still required). Through Transkribus we were able to export multiple versions of the machine-readable text – in addition to a TEI format, we exported an OCR’d PDF and MS Word files.

Customization

Encoding is done using Text Encoding Initiative (TEI) standards; starting with a TEI-All schema, we customized our schema to support a French-language constrained vocabulary for tagging (helping us to shape the semantic material we hoped to analyze). We resisted the desire to create project-specific elements and attributes, instead focusing on a value-based customization. This proved to be a helpful decision, as we were able to ‘tweak’ the encoding where necessary by swapping out TEI elements (e.g., exchanging ‘div#’ elements for ‘div’ elements). In only one case did we feel hampered by the lack of a TEI element: in our 6th strategic goal (tagging more complex concepts that refer to the steps involved in food growing/gathering/preparation/serving that constitute ‘events’ within the text we would have been well-served by an element like <objectName>. The TEI Technical Council is fortuitously adding that element later in 2023, so we will be able to use it in our next phase of the project.

The Suzette schema is available here] and we invite people to adopt/adapt it for your own use.

Code-Sharing

Logistically, we used GitHub to support collaborative encoding and versioning with one main repository and each team member working in their own fork. This helped to ensure that we could ‘proof’ and troubleshoot encoding as it was undertaken by multiple team members over time. We also stored the encoded files in different folder/directories – some that tracked process from structural to semantic encoding, others that split the 144-chapters into different lengths to facilitate group encoding, still others to save exported versions of the text in .txt and .csv formats for analytical purposes.

Tools and Platforms:

Initially all team members used the Oxygen XML code editor for encoding and processing. Beginning in 2022 we experimented with the LEAF-Writer web-based platform, which proved capable of supporting encoding needs for the project. We also used Voyant for tokenized text analysis and ArcGIS to generate spatial visualizations of places referred to in the text. Lately we have experimented with Open Refine to clean our extracted person and place data.

Tagging Entities and Linked Open Data:

As our encoding became more sophisticated, and as LEAF-Writer offered more features, we expanded our internal authority structure to include externally-recognized named entities that aligned with the linguistic and temporal specificities inherent in the text. We discovered that we could capture [Wikidata] entities with French labels and descriptions for persons and places, and use LEAF-Writer’s entity lookup function to capture that information via attributes for people and places, such that:

<persName key=”XX” ref=”Q##”>XX</persName>

<placeName key=”YY” ref=”Q##”>YY</placeName>

Furthermore, as we developed prosopographical information about the historical persons referred to in the text, we realized that there was valuable information connected to the curriculum related to occupation, avocation, and social status. Working with colleagues at the LINCS Project, we considered ways in which our list of occupations could align with their occupation vocabulary. Currently the LINCS vocabulary is in English, but we are in discussions to contribute to a French-language translation using the Suzette terms.

Developed for Collaboration

As we have undertaken this digital edition of Suzette, we have adhered to the conscious decision to make the project truly bilingual – and ultimately French-first (our next task will be to translate the few remaining English-language pages like this one). Ultimately, we believe that our approach to encoding, capture and sharing of information through public spheres, and collaboration with other projects working on similar or adjacent research subjects will be of value to a broader community of scholars than might have been possible previously. With humility, we are happy to share what we have learned and the materials we have developed; we also look forward to learning from others who have undertaken this kind of digital humanities work, as well.