Below you will find the abstracts that were submitted and unfortunately not accepted for the 2017 run of the Researcher-in-residence programme. The abstracts are in alphabetical order. If your abstract is published here and you would like to have your name posted with it, please contact us and let us know. The accepted projects and their abstracts can be found here.
We want to thank all researchers for their interesting proposals, wish them all the best for 2017 and hope to see them again in a following year!
- Deep learning OCR post-correction
- Faith in Old Age. A biographical and micro-historical study into the relation between religious beliefs and social-cultural perceptions, experiences and practices of ageing (c. 1800-1950)
- Mapping the Early Modern Dutch News(papers) (1618-1795)
- Segmentation and Categorization of Advertisements in Delpher’s Newspapers: An Eighteenth Century Feasibility Case Study
- Sound Patterns of Golden Age Theatrical Emotions in KB/DBNL’s digised theatre plays
- Understanding petitioning behavior in the Batavian Republic (1795-1801) through enhanced access to serial government sources
- Unlocking the STCN
Deep learning OCR post-correction – dr. Janneke van der Zwaan
Humanities research makes extensive use of digital archives. Most of these archives, including the KB newspaper data, consist of digitized text. One of the major challenges of using these collections for research is the fact that Optical Character Recognition (OCR) on scanned historical documents is far from perfect. Although it is hard to quantify the impact of OCR mistakes on humanities research (Traub et al., 2015), it is known that these mistakes have a negative impact on basic text processing techniques such as sentence boundary detection, tokenization, and part-of-speech tagging (Lopresti, 2009). As these basic techniques are often used prior to performing more advanced techniques and most advanced techniques use words as features, it is likely that OCR mistakes have a negative impact on more advanced text mining tasks humanities researchers are interested in, such as named entity recognition, topic modeling, and sentiment analysis.
The goal of the proposed research is to bring the digitized text closer to the original newspaper articles by applying post-correction. Post-correction involves improving digitized text quality by manipulating the textual output of the OCR process directly. The idea is that better quality data boosts eHumantities research. Although the quality of the KB newspaper data would definitely benefit from improving the OCR process itself (i.e., improved image recognition), post-correction will still be necessary, because the quality of historical newspapers is suboptimal for OCR (e.g., due to poor paper and print quality) (Arlitsch & Herbert, 2004).
Existing approaches for OCR post-correction generally make use of extensive dictionaries to replace words in the OCRed text that do not occur in the dictionary with words that do (see e.g., Alex et al., 2012, Strange et al., 2014, Volk et al. 2011). Based on the assumption that a number of characters in every word will be identified correctly, words not in the dictionary are replaced with alternatives that are as similar as possible to the text recognized, possibly taking into account word frequencies to solve ties. The main problem with these existing approaches is that they do not take into account the context in which words occur.
Deep learning techniques provide an opportunity to take this context into account. I propose to learn a character based language model of Dutch newspaper articles. This is a model of the character sequences occurring in the text of a corpus (see Karpathy (2015) for examples). OCR mistakes can be viewed as deviations from this model. Mistakes can be fixed by intervening when text deviates too much from the model.
Faith in Old Age. A biographical and micro-historical study into the relation between religious beliefs and social-cultural perceptions, experiences and practices of ageing (c. 1800-1950)
In 2041 4,7 million inhabitants of the Netherlands (26,5%) will be 65 years or older. One third will be above eighty and likely in need of care. Parallel to this development we are moving away from the welfare state towards a participation society in which ‘a strengths-based approach that encourages citizens to be more in control of their own lives, of their own communities, and eventually of society as a whole’ is needed. The outcomes of this biographical and microhistorical study will contribute on a fundamental level to that need.
In this study I will research how and to what extent small scale life histories reflect the relation between personal religious beliefs and social-cultural perceptions, experiences and practices of ageing and caring for the aged and if and how these religious convictions reflected and shaped the urban social-cultural ageing practice in the past (c. 1800-1950).
Although the past does not provide answers for the future, this study will open up longer perspectives on ideas and practices of ageing and ageing care and thus facilitate the construction of broader imaginative and critical perceptions of ourselves in our society and the way we (can) act today. By working together with other academic disciplines and professionals in social development this project will fuel a necessary scientific, political and public dialogue on ageing and what will motivate people in the near future to ‘reconquer the initiative’ to care for the aged from the welfare state.
In the long run it will also contribute to the understanding of what religious societies are and how they function, which has relevance for other academic disciplines as well as for society as a whole.
Mapping the Early Modern Dutch News(papers) (1618-1795)
The first Dutch newspaper was published in 1618. It marked the beginning of the rise of newspapers in the Dutch Republic. It was the recent launch of the newspaper database Delpher in 2013 which caused an increase in the research on early modern newspapers (Van Groesen, 2013; Van Groesen, 2015; Der Weduwen, 2015). Despite the digital disclosure of these historical newspapers, most research is done manually and on a small scale. It still remains difficult to determine the larger picture of the news provision in the early modern period. The big questions about the precise content, extent and origin of this news, are yet to be answered.
The aim of this project is to use Delpher as an instrument to visualize the origin and spread (both in location and tempo) of the early modern news in the Dutch Republic. By enriching metadata it becomes possible to show where the news came from, and how long the news was on the way from a certain region. By using the origin and the date of the news, it is possible to make a digital map based on early modern newspaper data.
Early modern newspapers consisted mainly of a single sheet of paper filled with foreign news.
The newspaper was clearly divided into blocks per country (with the name of a country as a caption above). Utilizing this recognizable and clear layout as a filter, it is possible to show where the news originated and the percentage of space of the delimited text. This project focuses on developing a deep learning program based on computer vision techniques which can automatically determine and extract the news items from digitized newspapers.
The blocks with news (sorted by country) contain separate items which are clearly recognizable with indented paragraphs. This standard format is important for software enhanced filtering. Each item begins with the city (or region) of origin, and date. This project will develop a method which adds metadata (country, city and date) to each news item. While great progress is made on fully searchable newspapers via OCR, this project adds a valuable dimension which allows a view from above.
Ultimately questions will be answered such as: in what period did Swedish news get more attention (in frequency and coverage)? And: from which countries originated news during the Nine Years’ War? Furthermore, the rise and scale of transnational news circulation can be mapped. With this, for the first time, it can be clearly established where news came from. The dating of news is also very important. Combined with dates of publication of the newspaper, dating each news item results in answering the important question of how long news was underway. On the basis of the origin and date, a digital map with a timeline will be created. Another version of this map, a more experimental one will be created which uses time instead of distance as a measure between locations.
Segmentation and Categorization of Advertisements in Delpher’s Newspapers: An Eighteenth Century Feasibility Case Study
This project aims to arrive at a better level of segmentation and categorization for individual advertisements in Delpher’s newspapers, with a particular focus on the eighteenth century collection. In this way, this invaluable historical resource of advertisements can be studied in much greater detail than is currently the case. This is extremely important for historical researchers, because the entire market economy passes by in these advertisements. They contain essential information about the flow of many goods and services in the Dutch Republic and subsequently the Kingdom of the Netherlands, which cannot yet be accessed in detail on such a big scale.
Currently, search queries for any product or service will give results that are not obviously relevant, because the nature of an advertisement is not disclosed in the search results. The smallest entities in Delpher are sections of advertisements, with metadata that have little to say about individual advertisements. Sections of advertisements can hold between 1 and 25+ advertisements, and the diversity of material within these sections 25+ is massive. Therefore, users still have to check the relevance of their results manually, because a query can occur in any kind of advertisement.
With a better level of segmentation and categorization, individual advertisements are recognizable as separate entities. In this project, both shape and content of the advertisements are used to arrive at a better level of segmentation. By making better use of markers for relevance on a page, such as indentations or capitalized words, it is possible to split segments of advertisements into smaller entities, that have meaning on their own. Once technical feasibility has been fully established, then advertisements can be studied, clustered, enriched and re-used in superior ways. For the purpose of categorization, a library of predefined categories of advertisements will be created, which allows users to narrow down to a baseline of relevant advertisements much quicker.
As a test case for this approach, a specific yet recognizable category of advertisements will be used: eighteenth century advertisements for auctions of drug components. These advertisements contain essential information about the early modern drug trade, but their contents overlap with other categories of advertisements: it is hard to find solid search queries to isolate this category of advertisements from others. Labelling these advertisements on the basis of a predefined category makes it possible to analyze them in greater detail, and to arrive at a valid thesis about the drug trade. This is of crucial importance to understand the mechanisms of the premodern medical
marketplace: many aspects that receive substantial attention from scholars (clinical testing, prescriptive procedures, preparation of remedies and so on) require understanding of the import and availability of raw materials.
Thus, this project will clarify the commercial dimension of early modern medicines, as a test case for developments of the market economy as a whole.
Sound Patterns of Golden Age Theatrical Emotions in KB/DBNL’s digised theatre plays
Sound Patterns of Golden Age Theatrical Emotions: the development of a tool to reveal the correlation between phonological patterns and emotions in KB/DBNL’s digitised Theatre Plays, to unravel the aural elements of historical texts.
What did emotions in the Dutch theatre sound like in the Golden Age? Did comedies sound different to tragedies? Did angry men on stage sound different to angry women? How did queens in love sound in relation to servants in love?
My PhD project researches the role that phonological patterns play in the expression of emotions in early modern Dutch theatre plays, and the way analysis of phonological patterns can contribute to new methods of author identification.
With a quantitative approach to modelling sounds and emotions, the project includes 200 digitised Dutch theatre plays provided by the Digital Library for Dutch Literature (DBNL), covering the entire early modern period in the Netherlands from 1570 to 1800. The project will result in a historical sound pattern timeline, which fits in with the results of the Historic Embodied Emotion Model, HEEM (Leemans e.a. 2015, Leemans e.a. forthcoming 2016), product of the emotion mining project, conducted by the Amsterdam Centre for Cross-disciplinary Emotion and Sensory Studies (ACCESS). The selection of plays involved in my research corresponds to the corpus of the ACCESS research group. With HEEM, ACCESS has created a new technique of sentiment mining. STAGE, (“Sounds in Theatre plays featuring Golden Age Emotions”) adds a new element by associating emotions with phonological patterns, and bringing together the fields of the history of emotions and those of (historical) phonology, musicology, history of theatre and computational linguistics.
Counting and analysis of phonological patterns will help reveal how the history of emotional expressions on stage has evolved. In addition the data this tool generates could open other opportunities for research on sound patterns in texts. Furthermore, as the tool has modular construction, this enables its application to related projects in the field of computational linguistics, (historical) phonology and ‘distant reading’ in a wide range of texts.
The supervisors are Prof. Inger Leemans (Vrije Universiteit Amsterdam) and Prof. Karina van Dalen-Oskam (Universiteit van Amsterdam, Huygens-ING).
Keywords: History of Emotions, History of (Dutch) Theatre, Historical Phonological Patterns, Machine Learning, Open Access Tool, Digital Humanities Research Question: How did the expression of emotions on stage evolve in the Netherlands during the Golden Age? How do phonological patterns relate to historical emotional expressions on stage? My PhD project aims to reveal the role phonological patterns play in the expression of emotions in early modern Dutch theatre.
Understanding petitioning behavior in the Batavian Republic (1795-1801) through enhanced access to serial government sources
For scholars working on the last decades of the eighteenth century, the Early Dutch Books Online (EDBO) dataset is an invaluable corpus of source material.
To literary works, political writings and other conventional texts in this dataset, adequate access is provided through Delpher and Nederlab. There is, however, at least one important source type for which the present search options of these tools are not optimal. As EBDO contains virtually the entire printed output of the revolutionary Batavian Republic, it also includes the many multi-volume proceedings of local, provincial, and national representative bodies that were printed in order to ensure a maximally transparent government. For historians, the Dagverhaal der handelingen van de Nationaale Vergadering and other such serial government publications are immensely rich sources that are also notoriously tough to work with. It is my conviction that the accessibility of this source type could be greatly improved by an approach more comparable – but not identical – to that already applied to other datasets, such as KB Kranten en Staten-Generaal Digitaal.
As a researcher-in-residence I therefore intend to build on my experience in working with this source type to create, in close collaboration with the KB digital humanities team, customized search options in Delpher. I propose a multifaceted approach with a primary focus on the use of automatic segmentation to separate the daily or weekly instalments in which these serial sources were published, the sessions of the representative bodies, and the deliberative elements of which each session was made up. If users can be enabled to search only the deliberative elements that are relevant to them in the proceedings of multiple governmental bodies at the same time and if their search results can be sorted by date or session, this opens up a whole new realm of research opportunities.
As for my own research, I want to deploy the enhanced searchability that should result from this project to address a set of research questions concerning the petitioning behavior of Dutch citizens during the Batavian Republic. Between 1795 and 1801 citizens petitioned all levels of government, as they had done in the old regime Dutch Republic but on a much larger scale and often with a more overtly political agenda. The description and discussion of petitions in the proceedings of various government bodies, the inventorying of which will be made manageable by this project, provides insight in how citizens related to local and supra-local contexts and how they came to terms with the great ideological and institutional transformations of their day. These questions are at the heart of my current research project The primacy of local belonging. Private papers, petitioning, and periodical press, 1747-1848.
In the long run, I consider my contribution to meeting the objectives set out in this proposal an investment in new research and teaching initiatives.
Moreover, the benefits of this project could become greater in the future as the knowledge and skills gained from it might eventually also be applied to other serial sources in the EDBO dataset.
Unlocking the STCN
The aim of the project is to use the Short-Title Catalogue, Netherlands (STCN) as an instrument for an easy-to-use tool to visualize trends in the history of the book in the sixteenth and seventeenth century based on a SPARQL generator.
The STCN is the national bibliography of the Dutch printed book up to the year 1800. It is a catalogue with over 204.000 titles. But the STCN is more than a catalogue. It is an overview of the printed book in the sixteenth and seventeenth century in the Netherlands. An overview in which all sorts of data lies hidden which can give us insight in how the book changed in these centuries. The STCN is available online, but would benefit from additional ways to consult its underlying data to gain insight in larger trends. For example, it can be used to show changes in the book in a specific genre, a specific decade, observing typographical developments, or the most active printer or author based on location or year. It is possible to use the STCN as a research tool via the Advanced Search function, but this often comes down to tallying.
In the proposed project, the STCN will be used to create an accessible (RDF) dataset and a tool (SPARQL generator) for researchers, students and other interested parties. This tool will be an easy accessible platform in which the user can request information about trends in the printed book. Recently it has become possible to answer questions based on the STCN with the help of the RDF query language SPARQL. However, this is a difficult language to master for occasional users. Last year, the Koninklijke Bibliotheek (KB) offered a STCN SPARQL-workshop for researchers. Despite the high turnout of interested researchers, working with SPARQL proved to be too difficult for most participants. The proposed tool will ease dealing with SPARQL language.
With the help of a SPARQL generator, the tool provides the users the option to combine selected variables to answer their questions in just a few clicks.
The output can be used to display information visually, in charts or plots.
This digital humanities project will increase the potential use of the STCN.
Trends and changes in the book which now has to be dug out the STCN, will soon be just a few clicks away which allows for a better understanding of the emergence of the printed book. The project will be more than a plaything for data mining the STCN hosted by the KB. This experimental SPARQL generator can be used for other (KB) projects with RDF data or the Semantic Web.
Leave a Reply