This post is written by Dr. Jiyin He – Researcher-in-residence at the KB Research Lab from June – October 2014.
Being able to study primary sources is pivotal to the work of historians. Today’s mass digitisation of historical records such as books, newspapers, and pamphlets now provides researchers with the opportunity to study an unprecedented amount of material without the need for physical access to archives. Access to this material is provided through search systems, however, the effectiveness of such systems seems to lag behind the major web search engines. Some of the things that make web search engines so effective are redundancy of information, that popular material is often considered relevant material, and that the preferences of other users may be used to determine what you would find relevant. These properties do not hold or are unavailable for collections of historical material. In the past 3 months I have worked at the KB as a guest researcher. Together with Dr. Samuël Kruizinga, a historian, we explored how we can enhance the search system at KB to assist the search challenges of the historian. In this blogpost, I will share our experience of working together, the system we have developed, as well as lessons learnt during this project.
A Historian at Work
Samuël summizes the research approach of a historian as a 4-stage procedure.
(1) Exploring. In this stage, a researcher has an initial research idea. With this initial idea in mind, he explores the target domain and literature in order to arrive at a preliminary research question. At this stage, the researcher also starts to explore possible primary data sources that can be used.
(2) Contextualisation. Given the preliminary research question, the researcher conducts historiographical analyses and further explores the availability of data sources. By the end of this stage, the researcher arrives at a refined research question.
(3) Operationalisation. Given the refined research question, the researcher formulates his theory or model, and decides on the primary sources from which he will search for materials to answer his research question. Based on his theory or model, the researcher formulates a set of sub research questions. These sub research questions form the sufficient or necessary conditions to answer the original research question.
(4) Execution. In this stage, the researcher actually starts to execute search in the data sources that have been selected from the previous stage. For each sub research question, the following steps are taken:
- Search each data source with source-specific queries. That is, these queries are specific with respect to the type of content, metadata, organisation, etc., of the source collection. Note that this is not necessarily done with a single search system, or even with digital tools.
- By analysing the retrieved materials, the researcher attempts to answer the sub research questions, and evaluates its impact on the original research question.
- If results of this stage are not satisfying, the researcher goes back to stage (3).
The stages sketched above vary in the search style, information needs, and material sought for, hence different digital tools might be developed to best support the historian at each stage. For instance, in the first two stages, there is a need to explore, overview, and compare different data sources to assist the researcher to discover possible data sources, and to select the ones that may contain useful information. In later stages, the focus moves towards locating detailed information objects (e.g., articles, images), relevant to solve specific research questions. In this project, we focus on the latter, i.e., to provide support in exploring and access information in a particular data source, namely the historical Dutch newspapers.
The prototype system
One misconception about the role of digital tools in humanities research is that they should hide the complexity of the data selection or analysis techniques from the user. Samuel describes the role of digital tools as: supporting researchers to locate and gain access to potentially relevant materials, while allowing the researcher to select, digest, and interpret the materials. He stresses the risk of blindly following the results and analyses generated by digital tools and his request for our system’s functionality can be summarised as: control and transparency. That is, to have control over the ways to search and explore a collection, and simple but understandable operations are preferred over blackbox complex algorithms.
The data used in our project included the KB historical collection during the first world war (WWI) period (1914 – 1940). As this material aligns with Samuël’s research interest of comparing collective memories of WWI as represented in the newspapers of different regional groups, e.g., “Landelijke” and “Nationale/Lokaal”.
Elasticsearch (ES) provides the basic framework for our retrieval system. News articles were retrieved from the KB data API, processed and then stored in an ES index. In total 55,639,628 articles were indexed. More details about the indexing process with ES are provided later in this post.
On way of providing users with greater control over their search is to provide a richer querying language than keywords. In particular we decided to implement the following features: (1) Boolean query operation. This include specifying terms that “must occur”, “should occur but not necessarily occur”, or “must not occur” in the retrieved articles. (2) Wildcard queries and proximity queries. (3) Filtering on multiple time ranges and newspaper selections.
Note that features such as wildcard and proximity queries are already supported by Lucene, the underlying search engine of KB’s Delpher system (as well as the basis of Elasticsearch). However, it is rather implicit, as users may not be aware of what is possible, and may not be able to use the query language defined by Lucene. Here, we make the possible operations explicit, including explicit instructions on how wildcard and proximity queries should be constructed, as shown in the screenshot on the right.
To provide greater control over the results displayed, we decided on two additional sorting options in addition to the default ranking criterion (i.e., by relevance scores), namely sorting by date, and by article length. Sorting by date provides a quick means to locate articles on specific dates. The argument for sorting by article length is that longer articles are likely to contain important content while news articles consisting of few lines are generally of less value.
Interestingly, in many retrieval tasks (e.g., microblog search, ad-hoc search), document length and date have been combined with the relevance scores for the ranking of retrieved results. In our case, however, it was prefered that different ranking criteria are kept separated, and that the control of ranking criteria is transparent and flexible to the researcher.
In Web search, query suggestion is a common means to assists users in issuing better keyword queries. In our case, users are allowed to issue queries with constraints such as time range and selected newspapers. That is, in an effort to support users in formulating their queries, suggestions of query words combined with suggestions of appropriate values of these constraints are needed. To this end, we implemented a temporal topic preview widget. That is, while typing a query, users can see named entities from the documents in the form of term clouds along a timeline. This widget is intended to help users in three ways:
- To determine the interesting time periods.
- To identify entities related to the original query which can be used for query reformulation.
- To compare topics discussed in different types of newspapers.
The methods to generate entity clouds for a specific year and for a given selection of newspapers are described next.
Entities. For each entity cloud, we select the top 10 most significant entities from the articles within that period and from the selected newspapers. Entities were extracted using KB’s named entity recognizer and prestored in the index.
Entity selection. To select the top 10 entities, we take the following steps.
- We consider a foreground and a background document set. The foreground set consists of articles that contain the query words and are in the selected period and newspapers. The background set consists of all articles in the collection. Our goal is to select entities that are representative (e.g., frequently occur) in the foreground set in contrast to the background set.
- We compute two types of conditional probabilities: the probability that the given entity is “generated” by the foreground set (p), and the probability that it is generated by the background set (q), using a language modeling approach. We then compute the Kullback-Leibler divergence between the two probability distributions KL(p||q).
- Entities within the foreground document set are ranked in descending order of their KL divergence scores.
Preview updates. When the user types in query words or changes newspaper selections, the preview updates. To prevent updates on incomplete query input, we wait for 500ms after the user stops typing before updating.
An illustrative example
The following example is generated by typing in the query word “beurskrach”, referring to the economic crisis around 1930. In the screenshot we see two rows of entity clouds. The top row is generated from “Landelijke” newspapers, and the bottom row is generated for the “Nationale / Lokaal” newspapers.
We have the following observations.
- This word starts to appear in news after 1929. This is correct, as the crisis starts in that year. The entity clouds in the previous years were absent as no articles containing this word were retrieved.
- We can see entities relevant to the crisis were selected, e.g., New York.
- If we compare the “Landelijke” newspapers to the “Nationale / Locaal” newspapers, we find that in local newspapers the crisis is hardly discussed.
UI wrap up
Finally, the resulting user interface looks as follows. While the user is formulating his or her query, the temporal topic overview is shown to assist this query formulation process (left). After the user has submitted the query, search results are shown, with possible result operations (right).
Finally, I would like to discuss some of the lessons learnt during this interdisciplinary project to design experimental search tools for historical research.
Support for an iterative design process. In this project, I started with a standard search engine setup, i.e., to support keyword searching in document content. Later, after Samuel jointed the project, we started to add additional features to the system. This is an iterative process consisting of discussion – implementation – testing – new discussion. During this process new requested features kept emerging.
The updates of features fall into two categories: at the UI level and at the index level. Updates at the UI level are relatively simple: it adds additional access or means of interactions with existing (indexed) data. Updates at the indexing level enables additional data to be searchable, which is more complicated — often it means reindexing the collection.
The updates of index were necessary in two situations: metadata that existed but was not in the same collection (e.g., the page number of a news articles existed, but in a different versions of the KB news collection than the one previously indexed); derived data (e.g., article length — while it is possible to compute it at querying time, in order to allow efficient sorting of results document length was included at indexing time).
With respect to the design and development of tools for historical research my view is as follows:
- From a system perspective, we need systems that allow flexible index updates. It was helpful that Elasticsearch allows adding additional fields without reindexing the data. In addition, when reindexing has to happen (e.g., when the data schema has changed), it can be done in the background without bringing down the whole system.
- From a user perspective, it may be useful to provide supporting tools that allow the target users to explore the availability of data as well as possible derived data in the early stage of the design process.
Memory issues. It was the first time I used Elasticsearch. Before, I have always been using academic search systems such as Lemur and Terrier. I decided to experiment with Elasticsearch mainly because of its rich aggregation functions.
One of the issues I have been struggling with was the memory usage. Given the focus of the project as well as its short duration, I did not experiment with different configurations of ES, but simply used default configuration. The indexed collection consists of 55,639,628 documents, resulting in an index of 260G. It seems that the memory usage can easily go over 10G, which is rather surprising. The machine we used has 30G RAM. While it was fine to perform simple keyword search, operations such as sorting on a specific field or more complexed queries can lead to out of memory exceptions. Unfortunately, I did not encounter this problem until the end of the project when all the data were indexed and all functionalities were implemented. The resulting system is therefore rather unstable.
Here are what I learnt with respect to the use of Elasticsearch: (1) It is not trivial to set the appropriate configuration for the elasticsearch system. Careful study and experiments are needed. (2) In order to properly configure the system, it is important to have an estimation of the size of the collection before hand. In my case, two factors make the estimation difficult. On the one hand, the documents were retrieved from a data service API, which were processed and indexed on the fly. On the other hand, we kept updating the index with additional data fields with respect to newly emerged functionality requests throughout the project.
In this project, we have discussed the research practice of historians and explored possible ways to support this process with novel search tool features. While a prototype system has been developed, much is left for further exploration, e.g., do the research practices as well as requirements for search systems found in this project generalise to that of other historians?
Dr. He’s tool is available for download on KB Research’s Github: https://github.com/KBNLresearch/spatio-temporal-topics