KB Research

Research at the National Library of the Netherlands

Tag: newspapers

FAQ Call for Proposals Researcher-in-Residence

Updated 04 June 2018

I don’t live or work in the Netherlands. Can I apply? 
Probably! Contact us at dh@kb.nl and we’ll discuss your options.

I want to use my own dataset. Is that possible?
Sure! As long as you also use one of the datasets of the KB and it doesn’t limit the publication of the project end results.

I don’t know how to code, is that a problem?
Not at all. We have skilled programmers who can help you with your project or we will try to find a match for you if you prefer someone else. This would mean submitting as a team and will cut the budget in half. Reach out to us to discuss the options.

Continue reading

Terms and conditions of the KB Researcher-in-residence programme 2017

This programme as detailed at the KB-website (“Programme”) is operated by the Koninklijke Bibliotheek, National Library of the Netherlands (“KB”), Prins Willem-Alexanderhof 5 (2509 LK) Den Haag, The Netherlands.

Continue reading

KB at DHBenelux 2016

This week, the annual DHBenelux conference will take place in Belval, Luxembourg. It will bring together practically all DH scholars from Belgium (BE), the Netherlands (NE) and Luxembourg (LUX). You can read the full program and all abstracts on the website. Two presentations are by members of our DH team (Steven Claeyssens & Martijn Kleppe) and one presentation is by our current researcher in residence (Puck Wildschut – Radboud University Nijmegen). Please find the first paragraphs of their abstracts below:

Continue reading

Dataset KBK-1M containing 1.6 Million Newspaper Images available for researchers

Each year the KB invites two academics to come and work with us as researchers in residence: early career researchers who work in the library with our Digital Humanities team and KB Data.  Together we address their research questions in a 6 month project using our digital collection and computational techniques. The output of the project will be incorporated in the KB Research Lab. Today we are happy to announce the output of the PhoCon project (‘Photos in and Out of Context’) by dr. Martijn Kleppe and dr. Desmond Elliott: the KBK-1M Dataset containing 1.6 Million Newspapers Images

Continue reading

“FoCon – Foto’s in en uit context” door dr. Martijn Kleppe

Deze blogpost is geschreven door dr. Martijn Kleppe en is herblogt van www.martijnkleppe.nl (17 april 2015). Sinds publicatie zijn enkele zaken binnen het onderzoek aangepast. Binnenkort schrijft Martijn hierover een uitgebreidere, Engelstalige blog.

Sinds 1 april ben ik voor een half jaar ‘onderzoeker te gast’ op de onderzoeksafdeling van de Koninklijke Bibliotheek om te werken aan mijn project ‘FoCon – Foto’s in en uit context’. Het is een erg leuke kans omdat ik de ruimte krijg om de digitale kranten– en tijdschriftencollectie alsmede het webarchief van de KB te verkennen waarbij ik me vooral richt op het gepubliceerde beeldmateriaal.

Continue reading

What’s happening with our digitised newspapers?

The KB has about 10 million digitised newspaper pages, ranging from 1650 until 1995. We negotiated rights to make these pages available for research and this has happened more and more over the past years. However, we thought that many of these projects might be interested in knowing what others are doing and we wanted to provide a networking opportunity for them to share their results. This is why we organised a newspapers symposium focusing on the digitised newspapers of the KB, which was a great success!

Prof. dr. Huub Wijfjes (RUG/UvA) showing word clouds used in his research.

Prof. dr. Huub Wijfjes (RUG/UvA) showing word clouds used in his research.

Continue reading

OCR improvement: helping and hindering researchers

Author: Tineke Koster

As I am writing this, volunteers are rekeying our 17th century newspapers articles. Optical character recognition of the gothic text type in use at the time has yielded poor results, making this part of our digital collection nearly inaccessible for full-text search. The Meertens institute, who have an excellent track record when it comes to crowdsourcing, has developed the editor (Dutch). Together with them we are working towards a full update of all newspaper issues from 1618 to 1700 that are available in our website Delpher.

Great news and, for some researchers, an eagerly awaited development. A bright future beckons in which our digital text corpus is 100% correct, just waiting to be mined for dynamic phenomena and paradigm shifts.

But we have to realize that without the proper precautions, correcting digital texts may also hinder researchers in their work. How so? These texts may have been used (browsed, mined, cited, etc.) by researchers in their earlier form. The improvement or enrichment may have consequences for the reproducibility of their research results.

For all researchers the need to reproduce research results is growing, with new guidelines due to new laws. There is also a specific group of researchers that need sustained access to older versions of digital text. The need is highest for research where the goal is to develop an algorithm and to assess its quality relative to previous versions of the same algorithm or to other algorithms. Without sustained access to older versions, these people cannot do their work.

Is it our role to provide this access? How the National Library of the Netherlands is thinking about this issue, I hope to explain in a later blogpost (soon!). Meanwhile, I would be very interested to hear your experiences. How is this subject discussed in your organization? Does your organization have a policy in place to deal with this?

How to maximise usage of digital collections

Libraries want to understand the researchers who use their digital collections and researchers want to understand the nature of these collections better. The seminar ‘Mining digital repositories’ brought them together at the Dutch Koninklijke Bibliotheek (KB) on 10-11 April, 2014, to discuss both the good and the bad of working with digitised collections – especially newspapers. And to look ahead at what a ‘digital utopia’ might look like. One easy point to agree on: it would be a world with less restrictive copyright laws. And a world where digital ‘portals’ are transformed into ‘platforms’ where researchers can freely ‘tinker’ with the digital data. – Report & photographs by Inge Angevaare, KB.

Mining Digital Repositories Conference 2014

Hans-Jorg Lieder of the Berlin State Library (front left) is given an especially warm welcome by conference chair Toine Pieters (Utrecht), ‘because he was the only guy in Germany who would share his data with us in the Biland project.’

Libraries and researchers: a changing relationship

‘A lot has changed in recent years,’ Arjan van Hessen of the University of Twente and the CLARIN project told me. ‘Ten years ago someone might have suggested that perhaps we should talk to the KB. Now we are practically in bed together.’

But each relationship has its difficult moments. Researchers are not happy when they discover gaps in the data on offer, such as missing issues or volumes of newspapers. Or incomprehensible transcriptions of texts because of inadequate OCR (optical character recognition). Conference organisers Toine Pieters and Jaap Verheul (University of Utrecht) invited Hans-Jorg Lieder of the Berlin State Library to explain why he ‘could not give researchers everything everywhere today’.

Lieder & Thomas: ‘Digitising newspapers is difficult’

Both Deborah Thomas of the Library of Congress and Hans-Jorg Lieder stressed how complicated it is to digitise historical newspapers. ‘OCR does not recognise the layout in columns, or the “continued on page 5”. Plus the originals are often in a bad state – brittle and sometimes torn paper, or they are bound in such a way that text is lost in the middle. And there are all these different fonts, e.g., Gothic script in German, and the well-known long-s/f confusion.’ Lieder provided the ultimate proof of how difficult digitising newspapers is: ‘Google only digitises books, they don’t touch newspapers.’

Mining Digital Repositories Damaged Newspapers

Thomas: ‘The stuff we are digitising is often damaged’

Another thing researchers should be aware of: ‘Texts are liquid things. Libraries enrich and annotate texts, versions may differ.’ Libraries do their best to connect and cluster collections of newspapers (e.g., in the Europeana Newspapers), but ‘the truth of the matter is that most newspapers collections are still analogue; at this moment we have only bits and pieces in digital form, and there is a lot of bad OCR.’ There is no question that libraries are working on improving the situation, but funding is always a problem. And the choices to be made with bad OCR are sometimes difficult: Should we manually correct it all, or maybe retype it, or maybe even wait a couple of years for OCR technology to improve?’

Mining Digital Repositories Conference Claeyssens Van Hessen Kenter

Librarians and researchers discuss what is possible and what not. From the left, Steven Claeyssens, KB Data Services, Arjan van Hessen, CLARIN, and Tom Kenter, Translantis.

Researchers: how to mine for meaning

Researchers themselves are debating how they can fit these new digital resources into their academic work. Obviously, being able to search millions of newspaper pages from different countries in a matter of days opens up a lot of new research possibilities. Conference organisers Toine Pieters and Jaap Verheul (University of Utrecht) are both involved in the HERA Translantis project which is taking a break from traditional ‘national’ historical research by looking at transnational influences of so-called ‘reference cultures’:

Mining digital repositories - Definition of reference cultures

Definition of Reference Cultures in the Translantis project which mines digital newspaper collections

In the 17th century the Dutch Republic was such a reference culture. In the 20th century the United States developed into a reference culture and Translantis digs deep into the digital newspaper archives of the Netherlands, the UK, Belgium and Germany to try and find out how the United States is depicted in public discourse:

Mining Digital Repositories Jaap Verheul Translantis

Jaap Verheul (Translantis) shows how the US is depicted in Dutch newspapers

Joris van Eijnatten introduced another transnational HERA project, ASYMENC, which is exploring cultural aspects of European identity with digital humanities methodologies.

All of this sounds straightforward enough, but researchers themselves have yet to develop a scholarly culture around the new resources:

  • What type of research questions do the digital collections allow? Are these new questions or just old questions to be researched in a new way?
  • What is scientific ‘proof’ if the collections you mine have big gaps and faulty OCR?
  • How to interpret the findings? You can search words and combinations of words in digital repositories, but how can you assess what the words mean? Meanings change over time. Also: how can you distinguish between irony and seriousness?
  • How do you know that a repository is trustworthy?
  • How to deal with language barriers in transnational research? Mere translations of concepts do not reflect the sentiment behind the words.
  • How can we analyse what newspapers do not discuss (also known as the ‘Voldemort’ phenomenon)?
  • How sustainable is digital content? Long-term storage of digital objects is uncertain and expensive. (Microfilms are much easier to keep, but then again, they do not allow for text mining …)
  • How do available tools influence research questions?
  • Researchers need a better understanding of text mining per se.

Some humanities scholars have yet to be convinced of the need to go digital

Rens Bod, Director of the Dutch Centre for Digital Humanities enthusiastically presented his ideas about the value of parsing (analysing parts of speech) for uncovering deep patterns in digital repositories. If you want to know more: Bod recently published a book about it.

Rens Bod

Professor Rens Bod: ‘At the University of Amsterdam we offer a free course in working with digital data.’

But in the context of this blog, his remarks about the lack of big data awareness and competencies among many humanities scholars, including young students, was perhaps more striking. The University of Amsterdam offers a crash course in working with digital data to bridge the gap. The one-week, free course, deals with all aspects of working with data, from ‘gathering data’ to ‘cooking data’.

As the scholarly dimensions of working with big data are not this blogger’s expertise, I will not delve into these further but gladly refer you to an article Toine Pieters and Jaap Verheul are writing about the scholarly outcomes of the conference [I will insert a link when it becomes available].

Mining Digital Repositories Jaap Verheul Toine Pieters

Conference hosts Jaap Verheul (left) and Toine Pieters taking analogue notes for their article on Mining Digital Repositories. And just in case you wonder: the meeting rooms are probably the last rooms in the KB to be migrated to Windows 7

More data providers: the ‘bad’ guys in the room

It was the commercial data providers in the room themselves that spoke of ‘bad guys’ or ‘bogey man’ – an image both Ray Abruzzi of Cengage Learning/Gale and Elaine Collins of DC Thomson Family History were hoping to at least soften a bit. Both companies provide huge quantities of digitised material. And, yes, they are in it for the money, which would account for their bogeyman image. But, they both stressed, everybody benefits from their efforts:

Value proposition of DC Thomson Family History

Value proposition of DC Thomson Family History

Cengage Learning is putting 25-30 million pages online annually. Thomson is digitising 750 million (!) newspaper & periodical pages for the British Library. Collins: ‘We take the risk, we do all the work, in exchange for certain rights.’ If you want to access the archive, you have to pay.

In and of itself, this is quite understandable. Public funding just doesn’t cut it when you are talking billions of pages. Both the KB’s Hans Jansen and Rens Bod (U. of Amsterdam) stressed the need for public/private partnerships in digitisation projects.

And yet.

Elaine Collins readily admitted that researchers ‘are not our most lucrative stakeholders’; that most of Thomson’s revenue comes from genealogists and the general public. So why not give digital humanities scholars free access to their resources for research purposes, if need be under the strictest conditions that the information does not go anywhere else? Both Abruzzi and Collins admitted that such restricted access is difficult to organise. ‘And once the data are out there, our entire investment is gone.’

Libraries to mediate access?

Perhaps, Ray Abruzzi allowed, access to certain types of data, e.g., metadata, could be allowed under certain conditions, but, he stressed, individual scholars who apply to Cengage for access do not stand a chance. Their requests for data are far too varied for Cengage to have any kind of business proposition. And there is the trust issue. Abruzzi recommended that researchers turn to libraries to mediate access to certain content. If libraries give certain guarantees, then perhaps …

Mining Digital Repositories Toine Pieters

You think OCR is difficult to read? Try human handwriting!

What do researchers want from libraries?

More data, of course, including more contemporary data (… ah, but copyright …)

And better quality OCR, please.

What if libraries have to choose between quality and quantity?  That is when things get tricky, because the answer would depend on the researcher you question. Some may choose quantity, others quality.

Should libraries build tools for analysing content? The researchers in the room seemed to agree that libraries should concentrate on data rather than tools. Tools are very temporary, and researchers often need to build the tools around their specific research questions.

But it would be nice if libraries started allowing users to upload enrichments to the content, such as better OCR transcriptions and/or metadata.

Mining Digital Repositories 2014

Researchers and libraries discussing what is desirable and what is possible. In the front row, from the left, Irene Haslinger (KB), Julia Noordegraaf (U. of Amsterdam), Toine Pieters (Utrecht), Hans Jansen (KB); further down the front row James Baker (British Library) and Ulrich Tiedau (UCL). Behind the table Jaap Verheul (Utrecht) and Deborah Thomas (Library of Congress).

And there is one more urgent request: that libraries become more transparent in what is in their collections and what is not. And be more open about the quality of the OCR in the collections. Take, e.g., the new Dutch national search service Delpher. A great project, but scholars must know exactly what’s in it and what’s not for their findings to have any meaning. And for scientific validity they must be able to reconstruct such information in retrospect. So a full historical overview of what is being added at what time would be a valuable addition to Delpher. (I shall personally communicate this request to the Delpher people, who are, I may add, working very hard to implement user requests).

American newspapers

Deborah Thomas of the US Library of Congress: ‘This digital age is a bit like the American Wild West. It is a frontier with lots of opportunities and hopes for striking it rich. And maybe it is a bit unruly.’

New to the library: labs for researchers

Deborah Thomas of the Library of Congress made no bones about her organisation’s strategy towards researchers: We put out the content, and you do with it whatever you want. In addition to API’s (Application Protocol Interfaces), the Library is also allowing for downloads of bulk content. The basic content is available free of charge, but additional metadata levels may come at a price.

The British Library (BL) is taking a more active approach. The BL’s James Baker explained how the BL is trying to bridge the gap between researchers and content by providing special labs for researchers. As I (unfortunately!) missed that parallel session, let me mention the KB’s own efforts to set up a KB lab where researchers are invited to experiment with KB data making use of open source tools. The lab is still in its ‘pre-beta phase’ as Hildelies Balk of the KB explained. If you want the full story, by all means attend the Digital Humanities Benelux Conference in the Hague on 12-13 June, where Steven Claeyssens and Clemens Neudecker of the KB are scheduled to launch the beta-version of the platform. Here is a sneak preview of the lab, a scansion machine built by KB Data Services in collaboration with phonologist Marc van Oostendorp (audio in Dutch):

https://www.youtube.com/watch?v=FcTufco9P3A

Europeana: the aggregator

“Portals are for visiting; platforms are for building on.”

Another effort by libraries to facilitate transnational research is the aggregation of their content in Europeana, especially Europeana Newspapers. For the time being the metadata are being aggregated, but in Alistair Dunning‘s vision, Europeana will grow from an end-user portal into a data brain, a cloud platform that will include the content and allow for metadata enrichment:

Alistair Dunning: 'Europeana must grow into

Alistair Dunning: ‘Europeana must grow into a data brain to bring disparate data sets together.’

Dunning's vision of Europeana in the future

Dunning’s vision of Europeana 3.0

Dunning also indicated that Europeana might develop brokerage services to clear content for non-commercial purposes. In a recent interview Toine Pieters said that researchers would welcome Europeana to take such a role, ‘because individual researchers should not be bothered with all these access/copyright issues.’ In the United States, the Library of Congress is not contemplating a move in that direction, Deborah Thomas told her audience. ‘It is not our mission to negotiate with publishers.’ And recent ‘Mickey Mouse’ legislation, said to have been inspired by Disney interests, seems to be leading to less rather than more access.

Dreaming of digital utopias

What would a digital utopia look like for the conference attendees? Jaap Verheul invited his guests to dream of what they would do if they were granted, say, €100 million to spend as they pleased.

Deborah Thomas of the Library of Congress would put her money into partnerships with commercial companies to digitise more material, especially the post-1922 stuff (less restrictive copyright laws being part and parcel of the dream). And she would build facilities for uploading enrichments to the data.

James Baker of the British Library would put his money into the labs for researchers.

Researcher Julia Noordegraaf of the University of Amsterdam (heritage and digital culture) would rather put the money towards improving OCR quality.

Joris van Eijnatten’s dream took the Europeana plans a few steps further. His dream would be of a ‘Globiana 5.0’ – a worldwide, transnational repository filled with material in standardised formats, connected to bilingual and multilingual dictionaries and researched by a network of multilingual, big data-savvy researchers. In this context, he suggested that ‘Google-like companies might not be such a bad thing’ in terms of sustainability and standardisation.

Joris van Eijnatten

Joris van Eijnatten: ‘Perhaps – and this is a personal observation – Google-like companies are not such a bad thing after all in terms of sustainability and standardisation of formats.’

At the end of the two-day workshop, perhaps not all of the ambitious agenda had been covered. But, then again, nobody had expected that.

Agenda for Mining Digital Repositories 2014

Mining Digital Repositories 2014 – the ambitious agenda

The trick is for providers and researchers to keep talking and conquer this ‘unruly’ Wild West of digital humanities bit by bit, step by step.

And, by all means, allow researchers to ‘tinker’ with the data. Verheul: ‘There is a certain serendipity in working with big data that allows for playfulness.’

See also:

 

Named entity recognition for digitised historical newspapers

Europeana NewspapersThe refinement partners in the Europeana Newspapers project will produce the astonishing amount of 10 million pages of full-text from historical newspapers from all over Europe. What could be done to further enrich that full-text?

The KB National Library of the Netherlands has been investigating named entity recognition (NER) and linked data technologies for a while now in projects such as IMPACT and STITCH+, and we felt it was about time to approach this on a production scale. So we decided to produce (open source) software, trained models as well as raw training data for NER software applications specifically for digitised historical newspapers as part of the project.

What is named entity recognition (NER)?

Named entity recognition is the process of identifying and classifying entities such as persons, locations and organisations in the full-text in order to enhance searchability. There are basically two types of approaches, a statistical and a rule based one. Rule based systems rely mostly on grammar rules defined by linguists, while statistical systems require large amounts of manually produced training data that they can learn from. While both approaches have their benefits and drawbacks, we decided to go for a statistical tool, the CRFNER system from Stanford University. In comparison, this software proved to be the most reliable, and it is supported by an active user community. Stanford University has an online demo where you can try it out: http://nlp.stanford.edu:8080/ner/.

ner

Example of Wikipedia article for Albert Einstein, tagged with the Stanford NER tool

Requirements & challenges

There are some particular requirements and challenges when applying these techniques to digital historical newspapers. Since full-text for more than 10 million pages will be produced in the project, one requirement for our NER tool was that it should be able to process large amounts of texts in a rather short time. This is possible with the Stanford tool,  which as of version 1.2.8 is “thread-safe”, i.e. it can run in parallel on a multi-core machine. Another requirement was to preserve the information about where on a page a named entity has been detected – based on coordinates. This is particularly important for newspapers: instead of having to go through all the articles on a newspaper page to find the named entity, it can be highlighted so that one can easily spot it even on very dense pages.

Then there are also challenges of course – mainly due to the quality of the OCR and the historical spelling that is found in many of these old newspapers. In the course of 2014 we will thus collaborate with the Dutch Institute for Lexicology (INL), who have produced modules which can be used in a pre-processing step before the Stanford system and that can to some extent mitigate problems caused by low quality of the full-text or the appearance of historical spelling variants.

The Europeana Newspapers NER workflow

For Europeana Newspapers, we decided to focus on three languages: Dutch, French and German. The content in these three languages makes up for about half of the newspaper pages that will become available through Europeana Newspapers. For the French materials, we cooperate with LIP6-ACASA, for Dutch again with INL. The workflow goes like this:

  1. We receive OCR results in ALTO format (or METS/MPEG21-DIDL containers)
  2. We process the OCR with our NER software to derive a pre-tagged corpus
  3. We upload the pre-tagged corpus into an online Attestation Tool (provided by INL)
  4. Within the Attestation Tool, the libraries make corrections and add tags until we arrive at a “gold corpus”, i.e. all named entities on the pages have been manually marked
  5. We train our NER software based on the gold corpus derived in step (4)
  6. We process the OCR again with our NER software trained on the gold corpus
  7. We repeat steps (2) – (6) until the results of the tagging won’t improve any further

    NER slide

    Screenshot of the NER Attestation Tool

Preliminary results

Named entity recognition is typically evaluated by means of Precision/Recall and F-measure. Precision gives an account of how many of the named entities that the software found are in fact named entities of the correct type, while Recall states how many of the total amount of named entities present have been detected by the software. The F-measure then combines both scores into a weighted average between 0 – 1. Here are our (preliminary) results for Dutch so far:

Dutch

Persons

Locations

Organizations

Precision

0.940

0.950

0.942

Recall

0.588

0.760

0.559

F-measure

0.689

0.838

0.671

These figures have been derived from a k-fold cross-evaluation of 25 out of 100 manually tagged pages of Dutch newspapers from the KB. The results confirm the fact that the Stanford system tends to be a bit “conservative”, i.e. it has a somewhat lower recall for the benefit of higher precision, which is also what we wanted.

Conclusion and outlook

Within this final year of the project we are looking forward to see in how far we can still boost these figures by adopting the extra modules from INL, and what results we can achieve on the French and German newspapers. We will also investigate software for linking the named entities to additional online resource descriptions and authority files such as DBPedia or VIAF to create Linked Data. The crucial question will be how well we can disambiguate the named entities and find a correct match in these resources. Besides, if there is time, we would also want to experiment with NER in other languages, such as Serbian or Latvian. And, if all goes well, you might already hear more about this at the upcoming IFLA newspapers conference “Digital transformation and the changing role of news media in the 21st Century“.

References

Europeana Newspapers Refinement & Aggregation Workshop

The KB participates in the Europeana Newspapers project that has started in February 2012. The project will enrich 18 million pages of digitised newspapers with Optical Character Recognition (OCR), Optical Layout Recognition (OLR) and Named Entity Recognition (NER) from all over Europe and deliver them to Europeana. The project consortium consists of 18 partners from all over Europe: some will provide (technical) support, while other will provide their digitised newspapers. The KB has two roles: we will not only deliver 2 million of our newspaper pages to Europeana, but we will also enrich ours and the newspapers of other partners with NER.

Untitled

Europeana Newspapers Workshop in Belgrade

In the last months, the project has welcomed 11 new associated partners and to make sure they can benefit as much as possible from the experiences of the project partners the University Library of Belgrade and LIBER jointly organised a workshop on refinement and aggregation on 13 and 14 June. Here, the KB (Clemens Neudecker and I) presented the work that is currently being done to make sure that we will have Named Entities for several partners. To make sure that the work that is being done in the project also benefits our direct colleagues, we were joined by someone from our Digitisation department.

The workshop started with a warm welcome in Belgrade by the director of the library, Prof. Aleksandar Jerkov. After a short introduction into the project by the project leader Hans-Jörg Lieder from the State Library Berlin, Clemens Neudecker from the KB presented the refinement process of the project. All presentations will be shared on the project’s Slideshare account. The refinement of the newspapers has already started and is being done by the University of Innsbruck and the company CCS in Hamburg. However, it was still a big surprise when Hans-Jörg Lieder announced a present for the director of the University Library Belgrade; the first batch of their processed newspapers!

Giving a gift of 200,000 digitised and refined newspapers to our Belgrade hosts

Giving a gift of 200,000 digitised and refined newspapers to our Belgrade hosts

The day continued with an introduction into the importance of evaluation of OCR and OLR and a demonstration of the tools used for this by Stefan Pletschacher and Cristian Clausner from the University of Salford. This sparked some interesting discussions in the break-out sessions on methods of evaluation in the libraries digitising their collections. For example, do you tell your service provider what you will be checking when you receive a batch? You could argue that the service provider would then only fix what you check. On the other hand if that is what you need to reach your goal it would save a lot of time and rejected batches.

After a short getting-to-know-each-other session the whole workshop party moved to the Nikola Tesla Museum nearby where we were introduced to their newspaper clippings project. All newspaper clippings collected by Nikola Tesla are now being digitised for publication on the museum’s website. A nice tour through the museum followed with several demonstrations (don’t worry, no one was electrocuted) and the day was concluded with a dinner in the bohemian quarter.

Breakout groups at the Belgrade Workshop

The second day of the workshop was dedicated solely to refinement. I kicked off the day with the question ‘What is a named entity?’. This sounds easy, but can provide you with some dilemmas as well. For example, a dog’s name is a name, but do you want it to be tagged as a NE? And what do you do with a title such as Romeo and Juliet? Consistency is key in this and as long as you keep your goal in mind while training your software you should end up with the results you are looking for.

Claus Gravenhorst followed me with his presentation on OLR at CCS, by using docWorks, with which they will process 2 million pages. It was then again our turn with a hands-on session about the tools we’re using, which are also available on Github. The last session of the workshop was a collaboration between Claus Gravenhorst from CCS and Günter Mühlberger from the University of Innsbruck who gave us a nice insight into their tools and the considerations made when working with digitised newspapers. For example, how many categories would you need to tag every article?

Group photo from the Europeana Newspapers workshop in Belgrade

All in all, it was a very successful workshop and I hope that all participants enjoyed it as much as I have. I at least am happy to have spoken to so many interesting people with new experiences from other digitisation projects. There is still much to learn from each other and projects like Europeana Newspapers contribute towards a good exchange of knowledge between libraries to ensure our users get the best experience when browsing through the rich digital collections.

© 2018 KB Research

Theme by Anders NorenUp ↑