Extracting information from newspaper archives in Africa
In sub-Saharan Africa, lack of useful information for the public good is one obstacle to the development of public services (public safety, education, healthcare, etc.). This makes the extraction of data from digital archives (e.g., analog sources such as printed newspaper archives and born-digital sources like native PDF) an interesting alternative source of data to increase the amount and diversity of potentially useful information. Printed newspapers contain various multiarticle page layouts, wherein articles in the newspaper are designed to allow readers to define their own reading. The title of an article, the introductory story of the title, and related images are mostly grouped together. However, subsequent paragraphs and images are spread across various pages of the newspaper in a somewhat unpredictable manner. This, together with the poor quality of existing archives, makes the extracting of data from archived newspapers a daunting research problem. To solve these challenges, we present a system that extracts, detects, and clusters articles in newspapers from digital archives (mainly containing scanned newspaper archives from which the information is extracted). Finally, we also describe our proof-of-concept service using the extracted data.