William Ogallo, Skyler Speakman, et al.
IJCAI 2020
In sub-Saharan Africa, lack of useful information for the public good is one obstacle to the development of public services (public safety, education, healthcare, etc.). This makes the extraction of data from digital archives (e.g., analog sources such as printed newspaper archives and born-digital sources like native PDF) an interesting alternative source of data to increase the amount and diversity of potentially useful information. Printed newspapers contain various multiarticle page layouts, wherein articles in the newspaper are designed to allow readers to define their own reading. The title of an article, the introductory story of the title, and related images are mostly grouped together. However, subsequent paragraphs and images are spread across various pages of the newspaper in a somewhat unpredictable manner. This, together with the poor quality of existing archives, makes the extracting of data from archived newspapers a daunting research problem. To solve these challenges, we present a system that extracts, detects, and clusters articles in newspapers from digital archives (mainly containing scanned newspaper archives from which the information is extracted). Finally, we also describe our proof-of-concept service using the extracted data.
William Ogallo, Skyler Speakman, et al.
IJCAI 2020
Jaye Clarkes-Nias, Oliver Bent, et al.
L@S 2015
Girmaw Abebe Tadesse, Oliver Bent, et al.
IEEE SPM
Charles M. Wachira, Sekou L. Remy, et al.
ICHI 2020