Digital Approaches to Text Analysis: An International Digital Humanities Symposium

Friday, 27 May 2022 - Aula Magna, Chiostro di Sant’Abbondio, Como (IT)

This one-day international symposium aims to showcase recent research and upcoming trends in text analysis within the context of digital humanities. Through a series of invited talks, the symposium will show a variety of approaches and methods to the digital investigation of texts and discourses, as well as the range of text types these approaches can be applied to.


  • 09:30-09:45 Welcome
  • 9:45-10:45 Plenary talk: Michaela Mahlberg, Re-thinking corpus linguistics
  • 10:45-11:15 Coffee break
  • 11:15-11:55 Gerold Schneider, Data-driven approaches to content analysis: Case studies from history, literature and medicine
  • 11:55-12:35 Viola Wiegand, Digital approaches to analysing meaning in discourse: Case studies on surveillance
  • 12:35-13:15 Lorenzo Mastropierro, Digital approaches to literary translation
  • 13:15-14:30 Buffet lunch
  • 14:30-15:30 Plenary talk: Turo Hiltunen, Digital humanities, corpus linguistics, and the problem of register
  • 15:30-16:00 Coffee break
  • 16:00-16:40 Sara Tonelli, Towards olfactory information extraction from historical texts
  • 16:40-17:20 Mikko Laitinen, American, British or a Mixture? Spatial Variability of Nordic Tweets in English 
  • 17:20-17:30 Closing

The “Digital Approaches to Text Analysis” symposium is organized by Paola Baseotto, Ruggero Lanotte, and Lorenzo Mastropierro (Department of Human Sciences and Innovation for the Territory, University of Insubria, Como), with the support of the Associazione Italiana di Anglistica (AIA) and the Sezione di Mediazione Interlinguistica e Interculturale of the Department of Human Sciences and Innovation for the Territory, University of Insurbia, Como.

  • “Digital Approaches to Text Analysis” is a full in-person event, but online participation is possible too.
  • Registration fee is €35 for in-person attendance, while online participation is free.
  • Registration fees include lunch, refreshments, and delegate pack.

In-person attendance:

Please use the university payment system PAGOPA to pay the registration fee of €35.00 by Monday 23 May.
You can access PAGOPA clicking on following link:

  1. Sign up and complete the form providing the required information.
  2. In the Payment type drop-down menu, select “Conferences/seminars/master classes”.
  3. Once you have selected the payment type, the text box Purpose of payment will appear. Insert “registrazione DH”.
  4. In the Office/Department drop-down menu, select “Department of Human Sciences and Innovation for the Territory”.
  5. Check your detail and proceed with the payment of €35.00.
  6. Attach PAGOPA payment receipt to an email to with your name, surname, and affiliation.

Please note that the registration will be complete only after sending PAGOPA payment receipt to the email address above.

Online attendance:

To register for online attendance, please send an email to with your name, surname, and affiliation.


Digital humanities, corpus linguistics, and the problem of register
Turo Hiltunen, University of Helsinki

The increasing availability of large digital archives and datasets has had a major influence on linguistic research, enabling scholars to study rare phenomena, analyse more specialised research questions, and establish interdisciplinary collaborations with practitioners of digital humanities (e.g. Hiltunen, Säily & McVeigh 2017, Mehl 2021). However, from the perspective of corpus linguistics, there are often issues related to the use of these materials, which need to be addressed before they can be used to their full potential as resources for digital text analysis. In this talk, I will review some of these issues, arguing that they are often linked to the concept of register. To use archive data as “opportunistic corpora”, it is therefore necessary to critically reflect on concept of register and how it can be operationalised in different settings. I will illustrate this with examples from historical data representing different types of public discourses.
Hiltunen, T, J. McVeigh, & T. Säily. (2017). How to turn linguistic data into evidence? In T. Hiltunen, J. McVeigh & T. Säily (Eds.), Big and Rich Data in English Corpus Linguistics: Methods and Explorations. Helsinki: VARIENG.
Mehl S. (2021). Why linguists should care about digital humanities (and epidemiology). Journal of English Linguistics, 49(3): 331-337.

American, British or a Mixture? Spatial Variability of Nordic Tweets in English
Mikko Laitinen, University of Eastern Finland

This presentation focuses on language use in social media and investigates spatial variability of networks in which English is used in the Nordic region. We ask if people in the Nordic region systematically select British English forms or resort to variants found in American English. Our empirical part examines both spelling and lexico-grammatical variables. Of interest is the process of Americanization, viz. the gradual change of how English around the world tends to follow contemporary American English norms (Leech et al. 2009: 252–259). Gonçalves et al. observe that American English variants dominate the expanding circle settings. They point out, however, that “in countries where English is not the mother tongue the real problem is the lack of data” (2018: 8). We amend this bad data problem by acquiring a more detailed picture of spatial variability of English in the Nordics. In addition, we investigate how network structures influence variability (Taipale and Laitinen 2021). Our empirical data consists of one year of English tweets and their metadata from the Nordic Tweet Stream (NTS) corpus. This real-time monitor corpus currently contains material from over 700,000 user accounts.
Our empirical part focuses on a set of spelling (e.g. British centre vs. American center) and lexico-grammatical variables (e.g. singular or plural agreement with collective nouns). We utilize geotagging properties to calculate a regional index of American vs. British forms. Our results show a mixed use in which spelling tends to follow the British norm, while American English forms are preferred in the lexico-grammatical variables. The study provides methodological improvements as the methods used and the data from the NTS enable researchers to obtain an accurate overview of spatial variability in urban and rural areas in the expanding circle and facilitates the study of language use in large networks.
Gonçalves, B., Loureiro-Porto, L., Ramasco, J., & Sánchez, D. (2018). Mapping the Americanization of English in space and time. PLoS ONE 13(5), e0197741.
Leech, G., Hundt, M., Mair, C., & Smith, N. (2009). Change in Contemporary English: A Grammatical Study. Cambridge: Cambridge University Press.
Taipale, I. & Laitinen, M. (2022). Individual sensitivity to change in the lingua franca use of English. Frontiers in Communication 6, 737017.

Re-thinking corpus linguistics
Michaela Mahlberg, University of Birmingham

In corpus linguistics, innovation has crucially depended on computational tools and methods. Since the early days, approaches and methods to store, collect, and analyse language data have developed significantly. Many areas of linguistics have now seen a corpus or a computational turn. Corpus linguistics and digital humanities also have much in common. Beyond linguistics, language data is increasingly studied in a wide range of subjects and areas of application. What does that mean for corpus linguistics? Is there a question of whether corpus linguistics has kept up with the most recent technology? Or is there more than technology to think about? In this talk, I want to argue that corpus linguistics is at a point now, where we need to be clear about the foundations of the field and the guiding principles that will determine its future. At the beginning, corpus linguistics brought the computational into linguistics. Now, it is the task of corpus linguistics to keep the focus on the language in a world of data. This paper will begin with some suggestions for this linguistic focus.

Digital approaches to literary translation
Lorenzo Mastropierro, University of Insubria

The descriptive study of translation as envisaged by Holmes (1972) and Toury (1995) relies on the comparison of source and target texts as its primary method of analysis. In this talk, I will argue that the use of digital approaches and resources can greatly improve the comparative scope of descriptive translation studies. In particular, I will showcase applications of digital approaches to the study of literary translation, showing how such methods can be used to examine a multitude of translational phenomena. I will draw on a series of case studies based on the investigation of reporting verbs in the Harry Potter book series and its Italian translation to demonstrate how the comparison of source and target (literary) texts with digital and corpus methods can provide important insight into stylistic, interpretational, and pedagogical issues in literary translation studies.
Holmes, J. (2004). The name and nature of Translation Studies. In L. Venuti (Ed.) The Translation Studies Reader (2nd Edition) (pp. 180-192). London/New York: Routledge.
Toury, G. (1995). Descriptive Translation Studies and Beyond. Amsterdam: John Benjamins.

Data-driven approaches to content analysis: Case studies from history, literature and medicine
Gerold Schneider, University of Zurich

Data-driven approaches fully automatically detect patterns and offer many applications in Digital Humanties and Automatic Content Analysis. We showcase how Topic Modelling (Blei 2012), Conceptual Maps with Kernel Density Estimation (Schneider 2020, Eve 2022) and Distributional Semantics (Firth 1957, Mikolov et al. 2018) can detect historic and social developments, literary styles and visions, or patients’ concerns. Our historical application focusses on American History from 1860-1999, based on the corpus of historical American English (COHA). In literature we present a distant-reading (and zooming in to close reading, see Moretti 2013) study of Charles Dickens’ style of literary realism, his compassion for the poor and visions for social reform (Mahlberg 2013). From the medical sector, we present a bird’s eye view of the history of medicine (Schneider in press) and show how patient interviews can be visualized to gain an overview of their concerns and see what is on their minds, with a view to improving doctor-patient relations and to support patients’ coping strategies.
Blei, D. (2012). Probabilistic topic models. Communications of the ACM, 55(4), 77-84.
Firth, J. R. (1957). A synopsis of linguistic theory 1930-1955. Studies in Linguistic Analysis, 1-32.
Fitzmaurice, S. J. R., Alexander, M., Hine, I., Mehl, S., & Dallachy, F. (2017). Linguistic DNA: Investigating conceptual change in Early Modern English discourse, Studia Neophilologica, 89(sup1), 21-38.
Eve, M. P. (2022). The Digital Humanities and Literary Studies. Oxford: Oxford University Press.
Mahlberg, M. (2013). Corpus Stylistics and Dickens’s Fiction. London/New York: Routledge.
Moretti, F. (2013). Distant Reading. London: Verso.
Mikolov, T., Grave, E., Bojanowski, P., Puhrsch, C., & Joulin, A. (2017). Advances in pre-training distributed word representations.
Schneider, G. (2020). Changes in society and language: Charting poverty. In P. Rautionaho, A. Nurmi & J. Klemola (Eds.), Corpora and the Changing Society: Studies in the Evolution of English (pp. 29-56). Amsterdam: Benjamins.
Schneider, G. (in press). Medical topics and style from 1500 to 2018. In T. Hiltunen & I. Taavitsainen (Eds.), Corpus Pragmatic Studies on the History of Medical Discourse. Amsterdam: Benjamins.

Towards olfactory information extraction from historical texts
Sara Tonelli, Fondazione Bruno Kessler

Human experience is mediated through the senses, which we use to interact with the world. But what do we know about experiences from the past from a sensory perspective? Historical archives are rich in visual descriptions of cities and landscapes from the past, can tell us what events affected people’s lives and what travellers saw in their journeys. However, we do not know much about how places smelled, and what impact olfactory experiences had on people’s lives. In fact, smell is an urgent, but highly under-researched topic in computer science and the humanities. How should we safeguard our olfactory heritage? How can we extract sensory data from large-scale digital text and image collections? In this talk, I will address these questions by describing ongoing work related to olfactory information extraction covering multiple domains and different languages, with the goal to obtain a richer, multi-faceted knowledge about European past. I will briefly describe how a temporal-aware taxonomy of olfactory terms has been created, and which natural language processing approaches are being used to automatically detect information on smell events, qualities and sources in historical archives. Our goal is ultimately to recognise, safeguard and promote olfactory knowledge, as a way to connect Europe’s tangible and intangible cultural heritage.

Digital approaches to analysing meaning in discourse: Case studies on surveillance
Viola Wiegand, University of Birmingham

In this talk, I will introduce digital approaches to analysing meaning in discourse as a concept that (i) evolves with the discourse (across different genres, contexts, and times); (ii) takes shape in co-occurrence patterns; and (iii) emerges via comparison. I explore the representation of the concept of surveillance, which can be a contentious issue, because it can be implemented for protection as well as suppression. While it may bring to mind technologies such as CCTV and browser cookies, surveillance is arguably part of everyday interactions: according to Goffman (1964), any social situation is an environment of mutual monitoring possibilities. I present corpus and qualitative analyses of discourse in the Times Digital Archive, with case studies of surveillance in relation to (i) identity documents and (ii) mental healthcare in historical classified adverts, based on my work in Wiegand (2019) and (2021).
Goffman, E. (1964). The neglected situation. American Anthropologist, 66(6), 133–136.
Wiegand, V. (2019). A Corpus Linguistic Approach to Meaning-Making Patterns in Surveillance Discourse [PhD, University of Birmingham].
Wiegand, V. (2021). Surveillance contexts in 19th-century British mental healthcare: A study of adverts in The Times. In N. Brownlees (Ed.), The Role of Context in the Production and Reception of Historical News (pp. 287–312). Pieterlen/Bern: Peter Lang.


Friday 27 May 2022, 9:30