A small journey in the German language for Pre-Processing in NLP

Published at February 17, 2020 ·  22 min read · by Flavio Clesio Silva de Souza



A small journey in the German language for Pre-Processing in NLP

This is a summary of a talk that I was to give in Data Council last year, but I rather gave a broader one called Low Hanging Fruit Projects in Machine Learning. This post will expand some points of the bullet points that I prepared for this presentation. If you saw my talk about LHF Projects, some of the content will be kind of familiar to you.

Disclaimer Project Report

This is only a project report with additional personal views and experiences, i.e. not a post in Towards Data Science, Keynote in O’Reilly Strata conference, best practices talk, Top-10-rule-list-that-you-must-do, Cautionary Tale or any other kind of science, not cognitive linguistics, not linguistics, computational linguistics.

Introduction

With all this hype about language models like BERT, GPT-2, RoBERTa, and others, there is no doubt that NLP is one of the hottest topics nowadays.

NLP is in the center of countless discussions today like for instance, the “dangerous” GPT-2. OpenAI said that it would be dangerous to society and did not report the weights. And that a few months later some good developers managed to replicate all the code and afterwards everyone saw that was a good model that sometimes generates very brittle results.

Debates aside, one positive point today is that there are countless resources that brings the state of the art mixed with everyday applications, for example, like this NLP e-mail list provided by Sebastian Ruder.

However, what I am going to put in the following lines are some small aspects of NLP for the German language regarding the preprocessing part for a text project.

I will put some basic aspects of our journey and some more project-level considerations that deal with natural language processing, where I will try to compile our journey and some other features that I saw during that time.

Los geht’s?

German Language: Respecting the unknown unknowns

Original Saying:

“ Give me six hours to chop down a tree and I will spend the first four sharpening the ax.” (Abraham Lincoln)

Machine Learning Saying:

“ Give me six hours to deliver a Machine Learning Model and I will spend the first four doing Feature Engineering.”

German NLP Saying:

“ Give me six hours to deliver a German NLP Model and I will spend the first five hours and thirty minutes doing text pre-processing.”:

As a few know, I wasn’t born or raised in any German-speaking country; and this already places a very big initial barrier on some aspects of language such as its nuances and even the understanding of trivial issues such as grammatical structure.

And here, I already put the first tip: If you are not a native speaker of the language, I suggest an understanding of at least an A2 equivalent certificate so that you have an understanding of the basic grammatical structure of the language before dealing directly with that language.

German, unlike my mother (Brazilian Portuguese), has a very different sentence structure in which Portuguese has the SVO (Subject-Verb-Object) structure, and in the German language, this rule is not so common, like for instance, the verbs can be at the end of a sentence.

It may seem small, but for example, for a Portuguese speaker, a negative answer or even a verb action is indicated in the first words of a sentence, not in the end. It forces us to an extra mental load to read the sentence until the end and then have the right context what’s going in the sentence.

In German, this rule is not necessarily true with the disadvantage that as a literate person in Brazilian Portuguese, I have to literally read the sentence completely, do the translation work in order to understand the sentence.

One factor that helped me a lot in that matter was, that since I was dealing with simple service request texts on an internet platform, this somewhat eased things because the textual structure is quite similar when someone wants certain types of services.

In other words, my corpora would be very restricted and would need a very large degree of specialization but in a single domain with a singular corpus. If it were a type of text that required a very high degree of specialization as a constitutional, legal, or scientific text, in this case, I would have to go a little beyond A2 just as an initial prerequisite.

Obviously, this is not a mandatory requirement, but I see that understanding the language is that 20% + performance of your model that you have only by understanding what makes sense or not in a sentence or even in the form of preprocessing.

Here is the simple tip: Respect the complexities of language, and if the language is one that you are not native respect more and try to understand its structures first before the first line of code. I will explore the language a bit more in a few topics later in this post.

First, let’s take a look at the MyHammer case.

Context: Classification as a triage for a better match between tradesman and consumers

For those who don’t know, MyHammer is a marketplace that unites between Craftsman and consumer that needs some home services with the best quality. Our main objective is defined by Craftsman receiving relevant jobs to work on and consumers placing jobs that need to be done and receiving good offers for high-quality services.  

To reach that goal, we need to deliver the most suitable job for each craftsman considering their skills, availability, relevance, and potential interestingness in terms of economics. Our Text Classification project enters in that equation for re-label some jobs that are in different categories inside our platform and help those matches happen. 

In summary, our data hold the following characteristics: 

  • 200+ classes
  • Overlap of keywords between several classes
  • Past data mislabel, and as we created new classes, we didn’t correct the past
  • Tons of abbreviations
  • Hierarchical Data (Taxonomy) in terms of Business but not related at all in terms of language semantics
  • A lot of classes with 1000+ words per record
  • Dominance of imbalanced data (Top 10 categories have 26% of all data, Bottom 100 has less than 10%)
  • Miscellaneous classes that englobe several categories, and with that arising the entropy interclass

With this scenario, we made the first hard decision about the project that was to invest at least 95% of our time in understanding the language across each class and building a strong pre-processing pipeline. 

In other words: If we understand well our language inside our corpora, we can leverage that even using plain vanilla models to our advantage.

With that in mind, we jump to understand better our language instead of starting to use algorithms and hoping for some very complex algorithm to work. 

Language is Hard

Language is not hard. Language is very hard. I don’t want to enter much in details around some hype about it and the brittleness of the State of the art. But technically speaking I think that we’re far away even to be near to solve that kind of problem that involves language in terms of conversation or even for machines to generate texts sufficiently good enough to pass in a simple essay. 

Language contains tons of aspects and complexities that makes everything hard. In this very good post of Monkeylearn are described some of those complexities like:

  • polysemy: words that have several meanings
  • synonymy: different words that have similar meanings
  • ambiguity: statement or resolution is not explicitly defined, making several interpretations plausible.
  • phonology: systematic organization of sounds in spoken languages and signs in sign languages
  • morphology: study of the internal structure of words and forms a core part of linguistic study today.
  • syntax: Set of rules, principles, and processes that govern the sentence structure in a given language
  • semantics:  study of meaning in language that is concerned with the relationship between signifiers—like words, phrases, signs, and symbols—and what they stand for in reality, their denotation.

To understand in depth, one of these aspects in depth would demand at least a master’s degree in full time, at least.

The point that I would like to make here is that knowing those aspects and understand that language is hard. We choose to start some statistical approach first to prune out non-relevant words in our corpora. And do the heavy-lift work and after we would jump to the language/symbolic approach to fine tune the corpora before going to train models to get a more safe side in terms of NLP modeling. 

Symbolic or statistical, what’s the best approach?

There’s a huge discussion about Symbolic versus Statistical approaches for language occurring nowadays. Some proponents about the Statistical as a main paradigm are Yann LeCun and Yoshua Bengio, and on the other side of the debate, it’s Gary Marcus. There are some resources available and some debates about that like this one between LeCun and Marcus and this thread about it.

For practitioners that are daily in the trenches I would suggest pragmatism and use all tools and methods that solve your problem in an efficient and scalable way. Here at MyHammer, I adopted a statistical approach for heavy lift work and some language ruling for tuning. Here the quotes from Lexalytics that I like about it: 

[…]The good point about statistical methods is that you can do a lot with a little. So if you want to build a NLP application, you may want to start with this family of methods[…]

[…]Statistical approaches have their limitations. When the era of HMM-based PoS taggers started, performances were around 95%. Well, it seems a very good result, an error rate of 5% seems acceptable. Maybe, but if you consider sentences of 20 words on average, 5% means that each sentence will have a word mislabeled […]

Source: Machine Learning Micromodels: More Data is Not Always Better

Yoav Goldberg in the SpaCy IRL Conference gave a great talk called “The missing elements in NLP” where I think he excelled in say that as we’re moving from a more linguistics expertise to a more Deep Learning approach to model NLP, we’re going in a path to have less debuggability and a more black box approach. We can see better this in the following slide:

NLP Tomorrow

After that, I took a very hard decision to stay in some tool that can provide me a bit of debuggability and transparency. I know that FastText deals with neural networks internally, but as FastText relies a lot on the WordNGrams using simple data analysis, it’s possible to explain why we’re getting some results or not.

The strategy here was: Let’s do a very extreme pre-processing approach in our corpora to get the leanest corpora as we can, and after this optimization, we can play with different models and see what’s going on.

To exemplify that this figure from Kavita Ganesan explains our point:

Level of Text preprocessing

Source: All you need to know about text pre-processing for NLP and Machine Learning. https://www.freecodecamp.org/news/all-you-need-to-know-about-text-preprocessing-for-nlp-and-machine-learning-bc1c5765ff67/

Some specifics in Pre-Processing for the German language

Umlauts (ä, ö and ü)  and Encoding

Long German Nouns

  • This problem was raised by Markus Konrad in a piece called Lemmatization of German Language Text where he explains that German nouns can be formed by creating composite from other words, for example, “Feinstaubbelastungen” which consists of “Feinstaub” and “Belastungen”.

As we know that these long nouns can appear according to the situation our strategy was to analyze the WordNGrams and TF-IDF scores and see the relevance of these words, if relevant use some rules to break down those words and keep it, if not, remove of the vocabulary.

Part-of-Speech tagging

  • We used Part-of-Speech tagging to remove some words of our corpora. Our strategy consisted in a) always keep the verbs, b) placeholder usage to abstract data entities inside our data (ex: In our domain we use place holders for sqm (square meters) and this placeholder gives some information gain in all particular classes that contains this words; c) as pronouns and conjunctions in our case most of the time do not contains any meaningful info we did cut it out.

For whom is interested in some examples, this table from NLTK is a good start:

German examples

Source: NLTK Universal Part-of-Speech Tagset

Stopwords: Analyze first, cut after…

One of the biggest endeavors in the project was to find a very nice library with consolidated German corpora in a word where the majority of the implementations and SOTA algorithms it’s crafted to English and first citizen language.

As our task was text classification only, and our data was massive in the beginning. We decided to perform an extreme cut out of stopwords because as we’re not doing any posterior application that relied in sequence like LSTM or seq2seq. This gave us more room to use a more unorthodox approach.  

In the work of Silva e Ribeiro called The importance of stop word removal on recall values in text categorization, the authors showed a positive relation between stopword removal and recall, and we follow that methodology in our work. 

If I could give a specific advice in that matter I would suggest using only out-of-the-box stopwords lists from those packages only if you don’t have time at all to perform analysis in your corpora. Otherwise, always perform the analysis and create your personalized list. To make my point clear about this matter, I’ll use the example from Chris Diehl.

Chris Diehl in his post called Social Signaling and Language Use generated a linguistic analysis in some e-mail from a company called Enron that were involved in a gigantic case of finance/corporate fraud. The analysis consisted of discovering if there’s an existence of a manager-subordinate social relationship.

The original e-mail is presented below:

Email

Doing a skim read, we can see that there’s a clear social relationship that characterizes a subordination. However, if we give this same text to a regular stopwords package, this will be the outcome:

Stop words

As Chris Diehl pointed out, the terms that matter in this message are function words, not content words. In other words, the removal of these words could mischaracterize the whole message and meaning. For more, I suggest the reading of the entire article written by Chris. 

A great post about the differences in stopwords across several open source packages was made by Gosia Adamczyk in the post called Common pitfalls with the preprocessing of German text for NLP where she showed some differences between those packages.

Pitfalls

Source: Common pitfalls with the preprocessing of German text for NLP, Gosia Adamczyk 

The key takeaway that we got here was: Trust in the stopwords from packages but check and if it’s necessary to mix all of them and use some information about your domain to enhance it.

Stopwords as Hyperparameters

In our project, as we’re started to go deeper into our corpora, we discovered very quickly that the normal list of stopwords not only was not suitable for us, but we needed to consolidate the maximum of them to remove from our corpora.

This was necessary because as we’re dealing with such amount of text, we wished to reduce the maximum amount of training time, and having lean corpora was mandatory to deal with that.

The main problem that I see in the current stopwords lists is that it is built on the top of tons of text that it’s suitable for general purposes but when we need to go in specific domains, the coverage it’s not enough and not knowing that caused us a huge source of inefficiency. 

We followed a strategy to analyze the results and if we noticed some improvement and in the model performance or in gains in processing time, we add more stopwords in the list. Roughly speaking it was kind of “stopwords list as hyperparameters

This example from Lousy Linguist translates my point:

Stopwords as hyperparameters

Source: https://twitter.com/lousylinguist/status/1068285983483822085

A single example that occurred to us: Our models started to give very strange results when we received in the text the name Hamburg or München. Basically everytime that a model received those words, the model always gave a single service as a prediction. In other words, it was a clear case of overfitting.

Long story make short: We discovered that this a specific moving service, when the customers used to place the service request for our Craftsman, the text posted by the consumer always had the information about the city (i.e. it makes totally sense since when someone needs to move to another place in the majority of the cases the origin and destination it’s placed).

A single example to illustrate that: * I would like to move my piano from Hamburg to Köln (Moving service) * I would like to paint my apartment. I’m located at downtown Hamburg. (Painting service)

The problem was that the second case always ended classified as Moving Service.

The solution here was to debug the model analyzing the WordnGrams composition for this service, include the cities as stopwords and after that everything worked good. 

There’s no exhaustive list, but if I can to give some hints, I would classify the stopwords lists like this:

German States and Cities: bayern, baden-württemberg, nordrhein-westfalen, hessen, sachsen, niedersachsen, rheinland-pfalz, thüringen

W-Frage: wer, was, wann, wo, warum, wie, wozu

Pronouns: das, dein, deine, der, dich, die, diese, diesem, diesen, dieser, dieses, dir, du, er, es, euch, eur, eure, ich, ihm, ihn, ihnen, ihr, ihre, mein, meine, meinem, meinen, meiner, meines, mich, mir, sie, uns, unser, unsere, unserem, unseren, unserer, unseres, wir

Numbers: null, eins, zwei, drei, vier, fünf, sechs, sieben, acht, neun, zehn, elf, zwölf

Ordinals: erste, zweite, dritte, vierte, fünfte, sechste, siebte, achte, neunte, zehnte, elfte, zwölfte, dreizehnte

Adjectives: egoistisch, ehrgeizig, ehrlich, eingebildet, empfindlich, englisch, engstirnig, entspannend, faul, feigling, fett, fleißig, französisch, freundlich, fromm, fröhlich, gesprächig, grausam, großzügig, hallo, herkömmlich, höflich, hübsch, intelligent, interessant, ja, jung, jämmerlich, komisch, langsam, langweilig, launisch, lustig

Greetings: ciao, hallo, bis, später, guten, tag, tschüss, wiederhören, wiedersehen, wochenende, hallo

Clarification words: dass, dafür, daher, dabei, ab, zb, usw, schon, sowie, sowieso, seit, bereits, hierfür, oft, mehr, na

Some lessons along the way…

We had some lessons along the way for our case. The point here is not to point any truths but exemplify that during the process to analyze data and perform experiments with our dataset we found something totally different from what we hear about NLP and Text-Preprocessing. I divide those sections between what did work and what didn’t work.

What didn’t work

  • Lemmatization: Here I think it was the most surprising takeaway that using Lemmatization as a common “best” practice, we had a 3% decrease in the accuracy in Top@5. The reason behind that was because of some categories we have some specific subset of words that characterizes a specific category.

For instance: We have some services that despite to have the same subset of words in our corpora like Wunschlackierung (Desired painting), Lackaufbereitung (paint preparation,), Unfallreparaturen (accident repairs) and Kratzer im Lack (scratches in the paint) can be fit in more than one service and the Lemmatization in some words caused to lose some information inside of the classes and this issue decreased the algorithm performance. To solve this problem, we removed the Lemmatization of our pre-processing pipeline.

  • Lemmatization was too slow for our data: Another point that was a deal breaker is that Lemmatization, even using a very good API as Spicy, took ages for our data in the beginning. Our dataset contains millions of records with text fields that contain dozens of hundreds of words for each line. Even using a 128 CPUs machine took a long time, and as we got this decrease in the model performance, we abandoned that approach. [Note: In the beginning we used a previous version of spaCy that didn’t contain several improvements in comparison with the current version] 

  • Hyperparametrization of FastText: on the start of this project we relied a lot in FastText, and we didn’t regret of that, but a huge limiting factor is that there are few meaningful parameters to use as hyperparameters, and here I’m specifically talking about the parameters Window Size and Dimension that in our data didn’t show any sign to be meaningful and/or useful in some strategy of grid search or tuning. 

  • Spark for data pipeline and model building and training: Spark is a cool tool for Data Processing, but for NLP all integrations using native Scala code didn’t provide the minimum in terms of libraries, flexibility and easiness to use for us to rely upon for text pre-processing in German language. I personally think that Text Processing using linguistic features is one of the weakest points in Spark libraries. We’re still using it, but only for data aggregate and dump from one place to another.

What worked

  • Own library of Pre-Processing and Personalized list of Stopwords: This can sound a bit as overengineering, but this was the only way for us to get the maximum of flexibility for our needs. We just unite all the best of all tools for German NLP like stopwords list, stemming, PoS and so on and craft something that can save us tons of time.

  • Hierarchical Models: This one deserves a special blog post, but for us using our natural ontology did put less pressure in terms of accuracy because at the end of the day we got a natural pruning of bad results

  • Using EDA + TF-IDF score to remove low frequent: We learned that for our case, removing low TF-IDF score words helped a lot in terms of experimentation without sacrificing performance. In the beginning, I just create a full list of TF-IDF and as I’ve included more words, I just monitor the performance to see if there’s some loss. My heuristic was: Get the minimum amount of words as possible with a tolerance of model performance in no more than 1%. This can look quite hard metric but at least in the beginning I was more concerned to have very lean corpora instead to have a good performance and a model more complex and brittle.

  • The oldie but Goldie Regex: This worked way better than Lambda and map() in Python for word replacement. The takeaway here is to use lambda and map only if it’s really necessary. 

  • Word Embeddings: We didn’t use embeddings before to force us to stress all possibilities with our corpora and hyperparameters. However, the loss in performance that I had with the removal of low IDF score words and using a personalized list of stopwords, I gained again just plugging embeddings in the model using FastText.

The key takeaway that we took was: Use the best practices as a start point but always check some alternatives because our data can have some specificities that can generate better results.

Final Remarks

If I have to do a wrap-up of all of these things, I would highlight as the main ones:

  1. Recognize your own language and jargon: For us understand that we’re dealing, i.e. web-based texts in the context of Craftsmanship and their language helped us to perform a better pre-processing that enhanced our models;

  2. Make your own corpus/vocabulary/embeddings: Out of the box stopwords lists and common text pre-processing helped a lot but in the end of the day we needed to create our own stopwords list and we used some low TF-IDF score stopwords list to prune out unnecessary words and reduce our corpora and consequently reducing the training time

  3. Do not automate misleading data. In our case one of the capital mistakes in the beginning was to try to automate the data pipeline and cleaning without considering some specifications of our case like jargon and abbreviations. It means: Data pre-processing is gold, models are silver. 

LIBRARIES

  • German_stopwords: Last commit 3yo, but contains a good set of loosen words and abbreviations  
  • DEMorphy: morphological analyzer for German language (several types of Tagset for dictionary filtering)
  • textblob-de by markuskiller: PoS
  • GermaLemma: First one that uses PoS before Lemma (Spacy will do that out of the box in next version)* tmtoolkit: Wrapper with PoS, Lemmatization and Stemming for German. Good for EDA with Latent Dirichlet Allocation.
  • NLTK: There’s a small trick to use PoS with German
  • StanfordNLP: (for future test)
  • SpaCy: NLP with batteries included (syntactic dependency parsing, named entity recognition, PoS)
  • Python Stopwords: Generic compilation of several Stopwords 

REFERENCES

Useful Links

Papers

  • Nothman, Joel, Hanmin Qin, and Roman Yurchak. “Stop Word Lists in Free Open-source Software Packages.” Proceedings of Workshop for NLP Open Source Software (NLP-OSS). 2018. - https://aclweb.org/anthology/W18-2502

Books

  • Indurkhya, Nitin and Fred Damerau (eds, 2010) Handbook of Natural Language Processing (Second Edition) Chapman & Hall/CRC. 2010. (Indurkhya & Damerau, 2010) (Dale, Moisl, & Somers, 2000)
  • Jurafsky, Daniel and James Martin (2008) Speech and Language Processing (Second Edition). Prentice Hall. (Jurafsky & Martin, 2008)
  • Mitkov, Ruslan (ed, 2003) The Oxford Handbook of Computational Linguistics. Oxford University Press. (second edition expected in 2010). (Mitkov, 2002)