codete NLP in Recruitment 1 main eb41bd0bde
Codete Blog

NLP in Recruitment

Pawel Dyrek 0d03178d36

31/01/2018 |

17 min read

Paweł Dyrek

In companies like Codete we use computers pretty extensively, not only for developing great ideas, which our clients bring to us but also for things like management, communication, security, human resources, etc. We invest a lot in improving our workflows so that we can focus more on delivering great software. 

This ultimate goal requires one key component - the best developers. Our HR department has to fall all over backward to find them in the recesses of the Internet, but the finding isn’t the only problem. Even if we build a great database of candidates we still need to ask our HR specialists to find the perfect match, and it’s always done manually - making decisions is a human job, not computers.

 

Technology to the rescue

What if there’s actually something we can use to help HR staff and reduce the error-prone and tedious task of going through the big database of developers while looking for the perfect match? What if we can bring Artificial Intelligence, which is making its way through the trends in Computer Science, and bring it to our HR specialists. Let me present you some ideas on how AI, in particular Natural Language Processing, can bring improvements in hiring the best candidates possible.

Before we start digging into the topic of using technology to help HR specialists in their work, we have to settle the common ground. First, and obvious, question which may arise from the introductory paragraph is: What Natural Language Processing actually is? And, maybe what’s even more curious case, how we can apply it to something so human-specific? I’ll try to answer those questions, so no further ado - let’s jump right into the theory!

We all know what’s the developer’s job - to write and to understand the code in the language which the computer can interpret and execute, right? There are a lot of programming languages out there, using different paradigms, having different use cases, etc. Usually, people consider programming hard work, but in general, it can be similar to learning a new foreign language, so yes, it’s hard, but it’s doable - you can learn Spanish, can’t you? 

Natural language processing (NLP) is pretty much the same… But it’s about teaching computers to understand languages used by humans. Sounds pretty easy, right? Just give them the exercise book and some time. Well, if it was that easy, we’d probably have a cure for cancer in our pockets by now. Even though human languages have grammar and are formally described, they can be used in a lot of different ways, and that’s the main problem compared to programming languages, which are pretty strict. Additionally, human languages evolve, we have dialects, we have slangs, abbreviations, and many, many more edge cases. 

That’s the expertise field of NLP. Smart guys around the world are working really hard to get computers to understand human languages, process them and later on use computer power to help people achieve amazing things. Contrarily, it’s hard, and it doesn’t take a couple of programmers to make something useful in the field of NLP, it takes a lot of time and it takes a lot of scientists with a solid academic background to proceed with the development of this study. 

While research papers, strong theoretical background, and experiments are really important, the real deal begins when we want to apply these techniques in real life. Nothing verifies science as much as day-to-day tasks in which we can apply them, and probably nothing expresses more gratitude to the scientists than millions of people using their research every day.

This article is focused on using Artificial Intelligence in the recruitment process, so let’s see how we can take the benefit of proper tools, put them together and achieve some useful results. 

 

What we are dealing with?

The recruitment process is usually based on exchanging emails, right? Let’s make some assumptions to simplify this process for now (you can extend it later, the sky is the limit). Imagine that instead of getting CVs in PDF format we just get them straight in an email message. In order to extract some information from emails, there is a need to convert emails to text and then perform text processing. 

As mentioned before natural language processing is widely researched (many books and papers) and there are many tools/SDKs available to perform such computing. With that in mind let’s draw a couple of theoretical steps to achieve our goal.

First of all, we’ll have to convert emails to text. One of our initial assumptions was that we already exchange emails in text form, so we just have to extract the core and we’re good to go. There’s a possibility that instead, we’ll get the CV in PDF or another format. In that case, we’ll have to use some transformation tools. In the case of PDF, there are a lot of libraries and software available, but it’s also possible that we’ll have to use some more advanced techniques like optical character recognition.

Preprocess the text

The next step is actual text document preprocessing. This is a really important part of building our database of candidates. In this step, we’re actually doing a really important thing, or even a couple of them - we’re preparing something which the computer will understand. Here are some steps taken in the process of text preprocessing:

  1. Chunking - extracting sentences from the text.
  2. Tokenizing - breaking up a stream of text into words (so-called: tokens).
  3. Stemming/Lemmatization - reducing words to their base form in order to be analyzed as a single item.

It may look scary, and if you have to code it from scratch every time you want to achieve this, well… We’ve discussed it already. It’s much better to use battle-tested software to achieve this, and fortunately, there are quite some solutions on the market for you to grab, use and get to the results you need. Below you may find just a couple of examples of software capable of such preprocessing:

  • TextRazor has a very impressive demo of its capabilities. It supports multiple languages including German. You can use it both as REST API or using their SDK (available in Java, Python, and PHP). What’s really important, you can extract semantic metadata using custom rules.
  • Stanford NLP Software includes tokenizer, CoreNLP, and classifier. It’s software produced by Stanford, which is home to the best NLP computer scientist in the world. One disadvantage we can point here is that this set of tools fully supports English, and that’s it. Other languages like Arabic, Chinese, French, German and Spanish have only partial support.

There are more of those, like tools from Cognitive Computation Group or Apache’s OpenNLP, so as you can see, you can easily choose from a wide range of available software, test it and decide whether it suits your requirements or you need something else.

Structure the data

Going further, we have to construct a vector space model from text documents we’ve collected so far. It’s an algebraic way of representing a document as a set of vectors with corresponding values, so it’s a great representation for further processing. There are many papers on how to approach this topic, but we recommend you familiarize yourself with two chapters from Speech and Language Processing by Dan Jurafsky and James H. Martin - Vector Semantics and Semantics with Dense Vectors.

This process can be divided into a couple of steps and we’ll guide you through them, so it’s easier to grasp the notion of what’s actually going on. This will be the most theoretical and advanced (from the mathematical point of view) part of this article but don’t worry, there are tools that can easily do it for you, but being conscious of what’s happening underneath is always beneficial for you.

In our case, the first step will be a preparation of a word-person matrix in which a person is represented by all emails that refer to himself/herself. It’s important to be sure that all emails related to a particular person are truncated into a single text document so that we have everything in one place and we’ll get the correct results. You may wonder what will be in this matrix, so let’s look at the following points describing each element of this matrix:

  • each row represents a word in vocabulary (generated from all preprocessed emails)
  • each column represents a particular person
  • each cell represents the number of times a particular word (row) occurs for a particular person (column)

Now, let’s create this matrix. To do so, we have to split it into three steps and the first one is to create a matrix containing the number of times a particular word occurs in all emails related to a particular person. Yes, it’s a simple counting, so assume this is the sample matrix we got from processing our correspondences with three people: 

 

PHP

C++

C#

Java

JavaScript

Python

SQL

person1

8000500440050

person2

0430065000220

person3

7800002307100

Table 1. The matrix containing the number of times a particular word occurs in all emails related to a particular person.

The next step is even simpler because we have to count all occurrences of nouns in those emails, so we can present it as a vector like:

 

nouns

person1

10000

person2

20000

person3

40000

Table 2. A vector containing the number of all nouns in all emails related to a particular person.

Last, but not least we have to calculate the initial matrix we need for further processing. As you may have already figured it out, we have to divide the number of times a particular word occurs in all emails related to a particular person by all nouns in emails related to a particular person. Let’s see how it looks like in a visual form:

 

PHP

C++

C#

Java

JavaScript

Python

SQL

person1

0.0800.00500.04400.005

person2

00.021500.0325000.011

person3

0.01950000.005750.017750

Table 3. Initial matrix. The value of a cell is the number of times a particular word occurs in all emails related to a particular person divided by all nouns in emails related to a particular person.

The first, and very important step of constructing the vector space model is beyond us, but you’ve probably noticed that values in the result matrix aren’t really reflecting anything meaningful. It looks like if you exchange more emails, your chances to be considered a good developer in a specific field decreases. Why is it so? It’s because of the normalization. 

We have to change figures in our result matrix from absolute values to relative ones and we have a couple of ways to do it. We have PMI and PPMI and they look like this:

PMI(w, p) = log2P(w, p)P(w)P(p)

PPMI(w, p) = max(log2P(w, p)P(w)P(p), 0)

where:

  • w - word
  • - person
  • P(w, p) - the value of a cell in w-th column and p-th row divided by the sum of values in w-th column and p-th row
  • P(w) - the sum of values in w-th column divided by the sum of values in w-th column and p-th row
  • P(p) - the sum of values in p-th row divided by the sum of values in w-th column and p-th row

We can also apply another numerical statistic called term frequency-inverse document frequency (in short tf-idf) and this will be our method of choice in proceeding with our example, so below you will see steps necessary to calculate our normalized matrix.

Let’s start with the formula for calculating tf-idf:

tfidf(w, p) = tfwpidfw

where:  

  • tfwp (term frequency) -  the frequency of a particular word within a particular skill = value of the cell in p-th row and w-th column divided by the sum of values in w-th column.
 

PHP

C++

C#

Java

JavaScript

Python

SQL

person1

0.8040100.88400.312

person2

0101000.688

person3

0.01960000.11610

Table 4. Matrix with normalized values from the matrix in Table 3 sum over all people within particular skill = 1

  • idfw(inverse document frequency) = log(Ndfw)
    • N - the total amount of people in the collection
    • dfw - the number of people for which the given word "w" occurs

Let’s consider the following example idfPHP=log32 and calculate inverse document frequency vector for it:

 

idfw

PHP

0.5849625007

C++

1.5849625007

C#

1.5849625007

Java

1.5849625007

JavaScript

0.5849625007

Python

1.5849625007

SQL

0.5849625007

Table 5. Inverse document frequency vector

With this vector, we can proceed and calculate the whole normalized matrix, so in result, we should get:

 

PHP

C++

C#

Java

JavaScript

Python

SQL

person1

0.0800.00500.04400.005

person2

00.021500.0325000.011

person3

0.01950000.005750.017750

Table 6. "tf-idf" normalized matrix each element in the column in matrix from Table 4 is multiplied by the corresponding element in vector from Table 5

The steps you’ve seen above to normalize your matrix aren’t hard, but the whole idea of this article is to show you that you can achieve your goal by simply using existing software and not worrying about all computation. For this step, we can use tools like Weka or Apache Lucene Core.

Profit!

Looks like we’re getting closer and closer to what we actually want to achieve. The next step is to create a proper query, so we have to formalize what is a skill we’re looking for in our future candidate. To do so we need to find similar vectors. You probably know that vectors are similar when the angle between them is small (cos = 1) and are different when the angle equals 90°, in other words, when cos = 0. 

Putting this in our context, person A (represented by vector A) is a better candidate than person B (represented by vector B) when cosine between vector A and query vector is bigger than cosine between vector B and query vector. Let’s say we’re looking for a candidate who knows Java, SQL, and JavaScript. Our first step is to translate this query into a vector:

Skills

Candidate

PHP0
C++0
C#0
Java1
JavaScript1
Python0
SQL1

Table 7. Query vector

Next on, we simply have to compute dot-product (or cosine) of query and person, so we can get the final scores:

 

person1

person2

person3

score0.7001.9870.068

Table 8. Final score

Last, but not least we have to sort the final score to get the ordered list of candidates that fulfill the given criteria.

That’s it! We got our final results. Of course, there are a couple of things to be noted and to be considered while looking into this sort of solution.

Just to name a couple of them. Email-to-person mapping is assumed to be one-to-many because email has at least one sender and one recipient, hence at least 1-to-2 mapping (not to mention Ccs, Bccs). You can get slightly different results (maybe more accurate in your case) if you prioritize people based on which list ('From', 'To', 'Cc', 'Bcc') they belong to (there is no prioritization in the current solution). 

Weights can be real numbers ranged from 0 to 1 as follows:

  • “From”, “To” - 1 
  • “Cc” - 0.5
  • “Bcc” -  0.25

The example presented in this article is really small, so the operations on the matrix were fast and easy to calculate, but in real life list of candidates and technologies may increase drastically giving you a hard time managing this. In fact, you should consider dimensionality reduction and you have a couple of options to consider, so let’s have just a brief look at them:

  • SVD (singular value decomposition) - seems to be the best solution as long as it is based on pure vector/matrix operations.
  • Skip-gram and CBOW (neural network approach) - although it is one of the dimensionality reduction approaches, it doesn't seem to be a proper one. This solution is used when trying to evaluate whether one word is similar to another one or not. This concept is based on the fact that words with similar meanings occur near each other in text quite frequently. Neural networks are used to predict the neighborhood of a given word, in contrast to the current case, when we rather count on absolute values (number of times the particular word occurs in text).
  • Brown-clustering - it doesn't seem to be the proper one, because clustering words is based on associations between preceding and following words.

I hope this article will get things moving in your brain. I hope that you see a potential for AI and NLP in almost every aspect of our work and that you actually see it not as a replacement for humans, but more of a complementary tool for boosting our productivity and giving the best results in short amounts of time.

 Of course, this article is just a scratch on the surface of the whole big and complex topic, so I’m more than happy to give you this general notion of what’s going on and how you can apply it in your company.

Rated: 5.0 / 1 opinions
Pawel Dyrek 0d03178d36

Paweł Dyrek

Director of Technology at Codete

Our mission is to accelerate your growth through technology

Contact us

Codete Global
Spółka z ograniczoną odpowiedzialnością

Na Zjeździe 11
30-527 Kraków

NIP (VAT-ID): PL6762460401
REGON: 122745429
KRS: 0000983688

Get in Touch
  • icon facebook
  • icon linkedin
  • icon instagram
  • icon youtube
Offices
  • Kraków

    Na Zjeździe 11
    30-527 Kraków
    Poland

  • Lublin

    Wojciechowska 7E
    20-704 Lublin
    Poland

  • Berlin

    Bouchéstraße 12
    12435 Berlin
    Germany