In companies like Codete we use computers pretty extensively, not only for developing great ideas, which our clients bring to us, but also for things like management, communication, security, human resources etc. We invest a lot in improving our workflows, so that we can focus more on delivering great software. This ultimate goal requires one key component – the best developers. Our HR department have to fall all over backwards to find them in the recesses of the Internet, but finding isn’t the only problem. Even if we build a great database of candidates we still need to ask our HR specialists to find the perfect match, and it’s always done manually – making decisions is a human job, not computers.

NLP – Technology to the rescue

What if there’s actually something we can use to help HR stuff, and reduce the error-prone and tedious task of going through the big database of developers while looking for the perfect match? What if we can bring Artificial Intelligence, which is making its way through the trends in Computer Science, and bring it to our HR specialists. Let me present you some ideas on how AI, in particular Natural Language Processing, can bring improvements in hiring the best candidates possible.

Before we start digging into the topic of using technology to help HR specialists in their work, we have to settle the common ground. First, and obvious, question which may arise from the introductory paragraph is: What Natural Language Processing actually is? And, maybe what’s even more curious case, how we can apply it to something so human-specific? I’ll try to answer those questions, so no further ado – let’s jump right into the theory!

We all know what’s the developer’s job – to write and to understand the code in the language which computer can interpret and execute, right? There are a lot of programming languages out there, using different paradigms, having different use cases etc. Usually, people consider programming a hard work, but in general it can be similar to learning a new foreign language, so yes, it’s hard, but it’s doable – you can learn Spanish, can’t you?
Natural language processing (NLP) is pretty much the same… But it’s about teaching computers to understand languages used by humans. Sounds pretty easy, right? Just give them the exercise book and some time. Well, if it was that easy, we’d probably have a cure for cancer in our pockets by now. Even though human languages have grammar and are formally described, they can be used in a lot of different ways, and that’s the main problem compared to programming languages, which are pretty strict. Additionally, human languages evolve, we have dialects, we have slangs, abbreviations and many, many more edge cases. That’s the expertise field of NLP. Smart guys around the world are working really hard to get computers understand human languages, process them and later on use computer power to help people achieve amazing things. Contrarily, it’s hard, and it doesn’t take a couple of programmers to make something useful in the field of NLP, it takes a lot of time and it takes a lot of scientists with a solid academic background to proceed with development of this study.

While research papers, strong theoretical background and experiments are really important, the real deal begins when we want to apply these techniques in real life. Nothing verifies science as much as day to day tasks in which we can apply them, and probably nothing express more gratitude to the scientists than millions of people using their research everyday.

This article is focused around using Artificial Intelligence in recruitment process, so let’s see how we can take a benefit of proper tools, put them together and achieve some useful results.

What we are dealing with?

Recruitment process is usually based on exchanging emails, right? Let’s make some assumptions to simplify this process for now (you can extended it later, sky is the limit). Imagine that instead of getting CVs in PDF format we just get them straight in an email message. In order to extract some information from emails there is a need to convert emails to text and then perform text processing. As mentioned before natural language processing is widely researched (many books and papers) and there are many tools/SDKs available to perform such computing. With that in mind let’s draw a couple of theoretical steps to achieve our goal.

First of all, we’ll have to convert emails to text. One of our initial assumptions was that we already exchange emails in text form, so we just have to extract the core and we’re good to go. There’s possibility that instead we’ll get CV in PDF or other format. In that case we’ll have to use some transformation tools. In case of PDF there are a lot of libraries and software available, but it’s also possible that we’ll have to use some more advanced techniques like optical character recognition.

Preprocess the text

Next step is actual text document preprocessing. This is a really important part for building our database of candidates. In this step we’re actually doing a really important thing, or even a couple of them – we’re preparing something which computer will understand. Here are some steps taken in the process of text preprocessing:

1. Chunking – extracting sentences from text.
2. Tokenizing – breaking up a stream of text into words (so called: tokens).
3. Stemming/Lemmatization – reducing words to their base form in order to be analysed as a single item.

It may look scary, and if you have to code it from scratch every time you want to achieve this, well… We’ve discussed it already. It’s much better to use battle tested software to achieve this, and fortunately, there are quite some solutions on the market for you to grab, use and get to the results you need. Below you may find just a couple of examples of software capable of such preprocessing:

• TextRazor has a very impressive demo of its capabilities. It supports multiple languages including German. You can use it both as REST API or using their SDK (available in Java, Python and PHP). What’s really important, you can extract semantic metadata using custom rules.
• Stanford NLP Software includes tokenizer, CoreNLP and classifier. It’s a software produced by Stanford, which is home of the best NLP computer scientist in the world. One disadvantage we can point here is that this set of tools fully supports English, and that’s it. Other languages like Arabic, Chinese, French, German and Spanish have only a partial support.

There are more of those, like tools from Cognitive Computation Group or Apache’s OpenNLP, so as you can see, you can easily choose from a wide range of available software, test it and decide whether it suits your requirements or you need something else.

Structure the data

Going further, we have to construct a vector space model from text documents we’ve collected so far. It’s an algebraic way of representing document as a set of vectors with corresponding values, so it’s a great representation for further processing. There are many papers on how to approach this topic, but we recommend you to familiarize yourself with two chapters from Speech and Language Processing by Dan Jurafsky and James H. Martin – Vector Semantics and Semantics with Dense Vectors.
This process can be divided into a couple of steps and we’ll guide you through them, so it’s easier to grasp the notion of what’s actually going on. This will be the most theoretical and advanced (from mathematical point of view) part of this article, but don’t worry, there are tools which can easily do it for you, but being conscious of what’s happening underneath is always beneficial for you.
In our case the first step will be preparation of word-person matrix in which a person is represented by all emails that refer to himself/herself. It’s important to be sure that all emails related to a particular person are truncated into a single text document, so that we have everything in one place and we’ll get correct results. You may wonder what will be in this matrix, so let’s look at the following points describing each element of this matrix:

• each row represents a word in vocabulary (generated from all preprocessed emails)
• each column represents a particular person
• each cell represents the number of times a particular word (row) occurs for a particular person (column)

Now, let’s create this matrix. To do so, we have to split it into three steps and the first one is to create a matrix containing the number of times a particular word occur in all emails related to particular person. Yes, it’s a simple counting, so assume this is our sample matrix we got from processing our correspondences with three people:

PHP C++ C# Java JavaScript Python SQL
person1 800 0 50 0 440 0 50
person2 0 430 0 650 0 0 220
person3 780 0 0 0 230 710 0

Table 1. Matrix containing the number of times a particular word occurs in all emails related to a particular person.

Next step is even simpler, because we have to count all occurrences of nouns in those emails, so we can present it as a vector like:

nouns
person1 10000
person2 20000
person3 40000

Table 2. Vector containing the number of all nouns in all emails related to a particular person.

Last, but not least we have to calculate the initial matrix we need for further processing. As you may have already figured it out, we have to divide the number of times a particular word occurs in all emails related to a particular person by all nouns in emails related to a particular person. Let’s see how it looks like in a visual form:

PHP C++ C# Java JavaScript Python SQL
person1 0.08 0 0.005 0 0.044 0 0.005
person2 0 0.0215 0 0.0325 0 0 0.011
person3 0.0195 0 0 0 0.00575 0.01775 0

Table 3. Initial matrix. Value of a cell is the number of times a particular word occurs in all emails related to a particular person divided by all nouns in emails related to particular person.

First, and very important step of constructing vector space model is beyond us, but you’ve probably noticed that values in the result matrix aren’t really reflecting anything meaningful. It looks like if you exchange more emails, your chances to be considered a good developer in a specific field decreases. Why is it so? It’s because of the normalization. We have to change figures in our result matrix from absolute values to relative ones and we have a couple of ways to do it. We have PMI and PPMI and they look like:

$$PMI(w, p) = loga_{2}\frac{P(w,p)}{P(w)\times P(p)}$$ $$PPMI(w, p) = max(loga_{2}\frac{P(w,p)}{P(w)\times P(p)}, 0)$$

where:

• $$w$$ – word
• $$p$$  – person
• $$P(w, p)$$ – value of cell in w-th column and p-th row divided by sum of values in w-th column and p-th row
• $$P(w)$$ – sum of values in w-th column divided by sum of values in w-th column and p-th row
• $$P(p)$$ – sum of values in p-th row divided by sum of values in w-th column and p-th row

We can also apply another numerical statistic called term frequency–inverse document frequency (in short tf-idf) and this will be our method of choice in proceeding with our example, so below you will see steps necessary to calculate our normalized matrix.
Let’s start with formula for calculating tf-idf:

$$tfidf(w, p) = tf_{wp}\times idf_{w}$$

where:

• $$tf_{wp}(term frequency)$$ –  the frequency of particular word within particular skill = value of cell in p-th row and w-th column divided by sum of values in w-th column.
PHP C++ C# Java JavaScript Python SQL
person1 0.804 0 1 0 0.884 0 0.312
person2 0 1 0 1 0 0 0.688
person3 0.0196 0 0 0 0.116 1 0

Table 4. Matrix with normalized values from matrix in Table 3 sum over all people within particular skill = 1

• $$idf_{w}(inverse document frequency)=log(\frac{N}{df_{w}})$$
• $$N$$ – total amount of people in collection
• $$dfw$$ – amount of people for which given word “w” occurs

Let’s consider following example $$idf_{PHP}=log\frac{3}{2}$$ and calculate inverse document frequency vector for it:

$$idf_{w}$$
PHP 0.5849625007
C++ 1.5849625007
C# 1.5849625007
Java 1.5849625007
JavaScript 0.5849625007
Python 1.5849625007
SQL 0.5849625007

Table 5. Inverse document frequency vector

With this vector we can proceed and calculate whole normalized matrix, so in result we should get:

PHP C++ C# Java JavaScript Python SQL
person1 0.8 0 0.005 0 0.044 0 0.005
person2 0 0.0215 0 0.0325 0 0 0.011
person3 0.0195 0 0 0 0.00575 0.01775 0

Table 6. “tf-idf” normalized matrix each element in column in matrix from Table 4 is multiplied by corresponding element in vector from Table 5

The steps you’ve seen above to normalize your matrix aren’t hard, but the whole idea of this article is to show you that you can achieve your goal by simply using existing software and not to worry about all computation. For this step we can use tools like Weka or Apache Lucene Core.

Profit!

Looks like we’re getting closer and closer to what we actually want to achieve. Next step is to create a proper query, so we have to formalize what is the skill we’re looking for in our future candidate. To do so we need to find similar vectors. You probably know that vectors are similar when angle between them is small (cos = 1) and are different when angle equals 90°, in other words, when cos = 0. Putting this in our context, person A (represented by vector A) is better candidate than person B (represented by vector B), when cosine between vector A and query vector is bigger than cosine between vector B and query vector. Let’s say we’re looking for a candidate who knows Java, SQL and JavaScript. Our first step is to translate this query into vector:

Candidate
PHP 0
C++ 0
C# 0
Java 1
JavaScript 1
Python 0
SQL 1

Table 7. Query vector

Next on, we simply have to compute dot-product (or cosine) of query and person, so we can get the final scores:

person1 person2 person3
Score 0.700 1.987 0.068

Table 8. Final score

Last, but not least we have to sort the final score to get the ordered list of candidates that fulfill given criteria.

That’s it! We got our final results. Of course there are a couple of things to be noted and to be considered while looking into this sort of solution. Just to name a couple of them.
Email-to-person mapping is assumed to be one-to-many, because email has at least one sender and one recipient, hence at least 1-to-2 mapping (not to mention Ccs, Bccs).
You can get slightly different results (maybe more accurate in your case) if you prioritize people based on which list (‘From’, ‘To’, ‘Cc’, ‘Bcc’) they belong to (there is no prioritization in current solution). Weights can be real numbers ranged from 0 to 1 as follows:

• “From”, “To” – 1
• “Cc” – 0.5
• “Bcc” –  0.25

Example presented in this article is really small, so the operations on the matrix were fast and easy to calculate, but in real life list of candidates and technologies may increase drastically giving you a hard time in managing this. In fact, you should consider dimensionality reduction and you have a couple of options to consider, so let’s have just a brief look at them:

• SVD (singular value decomposition) – seems to be the best solution as long as it is based on pure vector/matrix operations.
• Skip-gram and CBOW (neural network approach) – although it is one of dimensionality reduction approaches, it doesn’t seem to be proper one. This solution is used when trying to evaluate whether one word is similar to another one or not. This concept is based on the fact that words with similar meaning occur near each other in text quite frequently. Neural networks are used to predict the neighbourhood of given word, in contrast to current case, when we rather count on absolute values (number of times the particular word occurs in text).
• Brown-clustering – it doesn’t seem to be the proper one, because clustering words is based on associations between preceding and following words.

I hope this article will get things moving in your brain. I hope that you see a potential for AI and NLP in almost every aspect of our work, and that you actually see it not as a replacement for human, but more of a complementary tool for boosting our productivity and giving the best results in short amounts of time. Of course, this article is just a scratch on the surface of the whole big and complex topic, so I’m more than happy to give you this general notion of what’s going on and how you can apply it in your company.