NLP for Connecting Data Sources

How ING Merchant Bank connects the dots

  • ING Merchant Bank wanted to match the different names data providers give to the same companies.
  • TF-IDF is a useful NLP concept, telling a machine that some words are more important than others.
  • The bank used TF-IDF to spot company-name similarity.

Financial-data providers don't agree on company names

Banks and asset managers purchase data from many different information providers, to track financial performance for millions of companies.

This data may include: stock prices, financial reporting data, corporate actions and analyst coverage reports.

However, no single company identifier exists, one which is consistently shared among different source systems. As a result, connecting different data sets can be difficult, since a company’s name can differ between data sets.

The figure below illustrates the name-matching problem: The Ground Truth table features the company names, as might be used internally. The other tables show how data providers might name the same companies.

Financial services firms are tasked with building systems that correctly match different data sets. With millions of different company names to match across sources, this problem can become acute.

The name-matching problem
Source: ING Merchant Bank, FinText 

NLP Concept: TF-IDF

A term weighing scheme is a way of helping a machine understand a text’s meaning, by telling it to allocate more importance to some terms over others. 

For example, ‘The’ appears often in English, and doesn’t add much meaning. Conversely, rarer words are often suggestive of a text’s topic (for example, ‘inflation’).

But the relative rarity of a word in a document depends on context. ‘Inflation’ is much more common in economic texts than in children’s books.

That’s why many term-weighing schemes take into account how important a word is to a document, relative to its importance in a wider collection of documents. (The collection is also known as the corpus.)

Term Frequency–Inverse Document Frequency (TF-IDF) is a popular way to measure how important a word is to a document, relative to its importance in a wider collection of documents. (The collection is also known as the corpus.)

Common words in both the document and corpus won’t be that important. On the other hand, rare words in the document relative to the corpus will be more significant.

In ING’s case, each ‘document’ was simply a company name, as it arrived from one of the data sources. ‘Rabobank N.V’, for example, is a document, containing
two tokens.

But the token ‘N.V’ appears in many other company names. By using a term-weighting scheme like TF-IDF, its relative importance diminishes. Among the millions of company names, the token ‘Rabobank’ is not that common; its relative importance will be high.

Therefore, when comparing the representations of ‘Rabobank N.V’ with the Ground Truth term for Rabobank, the two will appear very similar.

How ING Merchant Bank solved the problem

ING Merchant Bank solves the problem using a technique that approximates name similarity. It represents company names with numbers, in a way that encodes all the different tokens a company name might have. (see NLP Concept on the Barings Asset Management case study. )

Names in the source data sets are encoded based on the tokens they have. For example, ‘RBS LLC’ has two tokens: RBS and LLC.

Names in the Ground Truth table are encoded with both the tokens they have and with ones they might have. Since suffixes like LLC can appear in lots of company names, they’re less helpful for finding the right match.

Numerical representations of company names are adjusted using TF-IDF, a popular term-weighting scheme. (See NLP Concept above. )

As data arrives, names are numerically represented and compared to find the closest match in the Ground Truth representations. This delivers a matching that’s both accurate. Crucially, given the ever growing volume of data, it’s also fast to compute.

How finance uses NLP