Short story on scaling an NLP problem without using a ton of hardware.

Short story on scaling an NLP problem without using a ton of hardware.

Short story on scaling an NLP problem without using a ton of hardware.
The cornerstone of the small work done for getting the info for these great charts with IEPY, was to be able to catch mentioned companies.

Click here for an interactive version of this chart and other great interactive charts

The basic idea of relation extraction is to be able to detect mentioned things in text (so called Mentions, or Entity-Occurrences), and later decide if in the text is expressed or not the target relation between each couple of those things. In our case, we needed to find where companies were mentioned, and later determine if in a given sentence it was said that Company-A was funding Company-B or not.

In order to detect those funding we needed to be sure of capturing every mention of a company. And although the NER used catched most of them, there are always some folks that name their company #waywire or 8th Story, words that are not very easily trackable with a NER.

Read Also:
Driving Data Science Productivity Without Compromising Quality

A good solution is to build a Gazetteer containing all the company names we can get. The idea of working with Gazettes, is that when using them, each time one of the Gazette entries is seen on a text, it’s automatically considered as a mention of a given object, ie, an Entity-Occurrence.

From an encyclopedic source; we got more than 300K entries.Great!

The next challenge was that… well, in the text to process, a company could be mentioned on a different way than the official one stated on the encyclopedic source. For instance, would be more natural to find mentions of “Yiftee” than “Yiftee Inc.”

So, after incorporating a basic schema for the alternative names (ie, substrings of the original long name), the number of entries grew up to 600K.

Read Full Story…

 

Leave a Reply

Your email address will not be published. Required fields are marked *