Social Media Information Extraction using NLP

In this blog we will be talking on Social Media Information Extraction using Natural Language Processing. We welcome you to the World of Information Extraction through Natural Language Processing. So before jumping directly on how information is extracted, let’s first talk about the need of Extraction of Information from Social Media. The fast development in Information Technology over the most recent twenty years has prompted a development in the measure of data accessible on the world wide web. Another style for trading and sharing data is Social media.

Photo by Adem AY on Unsplash

Social media refers to the methods for connection among individuals where they make, offer, and share data and thoughts in virtual networks a lot (like Twitter and Facebook). Social media in many cases gives more state-of-the-art data than traditional sources like online news. All this data is an unstructured data and so not yet, understandable or relatable for a machine. To utilize this tremendous measure of data, it is needed to extract organized data out of these heterogeneous unstructured data.


There are various challenges faced in extracting useful information from Social Media platforms. Some of them are listed below.

1. References: The various social media platforms have a length limit on posts. Because of those limits users will use more short forms in their post to present more information. This shortness of the information makes it more challenging to Information extraction.

2. Informal Language: On various social platforms users share their opinion, images, announcements etc., sometimes these shared information or posted data are noisy, or it contains misspellings, punctuation errors, and grammatical errors. Knowledge base construction rely on capitalization, and POS (Part-Of-Speech) used to extract name entities, various features are not available in social media and shared information makes the information extraction task more challenging.

3. Unwanted (noisy) Contents: Sometimes the users post random information which is not always significant. And it does not contain any useful information. Around 40–60% posts are pointless. Where the user just speaks about any insignificant topic. Because of that, filtering is necessary to get useful posts.

4. Less Usable Entities: This is the platform where users present their opinions, and share the information about any small events, festivals. These involved entities will be stored in knowledge base (KB).

5. Uncertain Contents: Every available information on social platforms is not always trustworthy. This information contains other types of errors. It is necessary to handle the uncertainty involved in the information extraction process.

How to overcome these challenges?

To overcome the challenges faced in Information Extraction from Social Media, there are methods designed which targets issues mentioned above individually. To discuss the proposed framework, we first describe some important key components and aspect of the solution.

  • Noisy Text Filtering: Huge number of data is generated on social media each day. On a typical, the quantity of tweets exceeds 140 million tweets per day sent by over 200 million users around the world. These numbers are growing exponentially. In order to extract useful information, we need to filter non-informative posts. Filtering could be done on supported domain or language or other criteria to make sure to keep only relevant posts that contain information about the domain need to be processed.
  • Named Entity Extraction: With the shortage of formal writing style, we’d like new approaches for NEE that don’t rely heavily on syntactic features like capitalization and Part-Of-Speech (POS). Existing approaches for named entity recognition suffer from data sparsity problems when conducted on short and informal texts, especially user-generated social media content. Semantic augmentation is a potential way to alleviate this problem. Given that rich semantic information is implicitly preserved in pre-trained word embeddings, they are potential ideal resources for semantic augmentation.
  • Named Entity Disambiguation: It is one of the foremost interesting pieces of this puzzle of information extraction. Named Entity Disambiguation is the undertaking of planning expressions of interest, similar to names of people, areas, and enterprises, from an info text archive to relating exceptional substances during an objective information base. The target Knowledge Base depends on the appliance but vast text data is available on Wikipedia. Usually Named Entity Disambiguation doesn’t employ Wikipedia directly, but they exploit databases that contain structured versions of it, like DBpedia or Wikidata.
  • Feedback Extraction: The feedback loop takes place between the FE (fact extraction) and thus the NED (Named Entity Disambiguation) modules. This feedback helps to resolve errors that happened earlier within the disambiguation step.
Image Reference- Research paper published by Mena B. Habib and Maurice van Keulen on Image Extraction for Social Media

NLP Models

  1. Automatic Summarization: Automatic Summarization is the procedure of decreasing a textual content record with the assistance of pc software so as to create a precise that keeps the maximum tremendous factors of the unique record. Technologies that can make a coherent precise remember variables together with length, writing fashion, and syntax. The most important perception of summarization is to discover a consultant subset of the data, which includes the records of the whole set. Generally, there are two methods to computerized summarization: Extraction and Abstraction. Extraction refers to choosing a subset of present words, phrases, or sentences with inside the unique textual content to shape the precise. In contrast, abstraction builds an inner semantic illustration after which use herbal language era strategies to create a precise this is towards what a human may generate. Automatic Summarization gadget takes 3 fundamental steps namely, Analysis, Transformation, and Realization which might be in short defined below: In evaluation, a concise and fluent precise of the maximum tremendous records is produced with inside the input.
Fig 1. Process of Auto Summarization

It calls for the functionality to reorganize, alter, and merge records expressed in specific sentences with inside the input. Transformation is an ordered textual content is generated with the aid of using manipulating the inner illustration submit the evaluation in Auto Summarization. An analyzed precise textual content has generated the usage of ratings of transformation with inside the Realization phase.

2. Chunking: Chunking is the simple approach used for entity detection. Chunking selects a subset of the tokens in preference to tokenization that omits whitespaces. The portions fashioned with inside the supply textual content do now no longer overlap because of the output of tokenization. It is less complicated to explain what’s to be excluded from a bit. It essentially segments the tokens. A chink may be described as a series of tokens that isn’t in a bit. Removing a series of tokens from a bit is referred to as Chinking. The entire bite is eliminated if the matching series of tokens spans a whole bite. However, the tokens are eliminated, leaving chunks wherein there has been best one before; if it seems with inside the center of the bite. A smaller subset of the bite remains if the series is on the outer edge of the bite.

Fig 2. Process of Text Mining

3 Parts-of-Speech Tagging: Parts-of-speech tagging is a chunk of software program that reads textual content in a few languages and assigns elements of speech to every phrase consisting of a noun, verb, adjective to call a few. Generally, computational packages make use of extra fine-grained Parts of speech tagging encompass tags like ‘noun-plural’. Dictionaries have classes or classes of a selected phrase which means that a phrase might also additionally belong to multiple classes. For example, ‘Run’ is each a noun and a verb. Taggers employ ‘Probabilistic Information’ to resolve this ambiguity.

4. Word Sense Disambiguation: This is an open NLP and ontology issue that identifies the proper feel of the phrase in a sentence in which a couple of meanings of the phrase exist. It’s smooth for a human to recognize the importance of a phrase primarily based totally on the idea of its heritage information of the issue. However, identity the thing of the phrase is tough for a system to recognize. This method gives a mechanism to decrease the ambiguities of phrases with inside the textual content. For example, Word Net is a loose lexical database in English that consists of a massive series of phrases and senses.

5. Sentiment Analysis: Sentiment Analysis is an NLP manner that identifies, extracts, enumerates the mindset of the consumer to the facts this is supplied with the aid of using the consumer in a loose shape textual content. A textual content series should display numerous sentiments that may be high quality, terrible or neutral. Sentiment Analysis is commonly used in the processing of survey shapes, online views, and social media tracking. It returns the diagnosed sentiment with a numeric rating from 1.0 to -1.0 in which 1.0 way strongly high quality and -1.0 way strongly terrible. A sensible software of this type will be in an average e-trade website. Famous or ‘Top Rated” merchandise are probable to draw hundreds of opinions and this can make it difficult for searching for what you offer to tune applicable opinions that can help in making a decision. Sellers use sentiment evaluation for there to determine applicable overview and forget about the deceptive opinions gift to reviewers.


From this article, we studied that social media became one of the major parts of human being for sharing their thoughts and exchanging data. Information extraction from social media is an emerging field nowadays. We also discussed how we can use this social media data and for utilization how to extract this data using different model in NLP. But while extracting data using NLP we faced many challenges like references, unwanted data etc, which makes it difficult to extract the exact data. To overcome these problems we developed the framework for extracting data using NLP. The proposed framework works on noisy text filtering which remove unwanted data or non informative data, because only required data is processed. Some other features of this framework are named entity extraction, named entity disambiguation, feedback extraction which is discussed above.

Learner. Observer. Interested in software development.