Text Mining for Big Data

Integrated circuit on board

Text Mining for Big Data

Leverage Text Mining to Discover Customer Needs and Wants

David Haertzen - November 1, 2024
Text mining methods are techniques that can turn unstructured data like emails, tweets and recordings into actionable insights. The knowledge gained can be used to both identify opportunities and serve customers as well as management risks such as cybercrimes. Examples of text mining use cases that capitalize on opportunities include:

  • Customer Experience: Obtain knowledge about customers through diverse sources such as emails, surveys and calls to provide automated response and to identify opportunities and issues.
  • Contextual Advertising: Target advertising to specific customers based on analysis of text
  • Business Intelligence: Answer specific business questions through scan and analysis of thousands of documents.
  • Knowledge Management: Gain value from huge amounts of information in areas like product research and clinical patient data.
  • Content Enrichment: Add value to content by organizing, summarizing and tagging.
  • Social Media Analysis: Scan large data volumes to gather opinions, sentiments and intentions relating to organization reputation, brands and offerings.

Examples of text mining use cases that address risks and losses include:

  • Cybercrime Detection: detect malicious threats such as ransomware and identity theft using machine learning to identify likely malware.
  • Machine learning: identifies trends and improved its predictions formed through experience.
  • Fraud Detection: Identify potential fraudulent activity such as insurance claim fraud through analysis of unstructured data.
  • Risk Management: Scan thousands of documents to find patterns that identifying risks to be addressed.
  • Spam Filtering: Reduce the volume of spam through better filtering tuned through machine learning.

How can we take advantage of these use case? One way, is to use the Text Frequency – Inverse Data Frequency (TF-IDF) method to quantity the strength of words that make up documents – based on the relative frequency of words. The flow of this process is illustrated in the following diagram.

There are five major steps to this process:

  1. Gather Text: Read in the body of text (corpus) from sources such as: emails, reports, tweets, comments and notes which may be stored as separate files or as fields in a database.
  2. Preprocess Text: Produce a streamlined version of the text by removing punctuation, shifting to lower case, removing stop words and location words, resolving to word stems (stemming). Using tokenization methods such as “bag of words” render words into streams of numbers.
  3. Apply TF-IDF Algorithm: Calculate the strength of words using the TD-IDF calculation. Text Frequency (TF) for each word in a document = specific word count divided by total words in document count. Inverse Document Frequency (IDF) = log e(total number of documents / total documents containing the word. Finally, TD-IDF = TF * IDF.
  4. Output Structured Data File: Generate one flat file record for each input document. Each record will contain a document identifier plus a field for each word of interest. See the example structured flat file below.
  5. Apply Data Science Algorithms: The generated flat file is in a format where data can be better understood or outcomes predicted using data science algorithms such as: regression, decision tree, clustering or neural network.

In conclusion, text mining methods are available that can be used to capitalize on opportunities, reduce losses and manage risks. The TF-IDF method is one of many approaches to successful data mining and is a good example of the overall approach. Typically multaple documents are scanned, pre-processed and then analyzed using an algoritm like TF-IDF, Keyword Association Network (KAN) or Support Vector Machines (SVM). Libraries of algorithms such as Python Scikit-learn support text processing via machine learning. I encourage you to learn more about text processing and its applications.