Python topic modeling visualization LDA and T-SNE interactive visualization

Python topic modeling visualization LDA and T-SNE interactive visualization

Original link:

Original source: Tuoduan Data Tribe Official Account


I try to use Latent Dirichlet to allocate LDA to extract some topics. This tutorial features natural language processing flow, starting with raw data, preparing, modeling, and visualizing thesis.

We will cover the following points

Use LDA for topic modeling
Use pyLDAvis to visualize topic models
Use t-SNE to visualize LDA results

In [1]:

from scipy import sparse as sp duplicated code
Populating the interactive namespace from numpy and matplotlib copy the code

In [2]:

docs = array (p_df [ 'PaperText ']) Copy the code

 Preprocess and vectorize documents

In [3]:

from nltk.stem.wordnet import WordNetLemmatizer from nltk.tokenize import RegexpTokenizer def docs_preprocessor(docs): tokenizer = RegexpTokenizer(r'\w+') for idx in range(len(docs)): docs[idx] = docs[idx].lower() # Convert to lowercase. docs[idx] = tokenizer.tokenize(docs[idx]) # Split into words. # Delete numbers, but don't delete words that contain numbers. docs = [[token for token in doc if not token.isdigit()] for doc in docs] # Delete words with only one character. docs = [[token for token in doc if len(token)> 3] for doc in docs] # Regularize all words in the document lemmatizer = WordNetLemmatizer() docs = [[lemmatizer.lemmatize(token) for token in doc] for doc in docs] return docs Copy code

In [4]:

docs = docs_preprocessor (docs) Copy the code

Calculate two-letter/triple groups:

The themes are very similar, and they can be distinguished as phrases rather than individual words.

In [5]:

from gensim.models import Phrases # Add two-letter groups and three-letter groups to the document (only documents that appear 10 times or more). bigram = Phrases(docs, min_count=10) trigram = Phrases(bigram[docs]) for idx in range(len(docs)): for token in bigram[docs[idx]]: if'_' in token: # Token is a bigram, add to document. docs[idx].append(token) for token in trigram[docs[idx]]: if'_' in token: # token is a two-tuple, added to the document. docs[idx].append(token) Copy code
Using TensorFlow backend. /opt/conda/lib/python3.6/site-packages/gensim/models/ UserWarning: For a faster implementation, use the gensim.models.phrases.Phraser class warnings.warn("For a faster implementation, use the gensim.models.phrases.Phraser class") Copy code


In [6]:

from gensim.corpora import Dictionary # Create a dictionary representation of the document dictionary = Dictionary(docs) print('Number of unique words in initital documents:', len(dictionary)) # Filter out words that are less than 10 documents or more than 20% of documents. dictionary.filter_extremes(no_below=10, no_above=0.2) print('Number of unique words after removing rare and common words:', len(dictionary)) Copy code
Number of unique words in initital documents: 39534 Number of unique words after removing rare and common words: 6001 Copy code

Cleaning up common and rare words, we end up with only about 6% of words.

Vectorized data: The
first step is to obtain the word representation of each document.

In [7]:

corpus = [dictionary.doc2bow (doc) for doc in docs] duplicated code

In [8]:

print('Number of unique tokens: %d'% len(dictionary)) print('Number of documents: %d'% len(corpus)) Copy code
Number of unique tokens: 6001 Number of documents: 403 Copy code

Through the bag of words corpus, we can continue to learn our topic models from the documents.

Train the LDA model 

In [9]:

from gensim.models import LdaModel copy the code

In [10]:

%time model = LdaModel(corpus=corpus, id2word=id2word, chunksize=chunksize,/ alpha='auto', eta='auto',/ iterations=iterations, num_topics=num_topics,/ passes=passes, eval_every=eval_every) Copy code
CPU times: user 3min 58s, sys: 348 ms, total: 3min 58s Wall time: 3min 59s Copy code

How to choose the number of topics?

LDA is an unsupervised technique, which means that we don't know how many topics exist in our corpus before running the model. Subject coherence is one of the main techniques used to determine the number of subjects. 

However, I used the LDA visualization tool pyLDAvis, tried several topics and compared the results. 4.seems to be the best number of topics that can separate the topics most.

In [11]:

import pyLDAvis.gensim pyLDAvis.enable_notebook() import warnings warnings.filterwarnings("ignore", category=DeprecationWarning) Copy code

In [12]:

pyLDAvis.gensim.prepare (model, corpus, dictionary) copy the code


What do we see here?

The left panel is labeled Intertopic Distance Map. The circles indicate different topics and the distance between them. Similar themes look closer, while different themes are further away. The relative size of the topic circle in the figure corresponds to the relative frequency of the topics in the corpus. 

How to evaluate our model? 

Divide each document into two parts and see if the topics assigned to them are similar. => The more similar the better

Compare randomly selected documents with each other. => The less similar the better

In [13]:

from sklearn.metrics.pairwise import cosine_similarity p_df['tokenz'] = docs docs1 = p_df['tokenz'].apply(lambda l: l[:int0(len(l)/2)]) docs2 = p_df['tokenz'].apply(lambda l: l[int0(len(l)/2):]) Copy code

Convert data

In [14]:

corpus1 = [dictionary.doc2bow(doc) for doc in docs1] corpus2 = [dictionary.doc2bow(doc) for doc in docs2] # Use corpus LDA model conversion lda_corpus1 = model[corpus1] lda_corpus2 = model[corpus2] Copy code

In [15]:

from collections import OrderedDict def get_doc_topic_dist(model, corpus, kwords=False): ''' LDA conversion, for each document, only return topics with non-zero weights This function performs matrix transformation on documents in the topic space ''' top_dist =[] keys = [] for d in corpus: tmp = {i:0 for i in range(num_topics)} tmp.update(dict(model[d])) vals = list(OrderedDict(tmp).values()) top_dist += [array(vals)] if kwords: keys += [array(vals).argmax()] return array(top_dist), keys Copy code
Intra similarity: cosine similarity for corresponding parts of a doc(higher is better): 0.906086532099 Inter similarity: cosine similarity between random parts (lower is better): 0.846485334252 Copy code

 Let's take a look at the words that appear in each topic.

In [17]:

def explore_topic(lda_model, topic_number, topn, output=True): """ Output a list of topn words """ terms = [] for term, frequency in lda_model.show_topic(topic_number, topn=topn): terms += [term] if output: print(u'{:20} {:.3f}'.format(term, round(frequency, 3))) return terms Copy code

In [18]:

term frequency Topic 0 |--------------------- data_set 0.006 embedding 0.004 query 0.004 document 0.003 tensor 0.003 multi_label 0.003 graphical_model 0.003 singular_value 0.003 topic_model 0.003 margin 0.003 Topic 1 |--------------------- policy 0.007 regret 0.007 bandit 0.006 reward 0.006 active_learning 0.005 agent 0.005 vertex 0.005 item 0.005 reward_function 0.005 submodular 0.004 Topic 2 |--------------------- convolutional 0.005 generative_model 0.005 variational_inference 0.005 recurrent 0.004 gaussian_process 0.004 fully_connected 0.004 recurrent_neural 0.004 hidden_unit 0.004 deep_learning 0.004 hidden_layer 0.004 Topic 3 |--------------------- convergence_rate 0.007 step_size 0.006 matrix_completion 0.006 rank_matrix 0.005 gradient_descent 0.005 regret 0.004 sample_complexity 0.004 strongly_convex 0.004 line_search 0.003 sample_size 0.003 Copy code

 From above, you can check each topic and assign an interpretable label to it. Here I mark them as follows:

In [19]:

top_labels = {0: 'Statistics' , 1: 'Numerical Analysis', 2: 'Online Learning', 3: 'Deep Learning'} copy the code

In [20]:

''' # 1. Remove non-letters paper_text = re.sub("[^a-zA-Z]"," ", paper) # 2. Convert the word to lowercase and split words = paper_text.lower().split() # 3. Delete stop words words = [w for w in words if not w in stops] # 4. Delete short words words = [t for t in words if len(t)> 2] # 5. Adjectives words = [nltk.stem.WordNetLemmatizer().lemmatize(t) for t in words] Copy code
In [21]: Copy code
from sklearn.feature_extraction.text import TfidfVectorizer tvectorizer = TfidfVectorizer(input='content', analyzer ='word', lowercase=True, stop_words='english',\ tokenizer=paper_to_wordlist, ngram_range=(1, 3), min_df=40, max_df=0.20,\ norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=True) dtm = tvectorizer.fit_transform(p_df['PaperText']).toarray() Copy code

In [22]:

top_dist =[] for d in corpus: tmp = {i:0 for i in range(num_topics)} tmp.update(dict(model[d])) vals = list(OrderedDict(tmp).values()) top_dist += [array(vals)] Copy code

In [23]:

top_dist, lda_keys = get_doc_topic_dist(model, corpus, True) features = tvectorizer.get_feature_names() Copy code

In [24]:

top_ws = [] for n in range(len(dtm)): inds = int0(argsort(dtm[n])[::-1][:4]) tmp = [features[i] for i in inds] top_ws += [''.join(tmp)] cluster_colors = {0:'blue', 1:'green', 2:'yellow', 3:'red', 4:'skyblue', 5:'salmon', 6:'orange', 7:'maroon' , 8:'crimson', 9:'black', 10:'gray'} p_df['colors'] = p_df['clusters'].apply(lambda l: cluster_colors[l]) Copy code

In [25]:

from sklearn.manifold import TSNE tsne = TSNE(n_components=2) X_tsne = tsne.fit_transform(top_dist) Copy code

In [26]:

p_df['X_tsne'] =X_tsne[:, 0] p_df['Y_tsne'] =X_tsne[:, 1] Copy code

In [27]:

from bokeh.plotting import figure, show, output_notebook, save#Output file from bokeh.models import HoverTool, value, LabelSet, Legend, ColumnDataSource output_notebook() Copy code

 BokehJS 0.12.5 was successfully loaded.

In [28]:

source = ColumnDataSource(dict( x=p_df['X_tsne'], y=p_df['Y_tsne'], color=p_df['colors'], label=p_df['clusters'].apply(lambda l: top_labels[l]), # msize= p_df['marker_size'], topic_key= p_df['clusters'], title= p_df[u'Title'], content = p_df['Text_Rep'] )) Copy code

In [29]:

title ='T-SNE visualization of topics' plot_lda.scatter(x='x', y='y', legend='label', source=source, color='color', alpha=0.8, size=10)#'msize',) show(plot_lda) Copy code

Most popular insights

1. Analyze the research hotspots of big data journal articles

2. 618 online shopping data inventory-what are the people paying attention to

3. Research on r language text mining tf-idf topic modeling, sentiment analysis n-gram modeling

4. Python topic modeling visualization lda and t-sne interactive visualization

5. R language text mining nasa data network analysis, tf-idf and topic modeling

6. Python theme lda modeling and t-sne visualization

7. Topic-modeling analysis of text data in r language

8. Topic modeling analysis for text mining of nasa metadata in r language

9. Python crawler performs web crawling lda topic semantic data analysis