Aggregate Sentence Structures from a Large Corpus
With the Wolfram Language, it is possible to analyze large datasets with ease. This example uses ExtendedEntityClass to extract and investigate the grammatical structure of over one million sentences from the posts on the website english.stackexchange.com.
Import an EntityStore created from english.stackexchange.com.
Register the store for use in EntityValue.
For posts classified with the "single-word-requests" tag, find the 50 most commonly quoted, italicized, bolded or linked words and make a word cloud of the results.
You can investigate the site on a wider scale by examining sentence structures used in posts. Start by extending the post entity type with a property to extract simple sentences.
Use the new property to extract over one million sentences from the posts.
Find the words in each sentence by splitting on whitespace or punctuation.
The word count per sentence of written prose was conjectured to follow a log-normal distribution according to a journal article. Use FindDistributionParameters to find fitting parameters for the distribution of words in each sentence of the corpus and plot them together for comparison.
Find how often each individual word occurs.
Use DeleteStopwords to clean up the dataset.
Visualize the cleaned-up word counts in a log-log plot.
Focus on the top 50 words, using Callout to see the individual words.
Analyze all of the sentences in the corpus with TextStructure, appending results to a file as they are finished. Note that this process takes a very long time and may evaluate for multiple days.
Read in the data from the file.
Look at a specific example.
Build a function to extract the core structure of a sentence.
Extract the core structure of all of the sentences.
Find all grammatical units in the data and how often they appear.
Find transition counts for each consecutive pair of units.
Here is the number of transitions between nouns and prepositions.
Visualize how frequently each transition occurs with MatrixPlot.
Group sentences with the same structure.
Visualize the most common sentence structures in a plot.
Look at example sentences for a few interesting structures.
Create a network of some of the most common sentence structures, connecting two structures if they share a parent-child relationship through the insertion of one grammatical unit.