Newsmastering Tips and Hacks – Part II – Taxonomy ToolsAugust 18, 2007 at 1:12 am | Posted in Unscientific Research | Leave a comment
Tags: newsmastering, taxonomy, tools
This article is a second part of series of the articles dedicated to Newsmastering tools/experiences/lessons learned. The first article is at: https://radarfarms.wordpress.com/2007/06/13/newsmastering-tips-and-hacks-part-i-keyword-rss-generators/
For those who have not read the first article, we would like to repeat definition of Newsmastering:
The Newsmastering can be defined as the ability to define, locate, select, aggregate, filter, process and publish Web-based information based on a specific subject. The RSS Newsmastering utilizes the Really Simple Syndication (RSS) feeds, which include brief summaries of web and news articles, blog and user group postings, images, videos and audios, messages board/forum threads, job postings etc.
These pieces can be composed into specific information channels, which have a variety of names: Feed Digests, Lenses, FeedBots, and Bees etc. We call them News Radars.
We also would like to re-emphasis that the prevalent mode of building of News Radars of descent quality is to generate feeds using search queries against multiple web, blog, news, user groups, message boards and multimedia search engines (e.g., MSN/Live.com, Technorati, Google News, Google Groups and Video, Boardtracker, Blogdigger) and social networking, bookmaking and publishing sites (del.icio.us, Digg, YouTube, Flickr etc.)
Therefore, when a Newscaster is contemplating a new News Radar (or how it is called lately a RSS Mashup), one of the most important questions he/she is facing is – which keywords or key phrases define the radar? The answer makes a huge difference between a shallow, watered down and a deep and robust “industry-quality” News Radar.
The difference is particularly palpable when we are talking about the News Radars covering a broad and/or complex subject.
If your radar’s subject is granular enough like radars we created recently: Air Hogs Storm Launcher (a cool RC toy for teens and adults alike) or Danier Leather Bombers (leather jackets by one of leading Canadian leather manufacturer), one can still mess up if subject’s keywords are not properly chosen (“danier leather” bombers vs. danier “leather bombers” vs. danier leather bombers vs. “danier leather bombers“).
However, using such a simple tool as Google, a Newsmaster can get his/her around and figure out that “danier leather” bombers is a best single option for the radar (we will talk about using search engine operators in one of our later articles in this series).
However, could you really create a radar about “Academic Accounting News and Research” or “Sentiment Classification” or “Web Data Mining” without knowledge of taxonomies for the radar’s subject.
According to Wikipedia, Taxonomy is the practice and science of classification.. Originally the term taxonomy only referred to the science of classifying living organisms (now known as alpha taxonomy); however, the term is now applied in a wider, more general sense and now may refer to a classification of things, as well as to the principles underlying such a classification.
Taxonomy is used interchangeable with definition of Ontology. Wikipedia defines it the following way – in both computer science and information science, an ontology is a data model that represents a set of concepts within a domain and the relationships between those concepts.
The bottom line is that if the radar’s subject represents a domain, then in order to describe the domain we need to know a set of concepts (collections of objects). If we decided to create a radar describing Motor Vehicles, we would not describe the subject as just Motor Vehicles but rather as Cars and Trucks. We would probably go even further and included 2-Wheel Drive Cars and 4-Wheel Drive Cars and then broke down the former category to Front-Wheel Drive Cars and Back-Wheel Drive Cars. We guess you got the idea.
You need to remember that RSS feeds generated by search engines of different kinds pick up a limited number (usually 10-15 most relevant items) per search query. Therefore, if, for instance, we use Live.com/MSN web search engine to generate RSS feeds (see our article at: https://radarfarms.wordpress.com/2006/10/16/newsmastering-hacks-how-to-generate-rss-feeds-from-msncom) and use “Motor Vehicles” as key phrase, we would generate a RSS feed incorporating 10 items most relevant for the “Motor Vehicles” key phrase. If we use the same approach but generate the MSN RSS feeds for “Cars” and “Trucks“, we would be able to pick up 10 most relevant items for “Cars” and 10 most relevant items for “Trucks“. Which radar will have a better quality?
We call this approach amongst us “Subject Exploding” (not sure that it is a correct term though). The subject may be exploded vertically (see the above example):
The “horizontal explosion” is discovering keywords/key phrases with similar scientific, technical and/or cultural meaning. For instance, our company is slowly but surely moving into some uncharted waters with Opinion Mining. We created quite a few Opinion Radars tracking Web Buzz about consumer products (Motorola Pebl, Gillette Fusion, Playstation 3, JVC Everio, Universal Remote Controls etc.), movies (All the King’s Men), great writers and musicians (Ernest Hemingway, Pink Floyd, Gogol Bordello, Jazz Legends, etc.). We performed their one-time indexing to create Tag Clouds and, in a process of the indexing we collected data about keywords/key phrases and were able to see some patterns. However, we would like to provide and present more formal evaluation of the Web Buzz sentiment for our visitors and clients alike.As usual, while we are developing some new algorithms, we mine the Web for an area of our interest and create radar. However, as it happens more often than not, the radar’s subject name is not unique. Different authors and different reference sources call this domain differently:
- Opinion Mining
- Opinion Extraction
- Sentiment Classification
- Sentiment Mining
- Sentiment Evaluation
- Sentiment Analysis etc.
There will be some overlapping between the above terms and if we pick a most representative one – Sentiment Analysis, top 10-15 most relevant items on the Web may cover the others. However, we would miss a lot.
Therefore, if we shoot for a really comprehensive high-quality News Radar, we need to generate queries for several keywords/key phrases describing the radar’ subject.
Hopefully, it makes sense to the article readers. However, it brings us to a question – where I can find Taxonomy for subject of my interest? Are they readily available?
This article is just dedicated to answer these questions.
Sources of Taxonomies for specific subjects are multiple and can be divided into two broad categories:
- 1) Domain Taxonomies or Controlled Vocabularies
- 2) Document and/or Search Clusters
Where you can find Taxonomies/Controlled Vocabularies? The greatest source is Taxonomy Warehouse. It is a free service (free to users and free to vocabulary publishers) put out by Factiva, a multi-national consulting company for content providers for the benefit of the information and knowledge management community.
The Warehouse aims to provide a comprehensive directory of taxonomies, thesauri, classification schemes and other authority files from around the world, plus information about taxonomy references, resources and events.
- Over 660 Taxonomies
- Classified by 73 subject domains
- Produced by 261 publishers
- In 39 languages
- 65% produced in digital media
- 100 directly licensable
Sites hosting online publications can also contain Taxonomies represented as Keywords. For instance, when we created Academic Accounting Research and News radar, we went to the Canadian Academic Accounting Association site, located Accounting Publications page and using the publication abstracts
and were able to retrieve quite a few keywords:
It was not easy but we still consider this radar as our benchmark in terms of quality.
There are a fair number of Taxonomies and Controlled Vocabularies on the Web, however, they are scattered and if you want to find them, you can use either del.icio.us or our very own Taxonomies and Folksonomies radar – in most of cases though you are on your own. However, taxonomies/ontology-related information could be found at much unexpected places readily available and for free.
Let’s start from Vivisimo and its Clusty – a free, public clustering search engine. Vivisimo may be not such household name as Google or Yahoo or even Ask.com, however, this R & D offspring of Carnegie Mellon University is a best thing that happened to Web and Enterprise Search since Google. The Vivisimo’s bread and butter are clustering of search results.
Vivisimo provides the following description for the clustered search:
Dynamic, or on-the-fly, document clustering refers to the automatic grouping of documents into spontaneously labeled categories. Thus, hundreds of search results on Pittsburgh might be grouped into Hotels, Universities, Steelers, Pirates, Mayor Murphy, Carnegie Museums, and so on. These groups can then be displayed as folders in the familiar style, in which folders are shown on the left and individual search results are shown on the right. Dynamic clustering forms groups via a quick statistical and linguistic analysis of the available textual descriptions, such as each search result’s title and summary.
If Pittsburgh-related collection of Web document is categorized into Pittsburg Hotels, Pittsburg Steelers, Pittsburg Pirates, Pittsburg Mayor Museums, Pittsburg Carnegie Museums and other groupings and if we have to create a News Radar dedicated to various aspects of the Pittsburgh life, the above categories would immensely help us.
Vivisimo not just provides the search categories but it also allows drilling the categories down and identifying the deeper layers (sub-categories). Definitely, the search clustering is very much dependent on quality of document corpus (collection) the search was performed on (GIGO principle is universally true J).
Clusty.com is meta-search engine queries results from MSN, Open Directory, Wikipedia, NY Times, Ask, Yahoo News, Gigablast and Wisenut search engines. Even though, this list doesn’t include Google and/or Yahoo web search, the above sources combined provide a set of results representative enough to be considered.
That’s how the clustered search for “Sentiment Analysis” (most representative term for the area of our interest) looks like:
Top ten search clusters are on a left side. Being stand-alone not all of them make sense in a given context but combined with “sentiment analysis“, if required, they would look as follows:
- Market Sentiment Analysis – 21 results out of total of 172
- Stock Sentiment Analysis – 18 results
- Blog Sentiment Analysis – 16 results
- Sentiment Analysis Monitoring – 14 results
- Opinion Intelligence – 12 results – it is more puzzling and requires some extra drilldown:
Apparently, it relates to combinations of opinion intelligence sentiment analysis keywords – for instance, opinion intelligence “sentiment analysis”
- Technical & Sentiment Analysis – 11 results
- Search & Sentiment Analysis – 10 results
- Institutional Sentiment & Analysis – 6 results
- Sentiment Mining Analysis – 8 results
- Sentiment Analysis for Homeland Security – 6 results
All of the above categories can be used for creating the Sentiment Analysis radar, however, this radar will be rather deep than broad one because all these categories with an exception of Opinion Intelligence represent Sentiment Analysis categories and Sentiment Analysis synonyms will not be sufficiently covered.
Another incarnation of Clusty search is Clusty Clouds. That’s how Vivisimo describes this feature:
A Clusty Cloud is a tool that webmasters or bloggers can use to instantly visualize a topic using the familiar tag cloud display. What makes the Clusty Cloud unique is that you can create a cloud based on any topic or query – you don’t need tags or months of content on a subject to create an interesting cloud.
Clusty Clouds are generated using our search results for the topic you enter. Since the tags you see come from Clusty, you can click on any of them to go to Clusty’s search results. Using Clusty to generate the cloud also ensures that it is always up-to-date because the clusters are generated in real-time.
Surprisingly, however, Clusty Cloud for Sentiment Analysis is a bit different from the Clustered Search results for the very same keyword. The larger size and/or bold font for keyword/key phrase indicate its larger weight in Clusty Cloud, however, we still drilled down each of tags in the cloud.
Results of drilldown for Investing keyword reveal that the cloud’s tag is added to Sentiment Analysis keyword that makes the complete query look like “sentiment analysis” investing:
Therefore, results of such combined query are probably more representative and reliable.
The top ten queries then are as follows:
- opinion “sentiment analysis” – 42,606 results
- technical and sentiment analysis “sentiment analysis” – 32,766 results
- market “sentiment analysis” – 24,698 results
- search and sentiment analysis “sentiment analysis” – 7,410 results
- financial “sentiment analysis” – 4,035 results
- blog “sentiment analysis” – 3,606 results
- mining “sentiment analysis” – 2,762 results
- investing “sentiment analysis” – 2,336 results
- security “sentiment analysis” – 2,295 results
- enterprise search “sentiment analysis” – 1,533 results
The top three queries return significantly larger set of results than the other queries. However, “technical and sentiment analysis“ is technique used by Stock Traders and it has nothing to do with the Opinion Mining area. Ditto for “market”, “financial” and “investing” sub-queries. If we use them to create RSS feeds for our radar, it would cover the Opinion Mining and Stock Trading areas while we need just former one.
Thus, if we want to create even half-descent quality radar for this subject, we need to create the feeds for the following key phrases:
- opinion “sentiment analysis”
- “search and sentiment analysis”
- blog “sentiment analysis”
- opinion mining “sentiment analysis” ((adding “opinion” removes “Stock Trading” noise from this query)
- homeland security “sentiment analysis” (adding “homeland“ removes “Stock Trading” noise from this query)
- enterprise search “sentiment analysis”
We can also drill down opinion “sentiment analysis” and see what is beneath but it may be a bit overkill:
Thus our graph of keywords/key phrases the area consists of will become as follows:
There is not just Clusty meta-search engine which provides the search clustering and thereby invaluable taxonomies/ontology for Newsmasters. Such household names as Live.com and MSN also provide it.
We tried to run “sentiment analysis” search for Live.com (formerly known as MSN). However, apparently the search engine didn’t cluster the search results:
We were luckier running the same query using Ask (formerly known as Ask Jeeves):
However, the cluster generated by the engine was either overwhelmed by the Stock Market Sentiment Analysis or merely diluted by other Sentiment terms.
We tested the clustered search feature using “honduras real estate” query (we have News Radar with similar name) and we achieved excellent results using Live.com:
and not so using Ask.com:
As you can see the clustered search offered by Live and Ask should be considered on case-by-case basis but certainly should not be ignored.
What other tools available to create taxonomies/ontology for News Radars?
1) Google Sets – Google Sets attempts to make a list of items when the user enters a few examples. For example, entering “Green, Purple, and Red” produces the list “Green, Purple, Red, Blue, Black, White, Yellow, Orange, and Brown.
Below is what we have got by creating a small set of similar terms – not bad – arnaud fischer term was our catch!
The larger set was more diluted:
2) Google Suggest uses auto-complete while typing to give popular searches. The catch was not very impressive:
3) Yahoo and Ask.com suggesting feature are similar to Google Suggest but even less impressive for our task:
4. Google Keyword Tool –
Use the Keyword Tool to get new keyword ideas. Pick one of the tabs below and enter keywords or URLs that are relevant to your business. The tool was biased toward the Stock Market’s “sentiment analysis”:
It could not locate anything for “Sentiment Analysis” but for more trivial “Honduras Real Estate” it produced quite an impressive list of suggestions:
6. Free version of Wordtracker tool. It couldn’t find anything for “Sentiment Analysis” and was not that spectacular for “Honduras Real Estate“:
6) Amazon Data Mining Stats
Amazon.com is a hidden treasury for Taxonomies “hunters”! It is not surprising taking into consideration that Amazon is likely a largest public data warehouse on the Earth. They quietly digitalize contents of all the books they sell and, therefore, they have a huge stockpile of the data they are sitting on. Being so obsessed with the numbers/stats as Jeff Bezos is as a former Wall Street analyst, Amazon has done formidable job generating various stats (even though some of them as Text Stats seem to be a bit far fetching and not very applicable) . We were so admired of this exercise that we even dedicated News Radar to the Amazon Data Mining.
Anyway, how Newsmasters can benefit from the Amazon data processing frills?
We explored Mining the Web subject using Amazon.com and the most popular/relevant book was Mining the Web: Discovering Knowledge from Hypertext Data (Hardcover) by Soumen Chakrabarti:
We drilled down data about the book:
And quickly hot the jackpot: Key Phrases – Statistically Improbable Phrases (SPIs):
This is how Amazon describes the feature:
Amazon.com Statistically Improbable Phrases
Amazon.com’s Statistically Improbable Phrases, or “SIPs”, are the most distinctive phrases in the text of books in the Search Inside!TM program. To identify SIPs, our computers scan the text of all books in the Search Inside! program. If they find a phrase that occurs a large number of times in a particular book relative to all Search Inside! books, that phrase is a SIP in that book.
SIPs are not necessarily improbable within a particular book, but they are improbable relative to all books in Search Inside!. For example, most SIPs for a book on taxes are tax related. But because we display SIPs in order of their improbability score, the first SIPs will be on tax topics that this book mentions more often than other tax books. For works of fiction, SIPs tend to be distinctive word combinations that often hint at important plot elements.
Click on a SIP to view a list of books in which the phrase occurs. You can also view a list of references to the phrase in each book. Learn more about the phrase by clicking on the A9.com search link.
Have some ideas for improving this feature? Please send your feedback to mailto:email@example.com?subject=Statistically Improbable Phrases
The SPIs can be used as the taxonomy substitutes because they represent, to some extent, sub-categories of the Mining the Web subject (assuming that the book we sued as the example sufficiently represents the subject).
Key Phrases – Capitalized Phrases (CAPs) can be also used for combining key phrases for News Radars and it seems that they represent more aggregated categories of the Mining the Web domain.
The next jackpot is Concordance which is the Amazon name for ubiquitous Tag Clouds:
As in the case with Tag Clouds, the larger and bolder font of a word in the cloud, the more its weight. The only downside of the Amazon concordance is that they don’t index bi- and trigrams (key phrases consisting of two words and three words respectively) like a lot of sites, including RadarFarms.com. It deflates the concordance value as a source of taxonomy for Newsmasters.
Conclusions and Recommendations:
1. Development of deep and robust “industry-quality” News Radars (or RSS Mashups) for a broad and/or complex subject is impossible without knowledge of the subject’s taxonomy/ontology. The best proof of this statement is that if we use “Sentiment Analysis” keyword without its taxonomy research, we would end up with a News Radar covering the Stock Market Sentiment and Opinion Mining domains
2. There are several sources of Taxonomies on the Web (the most notable is Taxonomy Warehouse), however, the taxonomy’s resources are still scarce and scattered across the Web
3. There are several free online tools which can be used for generation of taxonomies. The best tool is Clusty Clouds