Newsmastering Tips and Hacks – Part II – Taxonomy Tools

August 18, 2007 at 1:12 am | Posted in Unscientific Research | Leave a comment
Tags: , ,

This article is a second part of series of the articles dedicated to Newsmastering tools/experiences/lessons learned. The first article is at: https://radarfarms.wordpress.com/2007/06/13/newsmastering-tips-and-hacks-part-i-keyword-rss-generators/

For those who have not read the first article, we would like to repeat definition of Newsmastering:

The Newsmastering can be defined as the ability to define, locate, select, aggregate, filter, process and publish Web-based information based on a specific subject. The RSS Newsmastering utilizes the Really Simple Syndication (RSS) feeds, which include brief summaries of web and news articles, blog and user group postings, images, videos and audios, messages board/forum threads, job postings etc.

These pieces can be composed into specific information channels, which have a variety of names: Feed Digests, Lenses, FeedBots, and Bees etc. We call them News Radars.

We also would like to re-emphasis that the prevalent mode of building of News Radars of descent quality is to generate feeds using search queries against multiple web, blog, news, user groups, message boards and multimedia search engines (e.g., MSN/Live.com, Technorati, Google News, Google Groups and Video, Boardtracker, Blogdigger) and social networking, bookmaking and publishing sites (del.icio.us, Digg, YouTube, Flickr etc.)

Therefore, when a Newscaster is contemplating a new News Radar (or how it is called lately a RSS Mashup), one of the most important questions he/she is facing is – which keywords or key phrases define the radar? The answer makes a huge difference between a shallow, watered down and a deep and robust “industry-quality” News Radar.

The difference is particularly palpable when we are talking about the News Radars covering a broad and/or complex subject.

If your radar’s subject is granular enough like radars we created recently: Air Hogs Storm Launcher (a cool RC toy for teens and adults alike) or Danier Leather Bombers (leather jackets by one of leading Canadian leather manufacturer), one can still mess up if subject’s keywords are not properly chosen (“danier leather” bombers vs. danier “leather bombers” vs. danier leather bombers vs. “danier leather bombers“).

However, using such a simple tool as Google, a Newsmaster can get his/her around and figure out that “danier leather” bombers is a best single option for the radar (we will talk about using search engine operators in one of our later articles in this series).

However, could you really create a radar about “Academic Accounting News and Research” or “Sentiment Classification” or “Web Data Mining” without knowledge of taxonomies for the radar’s subject.

According to Wikipedia, Taxonomy is the practice and science of classification.. Originally the term taxonomy only referred to the science of classifying living organisms (now known as alpha taxonomy); however, the term is now applied in a wider, more general sense and now may refer to a classification of things, as well as to the principles underlying such a classification.

Taxonomy is used interchangeable with definition of Ontology. Wikipedia defines it the following way – in both computer science and information science, an ontology is a data model that represents a set of concepts within a domain and the relationships between those concepts.

The bottom line is that if the radar’s subject represents a domain, then in order to describe the domain we need to know a set of concepts (collections of objects). If we decided to create a radar describing Motor Vehicles, we would not describe the subject as just Motor Vehicles but rather as Cars and Trucks. We would probably go even further and included 2-Wheel Drive Cars and 4-Wheel Drive Cars and then broke down the former category to Front-Wheel Drive Cars and Back-Wheel Drive Cars. We guess you got the idea.

You need to remember that RSS feeds generated by search engines of different kinds pick up a limited number (usually 10-15 most relevant items) per search query. Therefore, if, for instance, we use Live.com/MSN web search engine to generate RSS feeds (see our article at:  https://radarfarms.wordpress.com/2006/10/16/newsmastering-hacks-how-to-generate-rss-feeds-from-msncom) and use “Motor Vehicles” as key phrase, we would generate a RSS feed incorporating 10 items most relevant for the “Motor Vehicles” key phrase. If we use the same approach but generate the MSN RSS feeds for “Cars” and “Trucks“, we would be able to pick up 10 most relevant items for “Cars” and 10 most relevant items for “Trucks“. Which radar will have a better quality?

We call this approach amongst us “Subject Exploding” (not sure that it is a correct term though). The subject may be exploded vertically (see the above example):

 taxonomy-image-1.jpg

or horizontally.

The “horizontal explosion” is discovering keywords/key phrases with similar scientific, technical and/or cultural meaning. For instance, our company is slowly but surely moving into some uncharted waters with Opinion Mining. We created quite a few Opinion Radars tracking Web Buzz about consumer products (Motorola Pebl, Gillette Fusion, Playstation 3, JVC Everio, Universal Remote Controls etc.), movies (All the King’s Men), great writers and musicians (Ernest Hemingway, Pink Floyd, Gogol Bordello, Jazz Legends, etc.). We performed their one-time indexing to create Tag Clouds and, in a process of the indexing we collected data about keywords/key phrases and were able to see some patterns. However, we would like to provide and present more formal evaluation of the Web Buzz sentiment for our visitors and clients alike.As usual, while we are developing some new algorithms, we mine the Web for an area of our interest and create radar. However, as it happens more often than not, the radar’s subject name is not unique. Different authors and different reference sources call this domain differently:

  • Opinion Mining
  • Opinion Extraction
  • Sentiment Classification
  • Sentiment Mining
  • Sentiment Evaluation
  • Sentiment Analysis etc.

There will be some overlapping between the above terms and if we pick a most representative one – Sentiment Analysis, top 10-15 most relevant items on the Web may cover the others. However, we would miss a lot.

Therefore, if we shoot for a really comprehensive high-quality News Radar, we need to generate queries for several keywords/key phrases describing the radar’ subject.

taxonomy-image-2.jpg

Hopefully, it makes sense to the article readers. However, it brings us to a question – where I can find Taxonomy for subject of my interest? Are they readily available?

This article is just dedicated to answer these questions.

Sources of Taxonomies for specific subjects are multiple and can be divided into two broad categories:

  • 1) Domain Taxonomies or Controlled Vocabularies
  • 2) Document and/or Search Clusters

Where you can find Taxonomies/Controlled Vocabularies? The greatest source is Taxonomy Warehouse. It is a free service (free to users and free to vocabulary publishers) put out by Factiva, a multi-national consulting company for content providers for the benefit of the information and knowledge management community.

The Warehouse aims to provide a comprehensive directory of taxonomies, thesauri, classification schemes and other authority files from around the world, plus information about taxonomy references, resources and events.

  • Over 660 Taxonomies
  • Classified by 73 subject domains
  • Produced by 261 publishers
  • In 39 languages
  • 65% produced in digital media
  • 100 directly licensable

Sites hosting online publications can also contain Taxonomies represented as Keywords. For instance, when we created Academic Accounting Research and News radar, we went to the Canadian Academic Accounting Association site, located Accounting Publications page and using the publication abstracts

 taxonomy-image-3.jpg

and were able to retrieve quite a few keywords:

taxonomy-image-4.jpg 

It was not easy but we still consider this radar as our benchmark in terms of quality.

There are a fair number of Taxonomies and Controlled Vocabularies on the Web, however, they are scattered and if you want to find them, you can use either del.icio.us or our very own Taxonomies and Folksonomies radar – in most of cases though you are on your own. However, taxonomies/ontology-related information could be found at much unexpected places readily available and for free.

Let’s start from Vivisimo and its Clusty – a free, public clustering search engine. Vivisimo may be not such household name as Google or Yahoo or even Ask.com, however, this R & D offspring of Carnegie Mellon University is a best thing that happened to Web and Enterprise Search since Google. The Vivisimo’s bread and butter are clustering of search results.

Vivisimo provides the following description for the clustered search:

Dynamic, or on-the-fly, document clustering refers to the automatic grouping of documents into spontaneously labeled categories. Thus, hundreds of search results on Pittsburgh might be grouped into Hotels, Universities, Steelers, Pirates, Mayor Murphy, Carnegie Museums, and so on. These groups can then be displayed as folders in the familiar style, in which folders are shown on the left and individual search results are shown on the right. Dynamic clustering forms groups via a quick statistical and linguistic analysis of the available textual descriptions, such as each search result’s title and summary.

If Pittsburgh-related collection of Web document is categorized into Pittsburg Hotels, Pittsburg Steelers, Pittsburg Pirates, Pittsburg Mayor Museums, Pittsburg Carnegie Museums and other groupings and if we have to create a News Radar dedicated to various aspects of the Pittsburgh life, the above categories would immensely help us.

Vivisimo not just provides the search categories but it also allows drilling the categories down and identifying the deeper layers (sub-categories). Definitely, the search clustering is very much dependent on quality of document corpus (collection) the search was performed on (GIGO principle is universally true J).

Clusty.com is meta-search engine queries results from MSN, Open Directory, Wikipedia, NY Times, Ask, Yahoo News, Gigablast and Wisenut search engines. Even though, this list doesn’t include Google and/or Yahoo web search, the above sources combined provide a set of results representative enough to be considered.

That’s how the clustered search for “Sentiment Analysis” (most representative term for the area of our interest) looks like:   

taxonomy-image-5.jpg

Top ten search clusters are on a left side. Being stand-alone not all of them make sense in a given context but combined with “sentiment analysis“, if required, they would look as follows:

  • Market Sentiment Analysis – 21 results out of total of 172
  • Stock Sentiment Analysis – 18 results
  • Blog Sentiment Analysis – 16 results
  • Sentiment Analysis Monitoring – 14 results
  • Opinion Intelligence – 12 results – it is more puzzling and requires some extra drilldown:

taxonomy-image-6.jpg

Apparently, it relates to combinations of opinion intelligence sentiment analysis keywords – for instance, opinion intelligence “sentiment analysis”

  • Technical & Sentiment Analysis – 11 results
  • Search & Sentiment Analysis – 10 results
  • Institutional Sentiment & Analysis – 6 results
  • Sentiment Mining Analysis – 8 results
  • Sentiment Analysis for Homeland Security – 6 results

All of the above categories can be used for creating the Sentiment Analysis radar, however, this radar will be rather deep than broad one because all these categories with an exception of Opinion Intelligence represent Sentiment Analysis categories and Sentiment Analysis synonyms will not be sufficiently covered.

Another incarnation of Clusty search is Clusty Clouds. That’s how Vivisimo describes this feature:

A Clusty Cloud is a tool that webmasters or bloggers can use to instantly visualize a topic using the familiar tag cloud display. What makes the Clusty Cloud unique is that you can create a cloud based on any topic or query – you don’t need tags or months of content on a subject to create an interesting cloud.

Clusty Clouds are generated using our search results for the topic you enter. Since the tags you see come from Clusty, you can click on any of them to go to Clusty’s search results. Using Clusty to generate the cloud also ensures that it is always up-to-date because the clusters are generated in real-time.

Surprisingly, however, Clusty Cloud for Sentiment Analysis is a bit different from the Clustered Search results for the very same keyword. The larger size and/or bold font for keyword/key phrase indicate its larger weight in Clusty Cloud, however, we still drilled down each of tags in the cloud.

taxonomy-image-7.jpg

Results of drilldown for Investing keyword reveal that the cloud’s tag is added to Sentiment Analysis keyword that makes the complete query look like “sentiment analysis” investing:

 taxonomy-image-8.jpg

Therefore, results of such combined query are probably more representative and reliable.

The top ten queries then are as follows:

  • opinion “sentiment analysis” – 42,606 results
  • technical and sentiment analysis “sentiment analysis” – 32,766 results
  • market “sentiment analysis” – 24,698 results
  • search and sentiment analysis “sentiment analysis” – 7,410 results
  • financial “sentiment analysis” – 4,035 results
  • blog “sentiment analysis” – 3,606 results
  • mining “sentiment analysis” – 2,762 results
  • investing “sentiment analysis” – 2,336 results
  • security “sentiment analysis” – 2,295 results
  • enterprise search “sentiment analysis” – 1,533 results

The top three queries return significantly larger set of results than the other queries. However, “technical and sentiment analysis is technique used by Stock Traders and it has nothing to do with the Opinion Mining area. Ditto for “market”, “financial” and “investing” sub-queries. If we use them to create RSS feeds for our radar, it would cover the Opinion Mining and Stock Trading areas while we need just former one.

Thus, if we want to create even half-descent quality radar for this subject, we need to create the feeds for the following key phrases:

  • opinion “sentiment analysis”
  • “search and sentiment analysis”
  • blog “sentiment analysis”
  • opinion mining “sentiment analysis” ((adding “opinion” removes “Stock Trading” noise from this query)
  • homeland security “sentiment analysis” (adding homeland removes “Stock Trading” noise from this query)
  • enterprise search “sentiment analysis”

taxonomy-image-9.jpg

We can also drill down opinion “sentiment analysis” and see what is beneath but it may be a bit overkill:

taxonomy-image-10.jpg

Thus our graph of keywords/key phrases the area consists of will become as follows:

taxonomy-image-11.jpg

There is not just Clusty meta-search engine which provides the search clustering and thereby invaluable taxonomies/ontology for Newsmasters. Such household names as Live.com and MSN also provide it.

We tried to run “sentiment analysis” search for Live.com (formerly known as MSN). However, apparently the search engine didn’t cluster the search results:

taxonomy-image-12.jpg

We were luckier running the same query using Ask (formerly known as Ask Jeeves):

taxonomy-image-13.jpg

However, the cluster generated by the engine was either overwhelmed by the Stock Market Sentiment Analysis or merely diluted by other Sentiment terms.

We tested the clustered search feature using “honduras real estate” query (we have News Radar with similar name) and we achieved excellent results using Live.com:

taxonomy-image-14.jpg 

and not so using Ask.com:

taxonomy-image-15.jpg

As you can see the clustered search offered by Live and Ask should be considered on case-by-case basis but certainly should not be ignored.

What other tools available to create taxonomies/ontology for News Radars?

1) Google Sets – Google Sets attempts to make a list of items when the user enters a few examples. For example, entering “Green, Purple, and Red” produces the list “Green, Purple, Red, Blue, Black, White, Yellow, Orange, and Brown.

taxonomy-image-16.jpg

Below is what we have got by creating a small set of similar terms – not bad – arnaud fischer term was our catch!

taxonomy-image-17.jpg

The larger set was more diluted:

Predicted Items

sentiment analysis

opinion mining

social search

search engine

information integration

signature file

streaming

sql

server

spam

site

ssl certificate

smtp

text mining

steve ballmer

summary

introduction

semantic annotation

arnaud fischer

project scheduling

budget

site selection

admin

web tv

ontology population

2) Google Suggest uses auto-complete while typing to give popular searches. The catch was not very impressive:

taxonomy-image-19.jpg

3) Yahoo and Ask.com suggesting feature are similar to Google Suggest but even less impressive for our task:

taxonomy-image-20.jpg

taxonomy-image-21.jpg

4. Google Keyword Tool

Use the Keyword Tool to get new keyword ideas. Pick one of the tabs below and enter keywords or URLs that are relevant to your business. The tool was biased toward the Stock Market’s “sentiment analysis”:

taxonomy-image-22.jpg

                 

5. SEO Book Keyword Tool:

It could not locate anything for “Sentiment Analysis” but for more trivial “Honduras Real Estate” it produced quite an impressive list of suggestions:

taxonomy-image-23.jpg

6. Free version of Wordtracker tool. It couldn’t find anything for “Sentiment Analysis” and was not that spectacular for “Honduras Real Estate“:

taxonomy-image-24.jpg

6) Amazon Data Mining Stats

Amazon.com is a hidden treasury for Taxonomies “hunters”! It is not surprising taking into consideration that Amazon is likely a largest public data warehouse on the Earth. They quietly digitalize contents of all the books they sell and, therefore, they have a huge stockpile of the data they are sitting on. Being so obsessed with the numbers/stats as Jeff Bezos is as a former Wall Street analyst, Amazon has done formidable job generating various stats (even though some of them as Text Stats seem to be a bit far fetching and not very applicable) . We were so admired of this exercise that we even dedicated News Radar to the Amazon Data Mining.

Anyway, how Newsmasters can benefit from the Amazon data processing frills?

We explored Mining the Web subject using Amazon.com and the most popular/relevant book was Mining the Web: Discovering Knowledge from Hypertext Data (Hardcover) by Soumen Chakrabarti:

taxonomy-image-25.jpg

We drilled down data about the book:

taxonomy-image-26.jpg

And quickly hot the jackpot: Key Phrases – Statistically Improbable Phrases (SPIs):

taxonomy-image-27.jpg

This is how Amazon describes the feature:

Amazon.com Statistically Improbable Phrases

Amazon.com’s Statistically Improbable Phrases, or “SIPs”, are the most distinctive phrases in the text of books in the Search Inside!TM program. To identify SIPs, our computers scan the text of all books in the Search Inside! program. If they find a phrase that occurs a large number of times in a particular book relative to all Search Inside! books, that phrase is a SIP in that book.

SIPs are not necessarily improbable within a particular book, but they are improbable relative to all books in Search Inside!. For example, most SIPs for a book on taxes are tax related. But because we display SIPs in order of their improbability score, the first SIPs will be on tax topics that this book mentions more often than other tax books. For works of fiction, SIPs tend to be distinctive word combinations that often hint at important plot elements.

Click on a SIP to view a list of books in which the phrase occurs. You can also view a list of references to the phrase in each book. Learn more about the phrase by clicking on the A9.com search link.

Have some ideas for improving this feature? Please send your feedback to mailto:sitb-feedback@amazon.com?subject=Statistically Improbable Phrases

The SPIs can be used as the taxonomy substitutes because they represent, to some extent, sub-categories of the Mining the Web subject (assuming that the book we sued as the example sufficiently represents the subject).

Key Phrases – Capitalized Phrases (CAPs) can be also used for combining key phrases for News Radars and it seems that they represent more aggregated categories of the Mining the Web domain.    

The next jackpot is Concordance which is the Amazon name for ubiquitous Tag Clouds:

taxonomy-image-28.jpg

As in the case with Tag Clouds, the larger and bolder font of a word in the cloud, the more its weight. The only downside of the Amazon concordance is that they don’t index bi- and trigrams (key phrases consisting of two words and three words respectively) like a lot of sites, including RadarFarms.com. It deflates the concordance value as a source of taxonomy for Newsmasters.

Conclusions and Recommendations:

1. Development of deep and robust “industry-quality” News Radars (or RSS Mashups) for a broad and/or complex subject is impossible without knowledge of the subject’s taxonomy/ontology. The best proof of this statement is that if we use “Sentiment Analysis” keyword without its taxonomy research, we would end up with a News Radar covering the Stock Market Sentiment and Opinion Mining domains

2. There are several sources of Taxonomies on the Web (the most notable is Taxonomy Warehouse), however, the taxonomy’s resources are still scarce and scattered across the Web

3. There are several free online tools which can be used for generation of taxonomies. The best tool is Clusty Clouds

4. Other tools such as Google Sets, Live Related Searches and Amazon SPIs, CPIs and Concordance can be also used

Advertisements

Newsmastering Tips and Hacks – Part I – Keyword RSS Generators

June 13, 2007 at 3:53 am | Posted in Unscientific Research | 1 Comment
Tags: , ,

This article is a first one of series of the articles we contemplated for a while to share our Newsmastering experiences/lessons learned and, hopefully, trigger a discussion amongst a small but growing community of Newsmasters.

Here are some of our tidbits on this very broad subject Newsmastering Hacks – How to generate RSS feeds from MSN.com and others at currently defunct Radar Forums, however, they are not structured and were written mostly in a response to challenges we encountered with some sources of RSS feeds.

The Newsmastering can be defined as the ability to define, locate, select, aggregate, filter, process and publish Web-based information based on a specific subject. The RSS Newsmastering utilizes the Really Simple Syndication (RSS) feeds, which include brief summaries of web and news articles, blog and user group postings, images, videos and audios, messages board/forum threads, job postings etc.

These pieces can be composed into specific information channels, which have a variety of names: Feed Digests, Lenses, FeedBots, and Bees etc. We call them News Radars.

News Radars applications are multiple and just to name few:

1) Web Buzz and News Monitoring

2) Competitive Intelligence and Market Research

3) Conversational Marketing

4) Web Site Dynamic Content (Widgets and Mini-Sites)

5) Web Knowledge Bases

6) Job Search Engines

7) Internet TV Channels

8) Search Engine Optimization (SEO)

The information aggregation, filtering, processing and publishing requires the Newsmastering tools, and these processes can be mostly automated after they set up by Newsmasters. However, definition, location and selection of the information for the News Radars are one area where most of the human skill is required.

According to The Birth of the NewsMaster, newsmastering is

the ability to concert, orchestrate, edit, and refine quality search formulas that tap into the whole RSS universe and beyond, and that filter out relevant content based on selected keywords, sources, type of content, ranking and many other possible criteria”. Very well said!

It seems that just this ability to concert, orchestrate, edit and refine Web search query and define/locate/select the relevant content for the Web Channels will be a set of skills that sets apart a great Newsmaster from the rest of the pack, and which will be sought after in years to come when a need for processing overabundant “raw” information from the Web is better understood.

Where can you find descent RSS feeds for your Channel? In a nutshell, today there are only three (3) ways to do it:

1) Generate feeds using search queries against multiple web, blog, news, user group, message boards and multimedia search engines (e.g., MSN/Live.com, Technorati, Google News, Google Groups and Video, Boardtracker, Blogdigger) and social networking, bookmaking and publishing sites (del.icio.us, Digg, YouTube, Flickr etc.)

2) Locate “native” RSS feeds available at sites or blogs of your interest

3) Scrape those Web sites or blogs, which do not have the RSS feeds

We will cover all the above ways in our series; however, we would like to start from the keyword-based RSS feed generation, which, at this moment, is a prevalent mode of the web content acquisition despite the growth in a number of “native” RSS feeds.

We will also skip for now the Channel’s subject definition (or refining), which is an initial step in creating professional constantly updated Web channel or “News Radar”. This step is so important, albeit being maybe a far fetching, that it requires a separate article and, therefore, we consciously decided not to follow the Newsmastering “lifecycle” and start from the generating the content, and then move to the subject’s definition/refining.

Most leading search engines provide the RSS output in one form or another although some prominent Web sites have somewhat limited RSS output capabilities (Amazon, del.icio.us, MySpace, Flickr).

However, Google, Yahoo, Live and Ask web search do not provide RSS feeds. There are some hacks and tools, which provide the feeds through the search engine’s APIs. For instance, Ben Hammersley describes how you can generate the feeds for Google searches – Ben Hammersley – Google to RSS.

TagJag provides the Yahoo and Ask feeds while Kebberfegg enables access to Live.com (former MSN) feeds.

Kebberfegg

The oldest keyword-based RSS feed generator is Kebberfegg released by the end of 2005 by a Web Search and SEO pundit Tara Calishain. Kebberfegg stands for (Keyword Based RSS Feed Generator).

That’s how Tara describes her tool:

Kebberfegg is a tool to help you generate large sets of keyword-based RSS feeds at one time. It gives you one place to set up as many as 64 (?) keyword-based RSS feeds at a time, in yummy HTML or OPML flavors.

Keyword-based feeds are great because they can save you a lot of time by automatically updating search results and sending them to your RSS feed reader. But it can take a lot of time to set up all the keyword-based feeds you might want to use across several different resources.

Enter your query in the box below. Underneath that you’ll have the option to choose for what categories you would like to generate feeds — you may wish to search only blogs, for example, or only news sites. You may choose multiple categories — use your CTRL key to select more than one category.

Beneath that you’ll have the choice to generate an OPML file containing your newly-generated feeds, or an HTML list with a link to the main site, a plain RSS link, and a direct link to add the feed to My Yahoo.

Using the tool is quite simple:

  • First, enter your keyword strategy into the search box.
  • For this demo, we will use the query “Granada real estate.”

thumb1.jpg

  • Then, we select a category or categories of feed sources I would be interested in receiving.
  • We decided to select the following categories: Multimedia and News and News Search Engines.
  • You can choose the output as a plain RSS link on an HTML or OPML, that you can import into a feed reader
  • OK, we chose the HTML option and now have a page of direct RSS links (preconfigured for my search query, ready to be added to my aggregator.
  • You’ll see an XML button or buttons to add the feed directly to My Yahoo, Bloglines, MultiRSS or get the feed via E-Mail

thumb2.jpg

If we choose an OPML option, we get an OPML file through the site’s URL:

thumb3.jpg

However, the file can’t be recognized by either FeedDemon or our engine:

thumb4.jpg

Overall, the tool is quite reliable, and has not changed much since we discovered it last year; it seems that Tara just added several sources since then.

Keotag

Keotag (just recently out of beta state) describes itself as “tag search multiple engines, tag generator and social bookmark links generator”. The tool is a child of Netwizz.net – a Web and Blog Agency from New Caledonia.

The tool has attractive and slick user interface:

thumb5.jpg

That’s how above-mentioned Tara Calishain describes how to use this tool:

Go to Keotag and enter a query; simpler ones are better because you are searching tag sites. Keotag refreshes with a Technorati graph showing you how active that query has been (sometimes the graph is blank) and a set of icons. If you mouseover the icons you’ll get a popup showing you with what search engines they’re affiliated. Many of the search engines will be familiar to you — Technorati, Yahoo, Feedster, Google blog search, etc.

Click on an icon and a preview window will open showing you a summary of the content from that site. For example, click on the Technorati icon and you’ll get a list of Technorati’s tag search result for your query. (Note that the tag search is NOT the blog search. The tag search is in my experience more limited in its results.) You won’t get any additional information that you might get from the search result at the site, like index date, source, or page size. It’s just like getting a headline-only RSS feed. The results from a site also include an XML link to get an RSS feed for that specific query from the site.

Despite Tara’s recommendations, we tried a complicated search query “Granada real estate” and was lucky with Google Blog Search, IceRocket, Live Search (just added lately), Blogdigger, Blogpulse, Yahoo etc.

thumb6.jpg

We are not sure that the above sites are all using tags, but it doesn’t really matter as long one can generate a descent quality feed.

The search output is in RSS and OPML format:

thumb7.jpg

Clicking on the RSS icon brought up a regular RSS feed with URL address which can be copied/pasted into a feed reader of your choice, while clicking on the OPML sign generates an OPML file, which can be saved on your hard drive:

thumb8.jpg

The OPML file is working very well with FeedDemon and other RSS Readers.

thumb9.jpg

Generally, the tool is fun to use but it only provides feeds from 18 sources – search and blog engines, and social sites. It doesn’t have flexibility of Kebberfegg in a sense that one cannot choose the source type (e.g., multimedia or blogs).

TagJag

TagJag (former Gada.be) is our “weapon of choice” even though we never seen the tool so volatile in terms of quality and usability.

Tha’s how Pete Cashmore describes Gada.be in his Mashable blog:

Gada.be launched in October 2005 as an RSS search engine – it brings together results from Feedster, Technorati, Sphere, Tailrank, YouTube and hundreds of other services. It’ also easily accessible from mobile devices. Gada.be’s biggest drawbacks were always the name (great for mobiles, but not great for computer users) and the incomprehensible design – with these issues fixed, it could be a great service.

TagJag is unique in a sense of generating and discovering RSS feeds, which could not be generated and/or discovered otherwise.

The feeds are categorized quite well by their type – for instance, Blogs, Discovery, Jobs, Multimedia, News, etc.

thumb10.jpg

You type in a specific keyword/key phrase, choose the feeds category and press “Find It” button. The feeds icons appear, however, cannot be previewed without pressing the icon (this feature was available before – see our TagJag New Interface posting:

thumb11.jpg

The OPML file feature, which did not work until lately, produces a nice file, which is nicely rendered by FeedDemon RSS reader:

thumb12.jpg

thumb13.jpg

It seems like Chris Pirillo – a guy behind TagJag is perpetually looking for new ways to improve the tool that makes it flaky sometimes.

Regardless, we are heavy-duty users of TagJag and, in our opinion; this tool is a best one amongst this kind that is reflected by its Alexa rankings vs. Keotag vs. Kebberfegg:

thumb14.jpg

Now enter a newcomer Radar Wizard. It is our tool ((RadarFarms.com) and, by definition, we cannot be objective. It is still has modest capabilities comparing to the aforementioned heavyweights (we only generate seven feeds); however, it allows creating, loading and hosting News Radars (Feed Digests) at our site.

thumb15.jpg

All you need to do is specify a keyword or key phrase, which represents your Radars’ subject, choose Feed Type (News, Blogs etc.) or All, generate and preview the feeds, select or unselect them, and then publish your Radar.

We are working on increasing a number of feeds, which can be generated for a specified keywords/key phrase and other unique features, which will be delivered shortly.

Read more about Radar Wizard at: http://www.radarfarms.com/rw_intro.php

Web Intelligence for Masses – Satellite Radio Wars

August 25, 2006 at 5:57 pm | Posted in Unscientific Research | Leave a comment
Tags: , , ,

This blog is inspired by the Business Week article “Grudge Match”.

Satellite radio is slowly but steadily coming into our cars, our family rooms, into our cell phones and PDAs, into our teen’s portable music players. Some people even think that it is the best thing after sliced bread. We do not know but it is really cool. You can choose a genre of music you are into or sports and just soak in it, and it is all without damn commercials.  

The satellite radio jukeboxes threaten the traditional distribution of music through CDs or DVDs even more than all Napsters and Kazaas combined because it is mainstream and perfectly legal. The music recording industry has already started taking notice and eventually will launch some sort of war asking for more royalty fees either eating up the radio provider’s profit or pushing up the user fees. However, for now, it is not an enemy #1.  

Sirius and XM side by side     The North American continent is a battlefield between two satellite giants: XM and Sirius (WorldSpace enjoys the European/Asian monopoly for now). Signs of the War are everywhere – they are fighting for retail customers and sales clerks (get into the Source by Circuit City in Canada and they don’t even want to talk about XM – just Sirius), for carmakers (XM signed up GM, VW and all Asian car manufactures while Sirius successfully courted BMW, Daimler-Chrysler and Ford), for radio anchors (does the Howard Stern’s name ring the bell?), for the Wall Street analysts, media and policymakers.  

Consumers have not seen such fierce duopoly’s war since Pepsi vs. Coca-Cola in the soft drink market. The war needs to have a great intelligence. The economic wars need Competitive Intelligence (CI).  

We strongly felt that Newsmastering combined with Data Mining could deliver the best online CI solution. This Radar was supposed to serve as a prototype of the online CI services. We would monitor and analyze all news, pundits’ and consumers’ opinions about XM and Sirius and keep you posted who are winning this Satellite War.  

Well, after almost 2 months, do we know who is winning hearts and purses of the satellite radio neophytes? Sort of…  

As of mid-August, search for “XM” using the XM vs. Sirius radar returned 2595 results for XM and 2468 results for Sirius. Since the radar inception, its visitors searched 311 times for “XM” or “XM Radio” and just 201 times for “Sirius”, “Sirius Radio” or “Sirius Satellite Radio”. The difference is quite substantial to see who the clear winner is.

However, our Web analytics do not confirm such a big spread. Our visitors were referred to our radar 308 times when they typed “XM” keyword and 301 times when they were looking for “Sirius”.

We decided to check what authorities in the Web and Blogosphere trendspotting would have to say about this epic battle of our time.

Google Trends indicates that XM (in red on the underlying diagram) is more popular on the Web than Sirius (in blue)

However, the regional trends vary. XM is more popular in the US but Sirius prevails in Europe and in Canada.

 

The latter, as proud Canadian owners of the Sirius Starmate, we can confirm.

The Blogosphere beg to differ though. Below, the Trends Results by BlogPulse:

 

For last 6 months, Sirius generated more buzz than XM all the time but three days.

As you can see, the Satellite War is just going on. We will keep you posted.

How to Measure Quality of News Radars

February 14, 2006 at 9:16 pm | Posted in Unscientific Research | Leave a comment

News Radars are topic-specific high-focus information channels allowing to syndicate RSS feeds dedicated to the same subject (e.g.,”accrual accounting”, “Pink Floyd” or “Boston Red Sox”).

A News Radar is a constantly updated thematic channel of highly relevant web references that are gathered in accordance with single or multiple specific, persistent search criteria. Radars can focus on anything: topics, people, opinions, products, news items, events or passions. The constant updating of the channel is accomplished by leveraging the RSS technology to its full power.

News Radars also called Feed Digests, Feed Channels, Newspages etc. depending on a RSS aggregator’s vendor or a Web 2.0 site.

Creation of News Radars requires tools allowing presentation and administration of RSS feeds, comprising a Radar, as well as Newsmastering. 

Robin Good aka Luigi Canali De Rossi running Master New Media web site, which introduced the “News Radars” approach, defines “Newsmaster” as:

The newsmaster is an individual capable of personally crafting RSS-based specialized information channels by utilizing technologies that allow him/her to select, aggregate, filter, exclude and identify quality news, information, content, tools and resources from the whole universe of content, news and information available on the Internet.
 
Newsmastering
is the ability of a human being to concert, orchestrate, edit, and refine quality search formulas that tap into the whole Internet content universe and beyond, and that filter out relevant information through selected keywords, source selection, ranking, heuristics, and many other possible criteria

Currently, Newsmastering is more an art than science and you can hear it from first hands. ITDynamo created and published seven (7) News Radars and more than dozen of others run in our testing and R & D environments.

How a Newsmaster evaluates quality of her/his effort? How can we tell Radar with relevant, focused and still deep coverage of the Radar’s subject from a wide stream of news, articles and blogs where we have more misses than hits on the subject of our interest.

We also need to take into consideration that the Radars can be used as News Pages/News Displays with constantly updating stream of information from Web or they can be utilized as external Knowledge Bases where the information does not disappear after being removed from the display but rather archived and categorized. In the latter case, difference between the high and “so-so” quality News Radar may be even more visible.

We researched quite few authoritative Web resources in newsmastering/feed aggregation trying to find the ways others test, validate and evaluate their Radars and, believe or not, came out empty handed.

Most of the sources, including very comprehensive Robin Good’s Mini Guide “The RSS NewsMaster’s Toolkit And The Creation Of RSS Information Radars. Automatic Filtering And Aggregation Of Online Content Via RSS. How To Create, Publish And Promote Topic-Specific Information Channels And Niche Sites, describe how to create Radars but do not provide recommendations how to measure the newsmastering results.

There should be the ways to see the Radar’s output, we thought, and they were – Tags. Wikipedia described Tags as pieces of information separate from, but related to, an object. In the practice of collaborative categorization using freely chosen keywords, tags are descriptors that individuals assign to objects.

Tags can be used to specify properties of an object that are not obvious from the object itself. They can then be used to find objects with some desired set of properties, or to organize objects. These features are exploited extensively in social software and folksonomies.

Tags were around for a long while but Web 2.0 brought a lot of popularity them. Tags can be assigned to information objects by the same person who created the content, can be seen in places like Technorati where blog posts are aggregated along with their tags. Reader tags, on the other hand, are created by anyone else and so might be closer to annotation systems. You can see reader tags on sites like del.icio.us that allow anyone to tag any document. Services like flickr, combine both into one mix, allowing both author and (some) readers to add tags (Hellonline. Eran’s Blog).

However, with all due respect to human being tagging their own or somebody else’s bits and pieces of information, we needed automated services tagging/indexing the feeds output and displaying the tags as Tag Clouds. This way, the tags can be considered as the Radar’s output categories or clusters.

Some sites use own proprietary technologies (ITD is also working on proprietary technology) or utilize third party free services such as TagCloud or Wanabo.

We used TagCloud.com. TagCloud, created by John Herren, is an automated Folksonomy tool. Essentially, TagCloud searches any number of RSS feeds you specify, extracts keywords from the content and lists them according to prevalence within the RSS feeds. Clicking on the tag’s link will display a list of all the article abstracts associated with that keyword. TagCloud lets you create and manage clouds with content you are interested in, and lets you publish them on your own website.

That is how Tag Cloud looks for our RSS in Enterprise news radar:

What attracted us in TagCloud is its ability to generate TagClouds in a XML format with up to 250 tags in the file. For example, an XML file for the above radar of ours is http://www.tagcloud.com/cloud/xml/ERSS/default/250.

The file includes attributes for the tags:

 If this file is open by MS Excel and sorted in descending order, it looks like:

Most important, this tag soup can be processed now and some conclusions can be drawn based upon statistical distribution analysis.

The best way to analyze the statistical distribution is linear regression analysis. We tried to visualize how the perfect distribution would look like and came to the following numbers:

#of Categories

Scale

1

9

2

8

4

7

8

6

16

5

32

4

64

3

128

2

256

1

The linear regression parameters then would be as follows:

Slope: -0.0264

Intercept: 6.5

R-Squared: 0.679.

There is some interpretation of the above statistical parameters.

In theory, we should not even try to interpret the Intercept value because in traditional sense, it is value of X (Scale), when a number of Categories (tags) equal to zero. However, it could represent somewhat of average Radar strength. Slope makes more sense because it means the estimated average change in Scale when a number of Categories increases by one. If the line is steeper or an absolute value of Slope is larger, then Scale (Radar Strength) drops quicker. In plain English, it means that we have few strong categories (“signal”) and then a lot of “noise” or, speaking more statistically, the long tail” becomes just longer.  

Correlation Coefficient (R-Squared) describes “normality” of distribution or consistency while Slope is also a measure of strength and more accurate than Intercept.

We used for our study three (3) Radars we created: two of them – RSS in Enterprise and Newsmastering and Newsradars are published on our site. Another one – ITDynamo/JobExposer Web Buzz is running in our test environment.

The RSS in Enterpriseradar covers a broad area of RSS application as an enterprise tool. Newsmastering and Newsradarsradar can be considered as a sub-set of the first domain while ITDynamo/JobExposer Web Buzz focuses only on monitoring of the WWW response on our existence and activities.

Thus, one can expect that the former radar will be broader, more “diluted” with fewer strong categories and the latter will be more narrow and stronger with larger number of strong categories with the Newsmastering/Newsradars radar somewhat in between.

Tag Clouds in HTML and XML format on February 12, 2006 (the tag clouds may change over some time) for the above radars are as follows:

http://www.tagcloud.com/cloud/html/ERSS/default/250 and http://www.tagcloud.com/cloud/xml/ERSS/default/250 – for RSS in Enterpriseradar

http://www.tagcloud.com/cloud/html/NewsRadars/default/250 and
http://www.tagcloud.com/cloud/xml/NewsRadars/default/250 – for Newsmastering and Newsradarsradar

http://www.tagcloud.com/cloud/html/ITDynamo/default/250 and http://www.tagcloud.com/cloud/xml/ITDynamo/default/250 – for ITDynamo/JobExposer Web Buzz radar

We processed the Radar tag’s XML files using the linear regression analysis and the results were as follows:

 

 

The results of the linear regression analysis of the Radar’s tags (categories) are not always consistent with our expectations.  

We can take ITDynamo/JobExposer Web Buzz radar tags as an example. They seem to be more “normally” distributed and more consistent than other two radars (R-Squared of 0.45 vs. 0.37 and 0.32 for Newsmastering/Newsradars and RSS in Enterprise radars, respectively). The radar’s Intercept is also larger – 5.87 (Newsmastering/Newsradars and RSS in Enterprise radar’s values are 5.5 and 5.32, respectively).  

However, surprisingly, an absolute value of Slope for the Web Buzz radar (0.31) is larger than for other radars in range of 0.2-0.21. It means that while the Web Buzz radar has stronger “signal” categories but it also has longer “tail” (“noise”).  

Therefore, as you can see, the linear regression analysis of the Tag Clouds for News Radars may serve as some indicator of the radar’s quality, however, it is still very much work in progress and requires more studies ((you know that than the results array larger than the stats are more accurate).

Relevancy of the Radar’s output to the Radar’s subject cannot be measured through the above method.

It would be interesting to see some discussion on this subject because we strongly believe that if you cannot measure quality the Web information channels you cannot effectively manage them.

Blog at WordPress.com.
Entries and comments feeds.