From Corpora to Matching

Making effective use of the Internet is increasingly about creating better and more intelligent applications and search engines. Here is a brief introduction into how search engines work:

01) Define the corpus, search space/data;
02) Separate the corpus into documents;
03) Generate features for each document;
04) Generate a representation of each document;
05) Study the feature/vector space;
06) Cluster documents;
07) Reduce dimensionality;
08) Accept input Queries;
09) Find the cosine angles against the query vector;
10) Find the sought vector column;
11) Output results to user in some way;

Each document in a corpus (database) is described by a set of keywords called index terms. We assign weights to index terms according to their relevance (frequency of occurrence for instance), this is how we go about creating the index, that we can then search.

Corpus preparation:
Web pages of interest are analysed and cleaned by removing hypertext tags or any other hyper language; Pages are then broken down into documents where each document is scanned through searching for words/terms of interest: those which make a document unique, not standard words.

Extract terms of interest:
Bear in mind that terms of interest must be invariant, that is be characteristic of a document, not generic and easy to find in any corpus/document. The idea is to find a signature per document.

Build term-by-document matrix:
The search space is defined by N dimensions where the chosen terms/features of a document is a point in the N term space, this allows conceptual/semantic searches.

Each document becomes a column vector, each row represents a term. Each row identifies the frequency of a term across the analysed corpus, at first we simply build the matrix by counting the terms for each document.

Compress the matrix:
There are two basic techniques/methods, Compress Row Storage (Scans matrix row by row) and Compress Column Storage (Scans matrix column by column) Both use three arrays.

Normalis the matrix:
Normalisation implies transforming column vectors to unit vectors: i.e. vectors of unit length

Unit document vectors contain frequency of terms; the normalisation is applied because the semantic content of a document is generally determined the relative frequency of terms.

Singular Value Decomposition:
This simplifies a symmetric matrix into three matrices Two are identical and represent the eigenvectors: the new dimensions. The third is diagonal and represents the eigenvalues, that is the spread of the corpus along these new dimensions.

A geometric interpretation:
The corpus is first formated, stemmed and is then stored in a compact term-by-document matrix. Each column of such matrix is then normalised to produce the likelihood of a term across the corpus, or, equivalently, the frequency of terms in a document.

The term-by-document matrix is then decomposed to calculate eigen values and vectors. Eigen vectors represent a new Cartesian coordinate frame spanning the same search space, BUT, they indicate the most important dimenions/axis along which documents mainly lie. Eigen value do quantify the spread of documents along these new axes/eigen vectors.

Queries:
Queries must be based on defined features/terms within the term-by-document matrix, matching in a vector space such as this is implemented by multiplying the query vector against the terms by document matrix,ie matching a query vector q against the documents of the matrix.

© I am the website administrator of the Wandle industrial museum (http://www.wandle.org). Established in 1983 by local people determined to ensure that the history of the valley was no longer neglected but enhanced awareness its heritage for the use and benefits of the community.

In The News:


pen paper and inkwell


cat break through


Why is Search Engine Optimisation Expensive?

Many of our customers (http://www.internet-marketing-australia.com) find search engine... Read More

Search Engine Rankings for Beginners

Search engine optimization is best left in the mystical land... Read More

Why Optimize Your Site For Search Engines?

Sometimes a search engine optimization company will miss that glaring... Read More

Search Engine Marketing: Choosing Keyword Phrases

Selecting the right keyword phrases is the key to a... Read More

The Secret To Getting Indexed In Yahoo

Getting indexed in Yahoo has become very difficult in the... Read More

How to Get a Website Indexed Fast

Get Indexed FastWhat does getting indexed mean?The search engines keep... Read More

Creating A Search Engine Copywriting Plan

Search engine copywriting has become an extremely important part of... Read More

Effective Keyword Optimization and Analysis Techniques

Keyword optimization involves vital keyword selection and placement strategy depends... Read More

From Corpora to Matching

Making effective use of the Internet is increasingly about creating... Read More

Optimum SEO Keyword Density ? A Real-Life Case Study

So you've built your website, you know what keywords you... Read More

Ten Steps To A Well Optimized Website - Step 3: Site Structure

Welcome to part three in this search engine positioning series.... Read More

Why Is SEO So Important To Your Site?

You have heard the phrase LOCATION LOCATION LOCATION. But wait,... Read More

?Web Content Management System fr Window?: Search Engine Typos

Oops! I meant "web content management system for windows." Do... Read More

Search Engine Optimization Strategies To Drive More Targeted Traffic To Your Website

Want to drive more targeted traffic to your web site?Want... Read More

Surviving Googles Aging Delay

Google has always been the search industry's innovator and that's... Read More

21 Search Engine Terms Every Web Marketer Should Know Part 1

1. Search Engine - Is a database of web sites... Read More

SEO Expert Guide - Sitewide Optimization (part 4/10)

In parts 1 and 2 you learnt how to develop... Read More

These 7 Back Link Strategies Will Get You a Top Ranking on Google Guaranteed

Google use a very complex function to determine which search... Read More

SEO Hints and Tips and Free SEO Tools

Do you realize that if you manage your website, SEO... Read More

Search Engine Optimization for Everyone

There are many facets to SEO and the search engines... Read More

The Power of Search Engine Friendly URLs

I recently invested quite some time into generating search engine... Read More

Yahoo Dopey, MSN Goofy, Google is Mickey Mouse Lost in a Sandbox

Seventy-two days ago Googlebot first showed up and crawled over... Read More

Picking Keywords for SEO ? A Different View

The first step to developing any search engine optimization effort... Read More

Make The Search Engines Love Your Site

Most webmasters have no idea on how to make a... Read More

Copywriters: Make Friends with Search Engine Optimization

I don't know about you, but I felt a lot... Read More

6 Ways To Attract Search Engines To Your Website More Often

Adding fresh, updated content to your website is the surest... Read More

Search Engine Optimization - Free Tips and Help - Part One - The Title Tag

Search Engine Optimization is a widely misunderstood industry. Many webmasters,... Read More

Reciprocal Link Exchange Management Software

When shopping online for Link Management Software that can... Read More

Absolute & Relative Links How Do They Rank?

The question for this article is whether or not you... Read More

Microsofts New Search Engine

I tried out Microsoft's new search engine (beta version) the... Read More

Give the Folks at Google What They Want

Recent developments on the Google front have web marketers and... Read More

What Is Search Engine Marketing?

It's in our genes, we're driven to seek. We 'hunt'... Read More

Maximize Your Search Engine Traffic - 13 Ways to Pull in More Visitors From the Search Engines

Maximizing traffic from the search engines to your web site... Read More