From Corpora to Matching

Making effective use of the Internet is increasingly about creating better and more intelligent applications and search engines. Here is a brief introduction into how search engines work:

01) Define the corpus, search space/data;
02) Separate the corpus into documents;
03) Generate features for each document;
04) Generate a representation of each document;
05) Study the feature/vector space;
06) Cluster documents;
07) Reduce dimensionality;
08) Accept input Queries;
09) Find the cosine angles against the query vector;
10) Find the sought vector column;
11) Output results to user in some way;

Each document in a corpus (database) is described by a set of keywords called index terms. We assign weights to index terms according to their relevance (frequency of occurrence for instance), this is how we go about creating the index, that we can then search.

Corpus preparation:
Web pages of interest are analysed and cleaned by removing hypertext tags or any other hyper language; Pages are then broken down into documents where each document is scanned through searching for words/terms of interest: those which make a document unique, not standard words.

Extract terms of interest:
Bear in mind that terms of interest must be invariant, that is be characteristic of a document, not generic and easy to find in any corpus/document. The idea is to find a signature per document.

Build term-by-document matrix:
The search space is defined by N dimensions where the chosen terms/features of a document is a point in the N term space, this allows conceptual/semantic searches.

Each document becomes a column vector, each row represents a term. Each row identifies the frequency of a term across the analysed corpus, at first we simply build the matrix by counting the terms for each document.

Compress the matrix:
There are two basic techniques/methods, Compress Row Storage (Scans matrix row by row) and Compress Column Storage (Scans matrix column by column) Both use three arrays.

Normalis the matrix:
Normalisation implies transforming column vectors to unit vectors: i.e. vectors of unit length

Unit document vectors contain frequency of terms; the normalisation is applied because the semantic content of a document is generally determined the relative frequency of terms.

Singular Value Decomposition:
This simplifies a symmetric matrix into three matrices Two are identical and represent the eigenvectors: the new dimensions. The third is diagonal and represents the eigenvalues, that is the spread of the corpus along these new dimensions.

A geometric interpretation:
The corpus is first formated, stemmed and is then stored in a compact term-by-document matrix. Each column of such matrix is then normalised to produce the likelihood of a term across the corpus, or, equivalently, the frequency of terms in a document.

The term-by-document matrix is then decomposed to calculate eigen values and vectors. Eigen vectors represent a new Cartesian coordinate frame spanning the same search space, BUT, they indicate the most important dimenions/axis along which documents mainly lie. Eigen value do quantify the spread of documents along these new axes/eigen vectors.

Queries:
Queries must be based on defined features/terms within the term-by-document matrix, matching in a vector space such as this is implemented by multiplying the query vector against the terms by document matrix,ie matching a query vector q against the documents of the matrix.

© I am the website administrator of the Wandle industrial museum (http://www.wandle.org). Established in 1983 by local people determined to ensure that the history of the valley was no longer neglected but enhanced awareness its heritage for the use and benefits of the community.

In The News:


pen paper and inkwell


cat break through


What are My Chances to Get the First Place in Search Engine Listings?

You must have heard the stories how people became rich... Read More

An Ethical Alternative to Doorway Pages

Definition: A doorway page is content created specifically for the... Read More

How To Find Good Keywords

Good keywords are frequently searched for (high demand) but not... Read More

Taking Advantage of Googles Sandbox Effect

Most new sites submitted to Google (at least within the... Read More

The Secret Benefit Of Search Engine Optimisation: Increased Usability

A higher search ranking is what many website owners dream... Read More

Linking for Traffic not Positioning!

With more and more experts and search engine enthusiastsclaiming the... Read More

Search Engine Optimization and Submission Tips

Bringing visitors to your site from the main Search engines... Read More

10 Basic Rules for Where to Place Your Keywords

First of all, Google and most other search engines do... Read More

Five Short Quick Tips on Web Page Titles

By Catherine FranzSearch engine spiders read and record page titles... Read More

Hens Lay Eggs (HLE) by Silke Stahl

"HLE" is a bit of a joke term, (or possible... Read More

How Important is PageRank, Really?

Webmasters can spend most of their waking hours doing everything... Read More

How Web Design Can Affect Search Engine Rankings

Uniquely built web sites can create unique issues when being... Read More

3 Principles Of Google

When online "Use it. Use it. Use it."Google is our... Read More

SEO Expert Guide - Black Hat SEO - Activities to avoid (part 8/10)

In parts 1 - 7, you learnt how to develop... Read More

Submitting Your Site To The Open Web Directory: Some Dos And Don?ts

One of the most important steps in any site's publicity... Read More

Black Hat SEO and the Sneaky Redirect

Are shades of grey SEO really Black Hat SEO?Black hat... Read More

The Ultimate Free Google Ranking Tool

The first months my website was online, I was constantly... Read More

Do-It-Yourself Keyword Optimization

The first step in a search engine optimization campaign is... Read More

Link Building - The Waiting Game

Link building is a waiting game. Many clients have asked... Read More

Beyond the Box with Googles Web API

Google, the most popular, and many say best, search engine,... Read More

The Power of Search Engine Friendly URLs

I recently invested quite some time into generating search engine... Read More

Searching The Internet Without Search Engines

The World Wide Web contains more than ten million websites... Read More

SEO Tips for Google

Getting a high ranking on Google is a big achievement.... Read More

Attack Smaller Searches To Get The Big Ones!

Searching online can not only be fun, but you sometimes... Read More

How My Page Rank Went From 0 to 5 In One Update - How Yours Can Too

Increase Your Google Page Rank!Page Rank. We all know what... Read More

SEO Expert Guide - Paid Site Promotion (Marketing) (part 7/10)

In parts 1 - 6 you learnt how to develop... Read More

Google?s Siren Call ? Is It Crashing Your Search Engine Marketing?

It's difficult to dispute the rational behind the rant since... Read More

Googles New SEO Rules

Google has recently made some pretty significant changes in its... Read More

Search Engine Optimization for Everyone

There are many facets to SEO and the search engines... Read More

Tread Towards A Successful ?Internet Research?

Internet is a terrific resource containing billions of web pages... Read More

How to Boost Your Traffic and Profits with Content!

Are you aware of how vitally important and valuable CONTENT... Read More

Dealing With Search Engine Stress In A Home-Based Business

As a member of several search engine optimization forums, I... Read More

How to Verify and Monitor Your Search Engine Listing on Google?

Being listed in search engines and ranked high on searches... Read More