Making effective use of the Internet is increasingly about creating better and more intelligent applications and search engines. Here is a brief introduction into how search engines work:
01) Define the corpus, search space/data;
02) Separate the corpus into documents;
03) Generate features for each document;
04) Generate a representation of each document;
05) Study the feature/vector space;
06) Cluster documents;
07) Reduce dimensionality;
08) Accept input Queries;
09) Find the cosine angles against the query vector;
10) Find the sought vector column;
11) Output results to user in some way;
Each document in a corpus (database) is described by a set of keywords called index terms. We assign weights to index terms according to their relevance (frequency of occurrence for instance), this is how we go about creating the index, that we can then search.
Corpus preparation:
Web pages of interest are analysed and cleaned by removing hypertext tags or any other hyper language; Pages are then broken down into documents where each document is scanned through searching for words/terms of interest: those which make a document unique, not standard words.
Extract terms of interest:
Bear in mind that terms of interest must be invariant, that is be characteristic of a document, not generic and easy to find in any corpus/document. The idea is to find a signature per document.
Build term-by-document matrix:
The search space is defined by N dimensions where the chosen terms/features of a document is a point in the N term space, this allows conceptual/semantic searches.
Each document becomes a column vector, each row represents a term. Each row identifies the frequency of a term across the analysed corpus, at first we simply build the matrix by counting the terms for each document.
Compress the matrix:
There are two basic techniques/methods, Compress Row Storage (Scans matrix row by row) and Compress Column Storage (Scans matrix column by column) Both use three arrays.
Normalis the matrix:
Normalisation implies transforming column vectors to unit vectors: i.e. vectors of unit length
Unit document vectors contain frequency of terms; the normalisation is applied because the semantic content of a document is generally determined the relative frequency of terms.
Singular Value Decomposition:
This simplifies a symmetric matrix into three matrices Two are identical and represent the eigenvectors: the new dimensions. The third is diagonal and represents the eigenvalues, that is the spread of the corpus along these new dimensions.
A geometric interpretation:
The corpus is first formated, stemmed and is then stored in a compact term-by-document matrix. Each column of such matrix is then normalised to produce the likelihood of a term across the corpus, or, equivalently, the frequency of terms in a document.
The term-by-document matrix is then decomposed to calculate eigen values and vectors. Eigen vectors represent a new Cartesian coordinate frame spanning the same search space, BUT, they indicate the most important dimenions/axis along which documents mainly lie. Eigen value do quantify the spread of documents along these new axes/eigen vectors.
Queries:
Queries must be based on defined features/terms within the term-by-document matrix, matching in a vector space such as this is implemented by multiplying the query vector against the terms by document matrix,ie matching a query vector q against the documents of the matrix.
© I am the website administrator of the Wandle industrial museum (http://www.wandle.org). Established in 1983 by local people determined to ensure that the history of the valley was no longer neglected but enhanced awareness its heritage for the use and benefits of the community.
![]() |
|
![]() |
|
![]() |
|
![]() |
Many of our customers (http://www.internet-marketing-australia.com) find search engine... Read More
Search engine optimization is best left in the mystical land... Read More
Sometimes a search engine optimization company will miss that glaring... Read More
Selecting the right keyword phrases is the key to a... Read More
Getting indexed in Yahoo has become very difficult in the... Read More
Get Indexed FastWhat does getting indexed mean?The search engines keep... Read More
Search engine copywriting has become an extremely important part of... Read More
Keyword optimization involves vital keyword selection and placement strategy depends... Read More
Making effective use of the Internet is increasingly about creating... Read More
So you've built your website, you know what keywords you... Read More
Welcome to part three in this search engine positioning series.... Read More
You have heard the phrase LOCATION LOCATION LOCATION. But wait,... Read More
Oops! I meant "web content management system for windows." Do... Read More
Want to drive more targeted traffic to your web site?Want... Read More
Google has always been the search industry's innovator and that's... Read More
1. Search Engine - Is a database of web sites... Read More
In parts 1 and 2 you learnt how to develop... Read More
Google use a very complex function to determine which search... Read More
Do you realize that if you manage your website, SEO... Read More
There are many facets to SEO and the search engines... Read More
I recently invested quite some time into generating search engine... Read More
Seventy-two days ago Googlebot first showed up and crawled over... Read More
The first step to developing any search engine optimization effort... Read More
Most webmasters have no idea on how to make a... Read More
I don't know about you, but I felt a lot... Read More
Adding fresh, updated content to your website is the surest... Read More
Search Engine Optimization is a widely misunderstood industry. Many webmasters,... Read More
When shopping online for Link Management Software that can... Read More
The question for this article is whether or not you... Read More
I tried out Microsoft's new search engine (beta version) the... Read More
Recent developments on the Google front have web marketers and... Read More
It's in our genes, we're driven to seek. We 'hunt'... Read More
Maximizing traffic from the search engines to your web site... Read More
In the early days of the World Wide Web, when... Read More
The WhyBut what is the reality of reaching a number... Read More
Webmasters can spend most of their waking hours doing everything... Read More
Search Engine Optimization (SEO) has become one of the biggest... Read More
There is a lot of competition to get good spots... Read More
"HLE" is a bit of a joke term, (or possible... Read More
We've all heard about it-it seems like all the buzz... Read More
Before we explore the world of search engine optimization, it... Read More
What follows is a condensed version of a conversation that... Read More
The corporate fundamentals are par excellence! The product is unsurpassable... Read More
The pursuit of online information has become an increasingly dynamic... Read More
Google use a very complex function to determine which search... Read More
SEO, not again!, you may groan. The webmaster world is... Read More
(A Reflective look at the little search engine that soared!)All... Read More
Smartpages are highly optimized pages which draw stampeded of search... Read More
Are you aware of how vitally important and valuable CONTENT... Read More
Search engine optimization is one of most popular online marketing... Read More
1. Google love ...Google and all its programming is not... Read More
You finally have a website and you are ready to... Read More
According to the dictionary, the definition of the word "overture"... Read More
So you have a site concept developed that you are... Read More
Before using keywords for your site, it is always better... Read More
November 2003 might go down in history as the month... Read More
It seems I now do this rant every single year... Read More
For websites, one of the most important things in their... Read More
Getting Honest With The Search EnginesI spend a lot of... Read More
What is link popularity?Link Popularity is simply the total number... Read More
Alexa toolbar also useful to Browse expired websites database. Many... Read More
Indulge me for a moment.Forget that Google is a search... Read More
The latest news to hit the Internet's 'water cooler' is... Read More
News broke this week that Yahoo has purchased the Inktomi... Read More
The "Number One" Question - the question that I (and... Read More
If your site isn't found in the search engines, it... Read More
Search Engine Optimization (SEO) |