Making effective use of the Internet is increasingly about creating better and more intelligent applications and search engines. Here is a brief introduction into how search engines work:
01) Define the corpus, search space/data;
02) Separate the corpus into documents;
03) Generate features for each document;
04) Generate a representation of each document;
05) Study the feature/vector space;
06) Cluster documents;
07) Reduce dimensionality;
08) Accept input Queries;
09) Find the cosine angles against the query vector;
10) Find the sought vector column;
11) Output results to user in some way;
Each document in a corpus (database) is described by a set of keywords called index terms. We assign weights to index terms according to their relevance (frequency of occurrence for instance), this is how we go about creating the index, that we can then search.
Corpus preparation:
Web pages of interest are analysed and cleaned by removing hypertext tags or any other hyper language; Pages are then broken down into documents where each document is scanned through searching for words/terms of interest: those which make a document unique, not standard words.
Extract terms of interest:
Bear in mind that terms of interest must be invariant, that is be characteristic of a document, not generic and easy to find in any corpus/document. The idea is to find a signature per document.
Build term-by-document matrix:
The search space is defined by N dimensions where the chosen terms/features of a document is a point in the N term space, this allows conceptual/semantic searches.
Each document becomes a column vector, each row represents a term. Each row identifies the frequency of a term across the analysed corpus, at first we simply build the matrix by counting the terms for each document.
Compress the matrix:
There are two basic techniques/methods, Compress Row Storage (Scans matrix row by row) and Compress Column Storage (Scans matrix column by column) Both use three arrays.
Normalis the matrix:
Normalisation implies transforming column vectors to unit vectors: i.e. vectors of unit length
Unit document vectors contain frequency of terms; the normalisation is applied because the semantic content of a document is generally determined the relative frequency of terms.
Singular Value Decomposition:
This simplifies a symmetric matrix into three matrices Two are identical and represent the eigenvectors: the new dimensions. The third is diagonal and represents the eigenvalues, that is the spread of the corpus along these new dimensions.
A geometric interpretation:
The corpus is first formated, stemmed and is then stored in a compact term-by-document matrix. Each column of such matrix is then normalised to produce the likelihood of a term across the corpus, or, equivalently, the frequency of terms in a document.
The term-by-document matrix is then decomposed to calculate eigen values and vectors. Eigen vectors represent a new Cartesian coordinate frame spanning the same search space, BUT, they indicate the most important dimenions/axis along which documents mainly lie. Eigen value do quantify the spread of documents along these new axes/eigen vectors.
Queries:
Queries must be based on defined features/terms within the term-by-document matrix, matching in a vector space such as this is implemented by multiplying the query vector against the terms by document matrix,ie matching a query vector q against the documents of the matrix.
© I am the website administrator of the Wandle industrial museum (http://www.wandle.org). Established in 1983 by local people determined to ensure that the history of the valley was no longer neglected but enhanced awareness its heritage for the use and benefits of the community.
You must have heard the stories how people became rich... Read More
Definition: A doorway page is content created specifically for the... Read More
Good keywords are frequently searched for (high demand) but not... Read More
Most new sites submitted to Google (at least within the... Read More
A higher search ranking is what many website owners dream... Read More
With more and more experts and search engine enthusiastsclaiming the... Read More
Bringing visitors to your site from the main Search engines... Read More
First of all, Google and most other search engines do... Read More
By Catherine FranzSearch engine spiders read and record page titles... Read More
"HLE" is a bit of a joke term, (or possible... Read More
Webmasters can spend most of their waking hours doing everything... Read More
Uniquely built web sites can create unique issues when being... Read More
When online "Use it. Use it. Use it."Google is our... Read More
In parts 1 - 7, you learnt how to develop... Read More
One of the most important steps in any site's publicity... Read More
Are shades of grey SEO really Black Hat SEO?Black hat... Read More
The first months my website was online, I was constantly... Read More
The first step in a search engine optimization campaign is... Read More
Link building is a waiting game. Many clients have asked... Read More
Google, the most popular, and many say best, search engine,... Read More
I recently invested quite some time into generating search engine... Read More
The World Wide Web contains more than ten million websites... Read More
Getting a high ranking on Google is a big achievement.... Read More
Searching online can not only be fun, but you sometimes... Read More
Increase Your Google Page Rank!Page Rank. We all know what... Read More
In parts 1 - 6 you learnt how to develop... Read More
It's difficult to dispute the rational behind the rant since... Read More
Google has recently made some pretty significant changes in its... Read More
There are many facets to SEO and the search engines... Read More
Internet is a terrific resource containing billions of web pages... Read More
Are you aware of how vitally important and valuable CONTENT... Read More
As a member of several search engine optimization forums, I... Read More
Being listed in search engines and ranked high on searches... Read More
Don't put the cart before the horse.You can't do SEO... Read More
Chances are you have been on the Internet and have... Read More
First, here's the rundown of some of the terminology I'm... Read More
On the internet, competition is stronger than ever. There was... Read More
The first months my website was online, I was constantly... Read More
After the latest PR update at Google and MSN's beta... Read More
What is Search Engine Optimization?Search Engine Optimization or SEO for... Read More
Google Sitemaps enables Webmasters to Directly Alert Google to Changes... Read More
Onpage search engine optimization are things that you can change... Read More
Keyword density. When it comes to SEO copywriting, this has... Read More
Before to answer to this question we have to know... Read More
By now, virtually every webmaster has heard or read that... Read More
Top 10 search engine rankings. Everybody wants it but a... Read More
Overture.com offers a cool function to assist you on your... Read More
I just wanted to share a little Search Engine Optimization... Read More
OK. So you've created a nice website with lots of... Read More
In the fall of 1990, the musical group Snap had... Read More
Thanks to a unique algorithm that produces most relevant results... Read More
First of all, What is SEO? SEO stands for Search... Read More
The old ways are not always the best ways.The traditional... Read More
Previously...In our article on Understanding Google's Algorithm,... Read More
With search engine algorithms changing seemingly daily, the quest to... Read More
Thinking about purchasing that premium SEO software that will hyper-optimize... Read More
The most important thing you can do for your business... Read More
I hear this all the time. "I can get you... Read More
Search Engines have become the soul of the Internet. They... Read More
With 40 million websites in existence, and more than 3... Read More
Google is the major search engine webmasters have to deal... Read More
... Read More
I wish to give a few little tips, about optimizing.... Read More
If you are the owner of a new website, trying... Read More
Search engine optimization is one of most popular online marketing... Read More
Most web surfers start their sessions at a search engine... Read More
Search Engine Optimization (SEO) |