Search Engines and Latent Semantic Indexing

PeterHoggan · Aug 14, 2009

It’s being suggested that Google’s new caffeine update is going to place more relevance on LSI. So here is some information for those interested in the basics. Don’t worry about references to previous tutorials as this piece is pretty much stand alone.

Search Engines and Latent Semantic Indexing

In the last unit of the course, we offered a basic overview of how search engines work. In this unit, we are going to take a more in-depth look at search technology, explaining some of the innovations that have been made to help return relevant results for search queries. In particular, we will be looking at some of the factors involved in Latent Semantic Indexing (or LSI for short).

Because of its very nature, beginners may find the material in this unit quite advanced, so you are encouraged to take your time and try to absorb the main points. We have also provided footnotes and suggestions for further reading. You are not required to read this material, but more advanced webmasters may find some of the sources mentioned useful.

By the end of this unit you should be able to:

understand the basics of Latent Semantic Indexing
understand how a search engine sees documents
understand how a search engine weights keywords

This unit assumes that you have read the previous parts of the course and are familiar with major search engines such as Google.

3.1 Another look at search engines

SEO requires quite a lot of background knowledge if you are going to optimise your page in a manner that is effective and does not actually damage the ranking of your website. Before you even get your hands dirty altering the code on your web pages, you need to do quite a bit of research into the most effective keywords to use for your product or services and into the competition you face in search engine rankings. However, even prior to this, however, it is necessary that you understand a bit about

search engines
the individual searcher

In one sense, these can be considered the ‘rocks’ upon which effective SEO is founded. After all, SEO is about improving your search engine visibility in order to bring targeted traffic to your site, and this implicitly involves understanding the nature of both search engines and the average searcher. Knowledge of these areas will prove an immense help when you come to optimising your own pages.

The two areas are inter-related: after all, the function of a search engine is not to search documents per se, but to search documents in a way that satisfies the needs of the average searcher. In order to maintain trust, the search engine must continue to provide the user with reliable and relevant results for search queries. In this context, any innovations made in search engine algorithms can be considered, first and foremost, as refinements aimed at providing ever more relevant results for the searcher.

We will now look at some of the advanced techniques that search engines are beginning to employ in order to satisfy the needs of the searcher.

3.2 Google and Latent Semantic Indexing (LSI)

In the last unit of the course, we showed you how a search engine attempts to find relevant documents for a search query by locating pages in its index that match the search query - that is, pages that contain the specific words we entered. However, the process is rather more complex than this, largely because of an innovation on the part of the world’s leading search provider, Google.

In order to return more relevant results for the user, Google uses a method called ‘Latent Semantic Indexing’ when indexing documents on the web. Although this method is not used universally by all search engines, it is likely that other search engines will begin to factor this (or a similar) method into their algorithms in the future.

Note that Google does not rely entirely on LSI for finding relevant results. However, according to noted SEO experts Google has been using LSI ‘for a while’ and has ‘recently increased its weighting’ . This means that while traditional keyword based search queries are still relevant - i.e., Google still tries to retrieve documents that contain the specific search terms or keywords you use - Google’s search algorithm has begun to place more importance on LSI when attempting to determine and retrieve relevant documents for a specific search query.

So what is LSI and how does it differ from a standard keyword search? In essence, LSI is a method for retrieving documents that are relevant to a search but that may not contain the specific keyword entered by the user.

For example, in a traditional keyword based search, if I enter the search phrase ‘used cars’ into the search engine, it will only return documents that mention those actual terms somewhere on the page. It will not return web pages that mention terms that we normally consider to be closely related to our search query, e.g. ‘second hand’, ‘vehicles’, ‘automobiles’, and so forth (unless these pages also happen to use the keyphrase ‘used cars’).

When using LSI, on the other hand, the search engine finds a means to locate pages that contain related terms as well as our specific keyphrase. Therefore, our search might also return pages that only mention ‘second-hand automobiles’ as well as pages that specifically mention ‘used cars’.

As you can see, then, LSI allows the search engine to return documents that are outside our specific search phrase, but that are still relevant to our search. It begins to approximate how we actually use language in real life, where we are aware of alternative terms and synonyms for words, and for this reason should prove to be more useful to the searcher than a standard keyword search.

PeterHoggan · Aug 14, 2009

3.3 Latent Semantic Analysis

LSI is based on a theory called Latent Semantic Analysis. This theory was devised in 1990 by Susan Dumais, George Furnas, Scott Deerwester, Thomas Landauer, and Richard Harshman. According to Landauer, Foltz and Laham, Latent Semantic Analysis, or LSA, is

a theory and method for extracting and representing the contextual-usage meaning for words by statistical computations applied to a large corpus of text.

In other words, LSA is statistical and mathematical method for finding the contextual meaning of words in a large collection of documents. Such a collection could be something like the Internet, which contains a vast corpus of text based documents in the form of web pages.

If this begins to sound like advanced mathematics meets advanced linguistics, that’s because it is! (LSA even borders on cognitive science). This method however has immediate applicability to search engines because we are dealing with the problem of making a mathematical machine, or computer, ‘understand’ or analyse, the meaning of words (semantics is the study of word meaning, hence Latent Semantic Analysis).

Unlike most humans, who usually acquire the ability to use and understand language at an early age, computers cannot understand what words mean. The same holds for search engines. Despite their sophisticated mathematical algorithms, and despite the fact these algorithms ‘read’ the text on web pages to some extent, search engines are actually rather stupid and cannot form even the most basic understanding of what words mean.

What is the ‘contextual-usage meaning’ of words? To explain this we have to look at two features of everyday language which cause particular problems for computers and search engines.

synonymy.
polysemy.

A synonym is a word that roughly has the same meaning as another word. To find synonyms for words you simply have to consult a Thesaurus, where you will find a list of alternative words that can be interchanged with the original word.

I say ‘roughly’ because we can’t just select any alternative listed in the Thesaurus to replace our original word. In fact, some words only become synonymous with other words when used in the right context.

For example, if I consult my Thesaurus for a synonym for part of our earlier search for a car, ‘used’, I am provided with the following list of possible alternatives:

cast-off
hand-me-down
nearly new, not new,
reach-me-down
second-hand,
shopsoiled,
worn

If I were looking for second-hand clothes rather than used cars, I could use many of the above synonyms, as we customarily refer to second-hand clothes as ‘cast-offs’ or ‘hand-me-downs’. However, we don’t use such phrases as ‘hand-me-down cars’ or ‘shopsoiled autmobiles’. The context of our original phrase, or the word ‘cars’, determines that only one of the above phrases - ‘second hand’ - is an appropriate alternative for ‘used’.

In other words, we understand which words are synonyms according to the context in which they appear.

Of course, it would be of great advantage to us as searchers if the search engine were to automatically find commonly-used alternative terms for the search phrases we entered. While we could simply construct a search-engine with its own built-in Thesaurus, the above example shows us the problems we would inevitably encounter if we did so. If the search-engine attempted to substitute our search terms with all the alternatives found in its Thesaurus, it would produce some very strange search results. Without some understanding of ‘contextual-usage meaning’, or the context in which the term to be substituted appears, the search engine would be unable to pick the ‘right’ synonyms.

‘Polysemy’ can roughly be translated as ‘many-meaning’. It refers to the fact that most words in any given language have more than one meaning.

To see this you simply have to look in a dictionary, where you will find that most words have more than one definition. If, for example, we use a term from our earlier search, ‘vehicle’, we can see that it could have more than one meaning. According to the Oxford Concise Dictionary, a ‘vehicle’ could be a thing for transporting people, a means of expressing something, or a film intended to display its leading performer to best advantage!

How do we decide which of these possible meanings is called into play at which point then? This is where ‘contextual usage’ comes into play. As language users, we know which meaning is being used according to the context in which it appears. If for example, I was to say that ‘Top Hat was a vehicle for Fred Astaire and Ginger Rogers’, you would know that the word ‘vehicle’ in this context refers to a type of film and not a car. If, on the other hand, I use the phrase ‘second-hand vehicle’ you are likely to know that I am referring to a car.

Unfortunately, a computer has no way of distinguishing between the two as it lacks the ability to understand the context of statements and has no knowledge of the linguistic customs that give rise to polysemy. This means that the search query ‘second-hand vehicles’ could potentially return any page that happens to mention the two words, including pages that mention films or even ‘vehicles’ for expression such as poems.

We clearly have a problem then, because computers can’t understand the meaning of words according to the context in which they appear. It either has to stick with the terms given and ignore all possible alternatives - which means that we could miss documents that are relevant to our search but don’t contain our keyphrase - or include all possible alternatives - which means that numerous irrelevant results could be returned.

LSA provides us with a means of getting round the problem of computers not being able to understand contextual-usage meaning. It has been successfully applied to the process of information retrieval - that is, the process of retrieving information from large databases and collections of documents (like the Internet) - because it adequately gets round the two problems of synonmy and polysemy.

It does this by looking at the collection of documents as a whole and finding words that are commonly closely related. For example, by looking at enough documents it could find that ‘used cars’ and ‘second-hand automobiles’ are closely related terms simply because all the above terms customarily appear together on the same pages. Let’s have a closer look at how this works.

3.4 Latent Semantic Indexing in Action

The following material is based on the paper ‘Patterns in Unstructured Data’ by Clara Yu, John Cuadrado, Maciej Ceglowski and J. Scott Payne. In this paper, the authors present what is probably the best introduction to LSI and search engines currently available, and one that is popular amongst the SEO community.

Advanced webmasters may wish to consult this paper themselves, although this is not an absolute requirement as the following material provides a simplified version of the research it contains.

3.4.1 What people expect from a search engine

Yu et al. begin by pointing out some of the problems faced by current search technology. The internet is growing at an exponential rate, to the extent that, as the authors point out, Google has over 8 billion webpages in its index. This effectively means that more and more users have access to a vast collection of information, and that search engines face the task of indexing and searching this vast reservoir of data to return results that are both relevant to the individual searcher and simple enough for the average user to understand.

The trouble is that, given the sheer size of the Internet and the current state of search engine technology, any relevant information we find will still appear among a ton of irrelevant pages. Therefore search engines today still face the task of coming up with ever better ways of finding relevant results for the individual searcher.

According to Yu et al, there are three main things that people expect from a search engine (what they call the ‘Holy Trinity’ of searching). These things can be defined as follows:

Recall
Precision
Ranking

‘Recall’ refers to the ability of the search engine to recall relevant information for a search. This relates to our desire to gain all relevant information that exists on a topic when we search for it. ‘Precision’ relates to the fact that we want the results returned to be precise, i.e. to contain more relevant than irrelevant information. Finally, we expect don’t expect the results returned to be presented to us in a random manner. We expect them to be ranked in such a way that the search engine presents what it perceives to be the most relevant results for our specific search first, and the least relevant results last.

We can use this criteria for judging the efficacy of any current search technology, e.g.. Google. When we use a search engine, we expect it to not omit information (recall),to return relevant results (precision) and to arrange those results in SERPs (ranking). Yu et al. envision that the ideal search engine would be able to quickly search every document on the internet and return up-to-date results quickly while still satisfying this criteria. However, where it is relatively easy for a search engine to increase its scope or speed up its searches as this largely involves investing in additional resources, it is still difficult for a search engine to improve upon the recall, precision and ranking of searches. This is where latent semantic indexing comes in.

PeterHoggan · Aug 14, 2009

3.4.2 A Middle Ground?

Yu et al. outline two main ways in which we can search a collection of documents such as the Internet. For the sake of simplicity we will call these methods:

Human.
Mechanical

The first type of search is not likely to be exhaustive as nobody has the time or resources to go through a whole collection of documents word by word (can you imagine reading all the pages on the internet!). Human beings are more likely to scan pages for relevant information rather than read pages as a whole to see if they occasionally contain the phrase or information we are looking for.

Although this kind of search is not exhaustive, it is based on a high-level understanding of context, in the sense that human searchers usually know that certain parts of documents - e.g. page titles, headings, indexes - usually contain relevant information regarding what a page is about. Because this kind of search is carried out with an understanding of context, it can also successfully uncover relevant information in unexpected places, e.g. in articles which are not dedicated to the subject we were originally looking for.

The second type of search is exhaustive in the sense that it works methodically and mechanically through an entire collection of documents, noting down every single mention of the topic we are looking for. Computers are particularly good at this kind of task.

Although this second search can find every single instance where a term is mentioned, it has no understanding of context. Without this understanding of context, the computer cannot return documents that are related to our search but that don’t actually our search terms. Alternatively, the search engine returns documents that mention our specific search terms but that are using those terms in the wrong context (see the problems of synonymy and polsemy outlined above).

In short, a human search understands context but remains inexhaustive or sorely incomplete, while a mechanical search is exhaustive but has no understanding of context.

An obvious solution to the difficulty of searching a collection of documents like the Internet would be to find a way to combine the two. That way we would have the best of both worlds, where an exhaustive mechanical search would also display an understanding of context, thereby allowing ‘synonymous’ or related material to be found while cutting out the irrelevant material caused the problem of polysemy.

Yu et al point out that past attempts to combine mechanical searches with a human element have only met with limited success. Attempts to supplement searches by providing a computer with a human compiled list of synonyms to search have not proved successful. Surprisingly, there would also be shortcomings were we to employ a human ‘taxonomy’ or a system of classification such as the systems that have been used by libraries for generations (e.g. the Library of Congress). Under such systems, documents are classified according to different human defined categories (e.g. a library book could belong to categories such as science, natural philosophy, natural history, and so forth).

Even though traditional archivists successfully employ such systems, they might not work so well for the Internet. How, for example, would one find the means and resource to go about placing the billions of pages in the internet into little pigeonholes? What, moreover, would happen if many of these documents were relevant to more than one category (as most will inevitably be),or if your average Internet user didn’t have knowledge of the category their search belongs to?

Latent Semantic Indexing can be said to be a solution to the above problems in that it appears to offer a ‘middle ground’ between the two methods outlined above. LSI offers an exhaustive search that still understands context. Better still, LSI is entirely mechanical and doesn’t require human intervention.

3.4.3 How LSI works

In the last unit of this course, we pointed that a search engine attempts to find relevant results for a search query by finding pages that contain the terms used in that query. For example, a search for ‘mobile phone accessories’ will return pages that actually mention the words ‘mobile’, ‘phone’, and ‘accessories’.

This system is not ideal, as it deems all pages that don’t contain our specific search term as irrelevant, even if those pages potentially contain information that is relevant to our search.

As Yu et al suggest, LSI still takes account of the words a document contains, but it takes the extra step of examining the document collection as a whole to see which other documents contain the same words. If it finds other documents which contains the words, it considers them to be ‘semantically close’. ‘Semantically distant’ documents, by contrast, are documents that don’t have a lot of words in common.

The important thing to note here is that, by calculating the similarity values between documents, LSI can actually find words that are semantically related. For example, if the terms ‘cars’, ‘automobiles’ and ‘vehicles’ appear together in enough documents on the Internet, LSI will consider them to be semantically related. Therefore, a search engine that uses LSI in its index will return pages that mention ‘vehicles’ when you search for ‘cars’.

In short, then, Latent Semantic Indexing enhances our searches by taking account of related terms. By looking at enough documents on the Internet, it can find which words are related to other words, or words that are synonymous with other words. A search engine that uses LSI can thereby return documents that are relevant to but outside of our specific search query.

3.5 How search engines view web pages
As we noted above, this process does not require human intervention. There is nobody telling Google, for example, that ‘cars’ and ‘vehicles’ are related terms. Instead LSI finds related terms all by itself simply by looking at enough documents.

LSI, in fact, is simply a statistical and mathematical computation that looks at word patterns across documents. It is not an Artificial Intelligence programme that gives Google a way to actually read documents as humans would. In fact, the search engine that uses LSA to index pages remains as stupid as ever in the sense that it cannot understand even the basic meaning of words.

But that is not to say that LSI doesn’t focus on word meaning. Nor does it pay attention to every single word on the page.

In every language, you have two different kinds of word:

content words - e.g. car, phone, liberty, celebrity, etc.
unction words - e.g. and, but, to, the, etc.

In simple terms, the first kind of word has some kind of meaning for us (i.e., we can visualise what a car is or understand the concept of liberty),while the second doesn’t have the same kind of meaning (ask yourself, what is the meaning of ‘the’?) In other words, words can be divided those that carry meaning and those which do not.

LSI works by stripping documents of function words and extraneous terms to focus on terms with semantic content. It is useful to know this, as it is what a search engine will be doing to the words on your web pages when it reads them.

In fact, the search engine employs what is known a stop list in order to strip web pages down to a skeleton of content words. This stop list is a list of commonly used words, function words, verbs, prepositions, etc, which it removes from the page to focus on words that carry the main meaning of the page. This greatly reduces the ‘noise’ on the page and helps the search engine determine what the page is about.

This is all part of a process the search engine performs upon web pages in order to determine the relevance of each page objectively. The process LSI performs upon web pages when indexing a document is as follows:

1) Linearisation
The search engine removes all markup tags (i.e. code) from a page so that all its content is represented as a series of characters. The search engine moves through the page systematically, working from top to bottom and left to right, removing content of from tags as it finds it.

2) Tokenization
The search engine strips the page of formatting such as punctuation, capitalisation and markup.

3) Filtration
The search engine applies a stop list to remove commonly used words from the document. This leaves us with only content words

4) Stemming
The remaining content words are then ‘stemmed’. That is to say that the remaining terms are reduced to common word roots (e.g. ‘techno’ for ‘technology’, ‘technologies’, ‘technological’).

5) Weighting
Weighting is the process of how determining how important a term is in a document.

PeterHoggan · Aug 14, 2009

3.5.1 Term Weighting

By ‘term weighting’ we mean the importance given to terms or words that appear in a document.

A search engine does not see all terms in a document as equally important (the use of a stop list, for instance, shows that the search engine treats common words, function words and non-content words as wholly unimportant). Similarly, the search engine does not treat the content words that remain after it has filtered a document as if they are all equally important.

According to Yu et al, the weighting of terms by the search engine is based on two ‘common sense insights’. Firstly, there is a likelihood that content words that are repeated in a single page are more likely to be significant than content words that appear once. Secondly, words that are not used very often are likely to be more significant than words that are used a lot.

For example, if the word ‘aircraft’ appears a number of times in a single page, it is likely to be fairly significant to that document. Remember that a search engine can’t read, so a recurrence of such terms may just indicate roughly what that page is about.

However, if one takes a word that appears in lots of pages - say, a common content word - then it is treated as less significant. It would not, for example, be much help in allowing the search engine to distinguish between these pages in terms of their different content.

There are therefore three types of weighting employed by a search engine:

Local weight

Global weight

Normalisation

Normalisation simply refers to the process by which documents of different lengths are made to appear ‘equal’. If this did not occur, longer documents - which, of course, contain more keywords - would tend to outweigh or subsume shorter documents.

Local weight refers to the number of times a term appears in a document. A word that features numerous times in a single document will have a greater weight than a word that features only once. This is also known as term frequency (tf).

Global weight refers to the number of times documents in the collection appear that feature the term. This is often referred to as inverse document frequency (IDF)

Keyword weighting is calculated according to the following equation:

tf*IDF

Where tf = term frequency and IDF = inverse document frequency.

3.5.2 Weighting and distribution vs Keyword density

Although this is material is fairly complex as it appears to involve advanced linguistics and complex mathematical formulas, it is useful material to have a basic grasp of as it does have ramifications for the SEO process.

Traditionally, SEO professionals have focused on a thing called keyword density when dealing with term weighting. Keyword density is a measure of the number of times your keywords appear on a page in relation to other terms. For example, if the keyword ‘cars’ appeared three times in a document that contained 100 words, the keyword density for that page would be 0.030 or 3% (3/100).

Under the keyword density model, the more times a keyword appears on a single page, the more likely it is that the search engine will find you relevant for that keyword. Under this system, optimising your page simply involves increasing its keyword density by mentioning your keywords as many times as you can on a single page.

However, SEO professionals are beginning to realise that this is not how search engines work when they look at keywords or determine the importance of terms to a page. Keywords density only refers to the use of keywords on a per page basis and not across the document collection as a whole. As Dr. E. Garcia points out, modern search engines also have to take into account the following factors when dealing with keywords:

proximity - the distance between keywords on a page
distribution - or where keywords appear on a page

These factors have a direct bearing on what a document is about. For example, if the keywords ‘used’ and ‘cars’ have a close proximity, i.e. they appear on the page together as ‘used cars’, then that page is more likely to be about used cars. The same goes when one looks at where the keywords appear on a page (e.g. do they appear in titles and main headings and so forth?).

The concept of keyword density, by contrast, does not take into account the position of keywords in relation to each other on a page. If search engines actually used keyword density as a measure of the relevance of a page, they could potentially return pages that mention ‘used’ and ‘cars’ enough times no matter where they appeared on a page. For the sake of illustration, we could say that the following phrase might make a page relevant for the keyword search ‘used cars’:

‘I used to cycle to work a lot but most people drive their cars to get there.’

As you can see, this takes no account of the proximity or distribution of keywords, all of which will have an impact upon what the page is about.

Note: In future units of the course, we will occasionally refer to keyword density as it is a term still used by SEO professionals, and, as a concept, it still works as a suggestive way to get SEO beginners to start increasing the frequency with which they employ their keywords on web pages. However, it will to pay to remember that search engines use a different system than keyword density for determining the importance of keywords on a page. Bear this in mind when we start showing you how to employ keywords in your own pages.

3.6 Conclusion

In this unit, we have only touched upon the basics of search engines and LSI. There are many other concepts that could be covered (including such things as singular value decomposition and the term-document matrix!).

The material we have covered so far, however, has been simplified to a number of main points that have a bearing on the basic principles of SEO. These points should help you form a basic understanding of:

the methods search engines are beginning to use for information retrieval
how search engines actually see web pages
how search engines weight keywords

By now you should have a more in-depth understanding of how search engines are beginning to index documents. It is useful to understand the principles behind LSI as we will refer back to it later in the course. As we will see later, LSI also has direct and practical implications for key areas of the SEO process such as deploying keywords, writing page copy, and constructing anchor text for external links.

SUMMARY:
In order to return more relevant results for the user, Search Engines like Google are beginning to use Latent Semantic Indexing to retrieve web pages.

Latent Semantic Analysis is a statistical method for determining the contextual meaning of words.
 Latent Semantic Indexing helps get round the problems of synonymy and polysemy encountered when people search the internet.

Latent Semantic Indexing can return relevant pages that do not contain the actual terms of a keyword search.

Search Engines do not read every word on a web page. Instead they focus on content words.

Web pages are subjected to a complex process of linearisation, tokenisation, filtration, and stemming, whereby Mark-up, punctuation, and a stop list of commonly used words are removed from web pages.

The content words on web pages are weighted differently according to how frequently they appear on a page and in the collection of documents as a whole.

Keyword density is not an accurate measure of the importance of keywords on a page. Search engines actually use methods that look at keyword weighting and distribution.

REFLECT:

What do you understand by the following terms?

Latent Semantic Analysis
Polysemy and Synonmy
Content Word
Linearisation
Tokenization
Filtration
Stop List
Stemming
Weighting
Normalisation
Keyword Density

Join our UK Small business Forum

Search Engines and Latent Semantic Indexing

PeterHoggan

New Member

PeterHoggan

New Member

PeterHoggan

New Member

PeterHoggan

New Member

Latest Posts

Stats

Quick Nav

About us

Useful Sites