Using Latent Semantic Indexing for Information Filtering

Peter W. Foltz
Box 345, Dept. of Psychology
University of Colorado
Boulder, CO 80309

Abstract

Latent Semantic Indexing (LSI) is an information retrieval method that organizes information into a semantic structure that takes advantage of some of the implicit higher-order associations of words with text objects. The resulting structure reflects the major associative patterns in the data while ignoring some of the smaller variations that may be due to idiosyncrasies in the word usage of individual documents. This permits retrieval based on the the "latent" semantic content of the documents rather than just on keyword matches. This paper evaluates using LSI for filtering information such as Netnews articles based on a model of user preferences for articles. Users judged articles on how interesting they were and based on these judgements, LSI predicted whether new articles would be judged interesting. LSI improved prediction performance over keyword matching an average of 13% and showed a 26% improvement in precision over presenting articles in the order received. The results indicate that user preferences for articles tend to cluster based on the semantic similarities between articles.

Published as: Foltz, P. W. (1990) Using Latent Semantic Indexing for Information Filtering. In R. B. Allen (Ed.) Proceedings of the Conference on Office Information Systems, Cambridge, MA, 40-47.

1. Introduction

While the greater availability of information in electronic form will bring wider access to information, it will also require users to do additional processing to decide what part of the information is actually useful to them. Currently, some users have access to a great many types of information through such things as commercial online services, news and financial wires, mail messages, electronic bulletin boards, and netnews articles. Although users of this electronic information have access to a rich body of information, only a small fraction of the information is actually relevant to any one user's interest. Thus, there is the problem of determining the part of the information that is of interest to the user while minimizing the amount of search through irrelevant information. One solution is to develop filters that selectively weed out the irrelevant information based on user preferences. This paper describes the results of a study of using an information retrieval method to structure and filter electronic information.

Filtering information is not a new concept; when we read standard paper texts, information filtering occurs. We only buy certain magazines, since other magazines may contain information that may be redundant or irrelevant to our interests. In this way we are filtering out some of the large amount of information to which we have access. Within any particular magazine, we also apply some effort in order to choose which articles appear relevant to our interests. Thus, information filtering is something that is continually done by people when engaged in any sort of acquisition of information. With the advent of electronic presentation of information, some of that filtering need no longer be done by us, but could be done automatically by the system that presents the information.

While automatic filtering of information sounds like a wonderful vision, there are many difficulties in determining what information a person would actually want to see. One issue is creating an accurate user model. The better the model of what types of information a user is interested in, the better the prediction of what information should be presented. Nevertheless, there are many problems in developing a good model of a user's interests. One of the primary problems is in determining what are the factors of interest in the information. Allen (in press) tested four emotional factors, based on Schank (1979), Death, Ambition, Romance, and Disease on the predictability of reader interest of news articles but found no clear pattern of correlations. Nevertheless, other factors such as, familiarity, novelty, importantance or urgence may be useful in predicting what information a user might want to see.

Even with a clear idea of what factors are important to predicting interest, there is no guarantee that those factors can be identified easily. One of the simplest methods of determining whether information matches a certain user model is through keyword matching. If a user's interests are clearly indicated by certain words, then information containing those words should be relevant. Nevertheless, keyword matching can still fail due to polysemy (more than one meaning for a single word) and synonymy (many ways of refering to the same concept). Furnas, Landauer, Gomez & Dumais (1983) showed that across people, the same word is used by two people to describe an object only 10 to 20 % of the time. This suggests then that keyword matching in texts may fail. Thus, for developing a user model for automatic information filtering part of its success will be based on its ability to determine what part of the text actually allows the prediction of the user's interest.

1.1 Information filtering systems and methods

There have been several studies and systems developed to test the abilities of information filtering. Allen did a series of experiments to test user models in predicting preferences for news articles. Through the analysis of what texts are read, and what is in the text, a user model is developed. The models create a structure to represent the information based on user preferences. For predicting articles read based on previous articles read, Allen used a measure of overlap of nouns between the new and old articles. While the predictions were better than chance, the average correlation between the retrieved articles and the users preferences articles was low. Overall, from looking at several models, he found that predictions were successful for user preferences of general categories of articles, but not very good for predicting preferences of specific articles. These results suggested that user models might need to take into account many different factors or that prediction of news articles is simply very difficult.

A somewhat different approach is to let the user structure the information rather than having the computer develop a model based on the user's past preferences. The Information Lens system (Malone, Grant, Lai, Rao, & Rosenblitt, 1987; Mackay, Malone, Crowston, Rao, Rosenblitt, & Card. 1989) allows users to create rules to filter mail messages based on keyword matches in the mail fields. Since there is already some implicit structure in mail messages, such as sender information in a sender field and keywords in a keyword field, these rules can then take advantage of this structure to perform user specified actions on the messages. Thus, a certain rule may take the form of deleting all messages from a certain person, or labeling messages with certain keywords as urgent. Mackay et al. performed an evaluation of Information Lens users and found that people without much computer experience were able to create their own rules to prioritize and filter their mail messages. Their results showed that the largest percentage of rules were created to match on information about the sender and the recipient, while fewer rules were created to match on textual information such as the subject and text body. Through these rules, users are creating their own user model to handle the information.

Infoscope, which works in a similar way to the Information Lens has been developed by Stevens and Fischer (Stevens, 1989; Fischer, Foltz, Kintsch, Nieper-Lemke & Stevens, submitted). It permits users to create matching rules that will then sort the information into different bins for easier access. The system then uses intelligent agents to analyze and suggest rules to aid in structuring the information into a directed graph. Another recent system has been developed by Anderson, Nielsen & Rasmussen (1989), using hypertext like links between similar documents. It creates links between articles that comment on each other or that have a high degree of similarity based on a weighted keyword match.

While a domain such as mail messages contains enough implicit structure to allow these rules to succeed, a less structured domain such as news articles (i.e. without fields), could only use the rules to match on words in the text. As suggested above, the problem of polysemy and synonymy poses problems for direct keyword matching. This problem may be greater in domains with less structuring information, since the structuring information in domains such as electronic mail also gives clues about the correct semantics to supply. In these less structured domains, it may be useful to let the computer structure the domain. While systems which let the user structure the information, like Information Lens, allow users to create their own user model, it also has the cost that the user must take an active part in the structuring. Automatic structuring of the information would take some of the burden off the user, however, it does mean that the user model is hidden from the user, making it more difficult for it to be evaluated and modified by the user.

1.2 Information Retrieval

Conventional information retrieval (IR) is very closely related to information filtering in that they both have the goal of retrieving information relevant to what a user wants while minimizing the amount of irrelevant information retrieved (e.g. Salton & McGill, 1983; VanRijsbergen, 1979). The primary difference is that in conventional information retrieval a user's query is typically a brief description of a certain type of information, while in information filtering, the query needs to be a much larger description of all types of information the user could want. Some research in IR is now moving towards incorporating user models to improve retrieval (Brooks, Daniels, &Belkin, 1985). The research in this paper uses an information retrieval technique, Latent Semantic Indexing (LSI), to filter information based on user preferences.

1.3 Latent Semantic Indexing

Latent Semantic Indexing (Deerwester, Dumais, Furnas, Landauer & Harshman, In press; Dumais, Furnas, Landauer, Deerwester, & Harshman, 1988) takes advantage of the implicit higher-order structure of the association of terms with articles to create a multi-dimensional semantic structure of the information. Through the pattern of co-occurences of words, LSI is able to infer the structure of relationships between articles and words. A singular-value decomposition (SVD) of the term by article association matrix is computed producing a reduced dimensionality matrix containing the best K orthogonal factors to approximate the original matrix as the model of "semantic" space for the collection. This semantic space reflects the major associative patterns in the data while ignoring some of smaller variations that may be due to idiosyncrasies in the word usage of individual documents. In this way, LSI produces a representation of the underlying "latent" semantic structure of the information. Retrieving information in LSI overcomes some of the problems of keyword matching by retrieval based on the higher level semantic structure rather than just the surface level word choice. Research on LSI (Deerwester, Dumais, Landauer, Furnas & Beck, 1988; Dumais, 1989) shows that retrieval of relevant documents is significantly improved over keyword matching, and a more complete description of the model may be found in these papers.

1.4 Filtering using LSI

Because LSI results in a space based on underlying semantic similarity, articles on similar topics should tend to cluster in the space. This feature is the basis for using it for filtering. In general, the idea is to create a semantic space of a set of articles that have previously been judged by a user to be interesting or not. To determine if a new article is relevant, it is first folded into the semantic space on the basis of its contained terms. If the article appears close to other interesting articles in the space, then it would be considered likely to be interesting to the user. Conversely, if that article appears close to other non-interesting articles, then it would be considered not interesting to the user.

This type of filtering differs greatly from the methods of Information Lens and Infoscope, in that for both of those systems, the user structures the space. Using LSI, the space is structured entirely from the semantics of the information. Both relevant and non-relevant information is stored in the space providing rich examples of both information that a user wants and information that a user does not want. Since a person may have many different types of articles that he or she is interested in, there may be multiple clusters of relevant documents. Of course, one of the primary assumptions of this method is that if a user finds a particular article interesting, other articles that are semantically similar, and thus near it, will be interesting. This may not always be the case for a variety of reasons. A user may find one article interesting, but after reading that article, the user has learned enough information on that subject and does not need to read other articles. Nevertheless, we do tend to use information within certain domains while ignoring others, so similarities between articles will provide some measure of interest.

2. Method

The method of using an article as the query to determine its relevance differs in several respects from standard information retrieval. Typically, in information retrieval, an ordered list of articles is retrieved based on their closeness to a query. For filtering, a new article is used as the query, and it is not important how well the article retrieves other articles in an ordered list, but rather whether the top articles in that ordered list were previously judged relevant. Thus, the filtering method returns an ordered list of articles and whether or not each of those returned articles was interesting.

Ideally, a new relevant article would fall within a cluster of other relevant articles, so the top of the ordered list would contain many of other relevant articles. In this case, the returned list of closest articles would all be relevant articles and the system could then decide that the new article must also be relevant. However, in reality, any cluster of articles may contain both relevant and non-relevant articles. Therefcre, it is necessary to develop measures to determine whether a new article is relevant based on some characteristic of what is returned. Two plausible criteria for such a measure are that if the new article appears close to any relevant article it is probably relevant, and that a high ratio of relevant articles to non-relevant articles close to a new article would indicate that the new article is probably relevant. The utility of these two factors was investigated in this research.

2.1 Data collection

Five readers of Netnews, and one reader of local electronic bulletin board messages were recruited. Over a two week period they read articles as they normally did, but after seeing each article, the computer paused and asked them to rate whether the article they read was interesting or not. They were asked to read the newsgroups they normally read at the times they normally would. They were further asked to rate the articles on interest, where interesting meant that they would like to see similar articles. While the prompting for judgment after each article may not be an ideal method for an actual information filtering system, it served its main purpose of collecting interest data. Other methods, such as measuring the amount of time spent per article or the number of pages read may provide less obtrusive ways of determining the user's interest. The subjects read an average of 681 articles over a two week period reading an average of 16 newsgroups. Overall they judged 28 % of the articles as being interesting, with a range of 13.4 to 63.7 % between the different newsgroups. This range indicates that there were very different interest thresholds for different people reading on different topics.

The data from 3 readers, each on one newsgroup and from the one reader on the electronic bulletin board messages were chosen for further analysis. These sets were chosen because they all contained a large enough number of articles (e.g. greater than 250) so that there would be enough data to create a large enough space in the scaling. For each data set, the first approximately 80 percent of the articles were scaled using LSI. For each scaling, a term frequency weighting was used within each article, and an entropy weighting was used between articles. These weighting schemes were chosen based on the results of Dumais (1989) which showed very good performance. The remaining approximately 20 percent of the articles were then used as the queries to test the model's ability to predict whether they were judged interesting or not. Table 1 shows the number of articles in the scaled and test sets and the percentage judged relevant for each data set.

Initially three methods were tested as models of the user preferences. The first two methods returned the cosine between a new article used as a query and its nearest neighbor that was judged relevant. A new, relevant article should appear near some other relevant article and should therefore have a higher cosine than a new non-relevant article. This prediction was tested with various reduced dimensionalties of LSI and for the full keyword space. These results should then give an indication as to whether LSI succeeds better than keyword matching for similarity matching in filtering articles. The third method used as a criterion the ratio of relevant to non-relevant articles within a particular cosine of the new article using LSI. This should then give an indication whether relevant articles tend to cluster together.

Data set	# of articles	# of articles	Overall percentage of
in scaling in test set articles judged relevant

Bboard msgs 233 45 39.9

Comp.windows.x 219 50 27.5

Soc.women 250 50 64.0

Rec.ham-radio 344 73 21.6

Table 1. Number of articles scaled and tested, and overall percentage of articles judged relevant for the four data sets.

The ability to predict relevance of new articles based on the scaling of the old articles was tested using a precision-recall measure (e.g. Salton & McGill, 1983) Recall measures the proportion of relevant articles retrieved in response to a query out of all the relevant articles, while precision measures the proportion of retrieved articles that are actually relevant. This measure was used in a way that was slightly different than is standard in IR. In IR, precision-recall measures can be obtained for each query based on a ranking of all the documents returned. In this study, we treat each new article as a query and obtain a single cosine measure of its distance to the nearest relevant previously scaled article. These cosines, each associated with one new article are then ranked and a precision recall measure is done on this ranked list over all new articles. This ranking of documents in the order of predicted relevance is different than the binary cutoff of Malone's Information Lens in which a document is shown only if it matches the user's criteria. Such a ranking would permit users to work their way down the ranked list of articles and set their own cutoff criterion.

3. Results

3.1 Nearest relevant neighbor method

Ranked lists of the articles from the test set were computed using LSI at a variety of different dimensions and for keyword matching. The number of dimensions was varied in order to determine whether a model with many or few factors best predicts the data. From these ranked lists, precision-recall curves were then computed. Table 2 shows the average precision at 3 levels of recall (.25, .50 and .75) for the keyword matching and for the LSI scaling using the number of dimensions that resulted in the highest average precision. The table also shows the average precision for chance performance, equivalent to a random ordering of the articles, and the average precision for the articles if they were read in the order that they were received as bazeline measures.

Data Set	Chance 	Temporal Keyword Best LSI scaling	%improvement over
		perfor.	order	matching (# of dimensions)	keyword matching

Bboard msgs .40 .44 .77 .77 (233) 0%

Comp.windows.x .28 .42 .39 .44 (200) +13%

Soc.women .64 .55 .65 .74 (190) +14%

Rec.ham-radio .22 .34 .40 .50 (240) +25%

Table 2. Average precision for chance performance, keyword and LSI for the 4 data sets.

The average precision values for the data sets are greatly better than chance performance, indicating that both the keyword and LSI ordering tend to result in a better ordering than ordering based on when the document arrived. Table 2 also shows that LSI offers a large improvement over keyword matching for three out of the four data sets. This indicates that LSI is capturing more of the semantic structure that is useful in predicting preferences than in keyword matching. The numbers of dimensions for the best LSI performance are all fairly high, being in a range of 69 to 91% of the total number of articles. Thus, it seems that this type of information works best when categorized on many factors which likely result in many small clusters of articles.

Figure 1 shows precision plotted against recall for 3 different levels of dimensions and keyword matching for one of the data sets. At low dimensionality, we see that the precision is very low, but increases with greater number of dimensions. Keyword matching, which is equivalent to an LSI scaling with the number of dimensions equal to the number of documents, consistently has a lower precision than the best LSI scaling with 190 dimensions.

Figure 1. Precision vs. recall for 3 levels of dimensions and keyword matching for the data from the news group Soc.women.

The fact that LSI did no better than keyword matching for one data set may be due to a variety of factors. It is possible that there was no meaningful structure captured by LSI as compared to the information lost in dimension reduction. This could be the case in an information space which uses a precise and limited vocabulary. Nevertheless, the average precision for both the keyword and the LSI method still show better than 100% improvement over chance performance, showing that this type of ordering of the information will improve a user's ability to filter it.

3.2 Ratio of relevant to non-relevant neighbors method

Some preliminary analyses have been completed on a measure of the ratio of the relevant to non-relevant articles near any new article. It is possible that a non-relevant new article may be very near a relevant article in the space, while all the other near articles are non-relevant. In this case, the nearest relevant neighbor method described above would fail, but some measure of the density of relevant and non-relevant documents near a new article would succeed. For this method, the new articles were ordered based on the ratio of the number of relevant to non-relevant documents near that article . Using this ratio for the 10 closest articles to a new article, with the number of dimensions that resulted in the best LSI scaling in the earlier analyses, the average precision was about 10% worse than using keyword matching in the nearest relevant neighbor method. However, using this ratio for the articles within a cosine of .65 of the new article did result in an improvement of 19% over keyword matching and5% over the nearest relevant neighbor method on one of the datasets using 30 dimensions. This suggests that measuring the distribution of relevant items within clusters may also succeed in predicting user preferences. A more sensitive analysis of these clusters should be done using a measure of the distribution of relevant articles (e.g. Kane-Esrig, Casella, Streeter, & Dumais, 1989).

4. Conclusions

The results of this study are encouraging. Overall, treating new articles as queries to be matched against the relevance judgements of previously judged articles succeeded in predicting the relevance of the new articles. Latent semantic indexing tended to work better than keyword matching to do the predictions. This indicates that LSI is capturing useful structure in word use. The fact that LSI works best with a large number of factors for these information spaces indicates that simple classifications of the information will not be as successful as classifications that take into account a wide variety of factors. Overall, LSI serves as a good model of people's preferences for articles.

This research is partially predicated on the hypothesis that user preferences will be similar for articles that are semantically similar. There are cases when this will not be true. A user may tire of one type of information, having read enough on that topic or may have already acquired the information from a previous article. A good example of this is APnews wire. The same article may be broadcast many times throughout the day with only minor changes such as clarifications or spelling corrections. Just because a user indicates interest in the first broadcast does not mean that he or she wants to see it repeated with slight changes. Another problem with modeling user interests is that human judgments are not always consistent. Factors such as a user's particular task, mood and the amount of available time will help determine what information the user would prefer to see. While no model is able to take all these factors into account, by ordering the information based on predicted user preference, the users would at least able to decide on their own cutoff point. However, ordering the information solely on interest would not work well for temporally dated information such as ongoing conversations in Netnews. Since the ordering of articles is based entirely on degree of match to users previous preferences, temporal order will be lost. For this type of information, it may be more useful to use a method that extracts order information such as in Anderson et al. (1989).

As this research is still in the preliminary stage, there remains more work to be done to improve filtering with LSI. One direction is to investigate the structure of human preferences for articles. The results from this study show some indication the LSI creates useful clusters for categorization, but we need to look more at what exactly are those clusters, and whether they can be adjusted to allow better retrieval of relevant information. It would also be useful to examine how user preferences may change over time. Currently, the scaling of the original articles in LSI remains static. By folding in new articles and removing old ones, the structure may have the ability to change as the user's preferences change. This would then be akin to a model of human cognition in which there is a decay in the ability to access older, unused information (e.g. Anderson & Milson, 1989).

One further step with this research is to test it in a real user environment, as in Mackay et al. (1989). This involves developing a user interface for the system and finding a better way to determine user preferences. The current method of having users rate each article as interesting or not-interesting provides a lot of data, but also requires a lot of effort on the part of the user. Possible solutions may be to monitor how much of an article is read by the user, or just have the user occasionally indicate articles that they deem to be good examples of what they like and do not like. While there remains more research to be done in evaluating LSI as an information filter, the results appear quite promising.

5. Acknowledgements

This research was completed while the author was at Bell Communications Research. He thanks Michael Littman for his considerable help in creating the software to monitor user preferences. He also thank Michael Littman, Susan Dumais, and Thomas Landauer for their help and comments on the development of the project.

6. References

Allen, R. (In press) User models: Theory, method and practice. International Journal of Man-Machine Studies.

Anderson, J. R., & Milson, R. (1989) Human memory: An adaptive perspective. Psychological Review. 96, 4, 703-719.

Anderson, M. H., Nielsen, J., & Rasmussen, H. (1989) A similarity-based hypertext browser for reading the Unix network news. Working paper JN-1989-18.2, Technical University of Denmark. October.

Brooks, R. M., Daniels, P. J., & Belkin, N. J. (1985). Problem descriptions and user models: Developing an intelligent interface for document retrieval systems. Advances in Intelligent retrieval, Proceedings of Informatics, 8.

Deerwester, S., Dumais, S. T. ,Furnas, G. W., Landauer, T. K., & Harshman, R. (In press) Indexing by Latent Semantic Analysis. JASIS.

Deerwester, S., Dumais, S. T., Landauer, T. K., Furnas, G. W., & Beck, L. (1988). Improving information retrieval using Latent Semantic Indexing. Proceedings of the 1988 annual meeting of the American Society for Information Science.

Dumais, S. T., Furnas, G. W., Landauer, T. K., Deerwester, S. & Harshman, R. (1988) Using latent semantic analysis to improve access to textual information. Proceedings of the Conference on Human Factors in Computing Systems, CHI. 281-286.

Dumais, S. T. (1989). Enhancing performance in LSI (Latent Semantic Indexing) Retrieval. Bellcore Technical memo.

Fischer, G., Foltz, P. W., Kintsch, W., Neiper-Lemke, H., & Stevens, C. Personal information systems and models of human memory. (Submitted)

Furnas, G. W., Landauer T. K., Gomez, L. M. & Dumais, S. T. (1983) Statistical semantics: Analysis of the potential performance of keyword information systems. Bell System Technical Journal, 1983, 62(6), 1753-1806.

Kane-Esrig, Y., Casella, G., Streeter, L. A., & Dumais, S. T. (1989) Ranking documents for retrieval by Bayesian modeling of a relevance distribution. In proceedings of the 12th IRIS (Information System Research in Scandinavia) Conference, Aug, 1989.

Mackay, W. E., Malone, T. W., Crowston, K. Rao, R., Rosenblitt, D., & Card, S. K. (1989) How do experienced information lens users use rules?, Proceedings of ACM CHI `89 Conference on Human Factors in Computing Systems. 211-216.

Malone, T. W., Grant, K. R., Lai, K. Y., Rao, R. & Rosenblitt, D. R. (1987). Semistructured messages are surprisingly useful for computer-supported coordination. ACM Transactions on Office Information Systems, 5(2), 115-131.

Salton, G., & McGill, M. J. (1983). Introduction to Modern Information Retrieval. McGraw Hill.

Schank, R. C. (1979). Interestingness. Artificial Intelligence, 12, 273-297.

Stevens, C. (1989). Large information spaces without a priori structuring mechanisms. Paper presented at the Vail workshop for the Human Computer Interaction Consortium.

VanRijsbergen, C. J. (1979). Information Retrieval. Butterworths, London.