Newsfilter project: framework and specifications

Version 0.70 Last updated 5 October 1999.

Note: This document is a description and a working reference for the Newsfilter project aimed at creating a public, open-source service for quality-based filtering and recommendation of Usenet articles. For more information please review newsfilter group discussion archives and references at the end of this document. Please contact Sasha Chislenko with any suggestions or criticisms.

Project Goal

The project is intended to create a technical framework for collaborative document filtering and recommendations, an information infrastructure, and a basic set of services allowing to use people's assessments of online documents for improved navigation, and apply them to Usenet messages.

Approach and aspirations

The project aims to create a set of open standards for storage and transmission of semantic encodings and client user interfaces, and the first implementations of crucial architectural components. This will create an easy-to-use infrastructure that should allow further rapid development of the system and smooth integration of additional services.

So far, Net tools concentrated on information storage, transmission, and representation functions, while all semantic analysis has been done by humans. Now, we can build another level of standards for representing, processing, and targeting documents based on their semantic encoding, and elevate the Web to a new level. In the resulting environment, the development of intelligent agents and symbolic AI may become as profitable for third-party services as text retrieval became with the invention of simple document-storage systems.

Project structure and deliverables

The project aims to provide design and deliver the first implementations of:

General scheme and information flows

In the traditional system, the user exchanges content with the content repository, querying it to obtain relevant documents. The content search engine knows nothing about the user. Client interface stores some simple user settings and allows browsing and querieng of the content repository and posting of user messages.

The ratings-enriched system introduces additional elements: user's own ratings profile, general ratings repository, and recommendation service. The right (ratings) wing of the following diagram of the ratings-based system is very similar to the left (content) wing, except that the advice it gives are based on the user-expressed semantics of the documents rather than their content.

             Content             Profile (Ratings)        
            repository            repository         
               |                    |        
               |                    |        
            Content                 |               
            search              recommendation        
            engine                service        
                \                   /          
                 \                 /           
                  \               /        
                   \             /        
                   Client interface        
                      User [+ user profile]        

An important concept in the proposed architecture is one of an advisor. An advisor is a human or machine generator of message ratings that the user decides to rely upon. A user can have multiple advisors whose recommendations may be combined. Each advisor has a named area of expertise and a reputation weight relating the relative utility of their recommendations to this user.

Every user can be an advisor if he

1) enters ratings
2) has anybody who is willing to follow his opinions.

The advisors can be also automatic (kill file or imported spam filter) or synthetic (e.g., an average of all human advisors is the "community" advisor, and can have a name like "")

The aggregation of advisor ratings (in a given named "sense", area, e.g. "humor" ) is relatively simple, except for the confidence calculation.

Suppose that :

Then the aggregated rating can be computed as an average rating by all advisors, taking weights and confidences into account:

      Sum [ Wt(A) * Rating(A,M) * Conf(A,M) ]
R(I)= ---------------------------------------       
          Sum [ Wt(A) * Conf(A,M) ]

Confidence computation is more complex, and depends on all advosor confidences, weights, number of advisors that suggested their ratings, and diversity/deviation of their opinions. The exact formulas can be selected based on statistical analysis of the recommendations, to optimize the recommendation quality (the accuracy of predicting user's ratings).

Basic operations of the service

The system should allow the following operations:

Data structures

The "semantic data" should represent features of users, advisors, and messages, as well as their relations.
Data will be kept in standard records (database or XML) allow easy extensions.

Sample data formats

The two basic types of data records are object description records and relation (rating) description records.

Object [User/advisor/message] description record

An object profile consists of multiple object records, describing various features of users, advisors, and messages, such as name, age, preferred language, URNs, etc. Each record has the following structure:

Relation record

Relation records allow to store user and advisor ratings as well as advisor records. The confidence reflects the degree to which the source is confident that the relation value is correct. The confidence may be stronger if the record was derived from combining a large number of opinions of reliable agents that agreed on this value (low, and lower if there were only a few not very reliable agents that deviated from each other, or was derived implicitly, etc.

The reason for storing confidence explicitly is that different users have different degrees of tolerance to false positive and false negative recommendations.
Also, people sometimes can be interested in messages with low confidence as these indicate controversial or under-researched objects.

Data repositories

The data records may be stored in databases that may serve records on request, or published as standard formatted files.

Data requests and transports

Data transport mechanism transfers semantic data, content, and requests between data repositories, knowledge servers, and user client software.
The transport can be HTTP, remote database interface, postings on a designated newsgroup (i.e., alt.newsgroups.ratings), or email. Each of these mechanisms has its own advantages in terms of delivery speed, privacy and efficiency. We will start with the Web interface that appears more immediately useful and easy to implement.

Data request examples

We also need to specify formats of requests to the data depository. As we agreed in principle on the structures of requests and data record formats, the request formats seem to be a matter of protocol rather than architecture, so I'll skip them here, except for the opinion that they should also be human-readable, at least in one of representations.

The communication standard should also allow transparent extensions: if the services on two sides of the interface can use various extensions or subsets of the protocol, they should just get whatever parts of the record are available and process what they can understand.

We need to specify the exact transport syntax of the above records, as well as field lengths, and then, basically, we'll have the needed interface - at least for the architectural purposes.

Client Software

Client software should improve the users' navigation in the document space. It should allow the user to annotate existing documents (or will annotate them automatically, based on the user's reading pattern), communicate annotations to data repositories, and request recommendation from knowledge servers. The recommendations will be used to filter and reorder the documents.


The semantic services (recommendation servers, reputation brokers, etc. - need a better generic name!) aggregate data from multiple users and software agents (this data is received from the data repositories described above) and form recommendations that should be used by the client software to improve selection and presentation of information to the user.

It is also possible to transmit a generic set of data and then perform the last personalization round on the client, such as weigh recommendations according to this user's affinity to the recommenders. This allows to preserve privacy of user data, reduces message traffic, and shifts part of the computational load to the client.

Requests to semantic servers may include

Some of these functions can be iterative. For example, at the beginning of session a user can request a list of like-minded users, and then use this list repeatedly to filter search results or listings for different groups. The user feedback will be used to adjust the similarity/reputation factors for the selected advisors.

The results of these functions should have the same structure as object and relation records.

First stage of the project

The first stage of development should create a collaborative message filtering framework and a basic functional service utilizing it.

This framework service should include:

The first stage of development should result in the creation of a basic, immediately useful service in a short time frame (counting on 2 developers * 3 months of work) that will be scaleable and will allow multiple extensions.

The extensions, to be developed and/or integrated into the service during the following stages of the project, should include complex message evaluation schemes, automated selection of advisors for a given person, complex content search utilities in addition to browsing, additional sources of information, etc. There selection for the next stages of the project will be determined during the implementation of the first stage, and depending on its results and people's feedback.

Interface specification for stage 1

Page 1: Welcome screen.

A short text describing the service and latest announcement.

links to:

Page 2. New user registration


Possibly more - a simple questionnaire: Age, gender, education level, a few keywords describing interests, "want to be on update mailing list"?

Page 3: User login

Could be the same as registration. Name, Password. Cookies if we manage.

Login should give us User Id and profile.

Page 4. Configuration screen

(people get here from login)

More general notes on communications between parts of the service

In the mature service (beyond the first stage) we will have the following agents producing and consuming ratings data:

Each of these agencies may be viewed as an Interactive Agent that can exchange requests with others. The request types may partially overlap between these agents. I can suggest the following types of communications (no claim about completeness of this list):

We can notice that the above communications are all called requests. In fact, request only originates a [sequence of] communication(s) listed above.

The sequence of communications usually starts with a request and goes through action, transfer of stored or computed data, and completion confirmation. These communication sequences may loop, chain or extend in time to form dialogs and other transaction sequences.

We should be able to define transaction syntax after we settle on Agent Interaction Protocol, Webmind SQL interface, and other related things.

Of course, we do not expect to implement all of these things in the first stage of Newsfilter. Hopefully though, this discussion can help us define the service structure that can be extended into more complex services and be compatible with Webmind.

Appendix: Web resources related to the project

© 1999 Newsfilter group