Note: This document is a description and a working reference         
for the Newsfilter project aimed at creating a public, open-source service        
for quality-based filtering and recommendation        
of Usenet articles. For more information please review         
newsfilter group discussion archives        
and references at the end of this document.          
Please contact        
Sasha Chislenko        
with any suggestions or criticisms.        
        
Project Goal
        
        
The project is intended to create a technical framework for collaborative        
document filtering and recommendations, an information infrastructure, and a
basic set of services allowing to use people's assessments of online documents
for improved navigation, and apply them to Usenet messages.        
        
Approach and aspirations
        
        
The project aims to create a set of open standards for storage and        
transmission of semantic encodings and client user interfaces,        
and the first implementations of crucial architectural components.        
This will create an easy-to-use infrastructure that should allow        
further rapid development of the system and smooth integration        
of additional services.        
        
So far, Net tools concentrated on information storage, transmission,        
and representation functions, while all semantic analysis has been        
done by humans. Now, we can build another level of standards for        
representing, processing, and targeting documents based on their semantic        
encoding, and elevate the Web to a new level.  In the resulting        
environment, the development of intelligent agents and symbolic        
AI may become as profitable for third-party services as text        
retrieval became with the invention of simple document-storage        
systems.        
        
Project structure and deliverables
        
        
The project aims to provide design and deliver the first implementations of:        
        
- Formats and storage facilities for rating profiles and other user data        
- Data request functions        
- Collaborative mechanisms for combining multiple sources of ratings into        
    a recommendation/filtering stream.        
- Client interface        
- Algorithms for generating, matching and aggregating ratings profiles
- Algorithms and interfaces for intercommunication of distributed
recommendation and prediction services.
- Help files, online documentation, and readable published code        
General scheme and information flows
        
        
        
In the traditional system, the user exchanges        
content with the content repository, querying it        
to obtain relevant documents.  The content search        
engine knows nothing about the user.  Client interface        
stores some simple user settings and allows browsing        
and querieng of the content repository and posting of        
user messages.        
        
        
The ratings-enriched system introduces additional        
elements: user's own ratings profile, general ratings        
repository, and recommendation service.        
The right (ratings) wing of the following diagram of        
the ratings-based system is very similar to the left        
(content) wing, except that the advice it gives        
are based on the user-expressed semantics of the        
documents rather than their content.        
        
        
        
        
        
             Content             Profile (Ratings)        
            repository            repository         
               |                    |        
               |                    |        
            Content                 |               
            search              recommendation        
            engine                service        
                \                   /          
                 \                 /           
                  \               /        
                   \             /        
                   Client interface        
                         |        
                         |        
                      User [+ user profile]        
        
        
        
        
        
An important concept in the proposed architecture is one        
of an 
advisor.   An advisor is a human or        
machine generator of message ratings that the user        
decides to rely upon.  A user can have multiple advisors        
whose recommendations may be combined.  Each advisor        
has a named area of expertise and a reputation weight        
relating the relative utility of their recommendations        
to this user.        
Every user can be an advisor if he
1)  enters ratings
and
2) has anybody who is willing to follow his opinions.
The advisors can be also automatic (kill file or imported
spam filter) or synthetic (e.g., an average of all human
advisors is the "community" advisor, and can have a name
like "comp.ai.alife.human_community")
The aggregation of advisor ratings (in a given named "sense",
area, e.g. "humor" ) is relatively simple, except for the confidence
calculation.
Suppose that : 
- Wt(A) is Advisor Weight designated by the user;
- Rating(A,M) is advisor A's rating for message M;
- Conf(A,M) is advisor A's confidence for message M.
Then the aggregated rating can be computed as an average rating by
all advisors, taking weights and confidences into account:
      Sum [ Wt(A) * Rating(A,M) * Conf(A,M) ]
R(I)= ---------------------------------------       
          Sum [ Wt(A) * Conf(A,M) ]
Confidence computation is more complex, and depends on all
advosor confidences, weights, number of advisors that suggested
their ratings, and diversity/deviation of their opinions.
The exact formulas can be selected based on statistical 
analysis of the recommendations, to optimize the recommendation
quality (the accuracy of predicting user's ratings).
      
Basic operations of the service
        
        
The system should allow the following operations:        
        
        
- Selecting the advisor set        
 The user can directly list good advisors and their        
weights, or request N most reputable advisors [in        
a given area], or ask for advisors of his friends,        
or ask the recommendation service to find appropriate        
advisors based on analysis of profiles and reputations,        
or a combination of these.        
        
 
- Getting a general recommendation        
 The user sends the advisor set [and an area of interest]        
to the recommendation service.  The service retrieves        
ratings of the advisors, aggregates them, and returns        
a selected set of messages that according to the opinion        
of these advisors (or worst, or most/least popular, or        
most controversial)  This function can be also         
performed on the client if it stores the advisor profiles.        
Then the selected messages are fetched from the content         
repository.        
 
- Getting a restricted recommendation        
 The user sends a query to the search engine and gets        
a reply with a list of messages and indicators of their        
relevance.   This list is then passed through the advisor        
set, and for each message the recommendation system suggests        
its expected quality, with confidence factors.        
Then, the messages get resorted based on all indicators of        
relevance, quality and confidence.
 (This allows taking into account personalized quality        
expectation every time search is conducted)        
 
- Improved navigation        
 Before presenting the user with a list of messages        
corresponding to a given newsgroup or a search list, client        
software should consult the recommendation service, and        
reorder the list, putting definitely good messages on the top        
of the list, definitely bad messages on the bottom, and        
everything else in between.        
 
- Rating feedback        
 The user should be able to enter feedback to each reviewed        
message.  The feedback may include a named rating for the        
message, a confidence factor, and a free-form comment.        
The user may also request to see who recommended this message,        
and adjust the reputation factors for the those advisors.        
This function can also be performed automatically.
Data structures
        
        
The "semantic data" should represent features of users, advisors,        
and messages, as well as their relations.        
        
Data will be kept in standard records (database or XML) allow        
easy extensions.        
        
Sample data formats
        
        
The two basic types of data records are object description        
records and relation (rating) description records.        
        
Object [User/advisor/message] description record
        
        
An object profile consists of multiple object records,         
describing various features of users, advisors, and messages,         
such as name, age, preferred language, URNs, etc.        
Each record has the following structure:        
        
        
- Object Id        
- Field name        
- Field value        
- Time Stamp        
- [possibly, Expiration time]        
Relation record
        
        
Relation records allow to store user and advisor ratings as well        
as advisor records.        
        
        
- Object1 Id          User or advisor Id        
- Object2 Id          Advisor or message Id        
- Relation type      "Advisor", or "rated message"        
- Relation name       area of expertise, or message feature, e.g. "funny"        
- Relation value      Reputation/weight/rating        
- Value Confidence        
- Time Stamp        
- [Free-form comment]        
- [possibly, Expiration time]        
        
        
The confidence reflects the degree to which the source is confident        
that the relation value is correct.  The confidence may be stronger        
if the record was derived from combining a large number of opinions        
of reliable agents that agreed on this value (low std.dev), and lower        
if there were only a few not very reliable agents that deviated from        
each other, or was derived implicitly, etc.        
        
The reason for storing confidence explicitly is that different users        
have different degrees of tolerance to false positive and false negative        
recommendations.
Also, people sometimes can be interested in messages with low confidence        
as these indicate controversial or under-researched objects.        
        
Data repositories
        
        
The data records may be stored in databases that may serve         
records on request, or published as standard formatted files.         
        
Data requests and transports
        
        
Data transport mechanism transfers semantic data, content, and        
requests between data repositories, knowledge servers, and user        
client software.        
        
The transport can be HTTP, remote database interface, postings on        
a designated newsgroup (i.e., alt.newsgroups.ratings), or email.        
Each of these mechanisms has its own advantages in terms of        
delivery speed, privacy and efficiency.  We will start with        
the Web interface that appears more immediately useful and        
easy to implement.        
        
Data request examples
        
        
        
-  Get/write object description records         
-  get a list of unrated messages among {message list}.        
-  get a list of most popular messages among {user set}        
 (combination of number of ratings and average rating)
        
        
We also need to specify formats of requests to the data depository.        
As we agreed in principle on the structures of requests and        
data record formats, the request formats seem to be a matter of        
protocol rather than architecture, so I'll skip them here,        
except for the opinion that they should also be human-readable,        
at least in one of representations.        
        
The communication standard should also allow transparent extensions:        
if the services on two sides of the interface can use various extensions        
or subsets of the protocol, they should just get whatever parts of the        
record are available and process what they can understand.        
        
We need to specify the exact transport syntax of the above records,        
as well as field lengths, and then, basically, we'll have the needed        
interface - at least for the architectural purposes.        
        
Client Software
        
        
Client software should improve the users' navigation in the        
document space.  It should allow the user to annotate existing        
documents (or will annotate them automatically, based on the user's        
reading pattern), communicate annotations to data repositories,        
and request recommendation from knowledge servers.  The        
recommendations will be used to filter and reorder the documents.        
        
Algorithms
        
        
The semantic services (recommendation servers, reputation brokers, etc. -        
need a better generic name!)        
aggregate data from multiple users and software agents (this data is        
received from the data repositories described above) and form         
recommendations that should be used by the client software to        
improve selection and presentation of information to the user.        
        
It is also possible to transmit a generic set of data and then perform        
the last personalization round on the client, such as weigh recommendations        
according to this user's affinity to the recommenders.  This allows to        
preserve privacy of user data, reduces message traffic, and shifts part        
of the computational load to the client.        
        
Requests to semantic servers may include
        
        
-  Predict ratings for a given message by user X        
  (e.g., "funny: 0.6, confidence: 0.7;  intelligent: 0.1, confidence: 0.9")        
-  Filter a given list of messages for a user X with given thresholds        
-  Get a set of "like-minded users"/advisors for user X [for criterion C]        
-  Sort a given list of messages by predicted rating/confidence combination        
-  get a list of most controversial messages among user set {X}        
   (combination of number of ratings, their standard deviation and confidence)        
-  Get a list of messages that a given set of users considers similar to message I        
-  Compute reputation of an advisor X (utility of their advice) among user set {Y}        
        
        
Some of these functions can be iterative.        
For example, at the beginning of session a user can request a list of        
like-minded users, and then use this list repeatedly to filter search        
results or listings for different groups.  The user feedback will be        
used to adjust the similarity/reputation factors for the selected advisors.        
        
The results of these functions should have the same structure as object and        
relation records.        
         
First stage of the project
        
        
The first stage of development should create a collaborative        
message filtering framework and a basic functional service        
utilizing it.        
        
        
This framework service should include:        
        
        
- User registration facility (URF)
A user should be assigned, minimally, a unique Id and a password.        
The registration can also include a questionnaire.        
 
URF includes client and server sides. Client side is an HTML-form,
server side is a database and a CGI-script. 
 
 
-  Use spam filters and kill files as advisors.
They will also be given names (e.g. "picture" filter that leaves only pictures).
A conversion utility should turn spam filter's message lists and results of kill files
application to rating value (e.g., name="picture"; rating="0.1"; confidence ="0.85")
 
- Interface functions allowing users to manually select, exchange,        
and merge advisor sets.          
        
 
- Facility for expressing message selection criteria for a user.        
The selection criteria should include the maximal number of messages        
a user wants to see in each area of interest, and threshold values        
for message quality and aggregated advisor confidence.        
        
 
- A mechanism for aggregating rating streams from several advisors        
into a collaborative recommendation filter.        
        
 
- A web-based news browsing facility that displays messages        
based on this filter and collects message ratings from the user        
that will be used in the system.         
        
 
- Storage and retrieval facilities for messages.        
 Every message has:
- Id
- Poster
- Date
- Size
- Body
- [optionally, other fields, like keywords]
         
 
- Storage and retrieval facilities for user profiles and ratings.        
(the stucture of user profile and rating records is listed above)
        
 
- Sample utilities converting widely accepted message filters        
(keyword search, spam filters, kill files) into rating streams.        
        
 
- Online documentation, including description of the project goals,
list of contributors, current status, online help, to-do list, and        
readable published code.        
        
 
- A minimal facility for user feedback (at least, email or a guestbook)        
        
 
- The project should be beta-tested by a limited group of people and        
their suggestions should be taken into account in the document describing        
further development plans.        
         
        
  The first stage of development should result in the creation of        
a basic, immediately useful service in a short time frame (counting        
on 2 developers * 3 months of work) that will be scaleable and will        
allow multiple extensions.        
        
The extensions, to be developed and/or integrated into the service during        
the following stages of the project, should include complex message evaluation        
schemes, automated selection of advisors for a given person, complex content        
search utilities in addition to browsing, additional sources of        
information, etc.  There selection for the next stages of the project        
will be determined during the implementation of the first stage, and        
depending on its results and people's feedback.
Interface specification for stage 1
Page 1:  Welcome screen.
  A short text describing the service and latest announcement.
  links to: 
-  new user registration 
-  existing user login
 (the above may be combined)
-  online documentation
Page 2.   New user registration
    Minimally: 
- name (should be unique)
- password (some simple restriction, like at least 4 letters) 
Possibly more - a simple questionnaire:
Age, gender, education level,
a few keywords  describing interests,
"want to be on update mailing list"?
Page 3:  User login
Could be the same as registration.
Name, Password.  Cookies if we manage.
Login should give us User Id and profile.
 Page 4.  Configuration screen
   (people get here from login)
- newsgroup selector (User's usual newsgroup set, plus ability to add)
- topic selector  (user's usual interests, plus ability to add from other
    topics mentioned in this newsgroup)
- advisor selector
 The user sees a list of his usual advisors [relevant to these topics and newsgroup(s)]
and also a list of community advisors that he can add to his own list.
The advisor selector record looks like:
<advisor name>
 <checkmark>  - check to include, uncheck to exclude
 <rating name>
 <reputation value>  - [0 to 1]   User’s assigned reputation, or community reputation.
 
 
 E.g.:
  
 <Andrei><x> <general><0.7>
 <Sasha> <x> <science> <0.6>
 <Sasha> <->  <Culture> <0.4>   (community-suggested value)
 
 The checked fields represent user’s advisors; the rest are taken from the most
reputable advisors in the group, for user’s consideration, and may be included.
 
All selections should be stored in user profile for later use, so that the user doesn't
have to re-specify them every time.
 
 
- Selection conditions: minimal rating, minimal total confidence for each message
to be displayed.
-  Search button. When this button is pressed server searches for all
 messages that satisfy selected news groups and advisors, browsing page is loaded.
  Links to browsing and documentation pages.
 Page 5.  Browsing pageFour horizontal areas, from top down:
 
- 
  Topmost line: newsgroup/topic selector
 fields (editable):
    -  Name of browsing newsgroup (e.g. "misc.philosophy")
    
-  Name of interest (e.g. "religion")
 
 
-   Top frame (under the top line):  message title (scrollable)
Contains normal message title fields: poster, date, size[?], subject,
and also predicted quality (averaged rating and confidence).
(Maybe, also - top advisors?)
Sortable by any field - or at least, by date and rating.
The frame should have adjustable lower bound. Default height: 30%.
 
-  Bottom frame:  Message body (scrollable)
 The message whose title was selected from the top frame.
 
-   Bottom line:  Feedback fields.
 
   -  Interest (default: browsing interest from the top field)
   
-  rating value;
   
-  confidence factor (with default value, to be stored in user profile)
   
-  "Reply" button - calls a new screen for writing a reply
        with parameters taken from the message.
 
    
-  Links to: configuration page, main page.
 
 
- Documentation screens:
 
-  Project description - why is it necessary, how it is useful, etc.
-  Help  (how to use it)
-  FAQ
-  to do list (known problems and planned changes)
-  feedback form  - user feedback goes to developer list and/or
  guestbook.
-  code to download
 (one Zip file with all HTML pages, scripts, and installation
   instructions)
 
More general notes on communications between parts of the service
In the mature service (beyond the first stage) we will have the following agents
producing and consuming ratings data:
-   rating agency.
This is an agent that takes a single message and produces a rating
record (e.g., human; killfile; word search; any other message
analysis mechanism).  Also known as advisor or expert.
 
 
-   rating repository (profile server)
Stores and serves profiles (groups of rating records) on request
 
 
-   recommendation server
Analyses profiles, matches users and advisors, aggregates ratings
values for different profiles, produces composite indicators of
quality, popularity, controversialness, etc. of messages.  Also,
it should use statistical analysis to optimize its algorithms
for various metrics of service quality.
   
This is the most complex part of the ratings processing mechanism,
and the one that will be barely present in the first stage (except,
mostly, for merging advisor profiles)
 
 
-   User Client
  (somewhat overlaps with rating agency where a human user is concerned;
   the emphasis here is on consumption, rather than production, of ratings)
   Issues requests for retrieval of most appropriate advisors and messages
   in given categories.
 
Each of these agencies may be viewed as an Interactive Agent that can
exchange requests with others.  The request types may partially overlap
between these agents.  I can suggest the following types of communications
(no claim about completeness of this list):
- 1.  Requests [typically] directed at the record server:
 
-   profile data query (directed at the profiles/ratings database)
For descriptive (non-rating) part of user profile, the query
may specify a UserId, and receive user data, or specify a condition
on user data, and receive a set of qualifying records.
 
For ratings part, the query can specify any condition on
rater Id, User Id, rating name, value, confidence, and time,
and receive matching ratings records.
 
For example, a query may request all ratings by the given set of
advisors for a selected message.
 
We can also have pending queries, with an expiration time, for agents
who want to be notified when new data appears that matches their request.
 
 
-   Data storage request.
Sends an attribute or rating record for storage
(typically, this can be sent by a rater or recommendation server
to a rating server)
 Receives a confirmation.
 
 
-   Data removal request.
    Sends a condition on the data to be removed, and authorization.
 Receives a confirmation.
 
 
 
-  2.  Requests to the recommendation/computation server
 
-   2a.  Simple requests
 
-   profile merger request 
sends a set of rating profiles;  receives a single profile
with combined ratings and confidence factors.
 
The simplest case here is that for computation of a predicted
message rating by a set of advisors.
 
 
-  ?
 
 
-   2b.  Complex requests  (trigger a series of consecutive operations)
 
-   advisor set [re]computation
sends a user profile, existing advisor set, relation name, and number of
advisors required.
 
Updates advisor set with the advisors taken from the existing advisors'
lists, most reputable community advisors, advisors with highest 
affinity to the user, etc., and returns a combined set with optimized
weights.
 
 
-   prediction request:
Starts with a user Id, an message Id, and rating relation name.
Gets a list of advisors for the user.
 If there are any, gets their ratings of the message.
 If there are any, merges them.
 If either advisors or their ratings for the message are missing,
 starts the "fallback method": gets an average rating for the message.
 If there are no ratings, returns average rating value with confidence 0.
 
 
-   message list reordering:  
Sends a user Id, a list of message Ids, and rating relation name.
Goes through a series of operations similar to described above;
Returns an ordered message list (by quality, in descending order. 
Or by any other factor, if one is interested in controversial,
popular, unknown, etc. messages)
 
 
 
-  2c.  Special computation requests:
  Calls to special-purpose functions.
 For example, consistency check on the database,
calculation of average prediction efficiency,
optimization of algorithm parameters, reclustering
of synthetic profiles, etc.
 
 
 
-  3.  Other action requests:
 request to perform certain actions, such as publish data,
back up the database, etc.
We can notice that the above communications are all called
requests.  In fact, request only originates a [sequence of]
communication(s) listed above.
The sequence of communications usually starts with a request 
and goes through action, transfer of stored or computed
data, and completion confirmation.  These communication
sequences may loop, chain or extend in time to form dialogs
and other transaction sequences.
We should be able to define transaction syntax after we settle
on Agent Interaction Protocol, Webmind SQL interface, and other 
related things.
Of course, we do not expect to implement all of these things
in the first stage of Newsfilter.  Hopefully though, this
discussion can help us define the service structure that can
be extended into more complex services and be compatible with
Webmind.
         
Appendix: Web resources related to the project
        
        
        
        
© 1999 
Newsfilter group