Talk About Network

Google


Register and Login
Nick
Password
Register create new account Sign up is FREE and you can post replies, new topics, bookmark posts and more!
Recover lost password


Data Bases > Pgsql General > Using tsearch2 ...
Latest [ Topics | Posts ] Archive Post A New Topic Post a Reply
<< Topic < Post Post 1 of 1 Topic 15192 of 17602
Post > Topic >>

Using tsearch2 in a Bayesian filter

by dalroi@[EMAIL PROTECTED] (Alban Hertroys) Apr 6, 2008 at 01:13 PM

Hi all,

In my spare time I've started on a general purpose Bayesian filter  
based on the now built-in tsearch2 functionality. The ability to stem  
words from a message into lexemes, removing stop words and gist  
indexes look promising enough to attempt this. However, my experience  
with tsearch is somewhat limited, so I have a few questions...

The messages entering the filter will be in different languages and  
encoding. For example, I get a lot of Cyrillic spam these days, while  
I get a lot of English messages and a few in Dutch. Especially the  
spam is likely to lie about it's encoding. Some messages will be  
plain text, but many will be HTML.
- Is it possible to stem words from that wide a variety of content?
- If so, what approach would be best?
- Do I need to strip out the HTML tags or can they serve as lexemes  
themselves?

Next, to determine the probability of a lexeme being of a certain  
classification (for example spam or not spam), I need to be able to  
count the number of occurrences of that lexeme in a text. I can't  
store a probability, as the numbers aren't fixed[*] (was hoping to  
abuse score() here, but that's probably a no-op). I haven't found any  
tsearch functions to determine the number of occurrences of each  
lexeme in a text. Ideally I'd have a resultset with ( lexeme, number  
of occurrences) tuples, so that I can use that directly in a query.
- How do I determine the number of occurrences of each lexeme in a text?

Thanks for your time.


[*] As more messages enter the system, there will be more occurrences  
of lexemes in messages and in classifications. If I start out with  
one lexeme occurring once in a single message, the chance that lexeme  
is in a message is 1. As soon as another message arrives not  
containing that lexeme, the chance is 0.5. The number of messages,  
occurrence of lexemes in messages and classifications is a  
continuously moving number, so I will need the numbers the  
probability was based on (might still decide to add a column with the  
probability calculated from those numbers for speed, of course).

Regards,

Alban Hertroys

--
If you can't see the forest for the trees,
cut the trees and you'll see there is no forest.


!DSPAM:737,47f8b050927661534911704!



-- 
Sent via pgsql-general mailing list (pgsql-general@[EMAIL PROTECTED]
)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
 




 1 Posts in Topic:
Using tsearch2 in a Bayesian filter
dalroi@[EMAIL PROTECTED]   2008-04-06 13:13:18 

Post A Reply:
  Go here to Signup

AddThis Feed Button


About - Advertising - Contact - Frequently Asked Questions - Privacy Policy - Terms of Use - Signup

Contact
tan12V112 Mon Dec 1 19:35:19 CST 2008.