After running the indexer on a default install and creating a vanilla search asset, I'm seeing weird behavior.
For instance:
Search for "Getting", "getting" return no results.Search for "Started" and you get a result for the Getting Started page.Search for "Tell" or "Tell a Friend" yields nothing.Search for "Friend" returns Tell a friend page.
Very frustrating... I'll dig more tomorrow. Tried this query:
select assetId, match(keywords) against('+getting +started' in boolean mode) as score from assetIndex order by score;
Returns three scores of 1... excellent. Tried this:
select assetId, match(keywords) against('+getting' in boolean mode) as score from assetIndex order by score;
Returns no scores of 1... wtf?
the in boolean mode is supposed to strip out the "ignore if appears in more than 50% of the rows" rule.
And alas, if you do select keywords from assetIndex where keywords like 'getting%' we see several rows of data.
Mystery solved. Getting is in the stopword list.
http://dev.mysql.com/doc/refman/5.0/en/fulltext-stopwords.html
That's a big freaking list of words! I don't think we should exclude all of those. Tomorrow I'll see if we can override that some how.
Did alot more digging and thinking. The problem here is not the stopwords mysql uses... searching for the word "getting" should'nt return any results as the word getting means nothing by it's self.
The real issue here is knowing when mysql actually discards a word from the query because it's in the stopword list. I've dug and dug and can't find anyway to either:
a) run a query after the match/against query to see if any stopwords were detected
b) run a query against all current stopwords
The list of default stopwords is listed in the docs, however it can be overriden from mysql server to mysql server. It is possible to detect if the default stopwords are being used... unfortunately 'default' can change from version to version of mysql which is impractical for us.
So, at this point I have posted to the mysql forums to see what their take is. http://forums.mysql.com/read.php?107,128828,128828#msg-128828
If I get snake-eyes there, I think the best solution is something like this:
1 - Add a table to wG that contains the default mysql stopwords of current.
2 - Write a method that checks the query entered against the stopwords table after each search, and returns a list of words that were ignored.
3 - Add a config file parameter called 'StopwordsFile' or similar with a default value of 'built-in'. Our method could then run the query "show variables like 'ft_stopwords_file' and compare the result to the StopwordsFile config value to make sure they match. If they match, tell the user about words ignored, otherwise don't tell the user and write a warning to the log file that webgui doesn't know what stopwords are being used.
I really hate that solution...
Closing this.
1 - The bug is with MySQL.2 - To work around this requires an RFE which I've posted.