plainblack.com
Username Password
search
Bookmark and Share

Search Indexing

WebGUI has a pluggable search indexer, which allows you to index and therefore search not only your WebGUI content, but any files attached to the content.

 

The Indexer

Before we get into extending the abilities of the indexer, we should first talk about how to index content. An indexer is a program that catalogs content into keywords and phrases so that it can be rapidly searched. WebGUI automatically indexes your content as you commit your changes. However, you can also tell WebGUI to manually re-index your content. To do this we use the command line utility search.pl found in your WebGUI/sbin folder. If you run the following command you'll see what options are available:

 

cd /data/WebGUI/sbin
perl search.pl --help

perl search.pl [ options ]



Options:



--configFile= The config file of the site you wish to perform

an action on.



--help Displays this message.



--indexall Reindexes all the sites. Note that this can take

many hours and will affect the performance of the

server during the indexing process.



--indexsite * Reindexes the entire site. Note that depending

upon the amount of content you have, it may take

hours to index a site and server performance will

suffer somewhat during the indexing process.



--search= * Searches the site for a keyword or phrase and

returns the results.



--updatesite * Indexes content that has not be indexed, but does not

index content that has been indexed. This is useful

if the --indexsite option had to be stopped part way

through.



* This option requires the --configFile option.

 

Why would you want to manually re-index your content? There are a number of reasons actually:

  • You just added some plugins for additional file types and you want WebGUI to re-index the content so that it can index all your existing assets with those file types.

  • There was a change to the search system, or there was a bug, and the WebGUI gotcha.txt file tells you to re-index your content.

  • You performed some manual changes to the database, either through external content imports, site splits or merges, or some other external function and you want to make sure that all the changes are indexed correctly.

To reindex the content on your site, simply type:

 

cd /data/WebGUI/sbin
perl search.pl --config=www.example.com.conf --indexsite

 

You can also search your content from the command line to make sure that the indexing worked as you expected it to. To that that, type:

 

cd /data/WebGUI/sbin
perl search.pl --config=www.example.com.conf --search=features

 

The word “features” in the above command was the keyword we were searching for. If the search found nothing it would output something similar to the following:

 

Search took 0.048402 seconds.



If it did find some content, then the output would look like this:

 

4Yfz9hqBqM8OYMGuQK8oLw Get Features

OhdaFLE7sXOzo_SIP2ZUgA Welcome



Search took 0.025347 seconds.

 

The first column is the asset id of the asset it found, and the second column is the title of the asset.

 

Adding File Types

The WebGUI search indexer allows you to catalog attachments in addition to your WebGUI content. This is done through the use of command line programs that can turn file content into either text or HTML. The program must return the content to standard out. A good example of this is the “cat” program that comes with all Unix®, Linux®, and BSD systems. If you type:

 

cat /path/to/product.html

 

it would output the contents of the product.html file. In the case of files that contain only HTML or text, the cat program is a perfect way to index those files. Unfortunately it's not so easy for most binary application files, like those created by office productivity software.

Luckily the WRE comes with a couple of utilities that will convert Microsoft® Word (catdoc) and Adobe® PDF (xpdf) files into text. And when you create your site using the WRE, it automatically adds them into your WebGUI config file.

On a Unix®, Linux®, or BSD style operating system a section like this will be added to your WebGUI config file.

 

"searchIndexerPlugins" : {
"doc" : "/data/wre/bin/doc2txt.sh",
"pdf" : "/data/wre/bin/pdf2txt.sh",
"readme" : "/bin/cat",
"txt" : "/bin/cat",
"html" : "/bin/cat",
"xls" : "/data/wre/bin/xls2txt.sh",
"htm" : "/bin/cat",
"ppt" : "/data/wre/bin/ppt2txt.sh",
"rtf" : "/data/wre/bin/doc2txt.sh"
},

 

Unfortunately, not all of the same utilities are available for Windows® users. Your config file will have to be modified to look like this:

 

"searchIndexerPlugins" : {
"doc" : "/data/wre/bin/doc2txt.bat",
"pdf" : "/data/wre/bin/pdf2txt.bat",
"readme" : "/data/wre/bin/cat.bat",
"txt" : "/data/wre/bin/cat.bat",
"html" : "/data/wre/bin/cat.bat",
"htm" : "/data/wre/bin/cat.bat",
"rtf" : "/data/wre/bin/doc2txt.bat"
},

 

If you can find, buy, or build other utilities to convert other document types. You can add them to your config file. The first parameter is the file extension to look for, and the second parameter is the path to the program that will convert that file into text or HTML.

Note that as you add new programs to your search indexer plugins, they will not retroactively index content that is already on your site. For that you need to re-run the indexer described previously. However, those plugins will be used for any new content added to your site.

Keywords: config

Search | Most Popular | Recent Changes | Wiki Home
© 2018 Plain Black Corporation | All Rights Reserved