RaZe

File indexer

Latest comment: 12 years ago10 comments4 people in discussion

Hi RaZe, thank you very much for this. If it covers the same functionality please feel free to replace "mine" with "yours". If you want to, you can add yourself to the developers in the box. I'd be glad, if someone would be willing to continue developing this thing, since I don't work with MediaWiki at work any more. I changed the company and they're running a Foswiki :(( What do you think? --Flominator 18:54, 30 June 2009 (UTC)Reply

Hi Flominator. first of all: thx for your message. I am very glad that I didn't start a rumor by editing your extensionpage without asking for permission previously.

As I saw in you userpage you are member of 'German Wikipedia'. This and the fact that the first message on you usertalk was in german, können wir auch gern deutsch kommunizieren, was mir unheimlich viel leichter fällt :D Aber da dies eine englisch geführte Seite ist, I think english would be prefered by anyone else... may be a good exercise.

To be true: I have no experiences with your companies type of wiki. My first real contact with wikis in general was during my final thesis at university. It was directly connected with the mediawiki-engine...

Back to topic: I will release a small change to the extention right now but to change the whole content of the page I will have to find an other day. I will come back to this later. Thx for your trust in my work :-D --Razqubik 14:21, 1 July 2009 (UTC)Reply

Thank you for doing it. Can you maybe translate the comments to English? Regards, --Flominator 19:42, 3 July 2009 (UTC)Reply

I will do so next time. --Razqubik 07:50, 6 July 2009 (UTC)Reply

Hi RaZe

It looks like you are the only person that can help me with this extension. I saw that you replied on the discussions page, and from above discussion it looks like you took over the extension. I would greatly appreciate some help with this. I think I'm misunderstanding how this extension works.

I go to the 'Special:FileIndexer' page.
In the edit box I put in the following: 'File:SomeDocument.docx', then I click "create".
If I then search for a word I know is in the document, it still finds nothing.

Please Help me,

--Johannekie 16:05, 14 June 2010 Hi. As a first shot try and analyse the following:

is 'File' a valid namespace which is enabled in your search configurations / did you enable it on seachtime?
please review the content of your article 'File:SomeDocument.docx' - did something change?
please post the following values from your LocalSettings.php: globals $wgFiPrefix and $wgFiPostfix
set $wgFiCheckSystem to true and check if the requirements are installed

--RaZe 21:19, 24 June 2010 (UTC)Reply

Hi RaZe, the FileInder Extention runs not in MW 1.16.x, Can you fix it? --Swus 12:46, 30 September 2010 (UTC)Reply

Hi Swus. I think it will please you that I am really short to a totaly new version of this extension with a much improved spezialpage and much more. It will also run on mw 1.16 . Greetings --RaZe 14:26, 30 September 2010 (UTC)Reply

New version (for 1.16) available now. Greetings --RaZe 16:13, 3 October 2010 (UTC)Reply

Hi RaZe, a new MW version (1.18.x) an the Extension:FileIndexer is not running :( Can you fix it? The n flags on the extension page sounds not good. "major security risk" --Swus (talk) 08:23, 17 February 2012 (UTC)Reply

Hi Swus,

I would love to have a look at this. I cant really say when I will be able to do this. Sorry --RaZe (talk) 10:20, 22 February 2012 (UTC)Reply

Link to a working wiki with FileIndexer?

Latest comment: 12 years ago5 comments2 people in discussion

Hi RaZe,

Thanks for creating this extension. I was wondering if you could provide a link to a wiki that has a working version (perhaps your own). I'm a novice programmer and am trying to find the best way to index pdf files so that they are searchable.

Thanks in advance, Dana (dchandXXler@mit.edu, without the XXs)

Hi. I awfully sorry but i dont have any list of public wikis using FileIndexer. I am just running internal wikis - they are using it but you will not be ale to access them. Awfully i dont have any demo system running. :(

--RaZe 02:46, 12 January 2012 (UTC)Reply

Hej RaZe,

we are willing to develop and promote a fully functioning demo mediawiki with file-indexer and Tesseract (further goal OCRopus) as an OCR GNU solution and started quite a bit of development already. I'd be thrilled to have you discuss and talk with us for just a brief moment. It would be great to have you as our advisor. Danke schön. CollinBloom2 12:16, 30 January 2012 (UTC)Reply

Hi CollinBloom2

How can I be of service for you? Gern doch ;) --RaZe 07:44, 31 January 2012 (UTC)Reply

Hi RaZe,

Thanks for the quick response. We are trying to index and make searchable multiple PDF documents with the help of FileIndexer. We have the documents on the server, and we would like to call FileIndexer functions by script to index

all the PDF's in a given location. We would be very graceful if you cud give us a hint.

Hi,

if you say "on the server", do you mean in "uploaded in a wikisystem"? What version of mediawiki do you use? Its stated above somewhere that it is not compatible with the latest wiki version - though i myself didnt have the time to check the problem with it so far...

I need detailed information about what is your goal here. In general it will be possible to index these files by script if they are not part of the wiki - i am not really using wiki-functions for the pure indexingpart... but if you want to upload them into your wiki at the same time, i wouldnt go that way. In this case do it in two steps... using a multiple-upload-extension and afterwards use the specialpage of fileindexer to index a bunch of files each request (if it are that many do it in parts to not run into server-response timeouts)... just list the filenames (by using wildcards)

If you even dont want to upload these files into a wiki (lets say you even dont want the index in it) dont use my extensions functions - i believe there are many index-engines out there way better then this one - dunno.

I hope i could help you with it... i will try to carry on the help when you leave me some detailed infos. --RaZe 10:14, 6 February 2012 (UTC)Reply

Hi again,

We managed to do the programming to upload and ocr the documents via script, we used snoopy class to simulate a browser.

We have now just one small problem, the search basic or SphinxSearch that we use does not search the uploaded document's content it just searches by title.

If you put a link of the uploaded file to a page it is searchable but we would like to have them searchable right away even if they are not linked to a pages.

Is this possible, by some config, or have you encountered this problem?

Thank you for the support. --CollinBloom2 11:25, 08 February 2012 2012 (UTC)

Hi there,

I am awfully sorry, but I am not really sure if I got what you are doing. As I understand you have files uploaded in you wiki now that you want to make content-searchable. OCR sounds like these files are not textbased but picturebased. This is something my extension is not prepared for so far as there is the needed commandline missing in the configs. But you may add that.

What i see as a problem is that you dont want any pages created to your files. This extension does not really lead in searching files contents but creates an index to an uploaded file into the content of a page. This in normally the page with the name of the file in the files namespace so that, when you search a keyword of one file you find that page and can directly access the corresponding file.

E.g. file "x.txt" has the word "key" in it the extension adds the word "key" to the pagecontent "Files:x.txt" in some form. When now you search for that word "key" the mediawiki build in searchengine (or any other engine that parses the pagecontent directly or indirectly) finds the page "Files.x.txt". It didnt search the files content.

You did say you "ocr the documents via script" - by that I interprete you allready got an index... where did you put that?

Or am i still wrong about you goals?--RaZe (talk) 10:42, 22 February 2012 (UTC)Reply

A barnstar for you!

Latest comment: 12 years ago5 comments2 people in discussion

	The Technical Barnstar
	Danke RaZe noch mal für den Fileindexer. Ist einfach spitze ;-) SmartK (talk) 14:52, 22 February 2012 (UTC)Reply

Danke Dir - das freut mich aber! Ich hab mir mal erlaubt den Stern zu übertragen auf die Version dieser Seite, da Dein Edit meine beiden letzten Posts überschrieben hat (paralleles Posten im Wiki ;-) Es möge der letztere gewinnen!)

--RaZe (talk) 12:24, 24 February 2012 (UTC)Reply

Sorry RaZe. Und ja klar... gerne --SmartK (talk) 13:49, 24 February 2012 (UTC)Reply

Wie auch immer Du das schaffst... es scheint reproduzierbar zu sein :P Ich musste es gerade noch mal korrigieren :D --RaZe (talk) 13:17, 27 February 2012 (UTC)Reply

Ich bin unschuldig ;-) --SmartK (talk) 15:56, 27 February 2012 (UTC)Reply

Dumping the full output from pdftotext into the wiki directly?

Latest comment: 12 years ago1 comment1 person in discussion

Hi raZe. This really is an incredible extension. I second the barnstar and am truly thankful you created this. If at all possible, can you please check out the question I wrote about on the extension talk page and let me know if you have a quick suggestion for how to dump the full output from a basic pdftotext operation into the index. I would be immensely appreciative.

I imagine this is actually pretty easy, but I'm at a loss to figure out the place in the code where I could do that.

Hi Mr. Anonymous :D

Please try if the changes I posted in the extensions talk page work for your requirements.

Best regards --RaZe (talk) 11:10, 20 April 2012 (UTC)Reply

Fileindexer how to

Latest comment: 12 years ago3 comments1 person in discussion

Hi RaZe I think i got the extension up and running, i can get to the special page, and i can get to index the pdf file. But now i dont know where to search for it, it doesnt show up if i make a normal seach on the top of the page. can you help me finding out what im doing wrong? also a guide how to use the extension would be great :) Sincerely Mikkel

Hi, how do you know that you got an index? Did you check the article it was supposed to be created in? If it is there you should be able to search for it as you can search for any other parts of this article. It somehow depends on what search engine you use (wiki build in, lucene, ...). All this extension does reading the e.g. pdf and copy its content in a given format into an article. And if your search engine can do fulltextsearches it should be able to find your pdf content copy. --RaZe (talk) 23:08, 3 May 2012 (UTC)Reply

hm i found the article but the place where the index should be is emty. The name of the article is the file name, and in the text, it says

"{{FileIndex |index= }}" but no content within the pdf file.

i hope you can help me

1st: it seems that you didnt create a template called "FileIndex" (Step 6 in the installation description)

2nd: as you can see no index was created from the pdf content... if so it should stand right behind the equals sign "=".

Did you follow all instructions?

Try to set $wgFiCheckSystem = true in file FileIndexer_cfg.php and send me the result. --RaZe (talk) 14:00, 7 May 2012 (UTC)Reply

The Template is created but. how do i create the index corectly?

When i try to index a file from the speciel page, i have to uncheck "No Updates" and change the "Destenation Namespace" to "Main" from "File" else it will just say

"For the following list of articles the index update process was suppressed:"

After i have changed that it seems to run corectly, and then it creates a new Page with the file name, and the Template, but the index is still emty.

Where do i read the output from, $wgFiCheckSystem?

Again thank you for helping me

Hi, I would like that you do a few things:

make sure you followed the section Requirements.
log on to your webserver, where the required tools are installed and send me the output of the following commands (and compare it to the following array)

Array for comparison (taken from file FileIndexer_cfg.php)

$wgFiCommandPaths = array(
    'pdftotext' => "/usr/bin/pdftotext",
    'iconv' => "/usr/bin/iconv",
    'antiword' => "/usr/bin/antiword",
    'xls2csv' => "/usr/bin/xls2csv",
    'catppt' => "/usr/bin/catppt",
    'strings' => "/usr/bin/strings",
    'unzip' => "/usr/bin/unzip",
);

Execute these commands and compare outputs with array above (change array if needed):

> which pdftotext
> which iconv
> which antiword
> which xls2csv
> which catppt
> which strings
> which unzip

Make sure your http user (the user you httpd process runs with - or what ever webserver you use) is allowed to execute (x-attribute assumedly for group 'others') the commands from above (also which)
If all is set up correctly you may try a dry run on one of the files you want to build an index of:

I will give an example for a pdf file with the path "/tmp/my.pdf":

Copy/upload (how ever) the file to /tmp on your server
Now you have to reconstruct the command. For this we take some information from the FileIndexer configuration (FileIndexer_cfg.php)

Config array $wgFiCommandCalls['pdf'] says:

WC_FI_COMMAND . "[pdftotext] -raw -nopgbrk \"" . WC_FI_FILEPATH . "\" -| " . WC_FI_COMMAND . "[iconv] -f ISO-8859-1 -t UTF-8"

Config array $wgFiCommandPaths['pdftotext']/$wgFiCommandPaths['iconv'] says:

    'pdftotext' => "/usr/bin/pdftotext",
    'iconv' => "/usr/bin/iconv",

So I substitute "WC_FI_COMMAND . [pdftotext]" with "/usr/bin/pdftotext", "WC_FI_COMMAND . [iconv]" with "/usr/bin/iconv" and "WC_FI_FILEPATH" with "/tmp/my.pdf" (I didnt really pay attention to the quotes here!)

The resulting command I finally execute on the console of my server is (hopefully you get the content of you file then):

> /usr/bin/pdftotext -raw -nopgbrk "/tmp/my.pdf" -| /usr/bin/iconv -f ISO-8859-1 -t UTF-8

I really hope it helps you/us finding the problem... please let me a note --RaZe (talk) 11:48, 15 May 2012 (UTC)Reply

Hi I appreciate your work! but we found another solution from google, that could do just what we wanted. And it was done in less than 5 min. Even if i could get this plugin to work, the google ::::::::solution would still be FAR easier to use. And i think that is what we need.

But I'm REALY glad that you had time to help me, and your work work wont be wasted time, as im shure someone else can use the info you provided me.

Add topic