The Quick and dirty:
New PasteLert lives at http://andrewmohawk.com/pasteLertV2/
Downloads:
» Interface -> http://andrewmohawk.com/pasteLertV2/src/pastelertv2_Interface.zip
» Cron Tasks -> http://andrewmohawk.com/pasteLertV2/src/pastelertv2_Cron_Tasks.zip
» Scraping Script -> http://andrewmohawk.com/pasteLertV2/src/pastelertv2_Python_Scraping_Script.zip
And of course if you want everything -> http://andrewmohawk.com/pasteLertV2/src/pastelertv2_all.zip
Overview
My linode has been pretty much falling over due to the previous version of the pastebin alerts for a number of reasons:
» Scripts sometimes get blackholed (pastebin.com allows the connection but doesnt respond – due to their DDoS protection)
» Scripts sometimes were still running when the PREVIOUS script had not completed causing a chain reaction of fail
» Deletes would be happening while the above scripts where running causing MySQL to tilt
Lucene/Solr
As such I recently re-worked the service. Initially I started playing around with other DB types to try and get my Linode to store more than a day or 2s worth of Pastebin.com data. I looked around and it appeared that Lucene/SolR was the solution I was looking for, and actually it does work _very_ well at storing large amounts of data (I had it running with about 2 weeks of data). However there were a number of issues:
» After about a week or 2s worth of data (avg around 20-30K posts a day, x 14 = 280 000 – 420 000 posts) the search times were SLOW (talking something like 5-15 SECONDS)
» Because Lucene is not a RDBMS there is no concept of having something like a row ID or an auto-incrementing ID – so this would have to be handled by the script to get the number of entries and +1 every time
» Because of the above Alerts would have to work on a date (when the post was made – so working out from x secs ago or y minutes ago), and an ISO formatted date no less (no unixtime) it became a real pain.
However, with that being said I did still build the interfaces for it and if you are looking to implement it with SolR / Lucene just message me for the schema and Python/PHP scripts.
Basics
Ultimately however I decided to stick to the same system previously used but rather than have cron’d scripts that pull the data have one long running python script that you can place in the background. Pretty basic and the code should be self explanatory, the gist of it:
1. Pull archive.php from pastebin.com [ http://pastebin.com/archive.php ]
2. Extract all the paste entries with a regular expression ( re.compile(‘<td><img src=”/i/t.gif” .*?<a href=”/(.*?)”>(.*?)</a></td>.*?<td>(.*?)</td>’,re.S) )
3. Check if we have seen it in the last 500 or so (that we have in a python list), if not, pull the raw paste
4. INSERT IGNORE (in case we missed a double) this data
Then for the “alerts” themselves, basically:
»Every 30 minutes (or whenever you set the cron to run) search if the terms in the database have been seen
»If seen send out mail
Additionally of course there is a web interface that you can use to add alerts as well as search the current index’d pastes.
Downloads / Config
My Crontab at this stage looks as follows (if you want to just copy mine):
*/20 * * * * php /home/andrew/pasteLertV2/Cron_Tasks/sendAlerts.php
0 1 * * * php /home/andrew/pasteLertV2/Cron_Tasks/truncPastes.php
And i’ve kicked off the script that puts the data in the database with:
andrew@mothership:~/pasteLertV2/Python_Scraping_Script$ nohup python scrapePastebinMySQL.py &
I’ve seperated the scripts into the 3 sections:
» Interface -> http://andrewmohawk.com/pasteLertV2/src/pastelertv2_Interface.zip
» Cron Tasks -> http://andrewmohawk.com/pasteLertV2/src/pastelertv2_Cron_Tasks.zip
» Scraping Script -> http://andrewmohawk.com/pasteLertV2/src/pastelertv2_Python_Scraping_Script.zip
And of course if you want everything -> http://andrewmohawk.com/pasteLertV2/src/pastelertv2_all.zip
Essentially the only modification you need to do is within the interface / cron tasks modify the ‘setDB.php’ script with your db credentials and within the scraping script, set these on line 141.
-AM
Hi,
Thanks for sharing this script. It would be quite useful to create the database and its tables.
Just some positive feedback to improve :)
Cheers,
Jeroen
Hi Jeroen,
download the pastelert archive from http://andrewmohawk.com/pasteLert/pasteLert.zip and use pastebinStructure.sql to create db structure.
Andrea