The Quick and dirty:

New PasteLert lives at http://andrewmohawk.com/pasteLertV2/

Downloads:

» Interface -> http://andrewmohawk.com/pasteLertV2/src/pastelertv2_Interface.zip
» Cron Tasks -> http://andrewmohawk.com/pasteLertV2/src/pastelertv2_Cron_Tasks.zip
» Scraping Script -> http://andrewmohawk.com/pasteLertV2/src/pastelertv2_Python_Scraping_Script.zip

And of course if you want everything -> http://andrewmohawk.com/pasteLertV2/src/pastelertv2_all.zip

Overview

My linode has been pretty much falling over due to the previous version of the pastebin alerts for a number of reasons:

» Scripts sometimes get blackholed (pastebin.com allows the connection but doesnt respond – due to their DDoS protection)
» Scripts sometimes were still running when the PREVIOUS script had not completed causing a chain reaction of fail
» Deletes would be happening while the above scripts where running causing MySQL to tilt

Lucene/Solr

As such I recently re-worked the service. Initially I started playing around with other DB types to try and get my Linode to store more than a day or 2s worth of Pastebin.com data. I looked around and it appeared that Lucene/SolR was the solution I was looking for, and actually it does work _very_ well at storing large amounts of data (I had it running with about 2 weeks of data). However there were a number of issues:

» After about a week or 2s worth of data (avg around 20-30K posts a day, x 14 = 280 000 – 420 000 posts) the search times were SLOW (talking something like 5-15 SECONDS)
» Because Lucene is not a RDBMS there is no concept of having something like a row ID or an auto-incrementing ID – so this would have to be handled by the script to get the number of entries and +1 every time
» Because of the above Alerts would have to work on a date (when the post was made – so working out from x secs ago or y minutes ago), and an ISO formatted date no less (no unixtime) it became a real pain.

However, with that being said I did still build the interfaces for it and if you are looking to implement it with SolR / Lucene just message me for the schema and Python/PHP scripts.

Basics

Ultimately however I decided to stick to the same system previously used but rather than have cron’d scripts that pull the data have one long running python script that you can place in the background. Pretty basic and the code should be self explanatory, the gist of it:

1. Pull archive.php from pastebin.com [ http://pastebin.com/archive.php ]
2. Extract all the paste entries with a regular expression ( re.compile(‘<td><img src=”/i/t.gif” .*?<a href=”/(.*?)”>(.*?)</a></td>.*?<td>(.*?)</td>’,re.S) )
3. Check if we have seen it in the last 500 or so (that we have in a python list), if not, pull the raw paste
4. INSERT IGNORE (in case we missed a double) this data

Then for the “alerts” themselves, basically:

»Every 30 minutes (or whenever you set the cron to run) search if the terms in the database have been seen
»If seen send out mail

Additionally of course there is a web interface that you can use to add alerts as well as search the current index’d pastes.

Downloads / Config

My Crontab at this stage looks as follows (if you want to just copy mine):
*/20 * * * * php /home/andrew/pasteLertV2/Cron_Tasks/sendAlerts.php
0 1 * * * php /home/andrew/pasteLertV2/Cron_Tasks/truncPastes.php

And i’ve kicked off the script that puts the data in the database with:

andrew@mothership:~/pasteLertV2/Python_Scraping_Script$ nohup python scrapePastebinMySQL.py &

I’ve seperated the scripts into the 3 sections:

» Interface -> http://andrewmohawk.com/pasteLertV2/src/pastelertv2_Interface.zip
» Cron Tasks -> http://andrewmohawk.com/pasteLertV2/src/pastelertv2_Cron_Tasks.zip
» Scraping Script -> http://andrewmohawk.com/pasteLertV2/src/pastelertv2_Python_Scraping_Script.zip

And of course if you want everything -> http://andrewmohawk.com/pasteLertV2/src/pastelertv2_all.zip

Essentially the only modification you need to do is within the interface / cron tasks modify the ‘setDB.php’ script with your db credentials and within the scraping script, set these on line 141.

-AM

Post comment

Recent Posts

What?

Not the quickest of cats
on the best of days.

Tag cloud

Donate

For electronics/other to play with:



Archives


Created by Site5 WordPress Themes.
Experts in WordPress Hosting.