The Quick and dirty:

New PasteLert lives at


» Interface ->
» Cron Tasks ->
» Scraping Script ->

And of course if you want everything ->


My linode has been pretty much falling over due to the previous version of the pastebin alerts for a number of reasons:

» Scripts sometimes get blackholed ( allows the connection but doesnt respond – due to their DDoS protection)
» Scripts sometimes were still running when the PREVIOUS script had not completed causing a chain reaction of fail
» Deletes would be happening while the above scripts where running causing MySQL to tilt


As such I recently re-worked the service. Initially I started playing around with other DB types to try and get my Linode to store more than a day or 2s worth of data. I looked around and it appeared that Lucene/SolR was the solution I was looking for, and actually it does work _very_ well at storing large amounts of data (I had it running with about 2 weeks of data). However there were a number of issues:

» After about a week or 2s worth of data (avg around 20-30K posts a day, x 14 = 280 000 – 420 000 posts) the search times were SLOW (talking something like 5-15 SECONDS)
» Because Lucene is not a RDBMS there is no concept of having something like a row ID or an auto-incrementing ID – so this would have to be handled by the script to get the number of entries and +1 every time
» Because of the above Alerts would have to work on a date (when the post was made – so working out from x secs ago or y minutes ago), and an ISO formatted date no less (no unixtime) it became a real pain.

However, with that being said I did still build the interfaces for it and if you are looking to implement it with SolR / Lucene just message me for the schema and Python/PHP scripts.


Ultimately however I decided to stick to the same system previously used but rather than have cron’d scripts that pull the data have one long running python script that you can place in the background. Pretty basic and the code should be self explanatory, the gist of it:

1. Pull archive.php from [ ]
2. Extract all the paste entries with a regular expression ( re.compile(‘<td><img src=”/i/t.gif” .*?<a href=”/(.*?)”>(.*?)</a></td>.*?<td>(.*?)</td>’,re.S) )
3. Check if we have seen it in the last 500 or so (that we have in a python list), if not, pull the raw paste
4. INSERT IGNORE (in case we missed a double) this data

Then for the “alerts” themselves, basically:

»Every 30 minutes (or whenever you set the cron to run) search if the terms in the database have been seen
»If seen send out mail

Additionally of course there is a web interface that you can use to add alerts as well as search the current index’d pastes.

Downloads / Config

My Crontab at this stage looks as follows (if you want to just copy mine):
*/20 * * * * php /home/andrew/pasteLertV2/Cron_Tasks/sendAlerts.php
0 1 * * * php /home/andrew/pasteLertV2/Cron_Tasks/truncPastes.php

And i’ve kicked off the script that puts the data in the database with:

andrew@mothership:~/pasteLertV2/Python_Scraping_Script$ nohup python &

I’ve seperated the scripts into the 3 sections:

» Interface ->
» Cron Tasks ->
» Scraping Script ->

And of course if you want everything ->

Essentially the only modification you need to do is within the interface / cron tasks modify the ‘setDB.php’ script with your db credentials and within the scraping script, set these on line 141.


2 Comments to “PasteLert v2!”

  • Hi,

    Thanks for sharing this script. It would be quite useful to create the database and its tables.

    Just some positive feedback to improve :)


  • Hi Jeroen,
    download the pastelert archive from and use pastebinStructure.sql to create db structure.

Post comment

Recent Posts


Not the quickest of cats
on the best of days.

Tag cloud


For electronics/other to play with:


Created by Site5 WordPress Themes.
Experts in WordPress Hosting.