Introduction
This is a little side project I did to try and scrape images out of reddit threads. There's a few different subreddits discussing shows, specifically /r/anime
where users add screenshots of the episodes. And I thought it'd be cool to see how much effort it'd be to automatically collate a list of those screenshots from a thread and display them in a simple gallery. The result looked like this
PRAW
PRAW is the Python Reddit API Wrapper, that provides a nice set of bindings to talk to Reddit.
To scrape Reddit you need credentials. The way to generate credentials is hidden away at https://www.reddit.com/prefs/apps where you have to register a new "app" with Reddit. Connecting is as simple as
import praw
reddit = praw.Reddit(client_id='id', \
client_secret='secret', \
user_agent='useragent', \
username='username', \
password='DevToIsCool')
Traversing reddit is made simple by the API, for example printing all of the comments in a thread.
submission = reddit.submission(url="https://reddit.com/r/abcde")
for comment in submission.comments.list():
print(comment)
Finding links
99% of the images I was looking for are posted to imgur so I just matched on those. I used a regular expression to extract the links. I always recommend using a tool like RegEx101 that makes it really easy to debug your regular expressions as they can be pretty brain bending.
REGEX_TEST = r"((http|https)://i.imgur.com/.+?(jpg|png))"
p = re.compile(REGEX_TEST, re.IGNORECASE)
Check if an image still exists
One of the problems I found was dead image links, so I created a simple helper that checks the status_code for that link.
# Check if a link still is exists
def checkLinkActive(url):
request = requests.head(url)
if request.status_code == 200:
return True
else:
return False
Getting Thumbnails
To save bandwidth and your mobile data I wanted to return a smaller version of the image. In imgur you can append a size character onto a URL to get it at a different size, for example 'l' large and 's' small.
# Add a letter to an imgur url to make a small thumbnail
def getImgurThumbnail(url, size):
startStr = url[:(len(url)-4)]
endStr = url[len(url)-4:]
return startStr + size + endStr
Putting it all together
Putting all of these bits together you get
def getImages(url):
submission = reddit.submission(url=url)
# Tell API to return all comment in thread, results are
# paginated by default
submission.comments.replace_more(limit=None)
# Create RegEx object for matching images
REGEX_TEST = r"((http|https)://i.imgur.com/.+?(jpg|png))"
p = re.compile(REGEX_TEST, re.IGNORECASE)
imageMatches = []
for comment in submission.comments.list():
matches = p.findall(comment.body)
for match in matches:
if checkLinkActive(match[0]):
imageMatches.append(
{"image": match[0], "thumbnail": getImgurThumbnail(match[0], "m")}
)
return imageMatches
Trying it out
I decided to stand up a quick demo of this, using an Azure Function to host my new function and a simple web form to allow people to try it out. Just copy and paste a Reddit URL and the function will return any images.
The Demo App uses Bulma for the look and feel, and a little bit of JQuery for the loading of the page.
I'll be looking in a future article at providing a show name search instead of having to paste individual episode URLs. Happy Reddit scraping!