/home/posts/python-scrape-flickr-collections

How to Use Python to Scrape Flickr Data

Published on

#Flickr API

Flickr has an API that allows you to download data from their website, including images and their metadata. These images can then be used for multiple purposes, including machine learning.

To use the API, you first need to:

  1. Create an account on Flickr
  2. Create an app on the “App Garden”, which gives you an API key and a secret key

As with any web scraping, make sure you scrape responsibly and follow the Flickr API Terms of Use.

#Example Tutorial: Downloading a Collection of Images

On Flickr, photos are stored in “sets”. Sometimes, sets are stored in “collections”. In this example, we will be downloading a single collection.

Collection
├─ Set
│  ├─Photo
│  ├─Photo
│  └─Photo
├─ Set
│  ├─Photo
│  └─Photo
└─ Set
   ├─Photo
   ├─Photo
   ├─Photo
   └─Photo

Note: Collections can contain thousands of images, so you should limit how many you download.

By the end of this tutorial, you’ll write two python scripts:

  1. flickr-scrape-urls.py: this script parses through a given collection and stores each set’s ID and photo URLs to a JSON file.
  2. flickr-dl.py: this script opens up the JSON file and parses it, downloading each image individually

#Install Modules

Before writing our scripts, we need to install the following python modules:

You can do install these libraries through your package manager (e.g. apt) like so:

bash

$ sudo apt install python3-bs4 python3-flickrapi

Alternatively, you can use pip/pip3:

bash

$ pip install beautifulsoup4 flickrapi

Now, let’s walk through an example and write some code.

#Part 1: Scraping Image URLs

In the Flickr API Documentation, you’ll notice there’s two available API calls under “collections”:

  1. flickr.collections.getInfo

Returns information for a single collection. Currently can only be called by the collection owner, this may change.

  1. flickr.collections.getTree

Returns a tree (or sub tree) of collections belonging to a given user.

Since you’re likely not the owner of the collection you want to download, we need to use getTree, and this method is not available via the python module flickrapi. Thus, you have to query the Flickr API in a more manual way, using the REST API at https://www.flickr.com/services/rest.

#Set up Libraries and Keys

Using the requests library, we can query the REST API and get a response in XML, which we can parse using BeautifulSoup. Let’s write some basic code to set this up:

python

import requests
from bs4 import BeautifulSoup
import lxml
import flickrapi
import json

# set up API keys
api_key = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
api_secret = "xxxxxxxxxxxxxx"

#Determine IDs

Next, we need to know the collection ID and user ID that owns it. You can figure this out from the collection URL, which has the following structure:

https://www.flickr.com/photos/{username}/collections/{collection_id}/

Note that the URL contains the username, not the user_id. Luckily there’s a findByUsername API call that gives us the user_id:

flickr-get-user_id.py

python

import requests
from bs4 import BeautifulSoup
import lxml
import flickrapi

# set up API keys
api_key = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
api_secret = "xxxxxxxxxxxxxx"

username = exampleUser123

# set url to query
base_url = "https://www.flickr.com/services/rest"
url = base_url + f"/?method=flickr.people.findByUsername&api_key={api_key}&username={username}"

# use requests library to get URL and parse using BeautifulSoup
r = requests.get(url)
soup = BeautifulSoup(r.text, "lxml")

# print user_id
user_id = soup.find('user')['nsid']
print(user_id)

Run this script and you should see a user_id in return, which you can copy for later use.

bash

$ python3 ./flickr-get-user_id.py
xxxxxxxxxxxx

At this point, you should know the collection_id and user_id. Now, we can start writing flickr-scrape-urls.py:

flickr-scrape-urls.py

python

import requests
from bs4 import BeautifulSoup
import lxml
import flickrapi
import json

# set up API keys
api_key = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
api_secret = "xxxxxxxxxxxxxx"


# set up IDs
collection_id = xxxxxxxxxxxxxxxxx
user_id = xxxxxxxxxxxx

# set url to query
base_url = "https://www.flickr.com/services/rest"
url = base_url + f"/?method=flickr.collections.getTree&api_key={api_key}&collection_id={collection_id}&user_id={user_id}"

# use requests library to get URL and parse using BeautifulSoup
r = requests.get(url)
soup = BeautifulSoup(r.text, "lxml")

At this point, our soup contains a structure of collections. We only want 1 of those collections, and maybe that collection is called Mountain Pictures. We can use the BeautifulSoup calls .find() and .find_all() to filter the XML down to exactly what we want.

flickr-scrape-urls.py

python

# grab our specific collection
mountain_collection = soup.find('collection', title="Mountain Pictures").find_all('set')

Next, we can loop through each set in the collection and use flickr.walk_set() from flickrapi to grab each image’s URL. We have to pass extras='url_o' so that each image’s metadata will contain it’s original URL.

We also collect all the set IDs and image URLs into a list called data, so we can store it into JSON:

flickr-scrape-urls.py

python

# grab pictures
data = []
for set in mountain_collection:

    # grab set ID
    print("Set ID: " + set['id'])
    d = dict()
    d['id'] = set['id']

    # grab photo URLs
    urls = []
    for photo in flickr.walk_set(set['id'], extras='url_o'):
        urls.append(photo.get('url_o'))
    d['urls'] = urls

    # add this set's data to our list
    data.append(d)

Note: Flickr stores images in multiple resolutions. In this example, we download the image’s “original” resolution, represented by url_o.

Finally, we store all this data into a JSON file, which we can later parse in another script.

flickr-scrape-urls.py

python

with open('flickr-data.json', 'w') as f:
    json.dump(data, f)

Putting all this together, we have flickr-scrape-urls.py:

flickr-scrape-urls.py

python

import requests
from bs4 import BeautifulSoup
import lxml
import flickrapi
import json

# set up API keys
api_key = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
api_secret = "xxxxxxxxxxxxxx"

# set up IDs
collection_id = xxxxxxxxxxxxxxxxx
user_id = xxxxxxxxxxxx

# set url to query
base_url = "https://www.flickr.com/services/rest"
url = base_url + f"/?method=flickr.collections.getTree&api_key={api_key}&collection_id={collection_id}&user_id={user_id}"

# use requests library to get URL and parse using BeautifulSoup
r = requests.get(url)
soup = BeautifulSoup(r.text, "lxml")

# grab our specific collection
mountain_collection = soup.find('collection', title="Mountain Pictures").find_all('set')

# loop through sets inside collection
data = []
for set in mountain_collection:

    # grab set ID
    print("Set ID: " + set['id'])
    d = dict()
    d['id'] = set['id']

    # grab photo URLs
    urls = []
    for photo in flickr.walk_set(set['id'], extras='url_o'):
        urls.append(photo.get('url_o'))
    d['urls'] = urls

    # add this set's data to our list
    data.append(d)

with open('flickr-data.json', 'w') as f:
    json.dump(data, f)

#Part 2: Downloading Images from URLs

Phew, okay. We have a JSON file called flickr-data.json that contains lists of URLS of images to download. Next, we need to write a script called flickr-dl.py that opens that JSON, parses it, and downloads the images.

First, let’s load the JSON, and create a directory called output to store the images:

flickr-dl.py

python

import requests
import json
import os

# load json
with open('flickr-data.json', 'r') as f:
    data = json.load(f)

# create image directory
base_dir = os.getcwd()
out_dir = os.path.join(base_dir, "output")
if not os.path.exists(out_dir):
    os.mkdir(out_dir)
print("Putting downloaded files into: " + out_dir)

Next, we loop through data and grab each image’s URL, making sure we haven’t already downloaded it. Notice that the loop goes up to count sets, which I’ve set to 10. You’ll want to change this number depending on how many images you want. If you want every photo in the entire collection, just use len(data).

flickr-dl.py

python

count = 10

# loop through sets
for i in range(len(data[:count])):
    print("Downloading set index: " + str(i))
    item = data[i]

    # loop through urls
    urlCount = len(item['urls'])
    for i in range(urlCount):
        url = item['urls'][i]

        print("(" + str(i+1) + "/" + str(urlCount) + ") Trying to download: " + url)

        # ensure file doesn't already exist
        out_file = os.path.join(out_dir, os.path.basename(url))
        if os.path.isfile(out_file):
            print("Skipping, file already downloaded\n")
            continue

All that’s left is to use requests.get() to retrieve the image. We wrap the call in a try/except block in case the download fails, and make sure we get a HTTP status code 200 before writing the image to disk.

flickr-dl.py

python

        try:
            r = requests.get(url)
        except:
            print("Request failed, skipping")
            continue

        if r.status_code == 200:
	    # write image to disk
            with open(out_file, 'wb') as f:
                print(f"Writing image to: {out_file}")
                f.write(r.content)
        else:
            print("Error downloading, HTTP response code was: " + str(r.status_code))
            continue
        print("")

Putting it all together, here’s the whole script, flickr-dl.py:

flickr-dl.py

python

import requests
import json
import os

# load json
with open('flickr-data.json', 'r') as f:
    data = json.load(f)

# create image directory
base_dir = os.getcwd()
out_dir = os.path.join(base_dir, "out")
if not os.path.exists(out_dir):
    os.mkdir(out_dir)
print("Putting downloaded files into: " + out_dir)

count = 10

# loop through sets
for i in range(len(data[:count])):
    print("Downloading set index: " + str(i))
    item = data[i]

    # loop through urls
    urlCount = len(item['urls'])
    for i in range(urlCount):
        url = item['urls'][i]

        print("(" + str(i+1) + "/" + str(urlCount) + ") Trying to download: " + url)

        # ensure file doesn't already exist
        out_file = os.path.join(out_dir, os.path.basename(url))
        if os.path.isfile(out_file):
            print("Skipping, file already downloaded\n")
            continue

        try:
            r = requests.get(url)
        except:
            print("Request failed, skipping")
            continue

        if r.status_code == 200:
	    # write image to disk
            with open(out_file, 'wb') as f:
                print(f"Writing image to: {out_file}")
                f.write(r.content)
        else:
            print("Error downloading, HTTP response code was: " + str(r.status_code))
            continue
        print("")

When you run the script, you should see something like this:

bash

$ python3 ./flickr-dl.py
Putting downloaded files into: /path/to/directory/output
Downloading set index: 0
(1/19) Trying to download: https://flickr.com/12345/image1.jpg
Writing image to: /path/to/directory/output/image1.jpg

(2/19) Trying to download: https://flickr.com/12345/image2.jpg
Writing image to: /path/to/directory/output/image2.jpg

...

#Conclusion

That’s it! In two separate scripts, you learned how to use the Flickr API to find user IDs, view collections and sets, and download images.

Meet the Author

John Allbritten

Nashville, TN

I love learning new technologies, and I have a passion for open source. As I learn things, my notes turn into articles to share.

Related Posts