#Flickr API
Flickr has an API that allows you to download data from their website, including images and their metadata. These images can then be used for multiple purposes, including machine learning.
To use the API, you first need to:
- Create an account on Flickr
- Create an app on the “App Garden”, which gives you an API key and a secret key
As with any web scraping, make sure you scrape responsibly and follow the Flickr API Terms of Use.
#Example Tutorial: Downloading a Collection of Images
On Flickr, photos are stored in “sets”. Sometimes, sets are stored in “collections”. In this example, we will be downloading a single collection.
Collection
├─ Set
│ ├─Photo
│ ├─Photo
│ └─Photo
├─ Set
│ ├─Photo
│ └─Photo
└─ Set
├─Photo
├─Photo
├─Photo
└─Photo
Note: Collections can contain thousands of images, so you should limit how many you download.
By the end of this tutorial, you’ll write two python scripts:
flickr-scrape-urls.py
: this script parses through a given collection and stores each set’s ID and photo URLs to a JSON file.flickr-dl.py
: this script opens up the JSON file and parses it, downloading each image individually
#Install Modules
Before writing our scripts, we need to install the following python modules:
beautifulsoup
: BeautifulSoup is a python library that pulls data out of HTML/XML.flickrapi
: flickrapi is a python library that makes calls to the Flickr API easy. It doesn’t include all of Flickr API’s functionality, but it’s still a useful library.
You can do install these libraries through your package manager (e.g. apt
) like so:
bash
$ sudo apt install python3-bs4 python3-flickrapi
Alternatively, you can use pip
/pip3
:
bash
$ pip install beautifulsoup4 flickrapi
Now, let’s walk through an example and write some code.
#Part 1: Scraping Image URLs
In the Flickr API Documentation, you’ll notice there’s two available API calls under “collections”:
flickr.collections.getInfo
Returns information for a single collection. Currently can only be called by the collection owner, this may change.
flickr.collections.getTree
Returns a tree (or sub tree) of collections belonging to a given user.
Since you’re likely not the owner of the collection you want to download, we need to use getTree
, and this method is not available via the python module flickrapi
. Thus, you have to query the Flickr API in a more manual way, using the REST API at https://www.flickr.com/services/rest
.
#Set up Libraries and Keys
Using the requests
library, we can query the REST API and get a response in XML, which we can parse using BeautifulSoup
. Let’s write some basic code to set this up:
python
import requests
from bs4 import BeautifulSoup
import lxml
import flickrapi
import json
# set up API keys
api_key = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
api_secret = "xxxxxxxxxxxxxx"
#Determine IDs
Next, we need to know the collection ID and user ID that owns it. You can figure this out from the collection URL, which has the following structure:
https://www.flickr.com/photos/{username}/collections/{collection_id}/
Note that the URL contains the username
, not the user_id
. Luckily there’s a findByUsername
API call that gives us the user_id
:
python
import requests
from bs4 import BeautifulSoup
import lxml
import flickrapi
# set up API keys
api_key = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
api_secret = "xxxxxxxxxxxxxx"
username = exampleUser123
# set url to query
base_url = "https://www.flickr.com/services/rest"
url = base_url + f"/?method=flickr.people.findByUsername&api_key={api_key}&username={username}"
# use requests library to get URL and parse using BeautifulSoup
r = requests.get(url)
soup = BeautifulSoup(r.text, "lxml")
# print user_id
user_id = soup.find('user')['nsid']
print(user_id)
Run this script and you should see a user_id
in return, which you can copy for later use.
bash
$ python3 ./flickr-get-user_id.py
xxxxxxxxxxxx
At this point, you should know the collection_id
and user_id
. Now, we can start writing flickr-scrape-urls.py
:
python
import requests
from bs4 import BeautifulSoup
import lxml
import flickrapi
import json
# set up API keys
api_key = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
api_secret = "xxxxxxxxxxxxxx"
# set up IDs
collection_id = xxxxxxxxxxxxxxxxx
user_id = xxxxxxxxxxxx
# set url to query
base_url = "https://www.flickr.com/services/rest"
url = base_url + f"/?method=flickr.collections.getTree&api_key={api_key}&collection_id={collection_id}&user_id={user_id}"
# use requests library to get URL and parse using BeautifulSoup
r = requests.get(url)
soup = BeautifulSoup(r.text, "lxml")
At this point, our soup
contains a structure of collections. We only want 1 of those collections, and maybe that collection is called Mountain Pictures
. We can use the BeautifulSoup
calls .find()
and .find_all()
to filter the XML down to exactly what we want.
python
# grab our specific collection
mountain_collection = soup.find('collection', title="Mountain Pictures").find_all('set')
Next, we can loop through each set in the collection and use flickr.walk_set()
from flickrapi
to grab each image’s URL. We have to pass extras='url_o'
so that each image’s metadata will contain it’s original URL.
We also collect all the set IDs and image URLs into a list called data
, so we can store it into JSON:
python
# grab pictures
data = []
for set in mountain_collection:
# grab set ID
print("Set ID: " + set['id'])
d = dict()
d['id'] = set['id']
# grab photo URLs
urls = []
for photo in flickr.walk_set(set['id'], extras='url_o'):
urls.append(photo.get('url_o'))
d['urls'] = urls
# add this set's data to our list
data.append(d)
Note: Flickr stores images in multiple resolutions. In this example, we download the image’s “original” resolution, represented by
url_o
.
Finally, we store all this data into a JSON file, which we can later parse in another script.
python
with open('flickr-data.json', 'w') as f:
json.dump(data, f)
Putting all this together, we have flickr-scrape-urls.py
:
python
import requests
from bs4 import BeautifulSoup
import lxml
import flickrapi
import json
# set up API keys
api_key = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
api_secret = "xxxxxxxxxxxxxx"
# set up IDs
collection_id = xxxxxxxxxxxxxxxxx
user_id = xxxxxxxxxxxx
# set url to query
base_url = "https://www.flickr.com/services/rest"
url = base_url + f"/?method=flickr.collections.getTree&api_key={api_key}&collection_id={collection_id}&user_id={user_id}"
# use requests library to get URL and parse using BeautifulSoup
r = requests.get(url)
soup = BeautifulSoup(r.text, "lxml")
# grab our specific collection
mountain_collection = soup.find('collection', title="Mountain Pictures").find_all('set')
# loop through sets inside collection
data = []
for set in mountain_collection:
# grab set ID
print("Set ID: " + set['id'])
d = dict()
d['id'] = set['id']
# grab photo URLs
urls = []
for photo in flickr.walk_set(set['id'], extras='url_o'):
urls.append(photo.get('url_o'))
d['urls'] = urls
# add this set's data to our list
data.append(d)
with open('flickr-data.json', 'w') as f:
json.dump(data, f)
#Part 2: Downloading Images from URLs
Phew, okay. We have a JSON file called flickr-data.json
that contains lists of URLS of images to download. Next, we need to write a script called flickr-dl.py
that opens that JSON, parses it, and downloads the images.
First, let’s load the JSON, and create a directory called output
to store the images:
python
import requests
import json
import os
# load json
with open('flickr-data.json', 'r') as f:
data = json.load(f)
# create image directory
base_dir = os.getcwd()
out_dir = os.path.join(base_dir, "output")
if not os.path.exists(out_dir):
os.mkdir(out_dir)
print("Putting downloaded files into: " + out_dir)
Next, we loop through data
and grab each image’s URL, making sure we haven’t already downloaded it. Notice that the loop goes up to count
sets, which I’ve set to 10
. You’ll want to change this number depending on how many images you want. If you want every photo in the entire collection, just use len(data)
.
python
count = 10
# loop through sets
for i in range(len(data[:count])):
print("Downloading set index: " + str(i))
item = data[i]
# loop through urls
urlCount = len(item['urls'])
for i in range(urlCount):
url = item['urls'][i]
print("(" + str(i+1) + "/" + str(urlCount) + ") Trying to download: " + url)
# ensure file doesn't already exist
out_file = os.path.join(out_dir, os.path.basename(url))
if os.path.isfile(out_file):
print("Skipping, file already downloaded\n")
continue
All that’s left is to use requests.get()
to retrieve the image. We wrap the call in a try/except
block in case the download fails, and make sure we get a HTTP status code 200 before writing the image to disk.
python
try:
r = requests.get(url)
except:
print("Request failed, skipping")
continue
if r.status_code == 200:
# write image to disk
with open(out_file, 'wb') as f:
print(f"Writing image to: {out_file}")
f.write(r.content)
else:
print("Error downloading, HTTP response code was: " + str(r.status_code))
continue
print("")
Putting it all together, here’s the whole script, flickr-dl.py
:
python
import requests
import json
import os
# load json
with open('flickr-data.json', 'r') as f:
data = json.load(f)
# create image directory
base_dir = os.getcwd()
out_dir = os.path.join(base_dir, "out")
if not os.path.exists(out_dir):
os.mkdir(out_dir)
print("Putting downloaded files into: " + out_dir)
count = 10
# loop through sets
for i in range(len(data[:count])):
print("Downloading set index: " + str(i))
item = data[i]
# loop through urls
urlCount = len(item['urls'])
for i in range(urlCount):
url = item['urls'][i]
print("(" + str(i+1) + "/" + str(urlCount) + ") Trying to download: " + url)
# ensure file doesn't already exist
out_file = os.path.join(out_dir, os.path.basename(url))
if os.path.isfile(out_file):
print("Skipping, file already downloaded\n")
continue
try:
r = requests.get(url)
except:
print("Request failed, skipping")
continue
if r.status_code == 200:
# write image to disk
with open(out_file, 'wb') as f:
print(f"Writing image to: {out_file}")
f.write(r.content)
else:
print("Error downloading, HTTP response code was: " + str(r.status_code))
continue
print("")
When you run the script, you should see something like this:
bash
$ python3 ./flickr-dl.py
Putting downloaded files into: /path/to/directory/output
Downloading set index: 0
(1/19) Trying to download: https://flickr.com/12345/image1.jpg
Writing image to: /path/to/directory/output/image1.jpg
(2/19) Trying to download: https://flickr.com/12345/image2.jpg
Writing image to: /path/to/directory/output/image2.jpg
...
#Conclusion
That’s it! In two separate scripts, you learned how to use the Flickr API to find user IDs, view collections and sets, and download images.