Some of the machine learning examples provided here employ images for training and inference. Acquiring a sufficient number of images for training a deep neural network can be challenging. One approach to obtain a large quantity of images is to use a script for scraping images from social media platforms. Many social media platforms provide API’s for doing so. The availability of an API doesn’t imply that any image available online should be used for any purpose. For many images, there are privacy and copyright aspects to respect. With that in mind, using such an API to obtain images can greatly simplify the creation of a dataset for machine learning.
flickr
One social media platform that hosts a large number of user supplied images and that provides an API for Python is flickr.
The flickr API can be installed as follows:
git clone https://github.com/ultralytics/flickr_scraper
cd flickr_scraper
pip install -qr requirements.txt
In order to be able to use the API, a flickr API key has to be requested. Information how to get a flickr API key is available here. A thorough documentation of the API is available here.
Before being able to access images through the flickr API, the API has to be instantiated in Python using the “FlickrAPI” function. This function expects as two arguments the api_key and secret that have been received when requesting them.
import flickrapi
api_key = '---------'
secret = '---------'
flickr = flickrapi.FlickrAPI(api_key,secret)
The flickr instance that is returned provides a function named “walk” for iterating over images on the flickr platform. This function provides several arguments. The most important arguments are:
- tag_mode: which images tags to use
- text: terms to use for a full text search
- media: the type of media to be searched for
- sort: sorting criteria for matching media
- extras: the image size
Several of the arguments can be assigned standard values. These are: tag_mode = “all”, media=”photos”, sort=”relevance”. The only argument that requires some experimentation is the full text search terms.
The extras argument deserves some explanations. The following keywords are available as values for this argument
- “url_sq” : small square 75×75
- “url_q” : large square 150×150
- “url_t” : thumbnail, 100 on longest side
- “url_s” : small, 240 on longest side
- “url_n” : small, 320 on longest side
- “url_m” : medium, 500 on longest side
- “url_z” : medium 640, 640 on longest side
- “url_c” : medium 800, 800 on longest side
- “url_l” : large, 1024 on longest side
- “url_o” : original image, either a jpg, gif or png, depending on source format
The following code employs the walk function to iterate over flickr images.
# continuation from previous code block
import urllib.request
extras = "url_q"
search_terms = 'dancer, dance, contemporary, solo'
image_save_directory = 'images/dancer/img_'
image_save_count = 5000 # 5000
count = 0
for photo in flickr.walk(tag_mode='all',
text = search_terms ,
media='photos',
sort='relevance',
extras=extras):
try:
photo_url = photo.get(extras)
filename = image_save_directory+('{:0>10}'.format(count))+'.jpg'
urllib.request.urlretrieve(photo_url, filename)
count += 1
print("get photo index {} name {} ".format(count, filename) )
except:
pass
if count > image_save_count:
break
Some example images downloaded from flickr are available here.