This code performs the following operations: ### Purpose of the Code The...
This code performs the following operations:
Purpose of the Code
The script uses Selenium, Requests, and Pillow (PIL) libraries to extract, validate, and save images (both <img>
tags and CSS background images) from a webpage. Its main functionality is web scraping to download images and store them locally while ensuring image dimensions and avoiding duplicates.
Workflow and Main Functions
1. download_and_save_image(image_url, base_url, download_dir, min_width=50, min_height=50)
- Purpose: Downloads and saves an image.
- Steps:
- Resolves relative image URLs using the
base_url
. - Validates the URL.
- Downloads the image using the
requests
library. - Checks the image's width and height to ensure it's above the
min_width
andmin_height
. - Ensures no duplicate downloads by hashing the image's content.
- Saves the image in the specified directory (
download_dir
) with a filename based on the image's hash and extension.
- Resolves relative image URLs using the
- Returns: The saved filename or
None
if the image is too small, invalid, or already exists.
2. detect_and_save_images(url, output_dir="images", min_width=50, min_height=50, delay=2)
- Purpose: Detects and saves all images from a webpage.
- Steps:
- Creates the output directory (
output_dir
) if it doesn't exist. - Sets up a headless Chrome WebDriver using
selenium
andwebdriver_manager
. - Opens the given webpage (
url
). - Waits for the page to load (
time.sleep(delay)
). - Finds
<img>
tags and extracts theirsrc
attributes. - Detects elements with background images defined in CSS and extracts image URLs.
- Calls the
download_and_save_image
function for each detected image URL.
- Creates the output directory (
- Returns: A list of filenames of successfully downloaded images.
Example Usage:
The script is runnable as a standalone program via the if __name__ == "__main__":
block. Here's what happens in the example:
- The script scrapes the URL
https://vgen.co/kcpasin/portfolio
. - The images are saved in the
portfolio_images
directory, with a minimum width/height of 100px and a 5-second page load delay. - It logs the total count of downloaded images and their filenames.
Key Features:
-
Image Validation and Deduplication:
- Ensures downloaded images meet minimum dimensions (width and height).
- Prevents downloading duplicate images by using MD5 hashes of the image content.
-
Supports Both
<img>
and CSS Background Images:- Searches for
<img>
tags. - Inspects all elements for CSS
"background-image"
properties.
- Searches for
-
Dynamic Webpage Support:
- Uses Selenium WebDriver to manage dynamic content loading.
- Allows custom delays (
delay
parameter) to wait for the page to fully load.
-
Error Handling:
- Handles network errors, invalid URLs, and unsupported image formats using exception handling.
- Logs warnings instead of breaking execution on encountering problems.
-
Customizable Parameters:
- Directory to store images (
output_dir
). - Minimum image size (
min_width
andmin_height
). - URL to scrape and page load delays.
- Directory to store images (
Output:
- After execution, all valid images are saved in the specified folder (e.g.,
portfolio_images
). - Outputs a list of filenames in the console with the total count of successfully downloaded images.
Dependencies:
- Requests for downloading images.
- Selenium for automated webpage interaction.
- Pillow (PIL) for handling and validating images.
- Webdriver-Manager to manage the ChromeDriver installation.
In summary, this code is a web scraping utility for downloading images from webpages, optimized for handling both inline and CSS-based images.