How to Scrape Web Using Python, Selenium and Beautiful Soup

xarball · on Sept 2, 2018

Why would you switch from selenium to beautiful soup halfway through what you're trying to do, and force your program to re-request the same information from the web server? Selenium has access to the entire DOM, and the entire JavaScript session already loaded in a running web browser. It has way more power for data mining than beautiful soup does.

It looks like they're just trying to use selectors, but these directions seem to completely miss that functionality in Selenium's API. Just search the WebDriver documentation for 'find_element_by_':

https://selenium-python.readthedocs.io/api.html

I use Selenium for all my web crawling, exactly because I would rather have one crawler with all the backing support of a modern web browser, than corner myself into not having something as crucial as a JavaScript parser halfway through implementing a bot that's designed to hook what's basically an end-user interface sitting on top of all that.

The most obvious benefit of Selenium to me, is that by having all that, I can make my interactions with a web server look more like a user, and fly under the radar a little more. This tends to require less work on my part when I treat websites more like a whole package (though more RAM, yes!)

chsasank · on Sept 2, 2018

One reason to use beautiful soup is that selenium is slow. You need to open the whole webpage including images, css etc. With requests/beautiful soup you can just parse the collected URLs very fast.

xarball · on Sept 3, 2018

Selenium sets up the browser profile for you, so you can disable images, videos, css, javascript, embeds, all to your heart's content.

I've recently started using Selenium with the privoxy proxy, exactly because browser headless modes are still fairly new tech. They don't all necessarily support all the standard profile features (addons, settings, etc), or behave the same way. It's really neat seeing where they're going, but they sometimes need a bit of help MITM-ing traffic, so that's where a good filter comes in handy.

In the user-facing web world, 'slow' is kind of a relative term. Even with a barebone system, you're nearly always going faster than most servers will put out. I just take my chances bringing in bigger tools, because the personal cost of maintaining an under-equipped tool is usually a greater time-waster to keep up to date as your target site evolves, than the personal cost of waiting for variably-optimized background work to perform its duties.

chsasank · on Sept 3, 2018

Thanks, that was insightful!

_8usx · on Sept 1, 2018

Ryan Mitchell did an excellent talk at DEFCON23 about defeating bot checks and other common barriers that web scrapers face. Excellent watch for anyone interested in scraping: https://youtu.be/PADKIdSPOsc

Shameless plug: her O’Reilly book and associated github work “Web Scraping with Python” is an excellent read.

fareesh · on Sept 1, 2018

Coming from a Ruby background I've always been curious about Python's libraries for scraping. I've tried scrapy and beautiful soup, but somehow kept going back to Nokogiri and mechanize.

I found the CSS selector or xpath based syntax and the DSL to be a lot more convenient and less verbose to deal with.

Is selenium still the best bet for parsing JS powered pages these days? I was under the impression that headless chrome was more memory / performance efficient.

I do a lot of scraping work but my methods have not really evolved in the past 3-4 years, always on the lookout for something more elegant / quicker.

onesmallcoin · on Sept 2, 2018

Look into chromedriver, It maintained by the Chromium team and provides an executable that allows webdriver to control chrome. I've used this successfully in the past running chrome in headless mode, it seems pretty scalable. If I had to build the same product again, I'd still use chromedriver. Although I'd also consider Sikuli for image recognition / automation outside of the browser. Check out the chromedriver project here! http://chromedriver.chromium.org/getting-started

theshadowknows · on Sept 3, 2018

+1 for Sikuli! I only very rarely see it mentioned anywhere!

AdamM12 · on Sept 2, 2018

If it doesn't require JS parsing I used Parsel. It's the underlying selector library for scrapy.