April 29, 2020

Apr 29

Web Scraping with Beautiful Soup

Object Types

Using the following code will provide the first tag of the specified type from the BeautifulSoup object
1. print(soup.div)
You can get the name of the tag using .name and a dictionary representing the attributes of the tag using .attrs
1. print(soup.div.name)
2. print(soup.div.attrs)
NavigableStrings are pieces of text that are in the HTML tags on the page. We can get the string inside the tag by calling .string
1. print(soup.string)
We can get the children of a tag by accessing the .children attribute
1. for child in soup.ul.children:
  1. print(child)
We can also navigate up the tree of a tag by accessing the .parents in soup.li.parents:
1. for parent in soup.li.parents:
  1. print(parent)
If we want to find all the occurrences of a tag, instead of just the first one, we can use .find_all( )
1. print(soup.find_all(“h1”)
With .find_all( ), we can use regexes, attributes, or even functions to select HTML elements more intelligently.
If we want to find all of the elements with the “banner” class, we could use the command:
1. soup.find_all(attrs = {‘class’ : ‘banner’})