Skip to content




Web scraping, is the process of retrieving or “scraping” data from a website. automatically, not manually. Web scraping uses intelligent automation to retrieve millions or even billions of data points from the internet’s websites..

If there is no API to download data from a site, I can use web scrapping


To perform web scrapping, we need to know exactly how a website works, so take a look at this awesome intro to HTML here

We need to understand some TAGS and its structure, meaning, who is the parent tag and its children.

Sol et´s say we have <div> </div> , so this parent tag will have children…whre? Just inside them, so ANYTHING within <div> </div> will be children of that DIV tag.

            Mi message here

Now let´s take a look at this HTML

            Mi mesaage here
			I am here

Why are they called siblings? Because the TAGS <div> and <span> are at the same level, they are childs from <body>, but siblings to each other

In order to make easy to identify a TAG, we can put them something called ATTRIBUTES

These attributes can be Class or ID and are formed by a NAME and a VALUE

 <div class="main container">
            Mi mesaage here

Now we can easily identify this <div> tag because it has a class named main container

To get a deep insight regarding HTML classes and ID attributes, please check this class tutorial and this ID one.


Take a look at this info to get some intro to it


When we type a site URL, like it is an easy-to-read-URL, but if we do a search in Google we´ll get something like this

Why all that stuff? Because thr URL can be used to pass info regarding, in this example, our search.

Let´s analyze this

https -> protocol -> domain

/search -> Endpoint (identifies the action the server will perform, this time a SEARCH. We can concatenate several endpoints like search/users)

Then we have the parameters; this starts with “?”, so everything AFTER a ? are the parameters that will use the server to answer our request. Keep in mind that the parameters are a pair name-value separated by the “=” sign

We can have several parameters split by the “&” symbol

So in the example the variable “q”= web scrapping, sourceid = chrome and ie = UTF8.

All this info is received by the server and used to serveour request


1 – Static scrapping (one-page) : when ALL info is in just one page and it does not load dynamic info.

Tools used: requests (to “ask” for data), Beautiful Soup to parse the XML and HTML we get, and Scrapy that gets done the two functions (request and parse)

: 2 – Static scrapping (several pages, same Domain) also called Horizontal scrolling (pagination )and Vertical scrolling. (product details).

Tools used: Scrapy

3 – Dynamic web scrapping: we´ll use some automation to fill data, to scroll and to wait for the page to load contents before scrapping what we need.

Tools used: Selenium

4 – APIs web scrapping:


1 – Define a Root or Seed URL, the main one from where to START the data extraction, maybe not the one to extract data, but the one from where we´ll start “travelling” to find the info

2 – Make a REQUEST to this URL

3 – Get the response from the previous Request (it will be HTML format)

4 – Parse the info to obtain what I am searching

5 – Repeat from step 2 with other URL within the same Domain.(may be obtained from the HTML response)


To obtain the required info from the HTML response, we´ll need XPATH

XPATH is a language that allows us to build expressions to extract info from XML or HTML data. I can search and extract exactly what I need from all the giberish we´ll get from our requests. We can search within the DOM elements in a number of ways.

Take a look at this awesome tool to learn how to use XPATH


Now, we must understand how XML works, it is made of a structure of LEVELS, being these levels the nodes (HTML tags) and these nodes have sub-levels, or nested-levels called “child nodes”

Take a look at this piece of code; <body> is the ROOT level, and the Childs are: <h1>, <h1> , <div>, and <div>, but the 1st <div> tag has another child -> <p> and the 2nd <div> another child <span>

    <h1>Main title</h1>
    <h1>Another main title</h1>
    <div class="main container">
            Mi mesaage here
			I am here

Now we can define our search axis to start a search, these axis are some parameters to filter the tags we are looking for.

If I use // (double slash) it will search within ALL levels of the document

If I do a single slash (/) it will only search within the root of the document

Note if found nothing, because <p> is a child of a child, it is not the root. This document has only one tag as root, and it is <body>

Ok, after defining the search prefix (//, / or ./) we must add the node we are searching, this is called a “step”. I can also define attributes to narrow the search even more.

This is done by adding [@ =] after the search prefix. Let´s say i want to find the <h1> tag with id title

Here an awesome intro to Xpath


We can run a “live” xpath request by opening web browser dev tools (usually F12 or right-click -> inspect)

, then go to “console” tab and run this code

$x("path expression")

Let´s see an example by requesting all <div> from root (//)



Remember that in order to get data from a website we need TWO separate procedures;

1 – REQUEST the page/server

2 – PARSE the data we received

We´ll use some Python libraries to do this.

To extract info from one-static-page we´ll use 4 different libraries:

Requests -> to obtain the HTML

LXML and beautifulsoup4 to parse the received info

Scrapy to perform the two operations; request and parse

to install a library just open a CMD -> command prompt in Windows or a terminal if Linux and run

pip install library-name

or pip3 install library-name

or sudo pip install library-name (Linux)

or pip install library-name –user (windows)

We´ll also install (for dynamic sites)


Pillow (to extract images)

Pymongo (to store data in DB)

In case you need Twisted to make scrapy work with windows, use this link

Requests full doc

LXML full doc

Pip full doc


Goal: Extract the names that Wikipedia shows in its main page


  • Requests to get the HTML from the server and
  • LXML to parse the tree and to get the desired info

Just to refresh, we need TWO steps -> requests the data and parse it to get the exact info.

Bear in mind that when I do a request, it also brings the headers. One of the most useful is “user-agent” that returns the browser from which the request is being called and the operating system. If I DON´T define this user-agent, by default will be ROBOT, so our attempt may be seen as an attack, an automatic web-scrapping and it will be blocked.

So we need to overwrite that default “user-agent” variable.

To do this, BEFORE setting a request I must create an object to host the new values

new_header = {
    "user-agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/80.0.3987.149 Safari/537.36"

Now we can print the result with the “text” property

So far we have this working code

Goal: Extract the names that Wikipedia shows in its main page

Requests to get the HTML from the server and 
LXML to parse the tree and to get the desired info

import requests

change the user-agent to avoid being blocked

new_header = {
    "user-agent" : "Mozilla/5.0 Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko Chrome/80.0.3987.149 Safari/537.36"

define my seed URL, in this case the wikipedia URL
seed_url = ""

now I make the request

request_result = requests.get(seed_url, headers=new_header)

now we can print the request_result
-> [:200] cuts the text to only 200 characters
print (request_result.text[:200])

with this result (cropped to 200 characters)

<!DOCTYPE html>
<html lang="mul" class="no-js">
<meta charset="utf-8">
<meta name="description" content="Wikipedia is a free online encyclopedia, created and edited by 

Now it´s time to use LXML to parse this data into more useful one.

Let´s import lxml -> from lxml import html

Now we´ll create a variable to call the parser

parser = html.fromstring(request_result)

now with this parser we have several useful methods to search within the HTML tree. BUT to extract data we need to check the HTML structure to find out

Ok, let´s say we want to extract the ENGLISH text from Wikipedia main page

So we right-click on the element we want to check and then we select “inspect”

And we´ll get this info

This “brings” the HTML tree so we can easily find the element we want to parse

As we can see, the text is within a <a> tag with an ID “js-link-box-en”. Remember that an ID is UNIQUE so we can reach this text within this tag…

Back to our script remember that now that we parsed the text, now we have many methods to use, so let´s try get_element_by_id

parser.get_element_by_id(“js-link-box-en”) #it receives as parameter the element ID we want to show

Now we assign this to a variable

Ingles = parser.get_element_by_id(“js-link-box-en”) #it receives as parameter the element ID we want to show

And now we print it

Print (ingles)

so we have this last piece of code

parser = html.fromstring(request_result.text)

english = parser.get_element_by_id("js-link-box-en") #it receives as parameter the element ID we want to show

print (english)

that brings…

<Element a at 0x31ecae0>

What is that?????? no worries, it is just a CLASS, so we need to call the content of it like this

print (english.text_content())

and now it works

6 203 000+ articles

Ok, we made this using XML, but we can also use XPath, remember? Let´s see how to do it


and…what is that expression that will lead me to the element?

Back to the inspect page we see that we have an <a> element with an ID and the text is within a chid tag (<strong>)

So, the Xpath expression would be


And the piece of code to call the element …

english = parser.xpath("//a[@id='js-link-box-en']/strong/text()")
print (english)

and it works, returning


Ok, now let´s focus on our goal; retrieve ALL languages from home page, so we need to create a XPath expression to do that

We need to find a pattern, something that wraps all languages. Remember that an ID is unique, but a CLASS is for groups, meaning, maybe we could fing a Class that contain our languages.

As we can see, it all happens within <div> tags and every language has a CLASS (class=”central-featured-lang) finishing with lang1, lang2….lang n. So when calling our Xpath expression we must use “contains”

And within that <div> tag they also have <a> and <strong> tags

languages = parser.xpath("//div[contains(@class, 'central-featured-lang')]//strong/text()")
print (languages)

and the result

['English', 'Español', 'æ\x97¥æ\x9c¬èª\x9e', 'Deutsch', 'Ð\xa0Ñ\x83Ñ\x81Ñ\x81кий', 'Français', 'Italiano', 'ä¸\xadæ\x96\x87', 'Português', 'Polski']

It works! And we receive the result as a List, but we can easily iterate it

for language in languages:
    print (language)



Well, now let´s try to do it with another XML way -> find_class

languages = parser.find_class('central-featured-lang')
for language in languages:


6 203 000+ articles

1 645 000+ artículos

1 242 000+ 記事

2 508 000+ Artikel

1 681 000+ статей

2 275 000+ articles

1 656 000+ voci

1 161 000+ 條目

1 048 000+ artigos

1 442 000+ haseł


Remember that when working with CLASSES we have this topic to watch out

class=”central-featured-lang lang1″

the space within a class indicates that there is ANOTHER class, so in the example, we have TWO classes (this allows us to style better)

class=”central-featured-lang lang1″



That´s why for XPath the class is ALL the lenght and for XML we must split the first class


As the title suggest, we use no more XML but a nice tool; Beautiful Soup. It works sort of similar because it allows us to apply functions to look by ID, classes, and so.

Goal: extract Title and description of published questions within stack overflow site main page

Tools to use: Beautiful Soup

  1. First, we import the requests library as usual
  2. Then the header stuff to avoid being banned
  3. Then we add the seed URL
  4. Now we make the request (get) to the URL to get the full tree
  5. We can print the result (it will be 200 if successful)

Nothing new so far, we have this code

Goal: extract Title and description of published
questions within stack overflow site main page
Tools to use: Beautiful Soup
import requests

#change header to avoid being detected as a bot
new_header = {
    "user-agent" : "Mozilla/5.0 Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko Chrome/80.0.3987.149 Safari/537.36"

#define seed URL
url = ""

#now we make the request (get) to the url to get the full tree
#request to server
result = requests.get(url, headers = new_header)

#show is site is reachable
print (result)


<Response [200]>

Ok, time to import beautifulsoup (install it if not yet installed)

pip install beautifulsoup4 –user

from bs4 import BeautifulSoup

Remember that BeautifulSoup is another parser with a set of tools for retrieving/filtering info

So beautifulsoup receives the result as parameter (the text, HTML tree) and we assign it to a variable as usual

soup_info = BeautifulSoup(result.text)

Now inspect the page to look for some clues to scrap

We find a main <div> with an ID = “questions”

And within many <div> with a class = “question-summary”

So, the path is clear enough

We can use FIND to retrieve maybe the ID?

Let´s check

main_questions_container = soup_info.find(id="questions")

Now, with this main element I can get the child ones, so I am not gonna search within soup_info, because here we have ALL the trre info, BUT within the main_questions_container, because here I only have the child elements (question-summary)


Find retrieves only ONE result

Find_all brings everything, so now we´ll use this last


Note: calss is a reserved word, so python uses class_

I can also double check that I am searching within a tag (in this case <div>) by adding it at the very beginning like this

main_questions_container.find_all('div', class_="question-summary")

now we just assign it to a variable

questions_list = main_questions_container.find_all('div', class_="question-summary")

It is a LIST so I can iterate it. But, what to iterate? Maybe the title that is within a <h3> tag

for questions in questions_list:
    question_text = questions.find('h3').text

Let´s see the result

How to create a serializer for decimal in flutter
reference to submit is ambiguous: <T>submit(Callable<T>) in ExecutorService and method submit(Runnable) in ExecutorService match
How to grab a case in switch if its in another class using random
Android paging2 library : Network(PageKeyedDataSource) + Database idiomatic/expected way to implement
Why is my CAST not casting to float with 2 decimal places?
Question regarding slice assignment, deep copy and shallow copy in Python
How to access objects in S3 bucket, without making the object's folder public
Google Maps API - Const must be initialized
CS50_ Filter more: Blur
click on map doesn't get triggered when the click is on a polygon
How to generate UK postcode using Faker or by own function in Python?
Resize image to specific filesize python
C#: " 'The given path's format is not supported.'
why my code is in continuous loop for second part even if i have given correct user input what is the difference between (not in) and (!=)
Kompići C++ COCI 2011/2012 2nd round

And now to get the description we inspect and see that there is a class=excerpt

question_description = questions.find(class_='excerpt').text


Power BI - using Groups vs creating a new column using switch()

            If I have a continuous numeric field and want to group it in Power BI, I can

Create a new column using SWITCH() to perform the grouping

numeric_value_grouped = SWITCH(TRUE(),
Is there a way to run simple HTML code in Visual Studio?

            Remember this is Visual Studio, the IDE, not Visual Studio Code, the text editor. Anyways, if I wanted to run some simple HTML code like <p>Hello world!</p>, how would I do it? This code ... 
How do I sort an ArrayList of class objects?

             I'm having trouble figuring out how to sort an ArrayList of objects. The objects are of a class CityTemp that implements the interface Comparable, and I have defined a compareTo() method. It works for ...

Ok, some weird spaces, so let´s fix them

question_description = question_description.replace('\n', '').replace('\r', '').strip()

replace to change newlines (\n and \r) with space (‘’) and strip to remove TABS or spaces before and after line so we get this result

How can I fix CORS error while trying to access an Angular website hosted on github pages with custom domain?

As mentioned on the title, I'm hosting my angular website on gh-pages and pointing a custom domain to it. The website was loading before I added in the custom domain. Here's an example of the error I ...

The relationship between DataTables in DataSet: can we check if the parent is so and so?

I loaded a complicated XML file with lots of data where are complex level of nested elements. The DataSet.ReadXml() load all that nicely and I can loop through all the nodes.Essentially each node is ...

Module status keeps running after it has been disabled

I rebooted my linux machine and started noticing these odd requests in my Apache access log.::1 - - [16/Dec/2020:21:28:54 -0500] "GET /server-status?auto HTTP/1.1" 404 147 "-" &...


Full doc

Scrappy comes in a set of Classes, we must import a set of functions, modules an classes.

Scrapy is a full framework.

First class:

class Date(Item):
    text = Field()

So, every element to search has its own properties, let´s say a product, has name, price, reviews and so

These are our Fields-> the  info I want to extract from  the product, meaning, I decide what to extract so these fields can be a lot of or just one.

Second Class:

The one that performs the extraction, our “spider”

First we define classes variables (we name our spider with any name and our seed URL)

Here we can define some rules to guide the spider where to look for the data.

Function parse, where the magic happens

    def parse(self, response):

parse recibes a parameter; response, where the HTML tree will be stored (I don´t need BeautifulSoup or XML to parse the tree, Scrapy does it by itself)

Scrapy calls its parsers “selectors”. I can search using XPath, ID, Classes, Lists and so.

To start loading data we use ItemLoader, which receives the class object I created and the selector (the HTML tree where I´ll search for the elements I want to) in this case the Text Field  that will be filled with the XPath expression

item.add_xpath('text', './/h3/a/text()')

Ok, the code so far, no worries this is just an intro we´ll get deeper in next lectures

#import modules
from scrapy.item import Field, Item
from scrapy.spiders import Spider
from scrapy.selector import Selector
from scrapy.loder import ItemLoader

#class type of data to extract, article, image, name, user, product
class Date(Item):
    text = Field()
class SpiderData(Spider):
    name = "MySpider"
    start_urls = ['https://site-to-crawl']
    #redirection rules here
    def parse(self, response):
        sel = Selector(response)
        page_title = sel.xpath('//h1/text()').get()
        list = sel.xpath('//div[@id="datos"]')
        for elements in list:
            item = ItemLoader(Dato(), element)
            item.add_xpath('text', './/h3/a/text()')


Ok, remember that the first step is to define our classes with the items to extract, it is a class abstraction. So our items to extract (again stackoverflow) are the questions , and the properties are the title and description from main page,

Now we have defined our items and properties, let´s start.

Let´s define the abstraction of what we want to extract.

We create a Class with any related name, the important is that this class inherit from ITEM. (from scrapy.item import item)

And then within the class I just define the properties I want to bring (question and description)

class Question (Item):
    question = Field()
    description = Fiel()

And that´s our Class definition

Now we need to define the CORE class for Scrapy

This is the one to perform our requests, parse, and more BUT it must inherit from SPIDER class

Note, we use Spider because we want to extract from ONE page, if we need to extract from multiple pages we´ll need another (we´ll see it later)

class MainCoreScrapy(Spider):

Now we can define several things within this main function:

  • Spider name
  • Header to avoid being detected as a bot (Scrapy defines it within “custom_setting” property) this object uses key-value pair with USER_AGENT in CAPS and the value the one we already know
  • URL (seed/starting URL)
class MainCoreScrapy(Spider):
    name = "MainSpider"
    custom_settings = {
        'USER_AGENT' : ['Mozilla/5.0 Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko Chrome/80.0.3987.149 Safari/537.36']
    start_urls = ['']

Done (because is just one-page-spider)

Now we need to define the function were the magic happens (parse function)

def parse(self, response):

We need nothing to add here because it is just one URL and Scrapy does the magic in auto mode and returns the response with the HTML tree.

So I receive the HTML tree, but where do we parse it?

Well, within the response parameter def parse(self, response):

So, here we have the first step done (request)

Now we need to parse it. We won´t use BeautifulSoup or XML, we´ll use the Scrapy’s way; the selector class

    sel = Selector(response)

now we have this sel variable with the selector response that we´ll use to ask the page for the useful info

I can use XPATH or CSS to make my wat though the tree

SO this is the same example that we used with BeautifulSoup, so we know that we need the question-summary within <div> tag

So we should build an expression to retrieve those <div> within a List to iterate and get the data

questions_list = sel.xpath('//div[@id="questions"]//div[@class="question-summary"]')

Now we can iterate this List

    for question in questions_list:

the variable questions will have every element in every iteration until finishing

so, now we have the element as shown in the image and we are iterating all of them BUT we need to extract every item (the questions) from them.

Remember that when we started the program we imported some tools, one of them was

from scrapy.loader import ItemLoader, this class loads ITEMS,

so ItemLoader is a class that receives as the first parameter, an instance from my class that contains an abstraction of what I need to extract, (questions) and the second parameter will be the HTML element (the selector) with the info that we´ll use to fill these fields

class Question (item):
    question = Field()
    description = Field()

so we have this so far

 for question in questions_list:
        item = ItemLoader(Question()question)

Now I have to fill the fields question and description, maybe we can try this (several ways to do it)

        item.add_xpath('question', './/h3/a/text()')

so the ‘question’ will be filled with the xpath expresion coming from .//h3/a/text() that is contained within the HMTL element that I called in here

    item = ItemLoader(Question()question)

Now we need to fill the ‘description’, but now our Xpath expression should reach ‘excerpt’

NOTE: when I use a dor and // is because the search is RELATIVE to an element, in this case, ‘question (item = ItemLoader(Question()question))’

    item.add_xpath('description', './/div[@class="excerpt"]/text()')

Well, now I need to apply a special return to close this

    yield item.load_item()

this will send to an archive the info loaded in items

yield vs return

I can not only add via XPATH, but I can also do it by “value” to fill any property; just let´s add another field just to check how easy is and how to do it

item.add_value('id', 1)

What is this? we just ADDED a VALUE (1) to the ID, instead of using XPATH we just added it like this

I only need to add a FIELD value to our main abstraction class

id = Field()

that piece of code goes inside our class

class Question (Item):
    id = Field()
    question = Field()
    description = Field()

we have this working code so far

#let´s install required modules and functions
from scrapy.item import Field
from scrapy.item import Item
from scrapy.spiders import Spider
from scrapy.selector import Selector
from scrapy.loader import ItemLoader

#define the data I have to fill in
#that will go to the results file
class Question (Item):
    id = Field()
    question = Field()
    description = Field()
#CLASS CORE - MainCoreScrapy
class MainCoreScrapy(Spider):
    name = "MainSpider"
    # configure the USER AGENT in Scrapy
    custom_settings = {
        'USER_AGENT' : ['Mozilla/5.0 Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko Chrome/80.0.3987.149 Safari/537.36']
    # URL (seed URL)
    start_urls = ['']

    # This function will be filled when we make a request to the seed URL
    def parse(self, response):
        # Selectors: Scrapy´s Class to extract data
        sel = Selector(response)
        questions_list = sel.xpath('//div[@id="questions"]//div[@class="question-summary"]')
        for question in questions_list:# instantiate my ITEM with the selector where there are the data to fill with
            # Fill my ITEM+s properties with XPATH expressions to search within the "question" selector

            item = ItemLoader(Question(), question)
            item.add_xpath('question', './/h3/a/text()')
            item.add_xpath('description', './/div[@class="excerpt"]/text()')
            item.add_value('id', 1)
            #Yield info to write data in file
            yield item.load_item()
# scrapy runspider -o filename.csv -t csv
# scrapy runspider -o results.csv -t csv

Ok, if we just run this code from within our IDE or IDLE, it won´t work…why?

Because we need to run it from the terminal that will run the scrapy spider

BUT we need to send it to an archive (-o filename -t extension)

if we run that we´ll get a file -> scrapy_stackoverflow.csv (it can be json too, I just took CSV) after some code executed within the terminal. Look how scrapy returns the seed URL and the parse results within the terminal

Now if we open our csv file with a notepad or some text editor we´ll have something like this


I tried many ways to create service process with session1, but I couldn’t find a suitable way to create a service process.
Can someone help me?thanks.
“,[1],CreateService with Session1

I wonder if I can somehow pass TInputQueryWizardPage, TInputOptionWizardPage, TInputDirWizardPage, TInputFileWizardPage, TOutputMsgWizardPage, TOutputMsgMemoWizardPage, TOutputProgressWizardPage pages …
“,[1],Can I somehow pass TInput/TOutput pages to a function as one parameter in Inno Setup?

Envs: Ubuntu 18.04, Miniconda3, python=3.7(GCC=7.3.0), GCC -v (7.4.0)
The error occurs when I run the following command:
scons build/X86/gem5.opt -j8

The error is as follow:
[ LINK] -> X86/…
“,[1],LTO compilation question when linking X86/marshal file

I have image of skin colour with repetitive pattern (Horizontal White Lines).
My Question is how to denoise the image effectively using FFT without affecting the quality of the image much, somebody …
“,[1],How to remove repititve pattern from an image using FFT

We have the result, very obscured because we have some tabs, spaces , it is not “clean” yet (we´ll fix that in later chapters)

But, take a look at the [1] -> that is the ID we added id = Field() and item.add_value(‘id’, 1)

If we comment those two lines and run again the code, we´ll have no ID [1].

I have a server that accept ssl connections on port 443. I am using the boost libraries for the server implementation. Below is the code snippet:
// Open the acceptor with the option to reuse the …
“,Recv-Q has data pending as per netstat command and it never gets cleared

So I have an object which is moving in a circular path and enemy in the centre of this circle. I’m trying to find out how to calculate shotingDirection for bullets. Transform.position isn’t ennough …
“,”How to shoot an object, which is moving in a circle”

df = pd.read_csv(CITY_DATA[city])

def user_stats(df,city):

""""""Displays statistics of users.""""""

print('\nCalculating User Stats...\n')

start_time = time....
    ",How can I display five rows of data based on user in Python?

I want scope viewmodel by Fragment but activity,
interface MarsRepository {
suspend fun getProperties(): List
class MarsRepositoryModule …
“,How install view model in fragmentComponent with Hilt injection?

im trying to set the placeholder for the v-select

If we also comment the lines that brings the “description”, we´ll get a nice list of headlines with no spaces at all

class Question (Item):
    #id = Field()
    question = Field()
    #description = Field()


for question in questions_list:# instantiate my ITEM with the selector where there are the data to fill with
            # Fill my ITEM+s properties with XPATH expressions to search within the "question" selector

            item = ItemLoader(Question(), question)
            item.add_xpath('question', './/h3/a/text()')
            #item.add_xpath('description', './/div[@class="excerpt"]/text()')
            #item.add_value('id', 1)

The result is much better

how to avoid invalid characters ? JSON
How do I make my require statement wait for it to finish before continueing?
GetEntityMetadata returns 0 attributes
Centralizing a TField’s Size value
GLSL Fit Fragment Shader Mask Into Vertex Dimensions
How can I create rounded corners button in WPF?
Return list of strings for query buffer with StatisticDefinition
How can I quickly groupby a large sparse dataframe?
How to remove space in the input box
Need a help REGEX php preg_match [duplicate]
How to draw a .obj file in pyqtgraph?
I have problem with converting MATLAB code to Python
Get the OpenSSL::PKey::EC key size in Ruby
how should the advanced contact page mysql scheme be?
Why is ZooKeeper LeaderElection Agent not being called by Spark Master?

Finally we can get and ID with an iterable and added auto number isntead of set a fixed value, this is, instaed of havinh [1] we can just add a counter within the code like this (bold)

i = 0

        for question in questions_list:# instantiate my ITEM with the selector where there are the data to fill with
            # Fill my ITEM+s properties with XPATH expressions to search within the "question" selector

            item = ItemLoader(Question(), question)
            item.add_xpath('question', './/h3/a/text()')
            #item.add_xpath('description', './/div[@class="excerpt"]/text()')
            item.add_value('id', i)
            i +=1

[0],Failure to clone risc-v tools (failure with newlib-cygwin.git)
[1],How would you represent musical notes in JavaScript?
[2],Do I need to call DeleteObject() on font retrieved from SystemParametersInfo()?
[3],Automaticaly positioning rectangles estheticaly on a canvas with D3.js
[4],Paging in virtual memory
[5],How to run program that connects to another machine in C++?
[6],Rsnapshot filepermission problem with network hdd over raspberry pi
[7],Adding reaction if message contains certain content
[8],”Batch – Findstr with error level condition, quotes?”
[9],Implementing microprofile health checks with EJB application
[10],“Add [name] to fillable property to allow mass assignment on [Illuminate\Foundation\Auth\User].”
[11],Vscode platform specific shortcuts
[12],Problem filling an array of objects. The result is always null
[13],how to get the name instead of reference field in mongoengine and flask -admin
[14],extract email attachment from AWS SES mail in S3 with Python on AWS Lambda

IMPORTANT: if when running the code you get a blank file (0 bytes) is because you made a mistake within any of the XPATH expressions

How To Scrape Amazon Product Data and Prices using Python 3

Scrape Amazon

Leave a Reply

Your email address will not be published.