Click a Button in Scrapy

I'm using Scrapy to crawl a webpage. Some of the information I need only pops up when you click on a certain button (of course also appears in the HTML code after clicking). I found out that...

Scrapy - Silently drop an item

I am using Scrapy to crawl several websites, which may share redundant information. For each page I scrape, I store the url of the page, its title and its html code, into mongoDB. I want to avoid...

pip install fails with "connection error: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:598)"

I am very new to Python and trying to > pip install linkchecker on Windows 7. Some notes: pip install is failing no matter the package. For example, > pip install scrapy also results in the SSL...

Scrapy shell against a local file

Before Scrapy 1.0, I could've run the Scrapy Shell against a local file quite simply: $ scrapy shell index.html After upgrading to 1.0.3, it started to throw an error: $ scrapy shell...

How to make Scrapy crawl only 1 page (make it non recursive)?

I'm using the latest version of scrapy (http://doc.scrapy.org/en/latest/index.html) and am trying to figure out how to make scrapy crawl only the URL(s) fed to it as part of start_url list. In...

Scrapy: CSV output without header

When I use the command scrapy crawl <project> -o <filename.csv>, I get the output of my Item dictionary with headers. This is good. However, I would like scrapy to omit headers if the file already...

"read-only file system" error when pulling Docker image

I am trying to install Splash for Scrapy. According to its installation documentation, first of all Docker has to be installed. This has been successfully done. Then I launch the Docker Quickstart...

scrapy shell not opening long link

I'm dealing with scrapy shell. URL that I'm trying to crawl is:...

Read cookies from Splash request

I'm trying to access cookies after I've made a request using Splash. Below is how I've build the request. script = """ function main(splash) splash:init_cookies(splash.args.cookies) ...

Cannot connect to the Docker daemon at unix:/var/run/docker.sock. Is the docker daemon running?

I have applied every solution available on internet but still I cannot run Docker. I want to use Scrapy Splash on my server. Here is history of commands I ran. docker run -p 8050:8050...

Construct DataFrame from scraped data using Scrapy

I have a problem with constructing csv type data file from scraped data. I have managed to scrape the data from the table but when it comes to writing it I can't do that for days. I am using items...

How scrapy yield request.follow actually works

I'm unable to figure out the working process of yield in yield request.follow(url, callback=func) So far I know, the request is sent and response is sent to the callback function and finally, a...

nslookup: isc_socket_bind: address in use - can't resolve dns in docker container (phusion image)

I am running a AWS instance with 2CPUs, 8GB Ram, 450Mbps Bandwidth, with a docker container that holds python application. The container load average is almost ~6.0 during the day when Python is...

Running scrapy splash with rotating proxies

I'm trying to use scrapy with splash and rotating proxies. Here's my settings.py: ROBOTSTXT_OBEY = False BOT_NAME = 'mybot' SPIDER_MODULES = ['myproject.spiders'] NEWSPIDER_MODULE =...

How to create index in mongodb with pymongo

I use scrapy crawl data and save it to mongodb, i want to save 2dsphere index in mongodb. Here is my pipelines.py file with scrapy from pymongo import MongoClient from scrapy.conf import...

Scraping: SSL: CERTIFICATE_VERIFY_FAILED error for http://en.wikipedia.org

I'm practicing the code from 'Web Scraping with Python', and I keep having this certificate problem: from urllib.request import urlopen from bs4 import BeautifulSoup import re pages = set() def...

Scrapy: How to get cookies from splash

I am trying to get the cookies from a splash request, but I keep getting an error. Here is the code I am using: class P2PEye(scrapy.Spider): name = 'p2peyeSpider' allowed_domains =...

Dropping duplicate items from Scrapy pipeline?

my scrapy crawler collects data from a set of urls, but when I run it again to add new content, the old content is saved to my Mongodb database. Is there a way to check if this item is already...

How I can clear scrapy jobs list?

How I can clear scrapy jobs list? When I start any spider I have a lot jobs with specific spider and I know how can I kill all them ? After reading documentation I have done next code, which I run...

Handshake Failure: SSL Alert number 40

I'm trying to crawl a page without success: >> scrapy shell "XXXXXX" ... 2018-12-28 17:23:32 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET XXXXXXXX> (failed 1 times):...

unable to deploy scrapy to scrapyd server

I am trying to deploy my scrapy which connected to django project to scrapyd, but when I tried scrapyd-deploy JD -p JDSpider, it failed. It said No module named GradutionProject. It seems the...

Running scrapy spider from script results in error ImportError: No module named scrapy

I've installed scrapy and created a spider that works when run from the command line with the command scrapy crawl getBUCPower. My issue is that I need to run the spider from another script when...

Scrapyd: How to cancel all jobs with one command?

I am running over 40 spiders which are until now scheduled via cron and issued via scrapy crawl Due to several reasons I am now switching to scrapyd, one of them is to be able to see which jobs...

Google Maps some XPATH selectors return data some not Selenium Python

I was trying to scrape google maps. The phone and hours variable is not returning any data. Other variables work fine and return data. The XPATH is correct. I am not sure what's the issue...

I have 12000 known URLs, what is the fastest way to scrape them with Python?

So I have a list of URLs that I pull from a database, and I need to crawl and parse through the JSON response of each URL. Some URLs return null, while others return information that is sent to a...

How is data scraping based on location in Amazon?

Whenever I want to scraping on amazon.com, I fail. Because Product information changes according to location in amazon.com This changing information is as follows; 1-Price 2-Shipping...

How to scrape site protected by cloudfare

So I'm trying to scrape https://craft.co/tesla When I visit from the browser, it opens correctly. However, when I use scrapy, it fetches the site but when I view the response, view(response) It...

scrapy-playwright:- Downloader/handlers: scrapy.exceptions.NotSupported: AsyncioSelectorReactor

I tried to extract some data from dynamically loaded javascript website using scrapy-playwright but I stuck at the very beginning. From where I'm facing trubles in settings.py file is as...

Failed to retrieve product listings pages from few categories

From this webpage I am trying to get that kind of link where different products are located. There are 6 categories having More info button which when I traverse recursively, I usually reach the...

How can I send Dynamic website content to scrapy with the html content generated by selenium browser?

I am working on certain stock relate project where I have task to scrape all data of daily base for last 5 years. i.e from 2016 to till date. I particularly thought of using selenium because I...