C: The Legalities and Ethics of Web Scraping
Index
Symbols
" (quotation marks), 17
$ (dollar sign), 27
() (parentheses), 25
* (asterisk), 25, 157
+ (plus sign), 25
- (hyphen), 113
. (period), 25
403 Forbidden error, 187
404 Page Not Found error, 9
500 Internal Server Error, 9
; (semicolon), 210
?! (does not contain), 27
[] (square brackets), 25
\ (forward slash), 27
^ (caret), 27
_ (underscore), 17
| (pipe), 25
A
a tag, 28, 156
Accept header, 179
Accept-Encoding header, 179
Accept-Language header, 179
action chains, 195
ActionScript , 147
add_cookie function, 182
Ajax (Asynchronous JavaScript and XML),
151-152
API keys
Echo Nest example, 52, 54
Google example, 60-63
Twitter example, 56
APIs
about, 49-50, 68
authentication and, 52
common conventions, 50-52
Echo Nest example, 52, 54-55
Google examples, 50, 60-63
HTTP methods and, 51
parsing JSON, 63
responses , 52
Twitter example, 55-59
Wikipedia example, 64
ASCII character encoding, 95-98
assertions (unit tests), 190
asterisk (*), 25, 157
Asynchronous JavaScript and XML (Ajax),
151-152
AttributeError exception, 10
attributes
accessing, 14, 28
finding tags based on, 15-18
Auernheimer, Andrew, 227
auth module, 144
authentication
about, 52
handling logins, 142-144
HTTP basic access, 144
Twitter example, 57
B
backing up data, 172
BeautifulSoup library
about, 6
children() function, 20
descendants() function, 20
find() function, 16-18
231
findAll() function, 15-18, 28
get_text() function, 15
installing, 6
next_siblings() function, 21
previous_siblings() function, 21
regular expressions and, 27
running, 8
searching for tags, 14-22
XPath and, 157
BeautifulSoup object, 15, 18, 192
Berne Convention for the Protection of Literary
and Artistic Works, 218
body tag, 215
.box file extension, 171
building web scrapers
advanced HTML parsing, 13-29
crawling across the Web, 31-48
first web scraper, 3-11
reading documents, 93
storing data, 71-91
using APIs, 49-69
By object, 155
BytesIO object, 103
C
CAPTCHA characters
about, 161, 169-171
dragging, 197
machine training and, 135, 171-174
retrieving, 174-176
caret (^), 27
Carroll , Lewis, 6
Cascading Style Sheets (CSS)
about, 14, 216
dynamic HTML and, 151
hidden fields and, 184
CGI (Common Gateway Interface), 204
checklist, human, 186
children (tags), 20
children() function, 20
Chrome developer tool , 141
class attribute, 14, 17, 156
cleaning dirty data
cleaning after the fact, 113-118
cleaning in code, 109-113
client-side processing
handling redirects, 44, 158
scripting languages and, 147
cloud computing, 204
colorpickers, 140
comma-separated values (CSV) files
reading, 98-100
storing data to, 74-76
Comment object, 18
Common Gateway Interface (CGI), 204
Computer Fraud and Abuse Act , 221, 227-229
Connection header, 179
Connection object, 83
connection/cursor model, 83
context-free grammars, 135
cookies
handling, 142-143, 181-182
verifying settings, 179
copyright law, 218-219, 229
cPanel software, 204
crawling across the Web (see web crawlers)
CREATE DATABASE statement, 80
CREATE INDEX statement, 86
CREATE TABLE statement, 80
credentials
Google accounts, 60
Twitter accounts, 58
CSS (Cascading Style Sheets)
about, 14, 216
dynamic HTML and, 151
hidden fields and, 184
CSV (comma-separated values) files
reading , 98-100
storing data to, 74-76
csv library, 98-100
Cursor object, 83
D
Dark Web, 36
data gathering, 36, 38-40
data management
about, 71
email and, 90-91
MySQL and, 76-89
storing data to CSV, 74-76
storing media files, 71-74
data normalization, 112-113
data warehouses, 40
database size versus query time, 86
Davies, Mark, 121
deactivate command, 8
Deep Web, 36
DELETE method (HTTP), 51
232 | Index
DELETE statement, 82
delete_all_cookies function, 182
delete_cookie function, 182
delimiters, 74
descendants (tags), 20
descendants() function, 20
DESCRIBE statement, 80
DHTML (dynamic HTML), 151-152
dictionaries, 85
DictReader object, 100
Digital Millennium Copyright Act (DMCA),
219
directed graph problems, 126
display:none attribute, 185
distributed computing, 201
DMCA (Digital Millennium Copyright Act),
219
.doc format, 102
documents, reading (see reading documents)
.docx format, 102-105
dollar sign ($), 27
downloading files from Internet, 74
drag-and-drop interfaces, 196
dynamic HTML (DHTML), 151-152
E
Easter Egg, 210
eBay v. Bidder's Edge, 226
Echo Nest API, 52, 54-55
EditThisCookie Chrome extension, 181
elements (Selenium), 153, 194
email
identifying addresses, 24
sending and receiving, 90-91
email package, 90
encoding (document)
about, 93
text files and, 94-98
environment variables, 163
escape characters, 27, 110
ethical guidelines, 177-178, 217-230
exception handling
external links, 43
handling redirects, 158
network connections, 9
suggestions for, 35, 40
explicit wait, 155
eXtensible Markup Language (XML), 52
external links
cautions using, 41
crawling across the Internet, 40-45
crawling with Scrapy, 45-48
finding, 42
F
Facebook social media site, 217
fair use clause, 219
fast scraping, 182, 187
Fibonacci sequence, 149
Field v. Google, 229
file attribute, 141
File object, 142
file uploads, 141
filtering data, 115-116, 165
finally statement, 85
find() function , 16-18
findAll() function, 15-18, 28
for loops, 39
forms
about, 137
file uploads and, 141
handling logins and cookies, 142-144
hidden fields in, 183-186
images in, 141, 161
input fields supported, 140
malicious bots, 144
security considerations, 183-186
submitting basic, 138-140
forward slash (\), 27
frequency distributions, 131-132
functions
handling in JavaScript, 148
lambda expressions and, 28
G
gathering data, 36, 38-40
GET method (HTTP)
about, 51
Google example, 62
retrieving data , 53
tracking requests, 140
get_cookies function, 181
get_text() function, 15
Google
API examples, 50, 60-63
building, 40
Markov model example, 124
Google Analytics, 150, 181
Index | 233
Google Maps , 150
GREL (OpenRefine Expression Language), 116
H
h1 tag, 9, 39
head tag, 98, 215
headers (HTTP), 179-180, 187
hidden fields in forms, 183-186
Homebrew package manager, 78
homonyms, 133
honeypots, 184-186
Host header, 179
hotlinking, 72
href attribute, 28
HTML (HyperText Markup Language), 215
HTML Parser library, 29
HTML parsing
accessing attributes, 28
avoiding the need for, 13
BeautifulSoup example, 14-22
lambda expressions, 28
regular expressions, 22-28
html tag, 215
HTTP (Hypertext Transfer Protocol)
API functionality and, 50
basic access authentication, 144
error handling, 9, 187
headers supported, 179-180
methods supported, 51
HTTPBasicAuth object, 144
human checklist, 186
HyperText Markup Language (HTML), 215
Hypertext Transfer Protocol (see HTTP)
hyphen (-), 113
I
id attribute, 156
image processing
scraping text from images, 166-169
submitting image files, 141
text recognition and, 161-176
img tag, 28
implicit wait, 155
indexing, 85
Innes, Nick, 100
input tag, 141
INSERT INTO statement, 81, 84
Intel Corp v. Hamidi, 227
intellectual property, 217-219
internal links
crawling an entire site, 35-40
crawling with Scrapy, 45-48
traversing a single domain, 31-35
Internet
about, 213-216
cautions downloading files from, 74
crawling across, 40-45
moving forward, 206
IP address blocking, avoiding, 199-200
ISO character sets, 96-98
is_displayed function, 186
Item object, 46, 48
items.py file, 46
J
JavaScript
about, 147-149
common libraries, 149-151
executing with Selenium, 152-156
handling redirects, 158
importing files, 14
JavaScript Object Notation (JSON)
about, 52
parsing, 63
jQuery library, 149
JSON (JavaScript Object Notation)
about, 52
parsing, 63
K
Kerr, Orin, 228
keywords, 17
L
lambda expressions, 28, 74
legalities of web scraping, 217-230
lexicographical analysis with NLTK, 132-136
libraries
bundling with projects, 7
OCR support, 161-164
logging with Scrapy, 48
logins
about, 137
handling, 142-143
troubleshooting, 187
lxml library, 29
234 | Index
M
machine learning, 135, 180
machine training, 135, 171-174
Markov text generators, 123-129
media files, storing, 71-74
Mersenne Twister algorithm, 34
methods (HTTP), 51
Microsoft SQL Server, 76
Microsoft Word, 102-105
MIME (Multipurpose Internet Mail Exten‐
sions) protocol, 90
MIMEText object, 90
MySQL
about, 76
basic commands, 79-82
database techniques, 85-87
installing, 77-79
integrating with Python, 82-85
Wikipedia example, 87-89
N
name attribute, 140
natural language processing
about, 119
additional resources, 136
Markov models, 123-129
Natural Language Toolkit, 129-136
summarizing data, 120-123
Natural Language Toolkit (NLTK)
about, 129
installation and setup, 129
lexicographical analysis, 132-136
statistical analysis, 130-132
NavigableString object, 18
navigating trees, 18-22
network connections
about, 3-5
connecting reliably, 9-11
security considerations, 181
next_siblings() function, 21
ngrams module, 132
n-grams, 109-112, 120
NLTK (Natural Language Toolkit)
about, 129
installation and setup, 129
lexicographical analysis, 132-136
statistical analysis, 130-132
NLTK Downloader interface, 130
NLTK module, 129
None object, 10
normalizing data, 112-113
NumPy library, 164
O
OAuth authentication, 57
OCR (optical character recognition)
about, 161
library support, 162-164
OpenRefine Expression Language (GREL), 116
OpenRefine tool
about, 114
cleaning data, 116-118
filtering data, 115-116
installing, 114
usage considerations, 114
optical character recognition (OCR)
about, 161
library support, 162-164
Oracle DBMS, 76
OrderedDict object, 112
os module, 74
P
page load times, 154, 182
parentheses (), 25
parents (tags), 20, 22
parsing HTML pages (see HTML parsing)
parsing JSON, 63
patents, 217
pay-per-hour computing instances, 205
PDF files, 100-102
PDFMiner3K library, 101
Penn Treebank Project, 133
period (.), 25
Peters, Tim, 211
PhantomJS tool, 152-155, 203
PIL (Python Imaging Library), 162
Pillow library
about, 162
processing well-formatted text, 165-169
pipe (|), 25
plus sign (+), 25
POST method (HTTP)
about, 51
tracking requests, 140
troubleshooting, 186
variable names and, 138
viewing form parameters, 140
Index | 235
previous_siblings() function, 21
primary keys in tables, 85
programming languages, regular expressions
and, 27
projects, bundling with libraries, 7
pseudorandom number generators, 34
PUT method (HTTP), 51
PyMySQL library, 82-85
PySocks module, 202
Python Imaging Library (PIL), 162
Python language, installing, 209-211
Q
query time versus database size, 86
quotation marks ("), 17
R
random number generators, 34
random seeds, 34
rate limits
about, 52
Google APIs, 60
Twitter API, 55
reading documents
document encoding, 93
Microsoft Word, 102-105
PDF files, 100
text files, 94-98
recursion limit, 38, 89
redirects, 44, 158
Referrer header, 179
RegexPal website, 24
regular expressions
about, 22-27
BeautifulSoup example, 27
commonly used symbols, 25
programming languages and, 27
relational data, 77
remote hosting
running from a website hosting account,
203
running from the cloud, 204
remote servers
avoiding IP address blocking, 199-200
extensibility and, 200
portability and, 200
PySocks and, 202
Tor and, 201-202
Requests library
about, 137
auth module, 144
installing, 138, 179
submitting forms, 138
tracking cookies, 142-143
requests module, 179-181
responses, API calls and, 52
Robots Exclusion Standard, 223
robots.txt file, 138, 167, 222-225, 229
S
safe harbor protection, 219, 230
Scrapy library, 45-48
screenshots, 197
script tag, 147
search engine optimization (SEO), 222
searching text data, 135
security considerations
copyright law and, 219
forms and, 183-186
handling cookies, 181
SELECT statement, 79, 81
Selenium library
about, 143
elements and, 153, 194
executing JavaScript, 152-156
handling redirects, 158
security considerations, 185
testing example, 193-198
Tor support, 203
semicolon (;), 210
SEO (search engine optimization), 222
server-side processing
handling redirects, 44, 158
scripting languages and, 147
sets, 67
siblings (tags), 21
Simple Mail Transfer Protocol (SMTP), 90
site maps, 36
Six Degrees of Wikipedia, 31-35
SMTP (Simple Mail Transfer Protocol), 90
smtplib package, 90
sorted function, 112
span tag, 15
Spitler, Daniel, 227
SQL Server (Microsoft), 76
square brackets [], 25
src attribute, 28, 72, 74
StaleElementReferenceException, 158
236 | Index
statistical analysis with NLTK, 130-132
storing data (see data management)
StringIO object, 99
strings, regular expressions and, 22-28
stylesheets
about, 14, 216
dynamic HTML and, 151
hidden fields and, 184
Surface Web, 36
T
tables
creating in databases, 80
inserting data into, 81
primary keys and, 85
Wikipedia example, 88
Tag object, 18
tags
accessing attributes, 14, 28
finding based on location in document,
18-22
finding based on name and attribute, 15-18
preserving, 15
Terms of Service, 222-225
Tesseract library
about, 163
installing, 163
processing well-formatted text, 164-169
training, 171-174
Tesseract OCR Chopper tool, 171
testing
about, 189
Selenium example, 193-198
unit tests, 190, 197
unittest module, 190, 197
Wikipedia example, 191-193
Text object, 130
text processing
image-to-text translation, 161-176
reading text files, 94-98
scraping text from images, 166-169
searching text data, 135
strings and regular expressions, 22-28
well-formatted text, 164-169
The Onion Network (Tor), 201-202
threshold filters, 165
timestamps, 87
tokens, 52, 58
Tor (The Onion Network), 201-202
trademarks, 218
traversing the Web (see web crawlers)
tree navigation, 18-22
trespass to chattels, 219-220, 226
trigrams module, 132
try...finally statement, 85
Twitov app, 123
Twitter API, 55-59
U
underscore (_), 17
undirected graph problems, 127
Unicode standard, 83, 95-98, 110
unit tests, 190, 197
United States v. Auernheimer, 227-229
unittest module, 190, 197
UPDATE statement, 82
urllib library, 5, 45
urllib.error module, 5
urllib.parse module, 5
urllib.request module, 5, 72
urllib2 library , 5
urlopen function, 5, 97
urlretrieve function, 72
USE statement, 80
User-Agent header, 179
UTF standards, 95, 110
V
variables
environment, 163
handling in JavaScript, 148
lambda expressions and, 28
versions, multiple, 45
virtual environments, 7
W
Warden, Pete, 217
web crawlers
about, 31
cautions using, 41
crawling across the Internet, 40-45
crawling an entire site, 35-40
crawling with Scrapy, 45-48
traversing a single domain, 31-35
usage considerations, 220
web scraping, viii-ix
WebDriver, 153-155, 181
Index | 237
websites
analyzing for scraping, 216
crawling entire, 35-40
running from hosting accounts, 203
scraping text from images on, 166-169
testing with scrapers, 189-198
well-formatted text, 164-169
whitespace, 74, 210
Wikipedia
cleaning dirty data, 109-112
Markov model example, 126-129
MySQL example, 87-89
revision history example, 64
robots.txt file, 224
testing example, 191-193
traversing a single domain example, 31-35
Word (Microsoft), 102-105
w:t tag, 104
X
XML (eXtensible Markup Language), 52
XPath (XML Path), 157