Interacting with strings

by Andres Baravalle

Week 4: Interacting with strings

  • Interacting with strings
  • Regular expressions

Interacting with strings

Strings

Strings are data structures composed of a sequence of alphanumeric characters.

Strings are widely used in PHP; string functions can be used to:

  • Formatting strings (e.g. dates, numbers, currency)
  • Validate the content that users submit through forms and to sanitise user data or user-generated content in general
  • For screenscraping web pages - as in phpGrabComics
  • To parse the user's referrer ($_SERVER["HTTP_REFER"]) or HTTP headers in general
  • To write proxies - as in PHProxy++!

What?!

PHP has just under 100 (2013) functions to interact with strings.

You should become familiar with all the ones listed in the next few pages (and all the ones included in your text book, of course).

Formatting functions

trim() Remove leading and trailing with space in a string. You can also use ltrim() (leading) and rtrim() (trailing)
nl2br() Insert '<br />' or '<br>' before any new lines (in the code)
strtoupper() Return uppercase string
strtolower() Return lowercase string
ucfirst() Return the string with an uppercase first character
ucwords() Capitalise the first character in each word in the string

prinf() and sprintf()

printf() Output a formatted string
sprintf() Return a formatted string

printf(): using %d and %s


<?php
$num = 5;
$location = "tree";
$format = 'There are %d monkeys in the %s';
printf($format, $num, $location);
?>             

Why is this better than the following code?


<?
$num = 5;
$location = "tree";
$text = 'There are $num monkeys in the $location';
printf($format, $num, $location); 
?>
              

printf(): re-using placeholders


<?php
$num = 5;
$location = "tree";
$format = 'The %2$s contains %1$d monkeys.
That\'s a nice %2$s full of %1$d monkeys.';
printf($format, $num, $location); ?>

printf(): advanced formatting


<?php
$output = "";
$start_page = array(1, 20, 35, 70, 90, 123, 156, 190, 210, 230, 256);
for ($i = 0; $i < count($start_page); $i++) {
     $output .= sprintf("Chapter %'.-20d%'.4d
\n", $i+1, $start_page[$i]); } echo $output; // Book Title // Chapter 1.................123 // Chapter 2.................234 // Chapter 3.................345 ?>

printf(): advanced formatting (2)

Let's analyse the sprintf() format parameter in details:

sprintf("Chapter %'.-20d%'.4d\n", $i+1,  $start_page[$i]);

For the first format string:

  • % is the start char for the conversion specification
  • '. is setting . as our padding specifier
  • - is aligning to the left (any character aligns to the left)
  • 20 is the minimum with of the result; as we have 11 chapters, Chapter 1-9 will be followed by 19 dots (.). Chapters 10-11 will be followed by 18.
  • d is telling us that the parameter will be formatted as a digit (irrelevant in this example)

Activity #1: using printf

Use the printf() function and:

  • take a month (from 1 to 12), a day (from 1 to 31), and a four - digit year
  • displays the resulting date, formatted in mm/dd/yyyy format.

Activity #2: advanced printf

Using the following array of chapters titles and pages:

<?php
$array = array(	
	array("Intro", 1), 
	array("Random chapter name", 6), 
	array("Another random chapter name", 13), 
	array("More random chapter name", 2), 
	array("Again a random chapter name", 33), 
	array("Blah blah random chapter name", 39), 
	array("Beh random chapter name", 45), 
	array("Atch! random chapter name", 61), 
	array("This is a random chapter name", 81), 
	array("Final random chapter name", 89)
);
?>

Use printf() to print the chapters number, title and page number formatted as the next line:

Chapter 1: Intro.................................1
Chapter 2: Random chapter name......1

Activity #2: advanced printf (hints)

  • Use a monotype font - otherwise it will not work
  • Read the "printf(): advanced formatting (2)" slide
  • Read the sprintf documentation
  • If in trouble, read the sprintf comments

Sanitising user data

Any content generated by users should be sanitised before passing it to the functions that will process it (e.g. to interact with a database).

SourceForge log-in form

Sanitising user-generated content (2)

When a user submits a form, the user selections (e.g. checkboxes/radio boxes selections, text in textarea/input fields) is sent to the server and stored in the $_GET or $_POST variable.

The next step - in any page like the SourceForge log-in in the previous slide - will be to compare the content submitted by the user (user name and password) against the database.

If the user content is not filtered, an attacker can try to inject SQL code in the query (rather than simple text).

A number of techniques are possible, but as you have no experience with SQL we will not explore them further.

Sanitising user-generated content (2)

Make sure that you sanitise any use content before using it. Always.

Sanitising functions

addslashes()
reverse: stripslashes()

Return a string with backslashes before single quote ('), double quote ("), backslash (\) and NUL.

When the content to be escaped will be used in a database query, use the native function instead - e.g.  mysqli_real_escape_string()

htmlspecialchars() Converts some special characters to HTML entities
htmlentities() Convert all applicable characters to HTML entities
strip_tags()  Strip HTML and PHP tags from a string
   

Joining and splitting functions

implode() Join array elements with a string
explode() Split a string by string; returns an array

Imploding an array

<?php
$array = array('lastname', 'email', 'phone');
$comma_separated = implode(",", $array);
echo $comma_separated; // lastname,email,phone
?>

Exploding a string


$ingredients  = "tomato mozzarella basil artichokes mushrooms ham olives";
$ingredients_array = explode(" ", $ingredients);
              
              

Activity #3: babies names!

Use this page to find the list of the 100 most popular male names in 2012 in the US and:

  • Copy the list into a string ("Aiden,Jackson,Ethan" etc.); use Notepad++ regexp features to convert new lines (\n) into commas.
  • Select an appropriate string function and convert the string to an array

Matching and replacing strings

strtr() Translate characters or replace substrings
substr() Return part of a string
str_replace() Replace search string with the replacement string

Activity #4: Rosemary's baby

Building on top of activity #3, now help Rosemary finding her perfect baby name:

  • It must start by "a" and include a second "a"
  • It should also include an "i"
  • She would prefer the less popular name

Activity #5: formatting a sonnet

Sonnet 116 by Shakespeare is stored in the code below with its rhyme scheme at the end of each line (ab ab cd cd ef ef gg):

$s116 = "Let me not to the marriage of true minds (a)
Admit impediments, love is not love (b)
Which alters when it alteration finds, (a)
Or bends with the remover to remove. (b)
O no, it is an ever fixèd mark (c)
That looks on tempests and is never shaken; (d)
It is the star to every wand'ring bark, (c)
Whose worth's unknown although his height be taken. (d)
Love's not time's fool, though rosy lips and cheeks (e)
Within his bending sickle's compass come, (f)
Love alters not with his brief hours and weeks, (e)
But bears it out even to the edge of doom: (f)
If this be error and upon me proved, (g)
I never writ, nor no man ever loved. (g)";

Activity #5: formatting a sonnet (2)

Use a set of replace functions to clean the rhyme indicators at the end of each (e.g. (d)).

Activity #6: convert to array

The opposite of implode can be used to convert the string into an array, having as many elements as the lines in the string. Read the documentation and apply the function to the sonnet used in the previous activity.

Regular expressions

Regular expressions

Regular expressions (reg exps) provide a special syntax for searching for patterns of text within strings.

Regular expressions are enclosed in delimiters (usually slashes). For example, this simple regular expression:

/word/

searches for the word "world" anywhere within the target string.

Regular expressions

Regular expressions as a concept arose in the 1950s and are in common use in Unix tools as grep, ed and vi.

The next slides will focus first on syntax of regular expressions, and then on their use in PHP.

Using regular expressions

PHP's main pattern-matching function is preg_match(). The main patter-replacing function is preg_replace().

<?php
// replace ~ with any symbol that it's not in your text 
if(preg_match('~word~','In linguistics, a word is the smallest 
	element that may be uttered in isolation 
	with semantic or pragmatic content.', $matches)) {
	echo "Pattern found."; 
}
?>             

Activity #7: the quick brown cat

Replace the word "fox" with "cat" in "The quick brown fox jumps over the lazy dog":

  • Start using str_replace()
  • Then use preg_replace()

Extra challenge: measure execution time for both approaches (hint: use microtime()).

Activity #8: the quick brown cat and the lazy mouse

Replace the word "fox" with "cat" and "dog" with "mouse" in "The quick brown fox jumps over the lazy dog":

  • Start using str_replace()
  • Then use preg_replace()

Extra challenge: measure execution time for both approaches (hint: use microtime()).

Character classes

Character classes (or sets) are used to match one of several characters:

  • Anything inside square brackets is in a character set
  • You can use the hyphen to match ranges

E.g.:

/[A-Z0-9]/

Will match any uppercase letter and any number.

Predefined classes

PHP includes a number of predefined classes, including:

[[:alnum:]] Alphanumeric characters
[[:alpha:]] Alphabetic characters
[[:lower:]] Lowercase letters
[[:upper:]] Uppercase letters
[[:digit:]] Decimal digits
[[:punct:]] Punctuation
[[:blanc:]] Whitespace

Refer to the full list for more expressivity.

Using predefined classes

/[[:alpha:][:space:][:punct:]]/

Will match any letter, space or punctuation sign.

Using character classes

Regular expressions allow to perform more precise and complex searches. For example, if I'm looking for my name in a string, I might want to look for both my Italian (Andrea Baravalle) and Spanish names (Andres Baravalle) at the same time.

Or I might want also to look to my full name, including my second name included (Nicola/Nicolas).

This is how this is represented using regular expressions:

/Andre[as](Nicola[s]? )? Baravalle/

Summary of operators

. Matches any single character.
? The preceding item is optional and will be matched, at most, once.
* The preceding item will be matched zero or more times.
+ The preceding item will be matched one or more times.
{N} The preceding item is matched exactly N times.
{N,} The preceding item is matched N or more times.
{N,M} The preceding item is matched at least N times, but not more than M times.
- Normally represents the range.
^ Matches the empty string at the beginning of a line; also represents the characters not in the range of a list.

Using operators: examples

.{9} Any 9 characters
(az){3} azazaz
(az){2,3} azaz or azazaz
[a-c]{2} any 2 character combination of a, b and c
^Abcd Matches Abcd only at the beginning of the line
(az)?c Matches c and azc

 

Using character classes and operators: screenscraping EZTV.it

           
<?php
$url = "http://www.eztvproxy.org/shows/23/the-big-bang-theory/";

// we need to download the URL
$html = file_get_contents($url);

if($html) {
	// uncomment the next line to check if you are downloading the original page
	// echo $html;
    
	// sample
	// <a href="http://torrent.zoink.it/The.Big.Bang.Theory.S02E06.HDTV.XviD-LOL.[eztv].torrent" class="download_1" title="Download Mirror #1"></a>     

	// getting all episodes  
	// $pattern = '#http://[a-z\.]+/The.Big.Bang.Theory.S[0-9]+E[0-9]+.*?\.torrent#';
   
   // getting series 5 only  
   $pattern = '#http://[a-z\.]+/The.Big.Bang.Theory.S05+E[0-9]+.*?\.torrent#';
   
   if(preg_match_all($pattern, $html, $matches)) {
       natcasesort($matches);
       echo "<pre>";
       print_r(array_reverse($matches[0]));
       echo "</pre>";
   }
}
?> 

Greedy and non-greedy matching

When you use quantifiers to match multiple characters, the quantifiers are by default greedy.

You can change a quantifier to be non-greedy. This causes it to match the smallest number of characters possible. To make a quantifier as non-greedy, place a question mark (?) after the quantifier.

Non-greedy matching: example

<?php
preg_match("/P.*?r/", "Peter Piper", $matches);
echo $matches[0]; // Displays "Peter"
              
preg_match("/P.*r/", "Peter Piper", $matches);
echo $matches[0]; // Displays "Peter Piper"
?>

Activity #9

Solve the exercises on this page. If unsure, test them with a PHP script or in Notepad++.

Subpatterns

By placing a portion of your regular expression's rules in parentheses, you can group those rules into a subpattern. You can now use quantifiers (such as * and ? ) to match the whole subpattern a certain number of times.

Subpatterns (2)

A side-effect of using subpatterns is that you can retrieve the individual subpattern matches in the matches array passed to preg_match(). The first element of the array contains the entire matched text as usual, and each subsequent element contains any matched subpatterns:

<?php
preg_match( "/(\d+\/\d+\/\d+) (\d+\:\d+.+)/", "7/18/2004 9:34AM", $matches );
echo $matches[0] . "<br>"; // Displays "7/18/2004 9:34AM"
echo $matches[1] . "<br>"; // Displays "7/18/2004"
echo $matches[2] . "<br>"; // Displays "9:34AM"
?>

Matching Alternative Patterns

Regular expressions let you combine patterns (and subpatterns) with the | (vertical bar) character to create alternatives.

$day = "wed";
echo preg_match( "/mon|tue|wed|thu|fri|sat|sun/", $day ); // Displays "1"

This work

This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License

Creative Commons License